Date: Thu, 20 Feb 2003 14:42:30 -0800 (PST)
From: "Gregory V. Tarsy"
Subject: Your 1/29/03 article about Sun
To: paul.murphy@linuxworld.com
MIME-Version: 1.0
Content-MD5: znNtYw7UB6tBuvz6yzqxeg==

Dear Mr. Murphy,

We appreciate the high estimation in which you hold Sun Microsystems, as
expressed in your Linuxworld article of 1/29/03,
http://www.linuxworld.com/site-stories/2003/0129.sun_p.html. We would like to
address a small part of the article in which you call attention to what you see
as a problem with SPARC and scientific computing, quoted below; in fact,
although there is something to be learned from your square-root summation
example, it is not what you conclude. More follows the quotation:

1. The actual problem is that SPARC, like other RISC CPUs, defaults to the IEEE
754 64-bit floating-point specification. Therefore, it doesn't get the right
answers in cases requiring extensive iterative computation. If, for example, you
get it to sum the square roots of the first billion positive integers using the
GCC compiler and default libraries, the result is off by 0.18.

Although completely meaningless in business processing, that error is
devastating if you're trying to simulate the three-body problem over a long
period. You can get around this by using extended precision libraries, but that
cuts performance significantly.

2. The self-inflicted version of this problem is that SPARC has an answer to
floating-point performance problems in its SIMD instruction set, but Sun doesn't
go out of its way to tell anyone about it and hasn't given the research
community the support needed to motivate its use.

The main problem is that researchers tend to develop for the box on their desk,
and these days that's more likely to be an Intel box running FreeBSD or Linux
than a SPARC. As a result, when they do run that code on a SPARC, there's
nothing there to take advantage of the short-array capabilities provided in the
hardware, and they get neither the performance nor the accuracy that the machine
is capable of."

The main points we want to make are:

a. The error as measured by the more suitable unit of ulps is not really large
and it is further controllable by improving the summation algorithm

b. The SPARC results are uniform across systems while the IA32 results vary
from somewhat better to much worse; uniformity is a virtue. Note that the
results on other RISC systems match Sun's and also note that high
performance scientific computing is still much more practiced on the
RISC systems exemplified in the table.

c. The variability of IA32 results stems from improper management of extended
precision registers vis-a-vis memory precision, a subtlety that vitiates
significantly the potential real advantage of extended precision
intermediate computation. Note that IA32 does indeed comply with IEEE754; its
differences with SPARC or other RISCs is within IEEE754 latitude.

d. The example is typical of kernels in the solution of differential equations,
like the three-body problem to which you refer, which will overwhelm
eventually any finite bounded precision representation, so better numerical
approaches must be used anyway.

e. The SIMD instruction set, VIS, available in successively improved versions
in UltraSPARC processors, does not as yet implement floating-point
arithmetic. Until recently this disadvantage with respect to the IA32 SIMD
sets, SSE and SSE2, was not material since the latter was addressing a more
egregious floating-point performance issue. VIS is used to accelerate some
FP codes on SPARC by doing logical and bit-wise operations on FP registers.
Note that SSE and SSE2 implement SPARC-style pure double and float precision
FP arithmetic; there's no use of extended precision.

Next is a more detailed discussion. For a good foundation in floating-point
computing we recommend the ACM article by David Goldberg "What Every Computer
Scientist Should Know about Floating-Point Computing" especially with the
addendum by Douglas Priest that addresses IA32. See

http://www.validlab.com/goldberg/paper.pdf.

Also of interest might be Joe Darcy's JavaOne talk

http://java.sun.com/people/darcy/JavaOne/2001/1789darcy.pdf

or even better, but requiring jdc registration, is

http://servlet.java.sun.com/javaone/resources/content/sf2002/conf/sessions/pdfs/
1079.pdf

I) The Code and Compilation
===========================

The source code at the heart of the discussion:

------- sqr.c -------
#include
#include
main()
{
register i;
double f=0.0;
double sqrt();

for(i=0;i<=1000000000;i++)
{
f+=sqrt( (double) i);
}
printf("Finished %20.14f\n",f);
}
------- sqr.c -------

Investigators were asked to compile this program with
the following command:

cc -o sqr -O sqr.c -lm

and to run the executable as follows:

time ./sqr

II) The Results: Correct versus IA32 versus SPARC
===================================================

The table below presents results from several sources/runs in
order of accuracy.

RUN-Name Result Double Prec Rep
-------- ------ ---------------
1) Correct : 21081851083600.37596259382529338 =0x42b32c80 3ebb5060
2) SPARC-quad : 21081851083600.37500000000000 =0x42b32c80 3ebb5060
3) Correct+1ULP : 21081851083600.3789 =0x42b32c80 3ebb5061
4) IA32 : 21081851083600.38281250000000 =0x42b32c80 3ebb5062
5) SPARC double : 21081851083600.55859375000000 =0x42b32c80 3ebb508f

Entry number 1 is the result from the Euler-Maclaurin formula as presented
by Murphy. The third column of the table gives the IEEE-754
floating-point double-precision hexadecimal representation of the result.

Entry number 3 presents not the result of a run, but rather serves to
illustrate the error in the decimal representation when the hexadecimal
number varies by 1 bit, or unit in the last place, denoted ULP.

The best computed results are in error by 2 ULPs and are represented
by entry number 4.

The SPARC results which are in error by 47 ULPs, are shown as entry
number 5.

Entry number 2, shows a result that in decimal representation, is in
error by -.00096259382529338. However, observe that the hex representation
shows an error of 0 ULPs, an exact answer. The reason for this apparent
contradiction is that IEEE-754 double precision represents only 16 or 17
decimal digits. Additional digits are used in rounding when converting
from decimal to the double precision format. Entry number 2 presents
the double precision results when the sum is kept in IEEE-754 128-bit
format.

I) Magnitude of the Errors
==========================

Consider now the errors encountered in the survey, based on the table
in section II. Consider #1 as perfect. Errors can be considered in
terms of decimal representation, and ULPs.

RUN-Name Decimal Error ULP Error err/correct
======== ============== ========= ===========
SPARC-quad -0.00096259382529338 0 0
Correct+1ULP 0.00293740617470662 1 1.39333E-16
BEST 0.00684990617470662 2 3.24920E-16
SPARC-double 0.18263115617470662 47 86.6296E-16

When considering the absolute error alone, SPARC results appear bad.
However, considering the ratio error/correct, the variance between
results, in fact, is discovered to be very small.

II) Sources of the Errors
=========================

There are two sources of error in the algorithm as provided in
Murphy's article. The first is due to the sqrt operation, and the
second is due to the addition in summing the square roots.

Sqrt rounding error: the largest rounding errors for the sqrt operations
should occur for the largest numbers. A one ULP error for
sqrt(1,000,000,000) is on the order of 2*10^(-12).
This rounding error is very small compared to the one that is possible
due to the addition that follows the sqrt operations.

The sum just before sqrt(1000000000) is 21081851051977.5977.
A 1 ULP error in the sum value results in an error of about .0029.
Huge compared to the rounding error of the sqrt operation.

The SPARC-quad run shows that a lot more precision will yield
the correct result. The cost as Murphy points out, is significantly
more compute time.

The question is, can something be done to minimize the inaccuracy without
pushing storage to quad?

Yes.

III) Algorithm Correcting for Errors
====================================

The following algorithm uses the notion of compensated summation, which
deals with the rounding error that is associated with each add in the
summation.

------- sqrX.c -------
#include
#include
main()
{
register i;
double sum=0.0,newsum,f;
double sumerr=0.0;
double err;
double sqrt();

for(i=1;i<=1000000000;i++)
{
f=sqrt( (double) i);

newsum = sum + f;
err = (newsum - sum ) - f;
sumerr += err;
sum = newsum;
}
printf("Finished %20.14f\n",sum);
printf("Finished %20.14f\n",sum-sumerr);
}
------- sqrX.c -------

This program was compiled/linked with the same command, but returned
results matching Correct while only taking about 20% more compute time.

1) Correct : 21081851083600.37596259382529338 =0x42b32c80 3ebb5060
!) SPARC double+e : 21081851083600.37500000000000 =0x42b32c80 3ebb5060
2) SPARC-quad : 21081851083600.37500000000000 =0x42b32c80 3ebb5060
3) Correct+1ULP : 21081851083600.3789 =0x42b32c80 3ebb5061
4) BEST : 21081851083600.38281250000000 =0x42b32c80 3ebb5062
5) SPARC double : 21081851083600.55859375000000 =0x42b32c80 3ebb508f

In the NOTES section at the end of this paper, a bit more detail is
given about compensated summation.

IV) Why are some results better than SPARC?
===========================================

There is a clustering of systems in terms of errors, identified
as RISC machines that adhere to IEEE-754 64-bit format for double
precision work.

The best results were from systems based on Intel's Pentium model.
Pentium systems have extended precision registers, 80 bits versus
SPARC's 64-bits in double. In addition, Pentium systems allow
intermediate results to be kept in extended format. An explicit
call is required to force the Pentium processor to keep intermediate
results in 64-bit format.

As the preceding section shows, the critical issue of concern here
should be the precision used in the problem. As Murphy points out,
the correct value was calculated using 200 digits of precision.

While the results from the Pentium class machines were better, the
question arguably remains: Are any of these answers bad?
We contend that the answer to this question is no. Given the magnitude
of the number, and the size of the error, the results are all perfectly
reasonable.

Finally, what should be learned from this example, is that
a precision level may work for one algorithm and data set, but fail if
the data changes, or if the algorithm is modified.

Another interpretation of this is that the compilers of C don't have strong guarantees about what "double" means. In some versions of gcc with some compiler options, "double" can mean "80-bits" instead of 64-bits. Some of the gcc compilers might be new enough to have some C99 support so it is conceivable that double_t is
being used instead of double.

If one wants more accurate results without compensated summation,
one should declare f as "long double" or "double_t". If one wants
predictable results one can use Java, even on x86.

V) IEEE-754 and Sun's Results
=============================

IEEE 754 aids understanding where errors arise in computations by
enabling us to work out what error is committed in each operation just
by knowing the operands and the destination format; we need not know such
system-dependent details as whether the hardware uses a guard bit or
whether it chops or rounds. But IEEE 754 also obscures understanding
why a particular system computes the result that it does because it
allows implementations to arbitrarily choose which operations are
rounded to the format the programmer specifies and which are rounded
to a wider format.

The one pertinent observation in Murphy's analysis of his example is
that the results on SPARC systems dating back to 1989 are all the same,
whereas the results on x86 systems vary in ways that are not at all
obvious.

NOTES
=====

A) Compensated Summation

------- sqrX.c -------
#include
#include
main()
{
register i;
double sum=0.0,newsum,f;
double sumerr=0.0;
double err;
double sqrt();

for(i=1;i<=1000000000;i++)
{
f=sqrt( (double) i);

newsum = sum + f;
err = (newsum - sum ) - f;
sumerr += err;
sum = newsum;
}
printf("Finished %20.14f\n",sum);
printf("Finished %20.14f\n",sum-sumerr);
}
------- sqrX.c -------

In the original algorithm:

for(i=0;i<=1000000000;i++)
{
f+=sqrt( (double) i);
}

whenever the addition is done, there are two possibilities:

1) The result is exact and fits into the 64-bits allotted, or
2) the result will not exactly fit, and rounding needs to occur.

With each addition, the rounding error is ignored. After a billion
additions, the results are noticeable in this problem.

The compensating algorithm is doing the following:

Do the add, don't worry about rounding error:

newsum = sum + f;

Determine the part of f that contributes to the new value:

(newsum - sum )

f's contribution to the new sum minus f give the amount of
f that is lost, i.e. the rounding error:

err = (newsum - sum ) - f;

Keep track of the rounding errors separately:

sumerr += err;

All that is left is to adjust things so that the loop can
be used again.

==========================================================================

Floating Point and Numerical Computing Group
Sun Microsystems Inc.
Menlo Park, CA