Date: Thu, 20 Feb 2003 14:42:30 -0800 (PST)

From: "Gregory V. Tarsy"

Subject: Your 1/29/03 article about Sun

To: paul.murphy@linuxworld.com

MIME-Version: 1.0

Content-MD5: znNtYw7UB6tBuvz6yzqxeg==

Dear Mr. Murphy,

We appreciate the high estimation in which you hold Sun Microsystems, as

expressed in your Linuxworld article of 1/29/03,

http://www.linuxworld.com/site-stories/2003/0129.sun_p.html. We would like to

address a small part of the article in which you call attention to what you see

as a problem with SPARC and scientific computing, quoted below; in fact,

although there is something to be learned from your square-root summation

example, it is not what you conclude. More follows the quotation:

1. The actual problem is that SPARC, like other RISC CPUs, defaults to the IEEE

754 64-bit floating-point specification. Therefore, it doesn't get the right

answers in cases requiring extensive iterative computation. If, for example, you

get it to sum the square roots of the first billion positive integers using the

GCC compiler and default libraries, the result is off by 0.18.

Although completely meaningless in business processing, that error is

devastating if you're trying to simulate the three-body problem over a long

period. You can get around this by using extended precision libraries, but that

cuts performance significantly.

2. The self-inflicted version of this problem is that SPARC has an answer to

floating-point performance problems in its SIMD instruction set, but Sun doesn't

go out of its way to tell anyone about it and hasn't given the research

community the support needed to motivate its use.

The main problem is that researchers tend to develop for the box on their desk,

and these days that's more likely to be an Intel box running FreeBSD or Linux

than a SPARC. As a result, when they do run that code on a SPARC, there's

nothing there to take advantage of the short-array capabilities provided in the

hardware, and they get neither the performance nor the accuracy that the machine

is capable of."

The main points we want to make are:

a. The error as measured by the more suitable unit of ulps is not really large

and it is further controllable by improving the summation algorithm

b. The SPARC results are uniform across systems while the IA32 results vary

from somewhat better to much worse; uniformity is a virtue. Note that the

results on other RISC systems match Sun's and also note that high

performance scientific computing is still much more practiced on the

RISC systems exemplified in the table.

c. The variability of IA32 results stems from improper management of extended

precision registers vis-a-vis memory precision, a subtlety that vitiates

significantly the potential real advantage of extended precision

intermediate computation. Note that IA32 does indeed comply with IEEE754; its

differences with SPARC or other RISCs is within IEEE754 latitude.

d. The example is typical of kernels in the solution of differential equations,

like the three-body problem to which you refer, which will overwhelm

eventually any finite bounded precision representation, so better numerical

approaches must be used anyway.

e. The SIMD instruction set, VIS, available in successively improved versions

in UltraSPARC processors, does not as yet implement floating-point

arithmetic. Until recently this disadvantage with respect to the IA32 SIMD

sets, SSE and SSE2, was not material since the latter was addressing a more

egregious floating-point performance issue. VIS is used to accelerate some

FP codes on SPARC by doing logical and bit-wise operations on FP registers.

Note that SSE and SSE2 implement SPARC-style pure double and float precision

FP arithmetic; there's no use of extended precision.

Next is a more detailed discussion. For a good foundation in floating-point

computing we recommend the ACM article by David Goldberg "What Every Computer

Scientist Should Know about Floating-Point Computing" especially with the

addendum by Douglas Priest that addresses IA32. See

http://www.validlab.com/goldberg/paper.pdf.

Also of interest might be Joe Darcy's JavaOne talk

http://java.sun.com/people/darcy/JavaOne/2001/1789darcy.pdf

or even better, but requiring jdc registration, is

http://servlet.java.sun.com/javaone/resources/content/sf2002/conf/sessions/pdfs/

1079.pdf

I) The Code and Compilation

===========================

The source code at the heart of the discussion:

------- sqr.c -------

#include

#include

main()

{

register i;

double f=0.0;

double sqrt();

for(i=0;i<=1000000000;i++)

{

f+=sqrt( (double) i);

}

printf("Finished %20.14f\n",f);

}

------- sqr.c -------

Investigators were asked to compile this program with

the following command:

cc -o sqr -O sqr.c -lm

and to run the executable as follows:

time ./sqr

II) The Results: Correct versus IA32 versus SPARC

===================================================

The table below presents results from several sources/runs in

order of accuracy.

RUN-Name Result Double Prec Rep

-------- ------ ---------------

1) Correct : 21081851083600.37596259382529338 =0x42b32c80 3ebb5060

2) SPARC-quad : 21081851083600.37500000000000 =0x42b32c80 3ebb5060

3) Correct+1ULP : 21081851083600.3789 =0x42b32c80 3ebb5061

4) IA32 : 21081851083600.38281250000000 =0x42b32c80 3ebb5062

5) SPARC double : 21081851083600.55859375000000 =0x42b32c80 3ebb508f

Entry number 1 is the result from the Euler-Maclaurin formula as presented

by Murphy. The third column of the table gives the IEEE-754

floating-point double-precision hexadecimal representation of the result.

Entry number 3 presents not the result of a run, but rather serves to

illustrate the error in the decimal representation when the hexadecimal

number varies by 1 bit, or unit in the last place, denoted ULP.

The best computed results are in error by 2 ULPs and are represented

by entry number 4.

The SPARC results which are in error by 47 ULPs, are shown as entry

number 5.

Entry number 2, shows a result that in decimal representation, is in

error by -.00096259382529338. However, observe that the hex representation

shows an error of 0 ULPs, an exact answer. The reason for this apparent

contradiction is that IEEE-754 double precision represents only 16 or 17

decimal digits. Additional digits are used in rounding when converting

from decimal to the double precision format. Entry number 2 presents

the double precision results when the sum is kept in IEEE-754 128-bit

format.

I) Magnitude of the Errors

==========================

Consider now the errors encountered in the survey, based on the table

in section II. Consider #1 as perfect. Errors can be considered in

terms of decimal representation, and ULPs.

RUN-Name Decimal Error ULP Error err/correct

======== ============== ========= ===========

SPARC-quad -0.00096259382529338 0 0

Correct+1ULP 0.00293740617470662 1 1.39333E-16

BEST 0.00684990617470662 2 3.24920E-16

SPARC-double 0.18263115617470662 47 86.6296E-16

When considering the absolute error alone, SPARC results appear bad.

However, considering the ratio error/correct, the variance between

results, in fact, is discovered to be very small.

II) Sources of the Errors

=========================

There are two sources of error in the algorithm as provided in

Murphy's article. The first is due to the sqrt operation, and the

second is due to the addition in summing the square roots.

Sqrt rounding error: the largest rounding errors for the sqrt operations

should occur for the largest numbers. A one ULP error for

sqrt(1,000,000,000) is on the order of 2*10^(-12).

This rounding error is very small compared to the one that is possible

due to the addition that follows the sqrt operations.

The sum just before sqrt(1000000000) is 21081851051977.5977.

A 1 ULP error in the sum value results in an error of about .0029.

Huge compared to the rounding error of the sqrt operation.

The SPARC-quad run shows that a lot more precision will yield

the correct result. The cost as Murphy points out, is significantly

more compute time.

The question is, can something be done to minimize the inaccuracy without

pushing storage to quad?

Yes.

III) Algorithm Correcting for Errors

====================================

The following algorithm uses the notion of compensated summation, which

deals with the rounding error that is associated with each add in the

summation.

------- sqrX.c -------

#include

#include

main()

{

register i;

double sum=0.0,newsum,f;

double sumerr=0.0;

double err;

double sqrt();

for(i=1;i<=1000000000;i++)

{

f=sqrt( (double) i);

newsum = sum + f;

err = (newsum - sum ) - f;

sumerr += err;

sum = newsum;

}

printf("Finished %20.14f\n",sum);

printf("Finished %20.14f\n",sum-sumerr);

}

------- sqrX.c -------

This program was compiled/linked with the same command, but returned

results matching Correct while only taking about 20% more compute time.

1) Correct : 21081851083600.37596259382529338 =0x42b32c80 3ebb5060

!) SPARC double+e : 21081851083600.37500000000000 =0x42b32c80 3ebb5060

2) SPARC-quad : 21081851083600.37500000000000 =0x42b32c80 3ebb5060

3) Correct+1ULP : 21081851083600.3789 =0x42b32c80 3ebb5061

4) BEST : 21081851083600.38281250000000 =0x42b32c80 3ebb5062

5) SPARC double : 21081851083600.55859375000000 =0x42b32c80 3ebb508f

In the NOTES section at the end of this paper, a bit more detail is

given about compensated summation.

IV) Why are some results better than SPARC?

===========================================

There is a clustering of systems in terms of errors, identified

as RISC machines that adhere to IEEE-754 64-bit format for double

precision work.

The best results were from systems based on Intel's Pentium model.

Pentium systems have extended precision registers, 80 bits versus

SPARC's 64-bits in double. In addition, Pentium systems allow

intermediate results to be kept in extended format. An explicit

call is required to force the Pentium processor to keep intermediate

results in 64-bit format.

As the preceding section shows, the critical issue of concern here

should be the precision used in the problem. As Murphy points out,

the correct value was calculated using 200 digits of precision.

While the results from the Pentium class machines were better, the

question arguably remains: Are any of these answers bad?

We contend that the answer to this question is no. Given the magnitude

of the number, and the size of the error, the results are all perfectly

reasonable.

Finally, what should be learned from this example, is that

a precision level may work for one algorithm and data set, but fail if

the data changes, or if the algorithm is modified.

Another interpretation of this is that the compilers of C don't have strong guarantees about what "double" means. In some versions of gcc with some compiler options, "double" can mean "80-bits" instead of 64-bits. Some of the gcc compilers might be new enough to have some C99 support so it is conceivable that double_t is

being used instead of double.

If one wants more accurate results without compensated summation,

one should declare f as "long double" or "double_t". If one wants

predictable results one can use Java, even on x86.

V) IEEE-754 and Sun's Results

=============================

IEEE 754 aids understanding where errors arise in computations by

enabling us to work out what error is committed in each operation just

by knowing the operands and the destination format; we need not know such

system-dependent details as whether the hardware uses a guard bit or

whether it chops or rounds. But IEEE 754 also obscures understanding

why a particular system computes the result that it does because it

allows implementations to arbitrarily choose which operations are

rounded to the format the programmer specifies and which are rounded

to a wider format.

The one pertinent observation in Murphy's analysis of his example is

that the results on SPARC systems dating back to 1989 are all the same,

whereas the results on x86 systems vary in ways that are not at all

obvious.

NOTES

=====

A) Compensated Summation

------- sqrX.c -------

#include

#include

main()

{

register i;

double sum=0.0,newsum,f;

double sumerr=0.0;

double err;

double sqrt();

for(i=1;i<=1000000000;i++)

{

f=sqrt( (double) i);

newsum = sum + f;

err = (newsum - sum ) - f;

sumerr += err;

sum = newsum;

}

printf("Finished %20.14f\n",sum);

printf("Finished %20.14f\n",sum-sumerr);

}

------- sqrX.c -------

In the original algorithm:

for(i=0;i<=1000000000;i++)

{

f+=sqrt( (double) i);

}

whenever the addition is done, there are two possibilities:

1) The result is exact and fits into the 64-bits allotted, or

2) the result will not exactly fit, and rounding needs to occur.

With each addition, the rounding error is ignored. After a billion

additions, the results are noticeable in this problem.

The compensating algorithm is doing the following:

Do the add, don't worry about rounding error:

newsum = sum + f;

Determine the part of f that contributes to the new value:

(newsum - sum )

f's contribution to the new sum minus f give the amount of

f that is lost, i.e. the rounding error:

err = (newsum - sum ) - f;

Keep track of the rounding errors separately:

sumerr += err;

All that is left is to adjust things so that the loop can

be used again.

==========================================================================

Floating Point and Numerical Computing Group

Sun Microsystems Inc.

Menlo Park, CA