IEEE STANDARD 754 FloatingPoint Arithmetic
Radix: Binary. Overflow and underflow:
Overflow goes by default to a signed oo.
Underflow is
gradual.
Zero is represented ambiguously as +0 or 0.
Its sign transforms correctly through multiplication or
division, and is preserved by addition of zeros
with like signs; but xx yields +0 for every
finite x.
The only operations that reveal zero’s
sign are division by zero and
copysign x ±0.
In particular, comparison (x > y, x >= y, etc.)
cannot be affected by the sign of zero; but if
finite x = y then oo = 1/(xy) != 1/(yx) = oo.
Infinity is signed.
It persists when added to itself
or to any finite number.
Its sign transforms
correctly through multiplication and division, and
(finite)/±oo = ±0
(nonzero)/0 = ±oo.
But
oooo, oo*0 and oo/oo
are, like 0/0 and sqrt(3),
invalid operations that produce NaN. ...
Reserved operands (NaNs):
An NaN is
( N ot a N umber).
Some NaNs, called Signaling NaNs, trap any floatingpoint operation
performed upon them; they are used to mark missing
or uninitialized values, or nonexistent elements
of arrays.
The rest are Quiet NaNs; they are
the default results of Invalid Operations, and
propagate through subsequent arithmetic operations.
If x != x then x is NaN; every other predicate
(x > y, x = y, x < y, ...) is FALSE if NaN is involved.
Rounding:
Every algebraic operation (+, , *, /,
v/)
is rounded by default to within half an
ulp,
and when the rounding error is exactly half an
ulp
then
the rounded value’s least significant bit is zero.
(An
ulp
is one
U nit
in the
L ast
P lace.)
This kind of rounding is usually the best kind,
sometimes provably so; for instance, for every
x = 1.0, 2.0, 3.0, 4.0, ..., 2.0**52, we find
(x/3.0)*3.0 == x and (x/10.0)*10.0 == x and ...
despite that both the quotients and the products
have been rounded.
Only rounding like IEEE 754 can do that.
But no single kind of rounding can be
proved best for every circumstance, so IEEE 754
provides rounding towards zero or towards
+oo or towards oo
at the programmer’s option.
Exceptions:
IEEE 754 recognizes five kinds of floatingpoint exceptions,
listed below in declining order of probable importance.
"ExceptionDefault Result"
Invalid Operation NaN, or FALSE
Overflow ±oo
Divide by Zero ±oo
Underflow Gradual Underflow
Inexact Rounded value
Singleprecision:
Type name:
.Vt float
Wordsize: 32 bits.
Precision: 24 significant bits,
roughly like 7 significant decimals.
If x and x’ are consecutive positive singleprecision
numbers (they differ by 1
ulp),
then
5.9e08 < 0.5**24 < (x’x)/x <= 0.5**23 < 1.2e07.
Range: Overflow threshold = 2.0**128 = 3.4e38
Underflow threshold = 0.5**126 = 1.2e38
Underflowed results round to the nearest
integer multiple of 0.5**149 = 1.4e45.
Doubleprecision:
Type name:
.Vt double
On some architectures,
.Vt long double
is the the same as
.Vt double .
Wordsize: 64 bits.
Precision: 53 significant bits, roughly like 16 significant decimals.
If x and x’ are consecutive positive doubleprecision
numbers (they differ by 1
ulp),
then
1.1e16 < 0.5**53 < (x’x)/x <= 0.5**52 < 2.3e16.
Range: Overflow threshold = 2.0**1024 = 1.8e308
Underflow threshold = 0.5**1022 = 2.2e308
Underflowed results round to the nearest
integer multiple of 0.5**1074 = 4.9e324.
Extendedprecision:
Type name:
.Vt long double
(when supported by the hardware)
Wordsize: 96 bits.
Precision: 64 significant bits,
roughly like 19 significant decimals.
If x and x’ are consecutive positive doubleprecision
numbers (they differ by 1
ulp),
then
1.0e19 < 0.5**63 < (x’x)/x <= 0.5**62 < 2.2e19.
Range: Overflow threshold = 2.0**16384 = 1.2e4932
Underflow threshold = 0.5**16382 = 3.4e4932
Underflowed results round to the nearest
integer multiple of 0.5**16445 = 5.7e4953.
Quadextendedprecision:
Type name:
.Vt long double
(when supported by the hardware)
Wordsize: 128 bits.
Precision: 113 significant bits,
roughly like 34 significant decimals.
If x and x’ are consecutive positive doubleprecision
numbers (they differ by 1
ulp),
then
9.6e35 < 0.5**113 < (x’x)/x <= 0.5**112 < 2.0e34.
For each kind of floatingpoint exception, IEEE 754 provides a Flag that is raised each time its exception is signaled, and stays raised until the program resets it. Programs may also test, save and restore a flag. Thus, IEEE 754 provides three ways by which programs may cope with exceptions for which the default result might be unsatisfactory:
 Test for a condition that might cause an exception later, and branch to avoid the exception.
 Test a flag to see whether an exception has occurred since the program last reset its flag.
 Test a result to see whether it is a value that only an exception could have produced.
CAUTION: The only reliable ways to discover whether Underflow has occurred are to test whether products or quotients lie closer to zero than the underflow threshold, or to test the Underflow flag. (Sums and differences cannot underflow in IEEE 754; if x != y then xy is correct to full precision and certainly nonzero regardless of how tiny it may be.) Products and quotients that underflow gradually can lose accuracy gradually without vanishing, so comparing them with zero (as one might on a VAX) will not reveal the loss. Fortunately, if a gradually underflowed value is destined to be added to something bigger than the underflow threshold, as is almost always the case, digits lost to gradual underflow will not be missed because they would have been rounded off anyway. So gradual underflows are usually provably ignorable. The same cannot be said of underflows flushed to 0.
At the option of an implementor conforming to IEEE 754, other ways to cope with exceptions may be provided:
 ABORT. This mechanism classifies an exception in advance as an incident to be handled by means traditionally associated with errorhandling statements like "ON ERROR GO TO ...". Different languages offer different forms of this statement, but most share the following characteristics:
 No means is provided to substitute a value for the offending operation’s result and resume computation from what may be the middle of an expression. An exceptional result is abandoned. 
 In a subprogram that lacks an errorhandling statement, an exception causes the subprogram to abort within whatever program called it, and so on back up the chain of calling subprograms until an errorhandling statement is encountered or the whole task is aborted and memory is dumped. 

 STOP. This mechanism, requiring an interactive debugging environment, is more for the programmer than the program. It classifies an exception in advance as a symptom of a programmer’s error; the exception suspends execution as near as it can to the offending operation so that the programmer can look around to see how it happened. Quite often the first several exceptions turn out to be quite unexceptionable, so the programmer ought ideally to be able to resume execution after each one as if execution had not been stopped.
 ... Other ways lie beyond the scope of this document.
Ideally, each elementary function should act as if it were indivisible, or atomic, in the sense that ...
 No exception should be signaled that is not deserved by the data supplied to that function.
 Any exception signaled should be identified with that function rather than with one of its subroutines.
 The internal behavior of an atomic function should not be disrupted when a calling program changes from one to another of the five or so ways of handling exceptions listed above, although the definition of the function may be correlated intentionally with exception handling.
The functions in libm are only approximately atomic. They signal no inappropriate exception except possibly ...