In floating point arithmetic, addition and
subtraction are more complex than multiplication and division. This is because
of the need for alignment. There are four basic phases of the algorithm for
addition and subtraction:
1.
Check for zeros
2.
Align the significant.
3.
Add or subtract the
significant.
4.
Normalize the result.
Example:
X = 0.3 x 102 = 30
Y = 0.2 x 103 = 200
X + Y = (0.3 x 102-3 + 0.2) x 103
= 0.23 x 103 = 230
X - Y = (0.3 x 102-3 -0.2) x 103
= (-0.17) x 103 = -170
X x Y = (0.3 x 0.2) x 102-3 =
0.06 x 105 = 6000
X + Y = (0.3 x 0.2) x 102-3 =
1.5 x 10-1 = 0.15
Floating Point
l We need a way to represent
-
Numbers with fractions, e.g. :
3.142
-
Very small numbers, e.g. :
0.0000001
-
Very large numbers, e.g. :
3.1428 x 109
l Representation:
-
sign, exponent, significant : (–1)sign
× significant × 2exponent
-
more bits for significant gives more
accuracy
-
more bits for exponent increases range
l IEEE 754 floating point standard
-
single precision: 8 bit exponent, 23
bit significant
-
double precision: 11 bit exponent, 52
bit significant
IEEE 754 floating point standard
l Leading
“1” bit of significant is implicit
l Exponent is “biased” to make sorting easier
-
all 0s is smallest exponent all 1s is
largest
-
bias of 127 for single precision and 1023 for
double precision
-
summary: (–1)sign ×
(1+significand) × 2exponent – bias
l Example:
-
decimal: -.75 = -3/4 = -3/22
-
binary: -.11 = -1.1 x 2-1
-
floating point: exponent = 126 = 01111110
-
IEEE single precision: 10111111010000000000000000000000
IEEE 754 Standard
Representation of floating point numbers in
IEEE 754 standard:
Magnitude of numbers that can be represented
is in the range:
2-126(1.0) to 2127(2-2-23)
This is approximately:
1.8 x 10-38 to 3.40 x 1038
Floating Point Complexities
l In
addition to overflow we can have “underflow” •
l Accuracy
can be a big problem
-
IEEE 754 keeps two extra bits, guard
and round
-
four rounding modes
-
positive divided by zero yields
“infinity”
-
zero divide by zero yields “not a
number”
-
other complexities
Floating Point Addition Example
e.g. : Add 9.999 x 101 and 1.610 x 10-1
assuming 4 decimal digits
1. Allign decimal point of number with smaller
exponent
1.610
× 10-1=
0.161 × 100 = 0.0161 × 101
Shift smaller number to
right
2. Add
significant
9.999 + 0.016 = 10.015 → SUM = 10.015 × 101
NOTE: One digit of
precision lost during shifting. Also sum is not normalized
3. Shift sum to
put it in normalized form 1.0015 × 102
4. Since
significant only has 4 digits, we need to round the sum
SUM = 1.002 × 102
NOTE: normalization
maybe needed again after rounding,
e.g, rounding 9.9999 you
get 10.000
Accurate
Arithmetic – Guard & Round bits •
l IEEE
754 standard specifies the use of 2 extra bits on the right during intermediate
calculations – Guard bit and Round bit
l Example:
Add 2.56 × 100 and 2.34 × 102 assuming 3 significant
digits and without guard and round bits
2.56 × 100 = 0.0256 × 102
2.34 x 0.02 = 2.36 × 102
l With
guard and round bits
2.34 x 0.0256 = 2.3656 × 102
ROUND → 2.37 × 100
Infinity
arithmetic
Infinity arithmetic is
treated as the limiting case of real arithmetic, with the infinity values given
the following interpretation:
-∞ <
(every finite number) < +∞
With the exception of the
special cases discussed subsequently, any arithmetic operation involving
infinity yields the obvious result.
For Example:
5 + (+∞) = +∞ 5 ÷ (+∞) = +0
5 - (+∞) = -∞ (+∞) + (+∞) = +∞
5 + (-∞) = -∞ (-∞) + (+∞) = -∞
5 - (-∞) = +∞ (-∞) - (+∞) = -∞
5 x (+∞) = +∞ (+∞) - (-∞) =
+∞
Quiet And
Signaling NaNs
Table Operations
that Produce a Quiet NaN
Yu Hong Sheng
B031210099
B031210099
No comments:
Post a Comment