Professional Documents
Culture Documents
radix point can "float" i.e., it can be placed anywhere relative to the significant digits of the number.
numbers
Floating numbers have large dynamic range
The
IEEE 754 has produced a standard for floating point arithmetic. This standard specifies how single precision (32 bit) and double precision (64 bit) floating point numbers are to be represented, as well as how arithmetic should be carried out on them
The IEEE 754 standard specifies a binary32 as having: Sign bit: 1 bit Exponent width: 8 bits Significand precision: 24 (23 explicitly stored)
The base is 2
The exponent field needs to represent both positive and negative exponents. To do this, a bias of 127 is added to the actual exponent in order to get the stored exponent. Thus, an exponent of zero means that 127 is stored in the exponent field. A stored value of 200 indicates an exponent of (200-127), or 73. Exponents of -127 (all 0s) and +128 (all 1s) are reserved for special numbers.
Also known as
Mantissa
The
true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit with value 1 unless the exponent is stored with all zeros. Thus only 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits
sign
exponent
significand
31
30
23 22
The value of the number represented in single precision format is as follows: (a)If e=255 and f=0, then v= NaN. (b) If e=255 and f=0, then v= (- I)s (c) If 0<e<255, then v=(- 1)s2e-127 (1. f). (d) If e =0 and f=0, then v = ( - 1)s2 -126(0.f). (e) If e=0 and f=0, then v=(- l)s 0, (zero).
In
order to maximize the quantity of representable numbers, floating-point numbers are typically stored in normalized form. This basically puts the radix point after the first non-zero digit. In normalized form, five is represented as 5.0 100.
A nice little optimization is available to us in base two, since the only possible non-zero digit is 1. Thus, we can just assume a leading digit of 1, and don't need to represent it explicitly. As a result, the mantissa has effectively 24 bits of resolution, by way of 23 fraction bits.
shown sign bit: 1 bit Exponent width:11 bits significand precision: 52 bits(implicit) The bias for exponent is 1023
Sign exponent significand
63
62
52 51
Convert
the following single-precision IEEE 754 number into a floating-point decimal value. 1 10000001 10110011001100110011010 First, put the bits in three groups. Bit 31 (the leftmost bit) show the sign of the number. Bits 23-30 (the next 8 bits) are the exponent. Bits 0-22 (on the right) give the fraction
Now, look at the sign bit. If this bit is a 1, the number is negative, otherwise positive. Here this bit is 1, so the number is negative.
Get the exponent and the correct bias. The exponent is simply a positive binary number. 10000001bin = 129ten Remember that we will have to subtract a bias from this exponent to find the power of 2. Since this is a single-precision number, the bias is 127.
This is the trickiest step. The binary string represents a fraction, so conversion is a little different. Binary fractions look like this:
So, for this example, we multiply each digit by the corresponding power of 2:
0.10110011001100110011010bin = 1*2-1+ 0*2-2 + 1*2-3 + 1*2-4 + 0*2-5 + 0 * 2-6 + ... 0.10110011001100110011010bin = 1/2 + 1/8 + 1/16 + ... Note that this number is just an approximation on some decimal number. There will most likely be some error. In this case, the fraction is about 0.7000000476837158.
Convert 0.1015625 to IEEE 32-bit floating point format. Converting: 0.1015625 2 = 0.203125 0 Generate 0 and continue. 0.203125 2 = 0.40625 0 Generate 0 and continue. 0.40625 2 = 0.8125 0 Generate 0 and continue. 0.8125 2 = 1.625 1 Generate 1 and continue with the rest. 0.625 2 = 1.25 1 Generate 1 and continue with the rest. 0.25 2 = 0.5 0 Generate 0 and continue. 0.5 2 = 1.0 1 Generate 1 and nothing remains. So 0.101562510 = 0.00011012.
Normalize: 0.00011012 = 1.1012 2-4. Mantissa is 10100000000000000000000, exponent is -4 + 127 = 123 = 011110112, sign bit is 0. So 0.1015625 is 00111101110100000000000000000000
Binary Fractional Numbers Even when least significant bit is 0 Half way when bits to right of rounding position = Examples Round to nearest 1/4 (2 bits right of binary point) Value Binary Rounded Action Rounded Value 2 3/32 10.000112 10.002 (<1/2down) 2 2 3/16 10.001102 10.012 (>1/2up) 2 1/4 2 7/8 10.111002 11.002 (1/2up) 3 2 5/8 10.101002 10.102 (1/2down) 2 1/2
1002
Operands
(1)s1 M1 2E1 (1)s2 M2 2E2 Assume E1 > E2
+
(1)s1
m1
Exact Result
Fixing
Exponent E:
E1
If M 2, shift M right, increment E if M < 1, shift M left k positions, decrement E by k Overflow if E out of range Round M to fit frac precision
3.25 x 10 ** 3 + 2.63 x 10 ** -1 ----------------first step: align decimal points second step: add 3.25 x 10 ** 3 + 0.000263 x 10 ** 3 -------------------= 3.250263 x 10 ** 3