You are on page 1of 23

A method of representation of real numbers that can support a wide range of values.

A typical number that can be represented exactly is of the form:

Significant digits baseexponent


The term floating point refers to the fact that the

radix point can "float" i.e., it can be placed anywhere relative to the significant digits of the number.

Floating point numbers approximate real

numbers
Floating numbers have large dynamic range

The

IEEE 754 has produced a standard for floating point arithmetic. This standard specifies how single precision (32 bit) and double precision (64 bit) floating point numbers are to be represented, as well as how arithmetic should be carried out on them

The IEEE 754 standard specifies a binary32 as having: Sign bit: 1 bit Exponent width: 8 bits Significand precision: 24 (23 explicitly stored)
The base is 2

Sign bit determines the sign of the number,

which is the sign of the significand as well.

Sign bit=0 if the number is positive =1 if the number is negative

The exponent field needs to represent both positive and negative exponents. To do this, a bias of 127 is added to the actual exponent in order to get the stored exponent. Thus, an exponent of zero means that 127 is stored in the exponent field. A stored value of 200 indicates an exponent of (200-127), or 73. Exponents of -127 (all 0s) and +128 (all 1s) are reserved for special numbers.

Also known as

Mantissa

The

true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit with value 1 unless the exponent is stored with all zeros. Thus only 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits

The bits are laid out as follows:

sign

exponent

significand

31

30

23 22

The value of the number represented in single precision format is as follows: (a)If e=255 and f=0, then v= NaN. (b) If e=255 and f=0, then v= (- I)s (c) If 0<e<255, then v=(- 1)s2e-127 (1. f). (d) If e =0 and f=0, then v = ( - 1)s2 -126(0.f). (e) If e=0 and f=0, then v=(- l)s 0, (zero).

In

order to maximize the quantity of representable numbers, floating-point numbers are typically stored in normalized form. This basically puts the radix point after the first non-zero digit. In normalized form, five is represented as 5.0 100.

A nice little optimization is available to us in base two, since the only possible non-zero digit is 1. Thus, we can just assume a leading digit of 1, and don't need to represent it explicitly. As a result, the mantissa has effectively 24 bits of resolution, by way of 23 fraction bits.

The storage format of double precision is as

shown sign bit: 1 bit Exponent width:11 bits significand precision: 52 bits(implicit) The bias for exponent is 1023
Sign exponent significand

63

62

52 51

Convert

the following single-precision IEEE 754 number into a floating-point decimal value. 1 10000001 10110011001100110011010 First, put the bits in three groups. Bit 31 (the leftmost bit) show the sign of the number. Bits 23-30 (the next 8 bits) are the exponent. Bits 0-22 (on the right) give the fraction

Now, look at the sign bit. If this bit is a 1, the number is negative, otherwise positive. Here this bit is 1, so the number is negative.

Get the exponent and the correct bias. The exponent is simply a positive binary number. 10000001bin = 129ten Remember that we will have to subtract a bias from this exponent to find the power of 2. Since this is a single-precision number, the bias is 127.

Convert the fraction string into base ten.

This is the trickiest step. The binary string represents a fraction, so conversion is a little different. Binary fractions look like this:

0.1 = (1/2) = 2-1 0.01 = (1/4) = 2-2 0.001 = (1/8) = 2-3

So, for this example, we multiply each digit by the corresponding power of 2:
0.10110011001100110011010bin = 1*2-1+ 0*2-2 + 1*2-3 + 1*2-4 + 0*2-5 + 0 * 2-6 + ... 0.10110011001100110011010bin = 1/2 + 1/8 + 1/16 + ... Note that this number is just an approximation on some decimal number. There will most likely be some error. In this case, the fraction is about 0.7000000476837158.

This is all the information we need. We can

put these numbers in the expression:

(-1)sign bit * (1+fraction) * 2 exponent - bias = (-1)1 * (1.7000000476837158) * 2 129-127 = -6.8


The answer is approximately -6.8.

Convert 0.1015625 to IEEE 32-bit floating point format. Converting: 0.1015625 2 = 0.203125 0 Generate 0 and continue. 0.203125 2 = 0.40625 0 Generate 0 and continue. 0.40625 2 = 0.8125 0 Generate 0 and continue. 0.8125 2 = 1.625 1 Generate 1 and continue with the rest. 0.625 2 = 1.25 1 Generate 1 and continue with the rest. 0.25 2 = 0.5 0 Generate 0 and continue. 0.5 2 = 1.0 1 Generate 1 and nothing remains. So 0.101562510 = 0.00011012.

Normalize: 0.00011012 = 1.1012 2-4. Mantissa is 10100000000000000000000, exponent is -4 + 127 = 123 = 011110112, sign bit is 0. So 0.1015625 is 00111101110100000000000000000000

Binary Fractional Numbers Even when least significant bit is 0 Half way when bits to right of rounding position = Examples Round to nearest 1/4 (2 bits right of binary point) Value Binary Rounded Action Rounded Value 2 3/32 10.000112 10.002 (<1/2down) 2 2 3/16 10.001102 10.012 (>1/2up) 2 1/4 2 7/8 10.111002 11.002 (1/2up) 3 2 5/8 10.101002 10.102 (1/2down) 2 1/2

1002

Operands
(1)s1 M1 2E1 (1)s2 M2 2E2 Assume E1 > E2
+

(1)s1

m1

E1E2 (1)s2 m2 (1)s m

Exact Result

(1)s M 2E Sign s, significand M:


Result of signed align & add

Fixing

Exponent E:

E1

If M 2, shift M right, increment E if M < 1, shift M left k positions, decrement E by k Overflow if E out of range Round M to fit frac precision

3.25 x 10 ** 3 + 2.63 x 10 ** -1 ----------------first step: align decimal points second step: add 3.25 x 10 ** 3 + 0.000263 x 10 ** 3 -------------------= 3.250263 x 10 ** 3

You might also like