Floating Point Arithmetic 2

A method of representation of real numbers that can support a wide range of values.
A typical number that can be represented exactly is of the form:
Significant digits baseexponent

The term floating point refers to the fact that the
radix point can "float" i.e., it can be placed anywhere relative to the significant digits of the number.
Floating point numbers approximate real
numbers
Floating numbers have large dynamic range
The
IEEE 754 has produced a standard for floating point arithmetic. This standard specifies how single precision (32 bit) and double precision (64 bit) floating point numbers are to be represented, as well as how arithmetic should be carried out on them
The IEEE 754 standard specifies a binary32 as having: Sign bit: 1 bit Exponent width: 8 bits Significand precision: 24 (23 explicitly stored)
The base is 2
Sign bit determines the sign of the number,
which is the sign of the significand as well.
Sign bit=0 if the number is positive =1 if the number is negative
The exponent field needs to represent both positive and negative exponents. To do this, a bias of 127 is added to the actual exponent in order to get the stored exponent. Thus, an exponent of zero means that 127 is stored in the exponent field. A stored value of 200 indicates an exponent of (200-127), or 73. Exponents of -127 (all 0s) and +128 (all 1s) are reserved for special numbers.
Also known as
Mantissa
The
true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit with value 1 unless the exponent is stored with all zeros. Thus only 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits
The bits are laid out as follows:
sign
exponent
significand
31
30
23 22
The value of the number represented in single precision format is as follows: (a)If e=255 and f=0, then v= NaN. (b) If e=255 and f=0, then v= (- I)s (c) If 0<e<255, then v=(- 1)s2e-127 (1. f). (d) If e =0 and f=0, then v = ( - 1)s2 -126(0.f). (e) If e=0 and f=0, then v=(- l)s 0, (zero).
In
order to maximize the quantity of representable numbers, floating-point numbers are typically stored in normalized form. This basically puts the radix point after the first non-zero digit. In normalized form, five is represented as 5.0 100.
A nice little optimization is available to us in base two, since the only possible non-zero digit is 1. Thus, we can just assume a leading digit of 1, and don't need to represent it explicitly. As a result, the mantissa has effectively 24 bits of resolution, by way of 23 fraction bits.
The storage format of double precision is as
shown sign bit: 1 bit Exponent width:11 bits significand precision: 52 bits(implicit) The bias for exponent is 1023
Sign exponent significand
63
62
52 51
Convert
the following single-precision IEEE 754 number into a floating-point decimal value. 1 10000001 10110011001100110011010 First, put the bits in three groups. Bit 31 (the leftmost bit) show the sign of the number. Bits 23-30 (the next 8 bits) are the exponent. Bits 0-22 (on the right) give the fraction
Now, look at the sign bit. If this bit is a 1, the number is negative, otherwise positive. Here this bit is 1, so the number is negative.
Get the exponent and the correct bias. The exponent is simply a positive binary number. 10000001bin = 129ten Remember that we will have to subtract a bias from this exponent to find the power of 2. Since this is a single-precision number, the bias is 127.
Convert the fraction string into base ten.
This is the trickiest step. The binary string represents a fraction, so conversion is a little different. Binary fractions look like this:
0.1 = (1/2) = 2-1 0.01 = (1/4) = 2-2 0.001 = (1/8) = 2-3
So, for this example, we multiply each digit by the corresponding power of 2:
0.10110011001100110011010bin = 1*2-1+ 0*2-2 + 1*2-3 + 1*2-4 + 0*2-5 + 0 * 2-6 + ... 0.10110011001100110011010bin = 1/2 + 1/8 + 1/16 + ... Note that this number is just an approximation on some decimal number. There will most likely be some error. In this case, the fraction is about 0.7000000476837158.
This is all the information we need. We can
put these numbers in the expression:
(-1)sign bit * (1+fraction) * 2 exponent - bias = (-1)1 * (1.7000000476837158) * 2 129-127 = -6.8

The answer is approximately -6.8.
Convert 0.1015625 to IEEE 32-bit floating point format. Converting: 0.1015625 2 = 0.203125 0 Generate 0 and continue. 0.203125 2 = 0.40625 0 Generate 0 and continue. 0.40625 2 = 0.8125 0 Generate 0 and continue. 0.8125 2 = 1.625 1 Generate 1 and continue with the rest. 0.625 2 = 1.25 1 Generate 1 and continue with the rest. 0.25 2 = 0.5 0 Generate 0 and continue. 0.5 2 = 1.0 1 Generate 1 and nothing remains. So 0.101562510 = 0.00011012.
Normalize: 0.00011012 = 1.1012 2-4. Mantissa is 10100000000000000000000, exponent is -4 + 127 = 123 = 011110112, sign bit is 0. So 0.1015625 is 00111101110100000000000000000000
Binary Fractional Numbers Even when least significant bit is 0 Half way when bits to right of rounding position = Examples Round to nearest 1/4 (2 bits right of binary point) Value Binary Rounded Action Rounded Value 2 3/32 10.000112 10.002 (<1/2down) 2 2 3/16 10.001102 10.012 (>1/2up) 2 1/4 2 7/8 10.111002 11.002 (1/2up) 3 2 5/8 10.101002 10.102 (1/2down) 2 1/2
1002
Operands
(1)s1 M1 2E1 (1)s2 M2 2E2 Assume E1 > E2
+
(1)s1
m1
E1E2 (1)s2 m2 (1)s m
Exact Result
(1)s M 2E Sign s, significand M:

Result of signed align & add
Fixing

Exponent E:
E1
If M 2, shift M right, increment E if M < 1, shift M left k positions, decrement E by k Overflow if E out of range Round M to fit frac precision
3.25 x 10 ** 3 + 2.63 x 10 ** -1 ----------------first step: align decimal points second step: add 3.25 x 10 ** 3 + 0.000263 x 10 ** 3 -------------------= 3.250263 x 10 ** 3

Floating Point Arithmetic 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Floating Point Arithmetic 2

Uploaded by

Copyright:

Available Formats

A method of representation of real numbers that can support a wide range of values.

A typical number that can be represented exactly is of the form:

Significant digits baseexponent

Floating point numbers approximate real

Sign bit determines the sign of the number,

which is the sign of the significand as well.

Sign bit=0 if the number is positive =1 if the number is negative

The bits are laid out as follows:

The storage format of double precision is as

Convert the fraction string into base ten.

0.1 = (1/2) = 2-1 0.01 = (1/4) = 2-2 0.001 = (1/8) = 2-3

This is all the information we need. We can

put these numbers in the expression:

(-1)sign bit * (1+fraction) * 2 exponent - bias = (-1)1 * (1.7000000476837158) * 2 129-127 = -6.8

E1E2 (1)s2 m2 (1)s m

(1)s M 2E Sign s, significand M:

You might also like