2021-09-20

Floating point in IEEE 754 format

  • What is the computer architecure (hardware / software / algorithmic / etc.) reason for the ordering of the fields in '754 format? sign : exponent : fraction

  • The signed exponent field is encoded using a bias, why is it done that way instead of just 2’s complement?

IEEE 754 single-precision (32 bit)

single

IEEE 754 double precision (64-bit)

double

Meaning of the bits to construct the binary point number: \((-1)^s \; 1.F \times 2^{E-bias}\)

There is always a leading 1. in the fraction, so it isn’t encoded into the bit representation.

Hardware to do arithmetic (+, -, ×, ÷, compare) is …​ a whole 'nother topic!

Fixed point

There is a thing called fixed point where arithmetic is done using a binary point somewhere inside the N-bit number.

0b1111.1111 is Q4.4 format.

while 0b1.101 0101 is Q1.7 format.

An integer ALU can be used for fixed point arithmetic with some additional software housekeeping for keeping track of the binary point. It is somewhat common in DSP implementations since it is generally faster than using full IEEE 754 floating point hardware. In this case, it is up to the programmer to keep track of the binary point! (The programmer then writes functions and other tools to help out.)

Nowadays, you can buy a "floating point DSP" which does have a speed-optimized FPU. Programming DSP algorithms on such devices is much easier than a fixed-point DSP.

In class exercise

Decoding 1

Consider the following 32-bit value:

0001 1001 1000 0101 1000 0111 0011 0111

single

%%

** Boltzmann constant! SI exact 1.380 649 e-23
hex 0x19858737
int 428181303
decimal 1.38064905242e-23
computed 1.3806490524162535742025823474237659731211902425229709479026496410369873046875E-23 **** %%
  • Convert to hex:

  • Value if considered an unsigned integer

  • Value if considered a signed (32-bit) integer

  • Value if considered a 32-bit IEEE 754 single precision float

  • If this was a constant in some code, what well-known symbolic physical constant would this be?

  • .

Decoding 2

Another 32-bit value:

1010 0000 0011 1101 0010 0110 1101 0001

single

%%

** Charge of an electron!
SI exact 1.602 176 634 e-19
hex 0xa03d26d1
uint 2688362193
int32 -1606605103
decimal -1.602 176 597 46 e-19
computed -1.6021765974585869277149415869365700615389869199134409427642822265625E-19 **** %%
  • Convert to hex:

  • Value if considered an unsigned integer

  • Value if considered a signed (32-bit) integer

  • Value if considered a 32-bit IEEE 754 single precision float

  • If this was a constant in some code, what well-known symbolic physical constant would this be?

  • Is 32 bits enough precision to properly encode this constant without rounding error according to SI

References

'''

Matlab

(ick)

format long

% a = 0b1111...  % 32 bits

typecast(a, 'int32')
typecast(a, 'uint32')
typecast(a, 'single')
typecast(a, 'double')