Floating Point Representation

1. Binary representation of real numbers

It is quite rare for numbers to have finitely many digits in their expansions. Only rational numbers can have finitely many digits, and only some of them. The following theorem is easy to prove and is not surprising at all.

Theorem A proper fraction of the form \(\frac{p}{q}\) has finitely many digits in decimal expansion if and only if the only prime factors of \(q\) are \(2\) and \(5\).

The numbers \(\frac{37}{125}\) and \(\frac{71}{50}\) have finitely many digits in decimal expansion. The number \(\frac{37}{3}\) does not. Once we start working with binary expansion, things are even more desperate.

Theorem Proper fractions of the form \(\frac{p}{q}\) have finite binary expansion if and only if \(q\) is a power of \(2\).

The numbers such as \(\frac1{5}\) and \(\frac1{10}\) have infinite binary expansions. Rational numbers have periodic expansions. Irrational numbers have infinite expansions that are not periodic.

Problem 1. Determine the binary expansion of the number \(9.375\).

Problem 2. Determine the binary expansion of \(\frac15\).

2. Mantissa and exponent

The number \(12345.678\) in base \(10\) can be represented in several different ways: \begin{eqnarray*} 12345.678&=&12.345678\cdot 10^3\\ &=& 0.012345678\cdot 10^6\\ &=& 1234567.8\cdot 10^{-2}. \end{eqnarray*} Each of the representations has two important components: mantissa and exponent. In the first representation, the mantissa is \(12.345678\) and the exponent is \(3\). In the second representation, the mantissa is \(0.012345678\) and the exponent is \(6\). In the third representation the mantissa is \(1234567.8\) and the exponent is \(-2\).

3. Normalized mantissa

For every given number, there are infinitely many ways to choose mantissa and exponent. In the previous paragraph we listed \(3\) different ways to represent \(12345.678\) in base \(10\). We could easily find infinitely many more ways to do the same thing. The representations using mantissa and exponent are called floating point representations because the decimal point can be placed virtually anywhere. The decimal point floats and depends on our choice of the mantissa and exponent. However, one representation is considered special. For our number \(12345.678\), we are going to make an agreement that the following one is the most beautiful and most special of all of the floating point representations: \[12345.678=1.2345678\cdot 10^4.\] The mantissa has exactly one digit to the left of the decimal point. This representation is called normalized. The formal definition is

Definition The representation \(x=p\cdot b^q\) of the real number \(x\) in base \(b\) is called normalized if the absolute value of the mantissa \(p\) satisfies \[1\leq |p| < b.\]

The previous definition is equivalent to saying that there is exactly one digit to the left of the point that separates integer part from the fractional part.

Note: Did you notice how we are using the words point that separates integer part from fractional part? We will not use the word decimal point anymore. Starting from the next section, the binary representations will be the default one. The attribute decimal would be quite wrong and deceiving.

4. Representation of float and double in C++

There are multiple data types that are used to store real numbers. They all have in common some basic rules (also known as standard IEEE 754).

The storage space for the real number is divided into three components:

  • Component 1: Sign. The sign occupies 1 bit. If the sign bit is \(0\), the number is considered to be positive. If the sign bit is \(1\), the number is negative.
  • Component 2: Exponent. In the remainder of this document we will use the variable \(e\) to denote the length of the exponent. The length of the exponent varies across operating systems and compilers. Currently, most compilers on \(64\)-bit operating systems have \(e=8\) for type float and \(e=11\) for type double.
  • Component 3: Mantissa. The length of mantissa will be denoted by \(m\). The standard does not specify how long the mantissa should be. Currently, most compilers have \(m=23\) for type float and \(m=52\) for type double.

5. Normalization

For given real number \(x\), we first determine its normalized representation \(x=p\cdot 2^q\). The mantissa \(p\) is normalized, hence it satisfies \(1\leq p< 2\).

The exponent \(q\) can be positive or negative, however the number that will be stored is always non-negative. This is achieved with the following rule.

Exponent shift.
  • If the exponent \(q\) satisfies \(2^{e-1} > q > -(2^{e-1}-1)\), then the value \(q+(2^{e-1}-1)\) is stored in the \(e\) bits dedicated to the exponent.
  • If the exponent is smaller than or equal to \(-(2^{e-1}-1)\), then the number is \(x\) will be called sub-normal. Such \(x\) is considered too small to be normalized and will be treated differently than normal numbers. The value \(0\) is stored instead of the exponent.
  • If the exponent \(q\) is greater than or equal to \(2^{e-1}\) then the number is too big to store. We will store \(2^e-1\) in the space dedicated to the exponent. The number \(2^e-1\) will set every single bit of the exponent to \(1\) and this will signify that there is an overflow. The content of the memory will not correspond to a real number. Depending on the value inside the mantissa, the content of the memory will send one of the messages:
    • The number is \(+\infty\) (if mantissa is \(0\) and the sign is \(0\));
    • The number is \(-\infty\) (if mantissa is \(0\) and the sign is \(1\));
    • The content of the memory is not a number, commonly known as NaN (if mantissa is non-zero).

For example, in the case of the most common operating systems and most common compilers, the exponents of numbers of type float are stored with shift-127. The exponents of numbers of type double are stored with shift-1023.

The length of the space for exponent storage is \(e\). Therefore \(2^e\) values can be stored in total. Roughly half of the values will be dedicated for positive, and half for negative exponents.

Normal numbers. If the exponent \(q\) belongs to the range \(2^{e-1} > q > -(2^{e-1}-1)\), then we store the normalized number. Since the mantissa \(p\) satisfies \(1 \leq p < 2\) we know for sure that its digit \(d_0\) is equal to \(1\). We don't need to waste space for its storage. We store only the binary digits \(d_{-1}\), \(d_{-2}\), \(\dots\), \(d_{-m}\). In other words, we store only the digits after the point that separates integer part from fractional part.

Problem 3. Assume that real numbers in a computer are represented using 16 bits: \(1\) for sign, \(6\) for exponent, and \(9\) for mantissa. Determine the representation of the number \(-23.125\) in this computer.

Sub-normal numbers. If the exponent \(q\) from the normalized representation \(x=p\cdot 2^q\) satisfies \(q \leq -(2^{e-1}-1)\), then we will represent the number \(x\) in a non-normalized form \[x=p'\cdot 2^{-(2^{e-1}-1)}.\] We will store all digits of the mantissa \(p'\) in the allocated \(m\) bits. We will store the value \(0\) in the allocated \(e\) bits for the exponent.

Special case of a sub-normal number is \(0\). Both the mantissa and exponent are \(0\). The sign bit can be either \(0\) or \(1\). This means that there are two zeroes: \(+0\) and \(-0\). This is intentional. The value \(0\) can be result of a scientific calculation that went too far and obtained sub-normal number that can't be stored any more. By analyzing the sign bit, we can get at least an idea whether \(0\) was approached from positive or negative direction.

Problem 4. Assume that real numbers in a computer are represented using 16 bits: \(1\) for sign, \(5\) for exponent, and \(15\) for mantissa. Determine the representations for the numbers \(\frac1{2^{14}}\), \(\frac1{2^{15}}\), and \(\frac1{2^{16}}\) in this computer.

6. Difficulties with floating point representation

Earlier we mentioned a bug that happens when working with real numbers. Now it is time to give it a full analysis.

Problem 5. What does the following program print? Explain your answer.
#include<iostream>
int main(){
    double a=1.0, b=0.1, c=0.0;
    for(int i=0;i<10;++i){ c = c+b; }
    if(a==c){
        std::cout<<a<<"=="<<c<<"\n";
    }
    else{
        std::cout<<a<<"!="<<c<<"\n";
    }
    return 0;
}

The following exercise provides more practice with floating point storage.

Problem 6. What does the following program print on a computer where int and float are stored in memory that occupies \(32\) bits? Provide a justification for your answer.
#include<iostream>
int main(){
    int x;
    float d;
    void* aV=&x;
    x=2089;
    for(long i=0;i<19;++i){
        x*=2;
    }
    float* aD;
    aD=static_cast<float*>(aV);
    d=*aD;
    std::cout<<d<<"\n";
    return 0;
}

Problem 7. Create a function truncate whose input is a positive real number \(d\) of type double that produce the output of type long obtained by keeping only the integer part of \(d\) and discarding the fractional part.

Problem 8. The user input consists of a positive number \(M\), a positive integer \(n\), and a sequence \(x[0]\), \(x[1]\), \(\dots\), \(x[n-1]\) of positive numbers. The person with \(M\) dollars decided to go to store and to spend all of the money. The person has a shopping list of items whose prices are \(x[0]\), \(x[1]\), \(\dots\), \(x[n-1]\). Create the program that determines whether the person can choose a subset of items from the list whose total price is exactly equal to \(M\).
  • (a) Solve the problem under the assumption that \(M\) and \(x[1]\), \(\dots\), \(x[n-1]\) are of type long.
  • (b) Solve the problem under the assumption that the user will provide \(M\) and \(x[1]\), \(\dots\), \(x[n-1]\) as real numbers of type double and that each of the numbers from the user input will have decimal expansion with at most two digits after the decimal point.

From the previous paragraphs we can make the following list of rules regarding floating point numbers that should be used to avoid trouble.

  • Try to design algorithms that avoid float and double as much as possible.
  • If you really heave to use float and double, don't use the comparison operators == and !=.
  • If you are making programs that handle money, don't use float and double.

7. Practice

Problem 9. Create a program that prints the raw content stored in the memory for each of the following data types: int, long, float, and double.

Problem 10.

The user input consists of two positive integers e and m and a real number d of type double. Create a program that prints the floating point representation of the number d using the storage that has one bit for the sign, e bits for the exponent, and m bits for the mantissa.

Examples:

Input: 
6 9 -3.142
Output:
1100000100100100
Input:
11 52 3.141592653589793
Output: 
0100000000001001001000011111101101010100010001000010110100011000