1.	Introduction to C++
2.	Variables, branching, and loops
3.	Functions and recursions
4.	Pointers
5.	Linked lists
6.	Stacks
7.	Sequences
8.	Pointers practice
9.	References
10.	Sorting
11.	Object oriented programming
12.	Trees in C++
13.	Balanced trees
14.	Sets and maps: simplified
15.	Sets and maps: standard
16.	Dynamic programming
17.	Multi core programming and threads
18.	Representation of integers in computers
19.	Floating point representation
20.	Templates
21.	Inheritance
22.	Pointers to functions
23.	Reading data from internet
24.	Graphic cards and OpenCL
25.	OpenCL on AWS
26.	OpenCV and optical mark recognition

19. Floating Point Representation

1. Binary representation of real numbers

It is quite rare for numbers to have finitely many digits in their expansions. Only rational numbers can have finitely many digits, and only some of them. The following theorem is easy to prove and is not surprising at all.

Theorem A proper fraction of the form \(\frac{p}{q}\) has finitely many digits in decimal expansion if and only if the only prime factors of \(q\) are \(2\) and \(5\).

The numbers \(\frac{37}{125}\) and \(\frac{71}{50}\) have finitely many digits in decimal expansion. The number \(\frac{37}{3}\) does not. Once we start working with binary expansion, things are even more desperate.

Theorem Proper fractions of the form \(\frac{p}{q}\) have finite binary expansions if and only if \(q\) is a power of \(2\).

The numbers such as \(\frac1{5}\) and \(\frac1{10}\) have infinite binary expansions. Rational numbers have periodic expansions. Irrational numbers have infinite expansions that are not periodic.

Problem 1. Determine the binary expansion of the number \(9.375\).

Problem 2. Determine the binary expansion of \(\frac15\).

We need to find the sequence \(d_{k-1}\), \(d_{k-2}\), \(\dots\), \(d_0\), \(d_{-1}\), \(d_{-2}\), \(\dots\) such that \begin{eqnarray*} \frac15 &= &d_{k-1}\cdot 2^{k-1}+\cdots + d_1\cdot 2^1+d_0\\ &&+ d_{-1}\cdot 2^{-1}+d_{-2}\cdot 2^{-2}+d_{-3}\cdot 2^{-3}+\cdots. \end{eqnarray*} Since \(\frac15< 1\) we must have \(d_i=0\) for \(i\geq 0\). It remains to determine \(d_{-1}\), \(d_{-2}\), \(\dots\). We start with the equation \begin{eqnarray*} \frac15 = d_{-1}\cdot 2^{-1}+d_{-2}\cdot 2^{-2}+d_{-3}\cdot 2^{-3}+d_{-4}\cdot 2^{-4}+d_{-5}\cdot 2^{-5}+\cdots \end{eqnarray*} and multiply both sides by \(2\). We obtain \begin{eqnarray*} \frac25 = d_{-1} +d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+d_{-4}\cdot 2^{-3}+d_{-5}\cdot 2^{-4}+\cdots \end{eqnarray*} Since \(\frac25 < 1\) we have \(d_{-1}=0\). The last equation can be re-written as \begin{eqnarray*} \frac25 = d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+d_{-4}\cdot 2^{-3}+d_{-5}\cdot 2^{-4}+\cdots \end{eqnarray*} Again, we multiply both sides by \(2\) and get \begin{eqnarray*} \frac45 & =& d_{-2} +d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+d_{-5}\cdot 2^{-3}\\ &&+d_{-6}\cdot 2^{-4}+d_{-7}\cdot 2^{-5}+\cdots \end{eqnarray*} From \(\frac45 < 1\) we derive \(d_{-2}=0\). The equation becomes \begin{eqnarray*} \frac45& =& d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+d_{-5}\cdot 2^{-3}\\ &&+d_{-6}\cdot 2^{-4}+d_{-7}\cdot 2^{-5}+\cdots \end{eqnarray*} Our next step consists of multiplying both sides by \(2\). The left side is equal to \(\frac85\) which we will write as \(1+\frac35\). We derive \begin{eqnarray*}1+ \frac35 &=& d_{-3} +d_{-4}\cdot 2^{-1}+d_{-5}\cdot 2^{-2}+d_{-6}\cdot 2^{-3}\\ &&+d_{-7}\cdot 2^{-4}+d_{-8}\cdot 2^{-5}+\cdots \end{eqnarray*} We must have \(d_{-3}=1\) and \begin{eqnarray*} \frac35 &=& d_{-4}\cdot 2^{-1}+d_{-5}\cdot 2^{-2}+d_{-6}\cdot 2^{-3}+d_{-7}\cdot 2^{-4}\\ &&+d_{-8}\cdot 2^{-5}+\cdots \end{eqnarray*} Both sides of the previous equation should be multiplied by \(2\). We obtain \begin{eqnarray*} 1+ \frac15 &=& d_{-4} +d_{-5}\cdot 2^{-1}+d_{-6}\cdot 2^{-2}+d_{-7}\cdot 2^{-3}\\ &&+d_{-8}\cdot 2^{-4}+d_{-9}\cdot 2^{-5}+\cdots \end{eqnarray*} Therefore, \(d_{-4}=1\). The last equation becomes \begin{eqnarray*}\frac15 = d_{-5}\cdot 2^{-1}+d_{-6}\cdot 2^{-2}+d_{-7}\cdot 2^{-3}+d_{-8}\cdot 2^{-4}+d_{-9}\cdot 2^{-5}+\cdots \end{eqnarray*} This equation is the same as the first one that we started with. Repeating the same procedure as before we get \(d_{-5}=0\), \(d_{-6}=0\), \(d_{-7}=1\), \(d_{-8}=1\), and \begin{eqnarray*}\frac15 &=& d_{-9}\cdot 2^{-1}+d_{-10}\cdot 2^{-2}+d_{-11}\cdot 2^{-3}+d_{-12}\cdot 2^{-4}\\ &+&d_{-13}\cdot 2^{-5}+d_{-14}\cdot 2^{-6}+\cdots \end{eqnarray*} Thus, the binary expansion of \(\frac15\) satisfies \begin{eqnarray*} \frac15&=&\overline{0.001100110011\dots}_2 \end{eqnarray*}

2. Mantissa and exponent

The number \(12345.678\) in base \(10\) can be represented in several different ways: \begin{eqnarray*} 12345.678&=&12.345678\cdot 10^3\\ &=& 0.012345678\cdot 10^6\\ &=& 1234567.8\cdot 10^{-2}. \end{eqnarray*} Each of the representations has two important components: mantissa and exponent. In the first representation, the mantissa is \(12.345678\) and the exponent is \(3\). In the second representation, the mantissa is \(0.012345678\) and the exponent is \(6\). In the third representation the mantissa is \(1234567.8\) and the exponent is \(-2\).

3. Normalized mantissa

For every given number, there are infinitely many ways to choose mantissa and exponent. In the previous paragraph we listed \(3\) different ways to represent \(12345.678\) in base \(10\). We could easily find infinitely many more ways to do the same thing. The representations using mantissa and exponent are called floating point representations because the separating point can be placed anywhere by modifying the exponent. The decimal point floats and depends on our choice of the mantissa and exponent. However, one representation is considered special. For our number \(12345.678\), we are going to make an agreement that the following one is the most beautiful and most special of all of the floating point representations: \[12345.678=1.2345678\cdot 10^4.\] The mantissa has exactly one digit to the left of the decimal point. This representation is called normalized. The formal definition is

Definition The representation \(x=p\cdot b^q\) of the real number \(x\) in base \(b\) is called normalized if the absolute value of the mantissa \(p\) satisfies \[1\leq |p| < b.\]

The previous definition is equivalent to saying that there is exactly one digit to the left of the point that separates integer part from the fractional part.

Note: Did you notice how we are using the words point that separates integer part from fractional part? We will not use the word decimal point anymore. Starting from the next section, the binary representations will be the default one. The attribute decimal would be quite wrong and deceiving.

4. Standard IEEE 754 for floating point representation

There are multiple data types that are used to store real numbers. They all have in common some basic rules (also known as standard IEEE 754).

The storage space for the real number is divided into three components:

Component 1: Sign. The sign occupies 1 bit. If the sign bit is \(0\), the number is considered to be positive. If the sign bit is \(1\), the number is negative.
Component 2: Exponent. In the remainder of this document we will use the variable \(e\) to denote the length of the exponent. The length of the exponent varies across operating systems and compilers. Most of the modern processors support two types of numbers: single precision and double precision numbers. In the languages C and C++ these two types are called float and double. Currently, most compilers on \(64\)-bit operating systems have \(e=8\) for type float and \(e=11\) for type double.
Component 3: Mantissa. The length of mantissa will be denoted by \(m\). The standard does not specify how long the mantissa should be. Currently, most compilers have \(m=23\) for type float and \(m=52\) for type double.

5. Normalization

For given real number \(x\), we first determine its normalized representation \(x=p\cdot 2^q\). The mantissa \(p\) is normalized, hence it satisfies \(1\leq p< 2\).

The exponent \(q\) can be positive or negative, however the number that will be stored is always non-negative. This is achieved with the following rule.

Exponent shift.

If the exponent \(q\) satisfies \(2^{e-1} > q > -(2^{e-1}-1)\), then the value \(q+(2^{e-1}-1)\) is stored in the \(e\) bits dedicated to the exponent.
If the exponent is smaller than or equal to \(-(2^{e-1}-1)\), then the number is \(x\) will be called sub-normal. Such \(x\) is considered too small to be normalized and will be treated differently than normal numbers. The value \(0\) is stored instead of the exponent.
If the exponent \(q\) is greater than or equal to \(2^{e-1}\) then the number is too big to store. We will store \(2^e-1\) in the space dedicated to the exponent. The number \(2^e-1\) will set every single bit of the exponent to \(1\) and this will signify that there is an overflow. The content of the memory will not correspond to a real number. Depending on the value inside the mantissa, the content of the memory will send one of the messages:
- The number is \(+\infty\) (if mantissa is \(0\) and the sign is \(0\));
- The number is \(-\infty\) (if mantissa is \(0\) and the sign is \(1\));
- The content of the memory is not a number, commonly known as NaN (if mantissa is non-zero).

For example, in the case of the most common operating systems and most common compilers, the exponents of numbers of type float are stored with shift-127. The exponents of numbers of type double are stored with shift-1023.

The length of the space for exponent storage is \(e\). Therefore \(2^e\) values can be stored in total. Roughly half of the values will be dedicated for positive, and half for negative exponents.

Normal numbers. If the exponent \(q\) belongs to the range \(2^{e-1} > q > -(2^{e-1}-1)\), then we store the normalized number. Since the mantissa \(p\) satisfies \(1 \leq p < 2\) we know for sure that its digit \(d_0\) is equal to \(1\). We don't need to waste space for its storage. We store only the binary digits \(d_{-1}\), \(d_{-2}\), \(\dots\), \(d_{-m}\). In other words, we store only the digits after the point that separates integer part from fractional part.

Problem 3. Assume that real numbers in a computer are represented using 16 bits: \(1\) for sign, \(6\) for exponent, and \(9\) for mantissa. Determine the representation of the number \(-23.125\) in this computer.

Sub-normal numbers. If the exponent \(q\) from the normalized representation \(x=p\cdot 2^q\) satisfies \(q \leq -(2^{e-1}-1)\), then we will represent the number \(x\) in a non-normalized form \[x=p'\cdot 2^{-(2^{e-1}-1)}.\] We will store all digits of the mantissa \(p'\) in the allocated \(m\) bits. We will store the value \(0\) in the allocated \(e\) bits for the exponent.

Special case of a sub-normal number is \(0\). Both the mantissa and exponent are \(0\). The sign bit can be either \(0\) or \(1\). This means that there are two zeroes: \(+0\) and \(-0\). This is intentional. The value \(0\) can be result of a scientific calculation that went too far and obtained sub-normal number that can't be stored any more. By analyzing the sign bit, we can get at least an idea whether \(0\) was approached from positive or negative direction.

Problem 4. Assume that real numbers in a computer are represented using 16 bits: \(1\) for sign, \(5\) for exponent, and \(15\) for mantissa. Determine the representations for the numbers \(\frac1{2^{14}}\), \(\frac1{2^{15}}\), and \(\frac1{2^{16}}\) in this computer.

6. Difficulties with floating point representation

Earlier we mentioned a bug that happens when working with real numbers. Now it is time to give it a full analysis.

Problem 5. What does the following program print? Explain your answer.

#include<iostream>
int main(){
    double a=1.0, b=0.1, c=0.0;
    for(int i=0;i<10;++i){ c = c+b; }
    if(a==c){
        std::cout<<a<<"=="<<c<<"\n";
    }
    else{
        std::cout<<a<<"!="<<c<<"\n";
    }
    return 0;
}

The following exercise provides more practice with floating point storage.

Problem 6. What does the following program print on a computer where int and float are stored in memory that occupies \(32\) bits? Provide a justification for your answer.

#include<iostream>
int main(){
    int x;
    float d;
    void* aV=&x;
    x=2089;
    for(long i=0;i<19;++i){
        x*=2;
    }
    float* aD;
    aD=static_cast<float*>(aV);
    d=*aD;
    std::cout<<d<<"\n";
    return 0;
}

Problem 7. Create a function truncate whose input is a positive real number \(d\) of type double that produce the output of type long obtained by keeping only the integer part of \(d\) and discarding the fractional part.

Problem 8. The user input consists of a positive number \(M\), a positive integer \(n\), and a sequence \(x[0]\), \(x[1]\), \(\dots\), \(x[n-1]\) of positive numbers. The person with \(M\) dollars decided to go to store and to spend all of the money. The person has a shopping list of items whose prices are \(x[0]\), \(x[1]\), \(\dots\), \(x[n-1]\). Create the program that determines whether the person can choose a subset of items from the list whose total price is exactly equal to \(M\).

(a) Solve the problem under the assumption that \(M\) and \(x[1]\), \(\dots\), \(x[n-1]\) are of type long.
(b) Solve the problem under the assumption that the user will provide \(M\) and \(x[1]\), \(\dots\), \(x[n-1]\) as real numbers of type double and that each of the numbers from the user input will have decimal expansion with at most two digits after the decimal point.

The part (a) of the problem is solved in the document Recursion improvements with maps.

The old program would not work is we only replace long with double when working with \(M\) and \(x\). Let us analyze the code in which we perform this replacement.

#include<iostream>
#include<map>
struct ShoppingPair{
public:
  double M;
  long i;
  int operator<(const ShoppingPair& )const;
};
int ShoppingPair::operator<(const ShoppingPair& b) const{
  if(M < b.M){return 1;}
  if(M > b.M){return 0;}
  if(i < b.i){return 1;}
  return 0;
}
long shoppingList(std::map<ShoppingPair,long>& k, 
                                   double M, double* x, long n){
  // The function returns:
  // * -2 if it is not possible to buy items of value 
  //       exactly equal to M
  // * an integer from the set {0, 1, ..., n-1} if it 
  //       is possible to buy items of value exactly
  //       equal to M. The return value is the index 
  //       of the last integer in the list
  // * -1 if M=0. This corresponds to the answer 
  //         "Yes. It is trivial. Just buy no items"
  if(M == 0){return -1;}
  if(M < 0){return -2;}
  if(n < 1){return -2;}
  ShoppingPair sp;
  sp.M=M; sp.i=n;
  if(k.find(sp)!=k.end()){
    return k[sp];
  }
  if(x[n-1] == M){
    k[sp]=n-1;
    return n-1;
  }
  long attempt;
  if(x[n-1] < M){
    // Attempt: Buy the item n-1.
    // The remaining money is M-x[n-1].
    // See if the remaining money can be fully spent 
    // on the items 0, 1, ..., n-2
    attempt=shoppingList(k,M-x[n-1],x,n-1);
    if(attempt > -1){
      k[sp]=n-1;
      return n-1;
    }
  }
  // Attempt has failed
  // Another attempt: Skip the item n-1.
  // The remaining money is M
  // See if the remaining money can be fully spent 
  // on the items 0, 1, ..., n-2
  attempt=shoppingList(k,M,x,n-1);
  k[sp]=attempt;
  return attempt;
}
int main(){
  double M;
  long n;
  double *x;
  std::cin>>M>>n;
  x=new double[n];
  for(long i=0;i<n;++i){
    std::cin>>x[i];
  }
  std::map<ShoppingPair,long> knowledge;
  long lastTerm=shoppingList(knowledge,M,x,n);
  if(lastTerm==-2){
    std::cout<<"Not possible.\n";
  }
  if(lastTerm==-1){
    std::cout<<"Trivial.\n";
  }
  if(lastTerm>-1){
    std::cout<<"It is possible. These are the items you";
    std::cout<<" need to buy:\n";
    while(M>0){
      std::cout<<lastTerm<<" ";
      M-=x[lastTerm];
      if(M>0){
        lastTerm=shoppingList(knowledge,M,x,lastTerm);
      }
    }
    std::cout<<"\n";
  }
  delete[] x;
  return 0;
}

If we execute the code with the input 31 5 12.3 8 17.6 1.1 28 we would get the output Not possible. This output is obviously wrong, because \(31=12.3+17.6+1.1\). The code is wrong because the numbers \(12.3\), \(17.6\), and \(1.1\) have infinitely many digits in binary expansion. Only finitely many digits are stored and when the rounding happens, the final result of the sum \(12.3+17.6+1.1\) is not exactly equal to \(31\).

The correct way to deal with this problem is to not work with numbers of type double. The floating point representation is not reliable in problems where we need precise results. Only the integers offer the absolute precision. Therefore, we will transform everything to integers. In this problem we will be able to do so because we are assured that the user will only provide numbers that have at most two digits after the decimal point. We will simply multiply everything by \(100\) and truncate to integers using the function we created in problem 0. The complete code is given below

#include<iostream>
#include<map>
long truncate(double d){
  return static_cast<long>(d+0.000000000000001);
}
long multiplyBy100AndConvert(double d){
  return truncate(100.0*d);
}
struct ShoppingPair{
public:
  long M;
  long i;
  int operator<(const ShoppingPair& )const;
};
int ShoppingPair::operator<(const ShoppingPair& b) const{
  if(M < b.M){return 1;}
  if(M > b.M){return 0;}
  if(i < b.i){return 1;}
  return 0;
}
long shoppingList(std::map<ShoppingPair,long>& k, 
                                    long M, long* x, long n){
  // The function returns:
  // * -2 if it is not possible to buy items of value 
  //       exactly equal to M
  // * an integer from the set {0, 1, ..., n-1} if it 
  //       is possible to buy items of value exactly
  //       equal to M. The return value is the index 
  //       of the last integer in the list
  // * -1 if M=0. This corresponds to the answer 
  //         "Yes. It is trivial. Just buy no items"
  if(M == 0){return -1;}
  if(M < 0){return -2;}
  if(n < 1){return -2;}
  ShoppingPair sp;
  sp.M=M; sp.i=n;
  if(k.find(sp)!=k.end()){
    return k[sp];
  }
  if(x[n-1] == M){
    k[sp]=n-1;
    return n-1;
  }
  long attempt;
  if(x[n-1] < M){
    // Attempt: Buy the item n-1.
    // The remaining money is M-x[n-1].
    // See if the remaining money can be fully 
    // spent on the items 0, 1, ..., n-2
    attempt=shoppingList(k,M-x[n-1],x,n-1);
    if(attempt > -1){
      k[sp]=n-1;
      return n-1;
    }
  }
  // Attempt has failed
  // Another attempt: Skip the item n-1.
  // The remaining money is M
  // See if the remaining money can be fully 
  // spent on the items 0, 1, ..., n-2
  attempt=shoppingList(k,M,x,n-1);
  k[sp]=attempt;
  return attempt;
}
int main(){
  long M;
  long n;
  long *x;
  double uI;
  std::cin>>uI>>n;
  M=multiplyBy100AndConvert(uI);
  x=new long[n];
  for(long i=0;i<n;++i){
    std::cin>>uI;
    x[i]=multiplyBy100AndConvert(uI);
  }
  std::map<ShoppingPair,long> knowledge;
  long lastTerm=shoppingList(knowledge,M,x,n);
  if(lastTerm==-2){
    std::cout<<"Not possible.\n";
  }
  if(lastTerm==-1){
    std::cout<<"Trivial.\n";
  }
  if(lastTerm>-1){
    std::cout<<"It is possible. These are the items you";
    std::cout<<" need to buy:\n";
    while(M>0){
      std::cout<<lastTerm<<" ";
      M-=x[lastTerm];
      if(M>0){
        lastTerm=shoppingList(knowledge,M,x,lastTerm);
      }
    }
    std::cout<<"\n";
  }
  delete[] x;
  return 0;
}

From the previous paragraphs we can make the following list of rules regarding floating point numbers that should be used to avoid trouble.

Try to design algorithms that avoid float and double as much as possible.
If you really heave to use float and double, don't use the comparison operators == and !=.
If you are making programs that handle money, don't use float and double.

7. Practice

Problem 9. Create a program that prints the raw content stored in the memory for each of the following data types: int, long, float, and double.

We will print out the following menu with options that the user can choose from:

-1. Exit
 0. Print this menu
 1. Print content of int
 2. Print content of long
 3. Print content of float
 4. Print content of double

We will then take the choice from the user. If it is 1, then we will declare the variable x of type int, read the value from std::cin and then print the raw memory content of the stored number. We do the analogous operations for each of the offered choices.

In the case of int and float we will use void pointer to trick the computer into thinking that the memory corresponds to unsigned int. Once the memory is treated as unsigned integer, we can read the number and convert it into binary. If the variable is of type long or double we use a similar technique. We use a void pointer to read the address of x and send this same address to another pointer to an unsigned long. Then we convert the unsigned long into binary so we can its binary digits. The program is given below.

#include<iostream>
std::string contentToStr(unsigned long x){
  long i=64;
  std::string result="";
  while(i>0){
    --i;
    result=std::to_string(x%2)+result;
    x/=2;
  }
  return result;
}
std::string contentToStr(unsigned int x){
  long i=32;
  std::string result="";
  while(i>0){
    --i;
    result=std::to_string(x%2)+result;
    x/=2;
  }
  return result;
}
std::string contentToStr(int x){
  void* a=&x;
  unsigned int* b=static_cast<unsigned int*>(a);
  return contentToStr(*b);
}
std::string contentToStr(long x){
  void* a=&x;
  unsigned long* b=static_cast<unsigned long*>(a);
  return contentToStr(*b);
}
std::string contentToStr(float x){
  void* a=&x;
  unsigned int* b=static_cast<unsigned int*>(a);
  return contentToStr(*b);
}
std::string contentToStr(double x){
  void* a=&x;
  unsigned long* b=static_cast<unsigned long*>(a);
  return contentToStr(*b);
}
std::string mainMenu(){
  std::string menu;
  menu+="-1. Exit\n";
  menu+=" 0. Print this menu\n";
  menu+=" 1. Print content of int\n";
  menu+=" 2. Print content of long\n";
  menu+=" 3. Print content of float\n";
  menu+=" 4. Print content of double\n";
  return menu;
}
int main(){
  std::string choice;
  std::cout<<mainMenu();
  std::cout<<">> "; std::cin>>choice;
  while(choice!="-1"){
    if(choice=="0"){
      std::cout<<mainMenu();
    }
    if(choice=="1"){
      int x; std::cin>>x; std::cout<<contentToStr(x)<<"\n";
    }
    if(choice=="2"){
      long x; std::cin>>x; std::cout<<contentToStr(x)<<"\n";
    }
    if(choice=="3"){
      float x; std::cin>>x; std::cout<<contentToStr(x)<<"\n";
    }
    if(choice=="4"){
      double x; std::cin>>x; std::cout<<contentToStr(x)<<"\n";
    }
    std::cout<<">> "; std::cin>>choice;
  }
  return 0;
}

Problem 10.

The user input consists of two positive integers e and m and a real number d of type double. Create a program that prints the floating point representation of the number d using the storage that has one bit for the sign, e bits for the exponent, and m bits for the mantissa.

Examples:

Input: 
6 9 -3.142
Output:
1100000100100100
Input:
11 52 3.141592653589793
Output: 
0100000000001001001000011111101101010100010001000010110100011000