20. Floating Point Representation
1. Binary representation of real numbers
It is quite rare for numbers to have finitely many digits in their expansions. Only rational numbers can have finitely many digits, and only some of them. The following theorem is easy to prove and is not surprising at all.
Theorem A proper fraction of the form \(\frac{p}{q}\) has finitely many digits in decimal expansion if and only if the only prime factors of \(q\) are \(2\) and \(5\).
The numbers \(\frac{37}{125}\) and \(\frac{71}{50}\) have finitely many digits in decimal expansion. The number \(\frac{37}{3}\) does not.
Once we start working with binary expansion, things are even more desperate.
Theorem Proper fractions of the form \(\frac{p}{q}\) have finite binary expansion if and only if \(q\) is a power of \(2\).
The numbers such as \(\frac1{5}\) and \(\frac1{10}\) have infinite binary expansions. Rational numbers have periodic expansions. Irrational numbers have infinite expansions that are not periodic.
Problem 1. Determine the binary expansion of the number \(9.375\).
We need to find the sequence \(d_{k-1}\), \(d_{k-2}\), \(\dots\), \(d_0\), \(d_{-1}\), \(d_{-2}\), \(\dots\) such that
\begin{eqnarray*} 9.375 & = &d_{k-1}\cdot 2^{k-1}+\cdots + d_1\cdot 2^1+d_0 \\&&+ d_{-1}\cdot 2^{-1}+d_{-2}\cdot 2^{-2}+d_{-3}\cdot 2^{-3}+\cdots.
\end{eqnarray*}
We will decompose \(9.375\) into the sum \(9.375=9+0.375\). Clearly, the binary digits must satisfy
\begin{eqnarray*}
9&=&d_{k-1}\cdot 2^{k-1}+\cdots+ d_1\cdot 2^1+d_0,\\
0.375&=&d_{-1}\cdot 2^{-1}+d_{-2}\cdot 2^{-2}+d_{-3}\cdot 2^{-3}+\cdots \tag*{(1)}
\end{eqnarray*}
The binary expansion of \(9\) is \(9=\overline{1001}_2\), which means that \(d_3=1\), \(d_2=d_1=0\), \(d_0=1\), and \(d_m=0\) for \(m\geq 4\). The digit \(d_{-1}\) is determined by multiplying both sides of (1) by \(2\).
\begin{eqnarray*}
0.75&=&d_{-1}+d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+\cdots \tag*{(2)}
\end{eqnarray*}
Since \(d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+\cdots \leq 1\) we must have \(d_{-1}=0\). The equation (2) becomes
\begin{eqnarray*}
0.75&=&d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+\cdots
\end{eqnarray*}
We can now multiply both sides of the last equation by \(2\) to obtain
\begin{eqnarray*}
1.5&=& d_{-2}+d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+\cdots \tag*{(3)}
\end{eqnarray*}
From equation (3) and \(d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+\cdots\in[0,1]\) we conclude that \(d_{-2}\) must be equal to \(1\). The equation (3) becomes
\begin{eqnarray*}
0.5&=& d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+\cdots
\end{eqnarray*}
We multiply both sides by \(2\) and derive
\begin{eqnarray*}
1&=& d_{-3} +d_{-4}\cdot 2^{-1}+d_{-5}\cdot 2^{-2}+\cdots
\end{eqnarray*}
The last equation finally gives us that \(d_{-3}=1\) and \(d_{-4}=d_{-5}=\cdots =0\). Therefore, the binary expansion of \(9.375\) is
\[9.375=\overline{1001.011}_2\]
Problem 2. Determine the binary expansion of \(\frac15\).
We need to find the sequence \(d_{k-1}\), \(d_{k-2}\), \(\dots\), \(d_0\), \(d_{-1}\), \(d_{-2}\), \(\dots\) such that
\begin{eqnarray*} \frac15 &= &d_{k-1}\cdot 2^{k-1}+\cdots + d_1\cdot 2^1+d_0\\ &&+ d_{-1}\cdot 2^{-1}+d_{-2}\cdot 2^{-2}+d_{-3}\cdot 2^{-3}+\cdots.
\end{eqnarray*}
Since \(\frac15< 1\) we must have \(d_i=0\) for \(i\geq 0\). It remains to determine \(d_{-1}\), \(d_{-2}\), \(\dots\). We start with the equation
\begin{eqnarray*} \frac15 = d_{-1}\cdot 2^{-1}+d_{-2}\cdot 2^{-2}+d_{-3}\cdot 2^{-3}+d_{-4}\cdot 2^{-4}+d_{-5}\cdot 2^{-5}+\cdots
\end{eqnarray*}
and multiply both sides by \(2\). We obtain
\begin{eqnarray*} \frac25 = d_{-1} +d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+d_{-4}\cdot 2^{-3}+d_{-5}\cdot 2^{-4}+\cdots
\end{eqnarray*}
Since \(\frac25 < 1\) we have \(d_{-1}=0\). The last equation can be re-written as
\begin{eqnarray*} \frac25 = d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+d_{-4}\cdot 2^{-3}+d_{-5}\cdot 2^{-4}+\cdots
\end{eqnarray*}
Again, we multiply both sides by \(2\) and get
\begin{eqnarray*} \frac45 & =& d_{-2} +d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+d_{-5}\cdot 2^{-3}\\ &&+d_{-6}\cdot 2^{-4}+d_{-7}\cdot 2^{-5}+\cdots
\end{eqnarray*}
From \(\frac45 < 1\) we derive \(d_{-2}=0\). The equation becomes
\begin{eqnarray*} \frac45& =& d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+d_{-5}\cdot 2^{-3}\\ &&+d_{-6}\cdot 2^{-4}+d_{-7}\cdot 2^{-5}+\cdots
\end{eqnarray*}
Our next step consists of multiplying both sides by \(2\). The left side is equal to \(\frac85\) which we will write as \(1+\frac35\). We derive
\begin{eqnarray*}1+ \frac35 &=& d_{-3} +d_{-4}\cdot 2^{-1}+d_{-5}\cdot 2^{-2}+d_{-6}\cdot 2^{-3}\\ &&+d_{-7}\cdot 2^{-4}+d_{-8}\cdot 2^{-5}+\cdots
\end{eqnarray*}
We must have \(d_{-3}=1\) and
\begin{eqnarray*} \frac35 &=& d_{-4}\cdot 2^{-1}+d_{-5}\cdot 2^{-2}+d_{-6}\cdot 2^{-3}+d_{-7}\cdot 2^{-4}\\ &&+d_{-8}\cdot 2^{-5}+\cdots
\end{eqnarray*}
Both sides of the previous equation should be multiplied by \(2\). We obtain
\begin{eqnarray*} 1+ \frac15 &=& d_{-4} +d_{-5}\cdot 2^{-1}+d_{-6}\cdot 2^{-2}+d_{-7}\cdot 2^{-3}\\ &&+d_{-8}\cdot 2^{-4}+d_{-9}\cdot 2^{-5}+\cdots
\end{eqnarray*}
Therefore, \(d_{-4}=1\). The last equation becomes
\begin{eqnarray*}\frac15 = d_{-5}\cdot 2^{-1}+d_{-6}\cdot 2^{-2}+d_{-7}\cdot 2^{-3}+d_{-8}\cdot 2^{-4}+d_{-9}\cdot 2^{-5}+\cdots
\end{eqnarray*}
This equation is the same as the first one that we started with. Repeating the same procedure as before we get \(d_{-5}=0\), \(d_{-6}=0\), \(d_{-7}=1\), \(d_{-8}=1\), and
\begin{eqnarray*}\frac15 &=& d_{-9}\cdot 2^{-1}+d_{-10}\cdot 2^{-2}+d_{-11}\cdot 2^{-3}+d_{-12}\cdot 2^{-4}\\ &+&d_{-13}\cdot 2^{-5}+d_{-14}\cdot 2^{-6}+\cdots
\end{eqnarray*}
Thus, the binary expansion of \(\frac15\) satisfies
\begin{eqnarray*}
\frac15&=&\overline{0.001100110011\dots}_2
\end{eqnarray*}
2. Mantissa and exponent
The number \(12345.678\) in base \(10\) can be represented in several different ways:
\begin{eqnarray*}
12345.678&=&12.345678\cdot 10^3\\
&=& 0.012345678\cdot 10^6\\
&=& 1234567.8\cdot 10^{-2}.
\end{eqnarray*}
Each of the representations has two important components: mantissa and exponent. In the first representation, the mantissa is \(12.345678\) and the exponent is \(3\). In the second representation, the mantissa is \(0.012345678\) and the exponent is \(6\). In the third representation the mantissa is \(1234567.8\) and the exponent is \(-2\).
3. Normalized mantissa
For every given number, there are infinitely many ways to choose mantissa and exponent. In the previous paragraph we listed \(3\) different ways to represent \(12345.678\) in base \(10\). We could easily find infinitely many more ways to do the same thing. The representations using mantissa and exponent are called floating point representations because the separating point can be placed anywhere by modifying the exponent. The decimal point floats and depends on our choice of the mantissa and exponent. However, one representation is considered special. For our number \(12345.678\), we are going to make an agreement that the following one is the most beautiful and most special of all of the floating point representations: \[12345.678=1.2345678\cdot 10^4.\] The mantissa has exactly one digit to the left of the decimal point. This representation is called normalized. The formal definition is
Definition The representation \(x=p\cdot b^q\) of the real number \(x\) in base \(b\) is called normalized if the absolute value of the mantissa \(p\) satisfies \[1\leq |p| < b.\]
The previous definition is equivalent to saying that there is exactly one digit to the left of the point that separates integer part from the fractional part.
Note: Did you notice how we are using the words point that separates integer part from fractional part? We will not use the word decimal point anymore. Starting from the next section, the binary representations will be the default one. The attribute decimal would be quite wrong and deceiving.
4. Standard IEEE 754 for floating point representation
There are multiple data types that are used to store real numbers. They all have in common some basic rules (also known as standard IEEE 754).
The storage space for the real number is divided into three components:
- Component 1: Sign. The sign occupies 1 bit. If the sign bit is \(0\), the number is considered to be positive. If the sign bit is \(1\), the number is negative.
- Component 2: Exponent. In the remainder of this document we will use the variable \(e\) to denote the length of the exponent. The length of the exponent varies across operating systems and compilers.
Most of the modern processors support two types of numbers: single precision and double precision numbers. In the languages
C and C++ these two types are called float and double.
Currently, most compilers on \(64\)-bit operating systems have \(e=8\) for type float and \(e=11\) for type double.
- Component 3: Mantissa. The length of mantissa will be denoted by \(m\). The standard does not specify how long the mantissa should be. Currently, most compilers have \(m=23\) for type
float and \(m=52\) for type double.
5. Normalization
For given real number \(x\), we first determine its normalized representation \(x=p\cdot 2^q\). The mantissa \(p\) is normalized, hence it satisfies \(1\leq p< 2\).
The exponent \(q\) can be positive or negative, however the number that will be stored is always non-negative. This is achieved with the following rule.
Exponent shift. - If the exponent \(q\) satisfies \(2^{e-1} > q > -(2^{e-1}-1)\), then the value \(q+(2^{e-1}-1)\) is stored in the \(e\) bits dedicated to the exponent.
- If the exponent is smaller than or equal to \(-(2^{e-1}-1)\), then the number is \(x\) will be called sub-normal. Such \(x\) is considered too small to be normalized and will be treated differently than normal numbers. The value \(0\) is stored instead of the exponent.
- If the exponent \(q\) is greater than or equal to \(2^{e-1}\) then the number is too big to store. We will store \(2^e-1\) in the space dedicated to the exponent. The number \(2^e-1\) will set every single bit of the exponent to \(1\) and this will signify that there is an overflow. The content of the memory will not correspond to a real number. Depending on the value inside the mantissa, the content of the memory will send one of the messages:
- The number is \(+\infty\) (if mantissa is \(0\) and the sign is \(0\));
- The number is \(-\infty\) (if mantissa is \(0\) and the sign is \(1\));
- The content of the memory is not a number, commonly known as
NaN (if mantissa is non-zero).
For example, in the case of the most common operating systems and most common compilers, the exponents of numbers of type float are stored with shift-127. The exponents of numbers of type double are stored with shift-1023.
The length of the space for exponent storage is \(e\). Therefore \(2^e\) values can be stored in total. Roughly half of the values will be dedicated for positive, and half for negative exponents.
Normal numbers. If the exponent \(q\) belongs to the range \(2^{e-1} > q > -(2^{e-1}-1)\), then we store the normalized number. Since the mantissa \(p\) satisfies \(1 \leq p < 2\) we know for sure that its digit \(d_0\) is equal to \(1\). We don't need to waste space for its storage. We store only the binary digits \(d_{-1}\), \(d_{-2}\), \(\dots\), \(d_{-m}\). In other words, we store only the digits after the point that separates integer part from fractional part.
Problem 3. Assume that real numbers in a computer are represented using 16 bits: \(1\) for sign, \(6\) for exponent, and \(9\) for mantissa. Determine the representation of the number \(-23.125\) in this computer.
The sign is \(1\). The absolute value of the number is \(23.125\). We will first determine the binary expansion of \(23.125\). Notice that \(23.125=23+\frac18\). The binary expansion of \(23\) is \(23=\overline{10111}_2\). The binary expansion of \(\frac18\) is \(\frac18=\overline{0.001}_2\). Therefore \(23.125=\overline{10111.001}_2\) and the normalized floating point representation is \[23.125=\overline{10111.001}_2= \overline{1.0111001}_2\cdot2^{4}.\]
The exponent shift is \(2^{6-1}-1=31\). Therefore, we need to store the number \(4+31=35\) in the \(6\) exponent bits. The binary expansion of \(35\) is \(35=\overline{100011}_2\). The mantissa is \(p= \overline{1.0111001}_2\). Since the number is normal (and not sub-normal), the first bit of mantissa will not be stored. Thus, the storage will look like this: \[\underbrace{1}_1\underbrace{100011}_6\underbrace{011100100}_9\]
Sub-normal numbers. If the exponent \(q\) from the normalized representation \(x=p\cdot 2^q\) satisfies \(q \leq -(2^{e-1}-1)\), then we will represent the number \(x\) in a non-normalized form
\[x=p'\cdot 2^{-(2^{e-1}-1)}.\] We will store all digits of the mantissa \(p'\) in the allocated \(m\) bits. We will store the value \(0\) in the allocated \(e\) bits for the exponent.
Special case of a sub-normal number is \(0\). Both the mantissa and exponent are \(0\). The sign bit can be either \(0\) or \(1\). This means that there are two zeroes: \(+0\) and \(-0\). This is intentional. The value \(0\) can be result of a scientific calculation that went too far and obtained sub-normal number that can't be stored any more. By analyzing the sign bit, we can get at least an idea whether \(0\) was approached from positive or negative direction.
Problem 4. Assume that real numbers in a computer are represented using 16 bits: \(1\) for sign, \(5\) for exponent, and \(15\) for mantissa. Determine the representations for the numbers \(\frac1{2^{14}}\), \(\frac1{2^{15}}\), and \(\frac1{2^{16}}\) in this computer.
The exponent shift is \(2^{5-1}-1=15\). The minimal exponent is \(-15\). Therefore, the number \(\frac1{2^{14}}\) will be normalized, while \(\frac1{2^{15}}\) and \(\frac1{2^{16}}\) are sub-normal.
The number \(2^{-14}\) has the normalized representation \(2^{-14}=1\cdot 2^{-14}\). The mantissa is \(1\) and the exponent is \(-14\). The exponent will be stored with shift 15. The exponent is \(1\). The mantissa will be normalized, and thus equal to \(0\). The storage will be \[\underbrace{0}_1\underbrace{00001}_5\underbrace{000000000000000}_{15}\]
The numbers \(2^{-15}\) and \(2^{-16}\) will be represented with exponent \(-15\). The representations are \(2^{-15}=1\cdot 2^{-15}\) and \(2^{-16}=\frac12\cdot 2^{-15}\). The mantissa of the first number is \(1\). The mantissa of the second number is \(\frac12=\overline{0.1}_2\) (in binary). Neither mantissa will be normalized. The representations of numbers are \begin{eqnarray*}&& \underbrace{0}_1\underbrace{00000}_5\underbrace{100000000000000}_{15}\quad\mbox{and}\\
&& \underbrace{0}_1\underbrace{00000}_5\underbrace{010000000000000}_{15}.\end{eqnarray*}
6. Difficulties with floating point representation
Earlier we mentioned a bug that happens when working with real numbers. Now it is time to give it a full analysis.
Problem 5. What does the following program print? Explain your answer.
#include<iostream>
int main(){
double a=1.0, b=0.1, c=0.0;
for(int i=0;i<10;++i){ c = c+b; }
if(a==c){
std::cout<<a<<"=="<<c<<"\n";
}
else{
std::cout<<a<<"!="<<c<<"\n";
}
return 0;
}
The program prints
1!=1
The number \(b\) is equal to \(0.1\) in decimal representation. The binary representation is infinite. The binary representation of \(b\) is \[b=\overline{0.0001100110011\dots}_2 \]
The value stored in memory is only an approximation of \(b\). The number \(c\) starts by being equal to \(0\). Then we add \(b\) to \(c\) ten times. We can inspect the contents of memory locations of \(a\) and \(c\). This is what we see
a=0011111111110000000000000000000000000000000000000000000000000000
c=0011111111101111111111111111111111111111111111111111111111111111
When the number \(c\) is rounded to any reasonable number of digits, the number \(c\) becomes \(1\). However, the comparison operation
== is brutal. It cannot overlook the fact that \(a\) and \(c\) are different.
The following exercise provides more practice with floating point storage.
Problem 6. What does the following program print on a computer where
int and
float are stored in memory that occupies \(32\) bits? Provide a justification for your answer.
#include<iostream>
int main(){
int x;
float d;
void* aV=&x;
x=2089;
for(long i=0;i<19;++i){
x*=2;
}
float* aD;
aD=static_cast<float*>(aV);
d=*aD;
std::cout<<d<<"\n";
return 0;
}
The variable x stores the value \(2089\) whose binary representation is
\[2089_{10}=100000101001_2\]
The number has a total of \(12\) digits. When the number is multiplied by \(2^{19}\), it will have 31 digits. The last \(19\) of them will be \(0\). Since this will be stored as a \(32\)-bit integer, the binary content of the memory will be
\[x =\begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} \hline
31&30&29&28&27&26&25&24&23&22&21&20&19&18&17&\dots&2&1&0\\ \hline 0&1&0&0&0&0&0&1&0&1&0&0&1&0&0&\dots&0&0&0\\
\hline\end{array}\]
Once the memory location is considered to be a real number of type double, the first bit will be a sign. That sign is \(+\). The bit is \(0\), hence the sign is +. The second 8 bits are exponent shifted by \(127\). These \(8\) bits correspond to the number
\( 10000010_2=130_{10}\). The real exponent is \(130-127=3\). The remaining bits are mantissa without the first \(1\) and the dot. The stored portion of the mantissa is \(1001\). Therefore, the real mantissa is \[1.1001_2=1+1\cdot 2^{-1}+0\cdot 2^{-2}+0\cdot 2^{-3}+1\cdot 2^{-4}.\] The number \(d\) is therefore
\[8\cdot \left(1+1\cdot 2^{-1}+0\cdot 2^{-2}+0\cdot 2^{-3}+1\cdot 2^{-4}\right)=8+4+\frac12=12.5.\] The computer will print \(12.5\).
Problem 7. Create a function truncate whose input is a positive real number \(d\) of type double that produce the output of type long obtained by keeping only the integer part of \(d\) and discarding the fractional part.
Our first attempt is to use the function
static_cast. That attempt will result in a bug, hence the name
attempt:
long truncateAttempt1(double d){
return static_cast<long>(d);
}
The function
static_cast<long>(d) does truncation but for binary representation. Due to the troubles with infinite expansions of some very innocent looking numbers, the following code will produce a behavior that is not desired:
double z; z=0.3+0.6+0.1; std::cout << truncateAttempt1(z)<<"\n";
The truncation will result in 0, because after adding \(0.3\), \(0.6\), and \(0.1\) our computer will not get \(1\) as the result. The correct truncation function should add some very small value to \(d\) to overcome the possible error that was caused by rounding. The improved function is
long truncate(double d){
return static_cast<long>(d+0.000000000000001);
}
Problem 8. The user input consists of a positive number \(M\), a positive integer \(n\), and a sequence \(x[0]\), \(x[1]\), \(\dots\), \(x[n-1]\) of positive numbers. The person with \(M\) dollars decided to go to store and to spend all of the money. The person has a shopping list of items whose prices are \(x[0]\), \(x[1]\), \(\dots\), \(x[n-1]\). Create the program that determines whether the person can choose a subset of items from the list whose total price is exactly equal to \(M\).
- (a) Solve the problem under the assumption that \(M\) and \(x[1]\), \(\dots\), \(x[n-1]\) are of type
long.
- (b) Solve the problem under the assumption that the user will provide \(M\) and \(x[1]\), \(\dots\), \(x[n-1]\) as real numbers of type
double and that each of the numbers from the user input will have decimal expansion with at most two digits after the decimal point.
The part (a) of the problem is solved in the document Recursion improvements with maps.
The old program would not work is we only replace long with double when working with \(M\) and \(x\). Let us analyze the code in which we perform this replacement.
#include<iostream>
#include<map>
struct ShoppingPair{
public:
double M;
long i;
int operator<(const ShoppingPair& )const;
};
int ShoppingPair::operator<(const ShoppingPair& b) const{
if(M < b.M){return 1;}
if(M > b.M){return 0;}
if(i < b.i){return 1;}
return 0;
}
long shoppingList(std::map<ShoppingPair,long>& k,
double M, double* x, long n){
// The function returns:
// * -2 if it is not possible to buy items of value
// exactly equal to M
// * an integer from the set {0, 1, ..., n-1} if it
// is possible to buy items of value exactly
// equal to M. The return value is the index
// of the last integer in the list
// * -1 if M=0. This corresponds to the answer
// "Yes. It is trivial. Just buy no items"
if(M == 0){return -1;}
if(M < 0){return -2;}
if(n < 1){return -2;}
ShoppingPair sp;
sp.M=M; sp.i=n;
if(k.find(sp)!=k.end()){
return k[sp];
}
if(x[n-1] == M){
k[sp]=n-1;
return n-1;
}
long attempt;
if(x[n-1] < M){
// Attempt: Buy the item n-1.
// The remaining money is M-x[n-1].
// See if the remaining money can be fully spent
// on the items 0, 1, ..., n-2
attempt=shoppingList(k,M-x[n-1],x,n-1);
if(attempt > -1){
k[sp]=n-1;
return n-1;
}
}
// Attempt has failed
// Another attempt: Skip the item n-1.
// The remaining money is M
// See if the remaining money can be fully spent
// on the items 0, 1, ..., n-2
attempt=shoppingList(k,M,x,n-1);
k[sp]=attempt;
return attempt;
}
int main(){
double M;
long n;
double *x;
std::cin>>M>>n;
x=new double[n];
for(long i=0;i<n;++i){
std::cin>>x[i];
}
std::map<ShoppingPair,long> knowledge;
long lastTerm=shoppingList(knowledge,M,x,n);
if(lastTerm==-2){
std::cout<<"Not possible.\n";
}
if(lastTerm==-1){
std::cout<<"Trivial.\n";
}
if(lastTerm>-1){
std::cout<<"It is possible. These are the items you";
std::cout<<" need to buy:\n";
while(M>0){
std::cout<<lastTerm<<" ";
M-=x[lastTerm];
if(M>0){
lastTerm=shoppingList(knowledge,M,x,lastTerm);
}
}
std::cout<<"\n";
}
delete[] x;
return 0;
}
If we execute the code with the input 31 5 12.3 8 17.6 1.1 28 we would get the output Not possible. This output is obviously wrong, because \(31=12.3+17.6+1.1\). The code is wrong because the numbers \(12.3\), \(17.6\), and \(1.1\) have infinitely many digits in binary expansion. Only finitely many digits are stored and when the rounding happens, the final result of the sum \(12.3+17.6+1.1\) is not exactly equal to \(31\).
The correct way to deal with this problem is to not work with numbers of type double. The floating point representation is not reliable in problems where we need precise results. Only the integers offer the absolute precision. Therefore, we will transform everything to integers. In this problem we will be able to do so because we are assured that the user will only provide numbers that have at most two digits after the decimal point. We will simply multiply everything by \(100\) and truncate to integers using the function we created in problem 0. The complete code is given below
#include<iostream>
#include<map>
long truncate(double d){
return static_cast<long>(d+0.000000000000001);
}
long multiplyBy100AndConvert(double d){
return truncate(100.0*d);
}
struct ShoppingPair{
public:
long M;
long i;
int operator<(const ShoppingPair& )const;
};
int ShoppingPair::operator<(const ShoppingPair& b) const{
if(M < b.M){return 1;}
if(M > b.M){return 0;}
if(i < b.i){return 1;}
return 0;
}
long shoppingList(std::map<ShoppingPair,long>& k,
long M, long* x, long n){
// The function returns:
// * -2 if it is not possible to buy items of value
// exactly equal to M
// * an integer from the set {0, 1, ..., n-1} if it
// is possible to buy items of value exactly
// equal to M. The return value is the index
// of the last integer in the list
// * -1 if M=0. This corresponds to the answer
// "Yes. It is trivial. Just buy no items"
if(M == 0){return -1;}
if(M < 0){return -2;}
if(n < 1){return -2;}
ShoppingPair sp;
sp.M=M; sp.i=n;
if(k.find(sp)!=k.end()){
return k[sp];
}
if(x[n-1] == M){
k[sp]=n-1;
return n-1;
}
long attempt;
if(x[n-1] < M){
// Attempt: Buy the item n-1.
// The remaining money is M-x[n-1].
// See if the remaining money can be fully
// spent on the items 0, 1, ..., n-2
attempt=shoppingList(k,M-x[n-1],x,n-1);
if(attempt > -1){
k[sp]=n-1;
return n-1;
}
}
// Attempt has failed
// Another attempt: Skip the item n-1.
// The remaining money is M
// See if the remaining money can be fully
// spent on the items 0, 1, ..., n-2
attempt=shoppingList(k,M,x,n-1);
k[sp]=attempt;
return attempt;
}
int main(){
long M;
long n;
long *x;
double uI;
std::cin>>uI>>n;
M=multiplyBy100AndConvert(uI);
x=new long[n];
for(long i=0;i<n;++i){
std::cin>>uI;
x[i]=multiplyBy100AndConvert(uI);
}
std::map<ShoppingPair,long> knowledge;
long lastTerm=shoppingList(knowledge,M,x,n);
if(lastTerm==-2){
std::cout<<"Not possible.\n";
}
if(lastTerm==-1){
std::cout<<"Trivial.\n";
}
if(lastTerm>-1){
std::cout<<"It is possible. These are the items you";
std::cout<<" need to buy:\n";
while(M>0){
std::cout<<lastTerm<<" ";
M-=x[lastTerm];
if(M>0){
lastTerm=shoppingList(knowledge,M,x,lastTerm);
}
}
std::cout<<"\n";
}
delete[] x;
return 0;
}
From the previous paragraphs we can make the following list of rules regarding floating point numbers that should be used to avoid trouble.
- Try to design algorithms that avoid
float and double as much as possible.
- If you really heave to use
float and double, don't use the comparison operators == and !=.
- If you are making programs that handle money, don't use
float and double.
7. Practice
Problem 9. Create a program that prints the raw content stored in the memory for each of the following data types: int, long, float, and double.
We will print out the following menu with options that the user can choose from:
-1. Exit
0. Print this menu
1. Print content of int
2. Print content of long
3. Print content of float
4. Print content of double
We will then take the choice from the user. If it is 1, then we will declare the variable x of type int, read the value from std::cin and then print the raw memory content of the stored number. We do the analogous operations for each of the offered choices.
In the case of int and float we will use void pointer to trick the computer into thinking that the memory corresponds to unsigned int. Once the memory is treated as unsigned integer, we can read the number and convert it into binary. If the variable is of type long or double we use a similar technique. We use a void pointer to read the address of x and send this same address to another pointer to an unsigned long. Then we convert the unsigned long into binary so we can its binary digits. The program is given below.
#include<iostream>
std::string contentToStr(unsigned long x){
long i=64;
std::string result="";
while(i>0){
--i;
result=std::to_string(x%2)+result;
x/=2;
}
return result;
}
std::string contentToStr(unsigned int x){
long i=32;
std::string result="";
while(i>0){
--i;
result=std::to_string(x%2)+result;
x/=2;
}
return result;
}
std::string contentToStr(int x){
void* a=&x;
unsigned int* b=static_cast<unsigned int*>(a);
return contentToStr(*b);
}
std::string contentToStr(long x){
void* a=&x;
unsigned long* b=static_cast<unsigned long*>(a);
return contentToStr(*b);
}
std::string contentToStr(float x){
void* a=&x;
unsigned int* b=static_cast<unsigned int*>(a);
return contentToStr(*b);
}
std::string contentToStr(double x){
void* a=&x;
unsigned long* b=static_cast<unsigned long*>(a);
return contentToStr(*b);
}
std::string mainMenu(){
std::string menu;
menu+="-1. Exit\n";
menu+=" 0. Print this menu\n";
menu+=" 1. Print content of int\n";
menu+=" 2. Print content of long\n";
menu+=" 3. Print content of float\n";
menu+=" 4. Print content of double\n";
return menu;
}
int main(){
std::string choice;
std::cout<<mainMenu();
std::cout<<">> "; std::cin>>choice;
while(choice!="-1"){
if(choice=="0"){
std::cout<<mainMenu();
}
if(choice=="1"){
int x; std::cin>>x; std::cout<<contentToStr(x)<<"\n";
}
if(choice=="2"){
long x; std::cin>>x; std::cout<<contentToStr(x)<<"\n";
}
if(choice=="3"){
float x; std::cin>>x; std::cout<<contentToStr(x)<<"\n";
}
if(choice=="4"){
double x; std::cin>>x; std::cout<<contentToStr(x)<<"\n";
}
std::cout<<">> "; std::cin>>choice;
}
return 0;
}
Problem 10.
The user input consists of two positive integers e and m and a real number d of type double. Create a program that prints the floating point representation of the number d using the storage that has one bit for the sign, e bits for the exponent, and m bits for the mantissa.
Examples:
Input:
6 9 -3.142
Output:
1100000100100100
Input:
11 52 3.141592653589793
Output:
0100000000001001001000011111101101010100010001000010110100011000
We will write the code that handles the small subnormal numbers as well. Once we collect the real number d and the lengths el and ml of mantissa and exponent, we pass them to the function floatingPointRepresentation. The function first determines the sign bit and replaces d with its absolute value. Then it calculates the exponent shift. The exponent shift is 2^(el-1)-1. In the next step we determine the exponent and normalize the mantissa. We treat cases d < 1.0 and d >= 1.0 separately as they correspond to negative and positive exponent, respectively.
The complete code is given below
#include<iostream>
std::string floatingPointRepresentation(double d, long el, long ml){
if(el<3){el=3;}
if(ml<3){ml=3;}
long signBit=0;
if(d<0.0){signBit=1; d*=-1;}
long expShift=1;
for(long i=1;i < el;++i){
expShift*=2;
}
expShift-=1;
long exponent=expShift;
if(d < 1.0){
while((d < 1.0)&&(exponent>0)){
d += d;
--exponent;
}
}
else{
while(d >= 2.0){
d /= 2.0;
++exponent;
}
}
std::string f="";
f+=std::to_string(signBit);
long i=0;
while(i < el+ml){
f+='0';
++i;
}
i=el;
if(exponent > 0){
if(d >= 1.0){d -= 1.0; d += d;}
}
while((exponent > 0)&&(i > 0)){
if(exponent%2==1){f[i]='1';}
--i;
exponent /= 2;
}
i=el+1;
long totalLen=el+ml+1;
while((i < totalLen)&&(d>0.0)) {
if(d >= 1.0){
f[i]='1';
d -= 1.0;
}
d+=d;
++i;
}
return f;
}
int main(){
long el;
long ml;
double d;
std::cin>>el>>ml>>d;
std::cout<<floatingPointRepresentation(d,el,ml)<<"\n";
return 0;
}