This document is about representing numbers in computers. We will first explore the concept of binary numbers. Then, we will learn how integers are stored in the computer. Finally, we will learn about floating point representation of real numbers.

We will first look at two sample programs with *bugs*. These bugs are examples of unpleasant surprises that can occur if we do not properly understand how the numbers are stored. Once we learn how the numbers are represented in the computer, we will learn the limitations and tradeoffs that must be made in designing the computers. Then we will be able to write the code that will not have unwanted behavior due to vulnerabilities resulting from representations of integers and real numbers.

#include<iostream> int main(){ int x=1 ; for(int i=0;i<35;++i){ std::cout<<i<<"\t"<<x<<"\n"; x = x*2; } return 0; }

The program prints the powers of \(2\). Everything goes well with the exponents \(0\), \(1\), \(\dots\), \(30\). We get \(2^0=1\), \(2^1=2\), \(2^2=4\), \(2^3=8\), \(2^4=16\), \(2^5=32\), \(\dots\), \(2^{30}=1073741824\). However, this is where it gets surprising, and, mathematically incorrect: The computer thinks that

\[2^{31}=-2147483648,\mbox{ and } 2^{32}=2^{33}=\cdots = 0.\]We will learn that this is due to the fact that `int`

is stored as a \(32\)-bit number whose first bit is the sign. We will learn even more than that. We will learn how the storage is implemented using technique called *Two's complement*.

#include<iostream> int main(){ double a=1.0, b=0.1, c=0.0; for(int i=0;i<10;++i){ c = c+b; } std::cout<<"a="<<a<<"\n"<<"c="<<c<<"\n"; if(a==c){ std::cout<<a<<"=="<<c<<"\n"; } else{ std::cout<<a<<"!="<<c<<"\n"; } return 0; }

The above code prints the following:

a=1 c=1 1!=1

The two values \(1\) stored in the variables `a`

and `c`

are not the same for our computer. The first is obtained by directly writing \(1\) into the variable `a`

. The second value \(1\) is obtained by adding the number \(0.1\) ten times to the variable `c`

inside the for loop.

Once we learn how the real numbers are stored, we will see that when the number \(0.1\) is stored in binary system, its representation has infinitely many digits. In decimal system, there is only one digit after the decimal point. In binary, it will be infinitely many. Thus, \(0.1\) has to be rounded in the computer. When we add this rounded value ten times, the rounding error becomes significant. The computer sees the result as a number different from \(1\). When the computer displays the results, it rounds it again and prints \(1\) on the output. That is how we end up with the confusing message `1!=1`

.

Until 12th century, the Europeans did not use the decimal system to represent numbers. They wrote `I`

, `II`

, `III`

, `IV`

, and `V`

instead of \(1\), \(2\), \(3\), \(4\), and \(5\). Chinese did almost the same. The characters for the first five numbers in Chinese alphabet are
`一`

, `二`

, `三`

, `四`

, and `五`

.

These old traditional number systems need infinitely many symbols to support infinitely many numbers. The decimal system needs only \(10\) symbols: \(0\), \(1\), \(2\), \(\dots\), \(9\). Let us consider a positive integer \(n\) and let us assume that it has \(k\) digits. We will denote the digits from right to left by \(d_0\), \(d_1\), \(\dots\), \(d_{k-1}\). To avoid confusion with multiplication in the case when digits are labeled as variables, we will write a horizontal line over the digits and have an equation of the form \[ n=\overline{d_{k-1}\dots d_1d_0}.\] For example, if \(n=4529\), then we will write \(k=4\), \(d_0=9\), \(d_1=2\), \(d_2=5\), and \(d_3=4\). We can write \(n=\overline{4529}\) as well, but the horizontal line is not necessary because all digits are known and none of them is represented with a variable.

The following equation holds: \[\overline{d_{k-1}\dots d_1d_0}=d_{k-1}\cdot 10^{k-1}+d_{k-2}\cdot 10^{k-2}+\cdots+ d_1\cdot 10^1+d_0\cdot 10^0.\]

We are very used to this representation of numbers. The system is called *decimal system* and the number \(10\) is called the *base* of the decimal system. It is believed that \(10\) is chosen as the base for the system because humans have \(10\) fingers on their hands. If we accept abandoning this very old habit, we can easily replace \(10\) with any other base and have perfectly well functioning mathematics.

We will now define the representation of integers in an arbitrary base \(b\), where \(b\) is an integer bigger than \(1\).

We will now assume that \(b >1\) and that \(n\) is a positive integer.

`s`

consisting of digits \(0\) and \(1\) only and determines the integer \(n\) whose binary representation is `s`

.
Non-negative integers are the easiest to handle. In computer science they are called *unsigned integers* because we do not need to worry about the sign. We can assume it is \(+\). The binary digits are the only elements that have to be stored.

Unsigned integers were more common in past, but not so much today. The computers had very small memory in past. If your program needed both positive and negative integers, then you would be kind of sad. You would need to devote one entire bit for the sign. If you knew that your program does not need negative numbers, you would feel happy and lucky. You would use unsigned integers. That extra bit would be an extra digit. The possible range for the data would be bigger than if one of the bits were reserved for the sign.

In modern C++ there are two major data types for unsigned integers. The first one is `unsigned int`

and it occupies \(32\) bits. The second type is `unsigned long`

and it takes \(64\) bits. They are rarely used any more.

Let us denote by \(l\) the total number of bits that we are allowed to use for storing our integer. For most C++ compilers, the type `int`

has \(l=32\) while the type `long`

has \(l=64\).

We will treat \(l\) as a general number and in our examples we will keep things simple by using \(l\) that is much smaller than \(32\) and \(64\). We will often use \(l=8\).

The maximal positive number that we will be able to store is \(2^{l-1}-1\). The minimal negative number will be \(-2^{l-1}\). Thus, the range of all integers will be \([-2^{l-1},2^{l-1}-1]\). There are exactly \(2^l\) numbers in this range. Instead of storing the number \(x\), the computer memory will contain *Two's complement* of \(x\). This is the formal definition of two's complement.

From the previous exercise we can observe that for positive \(x\) that satisfies \(x\leq 2^{l-1}-1\), the two's complement is always equal to the binary expansion of \(x\). The number \(2^l+x\) will have digit \(1\) at the position \(l+1\) which gets discarded. Moreover, the left-most digit of the two's complement is equal to \(0\). However, if \(x\) is a negative number that satisfies \(x\geq -2^{l-1}\), then the two's complement will have exactly \(l\) digits. There won't be any digit to discard. The left-most digit of the two's complement will be \(1\).

Thus, the left-most digit of the two's complement is \(0\) if the number is positive, and \(1\) if the number is negative. The left-most digit is the same as the sign.

The arithmetic operations with signed integers can be performed in the same way as if the integers are unsigned. If there is any carryover that would occupy more than \(l\) bits, that carryover should just be ignored.

We are now ready to analyze the code from the Example 1.

#include<iostream> int main(){ int x=1 ; for(int i=0;i<35;++i){ std::cout<<i<<"\t"<<x<<"\n"; x = x*2; } return 0; }

Create the function `addOneInBase8`

. The arguments of the functions are the pointer to the first element of the sequence and its length. The sequence represents the number `n`

in the base 8. The function should update the sequence so it represents the number `n+1`

in base 8.

Your code should replace the text `// ??? //`

below.

void addOneInBase8(long* b, long M){ // ??? // }