8. Floating Point Representation in Assembly Language
1. Binary representation of real numbers
It is quite rare for numbers to have finitely many digits in their expansions. Only rational numbers can have finitely many digits, and only some of them. The following theorem is easy to prove and is not surprising at all.
Theorem A proper fraction of the form \(\frac{p}{q}\) has finitely many digits in decimal expansion if and only if the only prime factors of \(q\) are \(2\) and \(5\).
The numbers \(\frac{37}{125}\) and \(\frac{71}{50}\) have finitely many digits in decimal expansion. The number \(\frac{37}{3}\) does not.
Once we start working with binary expansion, things are even more desperate.
Theorem Proper fractions of the form \(\frac{p}{q}\) have finite binary expansion if and only if \(q\) is a power of \(2\).
The numbers such as \(\frac1{5}\) and \(\frac1{10}\) have infinite binary expansions. Rational numbers have periodic expansions. Irrational numbers have infinite expansions that are not periodic.
Problem 1. Determine the binary expansion of the number \(9.375\).
We need to find the sequence \(d_{k-1}\), \(d_{k-2}\), \(\dots\), \(d_0\), \(d_{-1}\), \(d_{-2}\), \(\dots\) such that
\begin{eqnarray*} 9.375 & = &d_{k-1}\cdot 2^{k-1}+\cdots + d_1\cdot 2^1+d_0 \\&&+ d_{-1}\cdot 2^{-1}+d_{-2}\cdot 2^{-2}+d_{-3}\cdot 2^{-3}+\cdots.
\end{eqnarray*}
We will decompose \(9.375\) into the sum \(9.375=9+0.375\). Clearly, the binary digits must satisfy
\begin{eqnarray*}
9&=&d_{k-1}\cdot 2^{k-1}+\cdots+ d_1\cdot 2^1+d_0,\\
0.375&=&d_{-1}\cdot 2^{-1}+d_{-2}\cdot 2^{-2}+d_{-3}\cdot 2^{-3}+\cdots \tag*{(1)}
\end{eqnarray*}
The binary expansion of \(9\) is \(9=\overline{1001}_2\), which means that \(d_3=1\), \(d_2=d_1=0\), \(d_0=1\), and \(d_m=0\) for \(m\geq 4\). The digit \(d_{-1}\) is determined by multiplying both sides of (1) by \(2\).
\begin{eqnarray*}
0.75&=&d_{-1}+d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+\cdots \tag*{(2)}
\end{eqnarray*}
Since \(d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+\cdots \leq 1\) we must have \(d_{-1}=0\). The equation (2) becomes
\begin{eqnarray*}
0.75&=&d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+\cdots
\end{eqnarray*}
We can now multiply both sides of the last equation by \(2\) to obtain
\begin{eqnarray*}
1.5&=& d_{-2}+d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+\cdots \tag*{(3)}
\end{eqnarray*}
From equation (3) and \(d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+\cdots\in[0,1]\) we conclude that \(d_{-2}\) must be equal to \(1\). The equation (3) becomes
\begin{eqnarray*}
0.5&=& d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+\cdots
\end{eqnarray*}
We multiply both sides by \(2\) and derive
\begin{eqnarray*}
1&=& d_{-3} +d_{-4}\cdot 2^{-1}+d_{-5}\cdot 2^{-2}+\cdots
\end{eqnarray*}
The last equation finally gives us that \(d_{-3}=1\) and \(d_{-4}=d_{-5}=\cdots =0\). Therefore, the binary expansion of \(9.375\) is
\[9.375=\overline{1001.011}_2\]
Problem 2. Determine the binary expansion of \(\frac15\).
We need to find the sequence \(d_{k-1}\), \(d_{k-2}\), \(\dots\), \(d_0\), \(d_{-1}\), \(d_{-2}\), \(\dots\) such that
\begin{eqnarray*} \frac15 &= &d_{k-1}\cdot 2^{k-1}+\cdots + d_1\cdot 2^1+d_0\\ &&+ d_{-1}\cdot 2^{-1}+d_{-2}\cdot 2^{-2}+d_{-3}\cdot 2^{-3}+\cdots.
\end{eqnarray*}
Since \(\frac15< 1\) we must have \(d_i=0\) for \(i\geq 0\). It remains to determine \(d_{-1}\), \(d_{-2}\), \(\dots\). We start with the equation
\begin{eqnarray*} \frac15 = d_{-1}\cdot 2^{-1}+d_{-2}\cdot 2^{-2}+d_{-3}\cdot 2^{-3}+d_{-4}\cdot 2^{-4}+d_{-5}\cdot 2^{-5}+\cdots
\end{eqnarray*}
and multiply both sides by \(2\). We obtain
\begin{eqnarray*} \frac25 = d_{-1} +d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+d_{-4}\cdot 2^{-3}+d_{-5}\cdot 2^{-4}+\cdots
\end{eqnarray*}
Since \(\frac25 < 1\) we have \(d_{-1}=0\). The last equation can be re-written as
\begin{eqnarray*} \frac25 = d_{-2}\cdot 2^{-1}+d_{-3}\cdot 2^{-2}+d_{-4}\cdot 2^{-3}+d_{-5}\cdot 2^{-4}+\cdots
\end{eqnarray*}
Again, we multiply both sides by \(2\) and get
\begin{eqnarray*} \frac45 & =& d_{-2} +d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+d_{-5}\cdot 2^{-3}\\ &&+d_{-6}\cdot 2^{-4}+d_{-7}\cdot 2^{-5}+\cdots
\end{eqnarray*}
From \(\frac45 < 1\) we derive \(d_{-2}=0\). The equation becomes
\begin{eqnarray*} \frac45& =& d_{-3}\cdot 2^{-1}+d_{-4}\cdot 2^{-2}+d_{-5}\cdot 2^{-3}\\ &&+d_{-6}\cdot 2^{-4}+d_{-7}\cdot 2^{-5}+\cdots
\end{eqnarray*}
Our next step consists of multiplying both sides by \(2\). The left side is equal to \(\frac85\) which we will write as \(1+\frac35\). We derive
\begin{eqnarray*}1+ \frac35 &=& d_{-3} +d_{-4}\cdot 2^{-1}+d_{-5}\cdot 2^{-2}+d_{-6}\cdot 2^{-3}\\ &&+d_{-7}\cdot 2^{-4}+d_{-8}\cdot 2^{-5}+\cdots
\end{eqnarray*}
We must have \(d_{-3}=1\) and
\begin{eqnarray*} \frac35 &=& d_{-4}\cdot 2^{-1}+d_{-5}\cdot 2^{-2}+d_{-6}\cdot 2^{-3}+d_{-7}\cdot 2^{-4}\\ &&+d_{-8}\cdot 2^{-5}+\cdots
\end{eqnarray*}
Both sides of the previous equation should be multiplied by \(2\). We obtain
\begin{eqnarray*} 1+ \frac15 &=& d_{-4} +d_{-5}\cdot 2^{-1}+d_{-6}\cdot 2^{-2}+d_{-7}\cdot 2^{-3}\\ &&+d_{-8}\cdot 2^{-4}+d_{-9}\cdot 2^{-5}+\cdots
\end{eqnarray*}
Therefore, \(d_{-4}=1\). The last equation becomes
\begin{eqnarray*}\frac15 = d_{-5}\cdot 2^{-1}+d_{-6}\cdot 2^{-2}+d_{-7}\cdot 2^{-3}+d_{-8}\cdot 2^{-4}+d_{-9}\cdot 2^{-5}+\cdots
\end{eqnarray*}
This equation is the same as the first one that we started with. Repeating the same procedure as before we get \(d_{-5}=0\), \(d_{-6}=0\), \(d_{-7}=1\), \(d_{-8}=1\), and
\begin{eqnarray*}\frac15 &=& d_{-9}\cdot 2^{-1}+d_{-10}\cdot 2^{-2}+d_{-11}\cdot 2^{-3}+d_{-12}\cdot 2^{-4}\\ &+&d_{-13}\cdot 2^{-5}+d_{-14}\cdot 2^{-6}+\cdots
\end{eqnarray*}
Thus, the binary expansion of \(\frac15\) satisfies
\begin{eqnarray*}
\frac15&=&\overline{0.001100110011\dots}_2
\end{eqnarray*}
2. Mantissa and exponent
The number \(12345.678\) in base \(10\) can be represented in several different ways:
\begin{eqnarray*}
12345.678&=&12.345678\cdot 10^3\\
&=& 0.012345678\cdot 10^6\\
&=& 1234567.8\cdot 10^{-2}.
\end{eqnarray*}
Each of the representations has two important components: mantissa and exponent. In the first representation, the mantissa is \(12.345678\) and the exponent is \(3\). In the second representation, the mantissa is \(0.012345678\) and the exponent is \(6\). In the third representation the mantissa is \(1234567.8\) and the exponent is \(-2\).
3. Normalized mantissa
For every given number, there are infinitely many ways to choose mantissa and exponent. In the previous paragraph we listed \(3\) different ways to represent \(12345.678\) in base \(10\). We could easily find infinitely many more ways to do the same thing. The representations using mantissa and exponent are called floating point representations because the separating point can be placed anywhere by modifying the exponent. The decimal point floats and depends on our choice of the mantissa and exponent. However, one representation is considered special. For our number \(12345.678\), we are going to make an agreement that the following one is the most beautiful and most special of all of the floating point representations: \[12345.678=1.2345678\cdot 10^4.\] The mantissa has exactly one digit to the left of the decimal point. This representation is called normalized. The formal definition is
Definition The representation \(x=p\cdot b^q\) of the real number \(x\) in base \(b\) is called normalized if the absolute value of the mantissa \(p\) satisfies \[1\leq |p| < b.\]
The previous definition is equivalent to saying that there is exactly one digit to the left of the point that separates integer part from the fractional part.
Note: Did you notice how we are using the words point that separates integer part from fractional part? We will not use the word decimal point anymore. Starting from the next section, the binary representations will be the default one. The attribute decimal would be quite wrong and deceiving.
4. Standard IEEE 754 for floating point representation
There are multiple data types that are used to store real numbers. They all have in common some basic rules (also known as standard IEEE 754).
The storage space for the real number is divided into three components:
- Component 1: Sign. The sign occupies 1 bit. If the sign bit is \(0\), the number is considered to be positive. If the sign bit is \(1\), the number is negative.
- Component 2: Exponent. In the remainder of this document we will use the variable \(e\) to denote the length of the exponent. The length of the exponent varies across operating systems and compilers.
Most of the modern processors support two types of numbers: single precision and double precision numbers. In the languages
C and C++ these two types are called float and double.
Currently, most compilers on \(64\)-bit operating systems have \(e=8\) for type float and \(e=11\) for type double.
- Component 3: Mantissa. The length of mantissa will be denoted by \(m\). The standard does not specify how long the mantissa should be. Currently, most compilers have \(m=23\) for type
float and \(m=52\) for type double.
5. Normalization
For given real number \(x\), we first determine its normalized representation \(x=p\cdot 2^q\). The mantissa \(p\) is normalized, hence it satisfies \(1\leq p< 2\).
The exponent \(q\) can be positive or negative, however the number that will be stored is always non-negative. This is achieved with the following rule.
Exponent shift. - If the exponent \(q\) satisfies \(2^{e-1} > q > -(2^{e-1}-1)\), then the value \(q+(2^{e-1}-1)\) is stored in the \(e\) bits dedicated to the exponent.
- If the exponent is smaller than or equal to \(-(2^{e-1}-1)\), then the number is \(x\) will be called sub-normal. Such \(x\) is considered too small to be normalized and will be treated differently than normal numbers. The value \(0\) is stored instead of the exponent.
- If the exponent \(q\) is greater than or equal to \(2^{e-1}\) then the number is too big to store. We will store \(2^e-1\) in the space dedicated to the exponent. The number \(2^e-1\) will set every single bit of the exponent to \(1\) and this will signify that there is an overflow. The content of the memory will not correspond to a real number. Depending on the value inside the mantissa, the content of the memory will send one of the messages:
- The number is \(+\infty\) (if mantissa is \(0\) and the sign is \(0\));
- The number is \(-\infty\) (if mantissa is \(0\) and the sign is \(1\));
- The content of the memory is not a number, commonly known as
NaN (if mantissa is non-zero).
For example, in the case of the most common operating systems and most common compilers, the exponents of numbers of type float are stored with shift-127. The exponents of numbers of type double are stored with shift-1023.
The length of the space for exponent storage is \(e\). Therefore \(2^e\) values can be stored in total. Roughly half of the values will be dedicated for positive, and half for negative exponents.
Normal numbers. If the exponent \(q\) belongs to the range \(2^{e-1} > q > -(2^{e-1}-1)\), then we store the normalized number. Since the mantissa \(p\) satisfies \(1 \leq p < 2\) we know for sure that its digit \(d_0\) is equal to \(1\). We don't need to waste space for its storage. We store only the binary digits \(d_{-1}\), \(d_{-2}\), \(\dots\), \(d_{-m}\). In other words, we store only the digits after the point that separates integer part from fractional part.
Problem 3. Assume that real numbers in a computer are represented using 16 bits: \(1\) for sign, \(6\) for exponent, and \(9\) for mantissa. Determine the representation of the number \(-23.125\) in this computer.
The sign is \(1\). The absolute value of the number is \(23.125\). We will first determine the binary expansion of \(23.125\). Notice that \(23.125=23+\frac18\). The binary expansion of \(23\) is \(23=\overline{10111}_2\). The binary expansion of \(\frac18\) is \(\frac18=\overline{0.001}_2\). Therefore \(23.125=\overline{10111.001}_2\) and the normalized floating point representation is \[23.125=\overline{10111.001}_2= \overline{1.0111001}_2\cdot2^{4}.\]
The exponent shift is \(2^{6-1}-1=31\). Therefore, we need to store the number \(4+31=35\) in the \(6\) exponent bits. The binary expansion of \(35\) is \(35=\overline{100011}_2\). The mantissa is \(p= \overline{1.0111001}_2\). Since the number is normal (and not sub-normal), the first bit of mantissa will not be stored. Thus, the storage will look like this: \[\underbrace{1}_1\underbrace{100011}_6\underbrace{011100100}_9\]
Sub-normal numbers. If the exponent \(q\) from the normalized representation \(x=p\cdot 2^q\) satisfies \(q \leq -(2^{e-1}-1)\), then we will represent the number \(x\) in a non-normalized form
\[x=p'\cdot 2^{-(2^{e-1}-1)}.\] We will store all digits of the mantissa \(p'\) in the allocated \(m\) bits. We will store the value \(0\) in the allocated \(e\) bits for the exponent.
Special case of a sub-normal number is \(0\). Both the mantissa and exponent are \(0\). The sign bit can be either \(0\) or \(1\). This means that there are two zeroes: \(+0\) and \(-0\). This is intentional. The value \(0\) can be result of a scientific calculation that went too far and obtained sub-normal number that can't be stored any more. By analyzing the sign bit, we can get at least an idea whether \(0\) was approached from positive or negative direction.
Problem 4. Assume that real numbers in a computer are represented using 16 bits: \(1\) for sign, \(5\) for exponent, and \(15\) for mantissa. Determine the representations for the numbers \(\frac1{2^{14}}\), \(\frac1{2^{15}}\), and \(\frac1{2^{16}}\) in this computer.
The exponent shift is \(2^{5-1}-1=15\). The minimal exponent is \(-15\). Therefore, the number \(\frac1{2^{14}}\) will be normalized, while \(\frac1{2^{15}}\) and \(\frac1{2^{16}}\) are sub-normal.
The number \(2^{-14}\) has the normalized representation \(2^{-14}=1\cdot 2^{-14}\). The mantissa is \(1\) and the exponent is \(-14\). The exponent will be stored with shift 15. The exponent is \(1\). The mantissa will be normalized, and thus equal to \(0\). The storage will be \[\underbrace{0}_1\underbrace{00001}_5\underbrace{000000000000000}_{15}\]
The numbers \(2^{-15}\) and \(2^{-16}\) will be represented with exponent \(-15\). The representations are \(2^{-15}=1\cdot 2^{-15}\) and \(2^{-16}=\frac12\cdot 2^{-15}\). The mantissa of the first number is \(1\). The mantissa of the second number is \(\frac12=\overline{0.1}_2\) (in binary). Neither mantissa will be normalized. The representations of numbers are \begin{eqnarray*}&& \underbrace{0}_1\underbrace{00000}_5\underbrace{100000000000000}_{15}\quad\mbox{and}\\
&& \underbrace{0}_1\underbrace{00000}_5\underbrace{010000000000000}_{15}.\end{eqnarray*}
6. Floating point registers in ARM 64 assembly language
We have seen that floating point storage and floating point operations are very different from the integer storage and integer operations.
The hardware has separate registers on which the floating point operations are performed.
There are 32 registers in the ARM 64 architecture that are reserved for double precision floating point storage. Their names are D0, D1, ..., D31. Each of them occupies 64 bits.
The single precision registers are the lower halves of the double precision registers. Their names are S1, S2, ..., S31.
The floating point instructions for addition, subtraction, multiplication and division are: FADD, FSUB, FMUL, and FDIV.
Problem 5.
The numbers a and b are double-precision real numbers. They are already placed in RAM memory. The address of the number a is stored in X1. The address of the number b is stored in X2. Write a code that calculates
\(a/b\)
and stores the result in the ram memory at the address that is stored in X0.
Code editor
Problem 6.
The numbers a and b are single-precision real numbers. They are already placed in RAM memory. The address of the number a is stored in X1. The address of the number b is stored in X2. Write a code that calculates
\(a/b\)
and stores the result in the ram memory at the address that is stored in X0.
Code editor