That's it I give up

[ipdf/sam.git] / chapters / Background.tex
diff --git a/chapters/Background.tex b/chapters/Background.tex

index ee8da21..5d63d80 100644 (file)
--- a/chapters/Background.tex
+++ b/chapters/Background.tex
@@ -107,12 +107,14 @@ The Number type does differ slightly from IEEE-754 in that there is only a singl
  
  \section{Number Representations}
  
+Consider a value of $7.25 = 2^2 + 2^1 + 2^0 + 2^{-2}$. In binary, this can be written as $111.01_2$ Such a value would require 5 binary digits (bits) of memory to represent exactly. On the other hand a rational value such as $7\frac{1}{3}$ could not be represented exactly; the sequence of bits $111.0111 \text{ ...}_2$ never terminates.
  
-\subsection{Integers and Fixed Point Numbers}
+Integer representations do not allow for a ``point'' as shown in these examples, that is, no bits are reserved for the fractional part of a number. $7.25$ would be rounded to the nearest integer value, $7$. A fixed point representation keeps the decimal place at the same position for all values. Floating point representations can be thought of as analogous to scientific notation; an ``exponent'' and fixed point value are encoded, with multiplication by the exponent moving the position of the point.
  
+Modern computer hardware typically supports integer and floating-point number representations. Due to physical limitations, the size of these representations is limited; this is the fundamental source of both limits on range and precision in computer based calculations. 
  
  
-\subsection{Floating Points}
+\subsection{Floating Point Definition}
  
  A floating point number $x$ is commonly represented by a tuple of values $(s, e, m)$ in base $B$ as\cite{HFP, ieee2008-754}:
  
@@ -120,46 +122,78 @@ A floating point number $x$ is commonly represented by a tuple of values $(s, e,
         x &= (-1)^{s} \times m \times B^{e}
  \end{align*}
  
-Where $s$ is the sign and may be zero or one, $m$ is commonly called the ``mantissa'' and $e$ is the exponent. Whilst $e$ is an integer in some range $\pm e_max$, the mantissa $m$ is actually a fixed point value in the range $0 < m < B$. The name ``floating point'' refers to the equivelance of the $\times B^e$ operation to a shifting of the ``fixed point'' along the mantissa.
+Where $s$ is the sign and may be zero or one, $m$ is commonly called the ``mantissa'' and $e$ is the exponent. Whilst $e$ is an integer in some range $\pm e_max$, the mantissa $m$ is a fixed point value in the range $0 < m < B$. 
  
-For example, the value $7.25$ can be expressed as:
-\begin{enumerate}
-       \item 
-       
-\end{enumerate}
  
-The choice of base $B = 2$, closely matches the nature of modern hardware. It has also been found that this base in general gives the smallest rounding errors\cite{HFP}. Early computers had in fact used a variety of representations including $B=3$ or even $B=7$\cite{goldman1991whatevery}, and the revised IEEE-754 standard specifies a decimal representation $B = 10$ intended for use in financial applications\cite{ieee754std2008}. From now on we will restrict ourselves to considering base 2 floats.
+The choice of base $B = 2$ in the original IEEE-754 standard matches the nature of modern hardware. It has also been found that this base in general gives the smallest rounding errors\cite{HFP}. Early computers had in fact used a variety of representations including $B=3$ or even $B=7$\cite{goldman1991whatevery}, and the revised IEEE-754 standard specifies a decimal representation $B = 10$ intended for use in financial applications\cite{ieee754std2008}. From now on we will restrict ourselves to considering base 2 floats.
  
-Figure \ref{minifloat.pdf} shows the positive real numbers which can be represented exactly by an 8 bit floating point number encoded in the IEEE-754 format, and the distance between successive floating point numbers. We show two encodings using (1,2,5) and (1,3,4) bits to encode (sign, exponent, mantissa) respectively.
+The IEEE-754 encoding of $s$, $e$ and $m$ requires a fixed number of continuous bits dedicated to each value. Originally two encodings were defined: binary32 and binary64. $s$ is always encoded in a single leading bit, whilst (8,23) and (11,53) bits are used for the (exponent, mantissa) encodings respectively. 
  
-For each distinct value of the exponent, the successive floating point representations lie on a straight line with constant slope. As the exponent increases, larger values are represented, but the distance between successive values increases\footnote{A plot of fixed point numbers or integers (which we omit for space considerations) would show points lying on a straight line with a constant slope between points}.
+$m$ as encoded in the IEEE-754 standard is not exactly equivelant to a fixed point value. By assuming an implicit leading bit (ie: restricting $1 \leq m < 2$) except for when $e = 0$, floating point values are gauranteed to have unique representations. When $e = 0$ the leading bit is not implied; these values are called ``denormals''. This idea, which allows for 1 extra bit of precision for the normalised values appears to have been considered by Goldberg as early as 1967\cite{goldbern1967twentyseven}.
  
-In the graph of the difference between representations, a single isolated point should be visible; this is not an error, but due to the greater discontinuity between the denormalised and normalised values ($e = 0$ and $1$ respectively). 
+Figure \ref{float.pdf} shows the positive real numbers which can be represented exactly by an 8 bit floating point number encoded in the IEEE-754 format\footnote{Not quite; we are ignoring the IEEE-754 definitions of NaN and Infinity for simplicity}, and the distance between successive floating point numbers. We show two encodings using (1,2,5) and (1,3,4) bits to encode (sign, exponent, mantissa) respectively. We have also plotted a fixed point representation. For each distinct value of the exponent, the successive floating point representations lie on a straight line with constant slope. As the exponent increases, larger values are represented, but the distance between successive values increases; this can be seen on the right. The marked single point discontinuity at \verb/0x10/ and \verb/0x20/ occur when $e$ leaves the denormalised region and the encoding of $m$ changes.
+
+The earlier example $7.25$ would be converted to a (1,3,4) floating point representation as follows:
+\begin{enumerate}
+       \item Determine the fixed point representation $7.25 = 111.01_2$
+       \item Determine the sign bit; in this case $s = 0$
+       \item Calculate the exponent by shifting the point $111.01_2 = 1.1101_2 \times 2^2 \implies e = 2 = 10_2$
+       \item Determine the exponent encoding; in IEEE-754 equal to the number of exponent bits is added so $e_{enc} = e+3 = 5 = 101_2$
+       \item Remove the implicit bit if the encoded exponent $\neq 0$; $1.1101_2 \to .1101_2$
+       \item Combine the three bit strings$0,101,1101$
+       \item The final encoding is $01011101 \equiv \text{0x5D}$
+\end{enumerate}
+This particular example can be encoded exactly; however as there are an infinite number of real values and only a finite number of floats, in general a value must be $7.26$ must be rounded or truncated at Step 3. 
  
  
  \begin{figure}[H]
         \centering
-       \includegraphics[width=0.8\textwidth]{figures/minifloat.pdf} \\
-       \includegraphics[width=0.8\textwidth]{figures/minifloat_diff.pdf}
-       \caption{The mapping of 8 bit floats to reals}
+\begin{minipage}[t]{0.45\textwidth}
+       \begin{figure}[H]
+               \centering
+               \includegraphics[width=1\textwidth]{figures/floats.pdf} \\
+       \end{figure}
+\end{minipage}
+\begin{minipage}[t]{0.45\textwidth}
+       \begin{figure}[H]
+               \centering
+               \includegraphics[width=1\textwidth]{figures/floats_diff.pdf} \\
+       \end{figure}
+\end{minipage}
+       \caption{8 bit float and fixed point representations a) As mapped to real values b) The distance between each representation}\label{floats.pdf}
  \end{figure}
  
-\subsection{Floating Point Operations}
-
-Floating point operations can in principle be performed using integer operations, but specialised Floating Point Units (FPUs) are an almost universal component of modern processors\cite{kelley1997acmos}. The improvement of FPUs remains highly active in several areas including: efficiency\cite{seidel2001onthe}; accuracy of operations\cite{dieter2007lowcost}; and even the adaptation of algorithms originally used in software for reducing the overal error of a sequence of operations\cite{kadric2013accurate}. In this section we will briefly describe the algorithms for floating point operations without focusing on the hardware implementation of these algorithms.
  
  
  \subsection{Precision and Rounding} 
  
-Real values which cannot be represented exactly in a floating point representation must be rounded to the nearest floating point value. The results of a floating point operation will in general be such values and thus there is a rounding error possible in any floating point operation. Referring to Figure \ref{minifloat.pdf} it can be seen that the largest possible rounding error, or ``units in last place'' (ulp) is half the distance between successive floats; this means that rounding errors increase as the value to be represented increases. The IEEE-754 standard specifies the rounding conventions for floating point arithmetic\cite{ieee754std2008}.
-
+Real values which cannot be represented exactly in a floating point representation must be rounded to the nearest floating point value. The results of a floating point operation will in general be such values and thus there is a rounding error possible in any floating point operation. Referring to Figure \ref{floats.pdf} it can be seen that the largest possible rounding error is half the distance between successive floats; this means that rounding errors increase as the value to be represented increases.
  
  Goldberg's assertively titled 1991 paper ``What Every Computer Scientist Needs to Know about Floating Point Arithmetic''\cite{goldberg1991whatevery} provides a comprehensive overview of issues in floating point arithmetic and relates these to requirements of the IEEE-754 1985 standard\cite{ieee754std1985}. More recently, after the release of the revised IEEE-754 standard in 2008\cite{ieee754std2008}, a textbook ``Handbook Of Floating Point Arithmetic'' has been published which provides a thourough review of literature relating to floating point arithmetic in both software and hardware\cite{HFP}.
  
-William Kahan, one of the architects of the IEEE-754 standard in 1984 and a contributor to its revision in 2010, has also published many articles on his website explaining the more obscure features of the IEEE-754 standard and calling out software which fails to conform to the standard\footnote{In addition to encodings and acceptable rounding errors, the standard also specifies ``exceptions'' --- mechanisms by which a program can detect an error such as division by zero --- which are sometimes neglected, as in the ECMA-256}\cite{kahanweb, kahan1996ieee754}, as well as examples of the limitations of floating point computations\cite{kahan2007wrong}. 
+William Kahan, one of the architects of the IEEE-754 standard in 1984 and a contributor to its revision in 2010, has also published many articles on his website explaining the more obscure features of the IEEE-754 standard and calling out software which fails to conform to the standard\footnote{In addition to encodings and acceptable rounding behaviour, the standard also specifies ``exceptions'' --- mechanisms by which a program can detect and report an error such as division by zero}\cite{kahanweb, kahan1996ieee754}, as well as examples of the limitations of floating point computations\cite{kahan2007wrong}. 
+
+In  Figure \ref{calculatepi.pdf} we show the effect of accumulated rounding errors on the computation of $\pi$ through a numerical integration\footnote{This is not intended to be an example of a good way to calculate $\pi$} using 32 bit ``single precision'' floats and 64 bit ``double precision'' floats.
+
+\begin{figure}[H]
+       \centering
+       \includegraphics[width=0.6\textwidth]{figures/calculatepi.pdf}
+       \caption{Numerical calculation of $\pi$}\label{calculatepi.pdf}
+\end{figure}
+
+\subsection{Floating Point Operations}
+
+Floating point operations can in principle be performed using integer operations, but specialised Floating Point Units (FPUs) are an almost universal component of modern processors\cite{kelley1997acmos}. The improvement of FPUs remains highly active in several areas including: efficiency\cite{seidel2001onthe}; accuracy of operations\cite{dieter2007lowcost}; and even the adaptation of algorithms originally used in software, such as Kahan's Fast2Sum algorithm\cite{kadric2013accurate}. 
+
+In 1971 Dekker formalised a series of algorithms including the Fast2Sum method for calculating the correction term due to accumulated rounding errors\cite{dekker1971afloating}. The exact result of $x + y$ may be expressed in terms of floating point operations with rounding as follows:
+\begin{align*}
+       z &= \text{RN}(x + y) w &= \text{RN}(z - x) \\
+       zz &= \text{RN}(y - w) \implies x + y &= zz
+\end{align*}
  
  \subsection{Arbitrary Precision Floating Point Numbers}
  
-Fouse described 
+Arbitrary precision floating point numbers are implemented in a variety of software libraries which will dynamically allocate extra bits for the exponent or mantissa as required. An example is the GNU MPFR library discussed by Fousse in 2007\cite{fouse2007mpfr}. Although many arbitrary precision libraries already existed, MPFR intends to be fully compliant with some of the more obscure IEEE-754 requirements such as rounding rules and exceptions. 
  
+As we have seen, it is trivial to find real numbers that would require an infinite number of bits to represent exactly. Implementations of ``arbitrary'' precision must carefully determine at what point rounding should occur so as to balance performance with memory usage.