X-Git-Url: https://git.ucc.asn.au/?p=ipdf%2Fsam.git;a=blobdiff_plain;f=chapters%2FBackground%2FFloats%2FDefinition.tex;h=33f1a15a80006f4404e8c425aadb857655a304b8;hp=62d2c71082f5e8af56f742c0de22d8cdc5dd6663;hb=689e433b348588e05f47d23385dbf05d239c95d2;hpb=67c33e185edd820e2ea6fef334455645bacd432b diff --git a/chapters/Background/Floats/Definition.tex b/chapters/Background/Floats/Definition.tex index 62d2c71..33f1a15 100644 --- a/chapters/Background/Floats/Definition.tex +++ b/chapters/Background/Floats/Definition.tex @@ -1,11 +1,14 @@ Whilst a Fixed Point representation keeps the ``point'' (the location considered to be $i = 0$ in \eqref{fixedpointZ}) at the same position in a string of bits, Floating point representations can be thought of as scientific notation; an ``exponent'' and fixed point value are encoded, with multiplication by the exponent moving the position of the point. + +{\bf FIXME: Cite properly} The use of floating point arithmetic in computer systems was pioneered by Knuth\cite{}, Goldberg\cite{goldbern1967twentyseven}, Dekker\cite{}, and others, but modern systems are largely compatable with the IEEE-754 standard pioneered by William Kahan in 1985 \cite{ieee754std1985} and revised (also with contributions from Kahan) in 2008\cite{ieee754std2008}. + A floating point number $x$ is commonly represented by a tuple of values $(s, e, m)$ in base $B$ as\cite{HFP, ieee2008-754}: $x = (-1)^{s} \times m \times B^{e}$ Where $s$ is the sign and may be zero or one, $m$ is commonly called the ``mantissa'' and $e$ is the exponent. Whilst $e$ is an integer in some range $\pm e_max$, the mantissa $m$ is a fixed point value in the range $0 < m < B$. -The choice of base $B = 2$ in the original IEEE-754 standard matches the nature of modern hardware. It has also been found that this base in general gives the smallest rounding errors\cite{HFP}. %Early computers had in fact used a variety of representations including $B=3$ or even $B=7$\cite{goldman1991whatevery}, and the revised IEEE-754 standard specifies a decimal representation $B = 10$ intended for use in financial applications\cite{ieee754std2008}\footnote{Eg: The smallest valid unit of currency \$0.01 could not be represented exactly in base 2}. From now on we will restrict ourselves to considering base 2 floats. +The choice of base $B = 2$ in the original IEEE-754 standard matches the nature of modern hardware. It has also been found that this base in general gives the smallest rounding errors\cite{HFP}. Early computers had in fact used a variety of representations including $B=3$ or even $B=7$\cite{goldman1991whatevery}, and the revised IEEE-754 standard specifies a decimal representation $B = 10$ intended for use in financial applications\cite{ieee754std2008}\footnote{Eg: The smallest valid unit of currency \$0.01 could not be represented exactly in base 2}. From now on we will restrict ourselves to considering base 2 floats. The IEEE-754 encoding of $s$, $e$ and $m$ requires a fixed number of continuous bits dedicated to each value. Originally two encodings were defined: binary32 and binary64. $s$ is always encoded in a single leading bit, whilst (8,23) and (11,53) bits are used for the (exponent, mantissa) encodings respectively.