X-Git-Url: https://git.ucc.asn.au/?p=ipdf%2Fsam.git;a=blobdiff_plain;f=chapters%2FBackground%2FFloats%2FDefinition.tex;fp=chapters%2FBackground%2FFloats%2FDefinition.tex;h=62d2c71082f5e8af56f742c0de22d8cdc5dd6663;hp=0000000000000000000000000000000000000000;hb=9fcf44a0c34f393689118e913a2d17d907036c85;hpb=d5e7e14d2ec624cfe0febcccd81e95082ef1c175 diff --git a/chapters/Background/Floats/Definition.tex b/chapters/Background/Floats/Definition.tex new file mode 100644 index 0000000..62d2c71 --- /dev/null +++ b/chapters/Background/Floats/Definition.tex @@ -0,0 +1,13 @@ +Whilst a Fixed Point representation keeps the ``point'' (the location considered to be $i = 0$ in \eqref{fixedpointZ}) at the same position in a string of bits, Floating point representations can be thought of as scientific notation; an ``exponent'' and fixed point value are encoded, with multiplication by the exponent moving the position of the point. + +A floating point number $x$ is commonly represented by a tuple of values $(s, e, m)$ in base $B$ as\cite{HFP, ieee2008-754}: $x = (-1)^{s} \times m \times B^{e}$ + +Where $s$ is the sign and may be zero or one, $m$ is commonly called the ``mantissa'' and $e$ is the exponent. Whilst $e$ is an integer in some range $\pm e_max$, the mantissa $m$ is a fixed point value in the range $0 < m < B$. + + +The choice of base $B = 2$ in the original IEEE-754 standard matches the nature of modern hardware. It has also been found that this base in general gives the smallest rounding errors\cite{HFP}. %Early computers had in fact used a variety of representations including $B=3$ or even $B=7$\cite{goldman1991whatevery}, and the revised IEEE-754 standard specifies a decimal representation $B = 10$ intended for use in financial applications\cite{ieee754std2008}\footnote{Eg: The smallest valid unit of currency \$0.01 could not be represented exactly in base 2}. From now on we will restrict ourselves to considering base 2 floats. + +The IEEE-754 encoding of $s$, $e$ and $m$ requires a fixed number of continuous bits dedicated to each value. Originally two encodings were defined: binary32 and binary64. $s$ is always encoded in a single leading bit, whilst (8,23) and (11,53) bits are used for the (exponent, mantissa) encodings respectively. + +The encoding of $m$ in the IEEE-754 standard is not exactly equivelant to a fixed point value. By assuming an implicit leading bit (ie: restricting $1 \leq m < 2$) except for when $e = 0$, floating point values are gauranteed to have a unique representations; these representations are said to be ``normalised''. When $e = 0$ the leading bit is not implied; these representations are called ``denormals'' because multiple representations may map to the same real value. The idea of using an implicit bit appears to have been considered by Goldberg as early as 1967\cite{goldbern1967twentyseven}. +