From: David Gow Date: Wed, 21 May 2014 09:25:36 +0000 (+0800) Subject: Amazing paper + Document Taxonomy X-Git-Url: https://git.ucc.asn.au/?a=commitdiff_plain;h=a0c06f6d51f7d1b827a11dfedf0f6eefc6f5a2e9;p=ipdf%2Fdocuments.git Amazing paper + Document Taxonomy --- diff --git a/LitReviewDavid.pdf b/LitReviewDavid.pdf index 85a6698..d819fc3 100644 Binary files a/LitReviewDavid.pdf and b/LitReviewDavid.pdf differ diff --git a/LitReviewDavid.tex b/LitReviewDavid.tex index 31f4968..5605cef 100644 --- a/LitReviewDavid.tex +++ b/LitReviewDavid.tex @@ -78,27 +78,66 @@ which already have their elements placed. This document was prepared in \LaTeXe. - \item[\texttt{DVI}] - \TeX \, traditionally outputs to the \texttt{DVI} format: a binary format which consists of a + \item[DVI] + \TeX \, traditionally outputs to the \texttt{DVI} (``DeVice Independent'') format: a binary format which consists of a simple stack machine with instructions for drawing glyphs and curves\cite{fuchs1982theformat}. A \texttt{DVI} file is a representation of a document which has been typeset, and \texttt{DVI} viewers will rasterize this for display or printing, or convert it to another similar format like PostScript to be rasterized. - \item[\texttt{HTML}] + \item[HTML] + The Hypertext Markup Language (HTML)\cite{html2rfc} is the widely used document format which underpins the + world wide web. In order for web pages to adapt appropriately to different devices, the HTML format simply + defined semantic parts of a document, such as headings, phrases requiring emphasis, references to images or links + to other pages, leaving the \emph{layout} up to the browser, which would also rasterize the final document. + The HTML format has changed significantly since its introduction, and most of the layout and styling is now controlled + by a set of style sheets in the CSS\cite{css2spec} format. + \item[PostScript] + Much like DVI, PostScript\cite{plrm} is a stack-based format for drawing vector graphics, though unlike DVI (but like \TeX), PostScript is + text-based and turing complete. PostScript was traditionally run on a control board in laser printers, rasterizing pages at high resolution + to be printed, though PostScript interpreters for desktop systems also exist, and are often used with printers which do not support PostScript natively.\cite{ghostscript} + + PostScript programs typically embody documents which have been typeset, though as a turing-complete language, some layout can be performed by the document. + + \item[PDF] + Adobe's Portable Document Format (PDF)\cite{pdfref17} takes the PostScript rendering model, but does not implement a turing-complete language. + Later versions of PDF also extend the PostScript rendering model to support translucent regions via Porter-Duff compositing\cite{porter1984compositing}. + + PDF documents represent a particular layout, and must be rasterized before display. \end{description} +\subsection{Precision in Document Formats} - -Existing document formats, due to being designed to model paper, -have limited precision (8 decimal digits for PostScript\cite{plrm}, 5 decimal digits for PDF\cite{pdfref17}). -This matches the limited resolution of printers and ink, but is limited when compared to what aught to be possible -with ``zoom'' functionality, which is prevent from working beyond a limited scale factor, lest artefacts appear due +Existing document formats --- typically due to having been designed for documents printed on paper, which of course has +limited size and resolution --- use numeric types which can only represent a fixed range and precision. +While this works fine with printed pages, users reading documents on computer screens using programs +with ``zoom'' functionality are prevented from working beyond a limited scale factor, lest artefacts appear due to issues with numeric precision. +\TeX uses a $14.16$ bit fixed point type (implemented as a 32-bit integer type, with one sign bit and one bit used to detect overflow)\cite{beebe2007extending}. +This can represent values in the range $[-(2^14), 2^14 - 1]$ with 16 binary digits of fractional precision. + +The DVI files \TeX \, produces may use ``up to'' 32-bit signed integers\cite{fuchs1982theformat} to specify the document, but there is no requirement that +implementations support the full 32-bit type. It would be permissible, for example, to have a DVI viewer support only 24-bit signed integers, though many +files which require greater range may fail to render correctly. + +PostScript\cite{plrm} supports two different numeric types: \emph{integers} and \emph{reals}, both of which are specified as strings. The interpreter's representation of numbers +is not exposed, though the representation of integers can be divined by a program by the use of bitwise operations. The PostScript specification lists some ``typical limits'' +of numeric types, though the exact limits may differ from implementation to implementation. Integers typically must fall in the range $[-2^{31}, 2^{31} - 1]$, +and reals are listed to have largest and smallest values of $\pm10^{38}$, values closest to $0$ of $\pm10^{-38}$ and approximately $8$ decimal digits of precision, +derived from the IEEE 754 single-precision floating-point specification. + +Similarly, the PDF specification\cite{pdfref17} stores \emph{integers} and \emph{reals} as strings, though in a more restricted format than PostScript. +The PDF specification gives limits for the internal representation of values. Integer limits have not changed from the PostScript specification, but numbers +representable with the \emph{real} type have been specified differently: the largest representable values are $\pm 3.403\times 10^{38}$, the smallest non-zero representable values are +\footnote{The PDF specification mistakenly leaves out the negative in the exponent here.} +$\pm1.175 \times 10^{-38}$ with approximately $5$ decimal digits of precision \emph{in the fractional part}. +Adobe's implementation of PDF uses both IEEE 754 single precision floating-point numbers and (for some calculations, and in previous versions) 16.16 bit fixed-point values. + + \section{Rendering} Computer graphics comes in two forms: bit-mapped (or raster) graphics, which is defined by an array of pixel colours; diff --git a/papers.bib b/papers.bib index d7a96a7..7ffb0c2 100644 --- a/papers.bib +++ b/papers.bib @@ -33,6 +33,29 @@ howpublished={\url{http://www.tug.org/TUGboat/Articles/tb03-2/tb06software.pdf}} } +% HTML 2 spec +@article{html2rfc, + title={Hypertext Markup Language -- 2.0}, + author={Berners-Lee, Tim and Connolly, Daniel}, + year={1995}, + journal={Internet RFC 1866} +} + +% CSS 2 spec +@misc{css2spec, + title={Cascading Style Sheets, Level 2, {CSS2} Specification}, + author={Bos, Bert and Wium Lie, HÃ¥kon and Lilley, Chris and Jacobs Ian}, + date={1998}, + howpublished={\url{http://www.w3.org/TR/1998/REC-CSS2-19980512/}} +} + +@misc{ghostscript, + title={GhostScript, an interpreter for the PostScript language and PDF}, + author={Artifex Software}, + year={1988}, + howpublished={\url{http://www.ghostscript.com/}} +} + %%%%%%%%%%%%%%%%%%%%%%%% % Basic Rendering Theory %%%%%%%%%%%%%%%%%%%%%%%% @@ -929,3 +952,13 @@ ISSN={0272-1716},} } + +% Holy mackerel, a paper on precision in document formats! +@article{beebe2007extending, + author={Beebe, Nelson}, + title={Extending {\TeX} and {METAFONT} With Floating-Point Arithmetic}, + year={2007}, + journal={{TUGboat}}, + volume={28}, + number={3}, +} \ No newline at end of file diff --git a/references/beebe2007extending.pdf b/references/beebe2007extending.pdf new file mode 100644 index 0000000..833b49c Binary files /dev/null and b/references/beebe2007extending.pdf differ