X-Git-Url: https://git.ucc.asn.au/?p=dja%2Fscandal.git;a=blobdiff_plain;f=architecture.txt;h=5646a244aa17999a8db751e1b0adaa6ff794f8cb;hp=95251d3389a6ecec61697ee1ee02e2093ae33e06;hb=88d39ee082b1599229efe259f04579e34a13a395;hpb=aeb663e7da50cdad5b60dcb5d7a1f78fc83f36b4 diff --git a/architecture.txt b/architecture.txt index 95251d3..5646a24 100644 --- a/architecture.txt +++ b/architecture.txt @@ -12,17 +12,19 @@ each physical page may contain either 2 or 1 logical pages 3. determine dpi 4. foreach double-page-spread (scan page) 4.1. extract scan page from pdf, save as png - 4.2. run a mask over it to pull off large black areas - 4.3. run unpaper over it, creating 2 pages (physical page) - 4.4. foreach physical page - 4.4.1. remask and retrim - 4.4.2. attempt to detect if a physical page contains 2 logical pages, - 4.4.2.1. if so split with unpaper - 4.4.3. do any final processing (resize for bebook) -5. move all the final pictures into a final picture directory -In the accidentally deleted code we used ocropus's binarise stuff to do some -extra cleaning. +5. run ocropus's binarise over all the pngs + +6. foreach binarised scan page + 6.1. create a mask from the original (unbinarised) page + 6.2. use the mask to trim the binarised page (cutting this off improves unpaper's accuracy) + 6.3. run unpaper over the clean binarised page, creating 2 pages (physical page) + 6.4. foreach physical page + 6.4.1. remask and retrim + 6.4.2. attempt to detect if a physical page contains 2 logical pages, + 6.4.2.1. if so split with unpaper + 6.4.3. do any final processing (resize for bebook) +7. move all the final pictures into a final picture directory = What options do we need? = Anything we attempt to detect automatically should have the option to set manually