Creating ebooks from book scans …. on Linux
Filed under: E-documenting
Refined semiprofessional document scanning within Linux: here is a little collection of procedures and hints towards the production of e-books on your Linux system. I assume that your scanner is already running on Sane (see below) and you know how to get the suggested software packages from your repository if available. The running system is a Debian GNU/Linux testing (“Squeeze”) – I’ll give some availability information concerning other distributions (are there?) but not in a comprehensive way. In doubt please check the relevant project homepages and/or the source code hosts for more information on the applications. All the software mentioned is open source, so if nobody thought providing a packet for your distribution there is always the way to compile the code on your own – but it is for sure much more convenient to receive software from the repository through your packet management or at least to get it manually from the programmer or from somewhere else. By the way, most of the packets are available for non-Linux operating systems also. An easy way to get Linux stuff running on your Windows PC is Cygwin.
The e-book which is going to be produced is a single sided, black & white (b/w), OCR layer containing djvu or pdf file (“the containers”).
1. Scanning
Fortunately there is no specific insider knowledge needed for scanning on Linux, Sane is definitely the application which is mostly used for that purpose (libraries at Debian testing currently 1.0.20-13, frontends 1.0.14-9 – quite up-to-date!), and which should be available on most of the everyday Linux systems. The frontend Xsane (0.996-3) is quite convenient for batch scanning. It allows to choose a scan area if the book is smaller than the scanner, the pages could be saved rotated 90°, and you can auto adjust gamma/brightness/contrast after getting the first picture or the preview (the big buttons, the 2nd from the left). Other scanning solutions are surely also possible, the command line frontend scanimage for example is open to be run from a shell loop with custom intervals to save the need to push a button to proceed after every page.
I am scanning at 300 DPI grayscale with the file extension .pgm [1] because my scanner backend doesn’t support b/w scanning currently (I know, I know …), but if with your model it’s possible you could try to scan at b/w and skip the step converting the files to that after scanning. For post processing reasons it’s right to create a sequence 001.pgm, 002.pgm etc. and Xsane takes care of that. Usually there are two book pages in one picture – we are going to work on that next.
To discuss the DPI rate, when you check the 300 DPI outcome with your favourite image viewer you’ll see that the scans are much bigger than it would be necessary for reading on the screen but that’s just the right way because the images will appear shrinked in the containers and also the outcome of the conversion to b/w is better than producing scans with a lower DPI rate. And other as it is the case with grayscales, after converting to b/w different DPI rates have not such a significant effect on the overall file size of the final product so that there is no need to go below 300 to save resources.
[1] .pgm instead of the meta extension .pnm to separate it from .pbm after converting it to b/w (next step), and furthermore not as .tif because the post processing tool Unpaper (see below) couldn’t work with that.
2. Batch postprocessing (1): conversion and manipulation with Imagemagick
Imagemagick is a most versatile Swiss army knife for manipulation images at Linux systems. Like Sane it should be available broadly. Our scans could be easily converted to b/w using a simple shell loop:
for i in *pgm; do convert $i -verbose ${i%pgm}pbm; done
It’s also possible to manipulate the threshold for a pixel if it gets black or white but normally it works pretty well. The often remaining black stripe in the middle of the scan is going to be removed with Unpaper (next step). With Imagemagick resp. convert it’s also possible to rotate the scans (-rotate 90) and to cut out a rectangular region (-crop width x height +x +y) and lot of other manipulations are possible, please check out the command line options.
A hint for batch conversion: it’s always a good idea not to overwrite with the manipulated files but to write the new generation into another directory (like: … -verbose ~/foo/${i%pgm}pbm; done).
3. Batch postprocessing (2): Unpaper
Unpaper written by Jens Gulden (currently 0.3-1, also available for Ubuntu Karmic) is a tool for post processing scanned book pages. It can remove dark areas and corrects misaligned centring and rotation of book pages, removes blur and noise and also is able to split double book page scans into individual images. It’s made for heavy duty tasks dealing with scans even of the most ridiculous book xerocopies. Unpaper is able to perform batch processing jobs and the simple usage would be like:
unpaper --layout double --output-pages 2 %03d.pbm ~/foo/%03d.pbm
–layout double defines the input to carry two book pages on one scan image and –output-pages 2 tells Unpaper to split them up into two individual files, %03d is a shell variable for three-digit numbers. Unpaper is quite versatile and to get acquainted with everything needs some effort. While it does his job it’s a good a idea to constantly monitor the output. In the case unwanted results appear you could break the process and change the settings. Unpaper is very sensitive, for example in most of the cases when text blocks accidentally are removed on single pages you have to manipulate the mask scan setting (try a lower setting like -ms 25,25). The processing could be resumed from any file with –start-input x, but you have to align also –start-output and also to give –overwrite then. A useful user documentation is provided on the project’s homepage (here).
4. Creating djvu (& pdf)
If you haven’t known it already: djvu is a powerful container format for digital images which is faster and better in compression than other solutions and there are viewers available for nearly all the operating systems (see here, djview 4.5-3 at “Squeeze”). Even if djvu reveals its full potential especially at killer tasks like unreduced satellite pictures to my experience the workflow with it is always a little bit more fluent even with b/w book scans. To concatenate our post-processed book pages into djvu is no problem with the Djvulibre collection (3.5.22-7). First of all we have to convert the .pbm images into the djvu file format:
for i in *pbm; do cjb2 $i ${i%pbm}djvu; echo $i; done
After that we have to collect the container:
djvm -c mydjvu.djvu *djvu
That’s it!
As easy as this it is to create a pdf at Linux. First of all you have to convert the .pbms into .tifs (for i in *pbm; do convert $i -verbose ${i%pbm}tif; done), after that you have to create a multi-page tiff from these (tiffcp *tif bundle.tif ), and finally you could create a pdf from that with: tiff2pdf -o mypdf.pdf bundle.tif (Note: tiffcp and tiff2pdf are part of the libtiff-tools, 3.9.2-2 at “Squeeze”. For tiff2pdf the compression method has to be given also, -j (Jpeg) or -z (Zip), see the the manpage here).
5. OCR
The are also solutions available for Linux to derive OCR information from book scans for the text layers of djvu and pdf, and Tesseract seems to be the most mature application so far. The development of it has been taken over by Google and it is described to be “probably one oft the most accurate open source OCR engines available”. Tesseract-ocr is available for Debian testing in version 2.04-2 and there are a few language data files for the software which have to be installed also (tesseract-ocr-eng etc.). Playing around one could get the impression Tesseract is working quite nice especially when the correct language is chosen. Although it has problems with Sanskrit diacritics, but I’ve seen that Tesseract could be also trained (I’ll report when I found out more). It could be applied on individual image files also through batch processing (see some experiences here) but it is more convenient to work with a wrapper which also takes care of to re-combine the OCR output with the image automatically:
Ocrodjvu (0.3.2-1, Ubuntu Lucid) by Jakub Wilk is a foolproof wrapper for working on already djvu concatenated document scans which depends to OCRopus (0.3.1-2), an open source OCR system which is under development by the German Research Center for Artificial Intelligence (DFKI). OCRopus employs Tesseract to extract the textual information from the scanned document and, and that’s the clou here, saves also the page positioning information with every word so that a query at Djview or other viewers results not only in the relevant page but also in the highlighted word instances on these pages (layout analysis) – a feature which could hardly be missed nowadays. Ocrodjvu is easy to apply to the djvu we’ve created so far:
ocrodjvu -o mydjvu_ocr.djvu mydjvu.djvu --language=eng
or similar. Start’n'forget – live is easy.
For pdf e-books it’s a little bit more tricky because there isn’t a fully developed wrapper for OCRopus available so far for pdf (the little tool pdf2ocr which I’ve found in the net I couldn’t bring up to work properly) – so I will left that out here for now.
6. Gscan2pdf
Gscan2pdf (0.9.29-1, Ubuntu Hardy) is actually a very comfortable GUI frontend for the most of the multitude of tools we’ve discussed so far, Sane (scanning), Unpaper (postprocessing) and Tesseract (OCR) and the whole process of producing an e-book, both djvu and pdf, could be produced with this amazing tool. Gscan2pdf employs ports to Tesseract and also to the alternative Gocr, but as far as I’ve seen unfortunately it hasn’t a port to OCRopus nor couldn’t deal with layout analysed output (hocr) so this is a desideratum here.
7. Bookmarks
The final step to refine your e-book would be to apply bookmarks to the document. For djvu custom bookmarks (in the djvu world it is called “outline”) have to be in a form like:
(bookmarks
("Title" "#1")
("Main matter" "#5"
("Chapter 1" "#5")
("Chapter 2" "#15"))
)
After editing such a file, you could name it mydjvu.outline, djvused from the Djvulibre tools can apply the outline to the container:
djvused -e mydjvu.djvu 'set-outline mydjvu.outline' -s
That’s it. By the way, the djvu outline format is Unicode capable.
8. Miscellaneous stuff
If the book you want to scan is bigger than the affordable scanner the is the way to scan single pages at once. If then the lid of the scanner couldn’t be fully removed or for whatever other reason it could be the case that you have a set of even numbered scans on which single pages are rotated 180° in relation to the ones on the even numbered scans or the other way around. There is also a way to rotate any of them, try:
for i in *pbm; do p=`echo $i | cut -c 3`; if [ $(($p%2)) -eq 1 ]; then convert $i -rotate 180 -verbose $i; fi; done
(one line!). This is to rotate the set of odd numbered three-digit long .pbms. For working on the even numbered set, exchange - eq 1 with -eq 0. But if you try you’ll see that scanning such a way takes painful Prussian dicipline.
Unpaper employs batch processing with rising numbers. If you want to re-engineer your already created djvu containers like that you can unpack them with ddjvu which puts out a multi page tif (usage like: ddjvu -format=tiff -pages=1-25 ~/foo/mydjvu.djvu bundle.tif). That again could be bursted with tiffsplit which produces a set of images aaa.tif, aab.tif etc. After converting them to pbm (and then re-enconding them into djvu is so far that I’ve seen the only way to attain a custom page range djvu from djvu), but anyway a proper sequence could be restored always with this little shell script here.


Thanks for the post. gscan2pdf is indeed very nice; I used to use it a lot. Have you tried Scan Tailor (http://scantailor.sourceforge.net/ )? I have found this useful too.
Latest Gscan2pdf 0.9.30 employs a port to OCRopus, cf. http://lwn.net/Articles/372405/
Excellent post, Daniel. The Ocrodjvu is really a good finding, as the last time I’ve checked, we could just add OCR to each line, not to each word. Now, being able to search all my .djvu files will be neat indeed!
Thanks Edgard. Indeed everything here is under heavy development (actually Debian [Testing] is the best distrib for running all the newest stuff). I’ve been in contact with Jakub Wilk and he told me grayscale processing will be also available soon (with 0.4.2, option -render). BTW for proper single word highlighting hocr2djvused needs data with has been produced with ocroscript –charboxes!
Update:
(1) I really haven’t thought about kicking off such a thing with this but the American Linux Magazine has bought the article from German Linux User which has emerged from this contrib (issue 06/2010), and it’s going to appear in no. 117 in August.
(2) In the meanwhile Abbyy published a Linux command line version of their popular FineReader software (see here), which seems to work really fine (see the review in the recent Linux-Magazin, 07/2010, 68-71). The reader spits out also hOCR but also finished PDFs.
(3) There is a comparison of the free OCR engines in the Linux-Magazin 12/2006 (see here, might be a little bit outdated).
(4) Jakub Wilk also implemented a little GUI editor for DjVu metadata, Djvusmooth, which is available as I can see so far only for Debian Testing (that’s anyway the choice if you want to run the latest stuff), see here.