More lightweight PDF production: Docutils, Pandoc, Lout
Filed under: E-documenting
After the previous posting on Troff resp. Groff I want to continue pointing out some more of the various console based, end user oriented solutions that exist for producing PDF documents from plain text files next to the TeX family (certainly some of them generate TeX source for producing their PDFs). Whether the non-TeX systems are really contenders or not definitely depends heavily on what kind of input system they apply. The dinosaur Groff certainly is fascinating because of its antiquity. Furthermore it’s available everywhere, given its small size pretty versatile including even a bibliographical subsystem, and also the output is very tasteful. Unfortunately the input method is somewhat strange so it is not very common to really employ Groff for everyday’s work and it’s more or less pointless to consider a Troff renaissance. Of course input methods differ in their usability and these days there are in fact several different markup languages which compete. The Wiki markup which might be known from Wikipedia is to my experience one example for a pretty pleasant markup convention.
So let’s see now what other systems could be used on the console to produce papers, handouts and other pieces as PDFs (and alternatively also other publishing formats) from plain text files (like always with a view to Linux but generally more or less the same way also on Mac OS X with its Unix core, and mostly somehow also on Windows):
Docutils / reStructuredText
Next to Epydoc and other systems Doctutils is capable to generate the documentation for Python modules right from their source code (“in-line documentation” with Docstrings; Docutils is also written in Python), but it could be used also as a self standing lightweight text processing system [2]. Docutils employs an own markup system, reStructuredText (ReST) [3], which can be converted into several of the different heavyweight markup publishing formats resp. sources with different console tools which are shipped with it: XML/DocBook, the Open Document Format for OpenOffice.org’s Writer, HTML, and LaTeX (the programs are: rst2xml, rst2odt, rst2html, rst2latex). DocBook and LaTeX again are file formats from which PDFs could be processed easily on the console (also DocBook produces very classy documents). There is also the tool rst2pdf which generates PDF directly from the ReST source and that’s even without the need to install LaTeX. ReST is really intuitive and most of the times it uses plain text features the user more or less had made “naturally” for structuring: a tab indent is for quotes, “1) 2) 3)” etc. are for typesetting numbered lists and so on. It’s pretty simple: “*x*” is for marking italics, “http://”-hyperlinks are tagged automatically, etc. [5]. Unfortunately it seems that nothing is really Unicode capable here, so there is room for improvements. Anyway the ReST family is worth being followed what is going on here.
Pandoc / Markdown
Pandoc is a very useful converter for the most popular publishing markup formats [6]. It supports as input i.a. HTML, LaTeX, and again: ReST, next to a wide range of output formats: i.a. HTML, LaTeX, ConTeXt [!], DocBook, Open Document Format, Rich Text Format, ReSt, and even Mediawiki markup. Pandoc is also able to convert to S5, a slide show system for presentations (see here). Another pretty interesting feature of Pandoc is that it is able to process the Markdown markup language, which is also very lean and effective [7]. Like at Docutils, a PDF document could be generated manually from the DocBook or the LaTeX sources which Pandoc puts out (it must be pointed out that the resulting files contain only the body text and Pandoc does not put in document headers), but the packet includes also the wrapper markdown2pdf for a full one-step Markdown source processing (that makes use of Pdflatex, so a LaTeX distribution has to be available). A big advantage here is that Pandoc is already Unicode capable and the software is also open to create own conversion extensions. It is also possible to run Pandoc as embedded filter for using Markdown with ConTeXt (see here) [8].
Lout
Well, for Indologists poor attestation is no indication that something is uninteresting or irrelevant – on the contrary. But if you ever thought that ConTeXt is a minority issue and Groff is probably the tip in terms of the fact that one can’t find much about it on the net you really haven’t dealt with Lout. Even the Wikipedia page (with the few meagre counterparts in German, French, and Vietnamese) is marked as probably not meeting the general notability guideline [!]. To google that term produces a similar result measly. But Lout really doesn’t deserves to dwell in the darkness of generally being unconsidered and is really an alternative that must be located somewhere between Groff and TeX. Lout is pretty versatile given its size, this surprisingly comprehensive, everything included document formatting system which consists mainly only of a single with the size of 655 kb is the work of Jeffrey H. Kingston of the School of Information Technologies at the University of Sydney [9].
Like LateX Lout uses different document types, the basic ones are doc, report, book, slides and picture, but there are also other ones like for pretty printing source code of different programming languages (for all see /usr/share/lout/include). If the end user wants to modify basic settings like indentation or inter paragraph space that has to be done by modifying the document type setup files or copies of them (see 4.1 sq. in the User guide) – this method has its advantages. Like Groff, Lout is a closed system which comes with its own stuff like fonts, hyphenation patterns etc. Also Lout is equipped with its own bibliographical database format while the appearance of citations and the reference list could be customized (see chapter 5 in the User guide). There is also an glossary and an indexing system, a diagram, a graph, and a pie graph function (see chapter 9-11) – by which that all really does not look trivial. Lout is Postscript software, the resulting file (lout mydocument.lout > mydocument.ps) can again be converted to PDF – as always – with ps2pdf (belonging to Ghostscript). There are a lot of symbols and diacritica available (see 1.4: characters), but unfortunately I couldn’t find combining overbar, under-/overdot, nor s-acute. As with Groff I think there might be hardly a chance to get Unicode fonts running, but maybe it’s worth to do the job to add another proper Postscript font to the system. Anyway, I’ve created a little showcase, the source is here, the resulting PDF is here. I think I am really going to play around with this a little bit longer!
Notes
[1] On the issue of markup (which nowadays is a subfield of Digital humanities and covers of course also HTML and particularly the omnipresent XML family): J.H. Coombs, A.H. Renear, S.J. DeRose: Markup systems and the future of scholarly text processing [Communications of the ACM 30,11 (1987), 933-47]; A. Witt, D. Metzing (Eds.): Linguistic modeling of information and markup languages – contributions to language technology. Dordrecht (usw.): Springer 2010.
[2] python-docutils on Debian (0.5.2 on Lenny, 0.6.4 on Squeeze; 0.6.3 on Ubuntu Lucid).
[3] On Doctuils and ReST from the viewpoint of XML see David Mertz’s XML matters: reStructuredText (the stuff on IBM developerWorks is generally quite informative). A special GUI frontend is also under development, DocFactory.
[5] See the primer and the specification. There is another introduction to ReST here and here. There also is a Vim module for highlighting ReST and also an Emacs extension for that next to doing some other stuff. By the way, the extension .rst hasn’t established yet as an official MIME type, but is not reserved for another application (see here).
[6] Recent release 1.5.1.1. Pandoc is for some reason not available as packet from the Debian repository even for Squeeze/Testing recently (see here), but for it is programmed in the functional programming language Haskell it could be fetched easily through the Apt-like Haskell source retrieving system Cabal: just install cabal-install (it will get also some crap like the Haskell compiler ghc6) and then do cabal update and cabal install pandoc (it takes a while to get and compile everything). After that the executable binary pandoc is available at ~/.cabal/bin (which of course could be copied out or softlinked to ~/bin, but more elegant is to add: if [ -d ~/.cabal/bin ] ; then PATH=~/.cabal/bin:”${PATH}”; fi to ~/.bash_profile) – easy lover! Pandoc is also available as a library for Haskell.
[7] Markdown (home) was originally developed for producing HTML. On recent Debian Testing for example there are several Markdown applications in the repository: the original HTML generator markdown, libraries for Perl and Python, and a special Emacs mode which is included in emacs-goodies-el. .mkd or .pdc would be proper extensions for Markdown files, but Pandoc itself uses simply .txt when converting into Markdown – the clou of the lightweight markup languages indeed is that files of them more or less could also function for the reader as plain text files.
[8] The Lua library Lunamark for conversion between markup formats also could be employed by the LuaTeX engine.
[9] The software is hosted at Sourceforge. There is development here, the version 3.36 is available for Lenny, 3.38 for squeeze (see here). There is the Lout-users mailing list. Kingston described The design and implementation of the Lout document formatting language [Software - Practice and Experience 23,9 (1993), 1001-41 (offprint)]. The software comes with an user guide (pdf here) and an expert’s guide (pdf here, packet lout-doc on Debian). There are even high class books created with Lout, see Mark Summerfield’s page. Some business task examples are collected here.
Lightweight Pdf production with Groff
Open source Pdf production is always unproblematic and there are also several options to get what you want. An easy, quick and lean way to produce a paper in Pdf format is the Troff family. Troff (Text runoff, spoken: “T-roff”) is a command line based text formatting system which was created in the 70s by AT&T for Unix systems (there were predecessors). It produces Postscript documents if not told to put out Html or else. Nroff (New roff) was programmed as an alternative to Troff, although both differ only in details. 1990 James Clark programmed the follower Groff and donated it to the free GNU collection. Groff (project page, manual) nowadays belongs to most Linux and Unix operating systems (… Mac OS X) by default because it’s needed to format the system’s manpages (Groff for Win here). But it could be also utilized to produce Pdf documents on the command line in a very simple way.
To demonstrate how effective Troff resp. Groff is: if you like, use your favourite text editor to type something like: “Hello world!” into a random text file test.troff or else. After that do: groff test.troff > test.ps – and there you’ve got a Postscript document (there’s a “naked” page there without margins but anyhow). This .ps-file then could be converted to Pdf with ps2pdf, but there is also a wrapper pdfroff belonging to Groff which could be used alternatively to skip this step. That’s basically it!
There are several macro packages which ease the life of the common user because they are more uncomplicated to use than dealing with Troff on a fundamental level. The man macros are for producing manpages, very popular are the ms and mm macros, there are the mom macros made for book publishing, and the me macros which which were developed in Berkeley in context of the Ingres project for writing their research papers. All macro packages are shipped together with Groff and could be chosen running the processor on the source file (cf. the Groff documentation). Every macro package has its own set of commands.
A rich resource for using Troff but also for text processing on the command line in general is the classic Dougherty/O’Reilly’s Unix Text Processing (also available in its re-produced Troff source). This book also has chapters on the ms and mm macros. Robbins’ Unix in a nutshell up to the 3rd editions contains several chapters on the Troff family (see here). Nice also is M. Laha’s Introduction to Groff. Pretty useful for troffing in general also is the Groff manual. From the original programmer’s there are Kernighan’s A Troff tutorial and the comprehensive Troff User’s Manual by Ossanna/Kernighan (more books mentioned here). The Groff people are running a quite active mailing list, and I’ve found a little tutorial. There is also the Groff wiki (there are also interesting links to projects like the presentation module Gpresent).
Troffing with the me macros
Like in the TeX family at Troff even a line break in the source appears as whitespace in the output document. This gives you the freedom to arrange the text the way you like on the typing level. But in addition to that here every command must have a new line which forces you to something very different from the output text. That looks awful on the first sight, but on the other hand it makes the source easier to correct while proof reading the output because seeking for relating passages works a little bit better. To demonstrate what it means: a sentence like “no it’s not this, it’s that and only that” in Troff/me goes like:
no it's not this, it's .i that and only that
That’s italics with the me macros. Now you can compile a text like this with: pdfroff -me test.me > test.pdf. The Troff standard body font is Times New Roman which is common taste, and the me macro packet’s default settings like page dimension are quite allright for handouts or papers, one more reason to choose me as starting point for Troff I would say. In Troff there are also environments like the one for quotes in me: .(q … .)q. Here also both commands demand a line on their own like:
Schon mein Großvater hat immer gesagt: .(q Der frühe Vogel fängt den Wurm! .)q
(please note that the source file generally has to be latin1 encoded to treat stuff like Umlaute correctly). The me macros are sufficiently documented by their creator Eric P. Allman, there is the Writing papers with NROFF using -me and the -Me reference manual. There is also the 15th chapter of Robbins’ Unix in a nutshell in the 3rd edition (try to get them if you see rare old manuals like: H. Dabney: Introduction to the -me macros used with Nroff on the VAX 11/780. David Taylor Naval Ship Research 1982. By the way, if you want to check out the me macro’s source it’s: /usr/share/…/tmac/e.tmac). To show how a few basic commands work in me I’ve created a little demonstration sample.me and the resulting pdf.
Some typographic features within Troff need additional processors. For creating tables there is the tbl preprocessor (documentation, man tbl on the prompt line), there is eqn for maths (man eqn), GNU pic is used for drawing vector graphics like diagrams or stemma trees, and finally there is – no joke – refer for using bibliographic databases (which indeed is the ancestor of BibTeX) within Troff (man refer, see also here). All preprocessors could be invoked independently but also through groff which is a little bit more convenient as at TeX.
Diacritica … Unicode?
There are several diacritical marks available at me, there is a caret (\*^), cedilla (\*,) overcircle (\*o), turned caret (\*v), next to a tilde (\*~), acute and grave – but unfortunately the acute doesn’t work with “s” and there isn’t an underdot nor overdot available at all. So it all works pretty well but the current dispatch has some deficiencies for Sanskrit.
For Groff is pure Postscript software there is no chance for the end user to get Unicode or OpenType running. Random Postscript and converted TrueType fonts could be applied which makes Groff somewhat versatile – for example a packet made my Muhammad Muquit for getting the Bangla font running could be found here. An alternative here are the reimplementations of the Heirloom Project which provide up-to-date features like Unicode/OpenType support and direct font approach next to a lot of other enhancements of the former software. As far as I can see there are no binaries available for Linux nor Windows anywhere – but there are Mac OS X packages!
Prospects
Groff would be open to implement an own macro class for standardised papers of an institute, a Postscript font containing a full set of diacritica could be added for that purpose. As a single programm which is quite small like this it would be no problem to put it all up online – you upload your Troff sources and get your perfectly formatted Pdf in an instant without the need to get anything running on your own PC. Something like this could make up a pretty effective integrated system. Not to think into directions like making standardized bibliographic databases available …
Creating ebooks from book scans …. on Linux
Filed under: E-documenting
Refined semiprofessional document scanning within Linux: here is a little collection of procedures and hints towards the production of e-books on your Linux system. I assume that your scanner is already running on Sane (see below) and you know how to get the suggested software packages from your repository if available. The running system is a Debian GNU/Linux testing (“Squeeze”) – I’ll give some availability information concerning other distributions (are there?) but not in a comprehensive way. In doubt please check the relevant project homepages and/or the source code hosts for more information on the applications. All the software mentioned is open source, so if nobody thought providing a packet for your distribution there is always the way to compile the code on your own – but it is for sure much more convenient to receive software from the repository through your packet management or at least to get it manually from the programmer or from somewhere else. By the way, most of the packets are available for non-Linux operating systems also. An easy way to get Linux stuff running on your Windows PC is Cygwin.
The e-book which is going to be produced is a single sided, black & white (b/w), OCR layer containing djvu or pdf file (“the containers”).
1. Scanning
Fortunately there is no specific insider knowledge needed for scanning on Linux, Sane is definitely the application which is mostly used for that purpose (libraries at Debian testing currently 1.0.20-13, frontends 1.0.14-9 – quite up-to-date!), and which should be available on most of the everyday Linux systems. The frontend Xsane (0.996-3) is quite convenient for batch scanning. It allows to choose a scan area if the book is smaller than the scanner, the pages could be saved rotated 90°, and you can auto adjust gamma/brightness/contrast after getting the first picture or the preview (the big buttons, the 2nd from the left). Other scanning solutions are surely also possible, the command line frontend scanimage for example is open to be run from a shell loop with custom intervals to save the need to push a button to proceed after every page.
I am scanning at 300 DPI grayscale with the file extension .pgm [1] because my scanner backend doesn’t support b/w scanning currently (I know, I know …), but if with your model it’s possible you could try to scan at b/w and skip the step converting the files to that after scanning. For post processing reasons it’s right to create a sequence 001.pgm, 002.pgm etc. and Xsane takes care of that. Usually there are two book pages in one picture – we are going to work on that next.
To discuss the DPI rate, when you check the 300 DPI outcome with your favourite image viewer you’ll see that the scans are much bigger than it would be necessary for reading on the screen but that’s just the right way because the images will appear shrinked in the containers and also the outcome of the conversion to b/w is better than producing scans with a lower DPI rate. And other as it is the case with grayscales, after converting to b/w different DPI rates have not such a significant effect on the overall file size of the final product so that there is no need to go below 300 to save resources.
[1] .pgm instead of the meta extension .pnm to separate it from .pbm after converting it to b/w (next step), and furthermore not as .tif because the post processing tool Unpaper (see below) couldn’t work with that.
2. Batch postprocessing (1): conversion and manipulation with Imagemagick
Imagemagick is a most versatile Swiss army knife for manipulation images at Linux systems. Like Sane it should be available broadly. Our scans could be easily converted to b/w using a simple shell loop:
for i in *pgm; do convert $i -verbose ${i%pgm}pbm; done
It’s also possible to manipulate the threshold for a pixel if it gets black or white but normally it works pretty well. The often remaining black stripe in the middle of the scan is going to be removed with Unpaper (next step). With Imagemagick resp. convert it’s also possible to rotate the scans (-rotate 90) and to cut out a rectangular region (-crop width x height +x +y) and lot of other manipulations are possible, please check out the command line options.
A hint for batch conversion: it’s always a good idea not to overwrite with the manipulated files but to write the new generation into another directory (like: … -verbose ~/foo/${i%pgm}pbm; done).
3. Batch postprocessing (2): Unpaper
Unpaper written by Jens Gulden (currently 0.3-1, also available for Ubuntu Karmic) is a tool for post processing scanned book pages. It can remove dark areas and corrects misaligned centring and rotation of book pages, removes blur and noise and also is able to split double book page scans into individual images. It’s made for heavy duty tasks dealing with scans even of the most ridiculous book xerocopies. Unpaper is able to perform batch processing jobs and the simple usage would be like:
unpaper --layout double --output-pages 2 %03d.pbm ~/foo/%03d.pbm
–layout double defines the input to carry two book pages on one scan image and –output-pages 2 tells Unpaper to split them up into two individual files, %03d is a shell variable for three-digit numbers. Unpaper is quite versatile and to get acquainted with everything needs some effort. While it does his job it’s a good a idea to constantly monitor the output. In the case unwanted results appear you could break the process and change the settings. Unpaper is very sensitive, for example in most of the cases when text blocks accidentally are removed on single pages you have to manipulate the mask scan setting (try a lower setting like -ms 25,25). The processing could be resumed from any file with –start-input x, but you have to align also –start-output and also to give –overwrite then. A useful user documentation is provided on the project’s homepage (here).
4. Creating djvu (& pdf)
If you haven’t known it already: djvu is a powerful container format for digital images which is faster and better in compression than other solutions and there are viewers available for nearly all the operating systems (see here, djview 4.5-3 at “Squeeze”). Even if djvu reveals its full potential especially at killer tasks like unreduced satellite pictures to my experience the workflow with it is always a little bit more fluent even with b/w book scans. To concatenate our post-processed book pages into djvu is no problem with the Djvulibre collection (3.5.22-7). First of all we have to convert the .pbm images into the djvu file format:
for i in *pbm; do cjb2 $i ${i%pbm}djvu; echo $i; done
After that we have to collect the container:
djvm -c mydjvu.djvu *djvu
That’s it!
As easy as this it is to create a pdf at Linux. First of all you have to convert the .pbms into .tifs (for i in *pbm; do convert $i -verbose ${i%pbm}tif; done), after that you have to create a multi-page tiff from these (tiffcp *tif bundle.tif ), and finally you could create a pdf from that with: tiff2pdf -o mypdf.pdf bundle.tif (Note: tiffcp and tiff2pdf are part of the libtiff-tools, 3.9.2-2 at “Squeeze”. For tiff2pdf the compression method has to be given also, -j (Jpeg) or -z (Zip), see the the manpage here).
5. OCR
The are also solutions available for Linux to derive OCR information from book scans for the text layers of djvu and pdf, and Tesseract seems to be the most mature application so far. The development of it has been taken over by Google and it is described to be “probably one oft the most accurate open source OCR engines available”. Tesseract-ocr is available for Debian testing in version 2.04-2 and there are a few language data files for the software which have to be installed also (tesseract-ocr-eng etc.). Playing around one could get the impression Tesseract is working quite nice especially when the correct language is chosen. Although it has problems with Sanskrit diacritics, but I’ve seen that Tesseract could be also trained (I’ll report when I found out more). It could be applied on individual image files also through batch processing (see some experiences here) but it is more convenient to work with a wrapper which also takes care of to re-combine the OCR output with the image automatically:
Ocrodjvu (0.3.2-1, Ubuntu Lucid) by Jakub Wilk is a foolproof wrapper for working on already djvu concatenated document scans which depends to OCRopus (0.3.1-2), an open source OCR system which is under development by the German Research Center for Artificial Intelligence (DFKI). OCRopus employs Tesseract to extract the textual information from the scanned document and, and that’s the clou here, saves also the page positioning information with every word so that a query at Djview or other viewers results not only in the relevant page but also in the highlighted word instances on these pages (layout analysis) – a feature which could hardly be missed nowadays. Ocrodjvu is easy to apply to the djvu we’ve created so far:
ocrodjvu -o mydjvu_ocr.djvu mydjvu.djvu --language=eng
or similar. Start’n'forget – live is easy.
For pdf e-books it’s a little bit more tricky because there isn’t a fully developed wrapper for OCRopus available so far for pdf (the little tool pdf2ocr which I’ve found in the net I couldn’t bring up to work properly) – so I will left that out here for now.
6. Gscan2pdf
Gscan2pdf (0.9.29-1, Ubuntu Hardy) is actually a very comfortable GUI frontend for the most of the multitude of tools we’ve discussed so far, Sane (scanning), Unpaper (postprocessing) and Tesseract (OCR) and the whole process of producing an e-book, both djvu and pdf, could be produced with this amazing tool. Gscan2pdf employs ports to Tesseract and also to the alternative Gocr, but as far as I’ve seen unfortunately it hasn’t a port to OCRopus nor couldn’t deal with layout analysed output (hocr) so this is a desideratum here.
7. Bookmarks
The final step to refine your e-book would be to apply bookmarks to the document. For djvu custom bookmarks (in the djvu world it is called “outline”) have to be in a form like:
(bookmarks
("Title" "#1")
("Main matter" "#5"
("Chapter 1" "#5")
("Chapter 2" "#15"))
)
After editing such a file, you could name it mydjvu.outline, djvused from the Djvulibre tools can apply the outline to the container:
djvused -e mydjvu.djvu 'set-outline mydjvu.outline' -s
That’s it. By the way, the djvu outline format is Unicode capable.
8. Miscellaneous stuff
If the book you want to scan is bigger than the affordable scanner the is the way to scan single pages at once. If then the lid of the scanner couldn’t be fully removed or for whatever other reason it could be the case that you have a set of even numbered scans on which single pages are rotated 180° in relation to the ones on the even numbered scans or the other way around. There is also a way to rotate any of them, try:
for i in *pbm; do p=`echo $i | cut -c 3`; if [ $(($p%2)) -eq 1 ]; then convert $i -rotate 180 -verbose $i; fi; done
(one line!). This is to rotate the set of odd numbered three-digit long .pbms. For working on the even numbered set, exchange - eq 1 with -eq 0. But if you try you’ll see that scanning such a way takes painful Prussian dicipline.
Unpaper employs batch processing with rising numbers. If you want to re-engineer your already created djvu containers like that you can unpack them with ddjvu which puts out a multi page tif (usage like: ddjvu -format=tiff -pages=1-25 ~/foo/mydjvu.djvu bundle.tif). That again could be bursted with tiffsplit which produces a set of images aaa.tif, aab.tif etc. After converting them to pbm (and then re-enconding them into djvu is so far that I’ve seen the only way to attain a custom page range djvu from djvu), but anyway a proper sequence could be restored always with this little shell script here.

