More lightweight PDF production: Docutils, Pandoc, Lout
Filed under: E-documenting
After the previous posting on Troff resp. Groff I want to continue pointing out some more of the various console based, end user oriented solutions that exist for producing PDF documents from plain text files next to the TeX family (certainly some of them generate TeX source for producing their PDFs). Whether the non-TeX systems are really contenders or not definitely depends heavily on what kind of input system they apply. The dinosaur Groff certainly is fascinating because of its antiquity. Furthermore it’s available everywhere, given its small size pretty versatile including even a bibliographical subsystem, and also the output is very tasteful. Unfortunately the input method is somewhat strange so it is not very common to really employ Groff for everyday’s work and it’s more or less pointless to consider a Troff renaissance. Of course input methods differ in their usability and these days there are in fact several different markup languages which compete. The Wiki markup which might be known from Wikipedia is to my experience one example for a pretty pleasant markup convention.
So let’s see now what other systems could be used on the console to produce papers, handouts and other pieces as PDFs (and alternatively also other publishing formats) from plain text files (like always with a view to Linux but generally more or less the same way also on Mac OS X with its Unix core, and mostly somehow also on Windows):
Docutils / reStructuredText
Next to Epydoc and other systems Doctutils is capable to generate the documentation for Python modules right from their source code (“in-line documentation” with Docstrings; Docutils is also written in Python), but it could be used also as a self standing lightweight text processing system [2]. Docutils employs an own markup system, reStructuredText (ReST) [3], which can be converted into several of the different heavyweight markup publishing formats resp. sources with different console tools which are shipped with it: XML/DocBook, the Open Document Format for OpenOffice.org’s Writer, HTML, and LaTeX (the programs are: rst2xml, rst2odt, rst2html, rst2latex). DocBook and LaTeX again are file formats from which PDFs could be processed easily on the console (also DocBook produces very classy documents). There is also the tool rst2pdf which generates PDF directly from the ReST source and that’s even without the need to install LaTeX. ReST is really intuitive and most of the times it uses plain text features the user more or less had made “naturally” for structuring: a tab indent is for quotes, “1) 2) 3)” etc. are for typesetting numbered lists and so on. It’s pretty simple: “*x*” is for marking italics, “http://”-hyperlinks are tagged automatically, etc. [5]. Unfortunately it seems that nothing is really Unicode capable here, so there is room for improvements. Anyway the ReST family is worth being followed what is going on here.
Pandoc / Markdown
Pandoc is a very useful converter for the most popular publishing markup formats [6]. It supports as input i.a. HTML, LaTeX, and again: ReST, next to a wide range of output formats: i.a. HTML, LaTeX, ConTeXt [!], DocBook, Open Document Format, Rich Text Format, ReSt, and even Mediawiki markup. Pandoc is also able to convert to S5, a slide show system for presentations (see here). Another pretty interesting feature of Pandoc is that it is able to process the Markdown markup language, which is also very lean and effective [7]. Like at Docutils, a PDF document could be generated manually from the DocBook or the LaTeX sources which Pandoc puts out (it must be pointed out that the resulting files contain only the body text and Pandoc does not put in document headers), but the packet includes also the wrapper markdown2pdf for a full one-step Markdown source processing (that makes use of Pdflatex, so a LaTeX distribution has to be available). A big advantage here is that Pandoc is already Unicode capable and the software is also open to create own conversion extensions. It is also possible to run Pandoc as embedded filter for using Markdown with ConTeXt (see here) [8].
Lout
Well, for Indologists poor attestation is no indication that something is uninteresting or irrelevant – on the contrary. But if you ever thought that ConTeXt is a minority issue and Groff is probably the tip in terms of the fact that one can’t find much about it on the net you really haven’t dealt with Lout. Even the Wikipedia page (with the few meagre counterparts in German, French, and Vietnamese) is marked as probably not meeting the general notability guideline [!]. To google that term produces a similar result measly. But Lout really doesn’t deserves to dwell in the darkness of generally being unconsidered and is really an alternative that must be located somewhere between Groff and TeX. Lout is pretty versatile given its size, this surprisingly comprehensive, everything included document formatting system which consists mainly only of a single with the size of 655 kb is the work of Jeffrey H. Kingston of the School of Information Technologies at the University of Sydney [9].
Like LateX Lout uses different document types, the basic ones are doc, report, book, slides and picture, but there are also other ones like for pretty printing source code of different programming languages (for all see /usr/share/lout/include). If the end user wants to modify basic settings like indentation or inter paragraph space that has to be done by modifying the document type setup files or copies of them (see 4.1 sq. in the User guide) – this method has its advantages. Like Groff, Lout is a closed system which comes with its own stuff like fonts, hyphenation patterns etc. Also Lout is equipped with its own bibliographical database format while the appearance of citations and the reference list could be customized (see chapter 5 in the User guide). There is also an glossary and an indexing system, a diagram, a graph, and a pie graph function (see chapter 9-11) – by which that all really does not look trivial. Lout is Postscript software, the resulting file (lout mydocument.lout > mydocument.ps) can again be converted to PDF – as always – with ps2pdf (belonging to Ghostscript). There are a lot of symbols and diacritica available (see 1.4: characters), but unfortunately I couldn’t find combining overbar, under-/overdot, nor s-acute. As with Groff I think there might be hardly a chance to get Unicode fonts running, but maybe it’s worth to do the job to add another proper Postscript font to the system. Anyway, I’ve created a little showcase, the source is here, the resulting PDF is here. I think I am really going to play around with this a little bit longer!
Notes
[1] On the issue of markup (which nowadays is a subfield of Digital humanities and covers of course also HTML and particularly the omnipresent XML family): J.H. Coombs, A.H. Renear, S.J. DeRose: Markup systems and the future of scholarly text processing [Communications of the ACM 30,11 (1987), 933-47]; A. Witt, D. Metzing (Eds.): Linguistic modeling of information and markup languages – contributions to language technology. Dordrecht (usw.): Springer 2010.
[2] python-docutils on Debian (0.5.2 on Lenny, 0.6.4 on Squeeze; 0.6.3 on Ubuntu Lucid).
[3] On Doctuils and ReST from the viewpoint of XML see David Mertz’s XML matters: reStructuredText (the stuff on IBM developerWorks is generally quite informative). A special GUI frontend is also under development, DocFactory.
[5] See the primer and the specification. There is another introduction to ReST here and here. There also is a Vim module for highlighting ReST and also an Emacs extension for that next to doing some other stuff. By the way, the extension .rst hasn’t established yet as an official MIME type, but is not reserved for another application (see here).
[6] Recent release 1.5.1.1. Pandoc is for some reason not available as packet from the Debian repository even for Squeeze/Testing recently (see here), but for it is programmed in the functional programming language Haskell it could be fetched easily through the Apt-like Haskell source retrieving system Cabal: just install cabal-install (it will get also some crap like the Haskell compiler ghc6) and then do cabal update and cabal install pandoc (it takes a while to get and compile everything). After that the executable binary pandoc is available at ~/.cabal/bin (which of course could be copied out or softlinked to ~/bin, but more elegant is to add: if [ -d ~/.cabal/bin ] ; then PATH=~/.cabal/bin:”${PATH}”; fi to ~/.bash_profile) – easy lover! Pandoc is also available as a library for Haskell.
[7] Markdown (home) was originally developed for producing HTML. On recent Debian Testing for example there are several Markdown applications in the repository: the original HTML generator markdown, libraries for Perl and Python, and a special Emacs mode which is included in emacs-goodies-el. .mkd or .pdc would be proper extensions for Markdown files, but Pandoc itself uses simply .txt when converting into Markdown – the clou of the lightweight markup languages indeed is that files of them more or less could also function for the reader as plain text files.
[8] The Lua library Lunamark for conversion between markup formats also could be employed by the LuaTeX engine.
[9] The software is hosted at Sourceforge. There is development here, the version 3.36 is available for Lenny, 3.38 for squeeze (see here). There is the Lout-users mailing list. Kingston described The design and implementation of the Lout document formatting language [Software - Practice and Experience 23,9 (1993), 1001-41 (offprint)]. The software comes with an user guide (pdf here) and an expert’s guide (pdf here, packet lout-doc on Debian). There are even high class books created with Lout, see Mark Summerfield’s page. Some business task examples are collected here.

