Sky’s the limit: on the background of the Digital Corpus of Sanskrit (DCS)
For a couple of days Oliver Hellwig’s Digital Corpus of Sanskrit (DCS) is now online. On the first sight this project doesn’t looks different than all the other sites on the net for which Sanskrit texts have been typed in and in these the lemmas resp. lexemes have been cross-hyperlinked to one or more online dictionaries. As a result of these project often there are alphabetical lists of lexemes including auto-generated text instances like it is the case here with the DCS. But the scope of the Sanskrit Tagger, from which the DCS is a somewhat unspectacular looking form of result, goes much deeper as Hellwig explained in his presentation at the 1st International Sanskrit Computational Linguistics Symposium (1st ISSCL) at the INRIA in Paris 2007 (paper see here, video stream here [great that there is!]).
The Sanskrit Tagger is a device which uses stochastical (statistical) methods for the automatic assigning of grammatical descriptors to the “words” in random Sanskrit text. For that part-of-speech (POS) tagging a Hidden Markov Model algorithm (HMM) is brought into charge, a statistical method which is used in computational linguistics but also in bioinformatics which both have in common to be fields of temporal pattern recognition. In plain English that just means that the computer has been brought up to read Sanskrit. The whole process of tagging in its steps is described at length in the paper.
Due to the diligence of Hellwig the database which belongs to the tagger is constantly growing and with that, as always with statistics, the hitting precision increases. I think in the not so far away future it will be possible to leave the tagger processing larger Sanskrit corpuses like the GRETIL for example without producing a mass of errors. Auto-generated word indexes like the ones of the Bodhicaryāvatāra and the Rāmāyaṇa which Hellwig is presenting on another page (here) are just one result of the project, but it is even going further than that. A large corpus of tagged text is ideal for syntactical inquiries for example. But to refine the statistical methods which are applied to Sanskrit text could bring us also to research tools like generated author style fingerprints which would deepen our insight into Sanskrit literature significantly.
Basic literature:
Baldi/Brunak: Bioinformatics – the machine learning approach. 2nd ed. Cambridge (usw.): MIT Press 2001 [165 seq.: Hidden Markov Models: the theory].
Eddy: What is a Hidden Markov Model? In: Nature Biotechnology 22,10 (2004), 1315-16 [doi 10.1038/nbt1004-1315].
Fucks/Lauter: Mathematische Analyse des literatischen Stils. In: Kreuzer/Gunzenhäuser (Ed.): Mathematik und Dichtung. Versuche zur Frage einer exakten Literaturwissenschaft. München: Nymphenburger Verlagshandlung 1969, 107-22.
Hellwig: Sanskrit Tagger – a stochastical lexical and POS tagger for Sanskrit. In: Huet/Kulkarni/Scharf (Ed.): Sanskrit computational linguistics. First and Second International Symposia. Berlin, Heidelberg: Springer 2009, 266-77.
Huet: Lexicon-directed segmentation and tagging in Sanskrit. In: Tikkanen/Hettrich: Themes and tasks in old and middle Indo-Aryan linguistics. Delhi 2006 (Papers of the 12th World Sanskrit Conference; 5), 305-23.
Samuelsson: Statistical methods. In: Mitkov (Ed.): The Oxford Handbook of Computational Linguistics. Oxford University Press 2004, 358-75 [19.3: Hidden Markov Models].
Voutilainen: Part-of-speech tagging. In: Mitkov (Ed.): The Oxford Handbook of Computational Linguistics. Oxford University Press 2004, 219-32.

