Sky’s the limit: on the background of the Digital Corpus of Sanskrit (DCS)
For a couple of days Oliver Hellwig’s Digital Corpus of Sanskrit (DCS) is now online. On the first sight this project doesn’t looks different than all the other sites on the net for which Sanskrit texts have been typed in and in these the lemmas resp. lexemes have been cross-hyperlinked to one or more online dictionaries. As a result of these project often there are alphabetical lists of lexemes including auto-generated text instances like it is the case here with the DCS. But the scope of the Sanskrit Tagger, from which the DCS is a somewhat unspectacular looking form of result, goes much deeper as Hellwig explained in his presentation at the 1st International Sanskrit Computational Linguistics Symposium (1st ISSCL) at the INRIA in Paris 2007 (paper see here, video stream here [great that there is!]).
The Sanskrit Tagger is a device which uses stochastical (statistical) methods for the automatic assigning of grammatical descriptors to the “words” in random Sanskrit text. For that part-of-speech (POS) tagging a Hidden Markov Model algorithm (HMM) is brought into charge, a statistical method which is used in computational linguistics but also in bioinformatics which both have in common to be fields of temporal pattern recognition. In plain English that just means that the computer has been brought up to read Sanskrit. The whole process of tagging in its steps is described at length in the paper.
Due to the diligence of Hellwig the database which belongs to the tagger is constantly growing and with that, as always with statistics, the hitting precision increases. I think in the not so far away future it will be possible to leave the tagger processing larger Sanskrit corpuses like the GRETIL for example without producing a mass of errors. Auto-generated word indexes like the ones of the Bodhicaryāvatāra and the Rāmāyaṇa which Hellwig is presenting on another page (here) are just one result of the project, but it is even going further than that. A large corpus of tagged text is ideal for syntactical inquiries for example. But to refine the statistical methods which are applied to Sanskrit text could bring us also to research tools like generated author style fingerprints which would deepen our insight into Sanskrit literature significantly.
Basic literature:
Baldi/Brunak: Bioinformatics – the machine learning approach. 2nd ed. Cambridge (usw.): MIT Press 2001 [165 seq.: Hidden Markov Models: the theory].
Eddy: What is a Hidden Markov Model? In: Nature Biotechnology 22,10 (2004), 1315-16 [doi 10.1038/nbt1004-1315].
Fucks/Lauter: Mathematische Analyse des literatischen Stils. In: Kreuzer/Gunzenhäuser (Ed.): Mathematik und Dichtung. Versuche zur Frage einer exakten Literaturwissenschaft. München: Nymphenburger Verlagshandlung 1969, 107-22.
Hellwig: Sanskrit Tagger – a stochastical lexical and POS tagger for Sanskrit. In: Huet/Kulkarni/Scharf (Ed.): Sanskrit computational linguistics. First and Second International Symposia. Berlin, Heidelberg: Springer 2009, 266-77.
Huet: Lexicon-directed segmentation and tagging in Sanskrit. In: Tikkanen/Hettrich: Themes and tasks in old and middle Indo-Aryan linguistics. Delhi 2006 (Papers of the 12th World Sanskrit Conference; 5), 305-23.
Samuelsson: Statistical methods. In: Mitkov (Ed.): The Oxford Handbook of Computational Linguistics. Oxford University Press 2004, 358-75 [19.3: Hidden Markov Models].
Voutilainen: Part-of-speech tagging. In: Mitkov (Ed.): The Oxford Handbook of Computational Linguistics. Oxford University Press 2004, 219-32.

I have to say that I am skeptical about the scope of its usefulness. “The sky is the limit” has been said to me before when I asked the director of another project like this what specific uses it will have. It seems to me that that is not a very convincing answer. As far as I can see, you’ve given us two potential uses on the horizon- syntactical inquiries and “generated author style fingerprints.” We can do syntactical inquiries quite easily with GREP searching. I’m not sure exactly what is meant by the author-style fingerprint, but I think you mean the computer would be able to tell you whether a given piece of text is authorial or interpolated. For that I am extremely skeptical, for many reasons. One, and this is fundamental in my mind, is that we need to stop saying that the computer can read or the computer can judge. The computer only does what it is designed to do within the limits created for it. This matters for what you said because a human will have to decide when a given set of characteristics equals the text of an author, right? Only when that is fed to the program will it apply it to the text you feed it. Now how is one going to decide what criteria to use to determine authorship? I think it is safe to say that any given Sanskrit text from the classical period is the work of many hands copying, corrupting, correcting over and over and over again down the centuries. How is that going to be untangled? You can’t just say the computer will do it, because again the computer only does what you tell it to do.
(not offense intended, Dan, I’m just venting my feelings about the topic!)
A bucket full of critique not a single of my opinion lasts, I haven’t got if it’s aiming me or the tagger (thanks for the disclaimer at the end) but I am glad to invoke a discussion which is the mother of science.
My replies:
(1) The Tagger *does* read Sanskrit though on a very basic level
(2) Yes we can do syntactical inquiries with grep even without but we couldn’t do things like “give me all sentences where the adjective precedes its referent with 2-3 words in between” – that’s the difference between a tagged text and an etext – maybe I would had to explain that a little bit clearer
(3) I know that the tagger and software like this is working on the “edition level” which is a little bit problematic, but I think it’s open for alterations when the text improves. But projects like this naturally are working with a “simple concept” of what is the text (haha really a Hamburg style discussion!).
(4) Towards the “author style fingerprint” which I found you’re arguments are solid here: if we are going to probe into a direction like this someday I know we have to be very careful. I wouldn’t be feeling good, you are right, if we are going to leave everything to the computer which can count’n'calculate like nothing but indeed is stupid in a certain way. The computer couldn’t been left to judge too freely because very quick, you are right, we are getting into simple “garbage in – garbage out” situations and other loops like this (like it was the case in the last attempt to decipher the/establish an Indus culture script, Rao et. al.: Entropic evidence for linguistic structure in the Indus script, doi 10.1126/science.1170391). Inquiries like this are great for discussion but basically without substance. No I just mend developments could lead into a situation in which we would be able to get *additional data* for reasoning. I haven’t mend (and haven’t stated nowhere) that we should work to someday give up to the computer totally (which o.k. maybe a too overwhelmed random catch phrase like “sky’s the limit” in the light of today’s one-dimensional “emulate the human” computational linguistics implied in an too uncritical manner).
Thanks for your annotations, Michael!
Thanks for your reply. I am glad you brought this topic up and are welcome to discussion, because I haven’t seen many people really talking about this. I admit that I haven’t read any of the basic sources you mentioned either; I have tried reading some of this type of thing and I get bogged down by the jargon. I’m happy we agree about the problems of author fingerprinting. I see what you mean now about those specific types of syntactical inquiries, and you are right about that. I still have to disagree about the tagger “reading”, but perhaps this is just a semantic issue of what it means to read something. To me reading means understanding and reflecting, comparing to a life of experience. For example, I read a love poem and my understanding of the sense the author is trying to convey is filtered through my own experiences of love or loss. I know that the tagger can be fed a word and based on a database of past human choices make a “decision” about whether it’s a compound and what the sandhi is, and so on. I don’t consider that reading. I also worry about it, because many times there is a difficult decision about how to split a compound or dissolve a sandhi and either option gives a feasible but different result. I wouldn’t want the future Sanskritists to trust the computer-generated decision more than their own critical thinking skills, or worse, let their own skills atrophy because of reliance on the computer.
I may add some comments to the ongoing discussion:
(1) I completely agree with Daniel: tagged texts offer completely new opportunities when compared with simple e-texts because methods implemented in GREP and other programs can now be applied on a lexical/semantic level and not only on the phonetic one.
(2) Of course, the computer will not find out anything new “on its own”. You have to tell the computer which pattern you are interested in – but as soon as you have formulated this question, it capacities are clearly superior. Basically, this is not different from what every scholar is doing: Formalize ideas, develop (philological, mathematical, statistical, …) methods to prove them, and then delegate the unpleasant part of the work to some other instance = the computer!
(3) As I mentioned on the website of the DCS, the corpus is only a rather small extract from the data collected in the SanskritTagger database. In addition, it does not have any of the functionalities of the tagging program (except from searching for single words). So, one should distinguish between the view of the data = DCS and the underlying tagging program.
(4) A final point: The data presented in DCS are not “computer-generated” as Michael suggests. Instead, the computer only proposes solutions, and I select the most appropriate one according to my understanding of the passage. Of course, errors occur, especially depending on the time of the day! And some composites may be analyzed in a different way by other scholars. However, I would like to emphasize that the basic data of DCS are checked by a human philologist and not only processed by a tagging algorithm.
Best, Oliver
Thanks for the post. And a very interesting discussion indeed. I think the best way to check out any “style analysis” machine is to feed it with modern literature whose authorship is known, and see whether the software is able to detect real differences between authors and literary movements or not. Has it already been done?
Besides, I do think it has some usefulness. There’s no question about that. The problem could be just the same: it is a useful, utilitarist-focused software. Like any other method of analysis. Maybe not suitable for literary studies, but very useful for statistical purposes, indexes, etc. which are themselves a tool for literary criticism.
The fact that this is not the only way or the only approach does’nt make it useless.
Thank you all.