Hypertext Sanskrit Tools III

The previous course explaining sandhi is actually needed only for a rational explanation of sandhi on phonetic principles. First of all, if your intention is to speak Sanskrit correctly, you do not need to know all the optional sandhi rules, you may stick to a deterministic subset, for instance the one defined by Coulson's sandhi charts. Then, besides the rules concerning visarga, there is no need to learn these charts by heart, and to be self-conscious about those rules. The important point to understand is that these rules are not arbitrary decisions of grammarians. They are actually recordings of what happens naturally in your voice organs in the throat, mouth and nose by movements of the tongue, lips and glottis. The succession of phonemes provokes such movements, and the sandhi smoothing corresponds to minimizing these movements in order to save energy. This happens naturally, and being self-conscious about it makes it actually awkward. Thus a mute consonant stays mute before other mutes (no air vibration) but becomes naturally voiced when followed by a voiced one that triggers the vocal chords, making the air vibrate. Thus the alternance p/b in the "p" column of the consonants chart. For the "t" column, this is a bit more complex, because the tongue is involved rather than the lips, and its displacement in the palate is anticipated by assimilation to the varga of the next consonant, saving energy by avoiding this tongue movement. Thus this uniform phonetic transformation is actually better explained by the Pāṇinian rules, that factor these transformations in an algebraic way, quantifying on notions such as mute and voiced, and thus killing several birds with the same stone. The difficulty with the Pāṇinian rules is just terminological, since you have to know the names of the characteristic sets of phonemes sharing a certain phonetic feature. These names are themselves coined as condensed abreviations (pratyāhāras)) that are pairs (phoneme,marker) indexing a canonical list of pairs (list of phonemes,marker) called the śivasūtrāṇi ("शिवसूत्राणि").

Furthermore, if you are mostly interested in reading Sanskrit texts, the problem of deciphering the continuous enunciation is the inverse of effecting sandhi, and this is more complex, firstly because there is no mark in the phonetic string of where sandhi is taking place (except in the case of avagraha), and because now optional rules may have been used by the author of the text, so you have to deal with such variations, and now extra non-determinism creeps in.

This is where our tools can be used, since they know all possible ways of segmenting the text. Thus our Reader tool proposes all possible segmentations of a continuous utterrance as padapāṭhas guaranteed to dissolve by sandhi into the input utterance. Your main problem with the use of the tool will be in the possibly many possible segmentations. So let us first present the problem with the canonical paradigmatic exemple analysed by Patañjali 23 centuries ago: श्वेतोधावति. Here we see that the two interpretations make sense, so we have a genuine ambiguity. Such a situation is called śleṣa "embrace", evoking the two meanings intermixing. It is ambiguous, and if well crafted may provoque equivoque implications through the double meaning. Thus it was praised as one of the ten poetic qualities (guṇa), and Daṇḍī theorized in his Kāvyādarśa. So let us look at one of his examples: नक्षत्रपथवर्ती राजा. This nominal phrase talks of a king "who is following the path of stars", and is fit for a court panegyric. But the other side of the śleṣa hints that the king "is not following the path of heroes".

Note that here we are not just dealing with homonymy (abhaṅgaśleṣa), which is suggestive use of a reading with given padas, but with ambiguity of segmentation (sabhaṅgaśleṣa), using oronyms. Another standard exemple is: नतेन लिखितो लेखः. Did he write the letter crouching, or did he not write the letter ?

There is also this story in Kathāsaritsāgara about a Sātavāhana King frolicking in his bath with his queens. As he is splashing a lady, she protests: mā udakaṃ dehī deva "don't throw water to me, Your Majesty". But she is learned in Sanskrit, applies sandhi, and utters मोदकं देहि देव, at which the King reacts by asking a servant to give her sweets (modaka), and incurs her mockery.

Another interesting śleṣa is used in this grammatical saying: तद्धितमूढोवैयाकरणः तद्धितमूढोऽवैयाकरणः. It contrasts the expert linguist, who bears the burden of secondary derivatives (through taddhita suffixes) with the non-expert, who is stupid concerning them!

This discussion on śleṣa showed that we should not always expect a unique solution to segmentation of Sanskrit text. Several meaningful solutions may appear, and sometimes this phenomenon is expected, and this actually incurred a specific style of double-entendre poetry. This combined with all kinds of poetic acrobatics which made combinatorial puzzles to the reader. An introduction to this style of poetry is presented by Ygal Bronner in "Extreme Poetry, the South Asian movement of simultaneous narration", Columbia University Press, 2010.

This being said, we are faced with the Reader tool with a far more severe overgeneration than esthetic experimentations coined by creative poets, and we face artificial ambiguities of the vocal form, most of which being meaningless, but meanings are not sought at this stage, which deals only with form. This is specially striking with small words, with one or two syllables, which tend to proliferate as noise polluting the correct solution. Let us look at a typical example : रामोवनङ्गच्छति.

We are surprised by red rāmaḥ. It is a verbal form of Vedic root रा - we give. So it should be quickly erased with the red cross choice. For blue gacchati, this present participle (he going) in the locative would require a main verb, so must be discarded. So in two clicks, we discriminate between 4 potential solutions. So don't panic if the tool lists an astronomical number of alleged "solutions", the number of clicks to decide the correct one is of the order of its logarithm, and you will soon learn to discard parasitic components.

At this point you should go over the various examples of śleṣa given above, and re-enter them in the Reader tool. You will see that the two different readings that we exhibited, leading to actual semantic ambiguities, are actually mixed with many parasitic segmentations, which may or may not be meaninful. For instance, श्वेतोधावति may be segmented into 22 different segmentations, most of which do not make sense.

Let us now process a more ambitious example, taken from Corpus:Heritage_citations.19: न हि वचनशतेनाप्य् अनारभ्योऽर्थः शक्योविधातुम्. Our corpus data is just the recording of successful solutions obtained with the help of the Reader tool, when you have Annotator facility in the Corpus tool. So now we are going to replay the sequence of clicks needed to select the correct solutions among 68 possible segmentations. We select the devanāgarī Sentence in the current display, and call the Reader tool with this input, displaying now a lot of parasitic "solutions", either nonsensical or redundant. Actually, on this exemple, as often, you obtain the correct solution just by checking the top rectangles in the graphic display, except for api, that should be popped up. Actually checking the top śatena and anārabhyaḥ will automatically pop up api, and here two clicks are sufficient to obtain the desired segmentation. If you go strictly in left-to-right fashion, 3 clicks will be needed. This example should be examined carefully.

We have just seen an important heuristic of our graphical segmenter: select the maximally overlapping segment, which is generally presented in the first line of the display. This "cherry-picking" strategy has the advantage of bleeding most irrelevant small words, and it has the highest probability of being in the final solution. Also the large segments are usually contents words, while grammatical words such as api are usually short, and conflict a lot with word endings or beginnings.

Please note that you do not have to proceed from left to right, you may select/deselect in any order. Actually, it is usually better for semantic reasons to proceed from right to left, looking for the (red) verbal form, then subgoaling into its candidate kārakas. The verbal personal suffix gives the number of its agent (in the active voice), and thus may be used as a selector for the nominative (prathama) noun form referring to it by its name. Which determines in turn its gender, and further agreeing nominal forms may be searched to serve as qualifiers (viśeṣaṇa) of this qualified (viśeṣya) subject. The sequence of such nominal, adjectival, and participial forms constitutes the (possibly dislocated) noun phrase that refers to the agent. Similarly for the other roles (kāraka). One may thus unfold the meaning by the Socratic method, asking successive questions corresponding to progressive elucidation of the speaker's communicative intention (vivakṣā).

At this point of the course, you should train on using the segmenter on a variety of exemples. For instance, go back in the little Vikramacarita story, take each sentence raw input (the blue devanāgarī unanalysed input), and copy-paste it in the Reader tool input field, setting Input convention to Devanagari, Lexicon access to Monier-Williams if you prefer English over French, and Sanskrit display font to Devanagari if you prefer it over IAST romanisation. Now click on the green and red check signs associated with each colored box in order to retrieve the reading that was proposed by the Corpus annotator. if you make a mistake, it is easy to backtrack using the Undo button.

When you are familiar with the tool, you may try more difficult sentences. For instance, you may use the corpus of citations from the Sanskrit Heritage dictionary, that contains about 1100 popular subhāṣitas (maxims) as well as grammatical quotations. In order to do this, click on the "Corpus" link in the green bar at the bottom of Heritage site pages, then click OK at the Capacity Reader menu (there is actually no other choice on the public server), then click on the first choice "Heritage_citations". and you will see a page with 1136 links, each leading to an analysed sentence. For instance, click on the 999th, kāvyaśāstravinodena kālo gacchati dhīmatām | vyāsasena ca mūrkhāṇāṃ nidrayā kalahena vā ||. You will see the segmentation of this sentence chosen by the annotator to be recorded in the corpus, following its devanāgarī rendering काव्यशास्त्रविनोदेन कालो गच्छति धीमताम् व्यसनेन च मूर्खाणां निद्रया कलहेन वा. Now copy this devanāgarī string with your mouse, and open a new window of your navigator at the Heritage site (English) or (French) according to your preferred dictionary (Monier-Williams or Heritage respectively). Now click on Reader, in the input window paste the sentence, select "Devanagari" in the Input convention menu, select your preference in the Sanskrit display font menu, and press "Read". You will thus get the segmenter display of the potential 25560 Solutions. Now click on the green check signs ✓ to select a segment, or to its red cross sign X to discard it. You will see the number of All solutions decrease rapidly. When you reach "Unique Solution" you are done, and you may compare to the intended solution stored in the corpus repository (in your initial window). If you make a mistake at any point, you may backtrack using the "Undo" button.

If you are attentive, on the way of your clickings you will see appearing other choices in the segmenter interface, besides Undo. One is labeled "Filtered Solutions", another one "All N Solutions", a third one "UoH Analysis Mode". This happens when the number of potential solutions N is less then 100. Let us deal with the "All N solutions" check sign. It delivers a complete list of the taggings af all the current N potential solutions, in the deprecated style of the Tagging mode of the Reader. Its only merit is that it shows explicitly the sandhi rules applied between the padas.

The "Filtered Solutions" is more interesting. It filters out the solutions that do not satisfy a consistency criterion, and ranks the remaining ones by decreasing probability. For instance, in the previous example, if you hesitate between the blue and the red gacchati segments, it will return only the red one, since the blue present participle (शतृ), meaning simultaneously with a main action, needs indeed a finite verbal form (तिङ्), i.e. a red segment. The filter uses a shallow parser that effects a simplified kāraka analysis. It is not as sophisticated parser that is available in the Saṃsādhanī site, but it is must faster.

The solutions listed in this "Filtered Solutions" mode, as well as in the "All N solutions", are each associated with a green Check sign ✓, which permits the manual selection of the final solution. When you click on it, you find yourself facing a new interface, called the Sanskrit Parser Assistant, that will allow you to settle the remaining ambiguities, such as homophony and vibhakti ambiguities, such as gender.

This new interface is also available from the "Unique Solutions" button. For instance, let us return to our kāvyaśāstravinodena kālo gacchati dhīmatām, with intended meaning "Smart people spend the time in playing with poetry theory" (or poetry and science). Once we reach the intended solution, by any of the above methods, by clicking on its check symbol we find the following display. In this new interface, the left column, lists in yellow segments the padapāṭha. The middle column lists for each pada the list of its possible morphological tags, for each homophone candidate. For instance, notice that pada kālaḥ proposes two homophonic bases, corresponding to lexical entries kāla_1 et kāla_2. A quick look in their lexicon link tells that kāla_1 means "black" and kāla_2 means "time". We must here resolve this choice, by clicking on the corresponding radio button. Please note that buttons have been preselected at some default choice, which may not be the right one. In this particular example, kāla_1 has been preselected, and we check that this homonym means "time". Another choice, for the final dhīmatām pada, is for the gender, ambiguous between masculine "m." (पुंलिङ्ग) and neuter "n." (नपुंलिङ्ग). Here, since we are talking of intelligent persons, masculine is intended, but since it is preselected no action is needed. The third column should be ignored at this point. Once the selection of ambiguities is effected, two possible actions are possible. The first one consists in clicking on the first "Submit" button. What you obtain is the cleaned-up padapāṭha, where every segment has now a unique morphological tag, with proper links to the intended lexical entries. This Web page may then be saved in the user space. Please note that there is still possible ambiguity of compounding, here the compound kāvya-śāstra-vinodena may be interpreted as (kāvya-śāstra)-vinodena (the correct one, left associative) and kāvya-(śāstra-vinodena) (dubious as "poetic amusement with knowledge").

The second possible action is after the devanagarī layout of the sentence as colored segments. The "Submit" button that follows it is a call to the Saṃsādhanī parser with the proper morphological alignment.

By playing with the segmenter, you will quickly realize that by selecting long segments you will very rapidly converge to a small number of ambiguities which will need more thinking. On the way, all the drivel of small forms of unusual root or nominal forms, mixed with small mauve indeclinable forms like api, iva, api, iti, ca, , etc. which are grammatical words. When you read a continuous text, the contents words will be familiar because of the context, and thus the use of grammatical words will soon reduce to relevant places. The initial uncomfort with an unfamiliar interface, as well as the fear of the astronomical numbers of potential solutions for long sentences, will soon vanish. Indeed the speed of the tool, the rapid convergence toward intended solutions and the good probability of success for Classical Sanskrit, together with its gentle learning slope, have made the tool a popular one for perusing Sanskrit text. Think of it as a video game and enjoy the experience.

Various problems with the segmentation tool.
There are two main problems with the segmenter. First of all, it depends on the completeness of our pada generator, since its algorithm predicts potential sandhi application using a pre-computed database of padas. This database may be incomplete, or may contain illegal forms. It is generated by our grammatical tools, using information about stems and roots stated in the Heritage dictionary. This may over-generate, typically by excessive generalization of the permitted morphological schemes. Here the risk is to have the segmentation interface cluttered with irrelevant forms. But it may also under-generate, typically by absence of some stem. In which case either some part of the input will not get recognized, leading to a grey rectangle, or it will propose only irrelevant segmentations. If the problem is a missing nominal stem, and if you have installed locally the software on your own server, with annotator permission, it is possible to augment the lexical database, by creating your own local supplementary lexicon.

Concerning the over-generation problem, it is mostly due to the tool not being informed of the relative frequencies of padas in Sanskrit texts. We are now working on a refinement of the segmentation tool, properly informed about such occurrences probabilities, and are investigating this as an experimental module. You may glimpse into the current state of this experimental module by using the two upper Modes of our Reader, labelled respectively "The Best!" and "Best n solutions".

At this point it is advisable to read the full User manual, in order to understand additional functionalities, pitfalls, and practical advices in the way to chunk the input to help the segmenter.

When you are completely confident with your use of the various tools, you may try reading sentences from the extensive digital corpus GRETIL.

Credits.
Pawan Goyal implemented the graphic display of the segmenter, the alignment of the Monier-Williams dictionary with Heritage, and the User-aid facility. Idir Lankri implemented the Corpus mode. Sriram Krishna is working on the statistical segmenter in the framework of his doctoral research.

© Gérard Huet 2023