Hypertext Sanskrit Tools II

Phonetics, sandhi, and introduction to the Reader tool.

We are now going to discuss sandhi. Before getting into this topic, we must go a little bit into phonetics.

Natural language is a communication medium using vocal signals, i.e. modulation in air vibration produced by the mouth organs, and analysed by the ear. This basic fact is often dissimulated by the predominance of writing in modern linguistic communication. Writing is a way to store linguistic information, it is a modern invention, merely a few millenia away in the earliest of cases, compared to articulated speech, that is co-existent with mankind. Writing was not for the layman, it was the professional occupation of scribes. Furthermore, reading was for a long time actually reading out the text loud. In the West, it is Saint Ambrose, Roman governor of Liguria in the 4th century, elected by acclamation bishop of Milano, a theologian and respected intellectual who acquired the title of Father of the Church, who is credited with the invention of silent reading. Visiting the library of a monastery, he was caught reading manuscripts silently, and was denounced to the Superior as using magic, and suspected of heresy. Monks knew only of reading aloud sacred scriptures at the commun meal! At the University, the Reader would read instruction material aloud to the students. In the Indian tradition, this is still the case, and actually the ācārya is supposed to have committed the text to memory, and to speak it aloud to the students without any written support.

Writing is often imperfect, and may be an encoding coined for an earlier stage of the language. In French, orthography is to a great measure absurd, although it is a sacred cow of school education. Thus "oiseau" (bird) is an absurd encoding of the phonetic string "wazo" it represents. The problem with most writing systems is that writing was coined in an experimental manner, without sound phonetic principles. In India, the situation is completely different. Preservation of the Veda induced sophisticated phonetic investigations in the prātiśākhyas, that dealt precisely with the problem of segmentation: how to deconstruct vākya from its continuous utterance in the saṃhitāpāṭha recitation into its step-by-step padapāṭha word chain. All this phonetic science could be put to use by further linguists such as Pāṇini to explain grammar in formal terms. Formal to the point of being amenable to mathematical analysis, and computer implementation !

Pāṇini was living roughly 25 centuries ago, probably before writing was used in India. The first attested pieces of writing date from the prakrit (prakṛta) inscriptions of Emperor Aśoka, one century later. And devanāgarī is a much later development of the script. So we may question the Indian Post Office rendition of Pāṇini as a scribe. So what is Pāṇini writing ? Certainly not devanāgarī. Saying that he is writing in AIST or in WX notation would be a bit anachronistic, but it has a ring of truth, as we shall see.

Let us consider how our reader analyses tacchrutvā, a very frequent beginning of sentence, signalling a change in turn of speech: typically, the current agent is going to act after hearing some statement, given in the previous sentence and ended by iti, the indirect speech end marker. It is analysed as a 2-pada chunk as follows: tacchrutvā. Here I show that I am not cheating by actually calling the Heritage Reader on तच्छ्रुत्वा. The 9 letter line t.a.c.ch.r.u.t.v.ā explains clearly the phonetic change: the sequence t|ś is transformed into the sequence cch. We redo it again in mode Tagging, where the sandhi rule is clearly stated: t|ścch. I can also do it with printing in devanāgarī, but then the rule is muddled with viramās. Aside for beginners in devanāgarī: letter त encodes syllable ta, and various extra decorations give ता for tā, ति for ti, ती for tī, तु for tu, तू for tū, तृ for tṛ, तॄ for tṝ, ते for te, तै for tai तो for to and तौ for tau. And you must also use the special virama subscript to denote just the consonant t. Furthermore the input chunk has been massaged into just 3 akṣarā, with the middle one an incredible jungle of glyphs च्छ्रु pasting together 3 consonants and a vowel. Which we have to deconstruct in order to understand the sandhi, but there is no way to cut its picture into its proper components. In other words, the problem with devanāgarī is that it is a syllabic script which is not compositional as a graphic representation at the phonemic level. Each akṣara is a complex ligature of glyphs, as opposed to the sequence of sounds it represents. Extreme mixing of glyphs is witnessed in क्ष for kṣa and ज्ञ for jña.

Aside: IAST is not a fully satisfactory phonemic notation either, note that phoneme ch uses two roman letters, it should be a one-character ligature. Only wacCruwvA (WX) and tacCrutvA (SLP1) are satisfactory roman encodings, although not perfect for human usage... And an extreme case is वणिग्हसति (the merchant is laughting), which is not possible to input either in VH or AIST, but may be input in WX as "vaNighasawi" (whereas its sandhi variant वणिग्घसति may be input as "va.nigghasati" in VH or "vaṇigghasati" in IAST, or "vaNigGasawi" in WX). If you are not equipped with a devanāgarī keyboard and are confused with the variety of transliteration schemes, please refer to Transliteration schemes at a glance.

So now we should understand that (except in exceptional situations as noted above), IAST (or any alternative phonemic representation) is better than devanāgarī to explain phonetic processes such as sandhi. And the point is not that it is a Western notion, since Pāṇini himself discusses these topics at the phonemic level, using the notion of varṇa. So we need now to discuss this notion.

Let us have a look at the varṇamālā. It looks like the English alphabet {a,b,c, ...}. But it is not an alphabet of glyphs for writing, but of phonetic segments or articulated sounds, scientifically arranged according to their place of articulation in the mouth. As opposed to the various sounds corresponding to the pronunciation of "a" in English such as alphabet (alfəbɛt), paper (peɪpə), parcel (pɑːsəl), place (pleɪs) etc. Conversely, a sound like "f" may be written with an "f" like "fire" or with the sequence "ph" like in "pharmacy". The reason being that the second writing is justified for works having a Greek origin. In French, the spelling of word "nénuphar" is the topic of fierce battles, evoking medieval disputes about the sex of angels. Does that mean that the varṇamālā is a substitute to the International Phonetic alphabet ? No, and this for at least two reasons. Firstly, not all sounds used in natural speech are used in Sanskrit. For instance, the sounds "f", "z", French "u", etc. are not used. Secondly, these varṇas are not very precisely defined as sounds or even sound neighborhoods. What matters is their mutual contrast giving them a power of discriminating identification between the words of the language. They correspond to the notion of phoneme, coined in the West by the Prague school of linguistics in the 20th century, while this notion was implicit from ancient Sanskrit linguists under the name of varṇa. What matters is that distinct phonemes ought to disambiguate similarly sounding words in order to avoid homophonic ambiguities.

This is exemplified in the following saying from the Subhāṣitaratnakośa: यद्यपि बहुनाधीषे तथापि पठ पुत्र व्याकरणम् स्वजनः श्वजनः मा भूत् सकलम् शकलम् सकृच्छकृत् Ô son, even though you studied a lot, you must now learn grammar, and say svajanaḥ and not śvajanaḥ, sakalam and not śakalam, sakṛt and not śakṛt. The saying warns of the semantic confusion arising from mispronouncing sa as śa, by exhibiting pairs of opposites which would be confused: folks with dogs, intact with broken, suddenly with shit. This humorous statement is actually a very serious assertion of the mutual strength of repulsion of the phonemes s (स्) and ś (श्). In modern linguistic terminology, this subhāṣita exhibits three contrasting pairs for the two phonemes. So phonemes are discriminating sounds relatively to the vocabulary of the language, repulsing magnets in the phonetic space, so to speak.

Phonemes are not universal across all human languages. American English has a different set of phonemes than Oxford English. Bengali does not distinguish between श, ष and स. So when Raghunātha Śiromaṇi left Bengal to study vyākaraṇa in Mithilā, at the first lesson he protested that there were three letters for the same sound. It could be for him that this subhāṣita was coined ! Similarly, Bengali, as well as Japanese, does not distinguish between ba (ब) and va (व). The similarity of the glyphs of the two akṣara is revealing of their weak mutual distinction, as witnessed in variations like Kubera/Kuvera.

Complement. We give another version of the varṇamālā, where every phoneme is given its phonetic features, extracted from "Phonetics in Ancient India", by W.S. Allen, Oxford University Press, 1953. We also give a link to a more recent document giving the assumed original pronunciation of Sanskrit, in terms of the International Phonetic Alphabet (IPA).

After this long discussion about varṇas, we are coming at last to sandhi, that is phonetic glueing discretized as phonemic transformation. Now we have to explain a subtle point about sandhi. We have to distinguish between so-called internal sandhi, happening at the intra level within word formation, by glueing together morphemes (verbal roots, suffixes and prefixes), from external sandhi happening when linking together words (padas) in a continous enunciation within a sentence. The two phenomena happen in Western languages such as English and French. Let us look for instance at English. Internal sandhi may be noticeable in the orthography of words, for instance for deriving substantive "absorption" from the verb "to absorb": the voiced consonant "b" is muted by the mute consonant "t" and turned into a "p". Look at the pavarga of the varṇamālā: this is just a column shift. But this is rare, usually the phonetic transformation is not signaled in writing, like the difference between "cats" and "dogs", where the second is spoken as "dogz". So you have to learn pronunciation from usage, rather than looking at the written form as a phonetic notation. Concerning inter-sentence linking of words, first the convention for writing an English sentence is really that of its padapāṭha: words are separated with blanks, and sandhi is reduced to liaison, not marked in writing. So the problem of sandhiviccheda does not occur in reading text, only in disambiguating continuous speech. Thus there is no problem in disambiguating English "a notion" from "an ocean" in writing, or similarly "I scream" from "ice-cream", but orally we have a real ambiguity, and this is precisely the sandhiviccheda problem. The problem arises in Sanskrit reading not because of a default of devanāgarī, but on the contrary because of its quality, to be a perfect encoding of speech !

Let us try to locate a similar alternance of "b" and "p" in Sanskrit. For instance, ap (water) may form a compound with a form of ja (born) to form abjaḥ - born for water, an epithet of Brahmā. Here, it is the mute "p" that gets voiced by contact with voiced "j" to turn into "b", and the corresponding sandhi rule is "p|j → bj". More generally, a mute consonant becomes voiced in front of voiced consonants, vowels and semi-vowels. And a final vowel is invariant in front of a consonant. The only exception is that a final short vowel changes "ch" into "cch", this change being optional if the vowel is long. Thus काष्ठच्छित् from काष्ठ-छित् for the wood-pecker.

Before going further, let us discuss final sandhi, the change that occurs at the end of the last pada of a sentence. The rule is simple. If the last phoneme is "r" or "s", it turns into visarga "ḥ" (":" notation in devanāgarī), and if it is a voiced consonant, it gets muted. More precisely, it is turned into the corresponding mute unaspirated consonant in the same varga, except for the palatals (cavarga, tālavya), where "c" or "j" may turn into guttural "k" (क्), like in vāk or retroflex "ṭ" (ट्), like in svarāṭ; also final "h" (ह्) has a more complex treatment, which we shall not discuss at this point. But the end result is that a Sanskrit pada may have as final varṇa either a vowel, a visarga, the following mute stops: "k","ṭ","t","p", and the three nasals "ṅ","n" and "m".

Now for the good news: (external) sandhi of pada p1 and pada p2 may be effected by first putting p1 into final sandhi, and then doing sandhi with p2. Only exception: final "ar" or "ār" should be preserved in front of vowels, semi-vowels, voiced stops, and "h". Thus punarapi, punarjanma, punarvādaḥ, etc. And NOT *puno'pi which would result from punaḥ-api. NB. This exception is actually rare, it concerns only 3 indeclinable padas: antar, punar and prātar, and a few vocative forms like mātar and pitar. Thus prātargacchati "he goes early in the morning". For "ār" this is even rarer, for archaic words dvār and vār. But now the above rule allows us to consider a very small number of cases for the possible left contexts of sandhi.

Thus the only cases to consider are explained in the following sandhi grids, extracted from the "Teach yourself" course of Michael Coulson: consonants and vowels. In the consonant chart, the footnote 4 refers to the exception for vowel "a": the sandhi of final "aḥ" and initial "a" is « o' », where the avagraha « ' » (devanāgarī ऽ) is not pronounced, but marks in writing the elision of initial "a". This is further discussed below.

These charts should be consulted for reference in case of doubt, they do not have to be committed to memory. Specially if your goal is not to speak Sanskrit, but to read Sanskrit with the help of our tools, since they know not only how to effect sandhi, but also how to analyse it, which is harder since ambiguities arise. Thus we shall in this course just point out a few difficulties.

The sandhi engine of our platform is available from the green control bar at the bottom of entry pages, like here. Here we shall demo the Sandhi engine on तच्छ्रुत्वा. First, we demonstrate that the Reader indicates the analysis tat|śrutvā and indeed the sandhi engine answers:
तत् | श्रुत्वा = तच्छ्रुत्वा
consistently with the sandhi grid for consonants above:
t|ś → cch.
However, note the warning: "NB. Other sandhi solutions may be allowed." This points out that actually sandhi is not fully deterministic, there are cases where some other optional rule may be used. Indeed, one such optional rule is:
t|ś → cś
Actually, this is the sandhi rule explained in sūtra (8,4,40), yielding form तच्श्रुत्वा, and it is a further sūtra (8,4,63), that authorizes, optionally, तच्छ्रुत्वा. This is actually the most common spelling of this pada combination. It is to be remarked that, although this form is suggested by our sandhi engine, our Reader allows the recognition of the alternative form तच्श्रुत्वा, witness: तच्श्रुत्वा.

NB. This is a typical case where "optionally" in Pāṇini means "preferably".

One very common sandhi rule is that visarga following short "a" turns into "o" in contact with voiced consonants and semi-vowels. Whereas it is erased before vowels, with a hiatus that is marked by a space in the devanāgarī string. Thus राम उज्जयिनीं गच्छति. Please note the underscore in the display, which marks a space recognized by the computer as mandatory. But there is an important exception, when the vowel is short "a". In that case, visarga, turns into "o", and the "a" is erased, and replaced by a dummy avagraha, like in: रामोऽयोध्यां गच्छति.
The avagraha is not a phonetic element, it is just the ghost of the erased initial "a". But it is important in writing, since it disambiguates. This avagraha mark is a recent invention, since it concerns writing, and it is not formally required. However, our software demands it, and modern editions of Sanskrit text generally mark it.

This sandhi case is very frequent, notably after nominative (prathama) singular (ekavacana) of masculine stems (prātipadika) in "a", which are very frequent. Note, for a feminine stem such as ambhā, the different treatment of the above gendered sentence: अम्बोज्जयिनीं गच्छति, consistently with the vowel grid above.

Another frequent situation concerns visarga following long "ā", the form of the stems in nominative plural (bahuvacana). When the following word starts with a vowel, semivowel or voiced consonant, the visarga is dropped, and replaced by hiatus (oral pause, spacing in writing). Thus for instance: अनर्थका मन्त्राः.

Let us now review sandhi occurring after a visarga following a vowel different from "a" or "ā". When the following word begins with a vowel, a semi-vowel different from "r", or a voiced consonant, the visarga changes to "r". If it begins with "k", "kh", "p", "ph", a sibilant, or at the end of the sentence, it stays as visarga. Before "t" or "th", it changes into "s"; before "c" or "ch", it changes into "ś"; before "c" or "ch", it changes into "ś"; before "ṭ" or "ṭh", it changes into retroflex "ṣ"; but before "r", it is deleted, and if its preceding vowel is short, it is lengthened.

Examples.
nis+raktam=nīraktam "without colors".
catur+rātram=catūrātram "during four days".
punar+ramate=punāramate "he enjoys again".

Final "m". The consonant chart shows that final "m" generally turns into anusvaram ("ṃ"), except before vowels and at the end of sentences, where it stays as "m". But before stops (the five vargas), it is actually equivalent to the homophonic (savarga) nasal. It is then parasavarṇa. In that case, our software normalizes it into the corresponding nasal. This may puzzle you, since this may induce a ligature which is avoided by the anusvaram notation. Thus, if you enter वनंगच्छति (vanaṃgacchati), it will display it as वनङ्गच्छति (vanaṅgacchati) in the segmenter interface. Please note that the inverse transformation would not be valid: saṅgaḥ should not be written as saṃgaḥ, since it derives from root sañj, and there is no opportunity for anusvaram to arise.

Another interesting case occurs for sandhi of final "n" in front of a vowel. अस्मिन्नर्थेऽत्रभवन्तं प्रमाणीकरोमि : "I request Sir to decide on this matter". See how pada asmin joins with arthe into continuous asminnartha (अस्मिन्नर्थे) "in this matter". This gemmination is conditioned by the fact that the vowel preceding "n" is brief, here "i".

Here is the actual code in our sandhi compiler:
(Short_vowel,("n","a","nna"), [ (8,3,32) ])
Note that Pāṇini has a more concise rule, since he quantifies over condensed notations (pratyāhāra), so his sūtra generates many rules in our low-level formalism. In terms of specification languages, Pāṇini's rules are a higher level language than conditional rewrite rules on varga strings. He was induced to give abstract statements because of his care for lāghavam.

The above example अस्मिन्नर्थेऽत्रभवन्तं प्रमाणीकरोमि illustrates also a tricky sandhi rule between vowels: initial "a" is erased after final "e" or "o". But here we demand that the erasure of "a" be indicated explicitly with avagraha (), which has no phonetic realization but just indicates pada boundary. We shall come back to this situation below.

Also note the orange color, signaling a verbal compound with a cvi stem in ī. It is not a pada, but a prefix of an auxiliary tiṅanta, It has an inchoative meaning, of transformation. More generally, root forms may be prefixed with upasargas (preverbs) and certain specific forms (gati).

We finally review the vowel-vowel case of sandhi. First, savarṇa vowels merge into the savarṇa long vowel. This induces an ambiguity of four cases in the analysis. This is specially noticeable in the frequent case of "ā". Consider for instance सेनाधिपतिः. This example ought to be studied carefully, in order to yield the meaning "army general" in the masculine gender.

Otherwise, "a" or "ā" raise the grade (to guṇa or vṛddhi) of a following vowel, and the other vowels turn into the corresponding semi-vowel. The only tricky cases are the diphtongs "e", "o", "ai", "au". The first two turn into "a" followed by a hiatus, and initial "a" is erased, but we demand that an avagraha marks this elision. Final "ai" becomes "ā" followed by a hiatus, and final "au" becomes "āv" without hiatus.

The principal difficulty is final "e" (frequent with locatives) in front of initial non-"a" vowel, because these is an ambiguity with final "aḥ" (frequent with nominative): we get "a" followed by hiatus (i.e. a mandatory space in writing). A similar ambiguity exists with final "ai" in front of any vowel: it turns into "ā" followed by a hiatus, and is thus ambiguous with a form in "āḥ", but forms ended in "ai" are not frequent (mostly datives).

Compare वन इव (as if in the forest) with कूप इव (as if in the well). The first one is not ambiguous, with padapāṭha: वने इव, whereas the second one could be कूपे इव or कूपः इव. A similar problem occurs with "iti", the indirect speech closing particle. Thus in Mahābhārata, we find King Yudhiṣṭhira stating: योत्स्ये विगतकल्मषः "I shall fight without committing sin". This echoes Arjun'a despondency : न योत्स्ये "I shall not fight". But actually, in Bhagavadgītā, this is told by Sañjaya to King Dhṛtarāṣṭra as a quotation: न योत्स्य इति गोविन्दम् उक्त्वा तूष्णीं बभूव ह and now yotsye has become yotsya because of sandhi. You should thus be very careful about this situation : final "a" followed by space followed by an initial vowel could hide an "e" marking either a sup locative or an ātmanepadi or passive conjugation tiṅ suffix. For instance, look at rice cooking. The root is pac. In parasmaipadi style, you may say pacatyodanam (पचत्योदनम्) by sandhi of pacati and odanam: "(he) cooks rice". But you may also say, in ātmanepadi style : pacata odanaḥ (पच्यत ओदनः) sandhi of pacate and odanaḥ, "rice is cooking". So please note the necessary hiatus, marked by a space in the devanāgarī string, and by an underscore in the diagram of the Reader. This phenomenon may also happen to words ending with vowel "o", but this is much rarer.

Please note that initial "h" following a final consonant turns into the savarṇa voiced aspirate. Thus taddhitaḥ for tat+hitaḥ.

Exercises. Try to segment the following phrases in concord with the English meaning
अप्यवगच्छसि "do you understand?"
तन्नेच्छति "he does not want that"
तल्लिङ्गम् "this sign"
पुनाराजाभिषेकः "consecration of a new king"
पुनश्चितिः"piling up again"
बालष्षष्ठः "the 6th boy"
शिशुश्शेते "the child is sleeping"

Important irregularities. Special case for pronoun sa, in nominative, saḥ. When followed by consonant, the visarga is dropped, and replaced by hiatus. Thus स कृष्णः, but as usual सोऽग्निः, स आचार्यः, अश्वः सः. Forgetting the hiatus will bring confusions with stems starting with "sa", and they are extremely frequent, since "sa" is a productive prefix (sagotra, sajala, etc.)

There is also a rule that applies to the terminations "ī", "ū" and "e" of substantives in the dual number (dvivacana): they remain unchanged before vowels, thus ते फले इच्छामः "we want those two fruits". This hiatus situation is called pragṛhya (प्रगृह्य).

Where to effect sandhi, and which kind. Firstly, continuous enunciation is not always mandatory. Grammarians say: संहितैकपदे नित्या नित्या धातूपसर्गयोः नित्यासमासे वाक्ये तु सा विवक्षामपेक्षते : "Continuous enunciation is mandatory within a word, between a root and its preverb, and within a compound, but between words it must conform to the locutor's intention". Thus sandhi is not strictly necessary between padas. But this maxim hides several difficulties. Actually, sandhi effected at morphology formation, for vibhakti as well as for preverb (upasarga) prefixing of roots, is of a stronger nature than the external sandhi between words and between compound components, since it incurs retroflexion. Thus, in the same way that the tṛtīya form of rāma is rāmeṇa, with retroflexion of the "n", verbal forms for roots prefixed by an upasarga containing "r" are subject of retroflexion, like praṇayati. This extends to participles (kṛdanta) like praṇītaḥ.

However, one must be careful that sandhi must be carried from left to right in the sentence, including preverbs. This may induce a complex interference between inter-word sandhi, and preverb glueing, in the case of the short preverb ā (towards the locutor). This is exemplified in the sentence ihehi (इहेहि : come here). Here the verbal form is ehi (come), obtained by glueing upasarga ā to imperative form ihi of root i, to go. But it would be wrong to do sandhi between pada iha (here) and pada ehi, resulting in incorrect *ihaihi (इहैहि). You must instead do the sandhi between iha and ā, obtaining ihā, and then do sandhi with ihi, resulting indeed in ihehi. Fortunately, this is an extremely rare situation, but it poses a complex problem to our segmenting tool. Actually, ihehi is the acid test of Sanskrit segmentation.

Another difficulty is that, although hiatus is forbidden within compound formation (the compound must be pronounced in continuous enunciation (saṃhitāpāṭha) with no break in the devanāgarī written form, the corresponding sandhi is of the same nature as the inter-word sandhi, and thus does not incur retroflexion. But here too there is a glitch for proper names, which may have acquired retroflexion by usage. Thus while it is correct the use rāmāyanam ("रामायनम्") if you speak about the travels of your son Rāma, you must use rāmāyaṇam ("रामायणम्") to speak about the epic Rāmāyaṇa. And a few frozen compounds also exhibit retroflexion, like agrevaṇam ("अग्रेवणम्", at the border of the wood).

We shall not discuss here the special rules leading to upadhmānīya ("उपध्मानीय") or jihvāmūlīya ("जिह्वामूलीय"), represented by ardhavisarga ("अर्धविसर्ग") in writing, since these are rules for Vedic recitation, and ardhavisarga is replaced by visarga in texts of classical Sanskrit. Also we shall not discuss the 3-mātra long vowels (pluta).

We now close the sandhi topic. The next lesson will discuss strategies to steer our Reader tool in order to effect segmentation of Sanskrit text.

© Gérard Huet 2023