The Sanskrit Heritage Engine Reference Manual

About the Sanskrit Heritage Site

The Sanskrit Heritage website, at URL sanskrit.inria.fr, provides tools for the processing of the Sanskrit language.

This site offers public access to various Web services and Sanskrit lexicons since 2003. It offers dictionary search, declension/conjugation, stemming, and segmentation/tagging/parsing of Sanskrit sentences. The site started as a set of tools to exploit a digital version of the Sanskrit Heritage Dictionary, which had been developped as a personal independent project by Gérard Huet since 1996 as a Sanskrit-French dictionary intended as a small encyclopedia of Indian culture. These tools use the finite-state methods implemented in the ZEN Objective Caml library to provide efficient lexicon representation, morphology generation, and segmentation by sandhi recognition. This technology was published in 2005 as A Functional Toolkit for Morphological and Phonological Processing, Application to a Sanskrit Tagger. A graphical interface, designed jointly with Pawan Goyal, has been published recently as Design and analysis of a lean interface for Sanskrit corpus annotation.

Written on January 8th 2018, for Sanskrit Engine Version 3.04.

First approach to using the Sanskrit Heritage engine

The following scenario may be played remotely, if you are connected to Internet with a Web browser. Visit URL sanskrit.inria.fr to go to the standard Inria Sanskrit Heritage server. The same scenario may be played locally, if your workstation is equipped with its own HTTP server, and if you install the Sanskrit Heritage Engine software. This is explained below in the section How to install the Heritage Engine on your own server.

What you are seeing on the entry page is a somewhat ancient-looking Web document in the HTML style of the 90's. Don't be put off by the look-and-feel, but rather thank Inria for supporting this effort without throwing advertisements at you.

The page has a green bar at the bottom, which is the navigation control panel. Just click on the Reader link and you reach the Sanskrit Reader Companion page. You may now enter a Sanskrit sentence candidate in the input window. You may choose for this purpose a variety of input conventions in the corresponding menu. Let us assume you choose Velthuis transliteration proposed as default, and input "praaptavyam artha.m labhate manu.syo devo'pi ta.m lafghayitu.m na zakta.h". After you press the Read button, you will see your input displayed in devanāgarī script, followed by a graphical display with colored rectangles labeled by word forms. Notice how manu.syo (resp. devo) became manuṣyaḥ (resp. devaḥ) by sandhi analysis. Similarly lafghayitu.m became laṅghayitum.

The display actually represents all possible decompositions of your sentence into padas (word forms), aligned on your input represented in the blue line above. Blue rectangles are subantas (adjectives and substantives), red rectangles are tiṅantas (finite verbal forms). Indeclinable words (adverbs and particles) are purple, pronouns are sky blue, vocative forms are green.

When you click on a rectangle, its morphology is displayed. For instance, clicking on the red labhate reveals that it is a 3rd person singular form of the present of root labh in the middle voice (ātmanepadin) of class (gaṇa) 1. Furthermore, underlined labh is a link to the lexicon, which you may visit to check its meaning.

Here there are two scenarios. Had the Lexicon Access field been set to Heritage in the reader input page, you would be directed to the Sanskrit-French Heritage dictionary. If it had been set to Monier-Williams, you would be directed to the Sanskrit-English Monier-Williams dictionary. By default, you will get Monier-Williams access if you enter the site through its English entry URL.

In both cases you obtain the definition of root labh, decorated with its present class index (gaṇa), here 1, in red. This index is itself a link to the conjugation service, that reveals all the conjugated forms of labh, as well as all its participial stems. These stems, e.g. past passive participle labdha, are listed as gendered stems, the gender marks being links to the declension service. This exhibits the generative nature of the lexicon: all the forms obtainable from a root, either as finite conjugated forms, or as declined first-level nominals (kṛdanta), are the building blocks of our analyser.

Similarly, if you click on the blue artham form in the graphical display, you will get its lemma as singular accusative of masculine stem artha. This stem itself is a link leading to the corresponding lexicon entry artha, decorated by active gender marks. If you click on blue prāptavyam, however, you see a more complex morphological decomposition, informing you that it is a form of the kṛdanta (primary nominal stem) prāptavya, obtained by prefixing the preposition pra- to the 3rd formation (in -tavya) of the passive future participle (gerundive) āptavya of root āp. Please note how the root is linked to its lexical access, from which the stem āptavya and form āptavyam may be derived using the conjugation cum declension tools.

Here we are lucky - the correct word analysis (padapāṭha) of the sentence is obtainable as the sequence of all the words in the upper line of the diagram. Some have no competitor, they are checked blue. The ambiguous segments have two marks, a green check sign and a red cross sign. In two clicks on the green upper signs you get the intended segmentation.

Now let us return to the reader window, and remove all the blanks in your input: "praaptavyamartha.mlabhatemanu.syodevo'pita.mlafghayitu.mnazakta.h". The segmenter now returns more solutions, 96 instead of 6, and you see unexpected new forms appear, such as prāptavyamartham, whose stem happens to be lexicalized as the name Prāptavyamartha of the young boy from Pañcatantra, blamed by his father for having bought a book containing just one poem, starting precisely with our sentence. Other forms such as ude or evaḥ are just noise due to sandhi ambiguity. But the correct segments appear here too in prominent places, and in 4 clicks the correct solution is easily attained. Note how clicking on blue segment manuṣyaḥ determines unambiguously the next one devaḥ.

If you click on a selection by mistake, it is easy to backtrack by clicking on the Undo button of the page. Other command links on the same line (Filtered Solutions, All 96 Solutions) should be ignored at this stage, and will be explained in section Shallow parsing below.

Please note that the default Velthuis transliteration is just an option. You may input devanāgarī script like यतः कृष्णस्ततोधर्मः यतोधर्मस्ततोजयः by selecting the appropriate slot in the "Input convention menu". Try this now by cut and paste from this document. Similarly you may use the IAST standard Indic romanisation in Unicode, like yataḥ kṛṣṇastatodharmaḥ yatodharmastatojayaḥ.

Morphological tools

Grammar

The Sanskrit Grammarian, accessed from link Grammar in the green control bar, gives you declined forms of nouns and conjugated forms of root verbs. It is the workhorse of morphological derivation. For nouns (under Declension heading) you must provide the base stem, and its intended gender. For verbs (under Conjugation heading), you must provide the root and its present class. The resulting table of inflected forms is displayed either in Roman with diacritics (IAST), or in devanāgarī text, according to your choice in the Output font buttons.

The Declension tool accepts 4 gender parameters: Mas for masculine, Neu for Neuter, Fem for Feminine, and a final All that is to be used for deictic personal pronouns, and for numbers.

The Conjugation tool accepts 12 Present class parameters: 1 to 10 are used for the traditional quality (gaṇa). 11 is used for denominative verbs. Finally 0 gives the secondary conjugations: causative, intensive, and desiderative. Please note that in the Roman output the first person appears first, whereas in the devanāgarī output the third person appears first (prathama), consistently with vyākaraṇa tradition.

Homonyms are adressed using homonymy indexes, like in kara#1 and kara#2. In case of doubt, access the tool from the intended entry in the lexicon. If you do not specify the index, the system will make an educated guess of the intended homonym. For instance, if you ask for the conjugation of root in class 1, the system will propose the forms of mā_4; in class 2 or 3 it will propose mā_1. But if you intend mā_3 of class 3 you must address it explicitly as maa#3. If you enter random stems and parameters, you will get arbitrary nonsense, according to the principle "garbage-in garbage-out". Thus if you ask for the declension of stem blablabla in the masculine you will get nonsensical forms such as ablative blablablāt. But at least you are warned by the system, that indicates its doubt by labeling the declension table as blablabla? If you ask for its forms in the feminine, you will get a Gender anomaly report.

This morphological engine is available from within the dictionary pages, where the gender indications of nouns, and the present family indications of roots, are active links which activate the Sanskrit Grammarian with the right parameters.

Stemmer

Conversely, an inflected form which is derivable from the dictionary entries is retrievable, with its morphological taggings, from the Stemmer, also accessible from the green control bar.

The user must provide the lexical category where to search the word from. Available categories are Noun, for nominal and adjectival forms, Pron for pronominal forms, Verb for finite root forms, Part for participial forms as primary derivatives from roots, Inde for indeclinable forms (adverbs, particles, infinitive forms, root absolutives), Absya for absolutive forms in -ya (usable with preverbs prefixing), Abstvaa for absolutive forms of roots in -tvaa, Voca for vocative forms, Iic for stems usable as left component of a compound, Ifc for right components of compounds, Iiv for inchoative forms in usable to form compound verbal forms with auxiliaries (the cvi construction), Piic for participial stems.

For instance, forms usable only in fine compositi such as kāraḥ are to be found in the Ifc bank. There is some redundancy between the Noun and the Part banks. Thus a word form such as gataḥ may be found in Noun, tagged as { nom. sg. m. }[gata], as well as in Part, tagged as { nom. sg. m. }[gata { pp. }[gam]]. Such lemmatisations are linked to the lexicon by stems (here gata) as well as by roots (here gam).

These linguistic resources are freely provided in XML form under various transliteration schemes. Please visit the Sanskrit linguistic resources site.

The Sanskrit Heritage Dictionary

The Sanskrit Heritage Dictionary is the latest edition of a Sanskrit to French Dictionary "Dictionnaire Français de l'Héritage Sanskrit" compiled by Gérard Huet since 1994. This dictionary is freely available as a 907 pages book under the pdf format, easily readable with Acrobat Reader, a free Adobe product. This dictionary is still under development, and is automatically updated along with the site, being now a computer-generated by-product of the lexical database of the platform.

This dictionary is the base for morphology generation used by the grammatical tools. It may be used also as a small encyclopedia of Indian culture. The Sanskrit name that renders best our encyclopedic intention is saṃskṛtibhāratīyakośa - Treasure of India according to Perfected tradition. Knowledge in this tradition is traditionally transmitted by lineages of teachers (paraṃparā). Some of this knowledge is available to the West through Indological litterature, but often in dessicated form. Many sources were used to compile this information, and inevitable mistakes and inconsistencies occur, not to speak of glaring omissions. We pray the reader who knows better to signal such overcomings to us.

Perfected means Sanskrit or Sanskritized. Thus usual names in vernacular [prakṛta] or pāli are generally given in their original Sanskrit form. Dravidian names are sometimes adapted to Sanskrit as an approximate phonetic rendition, but our lexicon is too limited to account for dravidian traditions, not to speak of tribal ones. In any case, this modest dictionary ought not to be considered as a scholarly erudite document, but rather as a simplified presentation of Indian culture for the educated public.

Entries in the dictionary are arranged by vocables, which may be verbs or nouns. Verbs comprise verbal roots, but also their variations with prefix sequences of preverb particles, and secondary stems for causatives, intensives and desideratives. Nouns comprise noun roots, primary noun derivatives from verbs, secondary noun derivatives by suffixes from primary ones, and compounds. The first two categories are individual entries at toplevel, the others are sub-entries of a parent vocable, or sub-sub-entries. Adjectives are just semantic roles of nominals. Pronouns and numbers are subclasses of nouns. Indeclinable forms (adverbs) and tool particles complete the lexical categories. Some idiomatic expressions and a few selected citations are listed at the end of entries at any level.

The list of abbreviations, of the Heritage dictionary as well as the grammatical engine, is available as a standalone pdf document.

Two index engines are provided. The main index requires exactly transliterated input, possibly an initial prefix of an existing entry, possibly some inflected form of a declined noun or a conjugated verb. The Sanskrit made easy index requires a romanized input for a full word, without diacritics and aspiration marks, for easy access to words like Siva, Vishnou, Panini, Sankara, etc.

The user who opts for Monier-Williams access will have the benefit of seeing definitions in English if he does not know French, while having access to the grammatical online tools in the same way. However proper names are not properly glossed as hyper-linked entities. Furthermore, the index tool is not as smart as the Heritage one, since you have to give the exact stem of the entry. Thus e.g. devanāgarī must be entered in full, while the initial prefix devanāg suffices for its disambiguation by the Heritage index.

The Sanskrit Heritage dictionary is also available in an ebook format, usable with the Babyloo, Stardict or Goldendict software. Please visit the Golden Sanskrit Heritage page.

The Sanskrit Engine

The Sanskrit Engine consists in a number of tools accessible online on the Sanskrit Heritage site. These various tools are available through interfaces easily reached from the green band at the bottom of your browser panel.

Sandhi

The Sandhi Engine takes two phoneme streams (input as transliterated strings) and gives as result their sandhi euphonic composition. There are two modes, external for glueing together words in a sentence, as well as making nominal compounds, and internal, for appending of affixes to stems in morphological derivations. We provide a deterministic answer, that is a choice is made when optional forms are admitted. Its output does not preclude the obtention of different forms using an optional rule. A fuller non-functional sandhi relation is used by the segmenter, in order to recognize the optional variants in conformity with Pāṇini.

Reader

The Sanskrit Reader Companion allows the analysis of Sanskrit sentences. We already saw an example of its use in the graphical Summary mode. Let us now examine the nature of its parameters.

The parameter "Lexicon Access" chooses the look-up dictionary. This parameter is persistent within a session. On the standard server it is set by default to Sanskrit Heritage, but if you are an English speaker you may want to set it to Monier-Williams, by accessing the English entry URL. If you install the tools on your own server, you will set such default parameters at configuration time.

You should be aware that the choice of the look-up dictionary is of no consequence to the reader tools, since the morphology generation lexicon is Sanskrit Heritage. Thus the forms of certain stems in Monier-Williams may not be recognized (however, see user-aid below for their acquisition). Conversely, the richer generation of participles allows the recognition of many forms, whose stems are not lexicalized in Monier-Williams. The covering of Heritage within Monier-Williams is indicated explicitly since entries lexicalized in Heritage are rendered highlighted in yellow in the Monier-Williams pages.

The parameter "Cache" is for advance use, explained below in user-aid.

The parameter "Text" is set by default to Sentence, and may be set to Word if you want to recognize a single pada. For instance, if you parse the following compound (taken from Pañcatantra): "pravaran.rpamuku.tama.nimariicima~njariicayacarcitacara.nayugala.h" in Sentence mode, you will be offered 96 solutions, but only 6 solutions in Word mode.

The next parameter "Format" is a toggle between reading sandhied text and reading text which has already been analysed in words (padapāṭha). Thus the sentence "si.mhovyaakara.nasyakarturaharatpraa.naanmune.hpaa.nine.h" my be parsed in sandhied mode (yelding 58 potential solutions), or may be presented in padapāṭha form as "si.mha.h vyaakara.nasya kartu.h aharat praa.naat mune.h paa.nine.h" (yielding only 24 solutions).

The parameter "Parser strengh" is by default set at "Full". It may be set to "Simple", meaning that no generation of participial stems and privative compounds is effected, all stems must be lexicalized. Simple mode segmentation should be reserved to small sentences explained to learners.

The "Input convention" parameter allows a number of formats. Transliteration using ASCII characters is possible in 4 varieties: Velthuis, WX (University of Hyderabad), KH (Kyoto-Harvard), and SLP1 (Sanskrit Library). These various conventions are presented in a synthetic document. Thus vaiśeṣikaḥ may be input as vaize.sika.h in the default Velthuis scheme, as vaizeSikaH in the Kyoto-Harvard scheme, as vESeRikaH in the WX scheme, or as vESezikaH in the SLP1 scheme.

In addition, Unicode input may be used, both for devanāgarī and for the IAST romanisation with diacritics, the Indology standard. Thus one may input directly वैशेषिकः or vaiśeṣikaḥ.

The "Optional topic" parameter is used in Parser mode to indicate a contextual topic usable as ellipsed agent. This is an experimental feature.

Finally, the "Mode" parameter offers several modes of operation of the Engine. We saw the default Summary mode. Other modes are provided to display all solutions sequentially. These modes are mostly deprecated, since they produce enormous pages when there are many solutions. It is possible to access these modes from the graphical Summary mode, when there remain only a few solutions.

Shallow parsing, a first approach

Let us call the Reader with a simple sentence such as: vana.mgatvaadhyaana.mkaroti (in Velthuis). The summary interface returns a page showing you your input sentence in blue devanāgarī, then a line with a number of green check marks, the first one being labeled Undo, then the graphical display where segments may be selected or discarded, as explained above. The third button is labeled "All 28 Solutions". It indicates that there is a total of 28 segmentation solutions at this initial stage. Indeed, when you select segments by clicking on their green check signs (resp. discard them by clicking on their red check signs) you see the count of solutions decrease. Thus selecting segment dhyānam brings this count to only 4. One choice is to select verbal segment karoti. Either by clicking on its green check sign, or by clicking on the red cross sign on either one of the parasitic segments kara and ūti. The only remaining choice is to select or reject the yellow a segment indicating a possible privative compound. We remark en passant that the only way to get the correct intended interpretation is to discard this parasitic privative segment, which is entirely absorbed by sandhi with the final of segment gatvā, and thus cannot be rejected from other segment selections. This shows the necessity of the red cross signs.

From the state after selection of karoti, click on the green check labeled "All 2 Solutions". You see the two potential solutions listed one after the other, with no sharing of common parts. Each segment is lemmatized with hyperlinks to the lexicon. Segments are separated by sandhi annotations. In this linear interface, it is possible to select solutions by clicking on the green check after the index of the solution. For instance, let us select Solution 1.

We are facing now another user interface called the Sanskrit Parser Assistant. The selected solution is displayed in 3 columns. The 1st column, yellow, is the padapāṭha. The 2nd column displays possible stemmings as a sequence of morphological multitags. For instance, on the first row, vanam is analysed as: { acc. sg. n. | nom. sg. n. }[vana], where stem vana is hyperlinked to the lexicon. Each selection in the multitag is equiped with a selecting button, preset to a default value. Here you may chose the case accusative (pre-selected) or nominative of word form vanam.

The right column attempts to represent the cases by semantic roles, with occasional English gloss of verbal forms. This representation is an approximation which is actually slightly misleading, since it attempts to relate nominatives to an evasive syntactic notion of Subject which is of little relevance to Sanskrit. Actualy nominative forms denote just names of the "unexpressed" (anabhihita) semantic role. Thus at best this representation is some approximation of syntax.

Below the three columns you find a button labeled Submit. You may press it to validate the morphological choices, and the resulting page gives you a unique parse as a hypertext padapāṭha which you may save in user space.

Returning to the Parser Assistant page, you will notice some cryptic notations attempting to assign penalties to morphological choices. This is another way to make morphological selections, ranked by decreasing penalty. Each selection is marked with a mouse sensitive heart symbol, which effects its commitment. We shall return to this in the next section.

Shallow parsing, advanced

Let us return to the original state of our interface interaction, after entering vana.mgatvaadhyaana.mkaroti. We notice in the first menu line a button labeled "Filtered Solutions". If you click on it, you see a listing of solutions similar to what we saw in the last section, but now solutions are listed according to some constraint satisfaction ranking. The first one (labeled 4) is the intended one, followed by another one, proposing ādhyānam in place of dhyānam. You may select the one you prefer, and go directly to the Parser Assistant page as seen above. You may also go back to the graphical summary interface, allowing mutual interaction between the two modes of operation.

Let us illustrate this shallow parsing facility on a much-discussed ambiguous sentence going back to Patañjali. Go back to the Reader interface, and enter zvetodhaavati, using Summary Mode. You see a display of the 36 segmentation solutions. You are also offered a green check sign labeled Filtered Solutions. Click on it. You see one particular solution, labeled 24, formed with blue śvetaḥ in the nominative, followed by red dhāvati, a verbal form in the present. Actually form dhāvati is marked as ambiguous, since it may result from root dhāv_1 (running) or from root dhāv_2 (cleaning).

Clicking on the green check sign brings you to the Sanskrit Parser Assistant page. The interpretation "It runs" is pre-marked, favoring root dhāv_1. Lower in the page, you see indeed that this interpretation incurs no penalty. Clicking on the green heart sign, or equivalently to the preset Submit button brings you to the fully disambiguated padapāṭha "The white one is running". The other segmentation has some penalty, explained with the "-Obj" indication, marking the absence of the object to a transitive verb.

In this example, the machine has succeeded in focusing on a correct solution automatically, among many interpretations. If we come back to the initial selection, it indeed tells "1 solution kept among 36", but actually lists also 7 other plausible additional solutions. Indeed, among them, Solution 33 gives another correct decomposition śvetaḥ+itaḥ+dhāvati "The dog is running towards here". Here too, the tool analyses dhāv_1 as fitting the grammatical constraints. It has penalty 0 as well, but was just disfavored over the first interpretation because it has 3 segments rather than 2, exhibiting a "shortest length bias" heuristic.

This shallow parser cannot be used on large input sentences, since its output could become enormous to the point of choking the server. Thus we have its access link "Filtered Solutions" appear only when the number of remaining segmentation candidates is below a threshold set by default to 100. This is in contrast with the situation with the graphical interface, which is fast and robust. Thus entering the following verse from Kālidāsa, we obtain very quicky a display factorizing an astronomical number (37680373292728320) of solutions: yaa tapovize.saparizafkitasya sukumaaramprahara.nam mahendrasyapratyaadeza.h ruupagarvitaayaa.h zriya.h ala.mkaara.h svargasyasaana.h priyasakhyurvazii kuberabhavanaat pratinivartamaanaasamaapattid.r.s.tena kezinaadaanavenacitralekhaadvitiiyaa bandigraaha.mg.rhiitaa.

Lexical categories

The main lexical categories exhibited so far are:
* substantive/adjective forms (blue)
* vocative forms (green)
* finite verbal forms (red)
* undeclinable forms such as adverbs, conjunctions, prepositions (mauve)
* pronominal forms (light blue)
* left part of compounds (yellow)

Actually, complex compounds with n+1 components appear as a sequence of n yellow segments denoting stems, followed by a blue nominal inflected form. For instance, enter in the Reader the following input (Velthuis) pravaran.rpamuku.tama.nimariicima~njariicayacarcitacara.nayugala.h or प्रवरनृपमुकुटमणिमरीचिमञ्जरीचयचर्चितचरणयुगलः (Devanagari). The returned display exhibits 96 solutions in Sentence mode. But if you select Word rather than Sentence as Text mode parameter, you get only 6 solutions, which denote various nominal compounds with many components. Actually, there are remaining ambiguities concerning the bracketing of their constituents. Let us examine a few typical situations.

First of all, some of the constituents may constitute a dvandva compound. For instance, consider yakṣagandharvanāgāḥ. It is a dvandva compound with 3 components yakṣa, gandharva, and nāgāḥ. The first two are bare stems, only the third one bears declension (vibhakti). Note that actually we distinguish two cases of the last segment: a blue one for nominative, and a green one for vocative. This distinction between vocatives and other cases is important, since vocatives are not really syntactic components of a sentence, but rather separate interjections, part of the communicative structure.

Let us now consider binary branching compounds. A three component display A-B-C may actually represent the compounding structure (A-B)-C (for instance viśvarūpadarśanam) or (less commmonly) the structure A-(B-C) (for instance ubhayacakravartī). Thus long compounds are represented in ambiguous ways, since the mechanical reader does not know how to choose between them on the sole basis of grammatical dependencies.

Now consider the compound stem pitāmbara. It may denote a determinative compound (tatpuruṣa), meaning "yellow garment", of neuter gender inherited from its component ambara. Or it may denote an exocentric compound (bahuvrīhi), of adjectival meaning "who wears a yellow garment". Thus, on input pitāmbaram, in mode Word, we have two solutions, sharing the yellow initial component pitā. The first solution proposes a blue neuter nominal segment ambaram, analysed as accusative or nominative of stem ambara. The second ambaram however is of a distinct cyan colour, and is analysed as masculine accusative. This second solution is mandatorily the exocentric compound "he who wears a yellow garment", typically an epithet of Lord Viṣṇu. The cyan colour segment may not occur stand-alone, it is mandatorily preceded by a yellow segment in order to form an exocentric adjectival compound. But the first solution is ambiguous, since it may be interpreted as a tatpuruṣa or as a bahuvrīhi. This example ought to be thoroughly understood in order to learn how to select the segments corresponding to the intended meaning.

There exists yet another variety of compound, the so-called avyayībhāva "turned into undeclinable". Let us consider a typical example, nirmakṣikam (without flies). Here this input is analysed as a sequence of segments, first the preposition nis, colored lavender, and then the stem makṣikā, turned into an invariable form makṣikam, colored magenta. We remark that the segment makṣikam is not accepted as stand-alone input. Please also note on this example that an unrecognized chunk of input yields a grey rectangle.

Verbal compounds exist, such as the periphrastic perfect construction, used for secondary conjugations and nominative verbs. It builds a special stem in -ām, suffixed by a perfect form of one of the auxiliaries kṛ, as and bhū. Try for instance āmantrayāṃcakre. You see the periphrastic form displayed as two segments, an orange āmantrayām, and the red cakre of the perfect of root kṛ: "he/I summoned". The orange and red segments are mutually linked, selecting one selects automatically the other.

Another periphrastic construction is the inchoative "cvi" verbal compound. Its left part is a special substantival stem in ī or ū, and its right part a finite verb form of one of the auxiliaries, like kadarthīkaroti or mṛdūbhavate. It in turns gives rise to primary derivatives (kṛdanta) like khilībhūtaḥ. Here too the left part is orange, and the right part is either red for verbal forms or blue for participial forms.

This concludes the main grammatical paradigms implemented by our machinery. Some more exotic constructions may occasionally be met, like the special construction of forms of kāma or manas, preceded by a special infinitive verbal form in -tu. Try for instance vaktukāmaḥ ("who wants to speak"). Note that two blue segments kāmaḥ appear in the result. One is used as a stand-alone nominal form (if you select the red imperative form vaktu), whereas the other one is necessarily used together with the salmon-colored special infinitive segment vaktu. Similarly for draṣṭumanāḥ ("inclined to see").

The user of our machinery may be occasionally puzzled by what may appear as redundancies. For instance, consider the input mānam. Two blue apparently identical segments labeled mānam occur. However, closer inspection (by clicking on these blue rectangles) reveals that one is a form of māna_1 (past participle of root man), and the other one is a form of māna_2 ("measure"). Although the two segments have the same color, being both subanta nominal forms, they do not obey the same combinatorics, since a participle (kṛdanta) stem like māna_1 is liable to be prefixed by the preverb particles (upasarga) allowed for root man.

Another interesting exemple is virodhitayā. The two blue segments look alike, and they are both instrumental singular forms of the feminine stem virodhitā. But one is the past participle of the causative of verb vi-rudh, the other is an abstract taddhitānta noun, obtained as virodhi(n)-tā. Distinguishing the two is essential, since they don't have the same dependency, the first one being an adjective requiring a substantive as its qualificand.

In order to understand the segmenting algorithm, one should study its control automaton. Here is the simplified automaton, explaining the main constructions. Words (pada) are recognized by paths going doing from the starting state S, and ending in the accepting state Accept. The link going upward from Accept to S allows to recognize a sentence as a sequence of words, sandhi being effected on the arcs of the diagram. Please note the cycle through state Iic, allowing the recognition of arbitrary length (flattened) compounds. When this diagram is mastered, consider the extended automaton, which adds vocatives, cvi verbal compounds, and privative compounds. The bank Nounc (respectively Nounv) is the subset of nominal forms Noun starting with a consonant (respectively a vowel). Privative compounds are obtained by prefixing them with a- (respectively an-). Finally, consider the complete automaton. The state Krid corresponds to first-level nominal constructions from roots, notably participial forms. These may be preceded by preverbs (Pvk). The state Priv stands for one of the two forms of the privative prefix a/an. One must imagine partitioning all banks whose state follows Priv into forms started with a consonant or a vowel, similarly to the preceding diagram. Finally, the state Neg stands also for the privative prefix a/an, prefixing a root absolutive in -tvā, like akṛtvā (having not done).

The results returned by our graphical interface may be thought of as describing all paths following this state diagram, except that preverbs are glued to the root and participial forms following them.

Deep parsing

The Sanskrit Heritage Engine may also be used as segmentation front-end for the dependency parser designed by Pr Amba Kulkarni at University of Hyderabad.

This allows the production of dependency graphs labeled with semantic relations, and ranked by decreasing satisfaction of dependency constraints. We shall not explain further this facility, which is still under development and not yet publicly released.

The user aid facility for lexicon acquisition

Our generating lexicon is not an extensive dictionary of Sanskrit. Occasionally you will encounter forms that are not recognized. For instance, assume you enter patamaṃ śṛṇoti in the Reader. The result is a two segment solution attempt, where śṛṇoti is a red recognized verbal form of root śru, but patamam appears as a grey unrecognized form. Its segment is available for selection with a red spade symbol. If you click on it, you get to a help page labeled "Feedback for Unknown Chunks". This page comprises 3 zones. The first zone allows you to correct your sentence. The second one allows you to correct the faulty chunk of text, here patamam.

The third zone is actually available only if you install the Heritage Engine in Station mode on your own workstation, it is not available on the public server. It is a facility that helps you recognize forms of stems that are not lexicalized. It proposes you various hypothetical lemmatizations for the unrecognized stems. Among those, the ones that are lexicalized in the Monier-Williams dictionary are underlined, indicating a lexicon link that you may consult to verify whether its meaning is appropriate. Each lemmatization is marked with a selection button. For instance, if you choose acc. sg. n. of stem paṭhana and press Submit Morphology, you are brought back to the Sanskrit Segmenter Summary, where now segment patamam is blue. This way you may progressively augment the recognized forms or correct faulty input.

Furthermore, your choice has entered the stem patama to a local cached lexicon on your platform. Thus, if later you encounter sentence patamāḥ śrūyante, the segment patamāḥ will be recognized as a bona fide form of stem patama.

It may also happen that a chunk of text is successfully analysed, but none of the segmentation solutions corresponds to the intended one, because of some incompleteness in the lexicon. In this case, it is possible to invoke the user aid by clicking anywhere in the chunk itself on its blue rendition above the colored rectangles. This will allow the user to fill-in the right segmentation, if it is a nominal form obtainable as the inflected form of a nominal item lexicalized in the Monier-Williams dictionary.

This facility is an objective reason to install the Sanskrit Heritage Engine on your own workstation.

The lexicon cache is reset by command "make empty_caches" in the installation directory.

Fine-grained input considerations

We already discussed above the parameters Format and Input conventions. In the "Sandhied" format, blanks are necessary only when there is an actual hiatus in the devanāgarī representation. For instance, in vanād grāmam adyopetyaudana aazvapatenāpāci, only the third blank space is mandatory. The others may be removed. They are just help for the segmenter, in indicating pada boundaries. Of course, if you remove them, the number of potential solutions may increase, since the system will attempt analyses not respecting these word boundaries. The third space above is mandatory, and actually gives rise to two distinct segmentations, one with the form odanaḥ, the other with the form odane.

Note that in the system's rendering, the mandatory space is indicated by an underscore symbol. Indeed, the user may use underscore to mark the necessary pauses, and thus the above example may be entered without any space as vanādgrāmamadyopetyaudana_aazvapatenāpāci. On the other hand, a blank may be inserted between letters even though the separate chunks are not in final sandhi, like after vanād above, or in vanaṃ gacchati. Thus Sandhied format with optional blanks is completely different from Unsandhied format, where each chunk of input must be a pada in final sandhi form, like in: vanāt grāmam adya upetya odanaḥ aazvapatena apāci. When entering digitalized corpus in our machinery, one must understand well this distinction, and possibly restore a consistent input.

The nasalisation sign anusvāra is optional when it stands for a nasal, and mandatory only before sibilants and h. Thus sandhi and saṃdhi are equivalent. Similarly for visarga before a sibilant. Thus śunaḥśepa or śunaśśepa.

Sandhi of n before l (anunāsika) is noted in our adaptation of Velthuis notation by a pair of tilde symbols, like in vidvaal~~likhati, leading to candrabindu in devanāgarī, like: विद्वालँलिखति.

It is also possible to help the segmentation of compounds, by inserting a hyphen at the stem boundaries. For instance, the long compound: "pravaran.rpamuku.tama.nimariicima~njariicayacarcitacara.nayugala.h" may be disambiguated to a certain extent as: pravara-n.rpa-muku.ta-ma.ni-mariici-ma~njarii-caya-carcita-cara.na-yugala.h

When initial short a is deleted by sandhi, it is possible to indicate the situation with the avagraha sign, noted by an apostrophe ' in transliteration. Actually this notation is mandatory in certain situations (after e and o) like devo'pi. Thus the Bhagavadgītā verse नासतोविद्यतेभावोनाभावोविद्यतेसतः will only accommodate Śaṅkara's analysis na asataḥ vidyate bhāvaḥ na abhāvaḥ vidyate sataḥ, whereas Madhva's interpretation (with abhāvaḥ) has to be made explicit as नासतोविद्यतेऽभावोनाभावोविद्यतेसतः

Finally, the system does not currently support degemination of stems, such as modern renditions of tattva as tatva or vārttā as vārtā; only a few common stems such as chatra, chātra and patra are recognized.

Entering full verses (śloka).

It is possible to enter longer pieces of text than a single line. Verses (śloka) may be entered as lists of lines ended with the vertical bar | (daṇḍa), terminated by a line ended with two vertical bars || (pūrṇavirāma). Thus, for instance:

d.r.s.tvaa tu paa.n.davaaniika.m vyuu.dha.m duryodhanastadaa |
aacaaryamupasafgamya raajaa vacanamabraviit ||

Please note that this notation is mandatory for such examples, where the first verse should not be glued by sandhi to the second one.

The Sanskrit Corpus (Experimental)

This is a set of tools to browse and manage a corpus. You can explore the corpus tree and possibly add and modify the analysis of a sentence. There are three modes of use (if you install the platform in the Station mode) :

  1. Reader (available regardless of the installation platform): explore the corpus tree and display in read-only mode the analysis of a given sentence with the graphical interface of the segmenter
  2. Annotator: add a new sentence and save the current state of the analysis at any time via the "Save" button in the graphical interface of the segmenter
  3. Manager: add new branches to the corpus hierarchy

When you place yourself in a certain corpus location (foo/bar for example) and you decide to add a new sentence, you are directed to the Sanskrit Reader Companion (note the subtitle "Corpus annotator mode - foo/bar") to enter a new sentence to be added to the corpus at the location you clicked on the "Add" button.

Every time you want to switch from a mode to another you have to click on the "Corpus" link in the green control bar at the bottom. If you simply want to go back quickly to the top of the corpus hierarchy preserving the current mode, you can click on the title of any page of the corpus browser.

Software and its documentation

A short documentation giving a general survey of the software components is available as a text document README in the distribution directory.

The complete Ocaml source of all modules of the Heritage Engine is available in literate programming style as a pdf document Heritage_platform_documentation. It may be considered as our vyākaraṇasūtrasaṃgraha.

How to install the Heritage Engine on your own server

The Heritage Engine is distributed as a stand-alone software for workstations running versions of UNIX such as Linux or Apple's MacOSX.

In order to install it, you must download two git repositories:
https://gitlab.inria.fr/huet/Heritage_Resources.git and https://gitlab.inria.fr/huet/Heritage_Platform.git.

Your first installation may be tricky if you are not familiar with the UNIX/Apache technology. But once your config file is correct, it will be very easy to install updates, as summarized in the document INSTALLATION in the top distribution directory of the Platform.

Signal installation difficulties and relate your experiences with these tools to Gerard.Huet@inria.fr.

Authors of interesting feedbacks will be entered in the Heritage Hall of Fame.

A useful supplement to this manual is our page of frequently asked questions Faq, also available from the "Help" button on the site control bar.

Objective Caml
Top | Index | Stemmer | Grammar | Sandhi | Reader | Corpus | Help | Portal
© Gérard Huet 1994-2018
Logo Inria