Sanskrit Morphology
We provide here inflected forms and morphemes derived from the root forms
defined in the
Sanskrit Heritage Dictionary. These forms are
presented as lemmas linking each form to its stem entry by possible morpho-phonetic
operations. We limit ourselves to classical Sanskrit, and do not cover precative,
subjunctive, injunctive and conditional forms of the verbs.
At present, we provide for two transliteration schemas, respectively
WX, used by the
Department of Sanskrit Studies at
University of Hyderabad
and SLP1, used by the
Sanskrit Library.
The respective data banks are given in compressed archives (6Mo each, gz format)
WX and
SLP1. After downloading these documents,
and uncompressing them (typically with the utility gunzip),
you get a UNIX tar archive containing the following data.
The morphological lemmas are distributed in 6 files in
XML format, conformant to a common DTD.
The nominal morphological declensions of nouns, adjectives and numbers,
are covered in T_nouns.xml (where T is respectively WX or SL). Those of pronouns
are covered in T_pronouns.xml.
The conjugated forms of roots in the present, imperfect, imperative, optative,
perfect, aorist
and future tenses, as well as passives of the present system,
for the primary conjugation and for some secondary conjugations
(causative, intensive, desiderative) are covered in T_roots.xml.
Additional declensions of derived participial forms are given in T_parts.xml.
Absolutives, infinitives and other undeclinable words and particles
are listed in T_adverbs.xml. In addition, T_final.xml gives additional
generative morphemes. The files are conformant to the DTD T_morph.dtd.
Finally, the text file X_preverbs.txt lists common
preverb sequences, given with their sandhi analysis.
Intellectual Property
All these linguistic data banks are Copyrighted Gérard Huet 1994-2013.
They are derived from the Sanskrit Heritage Dictionary
version 2.75 dated 2013-05-05.
Use of these linguistic resources is granted according to the
Lesser General Public Licence for Linguistic Resources,
of which copies in
pdf and
html format are provided here.
Thank you for referencing the
origin of this data if you use it in your own work.
Methodology
We deal here with a mixture of derivational and inflexional morphology.
For instance, from the roots we generate verbal and propositional stems, and from
these stems we generate in turn inflected forms: conjugated forms from the
verbal stems, and declined forms from the participial stems. But at present
we do not generate mechanically primary nominal stems from roots,
nor secondary nominal stems from primary ones, because of overgeneration.
The nominal stems, as well as the undeclinable forms, are taken from the
lexicon, that lists also some frequent participles.
This organization entails a different role in our morphological data bases.
The
basic morphological categories correspond to lexical phases,
which are atomic letters in the defining grammar of Sanskrit
word.
The forms listed in these data bases act as morphemes of this high-level
morphological definition, which is recursive, since compounding may be
iterated, as well as preverb formation, to a certain extent.
But this recursion power is limited, in the sense that the grammar of a word
is a regular one (type 0 in the Chomsky hierarchy), and its recognizer is
a finite automaton, whose states are precisely the lexical categories indexing
the basic data bases. This definition of word implements correctly the geometry
of constructions such as absolutives (which fall in two distinct categories,
the preverb form and the root form) and periphrastic phrases (periphrastic
futures with substantives, and periphratic perfects as prefixes of finite
perfect forms of the auxiliary roots
as,
bhū and
kṛ which are duplicated in a specific auxiliary lexicon).
Here is a simplified diagram of the current state space of our lexer.
This automaton is also the top-level view of our Sanskrit Tagger, which
implements Sanskrit analysis from
devanagarī text.
The technical exposition of this method, together with its correctness
justification, has been exposed in various scientific journals and conferences,
and the corresponding articles are also available freely on my
publication page
(papers [78], [87], [88], [94] and [95] are specially relevant).
This material will not be repeated here. Let us just explain a few difficulties
of the large-scale implementation of this Sanskrit analyser.
As usual in a non-deterministic search algorithm (here all the possible parsings
of a sentence as a sandhied stream of forms), we have two pitfalls, silence and noise.
Silence (lack of recall) means incompleteness. Some legal Sanskrit sentences
may fail to be recognized.
Typicallly, some root word may be missing from the base lexicon,
or some Vedic form may use some construction rare in the later language,
like precative or subjunctive.
Compounding gives rise to two complications, the raising of new cases by
bahuvrīhi compounding,
and the formation of
avyayībhava compounds. Some of these
constructions are treated incompletely.
The opposite of silence is noise (lack of precision), that is overgeneration.
We deal with overgeneration
in the syntactico-semantic layer of our tagger, which filters out combinations of
tags inconsistent with semantic role assignments.
We shall not discuss this technology
further in this note on morphology, and refer the interested reader to our
Sanskrit reader
demonstration page.
We remark that the respective data bases can be interrogated online by our
stemmer
interface. But note that verbal forms prefixed by preverbs
are analysed by the tagger as non-atomic words, and only root forms and
their secondary conjugations are recognized by the stemmer.
Help
Questions concerning these resources should be addressed to
Gérard Huet.
All suggestions for improvements will be gratefully considered.