The Sanskrit Heritage Site
Version 2.75 [2013-05-05] (fr)

Welcome to the Sanskrit Heritage site.
It provides various services for the computational treatment of Sanskrit.
The first service is dictionary access. The dictionary is a hypertext structure
giving access to the Sanskrit lexicon, given with grammatical information.
There are currently two versions of the dictionary.
The first one is the original Heritage Sanskrit-French dictionary, which
serves as morphology generator, and is thus fully equipped with grammatical
tools. Furthermore it offers a rich encyclopedic contents about Indian culture.
You may also download a printable pdf version of this dictionary, as
explained below.
The second lexicon is a digital version of the Monier-Williams Sanskrit-English
dictionary, a much more complete lexicon for the Sanskrit language.
It is issued from Thomas Malten's digitalization of the Monier-Williams
at Köln University, turned into an XML databank by Jim Funderburk,
and finally adapted to the HTML Heritage look and feel by Pawan Goyal.
The Sanskrit Heritage dictionary is thus mirrored in the Monier-Williams, which
allows compatibility of the grammatical tools.
The choice of the dictionary is set to a default by the configuration of the
server site. But each dictionary is accessible separately by its search page,
respectively
Sanskrit Heritage and
Monier-Williams.
This site offers a number of linguistic services for the Sanskrit language, such
as a Sanskrit Reader that parses Sanskrit
transliterated text
into Sanskrit banks of tagged hypertext.
Various phonological and morphological tools are also provided,
as explained below.
Sanskrit Heritage dictionary in book form
You may download the pdf file of the Heritage dictionary from
PDF.
This document is readable through Acrobat Reader,
a well-known document management software from Adobe freely available on Internet.
Since the document is rather large, you have to account for some delay
in loading its 4.3 Mb.
Multilingual hyper-text dictionary
Interactive browsing
The dictionary may be accessed through an indexing engine:
Sanskrit Heritage or
through its mirror
Monier-Williams.
Your browser must be HTML5 compliant, and for proper viewing
of Sanskrit text you must have installed on your system open type fonts
for roman transliteration with diacritics, and for devanagari.
For instance, install fonts IndUni, available from
John Smith's site.
A Unicode-compliant font for devanagari with proper ligatures is Apple's
Devanagari MT for Macintosh OS X stations. For Windows users,
installation of font 'Arial MS Unicode' is advised for proper rendering.
You may have to fiddle with the controls of your browser, so that the font
declarations from the dictionary pages get precedence over the standard
selection, and thus encoding is specified as Unicode compliant (UTF-8 encoding).
Remark that many words are given with their etymology as hypertext links. You
may thus navigate from a word to its components, down to its roots.
Also, the gender declarations of
the main entries are mouse-sensitive, and give you direct access to the
relevant declension table. Similarly, the present class mark of the verbal roots
gives access to the conjugation schemes. Also for verb entries, preverbs
lead you to the correspondingly prefixed derived verbs.
All these grammatical tools, originally developed for the Heritage dictionary,
are being progressiveley extended to the Monier-Williams dictionary.
Sanskrit made easy
If you want to search for a Sanskrit word
without knowing its exact transliteration, go to section "Sanskrit made easy"
of the index page, which allows you to search for words without knowing
precise diacritics usage.
For instance, search Vishnou, Siva, or the grammarian Panini. This
interface is limited for the moment to the Sanskrit Heritage dictionary.
Sanskrit Grammarian
This interface gives the declension tables for Sanskrit substantives.
Try out this
declension engine by submitting Sanskrit stems
with intended gender. The same transliteration conventions as for the
dictionary index apply. For instance, submit "deva" with gender Mas,
or "devii" with gender Fem, or "brahman" with gender Neu. The fourth
button, labeled "Any", may be used for the words which take their
gender from the context, such as deictic pronouns ("aham", "tvad"),
or numeral words such as "dva", "tri", etc.
A conjugation engine for roots is also available. It handles
the full present system: present indicative, imperfect, imperative and
optative, as well as the passive present, the perfect, the aorist
and the future.
Participial stems, absolutives and infinitives are listed as well.
Some secondary conjugations (causative, intensive,
desiderative) are also generated, for the full present and future systems.
Try out this conjugation engine
with data such as "bhuu" 1, "as" 2, "m.rj" 2, "han" 2, "haa" 3, "hu" 3,
"daa" 4, "su" 5, "p.r" 6, "yuj" 7, "k.r" 8, "j~naa" 9, "namas" 10.
In order to get the secondary conjugations of a root, enter code 0.
You may cascade by generating declensions of the generated participial stems.
A word of caution is called for here. The only safe way to get correct
inflected forms is to enter the stem and its morphological parameters
consistently with their specification in the Heritage dictionary. This is
specially true of roots, since thay appear with various names according to
various Sanskrit grammars. For instance, root hū is called hū,
hvā or hve according to various grammarians. Another problem
is homophony. When two items have the same phonetic realisation, their
respective lexemes are disambiguated
by an integer index, which is specific to the lexicon. Thus there are
three roots named mā in the Sanskrit Heritage dictionary. They are
adressed respectively (in SH transliteration) as maa#1, maa#3 and maa#4.
If you ask for the conjugated forms of maa in present classes 2 or 3, the
system will guess you mean maa#1 (to measure). But if you mean maa#3
(to mow) or maa#4 (to exchange) you have to enter explicitly their stems
maa#3 or maa#4. Entering an arbitrary stem and arbitrary morphology parameters
may yield random results or error messages.
Lemmatizer
Conversely, a
lemmatiser
attempts to tag inflected words.
Try for instance (in Velthuis transliteration format)
"devaat", "jagmivaan", "a.s.tau" (clicking on Noun)
or "apibat", "akaar.siit", "dudoha", "vaahyate" etc (clicking on Verb).
This lemmatizer knows about inflected forms of derived stems in some
secondary derivations.
For instance, "darzayi.syati" is found as conjugated form:
{ ca. fut. a. sg. 3 }[dṛś_1],
"dariid.rzyate" yields { int. pr. m. sg. 3 }[dṛś_1],
"did.rk.sate" yields { des. pr. m. sg. 3 }[dṛś_1]
and "bibhik.se" yields { des. pft. m. sg. 3 | des. pft. m. sg. 1 }[bhaj].
N.B. Do not attempt to lemmatize verbal forms with preverbs - this will
not work, it knows only how to invert root forms. Lemmatizing
more complex forms is possible through the Sanskrit Reader interface,
as we shall see below.
Morphology
A dictionary of inflected forms of Sanskrit words is provided
in XML form under various transliteration schemes.
Please visit the Sanskrit linguistic resources site.
Sanskrit Reader
Try our interactive Sanskrit Reader.
It is able to segment simple sentences.
Try for instance to segment "tryambaka.myajaamahesugandhi.mpu.s.tivardhanam"
(we assume Velthuis transliteration here).
Then push the "Tagging" button and get the fully tagged sentence.
You will see two segmentations, one with an identified compound form
"tri-ambakam", the second with a compounded segment "tryambakam".
Note that each segment is indicated with a lemma giving its stem
and the set of morphological parameters that may generate the segment form
from its stem. The stem is hyperlinked to the dictionary of choice.
Note also that segments are separated by phonological information
in the shape of a sandhi rule, justifying correct obtention of the original
sentence by successive sandhi application. For instance, solution 1
explains the compound "tryambakam" as the sandhi of segments "tri" and
"ambakam" by rule ‹i|a → ya›.
The reader may be helped by inserting blanks in the input at word junction.
For instance, the above mantra may be entered as
"tryambaka.m yajaamahe sugandhi.m pu.s.tivardhanam".
But compounds should stay in one piece.
Spaces are also needed for hiatus, in sentences such as:
"tacchrutvaasa~njaya uvaaca".
Many options are provided in the menu of the Reader page. For instance,
clicking on the Unsandhied button we may present text in
padapāṭha form, where each chunk is in terminal sandhi form.
For instance "tryambakam yajaamahe sugandhim pu.s.tivardhanam".
Two strengths of the Reader are provided. The Simplified mode, offered as a
default, does not recognize vocatives. The Complete mode is more powerful,
using the full range of participles of verbs, privative compounds, etc.
It may however return so many solutions that listing all solutions is
impractical, and other facilities must be used.
The grammar used to recognize sentences is explained
as a local automaton state transition graph
Lexer automaton.
This is actually a simplification of the segmenter automaton control.
A simpler one, close to the Simplified mode of the reader, is
Simplified automaton.
A fuller one, close to the Complete mode of the reader, is
Complete automaton.
The color codes of these diagrams explain the output conventions of the tags.
In these diagrams, transparent nodes are non generative, and colored nodes
correspond to the lexical categories recognized by the lemmatizer. The
category Auxi is the subset of Verb consisting of conjugated forms of
roots "k.r", "as" and "bhuu" used as auxiliaries in periphrastic constructions.
Pv denotes sequences of preverbs.
Sanskrit Parser
If in the reader you press the "Parsing" button, many irrelevant
pseudo-solutions are eliminated. Try for instance example
"pratilekhanenaak.saraa.nisundaraa.nibhavanti". In Simplified mode,
it shows 80 potential segmentation solutions, but the parser keeps only 1.
Each solution returned with the parser is marked with a green check sign,
which may be pressed to get the semantic analysis of the sentence in
terms of roles (kāraka).
The parser recognizes sentences. It may be made to recognize nominal phrases,
provided one presses the "Contextual topic" button with the intended gender.
You may for instance analyze the compound:
"pravaran.rpamuku.tama.nimariicima~njariicayacarcitacara.nayugala.h"
as a masculine nominal.
Alternatively, one can ask to recognize this form as a single word, by pressing
"Word" rather than the default "Sentence" text category.
When breaking the text with spaces, the Word mode allows to recognize
texts given in padapāṭha fashion.
It is also possible to recognize sequences of chunks in final sandhi form
separated by spaces, where sandhi will be assumed to be undone between the
chunks, by specifying the "Unsandhied" mode in the reader interface.
Sanskrit Tagger
The semantic analysis may be still ambiguous, since a given segment may be
decorated by several morphological categories. All interpretations are
presented under the role matrix, sorted by increasing penalty. Check for
your favorite interpretation in this list, and select it by clicking on its
green heart symbol. The system will return the corresponding unambiguously
tagged sentence, as a page which you may save on your own station. Iterating
this process allows you to progressively tag a Sanskrit text with the Sanskrit
reader assistance.
Alternatively, you may select the ambiguous morphology choices, each being
provided with a selection button. Selections are chosen by default at the
first choice, but you may override this default and choose manually e.g.
the genders of nominals. When your choice is finalized, just click on the
"Submit" button and you will get the corresponding deterministically tagged
sentence. This tool is useful for semi-manual corpus annotation.
Summary mode
Now that you are more familiar with using the various modes of the Reader
on small Sanskrit sentences, it is time to try to analyse more complex
sentences. Obviously the listing of all solutions is out of the question with
long sentences in Complete mode. A new visual interface is offered for
semi-automatic segmentation. This new Summary mode is actually now proposed as a
default in the Reader page.
Try for instance
"satya.mbruuyaatpriya.mbruuyaannabruuyaatsatyamapriya.mpriya.mcanaan.rtambruuyaade.sadharma.h sanaatana.h".
The display presents a summary of the union of all solutions, as a chart of
segments aligned on their respective input contribution.
You see at the right end the segment sanātanas proposed first, on top of a forest of smaller words combinations.
Click on the green check sign below it. The check sign becomes blue,
and the forest of irrelevant combinations vanishes.
Do the same under the satyam segments, then under apriyam, all segments
presented as top candidates. Now choose the particle ca (and thus na).
Now only one choice remains, between brūyāt and brūyām.
Clicking on the first
one will finish the job. Indeed only one solution remains, as may be checked
by clicking on the "Show All Solutions" check sign, which may be invoked
at any point on the process. You are now viewing the same output as given by the
Reader in Tagging mode, but constrained to use only segments checked in the Summary. You may alternatively check the "Show Preferred Solutions" check sign,
asking the Parser to further filter according to its system of penalties.
If you make a mistake in the selection of segments, it is easy to
backtrack using the "Undo" button.
Other Sanskrit Resources
We have on on-going cooperation with the Department of Sanskrit Studies
of the University of Hyderabad on computational linguistics for Sanskrit.
A joint research team has been formed, together with scholars
from the Sanskrit Library. This team is actively developing cooperating
multi-platforms Web services.
In october 2007 we organized the First International Sanskrit Computational
Linguistics Symposium. Please visit
the Symposium Site.
This was followed by the Second Symposium in may 2008 at
Brown University,
by a third one in january 2009 at
Hyderabad University,
and by a fourth one in december 2010 at
JNU.
The next one just took place in january 2013 at
IIT Bombay.
The Zen Library
This site reflects an ongoing project of Sanskrit processing
on a comprehensive software platform.
The project is based on a structured lexicographic database, compiled from
the Sanskrit Heritage dictionary, and on
the Zen computational linguistics toolkit. This toolkit is a library
of programs implemented in Pidgin ML, functional core of the
Objective Caml
programming language. The Zen library and its documentation are available
as free software under the Gnu Lesser General Public License (LGPL) from the
Zen site.
The Sanskrit Portal
Please visit our Sanskrit Portal
to find links to other Sanskrit resources.
Artwork credits
Orissan artwork at this site courtesy of Shauraj Rath.
© Screenex, Bhubaneshwar, Ekamra, Orissa. All rights reserved.
Wallpaper om images courtesy of
Vishvarupa.com.
Ganesh wallpaper courtesy of
François Patte.
Shri Yantra design ©
Gérard Huet 1990.
|