Professional Documents
Culture Documents
system
A Thesis
By
R.SRIBADRI NARAYANAN
March, 2012
Amrita School of Engineering
Amrita Vishwa Vidyapeetham, Coimbatore 641105
BONAFIDE CERTIFICATE
Ettimadai, Coimbatore.
Date:
DR. K P SOMAN
RESEARCH GUIDE AND HEAD, CEN.
Amrita School of Engineering,
Amrita Vishwa Vidyapeetham, Coimbatore
641105
DECLARATION
Place: Ettimadai
Date:
Signature of the Student
Countersigned by
K P SOMAN
I extend my cordial thanks to all the teaching and the non teaching staffs of the
Department of Computational Engineering and Networking for the help rendered at
various phases of the project work.
I express our thanks to my parents and friends who always stood with me
with their valuable suggestions and help.
ABSTRACT
Parser gives grammatical tree structure for English sentence. For this purpose
we are using Stanford parser, which gives better results when compared with other
parser.
Final process is integrating the tools in a unique platform and producing the
Telugu output.
i
CONTENTS
Abstract ........................................................................................................................... i
Contents .........................................................................................................................ii
List Of Figures .............................................................................................................. iv
List of Tables ................................................................................................................. v
Chapter 1 ........................................................................................................................ 1
Introduction .................................................................................................................... 1
1.1 ISSUES IN MACHINE TRANSLATION ....................................................................... 2
Chapter 2 ........................................................................................................................ 3
Literature Survey ........................................................................................................... 3
2.1 MACHINE TRANSLATION ....................................................................................... 3
2.2 THE NECESSITY OF MACHINE TRANSLATION ........................................................ 3
2.3 DIFFERENT CATEGORIES OF MACHINE TRANSLATION SYSTEMS ........................... 4
2.4 VARIOUS APPROACHES TO MACHINE TRANSLATION ............................................ 5
2.4.1 LINGUISTICS OR RULE BASED APPROACH ...................................................... 6
2.4.2 NON-LINGUISTIC APPROACHES ...................................................................... 8
2.4.3 HYBRID APPROACH ...................................................................................... 10
2.5 MORPHOLOGICAL ANALYZER AND GENERATOR ............................................. 10
Chapter 3 ...................................................................................................................... 14
Overview Of Telugu Language ................................................................................... 14
3.1 DEMOGRAPHIC INFORMATION ............................................................................. 14
3.2 GENERIC AFFILIATION AND HISTORY .................................................................. 14
3.3 THE TELUGU SCRIPT ........................................................................................... 14
3.3.1 ORIGIN AND DEVELOPMENT......................................................................... 14
3.3.2 TELUGU ALPHABET...................................................................................... 15
3.4 COMPUTATIONAL GRAMMAR OF TELUGU ........................................................... 17
3.4.1 NOUNS ......................................................................................................... 17
3.4.2 VERBS .......................................................................................................... 19
Chapter 4 ...................................................................................................................... 23
Overview Of English-Telugu Machine Translation System ........................................ 23
4.1 PARSER ............................................................................................................... 24
ii
4.2 REORDERING ....................................................................................................... 24
4.3 DICTIONARY ....................................................................................................... 24
4.4 TRANSLITERATION .............................................................................................. 25
4.5 MORPHOLOGICAL ANALYZER .............................................................................. 25
4.5.1 INTRODUCTION............................................................................................. 25
4.5.2 DATA CREATION FOR SUPERVISED LEARNING ............................................. 26
4.5.3 IMPLEMENTATION OF MORPHOLOGICAL ANALYZER MODULE ....................... 31
4.6 MORPHOLOGICAL GENERATOR ........................................................................... 33
4.6.1 INTRODUCTION ............................................................................................. 33
4.6.2 MORPHOLOGICAL GENERATOR FOR TELUGU ............................................... 34
4.6.3 DIFFICULTIES IN MORPHOLOGICAL GENERATION FOR TELUGU ................... 34
4.6.4 FORMATION OF INFLECTIONAL TABLE ......................................................... 35
4.6.5 METHODOLOGY ........................................................................................... 36
Chapter 5 ...................................................................................................................... 41
Results .......................................................................................................................... 41
5.1 TESTING AND RESULTS ....................................................................................... 41
5.2 DISCUSSION ........................................................................................................ 41
5.3 SCREEN SHOT OF MORPHOLOGICAL ANALYZER ................................................. 42
5.4 TESTING AND RESULTS ....................................................................................... 43
5.5 DISCUSSION ........................................................................................................ 43
5.6 SCREEN SHOT OF MORPHOLOGICAL GENERATOR............................................... 44
5.7 TESTING AND RESULTS ....................................................................................... 45
5.8 DISCUSSION ........................................................................................................ 45
5.9 SCREEN SHOT OF ENGLISH-TELUGU MACHINE TRANSLATION SYSTEM .............. 46
Chapter 6 ...................................................................................................................... 47
Conclusion ................................................................................................................... 47
References .................................................................................................................... 48
Publication ................................................................................................................... 49
iii
LIST OF FIGURES
Fig. 4.1. General block diagram for English-Telugu machine translation system ..23
Fig. 4.4. Steps involved in preprocessing data for SVM model .27
iv
LIST OF TABLES
translation system 7
v
ABBREVIATION
CV Constant-Vowel
MT Machine Translation
SOV Subject-Object-Verb
SVO Subject-Verb-Object
vi
CHAPTER 1
INTRODUCTION
In such a situation, there is a big market for translation between English and the
various Indian languages. Currently, the translation is done manually. Use of
automation is largely restricted to word processing. Two specific examples of high
volume manual translation are -Translation of news from English into local
languages, translation of annual reports of government departments and public sector
units among, English, Hindi and the local language. Many resources such as news,
weather reports, books, etc., in English are being manually translated to Indian
languages. Of these, news and weather reports from all around the world are
1
translated from English to Indian languages by human translators more often. Human
translation is slow and also consumes more time and cost compared to machine
translation. The reason for choosing automatic machine translation rather than human
translation is that machine translation is better, faster and cheaper than human
translation.
2
CHAPTER 2
LITERATURE SURVEY
3
But this will be a tedious job to find a translator who knows the language in which the
literature was written and the language in which the user required to translate the
literature i.e. the language known by him. Also it is time consuming and very
expensive. And if the resource to be translated is huge, it is definitely impossible for
humans to manually translate the entire resources, in a short span of time. The only
solution for this problem is to design machine which can perform the translation
automatically.
MACHINE AIDED HUMAN TRANSLATION can range from automatic look-up programs
to systems which are practically fully automatic, but which require the translator to
approve each sentence. Examples of some of the more successful of this type of
software are the Translators Workbench of Trados and INK Tools.
a writing pen
a play pen
a pig pen
Human intervention can also mean post-editing to check the translation and fix
mistakes made by the computer. It should be noted that the pre-editing and glossary
compilation required for HAMT typically require a person who is a trained linguist
4
who can parse the syntax of the sentence, not simply a translator who understands the
foreign language and can express it in his/her own language.
Obviously the most primitive is the system which requires pre-editing, since the
computer cannot handle the text unless a human converts NL into a semi-artificial
language which is easier for the computer to understand. The ideal is when the
automatic translation is so good that all that is necessary is to check the translation
and change a few details. Interactive intervention can be anywhere in between.
FULLY AUTOMATED MACHINE TRANSLATION systems, and although they may suit
the needs of people who have to search through mountains of information and only
need to get a very general idea of the contents of a document (a good example is
provided by the low-quality needs of the military and the intelligence agencies), high-
quality translation of truly natural language which is really fully automatic
(automated) hardly exists. Fully Automatic High Quality Machine Translation
(FAHQMT) systems have requirements either for the compilation of extensive
glossaries and/or are restricted to specific sub worlds or sublanguages.
5
FIGURE 2.1 ILLUSTRATES DIFFERENT APPROACHES OF MACHINE TRANSLATION
SYSTEM
6
TABLE2.1 AN EXAMPLE TO ILLUSTRATE THE DIRECT APPROACH TO MACHINE
TRANSLATION
Word Reordering
PAST>
Dictionary Lookup
Predicate Reach
Agent Boy (Number: Singular)
Theme Hospital (Number: Singular)
Instrument Ambulance (Number: Singular)
Tense FUTURE
7
2.4.1.3 TRANSFER APPROACH
The transfer model involves three stages: analysis, transfer, and generation. In
the analysis stage, the source language sentence is parsed, and the sentence structure
and the constituents of the sentence are identied. In the transfer stage,
transformations are applied to the source language parse tree to convert the structure
to that of the target language. The generation stage translates the words and expresses
the tense, number, gender etc.
RELATED WORKS
8
the human translation, by providing meaningful word translations and limiting the
work of humans to correcting the syntax and grammar of the sentence.
The corpus based approaches dont require any explicit linguistic knowledge to
translate the sentence. But a bilingual corpus of the language pair and the monolingual
corpus of the target language are required to train the system to translate a sentence.
This approach has driven lots of interest world-wide, from late 1980s till now.
RELATED WORKS
9
2.4.3 HYBRID APPROACH
Hybrid machine translation approach makes use of the advantages of both
statistical and rule-based translation methodologies. Commercial translation systems
such as Asia Online and Systran provide systems that were implemented using this
approach. Hybrid machine translation approaches differ in many numbers of aspects:
2.5 PARSER
Parser is the process of analyzing a text, made of a sequence of tokens, to
determine the grammatical structure with respect to given formal language. Two
approaches for developing parsers are top down approach and bottom up approach.
Some of the parsers available as open software are XML parser, Stanford parser, LL
parser and LR parser.
10
RELATED WORKS
2.6.2 STEMMER
Stemmer [6] is used for stripping of affixes. It uses a set of rules containing
list of stems and replacement rules.
E.g: writing write + ing
For a stemmer program me we have to specify all possible affixes with
replacement rules.
E.g. ational ate relational relate
tional tion conditional condition
The most widely used stemmer algorithm is Potter algorithm. The algorithm
is available free of cost http://www.tartarus.org/martin/PotterStemmer/.
RELATED WORKS
There are some attempts to develop stemmer for Indian Languages also. IIT
Bombay and NCST Bombay has developed stemmer for Hindi [Manish, Anantha].
11
2.6.4 DAWG (DIRECTED ACRYLIC WORD GRAPH)
DAWG is a very efficient data structure for lexicon representation and fast
string matching, with a great variety of application. This method has been
successfully implemented for Greek language by University of Partas Greece. DAWG
data structure can be used for both morphological analysis and generation. This
approach is language independent it does not utilizes any morphological rules or any
other special linguistic information. The method can be tested for Indian languages
also. Figure 2.2 shows an example for DAWG graph. In the figure A, B, C, U, L, T, S
are different states from one node to another.
A paradigm defines all the word form of a given stem and also provides a
feature structure with every word form. The paradigm based approach is efficient for
inflectionally rich languages.
This or a variant of this scheme has been used widely in NLP. The linguist or
the language expert is asked to provide different tables of word forms covering the
words in a language. Each word-forms table covers a set of roots which means that
the roots follow the pattern (or paradigm) implicit in the table for generating their
word forms. Almost all Indian language morphological analyzers are developed using
this method. Based on paradigms the program generates add delete string for
analyzing. Paradigm approach rely on findings that the different types of word
paradigms are based on their morphological behavior.
12
RELATED WORKS
The ANUSAARAKA research group has developed a language independent
paradigm based morphological compiler program for Indian Languages.
Words are categorized as nouns, verbs, adjectives, adverbs and postpositions.
Each category will be classified into certain types of paradigms based on their
morphophonemic behavior. For example noun Uru (village) belongs to a paradigm
class is different form Abbayi (boy) which belongs to a different paradigm class as
they behave differently morpho-phonemically.
We have used Machine learning using SVMTool for implementing
Morphological Analyzer and paradigm approach for Morphological generator.
13
CHAPTER 3
OVERVIEW OF TELUGU
LANGUAGE
Historically Telugu Language is also known by the names, amdhram, tenu (m) gu
and gentoo [8].
15
CONSTANTS
CONJUNCT CONSONANTS
16
VOWEL MODIFIERS
3.4.1 NOUNS
A noun [9] in Telugu is inflected in a complex way. Nouns in Telugu
characteristically carry the markings of gender, number, person and case.
A number of nouns in Telugu often change their form before the marking of
gender, number, and person and case. Systematic changes occur in the base
particularly when inflected for non-nominative cases such as accusative, dative,
instrumental, ablative and locative. Conventionally noun-nominative base of a noun is
also known as oblique base or oblique form. However, it should be noted that such a
base is neither unique nor common.
GENDER MARKING ON NOUN
Though the inflection classes are insensitive to gender distinctions, there are
distinctions of gender discernible from morphology of agreement on verbs, adjectives,
possessives, predicate nominal, numerals and deictic categories. It is necessary to
identify four distinctions in gender, viz. nouns indicating:
Human males
Humans, and
Non-humans.
This distinct is necessitated by the distribution of nouns indicating human
females which are grouped with neuter nouns in singular, but human males in plural.
17
However, a number of nouns denoting human males end in du, and human
females end in di.
NUMBER MARKING IN TELUGU NOUNS
Telugu nouns usually occur in two numbers, singular and plural. However,
only plural nouns are explicitly marked. In case of large number of nouns the form of
the plural suffix is lu, while in case of some nouns of human male category, the form
of plural suffix alternant is ru.
GENDER- NUMBER-PERSON MARKING ON NOUNS
Telugu nouns when function as nominal predicate show agreement with the
gender, number and person of the surface subject of the clause. Pronominalized
possessive nouns (possessors) show agreement (in gender, number and person) with
the nouns of possession and function as heads of possessive phrases. In these two
cases nouns are marked by pronominal suffixes of the relevant gender-number-
person. The person marking on nouns is however, explicit only in 1st and 2nd person
both singular and plural, In the case of 3rd person, only the number is marked
explicitly and not the person.
18
inflected for location in time and space. Nouns when attached with various
combinations of adverbial nouns and case markers or post-positions express many
more such relations.
3.4.2 VERBS
Verb [16] denotes the state of or action by a substance. Telugu verb may be
finite or non-finite. All finite verbs and some non-finite verbs can occur according to
situation before the utterance final juncture /#/ characterized by of following terminal
contours: rising pitch, meaning question; level pitch, falling pitch, meaning command.
A finite verb does not occur before any of the non-final junctures. On the
morphological level, no non- finite verb contains a morpheme indicating person; this
statement should not, however, be taken to mean that all finite verbs necessarily
contain a morpheme indicating person. Since any verb, finite or non-finite, occurs
only after some marked juncture, by definition of these junctures, all verbs have
phonetic stress or prominence on their first syllable, which invariably part of the root.
Almost every Telugu verb has a Finite and a non- finite form. A finite form is
one that can stand as the main verb of a sentence and occur before a final pause (full
stop). A non- finite form cannot stand as a main verb and rarely occurs before a final
pause.
FINITE VERBS
The eight finite forms of the modern Telugu verb may be arranged in three
structural types, which are set up according to the differences in the grouping of the
three substitution classes,
Tense-mode suffix
Personal suffix( es )
The paradigms of the finite forms of a simple verbal base are given below
under the three structural types: ammu (to sell), with two allomorphs: amm- before a
vowel.
Type 1: stem + personal suffix:
1. Imperative : singular u amm-u (sell)
Plural - andi ammu - andi
19
Type 2: stem + tense-mode suffix:
2. Admonitive or abusive:
On account of semantic restrictions, many verbs cannot occur in this mood. A
few bases like kAlu (to burn), kUlu (to fall), cAvu (to die), pagulu (to break), etc.,
occur
Eg: nIyilli kUlu - may your house fall
3. Obligative (in all persons): -Ali
amma Ali I, we, you( sg, pl)
he, she, it
Type 3: stem + tense-mode suffix + personal suffix
4. Habitual- future or non-past: -t-
ammu t - Anu I shall sell
ammu t Am we shall sell
ammu t Ava you shall sell
ammu t Aru he shall sell
ammu t Adu she shall sell
ammu tun di she sell
ammu t Ay they sell
5. Past tense: -i-
ammu i Anu* I sold
ammu i Am we sold
ammu i Ava you sold (Singular)
ammu i Aru you sold (plural)
ammu i Adu he sold
ammu in di she/ it sold
ammu i Aru they sold
6. Hortative: -d-
ammu d Am let us sell, or we shall sell
7. Negative tense: -a-
ammu a nu I (do, did, and shall) not sell
20
ammu a m we(do, did, and shall) not sell
ammu a va you (do, did, and shall) not sell
ammu a Du he(does, did, and shall) not sell
ammu a du she/ it(do, did, and shall) not sell
ammu a ru they (do, did, and shall) not sell
8. Negative imperative or prohibitive: -Ak-
Unbound
Bound
Type 1:
1. Present participle -tu ammu- tU selling
2. Past participle -i ammu- i having sold
3. Concessive -inA ammu- inA even though sold
4. Conditional -itE ammu- itE if sold
5. Infinitive -a ammu- a to sell
6. Negative participle -aka amm-aka not selling
7. Habitual adjective -E amm-E that sells
8. Past adjective -ina amma-ina that sold
9. Negative adjective -ani ammu- ani not selling
Type 2:
Bound present - t- : ammu- t occurs with any finite form of the verb un- to be and
also a few non- finite forms.
Example: ammu- t- unnAnu I am selling
ammu- t- un- nA even selling( now)
21
ammu- t- un- tE if selling
ammu- t- un- na that selling
22
CHAPTER 4
OVERVIEW OF ENGLISH-
TELUGU MACHINE TRANSLATION
SYSTEM
ENGLISH SENTENCE
(INPUTTEXT)
S TANFORD PARSER
REORDERING
LEXICALIZATION
TRANSLITERATION
MORPHOLOGICAL GENERATION
4.2 REORDERING
Reordering plays a vital role in overcoming the structural difference between
English and Telugu language. In English, format of the sentence will be Subject-
Verb-Object (SVO) type but in Telugu we have SOV format. To overcome this
problem reordering rules are applied in the source language level. A set of reordering
rules for Telugu has been adopted from the reordering rules developed for Tamil.
4.3 DICTIONARY
A well groomed comprehensive bilingual dictionary, specially made from the
point of view of translation, is an essential component in a translation system. The
prototype of one such dictionary is created for the present English-Telugu machine
translation system. The bilingual dictionary is collected through various resources like
internet, books etc. At present the dictionary contains 26000 words which belong to
different grammatical categories.
24
TABLE 4.1 DATABASE INFORMATION
4.4 TRANSLITERATION
SVM based English to Telugu transliterator is used for transliteration.
Transliteration is mainly done for the words which are not available in the bilingual
dictionary.
4.5.1 INTRODUCTION
Morphological analyzer takes input as a word and produces output as the
analysis of the word. Presently morphological analyzer is considered as a module in
which the input is Telugu word and the output is the analysis of the given Telugu
word.
25
analyzable and generatable format is a challenging task. Inflections of the Telugu
verbs include finite, infinite, adjectival, adverbial and conditional markers. The verbs
are classified into certain number of paradigms based on the inflections. For
computational need we have 37 paradigms of verb and each paradigm with 160
inflections.
Sixty seven paradigms are identified for Telugu noun. Each paradigm has 117
sets of inflected forms. Based on the nature of the inflections the root words are
classified into groups. A corpus with all morphological information has been
prepared. So the machine by itself captures all the morphological rules.
Morphological analysis of nouns is less complex compared to verbs. The detailed list
of Paradigms and the possible inflections of the verbs and nouns are given in the
Appendix.
Support Vector Machine (SVM) is used for classifying task. Presently there
are three modules [13]. 1. SVMTlearn 2. SVMTagger 3.SVMTeval. SVMTlearn is
used for training the system with manually created corpus. SVMTagger is used for
tagging the sequence of words by taking samples from previously learned SVM
model. SVMTeval is used for evaluating the final output.
1. The first step involves the data creation (corpora development) for morphological
analyzer and classifying the verbs and nouns into paradigm types. Each root word
inflects for different grammatical features. But the nature of these inflections is same
for each paradigm type. The verbs inflect for grammatical features such as tense,
person, number, gender, non-finite, infiniteness, conditional negation, emphasis and
interrogation. The nouns inflect for plural numbers, postpositions may follow the case
immediately or after a space. Figure 4.3 illustrates the formation of paradigms.
26
FIGURE 4.3 FORMATION OF PARADIGM
2. The second step is to collect the list of words which will fall under the paradigms of
verbs and nouns. Table 4.2 illustrates some of the words and its inflections under the
paradigm ADu.
PARADIGM 1 ADU
LIST OF WORDS INFLECTIONS
1.ATADU 1.tunnAnu
2.IdADu 2.tunnAmu
3.KoniyADu 3.Anu
4.koTTADu 4.Amu
. 5.tAnu
.
3. The third step is pre-processing the corpus for morphological analyzer [12]. Steps
involved in pre-processing are explained in the Figure 7.
27
FIGURE 4.4 STEPS INVOLVED IN PRE-PROCESSING DATA FOR SVM MODEL
The pre-processing steps involves the Romanization, Segmentation, Alignment-
mapping and mismatching.
ROMANIZATION: The set of most commonly used noun and verb forms are
generated manually for input structure and similarly the output structure is developed.
These data are converted to Romanized forms using the Unicode to Roman mapping
file.
SEGMENTATION: After Romanization each and every word in the corpora is
segmented based on the Telugu grapheme and each grapheme which is syllabic is
further segmented into consonants and vowels. The Consonant are represented by "C"
and vowel is represented by "V" respectively. It is named as C-V representation or
Consonant Vowel representation. Morpheme boundaries (end of each morpheme)
are indicated by * symbol in output data.
ALIGNMENT AND MAPPING: The segmented syllables are aligned vertically as
shown in Table 1. Here the input segmented syllables are consequently mapped with
labeled output segmented syllables. First column represents the input data with C-V
28
representation and latter one represents output data labels.* indicates the morpheme
boundaries
MISMATCHING: It is the key problem in mapping between the input and output data.
Mismatching occurs in two cases i.e., either the input units are larger or smaller than
that of the output units. This problem is solved by inserting null symbol $ or
combining two units based on the morph-syntactic rules to the output data. And the
input segments are mapped with output segments. After mapping machine learning
tool is used for training the data. This type of problems sometimes it may occur in
case of nouns also.
Case 1:
Input sequence: Input sequence:
1|E*|t|u|n|n|A*|n|u*| (9 segments)
29
Case 1
This case shows the input sequence is having more number of segments than the
output sequence. Telugu verb lEstunnAnu is having 10 segments in input sequence but in
output it has only 9segments.the occurrence of s in the input sequence becomes null
due to the morph syntactic rule. So there is no segment to map with that s. For this
reason, in training data s is mapped with $ symbol ($ indicates null). Now the
number of input units are equal to the number of output units is shown in corrected output
sequence.
Case 2:
(A)
Input sequence:
A|D-C|a-V|n-C|u-V| (5 segments)
A|D|u*|a*|n|u*| (6 segments)
A|Du*|a*|n|u*| (5 segments)
(B)
Input sequence
A|v|u*|A|m|e*| (6 segments)
30
A|vu*|A|m|e*| (5 segments)
Case 2
This shows the input sequence is having less number of units than the output
units. (A) and (B) are examples for case2 in case of verbs and nouns. Telugu verb
ADanu is having 5 units in input sequence but output has 6 units or segments. Due to
morph syntactic change the unit D-C in the input sequence is mapped to two
segments D, u* in output sequence is shown in corrected output sequence. For this
reason in training D-C is mapped with Du*. Now the input and output sequences
are having equal number of units. So the problem of mismatching is solved. Same
thing happened in case of nouns also which is explained in (B).
There are some cases in which both case 1and case 2 will occur together. We
can solve such type of mismatching problems by applying same rules of case1 and
case2. Example with Telugu noun Urikeduru is shown below.
Input sequence:
U|r-C|i-V|K-C|e-V|d-C|u-V|r-C|u-V| (9 segments)
U |ru*|i*|$|e|d|u|r|u*| (9 segments)
31
3. Identifying morphemes.
32
IDENTIFYING MORPHEME: The Segmented morpheme is given to the training module-II.
It predicts grammatical categories to the segmented morphemes.
The system is trained for the word abbAYilu. When the system names across a
similar kind word like AvUlu the SVM modules will give the correct morphological
interpretation.
4.6.1 INTRODUCTION
Morphological generator is developed using Data Driven Approach. In this
approach three different modules are developed. The first module takes the lemma
and POS category as input and gives the lemmas paradigm number and words stem
as output. The second module takes morpho-lexical information as the input and gives
its index number as the output. In third module, a suffix-table is used to generate the
word with the information from the above two modules.
33
4.6.2 MORPHOLOGICAL GENERATOR FOR TELUGU
There are different methods available for Morphological generation. In
particular most familiar approach is rule based morphological generator. In rule based
approach we need linguistic knowledge to develop the Morphological generator
system as it requires morpho-phonemic rules and morpheme dictionary. In the present
approach, rules and dictionaries are not needed. It requires only suffix table and code
for paradigm classification. Information given as the input to morphological generator
are 1.lemma , 2.word_class and 3.Morpho-lexical information. Lemma specifies the
word-form to be generated, Word-class specifies the grammatical category and
Morpho-lexical information specifies the type of information. The input to the
morphological generator is given in the form of lemma + word_class + Morpho-
lexical Information. Morpho-lexical information is extracted from the Morphological
analyzer tool for Telugu. An example of Morphological generator system is given
below.
34
TABLE 4.4 VERB PARADIGM
ADu aruvu avvu Cavu Ceppu
vellu
Nouns are classified in to sixty five paradigms and the paradigms are listed in Table
4.5.
35
and nouns are selected. The creation of morpho-lexical forms of verbs and nouns
make use of an order which is followed for all the paradigms. Morpho-lexical
information list is created using Morpho-lexical forms. In the tabular column, row
indicates the Morpho-lexical information and column indicates the paradigm number.
The inflection table for Verb is given in Table 4.6.
ML-1 u vu pu nu yi
4.6.5 METHODOLOGY
36
paradigm number has to be found. The paradigm number corresponds to column
index for the inflection table. The Morpho-lexical information of the required word
class is given by the user as input. From the Morpho-lexicon information list the
index number of the corresponding input is identified and this corresponds to the row
index. The row and column index number thus obtained is sent to Noun/verb suffix
table. The input word class determines the Noun/verb Suffix table to be selected.
Stemming is done to the root word. The selected information from the inflection table
is concatenated with the root word.
STEP 1
Let us consider input to the system is given as (ADu) + verb + Present Tense.
1. is lemma
2. Verb is word_class
STEP 2
STEP 3
The Romanized ADu is given as input for the verb paradigm table and we get the
output as paradigm number of ADU which is 1. This is the column index for Table
4.6(Morpho-Lexical forms)
STEP 4
The lemma ADu is send for stemming process and the output is AD
37
STEP 5
STEP 6
Now with the help of row index and column index we can find the morpho-Lexical
information which is utunnAnu.
STEP 7
STEP 1
Consider the input sentence as She is writing a letter.
STEP 2
Input sentence is given to parser to get the grammatical tree structure and Parts Of
Speech category. Grammatical tree structure is shown in figure 4.8.
STEP 4
For the given English words equivalent Telugu words are found in the bilingual
dictionary.
39
STEP 5
Outpu vrAstundi
STEP 6
FINAL OUTPUT
Telugu
40
CHAPTER 5
RESULTS
5.2 DISCUSSION
Morphological analyzer for noun and verb are tested separately. The system is
tested with 150 nouns and 200 verbs. The accuracy of the system is 62.6 percent and
58.5 percent respectively. Incorrect output occurs mainly due to words which do not
fall under the classified paradigm.
41
5.3 SCREEN SHOT OF MORPHOLOGICAL ANALYZER
Screen shots of morphological analyzer for verb and noun is given below.
42
5.4 TESTING AND RESULTS
Morphological generation for verbs and nouns are tested separately and the results are
mentioned in Table 5.3 and Table 5.4.
5.5 DISCUSSION
Morphological generation for noun and verb are tested separately. The system is
tested with 300 nouns and 200 verbs. The accuracy of the system is 58 percent and
53.5 percent respectively. Incorrect output occurs mainly due to words which do not
fall under the classified paradigm. The accuracy of the system can be scaled up by
considering more special cases, clitics and negative forms.
43
5.6 SCREEN SHOT OF MORPHOLOGICAL GENERATOR
Screen shot of morphological generator verb and noun is given below
44
5.7 TESTING AND RESULTS
The system is tested with simple sentences. The outputs of the sentences are classified
into three categories. 1. Good 2.Understandable and 3. Bad
5.8 DISCUSSION
English to Telugu Machine translation system is tested with 450 simple sentences.
The output is categorized into three types namely good, understandable and Bad. Bad
translation occurs mainly due to following reasons,
A set of tested sentences is attached as an excel file and the output is compared with
Google translator system. Since morphological generation is not available in Google
translator, the outputs of our translation system are morphologically better than
Google. So, the translations are meaningful and more understandable in our system.
But the number of lexicon in Google is high compared to our translation system,
therefore lexicon wise Googles translation system works better. The online system is
available at http://nlp.amrita.edu:8080/Eng2Tel/.
45
5.9 SCREEN SHOT OF ENGLISH-TELUGU MACHINE
TRANSLATION SYSTEM
CONCLUSION
Machine translation plays a key role for breaking the barrier of language
problem. Particularly in India we have different states and in each state we have
different kinds of languages. Throughout the country it is difficult to follow a unique
language. There needs lot of research in this field to handle the difficulties. Telugu is
second most spoken language in India, it is important to have a translation system for
Telugu language.
Morphological analyzer and generator have been developed with the limited
resource of linguistic knowledge. In the future people who have good knowledge in
Telugu can use the system and provide an enhanced output.
47
REFERENCES
5. http://unicode.org/standard/WhatIsUnicode.html
7. Brown, C.P., The Grammar of the Telugu Language. New Delhi: Laurier Books
Ltd, 2001
11. Gwynn and Krishnamurti: A Grammar of Modern Telugu, volume 11, Oxford
University Press, Delhi, 1987.
12. K.P.Soman, R.Loganathan, V.Ajay, Support Vector Machines and other Kernel
Methods, PHI Learning Private Ltd.,2009, pp 115-155.
13. Jesus Gimenez and Lluis Marquez, SVMTool Technical Manual v1.3, TALP
Research Center, LSI Department, Salgado, Barcelona, 2006.
15. http://en.wikipedia.org/wiki/Google_Translate .
48
PUBLICATION
INTERNATIONAL JOURNAL
[1] R. SriBadri Narayanan, Saravanan.S and Dr Soman K.P, Amrita University,
Coimbatore, India, Data Driven Suffix List And Concatenation Algorithm
For Telugu Morphological Generator, In Proceedings of International
Journal Of Engineering Science and Technology,vol.3, no 8, pp.6712-6717,
August 2011.
NATIONAL CONFERENCE
[1] Ramasamy Veerappan, R. SriBadri Narayanan, and Dr. K. P. Soman, Amrita
University, Coimbatore, India, Translation Based Support System for Smart
Education, In Proceedings of NCILC, 2011.
49
APPENDIX
MARKERS
GIVEN BELOW ARE THE INFLECTIONS CONSIDERED FOR TELUGU VERBS
1. PRESENT TENSE MARKERS <PRESENT_TENSE> tunnA, TunnA, tunTE, TunTE,
Tum~m, tU , TU, to~m, To~m.
2. PAST TENSE MARKERS <PAST_TENSE> nnA, sunnA, A, sA, DA, cA, ppA, lcA,
slA, tA, LLA, TTA, ccA, kunnA, kua~m, ia~m, ccA, ia~mcA, se, de, ce, ppe,
te, ue, rce, nne, ye.
4. CLITIC <CLITIC> vO, nO, rO, dO, lO, lA, kO, sai, si, stu, akA, nnA, lE.
5. AUXILIARY VERBS <AUX> nivvu, vaccu, valayu, pO, ua~mdu, cUdu, peTTu,
pArEyi, veyyi, avvu, mugia~mcu, cUpu,daluvu, manu, cupia~mcu, veLLu, goTTu,
beTTu, sAgu, tIru.
11. POST POSITIONS <PP> lOga, lOpuna, dAkA, koddi, kadA, gAni, kanuka,kadu,
gUDA, kAbOlu, kAni, gAdA, annA, kUDA, mua~mdu, ni, a~mTA, a~mTE,
aMTu, mAku, baTTi, gAni, kUDa, mAllE, mari, gala, bO, lA, sariki, dagu
nua~mDu, galugu, joccu, jAlu, baDuvu, tappa, pATiki, varaku, ka~mTE.
50
mIdanua~mDi, madya, madyaki, madyalOnua~mDi, madyalOki, medalukoni,
mua~mdu, naDuma, naDumaki, ni, nua~mDi, pai, paiki, painua~mDi, pakka,
painua~mdi, pakkaku, pakkalO, pakkanua~mDi, prakAra~m, stAnAniki, stAna~m,
stAna~mlO, stAna~mlOnua~Di, valana, vadd, vaddaku, vaddanua~mDi,
venukanua~mDi, venuka, venukaku, taravAta, taravAnua~mDi, venuka, venukaku,
taravAta, taravAtanua~mDi, tO, gUDA, tOpATu, gAka, daggara, daggaralO,
daggaraku, daggaranua~mDi, dRushTilO, yOkka, dvArA.
2. PRONOUNS < pro> Ayana, Ame, atanu, gAru, di, vi, taravAta, vADu, vAru, vaipu.
Paradigm 1
Paradigm 4
Paradigm 2
Paradigm 5
Paradigm 3
51
Paradigm 6 Paradigm 11
Paradigm 7 Paradigm 12
Paradigm 8
Paradigm 13
Paradigm 9
Paradigm 14
Paradigm 10
Paradigm 15
52
Paradigm 16 Paradigm 21
Paradigm 17
Paradigm 22
Paradigm 18
Paradigm 23
Paradigm 19
Paradigm 24
Paradigm 20
Paradigm 25
53
Paradigm 26
Paradigm 31
Paradigm 27
Paradigm 32
Paradigm 28
Paradigm 33
Paradigm 29
Paradigm 34
Paradigm 30
Paradigm 35
54
Paradigm 37
Paradigm 36
55
For example, Noun have the
following paradigms
Paradigm 5
Paradigm 1
Paradigm 6
Paradigm 2
Paradigm 7
Paradigm 3
Paradigm 8
Paradigm 4
Paradigm 9
56
Paradigm 10
Paradigm 15
Paradigm 11
Paradigm 16
Paradigm 12 Paradigm 17
Paradigm 13 Paradigm 18
Paradigm 14 Paradigm 19
57
Paradigm 20 Paradigm 25
Paradigm 21 Paradigm 26
Paradigm 22
Paradigm 27
Paradigm 23
Paradigm 28
Paradigm 24
58
Paradigm 29
Paradigm 34
Paradigm 30
Paradigm 35
Paradigm 31
Paradigm 36
Paradigm 32
Paradigm 37
Paradigm 33
Paradigm 38
59
Paradigm 39
Paradigm 44
Paradigm 40
Paradigm 45
Paradigm 41
Paradigm 46
Paradigm 42
Paradigm 47
Paradigm 43
Paradigm 48
60
Paradigm 49
Paradigm 54
Paradigm 50
Paradigm 55
Paradigm 51
Paradigm 56
Paradigm 52
Paradigm 57
Paradigm 53
61
Paradigm 58
Paradigm 63
Paradigm 59
Paradigm 64
Paradigm 60
Paradigm 65
Paradigm 61
Paradigm 62
62
63