You are on page 1of 69

Automatic Translation of Noun Compounds from English to Hindi

Thesis submitted in partial fulfillment


of the requirements for the degree of

MS by Research
in
Computer Science with specialization in NLP

by

Prashant Mathur
200502016
mathur@research.iiit.ac.in

Language Technology Research Center


International Institute of Information Technology
Hyderabad - 500032, INDIA
October, 2011
Copyright
c Prashant Mathur, 2010
All Rights Reserved
International Institute of Information Technology
Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled Automatic Translation of Noun Compounds
from English to Hindi by Prashant Mathur, has been carried out under my supervision and is not
submitted elsewhere for a degree.

Date Advisor: Dr. Soma Paul


To Papa and Mummy
For being the worlds best parents
Acknowledgments

I would like to express my deepest of gratitude for my advisor Dr. Soma Paul, without whom this
thesis wouldnt have been possible. She always inspired me and had faith in my ability to rise to the
occasion and deliver the best work. She equally worked with me on the thesis as I did. She taught me
the course Computational Linguistics and since then I am working with her. I am also thankful to Prof.
Rajeev Sangal for his valuable comments on thesis. I started my journey in LTRC under the supervision
of Dr. V. Sriram on Machine Translation. Everytime I asked for his help he was there clearing my
doubts. Thanks for all the proof reading Sir and the inspiration you gave me that I ended up pursuing
Post-Graduation studies on Machine Translation.

The only unsinkable ship is FRIENDSHIP. I would also like to thank my Lab mates and my friends
Vipul Mittal, Sambhav Jain, Bharat Ram, Himani, Karthik, Siva Reddy, Avinesh, Ravikiran, Samar
Hussain for creating such a wonderful working environment in LTRC. I would also like to thank my
batchmates Abhijeet, Raman, Piyush, Himank, Shrikant, Aditya, Manish, Subhashis, Karan, Chirag,
Maruti, Prashant K., Vibhav for being the friends they were. It would be disrespectful of me if I didnt
mention my 5th year buddies Mahaveer, Kulbir Chacha, Saurabh, Rahul, Abhishek.
Last year wasnt fun if there were no playing cards and computer games. I would really like to mention
my CS buddies and my clan IM M ORT ALS.
Above all I would like to mention my cousins and my dear ones Krishna Mathur, Devika, Chandni
Mathur, Deepika Mathur for being there at the time I needed them and sometimes on the other side of
the outburst. Love you all.

v
Abstract

The present work attempts to build an automatic translation system of nominal compound (NC) from
English to Hindi. A noun compound is a sequence of nouns acting as a single noun, e.g., colon cancer,
suppressor protein, colon cancer tumor suppressor protein. They comprise 3.9% and 2.6% of all tokens
in the Reuters corpus and the British National Corpus (BNC), respectively. As of today, no good system
exists for the translation of multi-word expressions from English to any Indian languages. We have
evaluated two state-of-the-art systems, Moses and Google Translation system, to check the Noun Com-
pound translation accuracy from English to Hindi. Google translation system results in an accuracy of
57% while Moses, a statistical machine translation system, returns an accuracy of 48% on a test data of
300 Noun Compounds. The above figures indicate that automatic NC translation from English to Hindi
is an important subtask of machine translation system. We build a Noun Compound Translation system
(NCT) which returns an accuracy of 64% on the same set of test data.
This thesis examines two approaches for translation of Noun Compounds from English to Hindi. We
have done a manual study on 50K parallel sentences from English to Hindi and have found out that Noun
Compounds in English are translated into Noun Compound in Hindi in over 40% of the cases. In other
cases they are translated into varied syntactic constructs. Among them the most frequent construction
type is Modifier + Post-Position + Head which occurs in 35% of all the cases. Some examples are
cow milk gAya kA dUXa, wax work mOMa para ciwroM. This observation motivates
both the approaches for translation in the present thesis. The approaches are called in this work as: a)
Translation of NC by paraphrasing on source side and mapping the paraphrase to target construct and
b) Context based translation by searching and ranking translation candidate on target side.
In the first approach English nominal compounds are automatically paraphrased and the paraphrases are
translated into Hindi constructions. The paraphrasing is done with prepositions following [Lauer 1995]
approach of paraphrasing of nominal compound. For example, cow milk is paraphrased as milk from
cow, blood sugar is paraphrased as sugar in blood. Since English prepositions have one-to-one
mapping to post-position in Hindi, English paraphrases are easily translated into Hindi using the map-
ping schema. Assuming that lexical substitution for component nouns of the compound is correct, this
method examines how paraphrasing of English nominal compound acts as an aid for translation.
In the second approach, we, at first, generate translation templates for the target language. These tem-
plates are all possible Hindi construction types that English nominal compounds can be translated into.
Context based translation system take context into consideration while translating. We translate noun

vi
vii

compound by taking the sentence in which the compound occurs as the context. For example, the expres-
sion finance minister is the nominal compound to be translated in the sentence The finance minister
declared the financial budget for this year. Other content words in the sentence such as declared,
financial, budget, year form the context. We apply a Word-sense-disambiguation tool for selecting
the correct sense of the component nouns of NC in the given context. We use a bilingual dictionary to
get the Hindi translation of the component nouns in the sense selected by WSD tool. Thus context based
lexical substitution is accomplished for the target language. The output of lexical substitution is placed
in the translation templates and the resulted construction is searched on a Hindi indexed corpus of 28
million words. For ranking, a reference ranking based on the frequency of occurrence of the translate
candidates in full in the TL corpora is taken as baseline. To improve on the baseline, a stronger ranking
measure is borrowed from [Tanaka & Baldwin 2003b].
The context based translation system approach is adopted in the present work for building the noun
compound translation system (NCT) which is integrated to Moses. The outputs of Moses and Moses
integrated with NCT are compared. Evaluation of the system is carried out at two levels: by automatic
evaluation metric BLEU and by manual evaluation technique. The issue of automatic evaluation is
discussed in detail which motivates manual evaluation under the given circumstance.
Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Approaches to Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Noun Compound in English and its Translation . . . . . . . . . . . . . . . . . . . . . 3
1.5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Various approaches for translation of Noun Compounds . . . . . . . . . . . . . . . . . 6
1.6.1 Translation of NC by paraphrasing on source side and mapping the paraphrase
to target construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.2 Context Based Machine Translation . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Developing an Integrated MT System and its Evaluation . . . . . . . . . . . . . . . . 8
1.8 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.9 Chapterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Translation of Noun Compounds using paraphrasing on source side . . . . . . . . . . . . . . 11


2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Paraphrasing of Noun Compounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Experiments & Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Translation of Noun Compounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.1 Mapping English Preposition to Hindi Post-position . . . . . . . . . . . . . . 20
2.8 Translation of Noun Compounds: Experiments and Result . . . . . . . . . . . . . . . 21
2.9 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Context Based Translation of Noun Compounds . . . . . . . . . . . . . . . . . . . . . . . . 24


3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Preparation of Data and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Preparation of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Generation of Translation Templates . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.3 Sense Selection for components of Noun Compound . . . . . . . . . . . . . . 32
3.4.4 Corpus Search and Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

viii
CONTENTS ix

3.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34


3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Integration of Noun Compound Translator with Moses and its Evaluation . . . . . . . . . . . 37


4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Moses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Appendix A: Templates for candidate generation . . . . . . . . . . . . . . . . . . . . . . . . 52

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
List of Figures

Figure Page

1.1 Translation of Noun Compounds through Paraphrasing . . . . . . . . . . . . . . . . . 7


1.2 Noun Compound Translator (NCT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Moses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1 Phrase based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40


4.2 Distance based Reordering : Reordering distance is measured on the foreign input side.
In the illustration each foreign phrase is annotated with a dashed arrow indicating the
extent of reordering. For instance the 2nd English phrase translates the foreign word 6,
skipping over the words 4-5, a distance of +2. . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Lexicalized Reordering (Y-Axis : Source Phrase, X-Axis : Target Phrase) . . . . . . . 42

x
List of Tables

Table Page

2.1 Frequency of Paraphrases for finance minister resulted from Web search. . . . . . . 18
2.2 Frequency of Paraphrases for welfare agencies resulted from Web search. . . . . . . 18
2.3 Frequency of Paraphrases for antelope species after Web search. . . . . . . . . . . . . 19
2.4 Paraphrasing Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Distribution of Preposition on Lauer test data of 218 NC . . . . . . . . . . . . . . . . 20
2.6 Comparison of our approach with Lauers Approach . . . . . . . . . . . . . . . . . . 20
2.7 Mapping of English Preposition to Hindi postposition from aligned English-Hindi par-
allel corpora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Preposition-Postposition Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Translation Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.10 Translation Accuracy for some individual prepositions . . . . . . . . . . . . . . . . . 22

3.1 Distribution of translations of English NC from English Hindi parallel corpora. . . . . 25


3.2 Number of Senses Listed in Wordnet . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Synset selected by WSD tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Translation using bilingual dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Ranking using baseline frequency model . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Ranking using CTQ Metric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 Ranking after inclusion of default genitive translation i.e.. X kA Y, X ke Y, X kI Y as
templates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Corpus Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


4.2 NC translation accuracy(Surface Level) on the test data. . . . . . . . . . . . . . . . . . 45
4.3 BLEU scores on the test data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 BLEU scores on the development set . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 5 point scale for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Human Judgment score of translation of sentences and the NC Translation accuracy . . 48
4.7 Performance of systems on top 3 constructions for NC Translation . . . . . . . . . . . 48

xi
Chapter 1

Introduction

1.1 Introduction
Machine Translation is a sub-field of Computational Linguistics that makes use of computers to
translate text/speech from one language to another. Efforts to build machine translation systems started
almost as soon as electronic computers came into existence. Computers were used in Britain to crack the
German Enigma code in World War II and decoding language codes is what we call machine translation
in todays world. Warren Weaver, one of the pioneering minds in machine translation, wrote in 1947:

When I look at an article in Russian, I say: This is really written in English, but it has
been coded in some strange symbols. I will now proceed to decode.

Hindi is a widely spoken language and it is the principal official language of the Republic of India. On
the other hand, English is internationally popular language. In India, English as a language has played
a major role in administration, legal and education sector since British period. Presently, an awareness
has been developed in this country for using regional languages for government document writing, for
primary and higher education and every other domain of public life. In this context, it has become very
important to build system which can translate English to various Indian languages. With the existence
of huge text resources in internet and India being one of the most prominent users of web, even com-
mercial companies are finding it necessary to venture out for building machine translation.
With the emergence of India in the global market as one of the superior powers, and Hindi being its of-
ficial language, the language has reached to outside world too. English to Hindi translation has become
of great importance because the country has good trade relations with English spoken countries such
as USA, Australia, England and other European nations. English is an international industrial language
and companies from all around the world are investing in Indian market. English to Hindi translation has
become a necessity if the companies want to promote or pass information to those people who cannot
understand English. The coming of industries from abroad such as Quillpad, AIAIOO labs, MSR labs
has boosted the translation service industry in India, but as we know everyone wants the world to be
more automaticised and so the development of such tool is what one needs.

1
Whenever there is a language barrier between two individuals every time an intervention by a bi-lingual
speaker is required which is a costly task. We are striving to make the world an automated state and
so are we. An automated machine translation system can reduce the human task to quite an extent.
A typical Machine Translation system can be used anywhere where there are two different individuals
trying to communicate but non of them knows a common language which creates a communication gap.
Some of the popular translation systems that are being developed are Google Translate, Yahoo! Babel
fish, Bing-Translator, Moses [Koehn et al. 2007]. In India two of the major projects on machine trans-
lation are Anusaaraka, English to Indian Language Machine Translation Project(EILMT). Anusaaraka
is an English to Indian language accessing (translation) software, which employs algorithms derived
from Paninis Ashtadhyayi (Grammar rules). The EILMT system aims to design and deploy a Machine
Translation System from English to Indian Languages in Tourism and Healthcare Domains. The project
is funded by Department of Information Technology, MCIT, Government of India. Shakti is also a ma-
chine translation system from English to Indian languages. It is currently being developed at Language
Technology Research Centre.
These applications on translation from one language to another has become so popular and widely used
that people are developing Mobile Applications such as Speech-to-Speech translator, Speech-to-text,
Text-to-Text on different OS platforms like Android, iOs, Symbian.

1.2 Approaches to Machine Translation


Translation can be defined as a process in two steps :

1. Decoding the meaning of source text.

2. Re-Encoding the meaning in the target language.

There are 4 basic types of Machine Translation systems

1. Rule Based Machine Translation : Translation process that uses the linguistic information of
both source and target language such as morphological information, difference in syntactic be-
haviour.

(a) Transfer based Machine Translation : Translation process which builds intermediate repre-
sentation of source text that captures its meaning in order to generate the correct transla-
tion. [Tsuji & Fujita 1991]
(b) Interlingual Machine Translation : source text is transformed into an interlingual, i.e. source-
/target-language-independent representation. The target language is then generated out of
the interlingua. [Lampert 2004]
(c) Dictionary based Machine Translation : Word by word translation using a bi-lingual dictio-
nary. [Muegge 2006]

2
2. Statistical Machine Translation : Statistical Machine Translation is a typical example of Ma-
chine Translation where we need statistical models which come from analyzing a parallel text
corpora to predict translation of a sentence from one language to another. [Brown et. al. 1993]

3. Example Based Machine Translation : In this translation is accomplished by decomposing a


sentence into certain phrases, then by translating these phrases, and finally by properly composing
these fragments into one long sentence. Phrasal translations are translated by analogy to previous
translations. [Brown 1996]

4. Hybrid Machine Translation : A combination of statistical, example-based and rule-based ap-


proach that leverages the best use out of each system.

1.3 Problem Statement

The thesis aims at automatic translation of English Noun Compound into Hindi within a sentence.
Here is an example:
English : The coast guard was constituted by a parliament act in the year 1978.
Hindi : 1978 meM bhArawa sarakAra ne saMsada dvArA pArita adhiniyama ke aMtagarta taTarak-
Saka dala kA gaThana kiyA .
The expression parliament act is translated to saMsada dvArA pArita adhiniyama while coast guard
translates to tatarakRaka dala. In the first case, the source language compound is translated as a phrasal
construction in Hindi; while in the second case, the noun compound remains as a noun compound in
Hindi.
We attempt to develop a system that automatically translate English nominal compound into Hindi. The
tool is named as Noun Compound Translation system (NCT). We also integrate the NCT tool with a
state-of-the-art translation system, Moses and evaluate the integrated System.

1.4 Noun Compound in English and its Translation

Compounds have regularly received attention of linguists. Noun compounds are abundant in English
and pose an important challenge for the automatic analysis of English written text. A two word noun
compound (henceforth NC) is a construct of two nouns, the rightmost noun being the head (H) and the
preceding noun the modifier (M). The noun constituents together act as a single noun [Downing 1977].
Some examples are cow milk, road condition, machine translation, colon cancer, suppressor
protein and so on. A noun compound can have a more complex structure as illustrated in customer
satisfaction indices, wrought iron office chair routine health check up, fire and rescue service de-
partment and so on. The analysis of the aforementioned noun compounds can be done as follows:

3
1. Customer satisfaction indices: [[Customer satisfaction] indices]

2. wrought iron office chair: [[wrought iron] [office] chair]

3. routine health check up: [routine [health [check up]]]

4. fire and rescue service department: [[[[fire] and [rescue]] service] department]

The complex structure of noun compound establishes the fact that analysis and translation of these com-
pounds from one language to another is not an easy task. The translation becomes significantly difficult
in those cases when the source language NC is represented in a varied manner in the target language
as is the case with English - Hindi language pair. We have done a manual study on the BNC corpus in
which we have found that English noun compounds can be translated in Hindi in following varied ways:

1. Noun Compound

(a) Hindu Texts hiMdU SAstroM

(b) Milk Production dugdha utpAdana

2. Genetive Construction

(a) Rice Husks cAval kI bhUsI

(b) Room Temperature kamare kA tApamAna

3. Adjective Noun Construction

(a) Nature Care prAkritika cikitsA

(b) Hill Camel pahARI UMTa

4. Other Syntactic Phrase

(a) Wax Work mom par citroM

(b) Body Pain SarIr meM dard

5. One Word

4
(a) Cow Dung gobara

6. Others

Hand Luggage hATa meM le jAye jAne vala sAmAn.

When the compound is made up of more than two words, the translation issues also become more acute
sometimes. For example, let us study following cases of translation:

1. body pain cure SarIra meM darda kA ilAja,

2. state forest department rAjya vana vibhAga

3. Hindi language alphabet hindI bhASa kI varNamAlA.

In first case E1 E2 E3 is translated into H1 post-position H2 post-position H3 , in second case the


compound E1 E2 E3 remains a compound H1 H2 H3 ; while in the third case E1 E2 E3 is translated
as H1 H2 post-position H3 . In the present work, we will be handling only bigram noun compound
for translating them into Hindi.
Furthermore, compounding is an extremely productive process in English. The frequency spectrum of
compound types follows a Zipfian or power-law distribution, so in effect many compound tokens en-
countered belong to a long tail of low-frequency types. Over half of the two-noun compound types
in the British National Corpus occur just once. Taken together, the factors of low frequency and high
productivity means that achieving a robust on noun-compound interpretation is an important goal for
broad-coverage in semantic processing.

1.5 Motivation
Noun Compounds constitutes an important part of English running texts. Translation of Noun Com-
pounds is a very widely researched topic. No work specific to noun compound translation from English
to Hindi has been attempted till now. We have given some sentences with NC for translation to the
EILMT system. The output that the system returns is the following:
English : Keep the room temperature to 30 degrees, I am suffering from body pain .
Hindi : 30 digrI ko kakSa tApamAna ko rakheM, meM SarIr dard se kaSta bhugawa rahA hUM
In this example the translations of the Noun Compounds room temperature and body pain is not
correct. The performance of existing translation system makes the point clear that there exists no sat-
isfactorily efficient Noun Compound translation tool from English to Hindi although the need of one is
unprecedented in the context of machine translation. This observation motivates the present work to de-
velop English-Hindi noun compound translator tool. Similar algorithm will work for Indian languages
which are closely related to Hindi.

5
1.6 Various approaches for translation of Noun Compounds

There are various approaches for translation of Noun Compounds such as Transfer based machine
translation, Memory Based Machine Translation, Word-to-word compositional MT etc. The review
of translation of noun compound as discussed in the previous section motivates us to implement two
approaches both of which are corpus driven statistical approach. The two approaches can be described
as a) Translation by paraphrasing on source side and mapping the paraphrase to target construct and
b) Context based translation by searching and ranking translation candidate on target side. The two
approaches are introduced here and their implementation will be described in chapter 2 and 3.

1.6.1 Translation of NC by paraphrasing on source side and mapping the paraphrase to


target construct

We develop a system that uses paraphrase on source side as an aid to translate a Noun Compound
from English to Hindi. In this work we present a way of using paraphrasal interpretation of English
nominal compound for translating them into Hindi. Input Nominal compound is first paraphrased auto-
matically with the 8 prepositions as proposed by [Lauer 1995] for the task. The detail of the process is
described in Chapter 2. English prepositions have one-to-one mapping to post-positions in Hindi. We
obtain an accuracy of 71% over a set of gold data of 250 Nominal Compound. The translation-strategy
is motivated by the following observation: It is only 50% of the cases that English nominal compound
is translated into nominal compound in Hindi. In other cases, they are translated into varied syntac-
tic constructs. Among them the most frequent construction type is Modifier + Postposition + Head.
The translation module also attempts to determine when a compound is translated using paraphrase and
when it is translated into a Nominal compound.
The following flow chart describes the approach: In this method we assume the translation of compo-
nents nouns to be correct because our objective of conducting the experiment is to evaluate the accuracy
of translation of paraphrase construct. We implement another method for translation of Noun Com-
pounds which is described in chapter 3.

6
Figure 1.1 Translation of Noun Compounds through Paraphrasing

1.6.2 Context Based Machine Translation

In Context Based Machine Translation we translate a Noun Compound taking the sentence as the
context. The translation is carried out in two steps:

1. Context information is utilized for correct lexical substitution of components nouns of English
NC.

2. Hindi templates for potential translation candidates are generated which are searched in the target
language data.

Let us consider the following example, Soil on the river bank eroded due to the flood. The expression
river bank is the noun compound and the content words make the context (in this case: soil, river, bank,
eroded, flood). We use a Word-sense-disambiguation tool [Patwardhan et. al. 2005] to derive the sense
of the component nouns used in the compound in the given context. We will see in Chapter 3 that cor-
rect sense selection of component nouns significantly improves the translation of the compound. Once
correct sense is selected for component nouns and the nouns are substituted in Hindi for that sense, the
substituent words are fit in translation template to generate the translation candidates. These transla-
tion candidates are searched in Web and the best translation candidate selected by ranking method (see
chapter 3) is taken to be the translation of the noun compound.
In this work we present an architecture of a Context Based Machine Translation - Noun Compound
Translator that has been able to give an accuracy of 57% accuracy over a test set of 200 Noun Com-

7
pounds. Figure 1.2 represents functionality of Noun Compound Translator tool. We integrated the NCT
tool with state-of-the-art translation system, Moses as discussed in next section.

Figure 1.2 Noun Compound Translator (NCT)

1.7 Developing an Integrated MT System and its Evaluation

We present a translation system by combining the state-of-the-art machine translation tool, Moses
and Noun Compound Translator. The need to do this integration arises from our urge to know whether
the output of NCT brings in any improvement to the overall quality of sentence translation when the
system is plugged in to a full-fledged machine translation system. NCT is a phrase based system and
hence integrating it with another phrase based system ,which Moses is, makes the integration an easier
task than integrating it with a Syntax based SMT or any other SMT systems such as Example-based MT,
Tree based MT. Integrated system is built by combining Moses phrase based model and the translations
from NCT system. Moses decoder uses the enhanced model for translation of sentences.
The integrated system is also evaluated both manually and automatically. We will evaluate the systems
on test data of 300 parallel sentences containing Noun Compounds(NCs) and its translation.
Figure 1.3 represents the functionality of the Moses system combined with NCT tool.

8
Figure 1.3 Moses

1.8 Contribution of the Thesis


1. Two approaches for translation of Noun Compounds are presented and compared with state of the
art system.

(a) Proposed a new approach for translation of Noun Compounds by paraphrasing on source
side.
(b) Developed a context based machine translation system (Noun Compound Translator) for
English-Hindi language pair.

2. The principle mechanism of two approaches are different

(a) Paraphrases of NCs are searched on English web corpus i.e. search source language corpus.
(b) Hindi translated candidates are searched on indexed hindi corpus i.e. search on target lan-
guage corpus.

3. NCT tool is built as a module that can easily be integrated with any working machine translation
system.

4. The system is integrated with state-of-the-art Statistical machine translation system, Moses and
evaluated.

9
5. Evaluation of the system is carried out at two levels: by automatic evaluation metric BLEU and
by manual evaluation technique.

1.9 Chapterization
This thesis is divided in a number of Chapters. Chapter 2 explains the approach of translating para-
phrase of English noun compound into Hindi. Chapter 3 describes Context based MT system that we
have integrated with a full-fledged MT system, Moses. The integration is described in chapter 4. The
evaluation report of translation after integration is also presented in this chapter. Finally the conclusion
chapter summarizes our work and discusses the future task.

10
Chapter 2

Translation of Noun Compounds using paraphrasing on source side

2.1 Overview
This chapter argues that paraphrasing of source language noun compound can be used as an aid in
translation of English noun compounds in Hindi. We have discussed and implemented paraphrase by
preposition method for interpreting noun compound in this chapter. The result of the implementation is
also reported. For example, the meaning of nominal compound carbon deposit and wax work can be
conveyed as deposit of carbon and work on wax.
In this method, we have automatically paraphrased English noun compounds with 8 prepositions as
proposed by [Lauer 1995]. We have used the paraphrase frequency found on the web as the base for
scoring that paraphrase. We have developed one-to-one mapping of English preposition to post-position
in Hindi. Finally we have translated the top 3 paraphrases into Hindi using a mapping schema and
bilingual dictionary. We have got an accuracy of 71% over a set of gold data of 250 Noun Compounds.
We have assumed that lexical substitution of constituent nouns of the compound is correct during eval-
uation because our objective of conducting the experiment is to evaluate the accuracy of translation of
paraphrase construct. The translation-strategy is motivated by the following observation: It is only 50%
of the cases that English noun compound is translated into noun compound in Hindi. In other cases,
they are translated into varied syntactic constructs. Among them the most frequent construction type is
Modifier + Postposition + Head. The translation module also attempts to determine when a compound
is translated using paraphrase and when it is translated into a Noun compound.

2.2 Introduction
Noun compounds are abundant in English and compounding is a highly productive phenomenon.
[Baldwin & Tanaka 2004] has calculated that noun compounds comprise 3.9% and 2.6% of all tokens
in the Reuters corpus and the British National Corpus (BNC), respectively. The frequency spectrum of
NCs follows a Zipfian or a power-law distribution, so in effect many compound tokens encountered in a
text belong to a long tail of low frequency types as described in [Seaghdha 2008]. It is difficult to list

11
in a dictionary all compounds that are likely to be encountered. [Tanaka & Baldwin 2003b] has reported
that even for relatively frequent noun compounds that occur ten or more times in BNC, static English
dictionaries provide only 27% coverage.
Understanding the syntax and semantics of noun compounds is difficult but important for many natural
language applications including machine translation. Rackow [Rackow et al. 1992] has observed two
main issues in translating a noun compound from source language to target language (a) correctness in
the choice of the appropriate target lexeme during lexical substitution and (b) correctness in the selec-
tion of the right target construct type. The issue stated in (a) involves correct selection of sense of the
component words of NCs followed by substitution of source language word with that of target language
that best fits for the selected sense [Mathur & Paul 2009]. From the perspective of machine translation,
the issue of selecting the right construct of target language becomes very significant because English
NCs are translated into varied construct types in Hindi.
We have done a manual study on 50K parallel sentences from English to Hindi and have found out that
Noun Compounds in English are translated into Noun Compound in Hindi in over 40% of the cases.
In other cases they are translated into varied syntactic constructs. Among them the most frequent con-
struction type is Modifier + Post-Position + Head which occurs in 35% of all the cases (for illustration
see Chapter 1). The paraphrases can facilitate the translation in those cases. For example, let us take
the compound wax work and carbon deposit. The paraphrase of these compounds is work on wax
and deposit of carbon. Since English preposition can mostly be mapped to Hindi postposition in one-
to-one manner, the paraphrases can efficiently be translated to correct Hindi constructions as follows:
work on wax mOm para citra (wax on work) and deposit of carbon kOyle kA bhandAra
(coal of deposit).
Currently there exist two different approaches in Computational Linguistics for interpretation of
Noun Compounds. They are:

1. Labeling the semantics of compound with a set of abstract relations [Girju et. al. 2003]
Eg. Chocolate Bar bar made of chocolate (MAKE - relation)

2. Paraphrasing the compound in terms of syntactic constructs.

Paraphrasing, again, is done in three ways:

1. With prepositions (war story story about war) [Lauer 1995].

2. With verb+preposition nexus (war story story pertaining to war, noise pollution pol-
lution caused by noise) [Finin 1980], [Nakov & Hearst 2005], [Nastase & Szpakowicz 2003].

3. With Copula (tuna fish fish that is tuna) [Vanderwende 1995].

We have implemented the first approach for paraphrasing.


This chapter motivates the advantage of expanding English noun compounds into paraphrases with

12
prepositions for translating them into Hindi. Paraphrasing with Preposition has the following ad-
vantages: (a) Annotation is simpler; (b) Learning is easier and (c) Data sparseness is less; (d) Most
importantly, English prepositions have one to one Hindi postposition correspondents most of the times.
Therefore we have chosen the strategy of paraphrasing with prepositions over Paraphrasing with
Verbs+Preposition approach for the task of translation. The chapter explores the possibility of main-
taining one to one correspondence of English-Hindi preposition-postpositions and examines the accu-
racy of translation. The English NCs are paraphrased using 8 prepositions as mentioned in [Lauer 1995].
The task of translating English NCs into Hindi syntactic constructs is divided in two levels

1. Paraphrasing by Preposition

(a) Paraphrase the NC using 8 prepositions.


(b) Paraphrasing candidates are searched in Web corpus.
(c) An algorithm is devised to determine when the paraphrase is to be ignored and the source
language NC to be translated as NC or transliterated in NC.

2. Translation

(a) English preposition is replaced by Hindi corresponding postposition.


(b) Bi-lingual dictionary is used to translate the Noun components.

We have manually created a data set of 250 Noun Compounds extracted from BNC corpus for the pur-
pose of evaluation of paraphrasing with preposition method & subsequent translation. The gold standard
data consists of paraphrases of each noun compound and their translation.

The chapter is divided into following sections. The next section presents a review of related works,
the attempts made for automatic paraphrasing of noun compound. Section 2.4 discusses the method of
preparing gold data for evaluation. Section 2.5 describes the implementation of paraphrasing approach.
The result of paraphrasing is presented in Section 2.6. The schema of mapping English preposition to
Hindi post-position is described in Section 2.7.1. This schema is used for translating English paraphrases
into Hindi constructions. Finally an evaluation result of translation is reported in section 2.7.

2.3 Related Works


This section surveys various methods of paraphrasing noun compounds as listed in previous section.
Paraphrasing English NCs is a widely studied issue and this section will predominantly study those
works. Scholars (such as [Levi 1978], [Finin 1980]) agree there is a limited number of relations that oc-
cur with high frequency among the constituents of noun compounds. However, the number and the level
of abstraction of these frequently used semantic categories are not agreed upon. They can vary from a

13
few prepositional paraphrases [Lauer 1995] to hundreds and even thousands more specific semantic re-
lations [Finin 1980]. [Lauer 1995], uses eight prepositions for paraphrasing nominal compounds. They
are: of, for, with, in, on, at, about, and from. For example, the noun compound bird sanctuary, can be
interpreted both as sanctuary of bird and sanctuary for bird.
The automatic interpretation of noun compounds is a difficult task for both unsupervised and supervised
approaches. Currently, the best-performing NC interpretation methods in computational linguistics fo-
cus only on two-word noun compounds and rely either on rather ad-hoc, domain-specific, hand-coded
semantic taxonomies, or on statistical models on large collections of unlabeled data.
The majority of corpus based statistical approaches to noun compound interpretation collects statis-
tics on the occurrence frequency of the noun constituents and uses them in a probabilistic model
([Resnik 1993]; [Lapata & Keller 2004]).

[Lauer 1995] is the first scholar to devise and test an unsupervised probabilistic model for noun com-
pound interpretation on Grolier encyclopedia, an 8 million word corpus, based on a set of 8 prepositional
paraphrases. His probabilistic model computes the probability of a preposition p given a noun-noun pair
n1-n2 and finds the most likely prepositional paraphrase

p = argmaxP (p|n1, n2) (2.1)

The ultimate goal of [Lauer 1995] work is to perform semantic analysis of arbitrary noun com-
pounds, although the experiments were done on two word noun compounds that can have prepositional
paraphrase. A probabilistic model for noun compound paraphrasing is developed based on meaning dis-
tributions theory. The model combines information from both head and modifier in order to determine
the semantic relation between them. Compounds can be paraphrased with three possible groups :

1. Copula : fish that is a tuna

2. Verbal-nexus compounds or nominalization : construction of buildings

3. Prepositional Compounds : bars of steel, stories about war

Lauer has experimented with Prepositional Compounds only. He has used Warrens(1978) study to
construct the list of possible prepositions. The study yielded 7 prepositions of, for, in, at, on,
from, with, but Lauer has included one more preposition about because these preposition constitutes
about 3.6% of total cases. Thus, there were 8 prepositions that Lauer has used in this experiment for
paraphrasing.
Lauer has addressed the following issue in his work: Given a Noun Compound which preposition
among the list of 8 prepositions is most likely to occur as a paraphrase of the compound. To build a
statistical learner for this problem he has given a probabilistic model that compute the probability of 8

14
prepositions and gives the most likely one. The function is defined as

(n1 , n2 N )(A(n1 , n2 ) = argmaxpP P r(p|n1 , n2 )) (2.2)


With assumptions such as:

1. Probability of modifier is independent of the head.

2. Each role/preposition is equally likely.

We get the following equation.


X
(n1 , n2 N )(A(n1 , n2 ) = argmaxpP P r(c1 |p)P r(c2 |p))) (2.3)
c1 1 (n1 );c2 1 (n2 )

where is the syntactic mapping from concepts (c1 , c2 C)toN ouns(n1 , n2 ). The steps to build the
paraphrasing technique are the following:

1. Build a gold data set of two word noun compounds with their correct paraphrase.

2. Define a pair of patterns for extracting examples of the nouns modified by, and the nouns governed
by, eight prepositions.

3. Estimate P r(c1 |p) and P r(c2 |p) for each preposition.

4. Use the distribution to estimate the best preposition for paraphrasing.

This method has resulted in an accuracy of 40% on the test set of 400 Noun Compounds developed by
Lauer. We have used the same 8 prepositions for paraphrasing on our test data of Noun Compounds.

[Lapata & Keller 2004] has demonstrated that web counts can be used to approximate bigram frequen-
cies. They have applied this method in Interpretation of Compounds Nouns. For majority of tasks,
they have found that n-gram frequencies obtained from web are better than that of obtained from large
corpus. Keller and Lapata has shown in their previous work (03) that the web frequencies

1. correlate with the frequencies obtained from edited BNC corpus(100M words).

2. correlate with the frequencies recreated using smoothing methods.

They have used the model proposed by Lauer


X
p = argmaxp P r(t1 |p)P r(t2 |p) (2.4)
t1 cats(n1 );t2 cats(n2 )

where t1 and t2 represents concepts. This model is tested by Lauer on both concept level and Lexical
level, however, in this chapter they worked on lexical version. Unlike using normal queries, they have
preferred using augmented queries and thus generated three types of queries :

15
1. Literal Queries : use the quoted n-gram directly

2. Near Queries : Use of a NEAR b type queries to expand the n-gram where NEAR stands for a
window of 10 words.

3. Inflected Queries : that uses all morphological forms in an expanded query. e.g. history change
can be expanded as histories change, history changed ....
Lapata and Keller has shown that usage of Web as a target corpus and of inflected queries can certainly
improve the performance of the system. Their system return an accuracy of 55.71% when search for
the queries is done on Altavista search engine. Since this method has shown a great improvement in
paraphrasing accuracy and has established a correlation between web counts and BNC corpus, we have
borrowed [Lapata & Keller 2004] method for paraphrasing with some additional features.
[Nakov & Hearst 2005] has shown that a hidden noun-noun semantic relation can be expressed explic-
itly with the help of Web search engine queries. Queries are those construction that expresses a Noun
Compound as a relative clause containing wild card characters to be filled with a verb. For example,
a compound like tear gas can be interpreted as gas that causes tears, gas that brings tears or gas that
produces tears. For extracting verbs from these type of paraphrases they have devised the following
queries and have used Google to extract the snippets:
gas THAT * tear where, THAT stands for that, which, who.
The query can be generalized into: n2 that * n1 where n1 n2 are the noun components and * stands for
one to eight words. Issuing this query they have retrieved first 1000 snippets and have considered only
those snippets for which the sequence of words following n1 is non-empty and contains at least one
non-noun. Then they have shallow parsed the snippets to extract the verbs and the following preposi-
tion. They have also ensured that there is only one verb phrase between verb and n1 to disallow complex
clauses.

Another work that subsequently overlaps with paraphrasing is that of [Kim & Baldwin 2006]. They
have worked on interpretation of Noun Compound via Verb Semantics. They have developed an auto-
matic method for interpreting NC based on semantic relations. A set of seed verbs is used to represent a
single semantic relation; for example NC within the sentence The bald headed guy owns the Mercedes
is taken as an evidence of POSSESSOR relation. Their work can be divided in number of steps: First, 3
huge corpora (BNC, WSJ and Brown) are taken. Sentences are parsed using RASP parser. Second, sen-
tences are filtered by selecting only those one which have H(Head) and M(Modifier). Third, verbs
are extracted from the sentence and are mapped to the seed words using Wordnet::Similarity Measure
and thus generate the templates like

S(have, own, possessV , M SU BJ , H OBJ ) (2.5)

where V is the set of seed words. Fourth, a set of semantic relations is retrieved from which they select
the best-fitting relation using TiMbl Classifier.

16
There exists no work which has attempted to translate nominal compound by paraphrasing them first
and translating the paraphrases in target language next. In this regard, the approach proposed is first of
its kind to be attempted. We will describe our method of paraphrasing in Section 2.5.

2.4 Data
We have created two data sets for evaluating our algorithm. One set contains 218 nominal compounds
which originally occurs in Lauers test set of 400 noun compounds [Lauer 1995]. We have taken this data
to compare our system with that of Lauers using his data set. To evaluate the quality of paraphrasing
& translation of NC together we have created our own dataset since no reference data is available for
such task. The second test set consists of 250 bigram noun compounds which are extracted from BNC
corpus [Lou Burnard 2000]. This data is manually paraphrased on source side and translated into Hindi
for evaluating both paraphrasing and translation techniques. BNC has varied amount of text ranging
from news- paper article to letters, books etc.

2.5 Paraphrasing of Noun Compounds


This section describes the procedure of paraphrasing by preposition that we have adopted. The
system is comprised of the following stages:

1. Generating prepositional paraphrasing candidates for English NC using brute force.

2. Giving paraphrase as a query in Web search

3. Filtering the result using the heuristics

4. Selection of the best paraphrase.

Each noun compound is paraphrased with all eight prepositions under consideration. Our examination
of parallel corpora (for details see Chapter 3) has revealed the fact that English NC remains a NC in
Hindi for 40% times. This observation motivates us to design an algorithm that will determine whether
an English NC is to be translated as an analytic construct or retained as an NC in Hindi. We have
used Yahoo! search engine for querying Head Preposition Modifier in the web for a given input NC
(Modifier Head) to get the frequency of the paraphrase. For example, paraphrases for the NC finance
minister and their frequency are given in Table 2.1:
In the above table we notice that the distribution is widely varied. For some paraphrase the count
is very low (for example minister about finance has count 2) while the highest count is 5420000 for
minister of finance. The wide distribution is apparent even when the range is not that high as illustrated
in Table 2.2 :

17
Paraphrase Web Frequency
minister about finance 2
minister from finance 16
minister on finance 34300
minister for finance 1370000
minister with finance 43
minister by finance 20
minister to finance 508
minister in finance 335
minister at finance 64
minister of finance 5420000

Table 2.1 Frequency of Paraphrases for finance minister resulted from Web search.

Paraphrase Web Frequency


agencies about welfare 1
agencies from welfare 16
agencies on welfare 64
agencies for welfare 707
agencies with welfare 34
agencies in welfare 299
agencies at welfare 0
agencies of welfare 92

Table 2.2 Frequency of Paraphrases for welfare agencies resulted from Web search.

During our experiment we have come across three typical cases: (a) No paraphrase is available when
searched; (b) Frequency counts of some paraphrases for a given NC is very low and (c) Frequency of
a number of paraphrases cross a threshold limit. The threshold is set to be mean of all the frequencies
of paraphrases. Each of such cases signifies something about the data and we build our translation
heuristics based on these observations. When no paraphrase is found in web corpus for a given NC, we
consider such NCs very close-knit constructions and translate them as noun compound in Hindi. This
generally happens when the NC is a proper noun or a technical term. Similarly when there exists a
number of paraphrases each of those crossing the threshold limit, it indicates that the noun components
of such NCs can occur in various contexts and we select the first 3 paraphrase as probable paraphrase
of NCs for such cases. For example, the threshold value for the NC finance minister is: Threshold =
6825288/8 = 853161. The two paraphrases considered as probable paraphrase of this NC are therefore
minister of finance and minister for finance. The remaining paraphrases are ignored. When count
of a paraphrase is less than the threshold, they are removed from the data. We presume that such low
frequency does not convey any significance of paraphrase. On the contrary, they add to the noise for

18
probability distribution. For example, all paraphrases of antelope species except species of antelope
is very low as shown in Table 2.3. They are not therefore considered as probable paraphrases.

Paraphrase Web Frequency


species about antelope 0
species from antelope 44
species on antelope 98
species for antelope 8
species with antelope 10
species in antelope 9
species at antelope 8
species of antelope 60600

Table 2.3 Frequency of Paraphrases for antelope species after Web search.

2.6 Experiments & Results


For a given NC we have used a brute force method to find the paraphrase structure. We have used
Lauers prepositions (of, in, about, for, with, at, on, from, to, by) for prepositional paraphrasing. Web
search is done on all paraphrases and frequency counts are retrieved. Mean frequency (F) is calculated
using all frequencies retrieved. All those paraphrases that give frequency more than F are selected. We
have first tested the algorithm on 250 test data of our selection. The result of the top three paraphrases
are given below :

Selection Technique Precision


Top 1 61.6%
Top 2 67.20%
Top 3 71.6%

Table 2.4 Paraphrasing Accuracy

We have also tested the algorithm on Lauers test data (first 218 compounds out 400 of NCs) and have
obtained the following results (see Table 2.5 ). Each of the test data is paraphrased with a preposition
which best explain the relationship between the two noun components. Lauer gives X for compounds
which cannot be paraphrased by using prepositions For eg. tuna fish.
OLauer : Number of occurrence of each preposition in Lauer test data
OCI : Number of correctly identified preposition by our method
In Table 2.6 we compare our result with that of Lauers on his data. We have given the results
maintaining the following criteria:

19
Prep OLauer OCI Percentage
Of 54 37 68.50%
For 42 20 47.62%
In 24 9 37.50%
On 6 2 33.33%

Table 2.5 Distribution of Preposition on Lauer test data of 218 NC

1. only N prep N is considered.

2. Non-Prepositions (X) are also considered which are referred to as All in the Table 2.6.

Case Our Method Lauers


N-prep-N 43.67% 39.87%
All 42.2% 28.8%

Table 2.6 Comparison of our approach with Lauers Approach

2.7 Translation of Noun Compounds


For translation of Noun Compounds we first paraphrase the compounds with preposition as described
in Section 2.5. The paraphrases are then translated into Hindi.
In many cases English prepositions can be semantically overloaded. For example, the NC Hindu
law can be paraphrased as law of Hindu. This paraphrase can mean Law made by Hindu (not
for Hindu people alone though) or Law meant for Hindu (law can be made by anyone, not by the
Hindus necessarily). Such resolution of meaning is not possible from preposition paraphrase. The
chapter argues that this is not an issue from the point of view of translation at least. It is because
the Hindi correspondent of of, which is kA, is equally ambiguous. The translation of Hindu law
is hinduoM kA kAnUn and the construction can have both aforementioned interpretations. Human
users can select the right interpretation in the given context. The next subsection describes the mapping
of English preposition to Hindi post-position in details.

2.7.1 Mapping English Preposition to Hindi Post-position


The strategy of mapping English preposition to Hindi post-position is a crucial one for the present
task of translation. The decision is mainly motivated by a preliminary study of aligned parallel corpora
of English and Hindi in which we have observed the following distribution of translation probabilities
of Lauers 8 prepositions as shown in Table 2.7.

20
Prep Post-Pos Sense Prob.
kA Possession 0.13
ke Possession 0.574
of
kI Possession 0.29
se Possession 0.002
from se Source .999
meM Location 0.748
at
par Location .219
se Instrument 0.628
with
ke sAtha Association 0.26
par Loc./Theme 0.987
on
ko Theme 0.007
about ke bAre meM Subj.Matter 0.68
in meM Location .999
ke lie Beneficiary 0.72
for
ke Possession 0.27

Table 2.7 Mapping of English Preposition to Hindi postposition from aligned English-Hindi parallel
corpora.

Table 2.7 shows that English prepositions are mostly translated into one Hindi postposition except
for a few cases such as at, with and for. The probability of on getting translating into ko and
of into se is very less and therefore we are ignoring them in our mapping schema. The preposition
at can be translated into meM and para and both postpositions in Hindi can refer to location.
However, the two prepositions with and for can be translated into two distinct relations as shown in
Table 2.7. From our parallel corpus data, we therefore find that these prepositions are semantically over-
loaded from Hindi language perspective. The right sense and thereafter the right Hindi correspondent
can be selected in the context. In the present task, we are selecting the mapping with higher probability.
English Prepositions are mapped to one Hindi Post-position for all cases except for at and about.
The final correspondence as used in the present work is given in Table 2.8.
Hindi post-positions can be multi-word as in ke bAre meM, ke liye and so on as shown in Table
2.8. For the present study, lexical substitution of head noun and modifier noun are presumed to be cor-
rect.

2.8 Translation of Noun Compounds: Experiments and Result


In this section we check for the accuracy of translation of English paraphrases of NCs into Hindi.
For this task we have used the gold standard paraphrase data of 250 noun compounds which we have
manually created. The same set of noun compound is given to Google translator for translation. We

21
Preposition Postposition
of kA/kI/ke
on para
for ke liye
at para/meM
in meM
from se
with ke sAtha
ke bAre meM
about ke viSaya meM
ke sambaMdhi

Table 2.8 Preposition-Postposition Mapping

have taken Google output to be the baseline against which our result has been compared. For the
present experiment Google Translator could translate the data with 68.8% accuracy.
Google returns only one translation which we have evaluated against our gold standard data. For our
translation system, In our case, we have considered 3 top paraphrases and translated them into Hindi by
using the English preposition to Hindi postposition mapping schema. Table 2.9 presents the accuracy of
the translation of the top three paraphrases.

Case Precision
Top 1 61.6%
Top 2 68.4%
Top 3 70.8%

Table 2.9 Translation Accuracy

Table 2.10 presents accuracy of translation of most frequently occurring preposition in the source
side paraphrases.
The other prepositions have occurred very less in number and therefore not given in the table.

Preposition Post Position Accuracy


Of kA/ke/kI 94.3%
For ke liye 72.2%
In meM 42.9%

Table 2.10 Translation Accuracy for some individual prepositions

22
2.9 Conclusion and Future Work
This chapter describes an approach of translating English noun compounds into Hindi. The source
language NCs are paraphrased with prepositions and the paraphrases are then translated into Hindi
following an English preposition to Hindi post-position mapping schema. The result of translation is
encouraging as a first step towards this kind of work. This work finds out a useful application for the task
of paraphrasing noun compound using preposition. The next step of experiment includes the following
tasks: (a) Designing the test data in such a way that all correspondents get equal representation in the
data. (b) To examine if there are any other prepositions (besides Lauers 8 preposition) which can be
used for paraphrasing (c) To use context for translation. The next chapter describes an approach of
translation which uses context information of source language side.

23
Chapter 3

Context Based Translation of Noun Compounds

3.1 Overview
This chapter presents our second approach of noun compound translation which we are referring to as
Context based translation by searching and ranking translation candidate on target side. English Noun
Compounds can be translated into Hindi in different ways. This chapter presents an automatic translation
system for English bigram noun compound into Hindi. The method comprises of the following :

Translation template generation.

Extraction of Noun Compounds.

Find appropriate sense of the components of the compounds using sense disambiguation.

Lexical Substitution of the disambiguated components using Bi-Lingual Dictionary.

Candidates generation using templates.

Corpus Search for translation candidates

Ranking

We have shown that the correct sense selection of the component nouns of a given noun compound
during the analysis stage significantly improves the performance of the system and makes the present
work distinct from all the previous works done for automatic bilingual translation of Noun compounds.

3.2 Introduction
At the time of taking up the present project we have made a preliminary study of NCs in English-
Hindi parallel corpora in order to identify the distribution of various construct types which English NC
are aligned to. We have taken a parallel corpus of around 50,000 sentences in which we have got 9246

24
sentences (i.e. 21% cases of the whole corpus) that has noun compound. We have discussed in chapter
1 about the various translation possibilities of noun compound. The percentage of various translations
is given in Table 3.1 .

We have also come across some cases where an NC corresponds to a long paraphrase construct for
which we have not given a count in this table. There are .08% cases (see Table 3.1) when an English
NC becomes a single word form in Hindi. The single word form can either be a simple word as in (cow
dung translates to gobara) or a compounded word such as blood pressure to raktacApa, transition
plan to parivartana-yojanA.

Construction Type Example No. of Occurrence


Nominal Compound birth rate 3959
janma dara
Genitive(of-kA/ke/kI) Geneva Convention 1976
jenIvA kA samajhOtA
Purpose (for-ke liye) Research Center 22
shodha ke lie keMdra (research for center)
Location (at/on-par) wax work 34
mom par citroM (wax on work)
Location (in-meM) gum infection 93
masURe meM roga (gum in infection)
Adjective Noun Phrase Hill Camel 557
pahARI UMTa (hilly camel)
Single Word cow dung 766
gobar
Transliterated NC poultry bird 1208
pOltrI barda

Table 3.1 Distribution of translations of English NC from English Hindi parallel corpora.

There are 1208 cases (approximately 13%) where the English noun compound is not translated but
transliterated in Hindi. They are mostly technical terms or unseen words, names of chemicals, names,
places and so on.

The figure given in Table 3.1 is a report of the empirical study performed on English-Hindi parallel
corpora. We prepare a set of translation templates that represents the construct types of Hindi (as in Table
3.1). In section 3.4, we will discuss how these templates are used for building translation candidates.
From Table 3.1 , we come to know that the frequency of English NC remaining as noun compound in
Hindi is the highest. The second highest construction is the genitive construct. Simultaneously, we have
performed a study with Hindi informants to find out how many cases an English noun compound can
legitimately be translated into a syntactic genitive construct even when it can have other more accurate
translation. Our experiment shows that a noun compound is well accepted as a genitive construct in

25
Hindi in 59% of cases. This is an interesting finding which we have used in designing the heuristics of
the present task.
However, no definite clue is available in the data that helps one in selecting the right construction type
of Hindi for translating a given English NC. [Baldwin & Tanaka 2004] observes that a translator or MT
system attempting to translate a corpus will run across NCs with high frequency, but that each individual
NN compound will occur only a few times (with around 45-60% occurring only once like wire gauge,
population explosion). The upshot of this for MT systems and translators is that NN compounds are
too varied to be able to pre-compile in an exhaustive list of translated Noun-Noun compounds. The
system must be able to deal with novel Noun compounds on the fly. Building an automatic translation
system for noun compounds from the source language (SL) English to the target language (TL) Hindi
thus becomes a very challenging task in natural language processing.
With Google translator we could achieve an accuracy of 45% on the same test data that we have used
to evaluate our model. It could give a correct translation in 29% cases when a noun compound remains
a noun compound in Hindi. When an NC is translated in genitive construction in Hindi, the translator
could return the correct result 10% of cases. For other cases such as when NC translated as Adjective
noun pair or as a single word, the performance of Google translator is poor.
This chapter presents the architecture of a Noun Compound Translator system that has been able
to give an accuracy of 57% when tested on unseen gold standard test data. We limit our discussion
to English bigram noun compounds in this chapter. The approach adopted to build the system has a
close resemblance to the approaches described in [Bungum & Oepen 2009] for Norwegian to English
noun compound translation and [Baldwin & Tanaka 2004] (English to Japanese noun compound and
vice versa). All these works including the one described in this chapter follow a template based corpus
search approach. However, the present system distinctly differs from the aforementioned works for the
analysis stage. Our system, unlike others, attempts to select the correct sense of noun components by
running a WSD system [Patwardhan et. al. 2005] on the SL data. As a result of that the number of
possible translation candidates to be searched in the target language corpus is significantly reduced.
The chapter is divided into the following section. In section 3.3, we review earlier works that have
followed similar approaches as the present work. Our approach is described in section 3.4. Finally the
result and analysis is discussed in section 3.5.

3.3 Related Work


A lot of people have already worked on interpretation and translation of Noun Compounds. One of
the pioneer work is done by [Rackow et al. 1992]. They put forward a transfer based machine translation
system of Noun Compounds for German-English pair of languages. They have addressed the issues with
segmentation1 and structure selection. The two main problems they faced during machine translation of
Noun Compounds from German to English were
1
German is an agglutinative language and hence segmentation is an important part of translation task.

26
1. Segmentation of German word

2. Correct translation

Correct translation again includes two steps:

1. Choosing correct target lexeme

2. Selection of right construct type

1. Segmentation of German word : The main problem in the treatment of compound words arises
from the fact that German compounds are written in one word and above all in many cases,
the form of words in a compound differs from the base form. They have analyzed the type of
morphemes and accordingly designed the module for segmentation.

2. Correctness of Translation : A set of translated English Noun Compounds generated from Ger-
man Noun Compounds have been matched against the target corpus of 40 Millions words. The
right translation are chosen on the basis of how frequent that lexeme appears in the corpus. Some-
times it happens that not enough occurrences of the target lexeme is found in a corpus, so they
generalize the forms of nouns. As an example, Umweltbewebgung may be translated (among
other options) into ecology movement or ecological movement. The compound occurred
only once in the corpus as ecological movement, they could not claim with confidence that it is
the right translation. In order to obtain more information, they search ecology and ecological
in Hansard Corpus2 and have found that ecological occurred 11 time compared to ecology 1
time. Therefore, they proposed ecological movement is a more viable translation for Umwelt-
bewebgung.

3. Selection of Right Construct : German noun compounds are translated into following English
construction types:

(a) Noun Noun


(b) Noun of Noun
(c) Nouns Noun

This work is the inspiration for our work described in this chapter. The nature of translation from English
to Hindi is very close to the one described above. When we translate a Noun Compound from English
to Hindi, two issues of our concern are the following: a) Correct lexical substitution for components
of noun compound keeping their sense intact and b) Selection of a right Hindi structure from the ones
described in 3.2.
[Tanaka & Baldwin 2003] has worked on machine translation of noun-compounds from Japanese to
and Overgeneration
English. They have addressed the issues of Idiomaticity such as kick the bucket (die)
2
Canadian Parliament corpus of 100,000 sentences.

27
meaning when system fails to capture the syntactic idiosyncrasies like lexical affinity between words,
such as blocking of seemingly equivalent word combinations (many thanks v/s several thanks). They
explained two basic Machine translation techniques for NN compounds.

1. Memory Based Machine Translation

(a) Extract NN compounds from the source language.


(b) Generate translation candidates.
(c) Use target language to empirically determine the best translation.

Further there are two types of MBMT

(a) Dictionary driven MBMT that store the translation pair in static translation database in-
dexed by source language strings.
(b) Alignment driven MBMT in which translation pairs can be extracted from parallel corpus.

2. Dynamic Machine Translation : Translation of arbitrary NN compounds. Two types of DMT


are

(a) Word-to-word compositional DMT in which source NN compounds are fed directly into
the system rather than extracting out of source sentence.
(b) Interpretation driven DMT
i. Use semantics and/or pragmatics to carry out deep analysis of the source NN com-
pound.
ii. Map it to some intermediate semantic representation.
iii. Generate the translation directly from semantic representation.

[Tanaka & Baldwin 2003b] address the task of the machine translation(MT) of Japanese noun-noun
(NN) compounds into English. The issues that have been discussed in this work are:

1. Constructional variability in English translation

2. Lexical idiosyncrasies in Japanese and English.

3. Non-compositional NN compounds.

The method for translation is divided in two steps

1. Generation

(a) Do the word-level translation for each noun component on source side.
(b) Slot each translation pair in templates.
There were total of 28 translation templates.
Determined by combining all POS alignment mappings.

28
2. Selection

(a) In selection each generated translation is scored. Candidates are scored according to CTQ
(Corpus based Translation Quality)
CT Q(w1E , w2E , t) = P (w1E , w2E , t) + P (w1E , t) P (w2E , t) P (t) + P (w1E ) P (w2E )
P (t)
where + + = 1

(b) Selecting the best candidate translation.

[Baldwin & Tanaka 2004] has worked on translation of noun compounds on English-Japanese language
pair. This is a path-breaking work. They provide statistics for the frequent occurrence of Noun Com-
pounds in English written text3 . The problems that they have reported during the translation are the
following:

1. Constructional variability in translation

2. Lexical divergences in Japanese and English

3. Semantic underspecification

4. The existence of non-compositional NN compounds

5. High productivity and frequency

They have presented a way to translate Noun-Noun compound, using a word-level dictionary and syn-
tactic templates for candidate generation, and corpus and dictionary statistics for selection of right con-
struction. There are two stages of translation

1. Generation : Translation of individual components of Noun Compound in target language using


word-level dictionary, then setting those target lexemes in translation templates(generated before-
hand) to generate translation candidates.

2. Selection : Selecting the best translation out of translation candidates generated in first step. Se-
lection procedure is performed on collection of texts in target language collected from web or
corpus. CT Q(w1H , w2H , t) = P (w1H , w2H , t) + P (w1H , t) P (w2H , t) P (t)

The major steps to build the NCT tool have been adopted from the above mentioned works. Use of
translation templates and then empirically finding the best translation by the use of target language are
the major steps of NCT tools too. [Tanaka & Baldwin 2003b] has introduced a ranking measure to find
the best translation CTQ. Section 3.4 describes how we have made use of the ranking measure CTQ for
3
According to Baldwin & Tanaka BNC Corpus(84M words) has 2.6% of Noun Compounds and Reuters Corpus(104M
words) has 3.9% of Noun Compounds

29
finding the target Noun Compound.
Another work that resembles closely with our work is that of [Bungum & Oepen 2009]. They have
worked on translation of Noun Compounds from Norwegian to English. The major issues while work-
ing on norwegian compound is segmentation because of the agglutination problem in Norwegian com-
pounds like the one described in [Rackow et al. 1992] with the German compounds. They have ex-
perimented on a test set of 444 Noun Compounds which were compositional in nature. In their work
they first segment the compound and then translate the compound by treating the component nouns as
individual entities. They have also used translation templates as describe in [Baldwin & Tanaka 2004]
to generate translation candidates. In this work they treat the problem of selection of right candidate
as a Machine Learning problem. Probabilisitic ranking determines the most probable translation of the
source compound.
Given a source language compound n, the model estimates the probability of a candidate translation


ei as a normalized dot product of a vector f of so-called features-arbitrary properties determines by


so-called feature functions-and a vector of corresponding weights.
P
exp j fj (ei , n)
j
p(ei |n) = Pn P (3.1)
k=1 exp j j fj (ek , n)

The highest scoring candidate will be considered as the apt translation for the source compound n. There
were two types of features used in learning the best candidate translation namely

1. Monolingual Features

(a) Corpus Translation Quality


(b) freq(E1 , E2 , t)
(c) freq( , E2 , t)
(d) freq(E1 , , t)
(e) freq(E1 , t)
(f) freq(E2 , t)

2. Bi-Lingual Features

(a) f req(E1 , E2 |N1 , N2 )


(b) f req(N1 , N2 |E1 , E2 )
(c) f req(E1 , E2 , )
(d) f req(E1 , E2 , )
(e) f req(E1 |N1 )
(f) f req(E2 |N2 )
(g) f req(N1 |E1 )

30
(h) f req(N2 |E2 )

Here N1 N2 and E1 E2 are Norwegian and English noun compounds respectively with t as the template.
Experiments were done on a test set of 444 Noun Compounds which were compositional in nature. Out
of 444 compounds 90% of the data was used in training, to calculate the weights of the features and the
rest 10% was the test data.

3.4 Preparation of Data and Approach


This section describes our procedure in details. The system is comprised of the following stages:

1. Preparation of data and template generation

2. Determining sense of the component nouns in the given context,

3. Lexical substitution using bilingual dictionary,

4. corpus search using translation templates and

5. Ranking of the possible candidates.

Preparing the data to evaluate this kind of approach was itself a task. In this section we describe the
method we used on the test set we prepared in a semi-automatic manner.

3.4.1 Preparation of Data


We have evaluated the NCT tool on a manually built gold standard data of noun compound translation
pairs. We have used parallel Tourism Corpus (8768 sentences, .182M words) to build our gold standard
data. We have run Tree-Tagger4 [Schmid 1994] on the English side of Tourism Corpus. Sentences with
noun compounds are extracted from the tagged data and the noun compounds are strictly restricted to
be two consecutive noun construction type. We have obtained 1584 sentence pairs with distinct noun
compounds on English side. 300 parallel sentences are manually extracted from the set so that the Noun
Compounds are evenly distributed over all possible structures in Hindi.

3.4.2 Generation of Translation Templates


One of the most important subtasks in this work is determining the translation templates. Each
template is a possible translated construct type of English NC in Hindi. The parallel corpus data are
inspected and generalized into translation templates. The most common templates are <E1 E2> 5
4
The tagger not only gives part of speech of the words but also outputs the lemma for each word. The lemma is required in
the later stage for searching the word in the WordNet.
5
E1 E2 are the components of the noun compound, <E E> denotes the structure

31
<H1 H2> and <E1 E2> <H1 gen6 H2>. The other interesting candidate is adjective noun phrase
in Hindi as exemplified: The compound government official translates into sarakArI karmacArI ,
sarkArI is the adjectival form of sarkAra. The derivation takes place by chopping of the final vowel
a and adding I in that place. Some more examples are the following: desert camel registAnI UMTa
(registAnI < registAna), hill horse pahaRI ghOrA (pahARI < pahARa) and so on. Hindi has a rich
derivational system for adjective formation.
In the present work we have till now identified 44 templates as shown in Appendix A.

3.4.3 Sense Selection for components of Noun Compound


The context determines the sense of a given English NC in a sentence, context being the content
words in the sentence. When the component nouns are treated independently regardless of the context,
they might represent more than one sense. For example, security has 9 different senses according to
English WordNet. For each sense the English word might be translated into more than one Hindi equiv-
alent word using English to Hindi bilingual dictionary (security itself has 4 different translations). Let
me explain the complexity of lexical substitution with data from the corpus. We have come across the
following sentences in the test data:

1. Millions of people in the border area need to feel safe again

2. Road safety aims to reduce the harm (deaths, injuries, and property damage) resulting from
crashes of road vehicles

The noun compound identified in sentence (a) and (b) are border area and road safety respec-
tively. All four words can be used in more than one sense. The number of senses enlisted in WordNet
for each of the above words is reported in Table 3.2.

Noun No. of Senses


Border 5
Area 6
Road 2
Safety 6

Table 3.2 Number of Senses Listed in Wordnet

For each sense there are many synset words which can be seen as semantically equivalent words in
the WordNet. If we consider all words for all senses of the component nouns and attempt to translate
all of them using a bilingual dictionary the number of translation candidates will be huge in number.
Moreover we would be searching for those candidates that are not even relevant to the English NC in
6
gen is genitive marker in Hindi. It has variants kA, kI and ke. Therefore <H1 kA H2>, <H1 ke H2> and <H1 kI H2>
form three translation candidates.

32
the given context. In order to reduce the search space, we have chosen to use a WSD tool. We have run
WordNet-SenseRelate [Patwardhan et. al. 2005] on our data for the purpose. This tool takes sentence
as the context and with a specified context window it outputs the WordNet sense id for every content
word that exists in WordNet. For example, the sense ids selected by WordNet-SenseRelate tool for the
two NCs border area and road safety are given in Table 3.3.

Noun Sense Selected Synset Words


Border #1 < boundary line, border, borderline, delimitation, mete >
Area #3 <area, region>
Road #1 <road, route>
Safety #2 <safety, refuge>

Table 3.3 Synset selected by WSD tool

The third column of Table 3.3 presents the synset associated with the sense selected by the WSD
tool. Once the synsets are acquired in this process, the translations for each word in the synset is
obtained from a bilingual dictionary. Once we look into a bilingual dictionary, again we may come
across many equivalents of a word which do not match to the sense id selected for that word. For
example, the word border (a member of the synset of border) has one equivalent JAlara in the
bilingual dictionary that is used in the domain of decoration and not location. We would like to
discard such equivalents. Otherwise the whole attempt of using WSD tool on the source language side
will be lost. The ideal situation would have been to have a mapping from the synset id of a word in
English WordNet to the corresponding Hindi synset id in Hindi WordNet. Since such mapper is not
available, we have maintained the following strategy. We first acquire all possible translations for all
the words within a synset from all possible dictionary resources. Then we take out those Hindi words
which are common translations to all English words of a synset, if there is one. For example, we got the
following translations for the two synsets <road, route> from bilingual dictionaries

Noun Translation
Road pATa, mArga, sadZka, rAswA
Route mArga, sadZka, rAswA
Safety ahAnikArakawA, surakRiwa sWAna, sePZtI, surakRA, salAmawI, surakRA sAXana
Refuge SaraNa, ASraya sWAna, yAwAyAwa cakkara, sahArA, panAha

Table 3.4 Translation using bilingual dictionary

From Table 3.4, we find out that mArga, saDZaka, rAswA are common translation for road and
route. Once the Hindi equivalents are obtained they are fit to template frames to generate the transla-
tion candidates which are further searched in the target corpus for a match. The worst case is when we
do not find any common translation for words in the synset. Such is the case for safety and refuge

33
as shown in Table 3.4. For such cases, we try out translations of all synset members one by one for
generating the translation templates.

3.4.4 Corpus Search and Ranking

We have performed a web search on the candidate translations generated in the previous step to
find the best translation for the Noun Compound. For ranking, a reference ranking measure based on
the frequency of occurrence of the translate candidates in full in web corpus is taken as baseline. To
improve on the baseline, a stronger ranking measure (Corpus Translation Quality) is borrowed from
[Baldwin & Tanaka 2004] . It rates a given translation candidate according to corpus evidence for both
the fully specified translation and its parts in the context of the translation template in question. The
measure is called interpolated Corpus based Translation Quality (CTQ) metric that extracts the fre-
quency counts from the web corpus in the following manner:
CT Q(w1H , w2H , t) = P (w1H , w2H , t) + P (w1H , t) P (w2H , t) P (t)

where P (w1H , w2H , t) is the probability of occurrence of template t with w1 and w2 as its instances
and P (w1H , t) P (w2H , t) P (t) is the probability of occurrence of translation template t with w1 as
its instance at one time multiplied by the probability of occurrence of translation template t with w2 as
its instance at another time multiplied by the occurrence of translation template t. After a number of
experiments the values of and the optimum values were 0.9 and 0.1 respectively which was then
fixed. Naturally the first term should have higher priority than the second term. The result presented in
the next section shows that the incorporation of frequency of occurrence of P (w1H , t)P (w2H , t)P (t)
has distinctly improved the recall in our system.
As an example the baseline metric system calculates
Count(mArga,kI,surakRA)
P(mArga kI surakRA) = Count(mArga)+Count(surakRA)+Count(kI)
whereas CTQ metric calculate the probability as
P(mArga kI surakRA) =
Count(mArga,kI,surakRA)+Count(mArga,kI)Count(kI,surakRA)Count(kI)+Count(mArga)Count(surakRA)Count(kI)
Count(mArga)+Count(surakRA)+Count(kI)

3.5 Results and Analysis

This section presents the result of our various experiments performed as part of translating automat-
ically English NC to Hindi. The results show a distinct improvement in performance as we go from
baseline ranking measure to CTQ measure for ranking. We have used three methods of lexical substitu-
tion for components of noun compounds into Hindi equivalents and the result obtained for each method
is presented at Table 3.5 and Table 3.6. As part of the first method we have not done any word sense
disambiguation of the component words of source language NC; on the contrary we have straightaway
used the bilingual dictionaries for substituting English NC components to all possible Hindi equivalents.

34
For the second method, the first sense of WordNet for the components of the given English NC has been
selected as default sense [McCarthy et. al.2007] and all the members of synset of the first sense have
been substituted using a bilingual dictionary.
The motivation for this approach is two fold:

1. a word occurs mostly in its default sense which is listed as the first sense in any lexicon

2. [McCarthy et. al.2007] showed that the choosing the first sense of the word is the most viable
option due to skewed frequency distribution of word senses.

This increases the robustness of the system. The third method is the one we have adopted for the present
task using a WSD tool on the source language NC and select the appropriate sense of the given word in
that context. The purpose of trying out various methods for lexical substitution is for examining whether
the usage of WSD tool brings in any improvement to the overall performance of the translator tool. The
table below shows that it does. The pre-processed input that has been used for lexical substitution is not
humanly analyzed data but is actually obtained as the output of Tree-Tagger that gives 94% accuracy
and the WSD tool WordNet-SenseRelate that has produced 80% accurate case for noun compound
disambiguation7 . The results of corpus search of the translation candidates are given in the following
two tables. The baseline frequency model performs in the following:

Lexical substitution Method Recall Precision F-Measure


Bilingual Dictionary 14.2% 50% 22.1%
WordNet 1st Sense + Dictionary 24% 46.15% 31.57%
WSD Tool + Dictionary 24.63% 53.68% 33.76%

Table 3.5 Ranking using baseline frequency model

With the use of CTQ measure metric, the accuracy of translation is distinctly improved as shown in
the following table

Lexical substitution Method Recall Precision F-Measure


Bilingual Dictionary 19% 56.25% 28.4%
WordNet 1st Sense + Dictionary 28% 54.1% 36.9%
WSD Tool + Dictionary 28.50% 62.1% 39.06%

Table 3.6 Ranking using CTQ Metric Model

The recall of this experiment was very low. In order to increase the coverage of translation, we
have done the following study. We involved two informants to verify on the development data whether
7
It is interesting to note that the accuracy reported for the WordNet-SenseRelate output on general data is 58%. When we
tested the tool for noun compound, it gave an accuracy of around 80% for the same

35
the compounds which were not found during corpus search can legitimately be translated as a genitive
construct. We found that the heuristics is working for 59% cases. Therefore we incorporated this as
a default translation case for our system. Whenever a corpus search for a translation candidate fails,
we assign a genitive translation for that noun compound. This results in a steep improvement in recall
although the precision falls down a little. We ran the experiment on the output of 1st and 3rd lexical
substitution methods. The result is reported in the following table:

Lexical substitution Method Recall Precision F-Measure


Bilingual Dictionary 25% 54% 34.17%
WSD Tool + Dictionary 44.5% 57% 49.98%

Table 3.7 Ranking after inclusion of default genitive translation i.e.. X kA Y, X ke Y, X kI Y as


templates.

3.6 Conclusion
This chapter describes the architecture of a template based translation system for translating English
noun compound into Hindi. We have observed that English noun compounds can variously be translated
into Hindi. However no clue is available to determine which type of Hindi constructs a given English
noun compound would be translated into. We have, therefore, adopted a corpus search approach that
performs the search of candidate templates in a Hindi indexed corpus. While generating templates,
we found out that adjectival templates are hard to generate because adjective formation from noun is
a complex derivational process in Hindi. It does not only involve attaching an adjectival suffix on
the noun but also many a time requires a change in the vowel of the stem. In the present work, we
have performed poorly for adjective noun translation templates. The future work includes the correct
generation of adjectival form from the modifier nouns so that correct templates for Adjective Noun
construct can be obtained. One advantage of this approach is that a translation if it exists in the corpus
will never be missed. Therefore accuracy of translation will depends largely on the amount of target
language data searched for the translation candidates.
In order to improve the system, we can do the following: a) Develop a WordNet sense mapping from
English to Hindi which can further improve the translation accuracy of the NCT system and b) Plan to
use morphological information during web search to make it more effective and thus get better results.
The NCT system described in this chapter is integrated with Moses, a statistical MT system. The detail
of integration is discussed in the next chapter.

36
Chapter 4

Integration of Noun Compound Translator with Moses and its Evaluation

4.1 Overview
In this chapter we present the integration of Noun Compound Translation (NCT) system that we have
developed (see chapter 3) with the state-of-the-art machine translation tool, Moses [Koehn et al. 2007].
We evaluate Moses standalone and the integrated system on a test data 300 parallel English-Hindi sen-
tences. The test data is manually developed and each sentence in the test set has one or more noun
compound. The gold data contains the Noun Compounds and its translation into Hindi. A gain of 29%
on BLEU score and 27% on Human evaluation has been reported in the paper.

4.2 Introduction
We have presented the implementation of Context Based Noun Compound Translation System
(NCT) in chapter 3. The system has reached an accuracy of 62% as reported in the previous chapter. In
order to examine the usefulness of the tool in the context of a full-fledged translation system, we have
made an effort to integrate NCT with one state-of-the art translation system. This chapter describes the
integration of NCT with Moses [Koehn et al. 2007]. Moses is a statistical MT tool that allows automatic
training of translation/reordering models for any given pair of parallel texts. Moses decoder is a tool
which decodes the source sentence (containing the Noun Compounds in this case) into target sentence
using the translation/reordering models and the language model (built on the target language corpus).
Moses trains two types of translation models

1. Phrase based Model

2. Tree based Model

NCT is a phrase based system and hence integrating it with another phrase based system makes the
integration an easier than integrating it with a Syntax based SMT or any other SMT systems such as
Example-based MT, Tree based MT. We aim at building an enhanced model by combining Moses phrase

37
based model and NCT system. Moses decoder uses the enhanced model and language model to generate
sentential translation. We evaluate the translation system extensively both automatically and manually.
We compare the output of Moses with NCT with Moses standalone.
This chapter has been divided into a number of sections. In section 4.3, we briefly review some SMT
systems and evaluation matrix of MT system. Section 4.4 presents the details of different data sets which
we have prepared to implement Moses and evaluate our system. Section 4.5 describes the working
principle of Moses. We describe the integration of our system NCT with Moses in section 4.6 . Finally
the evaluation report is presented in section 4.7 .

4.3 Related Work


This section presents a review of the following:

1. SMT systems and

2. Evaluation process of MT system

IBM introduced SMT in the early 1990s with their original approach of word-to-word translation
allowing insertion and deletion of words. Phrase-based MT was originally introduced by [Och 2002]
alignment template model has later been reframed as phrase based model. Phrase based Model translate
phrases as atomic units. Most of the best performing Machine Translation systems uses phrase based
models including Google. [Koehn et al. 2007] has introduced a state-of-the-art machine translation tool
Moses which allows automatic training of the translation model. [Federico et al. 2008] has introduced
a tool, IRSTLM to build the language model. Language model keeps the grammar of the translated
sentence in check.
One of the most difficult problems of Machine Translation is the evaluation of a system. If one system
is better than the other it can only be proven if there is a score attached to it, the better the score the
better the system.
There have been work done on evaluation of systems both automatically and by human. ALPAC was
formed in 1964 to evaluate the progress of machine translation by using human translators. The trans-
lators studied two measures intelligibility and fidelity. Intelligibility measured how good is the
language of the sentence and fidelity ensured that all the information is translated in the target sentence.
In this work we have used two measures Adequacy and Fluency which represents fidelity and intelligi-
bility respectively and they are measured on a scale of 5 (as described in Section 4.7 ) unlike on a scale
of 10 done in the previous one. This was done due to low inter-annotator agreement on a scale of 10.
Many automatic evaluation metrics (a metric which represents the quality of the translation) have been
developed such as WER, TER, BLEU, NIST. In 2002 [Papineni et al. 2002] has introduced the metric
BLEU (Bilingual Evaluation Understudy) for automatic evaluation of machine translation. It is one of
the first metrics to have a high correlation with the human judgements and has now become a benchmark
for new evaluation metrics. We have used BLEU for evaluation of our systems.

38
4.4 Data Preparation

Statistical machine translation uses three different data namely a) Training Data b) Development
Data c) Test Data. We have used Tourism Corpus1 parallel data (statistics are shown in Table 4.1)
which has 3 segments: training, development and test. We have used this corpus for training and
development purpose. For training of translation and reorder model we have used the training data of
8169 sentences. For minimum error rate training [Och 2003] we have used a development set of 361
sentences. Gyannidhi Corpus of 12000 Hindi sentences have been used to build a trigram language
model with Knesser-Ney [Kneser & Ney 1995] smoothing using IRSTLM [Federico et al. 2008] tool.
The size of training and development data is shown in Table 4.1.

Corpus Sentences Source Words Target Words


Training (Tourism) 8169 0.17M 0.18M
Development (Tourism) 358 7741 7992
Monolingual Hindi (Gyannidhi) 12000 N.A 0.4 M

Table 4.1 Corpus Statistics

The test data has not been taken from the aforementioned corpus because of two reasons: a) many
NCs in the Tourism Corpus are technical terms. The translation of technical terms is often transliterated
form of the input and not fit for our translation purpose; b) the number of NCs in tourism test data is also
insignificant. These factors motivate us to build our test data from a different domain. We have used
Tree-Tagger [Schmid 1994] to tag the source side of a English-Hindi parallel corpora of 50K sentences.
The tagger gives the additional useful lemma information with the POS tag. Out of the tagged sentences
we have extracted about 15000 sentences which contains bigram Noun Compound. Finally 300 NCs
from this dataset have been handpicked to build the gold standard data which contains the following
information:

1. Source Noun Compound

2. Source Sentence

3. Target Sentence

The information is stored in form of a tuple <NC-S, NC-T, SS, TS>, where NC-S is the Noun
Compound on the source side, NC-T is the translated correspond of Noun Compound on the target side,
SS is the source sentence, TS is the corresponding target sentence.

1
Hindi is a resource-poor language, the most proficient data we could manage was Tourism Corpus. This corpus have been
used in English to Hindi MT task in NLP Tools Contest organized by IIIT Hyderabad.

39
4.5 Moses
Moses is a statistical Machine Translation tool that uses the phrase based translation approach as
described in [Koehn et al. 2003]. Figure 4.1 explains a phrase based model. The figure explains the
working of the model on the source sentence Ram went to buy grocery in the shopping market.

Figure 4.1 Phrase based Model

1. The sentence is split into a number of phrases (typically not syntactic phrases), all being equally
likely. (in the shopping mart is a phrase.)

2. These phrases are then translated to target language by a one-to-one mapping provided by a phrase
translation table2 . (in the shopping mart translates to SOpiMga mArta meM)

3. Phrases may be reordered with a maximum reordering limit of 6 (usually). (the phrase trans-
lated from went on target side, jumps 4 places from the previous phrase translated from Ram i.e.
reordering of 3)

A translation task is casted as decoding a source sentence f into target sentence e. This decoding is
done using noisy channel model shown by equation 4.1 .
P (e, f ) P (e) P (f |e)
P (e|f ) = =P (4.1)
P (f ) e P (e) P (f |e)

Applying Bayes rule to the equation 4.1 the model is factored into translation model and language
model. Substituting the source language as English and target language as Hindi the equation is

argmaxh P (h|e) = argmaxh P (e|h)PLM (h) (4.2)

where h is target language (Hindi) and e is the source language (English), P(h) is the language model.
P(e|h) is further divided into
I
Y
eI
p( I
1 |h1 ) =
i )d(starti endi1 1)
ei |h
( (4.3)
i=1
2
A table that provides a mapping from source phrase to target phrase

40
The foreign sentence e is broken up into I phrases ei . The segmentation is equally likely. Each foreign
i . Since we mathematically inverted the translation direction
phrase ei is translated into a target phrase h
in the noisy channel, the phrase translation probability ( i ) is modeled as translation from Hindi to
ei |h
English.
Each entry in a translation model represents

1. Source Phrase : election campaign

2. Target Phrase : cUnAva abhiyAna

3. Alignment (source to target) : (0) (1)

4. Alignment (target to source) : (0) (1)

5. Score Vector : 0.16 4.81e-05 0.5 6.35e-05 2.718

where score vector represents the following features

Phrase translation probability (h|e) : 0.16

Lexical Weighting lex(f |e) : 4.81e-05

Inverse Phrase translation probability (e|h) : 0.5

Inverse lexical weighting lex(h|e) : 6.35e-05

Phrase penalty (always exp(1) = 2.718) : 2.718

A phrase table looks like a compiled list of following entry :


election campaign ||| cUnAva abhiyAna ||| (0) (1) ||| (0) (1) ||| 0.16 4.81e-05 0.5 6.35e-05 2.718
Reordering is handled by distance-based reordering model (d(starti endi1 1)). We consider
reordering relative to the previous phrase. We define starti as the position of the first word of the
foreign input phrase that translates to the ith English phrase, and endi as the position of the last word of
that foreign phrase. Reordering distance is computed as starti endi 1 1.
The reordering distance is the number of words skipped (either forward or backward) when taking
foreign words out of sequence. If two phrases are translated in sequence, then starti = endi 1 + 1;
i.e., the position of the first word of phrase i is the same as the the position of the last word of the
previous phrase plus one. In this case, a reordering cost of d(0) is applied. See Figure 4.2 for an
example.
Each entry in the reordering model represents

1. Source Phrase

2. Target Phrase

3. Score Vector

41
Figure 4.2 Distance based Reorder-
ing : Reordering distance is mea-
sured on the foreign input side. In
the illustration each foreign phrase
is annotated with a dashed arrow
indicating the extent of reordering.
For instance the 2nd English phrase
translates the foreign word 6, skip-
ping over the words 4-5, a distance
of +2.

Figure 4.3 Lexicalized Reordering (Y-Axis : Source Phrase, X-Axis : Target Phrase)

where the Score vector contains the bidirectional entries from source-target and target-source.
Figure 4.3 represents three types of orientation of phrases (features in the score vector)

1. monotone (m) : if a word alignment point to the top left exists, we have evidence for monotone
orientation.

2. swap (s) : if a word alignment point to the top left or to the top right exists, we have evidence of
a swap with the previous phrase.

3. discontinuous (d) : if neither a word alignment point to the top left nor to the top right exists, we
have neither monotone nor swap, and hence discontinuous.

The reordering table entries are like:


election campaign ||| cUnAva abhiyAna ||| 0.2 4e-05 0.15 7e-04 0.65 0.9
After using Translation and Reordering Models the sentence is a bag of words. Grammar of the sentence

42
is ensured by the use of Language Model. Language model can estimate the distribution of natural
language as accurate as possible. A statistical language model (SLM) is a probability distribution P (s)
over strings S that attempts to reflect how frequently a string S occurs as a sentence.
The standard translation system described so far consists of three models:

e|h);
1. Phrase translation model (

2. Reordering model d;

3. Language model pLM (e).

The three model components combine together to form phrase-based statistical machine transla-
tion
I |e|
Y Y
hbest = argmaxh (
ei |hi )d(starti endi1 1) pLM (hi |h1 ...hi 1) (4.4)
i=1 i=1
When we use the system, we may observe that the words between input and output match up pretty
well, but that the output is not grammatically good Hindi. Hence, we are inclined to give the language
model more weight. Formally, we can do this by introducing weights , d , LM that let us scale the
contributions of each of the three components:
I |e|
Y Y
hbest = argmaxh ( i ) d(starti endi1 1)d
ei |h pLM (hi |h1 ...hi 1)LM (4.5)
i=1 i=1

The phrase-based decoder employs a beam search algorithm, similar to the one by [Jelinek 1998].
The Hindi output sentence is generated left to right in form of partial translations (or hypotheses). We
start with an initial empty hypothesis. A new hypothesis is expanded from an existing hypothesis by the
translation of a phrase. The Hindi phrase is attached to the existing Hindi output sequence. The English
words are marked as translated and the probability cost of the hypothesis is updated. The cheapest
(highest probability) final hypothesis with no untranslated English words is the output of the search.
Moses treats a Noun Compound as just another phrase and does not perform any special operation
for the translation of NC. Both of the issues discussed by [Rackow et al. 1992] in Section 3.3 are not
handled explicitly by Moses. Above all low-frequency of NCs means less training data (on source-target
sides) and thus lower/insignificant probability scores of translation. This motivate us to make a Hybrid
system - a statistical system with some linguistic information which performs and tackle the issue with
Noun Compound specifically.

4.6 Integration
As mentioned earlier, both Moses and NCT are phrase based systems and therefore their integration is
not difficult. This section presents the integration in detail. We have applied two different techniques for
integration. They are integrating by a) Generating additional Phrase Table and b) Generating additional
Training Data. For both methods, following common steps have been followed:

43
1. For each compound in the sentence we generate the translations using Noun Compound Translator
and make a list of all the translation pairs.

2. We train the translation and the reordering models using Moses on the training data described in
Section 4.4.

The two techniques of integration are described below:

1. Generating additional Phrase Table : After training is done, we used the translation pairs to
build the phrase table of the NCs and add that phrase table to the list of phrase table already gen-
erated by Moses3 . Since we want the decoder to choose the translation options provided by the
NC phrase/reordering table, we raise the probabilities of the features explained in Section 4.5 to
the maximum(i.e 1). Thus, the phrase table generated for the NC election campaign would be the
following:
election campaign ||| cUnAva abhiyAna ||| (0) (1) ||| (0) (1) ||| 1 1 1 1 2.718
One important task remains for NCT system to get integrated to Moses is to be able to express the
alignment information for the phrase table that is built from NCT output. In order to do that, NCT
is further extended with the following feature: Along with translation output, it also generates
the alignments from source compound/phrase to target compound/phrase. For example, cancer
treatment would be translated as kEnsar ke lie cikitsA. We have assumed the compound cancer
treatment to be cancer NULL treatment. In this compound cancer is translated as kEnsar, treat-
ment is translated as cikitsa, NULL word is translated to ke lie. cancer, NULL, treatment has a
fertility(represented as function f ()) of 1, 2, 1 respectively. Therefore, the alignment for cancer
which is the 0th word is kEnsar which is also the 0th word. But, the alignment for treatment
which is 1st word is cikitsa which is 1 (f (cancer)) + 2 (f (N U LL)) = 3rd word. Keeping track
of the alignments adds to the computational cost of the system.
Simultaneously, the reordering table is built specifically for the noun compound pairs and all the
probabilities are also raised to maximum. We then do the minimum error rate training(MERT) on
the development set (described in Section 4.4) to optimize the translation quality.
The results with this method are not satisfactory (refer Section 4.7 for more details). We have
done the error analysis for this system. We have found out that the decoder is complied to choose
the translation option provided by NCT which in turn affects the translation of the whole sentence.
We have, therefore, resorted to another method for integrating the two systems together. We have
referred this system as System-2 later in this chapter.

2. Generating additional Training Data : We use the translation pairs and treat them as a parallel
corpora. Rather than building the whole phrase table we just use the translation pairs as the
parallel text. For these translation pairs to be selected in decoding process they are required to
3
Moses provides a feature of adding a phrase table with flat weights to the existing ones.

44
outweigh the other possible translations. To ensure that, we build a parallel corpus which contains
each translation pair 10 times in order to outweigh other translations of a given pair generated
from the training corpus by Moses. In this case computational cost is much cheaper than the
first method as we didnt have to align translation pairs. MERT is performed afterwards for the
optimization of translation quality. We have referred this system as System-2 later in this chapter.

4.7 Evaluation
The two systems described in the previous section are evaluated in this section on a test data of
300 sentences (see Section 4.4 for preparation of data). Two systems: Moses standalone and Google
translator are run on the same test data in order to compare the result. Two methods of evaluation are
applied: a) Automatic evaluation by using BLEU metric [Papineni et al. 2002] and b) human evaluation.
Three human evaluators have evaluated the data and the details are given below.
We will present the evaluation report of both sentence translation and NC translation. First we present
in Table 4.2 the evaluation report of NC translation Tool, NCT and compare its performance with Moses
and Google translator for NC translation alone:

System NC Trans. Train. Set


Moses (Baseline) 23% 180K words
Google 35% -NA-
System 1 30% 180K words
System 2 28% 180K words

Table 4.2 NC translation accuracy(Surface Level) on the test data.

As shown in the Table 4.2, System 1 performs much better than Moses (7% absolute improvement)
and slightly better than System 2. It is because of the fact that System 1 has to use the translations of
Noun Compounds given by NCT whereas System 2 is under no such compliance. System 1 is compelled
to use the translation from the phrase table which in turn affects the performance of the overall sentential
translation as we will see below during sentential translation evaluation.
Even though the goal of the thesis is to develop a noun compound translation tool, we realize that it is
important to examine how the performance of a full-fledged translation system gets affected when our
system is integrated with the system. We examine whether overall translation quality improves when
our system is plugged in to an existing system.
The following table shows the results of automatic evaluation of sentence translation. All these sentences
contain nominal compound:
In Table 4.3 we observe that Google shows a pretty high BLEU score compared to other systems
because of the vast amount of data they have which helps them build better translation and language
models. However the score is still much low compared to other language pair such as English-French(30
BLEU), Arabic-English(35 BLEU). There exist other English-Hindi SMT systems but all them have low

45
System BLEU Train. Set
Moses (Baseline) 2.34 180K words
Google 8.07 -NA-
System 1 2.74 180K words
System 2 3.01 180K words

Table 4.3 BLEU scores on the test data.

BLEU scores for English-Hindi Translation task which is mainly because of the less and low quality
training data. [Singh & Bandopadhyay 2010] and [Udupa & Farooquie 2004] have shown as low as 13

BLEU points on different test sets even with a moderate amount of training data of 150K words.
There is a relative improvement of 29% in System 2 w.r.t Moses system. The BLEU scores as reported
in Table 4.3 is significantly lower than the scores we have obtained with the development set during the
tuning phase as presented in Table 4.4.

System BLEU Dev Set


Moses (Baseline) 10.49 10K words
System 1 10.55 10K words
System 2 10.70 10K words

Table 4.4 BLEU scores on the development set

The reason for this difference in BLEU score (compare Table 4.3 and 4.4) has been accounted for
from our discussion in Section 4.4 where we have discussed that test data could not be taken from the
same domain from which training and development data has been selected.
To calculate Noun Compounds translation accuracy, the binary score is determined on the basis of an
exact match with the gold data NCs. For example, the gold standard data translation for sea food is
given as samuxrI bhojana. The NCT tool translates the compound as samuxrI khAxya. Although food
can be translated as khAdya in this context, the score will be 0 because the translation output is not an
exact match to the gold translation data.
BLEU has the tendency of giving two semantically equivalent sentences a low score. Moreover, Hindi is
a free order word language, it is more difficult to get good BLEU score [Papineni et al. 2002] with a sin-
gle reference gold translation which reflects our dataset. The human translators while building the gold
standard data generally try to preserve the meaning of the whole sentence, and so they are not bound
to retain the syntactic structure of the NC in target side which leads to arbit restructuring of the sentence.

These observations have made us think that BLEU metric is not the perfect measure for evaluation
in this case. We therefore employed three human evaluators(all of them native language speakers) to
evaluate the translations for all the systems and score them.
Each entry in the evaluation set contains

1. Noun Compound (Source)

46
2. Noun Compound (Target) (by all 4 systems)

3. Source Sentence

4. Target Sentence (by all 4 systems).

Noun Compounds are marked as correct translation if the translation by a system is semantically
equivalent to the compound in gold data set. As an example, if gold standard data consists translation
of sea food as samuxrI Bojana and the translation by a system is samuxrI KAxya we have marked it
correct, because both the compounds convey same meaning.
To make evaluation an easy task NCs were scored

1 : if translation by Moses was correct

2 : if translation by Google was correct

3 : if translation by System-1 was correct

4 : if translation by System-2 was correct

0 : if none of the translation is correct

The human translators are asked to score the target sentence on the following scale as illustrated in
Table 4.5. Here we have not used Adequacy and Fluency scales separately as mentioned in [Denkowski & Lavie 2000],
instead we have merged the two scales to form a new scale. Interannotator agreement between the hu-

Score Adequacy Fluency


5 contains all information Flawless grammar
4 contains most information Good grammar
3 contains much information Non-native grammar
2 contains little information Disfluent grammar
1 contains no information Grammatically incorrect

Table 4.5 5 point scale for Evaluation

man evaluators was 83.3% (250 entries out of 300) when NCs were evaluated. Interannotator agreement
for the sentential translations was 48% (144 entries out of 300). The agreement was calculated as

N umberOf T imesAnnotatorAgrees
InterAnnotatorAgreement = (4.6)
T otalN umberOf Entries

The following table presents human evaluation report for both sentential translation and NC translation:

We can observe that Human Judgement scores are much higher and significant than the BLEU scores.
There is a relative improvement of 27% in the performance of System 2 w.r.t Moses standalone for sen-
tential translations. The Noun Compound Translation accuracy is higher than the accuracy shown in the

47
System Sentential Translation NC Trans. Train. Set
Moses 24.4% 48% 180K words
Google 40% 57% -NA-
System 1 29.5% 64% 180K words
System 2 31% 60% 180K words

Table 4.6 Human Judgment score of translation of sentences and the NC Translation accuracy

Table 4.6. It is only because of the fact that in this experiment we have looked at the semantic equiv-
alence (we have checked if the translated compound fit in the target sentence and it has conveyed the
correct meaning) rather than surface-level matching. System 1 performs best in NC Translation accu-
racy (16% absolute improvement). An interesting point to note here is that System 2s Noun Compound
Translation accuracy is higher than the System 1s but at the sentence level it is other way round. The
only probable reason for this change is that in System 1 while building a phrase table and complying
the decoder to choose the translation from it, affects the translation score of the whole sentence and thus
the lower the scores.
Table 4.7 shows the translation accuracy of top 3 constructions in which an English noun compound can
be translated into Hindi. In the above table we can observe that System-1 outperforms other systems

Construction Moses System-1 System-2 Google


Noun Compounds 40% 80% 69% 76%
Genitives 25% 67% 75% 60%
Single Word 20% 67% 67% 50%

Table 4.7 Performance of systems on top 3 constructions for NC Translation

when Noun Compounds are translated as Noun Compounds. While in the case when Noun Compounds
are translated to Noun Genitive Noun construct in Hindi System-2 performs better than any other sys-
tem. When Noun Compound is translated as a Single Word both System-1 and System-2 performs
equally well. It is evident from Table 4.7 that NCT tool performs better than any other system when
Noun Compound translation is concerned.

4.8 Conclusion and Future Work

In this chapter, we have observed that a Integrated system performs better than Moses standalone
on sentences containing NCs. Also, there is a significant difference in the Noun Compound Translation
accuracy which evidently shows that NCT performs better than Moses and Google. This also indicates
that if we have some linguistic information about the type of sentence we are translating(in this case
a sentence with a NC) we can get better translations. This chapter also shows that BLEU metric is
not suitable for evaluation of sentential translations from English to Hindi because the scores are very
insignificant to compare the systems. The results reported are produced when we use a Wordnet-Sense

48
Disambiguation tool [Patwardhan et. al. 2005] which has an accuracy of 72% in case of sense selection
of constituents of Noun Compounds and the POS-tagger [Schmid 1994] with the accuracy of 95%. This
shows our system can perform much better if the pre-processing tools are of higher accuracy than they
are right now.

49
Chapter 5

Conclusion and Future Work

We have argued in the thesis that automatic translation of Noun Compounds is an important task in
the context of Machine Translation since the frequency of NCs is as high as 3.9% in Reuters Corpus.
We have built two automatic translation systems that work on the basis two different principles:

1. Performing search on source language web corpus for paraphrases of NCs; and

2. Performing search on indexed target language corpus for Hindi translation candidates

For the first approach, the motivation is the following: Paraphrase of an English NC, once identified,
can easily be translated into Hindi construction by mapping preposition within the paraphrase to cor-
responding Hindi post-position. The result of this approach is encouraging as a first step towards this
kind of work. We have used prepositions proposed by [Lauer 1995] for paraphrasing of NCs. It would
be interesting to examine the frequency of occurrence of other prepositions within paraphrases. If such
prepositions are identified they would be added to Lauers list as a future work. We have developed
heuristics for selecting top 3 paraphrases from a given list of paraphrases. Rackow [Rackow et al. 1992]
, as we have mentioned in chapter 3, has observed that there exist two issues in translation of NCs: (a)
correctness in the choice of the appropriate target lexeme during lexical substitution and (b) correctness
in the selection of the right target construct type.
In the first approach, we have not taken into consideration of the first issue; in other words, we have
assumed lexical substitution to be correct. Instead, we have focused our attention on paraphrase search
on source language web corpus in order to be able to select the right construct type for Hindi. This
task is important particularly in the context of noun compound translation because we have reported
in chapter 2 that Compounds in English are translated into Noun Compound in Hindi in over 40% of
the cases. In other cases they are translated into varied syntactic constructs. Among them the most
frequent construction type is Modier + Post-Position + Head which occurs in 35% of all the cases.
Thus, identification of right construction type to which an English NC would legitimately be translated
is an important and significant task. An effort of making right lexical substitution for component nouns
of noun compound has been attempted in the second approach of translation which we have finally in-
tegrated to a full-fledged MT system, Moses. We have used a WSD tool WordNet-Sense-Relate for

50
correct identification of sense of constituent nouns in the source language and then we have developed
a heuristics to translate the English word in that sense to a corresponding Hindi word. The task of sub-
stitution would have been much easier if synsets of English WordNet is mapped to corresponding Hindi
sysnset.
In the second approach, the translation templates have been generated at a pre-processing state and
the translation candidates with lexical substituents have been searched for in Hindi web corpus. While
generating templates, we found out that adjectival templates are hard to generate because adjective for-
mation from noun is a complex derivational process in Hindi. It does not only involve attaching an
adjectival suffix on the noun but also many a time re- quires a change in the vowel of the stem. In
the present work, we have performed poorly for adjective noun translation templates. The future work
includes the correct generation of adjectival form from the modifier noun so that correct templates for
Adjective Noun construct can be obtained. The web corpus of Hindi is definitely not as big and varied
as that of English. With a better Hindi corpus, the result of translation of the second method will defi-
nitely improve.
Finally, we have integrated the Context based Translation system with state-of-the-art statistical ma-
chine translation system, Moses. The integration of NCT with Moses is accomplished in two ways
and the methods are compared. Evaluation of the integrated system has been done by using automatic
metric BLEU. We have also argued that BLEU metric is not suitable for the type of data we are evalu-
ating. BLEU scores that we have obtained are quite low and insignificant. As an alternative, we have
proposed human evaluation technique which have shown promising and significant results unlike the
BLEU score.

51
Appendix A

Templates for candidate generation

For all N1 N2 types compounds, templates for candidate generation are

1. N1 N2

2. N2 kA N1

3. N2 kI N1

4. N2 ke N1

5. N2 ko N1

6. N2 meM N1

7. N2 pe N1

8. N2 par N1

9. N2 xene vAlA N1

10. N2 xene vAle N1

11. N2 ke xvArA N1

12. N2 ke prapwI N1

13. N2 se prAwa N1

14. N2 ke liye N1

15. N2 ke kArana N1

If N2 ends in a, chop off the last letter and insert it in the following template

1. N2-A N1

52
2. N2-AnA N1

3. N2-ArA N1

4. N2-I N1

5. N2-axAra N1

6. N2-Ila N1

7. N2-aka N1

8. N2-Iya N1

9. N2-iwa N1

10. N2-aNiya N1

11. N2-aTa N1

12. N2-ya N1

13. N2-e N1

14. N2-amaMxa N1

If N2 ends in A, chop off the last letter and insert it in the following template

1. N2-Ala N1

2. N2-Ela N1

3. N2-Alu N1

4. N2-ila N1

5. N2-AbAja N1

If N2 ends in U, chop off the last letter and insert it in the following template

1. N2-ua N1

2. N2-vI N1

If N2 ends in i, chop off the last letter and insert it in the following template

1. N2-imAna N1

2. N2-ima N1

3. N2-imaya N1

53
Related Publications

Prashant Mathur, Soma Paul. 2009. Automatic Translation of Nominal Compounds from English
to Hindi. In the Proceedings of International Conference on Natural Language Processing, Hyder-
abad.(ICON)
Paul, Soma and Mathur, Prashant and Kishore, Sushant 2010. Syntactic Construct : An Aid for
translating English Nominal Compound into Hindi. In Proceedings of the NAACL HLT Workshop on
Extracting and Using Constructions in Computational Linguistics. Los Angeles, California, pp. 3238
Prashant Mathur 2011. A Hybrid system for Machine Translation using Moses and Noun Compound
Translator and its Evaluation. Journal on Natural Language Engineering, Special issue on semantics of
Noun Compounds. Under Submission

54
Bibliography

[Bharati et. al. 1994] Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal. 1994. NLP: A Paninian
Perspective. Prentice Hall, New Delhi, 1994.

[Kilgariff & Grefenstette 2003] A. Kilgariff, G Grefenstette 2003. Introduction to the special issue on
the web as Corpus. Computational Linguistics 29, (3). 333-348

[Lampert 2004] Andrew Lampert 2004. Interlingua in Machine Translation.

[Brown et. al. 1993] Brown, Peter F. and Pietra, Vincent J. Della and Pietra, Stephen A. Della and
Mercer, Robert L. 1993. The mathematics of statistical machine translation: parameter estimation.
Computational Linguistics - Special issue on using large corpora: II archive Volume 19 Issue 2, June
1993

[Brown 1996] Brown, Ralf D. 1996. Example-Based Machine Translation in the Pangloss System..
Proceedings of the Sixteenth International Conference on Computational Linguistics (COLING-96)

[McCarthy et. al.2007] Diana McCarthy, Rob Koeling, Julie Weeds and John Carroll. 2007. Unsuper-
vised acquisition of predominant word senses. Computational Linguistics, 33(4):553-590.

[Seaghdha 2008] Diarmuid O Seaghdha 2008. Learning Noun Compounds semantics. PhD Thesis,
Computer Laboratory, University of Cambridge. Technical Report 735.

[Downing 1977] Downing, Pamela. 1977. On the creation and use of English compound nouns. Lan-
guage vol. 53, pp. 810-842, 1977.

[Finin 1980] Finin, T.W. 1980. The semantic interpretation of nominal compounds. In Proc. of the 1st
Conference on Artificial Intelligence (AAAI-80), 1980.

[Gawronska et. al. 1994] Gawronska, B., Nordner, A., Johansson, C. and Willners, C. 1994. Interpret-
ing compounds for machine translation. In Proceedings of COLING-1994, Kyoto, Japan.

[Levi 1978] Judith N. Levi. 1978. The Syntax and Semantics of Complex Nominals. Academic Press,
New York.

55
[Tsuji & Fujita 1991] Jun-ich Tsujii and Kimikazu Fujita 1991. Lexical Transfer based on bilingual
signs: Towards interaction during transfer In Proc. of 5th European ACL Conference pp. 275-280

[Girju et. al. 2003] Girju, R., Badulescu, A., and Moldovan, D.. 2003. Learning Semantic Constraints
for the Automatic Discovery of Part-Whole Relations.. In the proceedings of the Human Language
Technology Conference (HLT).

[Schmid 1994] Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In
International Conference on New Methods in Language Processing. Manchester, UK.

[Kneser & Ney 1995] Kneser, Reinhard, and Hermann Ney 1995. Improved backing-off for n-gram
language model. In Proceedings of the IEEE International Conference on Acoustics, speech, and
Signal Processing (ICASSP), vol. 1, pp. 181-182. Detroit, MI

[Lapata & Keller 2004] Lapata, M. and Keller, F. 2004. The Web as a baseline: evaluating the perfor-
mance of unsupervised Web based models for a range of NLP tasks. In: Proceedings of the Human
Language Technology conference (HLT/NAACL), Boston, MA, pp. 121-128.

[Bungum & Oepen 2009] Lars Bungum and Stephan Oepen 2009. Automatic Translation of Norwegian
Noun Compounds. Proceedings of the 13th Annual Meeting of the European Association for Machine
Translation (EAMT-09)

[Lou Burnard 2000] Lou Burnard. 2000. User Reference Guide for the British National Corpus, Tech-
nical Report. Oxford University Computing Services.

[Denkowski & Lavie 2000] M. Denkowski and A. Lavie 2010. Choosing the Right Evaluation for
Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human
Judgment Tasks. Proceedings of AMTA, 2010.

[Federico et al. 2008] Marcello Federico, Nicola Bertoldi, Mauro Cettolo 2008. IRSTLM: an open
source toolkit for handling large scale language models. INTERSPEECH 2008, 1618-1621

[Lauer 1995] M. Lauer 1995. Designing statistical language learners: experiments on noun com-
pounds. Ph.D. Thesis, Macquarie University, Australia

[Nakov & Hearst 2005] Nakov, P., and Hearst, M. 2005. Using the Web as an Implicit Training Set:
Application to Structural Ambiguity Resolution In Proceedings of HLT/EMNLP05, Vancouver,
2005.

[Nakov & Hearst 2005] Nakov, P., and Hearst, M. 2005. Using Verbs to Characterize Noun-Noun
Relations. In J. Euzenat and J. Domingue (Eds.): AIMSA 2006, LNAI 4183, pp. 233-244, 2006.

[Nastase & Szpakowicz 2003] Nastase, Vivi, and Stan Szpakowicz. 2003. Exploring noun-modifier
semantic relations. Fifth International Workshop on Computational Semantics (IWCS-5), pp. 285-
301. 2003.

56
[Och 2002] Och, Franz J. 2002. Statistical Machine Translation: From Single Words Models to Align-
ment Templates. PhD Thesis, RWTH Aachen, Germany.

[Och 2003] Och, Franz J. 2003. Minimum Error rate training in statistical machine translation. In
Proc. of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-2003),
Sapporo, Japan, pp. 160-167.

[Koehn et al. 2007] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B.


Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst. 2007. Moses:
Open Source Toolkit for Statistical Machine Translation. ACL 2007, Demonstration Session, Prague,
Czech Republic.

[Koehn et al. 2003] P. Koehn, Franz Josef Och and Daniel Marcu. 2003. Statistical Phrase Based
Translation. NAACL 2003.

[Papineni et al. 2002] Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. 2002. BLEU : a method for
automatic evaluation of machine translation. ACL-2002 : 40th Annual Meeting of the Association
for Computational Linguistics pp. 311-318.

[Mathur & Paul 2009] Prashant Mathur, Soma Paul. 2009. Automatic Translation of Nominal Com-
pounds from English to Hindi. In the Proceedings of International Conference on Natural Language
Processing, Hyderabad.(ICON)

[Paul et. al. 2010] Paul, Soma and Mathur, Prashant and Kishore, Sushant 2010. Syntactic Construct
: An Aid for translating English Nominal Compound into Hindi. In Proceedings of the NAACL
HLT Workshop on Extracting and Using Constructions in Computational Linguistics. Los Angeles,
California, pp. 3238

[Resnik 1993] Resnik, Philip. 1993. Selection and Information: A Class.Based Approach to Lexical
Relationships. PhD dissertation, University of Pennsylvania, Philadelphia, PA.

[Udupa & Farooquie 2004] Raghavendra Udupa U., Tanveer A. Farooquie 2004. An English-Hindi
Statistical Machine Translation System. In the Proceedings IJCNLP 2004

[Patwardhan et. al. 2005] Siddharth Patwardhan, Satanjeev Banerjee and Ted Pedersen. 2005. SenseRe-
late::TargetWord A Generalized Framework for Word Sense Disambiguation. Proceedings of the
ACL Interactive Poster and Demonstration Sessions, Ann Arbor, MI

[Kim & Baldwin 2006] Su Nam Kim, Timothy Baldwin. 2006. Interpreting Semantic Relation in Noun
Compound via Verb Semantics. In Proceedings of ACL/COLING-2006, 491-498.

[Kim & Baldwin 2005] Su Nam Kim, Timothy Baldwin. 2005. Automatic Interpretation of Noun
Compounds using WordNet-Similarity. IJCNLP, 945-956

57
[Tanaka & Baldwin 2003] Takaaki Tanaka, Timothy Baldwin 2003. Noun-Noun Compound Machine
Translation: A Feasibility Study on Shallow Processing. In the Proceedings of ACL-2003 Workshop
on Multiword Expression : Analysis, Acquisition and Treatment, Sapporo, Japan, 17-24.

[Tanaka & Baldwin 2003b] Takaaki Tanaka, Timothy Baldwin 2003. Translation selection for
Japanese-English noun-noun compounds. In Proc. of the Ninth Machine Translation Summit (MT
Summit IX), pages 8996, New Orleans, USA.

[Baldwin & Tanaka 2004] Timothy Baldwin, Takaaki Tanaka. 2004. Translation by Machine of Com-
plex Nominals: Getting it Right. In the Proceedings of ACL04 Workshop on Multiword Expres-
sion:Integrating Processing, Barcelona, Spain

[Singh & Bandopadhyay 2010] Thoudam Doren Singh, Sivaji Bandopadhyay 2010. Statistical Ma-
chine Translation of English-Manipuri using Morpho-syntactic and Semantic Information. In the
Students Research Workshop of AMTA 2010, Denver, Colorado

[Rackow et al. 1992] Ulrike Rackow, Ido Dagan, Ulrike Schwall. 1992. Automatic Translation of Noun
Compounds. COLING, 1249-1253

[Muegge 2006] Uwe Muegge 2006. An Excellent Application for Crummy Machine Translation: Au-
tomatic Translation of a Large Database. in Elisabeth Grfe (2006; ed.), Proceedings of the Annual
Conference of the German Society of Technical Communicators, Stuttgart: tekom, 18-21.

[Vanderwende 1995] Vanderwende, L. 1995. The analysis of noun sequences using semantic informa-
tion extracted from on-line dictionaries. Ph.D. Dissertation, Georgetown University.

[Wu & Palmer 1994] Zhibiao Wu and Martha Palmer. 1994. Verb semantics and lexical selection..
In 32nd. Annual Meeting of the Association for Computational Linguistics, pages 133-138, New
Mexico State University, Las Cruces, New Mexico.

[Maalej 1994] Zouhair Maalej. 1994. English-Arabic Machine Translation of Nominal Compounds.
Proceedings of the Workshop on Compound Nouns: Multilingual Aspects of Nominal Composition.
Geneva: ISSCO, pp. 135146, 1994.

58

You might also like