You are on page 1of 19

Mach Translat (2012) 26:4765 DOI 10.

1007/s10590-011-9102-0

A comparison of segmentation methods and extended lexicon models for Arabic statistical machine translation
Saa Hasan Saab Mansour Hermann Ney

Received: 30 June 2010 / Accepted: 4 August 2011 / Published online: 22 September 2011 Springer Science+Business Media B.V. 2011

Abstract In this article, we investigate different methodologies of Arabic segmentation for statistical machine translation by comparing a rule-based segmenter to different statistically-based segmenters. We also present a method for segmentation that serves the needs of a real-time translation system without impairing the translation accuracy. Second, we report on extended lexicon models based on triplets that incorporate sentence-level context during the decoding process. Results are presented on different translation tasks that show improvements in both BLEU and TER scores. Keywords Statistical machine translation Segmentation Extended lexicon models 1 Introduction Data-driven methods have been applied very successfully within the machine translation (MT) domain since the early 1990s. Starting from single-word-based translation approaches, significant improvements have been made through advances in modeling, availability of larger corpora and more powerful computers. Thus, substantial progress made in the past enables todays MT systems to achieve acceptable results in terms of translation quality for specic language pairs such as ArabicEnglish. If sufcient

S. Hasan (B S. Mansour H. Ney ) Human Language Technology and Pattern Recognition Group, Lehrstuhl fr Informatik 6, RWTH Aachen University, 52062 Aachen, Germany e-mail: hasan@cs.rwth-aachen.de S. Mansour e-mail: mansour@cs.rwth-aachen.de H. Ney e-mail: ney@cs.rwth-aachen.de

123

48

S. Hasan et al.

Fig. 1 Arabic segmentation example: Arabic words are accompanied by the Buckwalter transliteration and a possible alignment to the words on the English side

amounts of parallel data are available, statistical MT (SMT) systems can be trained on millions of sentence pairs and use an extended level of context based on bilingual groups of words which denote the building blocks of state-of-the-art phrase-based SMT systems. Often, a requirement for these systems is the capability to deal with multiple genres (e.g., newswire texts, web texts such as news groups or blogs, or broadcast conversations) in real-time, i.e., without a complex setup for preprocessing and translation that would need minutes or even hours for a single document. One of the major problems of statistical models is the data sparseness problem which consequently forces researchers to develop statistical models which are trained on local or limited contexts. Language models are derived from n-grams with n 5 and bilingual phrase pairs are extracted with lengths up to 10 words on the target side. This captures the local dependencies in the data and is responsible for the success of data-driven phrase-based approaches. In order to lessen the data sparseness problem for the task of Arabic SMT, we apply the well studied method of segmentation as a preprocessing step (Sadat and Habash 2006; Habash and Sadat 2006; El Isbihani et al. 2006; Hasan et al. 2006). A word in Arabic may be composed of prexes, a stem and sufxes which are expressed as stand-alone words in many languages. Those attachment clitics include prepositions and subject, object and possessive pronouns. In addition to reducing the data sparseness problem, segmentation results in minimizing the differences between Arabic and the target language, smaller vocabulary size and less out-of-vocabulary (OOV) words. An example of Arabic segmentation is shown in Fig. 1 where the Arabic words are depicted with the corresponding Buckwalter transliteration (Habash et al. 2007). One observation from this gure is that using segmentation, a better one-to-one correspondence between English and Arabic is achieved. In this work, we compare the performance of several segmenters on several SMT tasks. We also introduce a new segmentation method that answers the needs of a real-time translation system without impairing translation quality. To include more context in the SMT framework, we investigate an extended lexicon model based on lexicalized triplets ( f, e, e ) and (e, f, f ) (Hasan et al. 2008) which we will also refer to as triggers of the form (e, e ) f and ( f, f ) e, respectively. This denotes that two words in one language trigger one word in another language. These triplets in standard direction, modeled by p( f |e, e ), are closely related to lexical translation probabilities based on the IBM model 1, i.e., p( f |e) (Brown et al. 1993). Several constraints and setups will be described later on in more detail, but as an introduction one can think of the following interpretation which is depicted in Fig. 2: Using a phrase-based MT approach, a source word f is triggered by its translation e which is

123

Comparison of segmentation methods and extended lexicon models Fig. 2 Triplet example: a source word f is triggered by two target words e and e , where one of the words is within and the other outside the considered phrase pair (indicated by the dashed line)

49

part of the phrase being considered, whereas another target word e outside this phrase serves as an additional trigger in order to allow for more ne-grained distinction of a specic word sense. Thus, this crosslingual trigger model can be seen as a combination of a lexicon model (i.e. e f ) and a model similar to monolingual long-range triggers (i.e. distant bigram e e ) which uses both local and global information for the scoring. The motivation behind this approach is to get non-local information outside the current context (i.e., the currently considered bilingual phrase pair) into the translation process. The triplets are trained via the Expectation-Maximization (EM) algorithm, as will be shown in Sect. 4.1. This article is organized as follows. Related work on Arabic segmentation and long-range dependency modeling is presented in Sect. 2. In Sect. 3, we present different segmentation methods for Arabic. Section 4 gives a detailed introduction on the extended lexicon model, including EM training and variations of the model. The different settings will be evaluated in Sect. 5 where we show experiments on various tasks having Arabic as source language. A discussion of the results and further examples including nal remarks are given in Sect. 6.

2 Related work Arabic segmentation for the task of SMT was already successfully applied in previous work. Lee (2004) uses a language model to select among possible segmentations for translating Arabic into English. She reports improvements for small-scale tasks but no improvements for tasks with larger vocabulary. Sadat and Habash (2006) apply the MADA tool for ArabicEnglish MT. MADA (Habash and Rambow 2005) selects among Buckwalter Arabic Morphological Analyzer (BAMA) analyses using a combination of Support Vector Machine (SVM) classiers. Their work is mainly focused on comparing different segmentation schemes. Habash and Sadat (2006) compare MADA to a greedy regular expression based segmenter (REGEX) and a BAMA based (select rst analysis) segmenter across different segmentation schemes. In general, they report better translation results when using MADA over REGEX and BAMA. El Isbihani et al. (2006) develop a Finite State Transducer (FST) based segmenter and apply it to ArabicEnglish SMT and later on to ArabicFrench SMT (cf. Hasan et al. 2006). The method is also compared to an SVM based segmenter presented by Diab et al. (2004) and shows improved results for small tasks but, again, no or only little improvement for large tasks. Nguyen and Vogel (2008) apply a Conditional Random Fields (CRF) segmentation method (as presented in Smith et al. 2005) for Arabic-to-English

123

50

S. Hasan et al.

translation. They show that a reduced morpheme segmentation, i.e. applying a statistically trained model for morpheme deletion, outperforms a full morpheme segmentation. In this work, we present a new segmentation method that is fast enough to be used in a real-time translation system without impairing the translation accuracy. Furthermore, the method shows comparable results to MADA on different translation tasks. In the past, a significant number of methods have been presented that try to capture long-distance dependencies, i.e., use dependencies in the data that reach beyond the local context of n-grams or bilingual phrase pairs. In language modeling, monolingual trigger approaches have been presented (Rosenfeld 1996; Tillmann and Ney 1997) as well as syntactical methods that parse the input and model long-range dependencies on the syntactic level by conditioning on the preceding words and their corresponding parent nodes (Chelba and Jelinek 2000; Roark 2001). One drawback is that the parsing process might slow down the system significantly and it is complicated to integrate the approach directly in the search process. Thus, the effect is often shown offline in reranking experiments using n-best lists. One of the simplest models that can be seen in the context of lexical triggers is the IBM model 1 (Brown et al. 1993) which captures lexical dependencies between source and target words. It can be seen as a lexicon containing correspondents of translations of source and target words on full sentence level. The model presented in this work is an extension to the initial IBM model 1 and simply takes another word into the conditioning part, i.e., the triggering items. Thus, instead of p( f |e) we model p( f |e, e ) with different additional constraints as explained later on. Since the second trigger can come from any part of the sentence, we also have a link to long-range monolingual triggers as mentioned above. A long-range trigram model is presented in Della Pietra et al. (1994) where it is shown how to derive a probabilistic link grammar in order to capture long-range dependencies in English using the EM algorithm. EM is used for training the presented triplet model as well (cf. Sect. 4). Instead of deriving a grammar based on partof-speech (POS) tags, we rely on a fully lexicalized approach, i.e., the training is taking place at the word level. Related work in the context of ne-tuning language models by using crosslingual lexical triggers is presented in Kim and Khudanpur (2003). The authors show how to use crosslingual triggers on a document level in order to extract translation lexicons and domain-specic language models using a mutual information criterion. Recently, word-sense disambiguation (WSD) methods have been shown to improve translation quality (Chan et al. 2007; Carpuat and Wu 2007). Chan et al. (2007) use a SVM based classier for disambiguating word senses which is directly incorporated in the decoder through additional features that are part of the log-linear combination of models. They use local collocations based on surrounding words left and right of an ambiguous word including the corresponding parts-of-speech. In Carpuat and Wu (2007), another state-of-the-art WSD engine (a combination of naive Bayes, maximum entropy, boosting and Kernel PCA models) is used to dynamically determine the score of a phrase pair under consideration and, thus, let the phrase selection adapt to the context of the sentence. The work in this article tries to complement these approaches by modeling longrange dependencies. We argue that triggers on sentence level help selecting

123

Comparison of segmentation methods and extended lexicon models

51

context-specic words that improve overall translation quality. The effect is similar to the one mentioned in the WSD approaches. We can incorporate sentence-level information for more ne-grained lexical choice of the words considered during the translation process. Note though that the triplet lexicon model does not provide a concrete word sense as opposed to WSD approaches. Furthermore, a distant second trigger might have a benecial effect for specic languages, e.g., by capturing word splits (as it is the case in German for verbs with separable prexes) or modeling general effects introduced by the segmentation process, as it is the case for Arabic (cf. also Fig. 1). 3 Arabic segmentation In the context of MT, written modern standard Arabic (henceforth Arabic) is known for its complex morphology and ambiguous writing system (Habash 2007). Next, we discuss some of the difculties Arabic possesses: High rate of inection causing a high percentage of OOV words. In addition to inections expressing different grammatical categories found in English (e.g. gender, number, . . .), Arabic inection includes the generation of words using the root-pattern constructor and the attachment of clitics (to a stem) which appear as stand-alone words in many other languages. An example is given in Fig. 3. The rst sentence in this gure is a hypothesis generated by our translation system without Arabic segmentation. The second hypothesis is generated by a system which includes Arabic segmentation, causing one OOV word to be resolved. High ambiguity due to the lack of vowels in written Arabic. The increase of ambiguity is expressed in the increased number of possible translations per word but, in addition, it is expressed in the possible segmentations of the word which eventually affects the corresponding translations. An example is given in Fig. 3. One word in Arabic often corresponds to more than one word in traditional target languages such as English and French, posing a problem to the alignment models. An example is given in Fig. 1 where it can be seen that some Arabic words are aligned to more than one word in English. This causes a problem to the traditional alignment models which are used in most state-of-the-art SMT systems. A well studied solution to the problems mentioned above is Arabic word segmentation. Splitting an Arabic word into its corresponding prexes, stem and sufxes decreases the number of OOV words, resolves some of the ambiguous Arabic words and generates more one-to-one correspondences between the Arabic and the target language side which facilitates training of the word alignment models. As mentioned in Sect. 2, previous work has been carried out on Arabic segmentation for SMT. The FST tool presented by El Isbihani et al. (2006) inherently suffers from ambiguous words which are not segmented in the approach. Another well-known segmentation tool for Arabic is MADA. Habash and Sadat (2006) perform a comparison between different segmentation schemes using MADA and other techniques. Their results show superior MT quality for the MADA tool in most of the cases. One small deciency of the MADA tool is the slow speed of the segmentation process. MADA applies several SVM classiers to classify different grammatical categories of the

123

52

S. Hasan et al.

Fig. 3 Examples of Arabic difculties for SMT

words and then combines those classications to infer full morphological disambiguation. (Non-linear) SVM classication has the time complexity of (N |SV |), where N is the number of words in the text being segmented and |SV | is the number of support vectors generated in the training phase. |SV | is upper bounded by the number of training examples. In the case of MADA this is in the magnitude of 105 as it is trained on the Arabic Treebank (Maamouri et al. 2004). In this work, we present a Hidden-Markov-Model (HMM) segmenter for Arabic. The motivation behind the development of this tool is the need for a segmenter which achieves comparable accuracy to MADA but retains a speed level similar to the FST segmenter and which is acceptable for real-time translation systems.

3.1 MorphTagger: HMM-based segmenter MorphTagger is a general architecture for POS tagging of natural languages. The architecture was rst proposed in Bar-Haim et al. (2005) and applied for the task of POS tagging of Hebrew. Mansour et al. (2007) adapted the architecture to the Arabic language. In this work, MorphTagger was adapted to the SMT task by adding a segmenter level and normalization rules that are appropriate for translation, including speed enhancements of the software. The architecture is similar to Habash and Rambow (2005) where a specic analysis is selected from the output of the morphological analyzer. It is visualized in Fig. 4. First, the Arabic input sentence goes through the analyzer which outputs all possible analyses for each word. Each analysis includes a sequence of pairs of a segment and the corresponding POS tag. The disambiguator component in MorphTagger is realized by an HMM-based model and is trained on the segment level using Maximum Likelihood Estimation (MLE). The disambiguator outputs the most probable tagging sequence according to the model. Then, we infer the corresponding segments from the tagging sequence. Since the process of matching the corresponding segments is ambiguous, we simply use the heuristic of choosing the most probable morpheme given the tag. We believe that the heuristic is sufcient for the problem at hand as the ambiguity mainly occurs for variations of Arabic letters ), and is rarely observed for segmentation boundaries. The variation such as Alif ( of the letters could be modeled by a more ne-grained tag set but this modeling is not used in this work. The Segmenter component is then responsible for the choice of which morphemes should be split. This component is realized by rules which are

123

Comparison of segmentation methods and extended lexicon models

53

Fig. 4 MorphTagger segmenter architecture

selected manually. The segmenter also applies several normalization steps which are helpful for SMT. Due to this settings we achieve the following three advantages: comparable tagging/segmentation accuracy to state-of-the-art training and tagging are fast (linear in corpus size) appropriate for real-time systems. To implement MorphTagger for Arabic, we use the Buckwalter Arabic Morphological Analyzer v1.0 (Buckwalter 2002), a rule based analyzer, with 80,000 lexicon entries. The POS model is a standard Markov Model Tagger trained on the Arabic Treebank Part 1 v3.0 (Maamouri et al. 2004) (150,000 tokens). We estimate the probabilities of the model for segments and not words, because it achieves better POS tagging and segmentation accuracies as reported in Bar-Haim et al. (2005). The disambiguator is implemented by wrapping around the SRILM toolkit (Stolcke 2002). The Segmenter component splits prepositions (excluding the Arabic determiner) and possessive and objective pronouns (this is the so-called ATB scheme originally used in the Arabic TreeBank). As mentioned before the segmenter also performs a few normalization steps, most noticeably undoing some rewriting rules when attachment is involved. Reverted characters include: (i) Alif maksura: reverted to the original form when a preposition ending with Alif maksura is split from a sufx (yX Y + X); (ii) feminine marker: reverted to original form when a noun is split from a sufx to (tX p + X); and (iii) Arabic determiner: is restored when preceded by preposition (llX l + Al + X). To check the appropriateness of the different segmenters for real-time SMT application, we perform a speed comparison. The speed is measured in words per second ([w/s]). We use MADA v2.0 with the ATB scheme (similarly to MorphTagger) to perform our experiments. A newer version of MADA (v3.1) is currently available. According to MADA website,1 the speed measured by the authors is around 100 w/s. In our measurements MADA achieved the speed of 70 w/s, while MorphTagger achieved 1,500 and 4,500 w/s for the FST method. From this data, we conclude that the MADA tool cannot be applied in a real-time manner. For example, our real-time ArabicFrench SMT system (presented in Sect. 5) is running at the speed of 100 w/s,
1 http://www1.ccls.columbia.edu/MADA/MADA_FAQ.html.

123

54

S. Hasan et al.

making the MADA segmenter slower than the translation system and non-appropriate for such applications. In this work, we do not perform a comparison of segmentation error rates. First, it is difcult to perform as the FST method is using a different segmentation scheme than MorphTagger and MADA. Second, the authors of MADA and MorphTagger already reported on comparable accuracies, both evaluated and trained using the Arabic Treebank Part 1. MADA achieves 99.1% segment F-measure accuracy, whereas MorphTagger achieves 98.9% (with standard deviation of 0.28 when using 10-fold cross-validation). Third, Chang et al. (2008) reported that for Chinese, optimizing segmentation accuracy does not always lead to better MT performance but other factors, e.g. segmentation consistency, might be more important for MT. We did not verify those claims for Arabic in this work. 4 Triplet lexicon models As an extension to commonly used lexical word pair probabilities p( f |e) as introduced in Brown et al. (1993), we dene our model to operate on word triplets. A triplet ( f, e, e ) is assigned a value ( f |e, e ) 0 with the constraint such that e, e :
f

( f |e, e ) = 1.

Throughout this article, e and e will be referred to as the rst and the second trigger, respectively. In view of its triggers f will be termed the effect. I For a given bilingual sentence pair ( f 1J , e1 ), the probability of a source word f j I for the triplet model is dened as: given the whole target sentence e1 pall
I f j |e1

1 = Z

( f j |ei , ek ),
i=1 k=i+1

(1)

where Z denotes a normalization factor based on the target sentence length, i.e., Z= I (I 1) . 2 (2)

The introduction of a second trigger (i.e. ek in Eq. 1) enables the model to combine local (i.e. word or phrase level) and global (i.e. sentence level) information. In the following, we will describe the training procedure of the model via MLE for the unconstrained case. 4.1 Training The goal of the training procedure is to maximize the log-likelihood Fall of the tripI N let model for a given bilingual training corpus {( f 1J , e1 )}1 consisting of N sentence pairs:

123

Comparison of segmentation methods and extended lexicon models


N Jn

55

Fall :=
n=1 j=1

log pall

I f j |e1n ,

where Jn and In are the lengths of the nth source and target sentences, respectively. As there is no closed form solution for the maximum likelihood estimate, we resort to iterative training via the EM algorithm (Dempster et al. 1977). We dene the auxiliary function Q(; ) based on Fall where is the new estimate within an iteration which is to be derived from the current estimate . Here, stands for the entire set of model parameters to be estimated, i.e., the set of all {( f |e, e )}. Thus, we obtain Q {( f |e, e )}; {( f |e, e )} In N Jn In Z 1 ( f j |ei , ek ) 1 n = log Z n f j |ei , ek In pall f j |e1 n=1 j=1 i=1 k=i+1 ,

where Z n is dened as in Eq. 2. Using the method of Lagrangian multipliers for the normalization constraint, we take the derivative with respect to ( f |e, e ) and obtain: ( f |e, e ) = A( f, e, e ) f A( f , e, e ) (3)

where A( f, e, e ) is a relative weight accumulator over the parallel corpus:


N Jn

A( f, e, e ) =
n=1 j=1

( f, f j )

1 Z n ( f |e, e )

pall

I f j |e1n

Cn (e, e )

(4)

and
In In

Cn (e, e ) =
i=1 k=i+1

(e, ei )(e , ek ).

The function (, ) denotes the Kronecker delta. The resulting training procedure is analogous to the one presented in Brown et al. (1993) and Tillmann and Ney (1997). 4.2 Model variation Path-aligned triplets use an alignment constraint from the word alignments that are trained with GIZA++. Here, the rst trigger pair ( f, e) is restricted to the alignment path obtained in the alignment matrix produced by IBM model 4, whereas the second trigger can move across the whole sentence. This requires information in addition to I the bilingual sentence pair ( f 1J , e1 ), namely a corresponding word alignment matrix

123

56

S. Hasan et al.

Fig. 5 Training an unconstrained triplet lexicon: number of triplets and perplexity on the training corpus at each EM iteration

A = {ai j } where ai j = 1 0 if ei is aligned to f j otherwise

These restrictions were introduced in order to reduce the overall number of triplets during training and to facilitate the application of these models during the decoding process. For a training set, e.g. a subset of the NIST training data, that is comprised of approximately 306K sentence pairs with 6.1M running words for the target language, we observe roughly 1.46 B triplets for the unconstrained use case. In order to train the rst iterations, we need around 34 GB of RAM and 2.3 h per iteration on an AMD Opteron with 2.3 GHz. In order to further speed up the training procedure, we apply trimming with a threshold of 107 , i.e., we discard low-probability triplets at the end of each iteration and renormalize the probabilities accordingly. Figure 5 gives an overview on the reduction of the overall number of triplets during training. It also shows the effect of the EM training which moves probability mass to the important triplets, observed in the decrease of overall perplexity. Experience has shown that it is seldom useful to train beyond 10 EM iterations since conversion is fast. The path-aligned triplet model (denoted by palign in the following), restricts the scope of e to words aligned to f by A, resulting in: 1 Zj
I I

palign

I f j |e1 , A =

ai j ( f j |ei , ek )
i=1 k=1

(5)

where the Z j denote corresponding normalization terms. This constraint cuts down the overall number of triplets significantly since it rules out a large number of rst triggers due to the alignment path. For above mentioned training data, instead of 1.46 B triplets, we only have to consider 90.8M. Training times are reduced to 6.8 min per iteration. The nal model is much more compact than the unconstrained variant, allowing it to be integrated easily into the decoder.

123

Comparison of segmentation methods and extended lexicon models

57

Note that, in order to account for non-aligned words (analogously to the IBM models), the empty word e0 is considered in all presented model variations. Furthermore, we can train the models in the inverse direction, i.e. p(e| f, f ), and combine the two directions, e.g. in a rescoring framework or, to some extent, even directly in search, as is presented in the following. 4.3 Decoding In search, we apply an inverse triplet model directly when scoring bilingual phrase pairs (Hasan and Ney 2009). Given a trained model for p(e| f, f ), we compute the feature score h trip () of a phrase pair (e, f) as h trip e, f, f 0J =
i

p ei | f j , f j ,
j j >j

2 log J (J + 1)

where i moves over all target words in the phrase e, the second sum selects all source sentence words f 0J including the empty word, and j > j incorporates the rest of the source sentence right of the rst trigger. We take negative log-probabilities and normalize to obtain the nal score (representing costs) for the given phrase pair. Note that in search, we can use this direction of the model, i.e. p(e| f, f ), since the whole source sentence is available for triggering effects whereas not all target words have been generated so far, as it would be necessary for the standard direction, p( f |e, e ). For the standard one, we can use an approximation that traverses the search graph backwards for the second trigger. Since this is a very time-consuming process, we limit the history to a limited amount of words, e.g., the two triggers can have a maximum distance of 10 words. Note that decoding with the inverse model can be quite efcient if caching is applied. Since the given source sentence does not change, we have to calculate p(e| f, f ) for each e only once and can retrieve the probabilities from the cache for consecutive scorings of the same target word e. This significantly speeds up the decoding process. 5 Translation experiments In this section, we evaluate the translation performance of the MorphTagger segmenter and the various triplet lexicon models. We compare the results of MorphTagger to MADA and the FST method, and the results of adding the triplet models to a baseline SMT system. The baseline system was built using a state-of-the art phrase-based MT system (Zens and Ney 2008). We use the standard set of models with phrase translation probabilities for source-to-target and target-to-source direction, smoothing with lexical weights, a word and phrase penalty, distance-based and lexicalized reordering and an n-gram target language model. The experiments were carried out on two large-scale translation tasks: the NIST MT 2009 ArabicEnglish constrained data track (NIST 2009), and the QUAERO

123

58 Table 1 NIST AR-EN 2009: corpus statistics Arabic TOK Train Sentences Running words Vocabulary nist06 Sentences Running words OOVs (run.) nist08 Sentences Running words OOVs (run.) 1,357 37,597 649 43,456 416 44,593 259 45,366 308 1,797 39,934 868 46,245 589 47,334 463 48,943 427 4.7M 128M 702K 147M 404K 150M 336K 153M 362K FST MADA MorphTagger

S. Hasan et al.

English

196M 300K

2009 ArabicFrench task (QUAERO 2008).2 The corpus statistics of the NIST and the QUAERO tasks are given in Tables 1 and 2, respectively. The tables include statistics of the training corpora and test sets, calculated over the various segmentation methods. We also include statistics of a simple tokenizer (TOK) for Arabic which splits on punctuation marks for comparison purposes to the other segmenters. Both training corpora mainly consist of the United Nations (UN) data. In the NIST task, the UN data comprises 70% of the data, and in the QUAERO task, 97% of the data is UN data. The test sets are drawn from the newswire domain, making the translation task hard due to the differences of the domains between training and testing. For the QUAERO task, the development and test sets consist of one reference on the French side, the CESTA_RUN2 (Hamon et al. 2007) test set has four references. The test sets of the NIST task consist of four references. We built a 4-gram language model for the NIST task composed of the English Gigaword Third Edition (LDC2007T07) and the target side of the training data. For the QUAERO task, we built 4- and 6-gram language models using the French Gigaword First Edition (LDC2006T17), French text available from the WMT 2009 translation task3 and the target side of the training data. For all the experiments, we report both BLEU and TER results. The results are calculated over a truecased output of the SMT system. 5.1 Segmentation results To experiment with the effect of the different segmentation methods on the nal MT quality, we used the full QUAERO 2009 ArabicFrench data and a subset of the NIST
2 Note that the data is available for the project partners only. 3 http://www.statmt.org/wmt09/translation-task.html.

123

Comparison of segmentation methods and extended lexicon models Table 2 QUAERO AR-FR 2009: corpus statistics Arabic TOK Train Sentences Running words Vocabulary Dev Sentences Running words OOVs (run.) Test Sentences Running words OOVs (run.) CESTA_RUN2 Sentences Running words OOVs (run.) 824 19,329 118 22,019 224 22,524 44 22,895 56 2,202 49,617 318 56,065 296 57,235 180 57,535 191 2,121 50,389 337 57,264 289 58,335 176 58,516 185 7.6M 150M 638K 170M 380K 175M 422K 178M 380K FST MADA MorphTagger

59

French

196M 300K

2009 ArabicEnglish data. We excluded the UN and the ISI data from the NIST corpora. This leaves 300K sentence pairs and 5M running words (about 6% of the whole data). This selection eases building the SMT systems, makes the training and testing genres consistent and the loss in performance is small. The loss is small due to the fact that the test sets are from the newswire genre, while the ISI data is noisy (automatically sentence-aligned) and the UN data is from the parliamentary speech genre and not the newswire one. As we already discussed in Sect. 1, segmentation of Arabic text results in minimizing the differences between Arabic and the target language, smaller vocabulary size and less OOV words. We can observe these phenomena from the corpus statistics of NIST and QUAERO given in Tables 1 and 2, respectively. Segmentation increases the number of Arabic running words by 20% and the vocabulary is reduced by 40%. We also observe a notable reduction in OOV words of up to 60% in both the NIST and the QUAERO tasks. One interesting point to notice about the OOV gures is that the FST segmenter is performing worse than a simple tokenizer. The reason behind this is that FST restricts segmented stems to those seen in the corpus, therefore preventing segmenting words that include unseen stems. This causes inconsistencies in the segmentations between the train and the test sets. One could overcome this problem by concatenating the train and test sets and segmenting them together. However, in the available implementation of the FST segmenter, this caused a considerable increase in time and memory usage and was therefore skipped (we actually split the training data into chunks in order to be able to segment it in reasonable time).

123

60 Table 3 AR-EN NIST 2009 subset (excludes UN and ISI) System nist06 (dev) BLEU TOK FST MorphTagger MADA ATB 39.4 40.4 41.5 42.7 TER 54.1 52.3 51.2 50.6 nist08 BLEU 36.5 37.5 39.7 40.0

S. Hasan et al.

TER 54.3 53.2 51.8 51.3

Table 4 AR-FR QUAERO 2009: translation results comparing different segmentation methods System Dev BLEU Real-time systems FST MADA MorphTagger Offline systems TOK + Triplet ef/pa FST + Triplet ef/pa MADA + Triplet ef/pa MorphTagger + Triplet ef/pa 15.4 15.7 15.7 16.6 15.7 16.1 16.6 17.1 74.5 74.6 74.1 73.2 74.2 73.7 73.1 72.5 15.0 15.3 15.6 16.3 15.7 16.1 16.2 16.6 75.7 75.3 74.7 74.2 74.5 74.9 74.2 73.5 44.0 45.3 47.0 47.6 47.7 47.8 48.1 48.8 54.4 53.6 52.6 52.1 53.0 51.7 50.0 49.8 15.5 15.5 15.9 74.9 73.9 73.9 15.4 15.5 15.8 74.8 74.8 74.7 45.7 47.7 48.0 53.4 53.0 53.2 TER Test BLEU TER CESTA_RUN2 BLEU TER

Triplet model ef/pa is an inverse model p(e| f, f ) using the path-aligned constraint for the rst trigger

The results of the NIST subset and the QUAERO task are summarized in Tables 3 and 4. For QUAERO we additionally include Real-time versus Offline systems. Realtime systems use a monotone decoder and a smaller language model (4-gram instead of 6-gram as in the offline systems). Offline systems include reordering and a bigger language model. In terms of speed, real-time systems translate more than 100 words per second, whereas the offline systems are running at less than one word per second. For all test sets, we notice that the methods with Arabic segmentation are better than a simple tokenizer which is in concordance with previous work. We also observe that the statistically based segmenters, namely MorphTagger and MADA, perform better than the FST rule-based method. For the NIST task, MADA is performing slightly better than MorphTagger but a big gap is observed on the development set (nist06) for the BLEU score. For the QUAERO task, we observe that MorphTagger achieves modest improvements in comparison to MADA. The FST method is performing much worse on CESTA_RUN2,

123

Comparison of segmentation methods and extended lexicon models Table 5 Examples of better translations due to Arabic segmentation

61

probably due to the OOV problem mentioned earlier. The tendency of the results is similar for both real-time and offline systems. Looking at the translations and the segmented sources, we see that few differences are the result of different segmentations, especially between MADA and MorphTagger as they use the same segmentation scheme. A more significant difference between the segmenters might be due to the different normalization they apply. In MADA, in addition to the normalizations mentioned in Sect. 3.1, many irregular word writings are collapsed to one form. It is interesting to see that MorphTagger performed better for ArabicFrench while MADA performed better for ArabicEnglish. We leave studying the effects of different segmentation methods and different normalizations on different target languages to future work. Translation examples for both translation tasks are given in Table 5. In the rst senfrom the word fwArq differences tence, MADA wrongly splits the character while MorphTagger keeps the word intact. In the second example, MADA retains the b An Arabic word bAlA concern, while MorphTagger wrongly splits it into wfY and lA that is not. In the third example, MADA does not segment the word in, which then can also mean Acquitt, causing a wrong translation.

123

62

S. Hasan et al.

Table 6 Results of combining two triplet models in search for the NIST 2009 ArabicEnglish task NIST 2009 nist06 (dev) BLEU [%] Baseline Triplet fe/d10 Triplet ef/pa Combined 42.4 42.7 43.5 43.7 TER [%] 50.5 50.3 49.6 49.4 nist08 BLEU [%] 40.5 41.0 41.7 42.0 TER [%] 51.9 51.2 50.9 50.7

Two variants are used: fe/d10 is a standard model using a maximum distance constraint of 10 words, i.e. p( f |e, e ), whereas ef/pa denotes a path-aligned model in reverse direction, i.e. p(e| f, f )

5.2 Triplet results We tested the extended lexicon models on two large-scale systems, i.e. QUAERO and NIST, for two language pairs, namely ArabicFrench and ArabicEnglish, respectively. The overall improvements for the ArabicFrench language pair are +0.7% BLEU for the MorphTagger baseline (cf. Table 4). For NIST ArabicEnglish, we observe a larger improvement by combining two triplet models in search. Table 6 summarizes the results using a combination of two triplet models during decoding. One model is an unconstrained triplet model p( f |e, e ) trained for 20 EM iterations with a maximum distance constraint of 10, i.e., the distance between two triggering word is limited to 10. We chose 20 iterations over 10 since the triplet model was still very large and trimming during the training process was able to reduce the overall model size to 80M triplets after 20 iterations. The second model is a path-aligned triplet model p(e| f, f ) trained for 6 EM iterations without a maximum distance limit, i.e., the second triggering word can originate from any part of the source sentence. This model consists of 56M triplets. The baseline system is trained on MADA segmented input data and uses a 4-gram language model. It includes a discriminative word lexicon as presented in Mauser et al. (2009) and an inverse IBM model 1 p(e| f ), resulting in a slightly different but comparable baseline compared to the segmentation experiment in Sect. 5.1. We can see that the unconstrained model in the standard direction achieves moderate improvements of +0.5% BLEU on the test set. From the table, one can observe that the inverse model, p(e| f, f ), obtains much better improvements of +1.2% BLEU. One possible explanation is that the model captures information that is lost due to the preprocessing of the source side: due to the removal of diacritics, the system loses a discriminative feature that helps to disambiguate the source words. In order to deal with this, information from the context is necessary. Thus, the triplet model enables the system to add sentence-level information to the search process which might explain the observed improvements. Table 7 shows improved translation examples. In the rst example, holding is retained when using triplets due to the link to hostages. In the second example, guaranteed is produced by the triplets system due to the link with membership, thus generating better lexical choice than the system without triplets.

123

Comparison of segmentation methods and extended lexicon models Table 7 Translation example from the NIST test set

63

The Arabic source is presented in its segmented form

6 Conclusions and summary In this work, we compared and evaluated Arabic segmenters for the task of Arabic SMT. We started out by comparing two available segmenters, an FST rule-based segmenter and the MADA toolan SVM-based statistical classier. The FST segmenter suffers from worse translation results on large tasks and MADA performs too slowly to be incorporated into a real-time SMT system. To combine the best of both worlds, we adapted a Hidden-Markov-Model POS tagger to the segmentation task and plugged it into the translation system as a preprocessing step. Being an HMM disambiguator, the POS tagging process is linear in corpus size and proves to be comparable to the speed of the FST method and applicable to real-time systems. Furthermore, the HMM model incorporates context knowledge to infer the output classes, thus resulting in a better, more consistent segmentation result than the FST method. We compared MorphTagger to the FST and MADA segmenters and showed improved results over the FST method and comparable ones to MADA on different translation conditions and different test sets. We showed that triplet lexicon models can be successfully trained on large amounts of data by using constrained variants such as the path-aligned model. They can be integrated directly into the decoder and yield improvements in translation quality on various tasks measured by both BLEU and TER. Due to trimming and maximumdistance constraints, we were able to integrate two triplet models in both directions, i.e., a distance-limited unconstrained model in standard and a path-aligned unlimited model in inverse direction, improving translation results on the 2008 NIST test set by +1.5% BLEU and 1.2% TER.
Acknowledgments This material is partly based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-08-C-0110, and is partly realized as part of the Quaero Programme, funded by OSEO, French State agency for innovation.

References
Bar-Haim R, Simaan K, Winter Y (2005) Choosing an optimal architecture for segmentation and POStagging of modern Hebrew. In: Semitic 05: proceedings of the ACL workshop on computational approaches to semitic languages, Morristown, NJ, USA, pp 3946

123

64

S. Hasan et al.

Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263311 Buckwalter T (2002) Buckwalter Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium, University of Pennsylvania. LDC Catalog No.: LDC2002L49 Carpuat M, Wu D (2007) Improving statistical machine translation using word sense disambiguation. In: EMNLP-CoNLL 2007, Prague, Czech Republic Chan YS, Ng HT, Chiang D (2007) Word sense disambiguation improves statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL, Prague, Czech Republic, pp 3340 Chang PC, Galley M, Manning CD (2008) Optimizing Chinese word segmentation for machine translation performance. In: StatMT08: proceedings of the third workshop on SMT, Morristown, NJ, USA, pp 224232 Chelba C, Jelinek F (2000) Structured language modeling. Comput Speech Lang 14(4):283332 Della Pietra SA, Della Pietra VJ, Gillett JR, Lafferty JD, Printz H, Ure L (1994) Inference and estimation of a long-range trigram model. In: Oncina J, Carrasco RC (eds) Grammatical inference and applications, second international colloquium, ICGI-94, vol 862. Springer, Alicante pp 7892 Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):122 Diab M, Hacioglu K, Jurafsky D (2004) Automatic tagging of Arabic text: from raw text to base phrase chunks. In: HLT-NAACL 2004: short papers, Boston, MA, USA, pp 149152 El Isbihani A, Khadivi S, Bender O, Ney H (2006) Morpho-syntactic Arabic preprocessing for Arabic to English statistical machine translation. In: Proceedings on the workshop on SMT, New York, pp 1522 Habash N (2007) Arabic morphological representations for machine translation. In: Soudi A, Bosch Avd, Neumann G (eds) Arabic computational morphology, text, speech and language technology, vol 38. Springer, Netherlands pp 263285 Habash N, Rambow O (2005) Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd annual meeting of the ACL, Morristown, NJ, USA, pp 573580 Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: HLTNAACL 2006: short papers, New York, USA, pp 4952 Habash N, Soudi A, Buckwalter T (2007) On Arabic transliteration. In: Ide N, Vronis J, Soudi A, Bosch Avd, Neumann G (eds) Arabic computational morphology, text, speech and language technology, vol 38, Springer, Netherlands, pp 1522 Hamon O, Hartley A, Popescu-Belis A, Choukri K (2007) Assessing human and automated quality judgments in the French MT evaluation campaign CESTA. In: MT summit XI, Copenhagen, Denmark, pp 231238 Hasan S, Ney H (2009) Comparison of extended lexicon models in search and rescoring for SMT. In: HLT-NAACL 2009: short papers, Boulder, CO, pp 1720 Hasan S, El Isbihani A, Ney H (2006) Creating a large-scale Arabic to French statistical machine translation system. In: International conference on language resources and evaluation, Genoa, Italy, pp 855858 Hasan S, Ganitkevitch J, Ney H, Andrs-Ferrer J (2008) Triplet lexicon models for statistical machine translation. In: EMNLP 2008, Honolulu, Hawaii, pp 372381 Kim W, Khudanpur S (2003) Cross-lingual lexical triggers in statistical language modeling. In: EMNLP 2003, Morristown, NJ, USA, pp 1724 Lee YS (2004) Morphological analysis for statistical machine translation. In: HLT-NAACL 04: proceedings of HLT-NAACL 2004: short papers, Morristown, NJ, USA, pp 5760 Maamouri M, Bies A, Buckwalter T, Mekki W (2004) The Penn Arabic Treebank: building a large-scale annotated Arabic corpus. In: NEMLAR conference on Arabic language resources and tools Mansour S, Simaan K, Winter Y (2007) Smoothing a lexicon-based POS tagger for Arabic and Hebrew. In: Semitic 07: proceedings of the 2007 workshop on computational approaches to semitic languages, Morristown, NJ, USA, pp 97103 Mauser A, Hasan S, Ney H (2009) Extending statistical machine translation with discriminative and trigger-based lexicon models. In: EMNLP 2009, Singapore, pp 210217 Nguyen T, Vogel S (2008) Context-based arabic morphological analysis for machine translation. In: CoNLL 08, Morristown, NJ, USA, pp 135142 NIST (2009) NIST open MT evaluation. http://www.itl.nist.gov/iad/mig/tests/mt/2009/ QUAERO (2008) Automatic multimedia content processing. http://www.quaero.org/ Roark B (2001) Probabilistic top-down parsing and language modeling. Comput Linguist 27(2):249276

123

Comparison of segmentation methods and extended lexicon models

65

Rosenfeld R (1996) A maximum entropy approach to adaptive statistical language modeling. Comput Speech Lang 10(3):187228 Sadat F, Habash N (2006) Combination of preprocessing schemes for statistical MT. In: Proceedings of the 44th annual meeting of the Association for Computational Linguistics (ACL), Sydney, Australia, pp 18 Smith NA, Smith DA, Tromble RW (2005) Context-based morphological disambiguation with random elds. In: HLT /EMNLP05, Morristown, NJ, USA, pp 475482 Stolcke A (2002) SRILMan extensible language modeling toolkit. In: Proceedings of the seventh international conference on spoken language processing, ISCA, Denver, CO, USA, pp 901904 Tillmann C, Ney H (1997) Word triggers and the EM algorithm. In: Proceedings of the special interest group workshop on computational natural language learning (ACL), Madrid, Spain, pp 117124 Zens R, Ney H (2008) Improvements in dynamic programming beam search for phrase-based statistical machine translation. In: International workshop on spoken language translation, Honolulu, Hawaii, pp 195205

123

You might also like