Professional Documents
Culture Documents
Table of Contents
Moses for Mere Mortals Short Tutorial..............................................................................................1 A. Purpose.............................................................................................................................................6 A.1. Windows add-ins.......................................................................................................................6 A.2. Moses for Mere Mortals (the Linux scripts).............................................................................6 B. Requirements....................................................................................................................................7 B.1. System requirements..................................................................................................................7 B.2. Software....................................................................................................................................8 C. Overview...........................................................................................................................................8 D. Installation for users new to Moses..................................................................................................9 D. 1. Demonstration corpus..............................................................................................................9 D. 2. First steps with the scripts......................................................................................................10 E. Using your own corpora..................................................................................................................12 F. create-moses-irstlm-randlm script...................................................................................................13 G. Understanding the directory structure of $mosesdir.......................................................................14 H. Names of the files and of the languages.........................................................................................16 I. Files needed for the training and what you need to know before training your own corpus ...........16 I.1. Where to put the files to be processed......................................................................................16 I.2. Need of strictly aligned corpora files.......................................................................................17 I.3. Do not use spaces in file names................................................................................................17 I.4. Corpus files ..............................................................................................................................17 I.4.1. Using TMX files to create Moses corpus files..................................................................17 I.5. Test files....................................................................................................................................17 I.6. Language model file.................................................................................................................18 I.7. Tuning files...............................................................................................................................18 I.7.1. maxruns parameter ..........................................................................................................19 I.8. Evaluation (testing) files...........................................................................................................19 I.9. Recaser files..............................................................................................................................19 J. make-test-moses-irstlm-randlm script.............................................................................................20 J.1. The ominous control characters................................................................................................20 K. train-moses-irstlm-randlm script....................................................................................................20 K.1. Description of some important parameters.............................................................................21 K.2. Greatly increasing the training speed......................................................................................22 K.3. Controlling tuning ..................................................................................................................22 K.4. Avoiding the destruction of a previous training by a subsequent training and reusing parts of training already done in previous trainings ....................................................................................22 K.5. Training of an inverted corpus ...............................................................................................22 K.6. Isolating a training from all the other ones.............................................................................23 L. translate-moses-irstlm-randlm script..............................................................................................23 L.1. Speed........................................................................................................................................24 L.2. Reusing tuning weights (only for advanced users)..................................................................24 M. score-moses-irstlm-randlm script..................................................................................................25 M.1. Two types of scores.................................................................................................................26 N. Utilities...........................................................................................................................................27 N.1. transfer-training-to-another-location-moses-irstlm-randlm script...........................................27
O. Windows add-ins............................................................................................................................28 P. Improving quality and speed............................................................................................................28 Q. Deleting trained corpora ................................................................................................................29 Q.1. You want to erase all the trainings that you have done............................................................29 Q.2. You want to erase just some of all the trainings that you have done.......................................29 Q.2.1. Scenario 1: More than one Moses installations available ...............................................29 Q.2.2. Scenario 2: Single Moses installation available..............................................................30 R. New features...................................................................................................................................30 S. How to contribute............................................................................................................................31 T. Thanks.............................................................................................................................................31 U. Author.............................................................................................................................................32 APPENDIX: default parameters of each of the scripts.......................................................................32 1) create-moses-irstlm-randlm script:.............................................................................................32 2) make-test-moses-irstlm-randlm script:.......................................................................................33 3) train-moses-irstlm-randlm script:...............................................................................................33 4) translate-moses-irstlm-randlm script:.........................................................................................42 5) score-moses-irstlm-randlm script:..............................................................................................45 6) transfer-training-to-another-location-moses-irstlm-randlm script..............................................46
A. Purpose
Moses-for-Mere-Mortals builds a translation chain prototype with Moses + IRSTLM + RandLM (with either the giza-pp or the MGIZAaligner). Very large corpora can therefore be processed. Its main aims are the following: 1) To help build a translation chain for the real world; 2) To guide the first steps of users that are just beginning to use Moses; 3) To enable a quick evaluation of Moses. Even though the main thrust of this work centers on Linux (the operating system where Moses runs), translators usually work on an MS Windows environment. Therefore, two add-ins (collectively called Windows add-ins) help to make the bridge from Windows to Linux and then back from Linux to Windows. Therefore, this work involves both a Linux component (the Moses-for-Mere-Mortals.tar.gz package) and a MS Windows components (the Windows-add-ins.zip package). For a brief overview of the way they interact, please see http://moses-for-mere-mortals.googlecode.com/files/Overview.jpeg.
building very large Moses corpora using your own translation memories (*.TMX files), please see http://moses-for-mere-mortals.googlecode.com/files/Extract_TMX_Corpus_1.041.exe. If, on the other hand, you want to transfer Moses translations to a *.TMX translation memory tool (e. g., SDL Trados Translator's Workbench), you can use Moses2TMX (http://moses-for-meremortals.googlecode.com/files/Moses2TMX-1.032.exe). Together, these 2 open source programs make the link between Linux (where the corpora are trained and the translations are made) and Windows (where most translators actually use Moses translations). One would expect that the users of these scripts, after having tried the demonstration corpus, can immediately get results with the real corpora they are interested in. These scripts also avoid the destruction of previously trained corpora by the subsequent training of a new corpus and simultaneously try to reuse the relevant parts of previous trainings in the subsequent training of a new corpus. It is also possible to train corpora where every word is presented together with its respective lemma and part of speech tag (factored training). The present scripts do not cover this type of training. Moses-for-Mere-Mortals scripts are based on instructions from several sources, especially the http://www.dlsi.ua.es/~mlf/fosmt-moses.html and the http://www.statmt.org/moses_steps.html web pages and the Moses, IRSTLM, RandLM, giza-pp and MGIZA documentation, as well as on research on the available literature on Moses, namely the Moses mailing list (http://news.gmane.org/gmane.comp.nlp.moses.user). The comments transcribe parts of the manuals of all the tools used. Moses MT System is an open source project under the guidance of Philipp Koehn, University of Edinburgh, and is supported by the European Commission Framework Programme on Research and Technological Development and others. For information on the general concepts of Statistical Machine Translation, see Statistical Machine Translation" by Philipp Koehn, Cambridge University Press, 2010.
Preferably a fast multiprocessor computer Disk space: as a rough rule, the disk space needed for corpus training is approximately 100 times the size of the corpus (source plus target files).
B.2. Software
In order to use the scripts, the following packages should have already be installed in Ubuntu 9.04: 1) subversion 2) automake 3) autoconf 4) bison 5) boost-build 6) build-essential 7) flex 8) help2man 9) libboost1.37-dev 10) libpthread-stubs0-dev 11) libgc-dev 12) zlibc 13) zlib1g-dev 14) gawk 15) tofrodos You can install them by selecting the System menu and then the Administration > Synaptic Package Manager command.
C. Overview
These scripts have only been tested in an Ubuntu 9.04 64 bits environment. Before using them, you should install the Linux packages upon which they depend (see section B. Requirements). Note: So as to avoid having to rewrite this Help file every time that a script version changes, in what follows the version numbers of the scripts have been omitted (for example: we write createmoses-irstlm-randlm instead of create-moses-irstlm-randlm-1.27) 1) You should start using these scripts by opening the create-moses-irstlm-randlm script and changing there some important parameters that will make your installation adequate for your hardware (see section F. create-moses-irstlm-randlm script). 2) Then launch the create-moses-irstlm-randlm script, which will download and compile ***all***
the Moses packages. Since some compilation errors do not stop compilation but can lead to an unusable tool, the create-moses-irstlm-randlm script checks, at the end of the compilation of each tool, whether the appropriate files are present or not. 3) Select a set of corpus files (2 strictly aligned files in UTF-8 format, one in the source language and another in the target language) for Moses training and launch the make-test-moses-irstlmrandlm script in order to create a representative set of segments that will be used for testing the corpus that you will train and in order to erase those segments from the corpus files that will be used for training. This step is not obligatory but is highly recommended. NOTE 1: Even if you do not have your own corpus for training, you can use the demo corpus that comes with these scripts (you do not need to do anything to have this corpus: the createmoses-irstlm-randlm script takes care of this). It is highly recommended that new Moses users start using these scripts using the demo corpus. NOTE 2: If you do not have text files in UTF-8 format (necessary for Moses), but you do have translation memories in *.TMX format, you can use the Extract-TMX-Corpus tool to create the adequate UTF-8 files from your TMX files (http://code.google.com/p/extract-tmx-corpus/). 4) Next, the train-moses-irstlm-randlm script trains a corpus composed exclusively of plain words (non-factored training). This script does include some advanced features, like memorymapping (for saving memory resources, what is essential for processing large corpora), tuning (in order to get qualitatively better results) and the ability to change certain parameters that can either reduce the processing time or increase the quality of the results. 5) The translate-moses-irstlm-randlm script then translates one or more documents placed in a specific directory. 6) Finally, the score-moses-irstlm-randlm script allows you to score Moses translations against human translations of the same text, giving BLEU and NIST scores for either the whole document or for each segment of the document (depending on the settings that you define). 7) If you want to exchange your trained corpora with someone else or to another Moses installation (even one in the same computer), the transfer-training-to-another-location-moses-irstlm-randlm script helps you to do that. Users are expected to open the scripts and to change the parameters according to their needs (the parameters are preceded by comments that explain their purpose and sometimes even their allowable values and limits; many of those comments are citations of the Help of the several packages that Moses uses and of the Moses manual). These parameters are set at the top of each script in a clearly defined section.
(please note that only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic). The corpus is small (200 000 segments in the Portuguese and English languages) and the results of its processing cannot be seen as representative of the quality Moses can achieve (especially if you consider that IRSTLM and RandLM are intended to process corpora with several to many millions of segments). However, a small corpus like this will reveal you facts about Moses (like the length of time needed for each of its steps) and it is therefore highly recommended that you start using the scripts with their default settings. Later on, you can set them so that the actual work you are interested in will be done. If you don't change the default settings in the create-moses-irstlm-randlm and the train-moses-irstlmrandlm scripts, Moses will train this Portuguese-English corpus, which involves 300 000 segments for language model building and 200 000 segments for corpus training. The create-moses-irstlm-randlm script transfers the necessary files to the right place, so that you do not have to do anything. This corpus had a BLEU score of 0,6702 and a NIST score of 11.6946 with a 5-gram language model. Even though small, it took some time to train in a machine with a Intel i7 720-QM processor and 8 GB of DDR3 RAM (3h 08m 39s without tuning and 10h 02m 20s with tuning limited to a maximum of 10 iterations). The training of a corpus with 6.6 million segments in this same machine took 2 days 22 h 59m (without tuning). D. 2. First steps with the scripts Here you find a description of how to run the whole demo, with minimum changes in the scripts' parameters. In most cases, you just have to launch the scripts as indicated below. However, if you want to use your own corpora right away, skip this section and go directly to Section E. You can launch the scripts from wherever you saved them. However, create-moses-irstlm-randlm (installation) should be launched first, make-test-moses-irstlm-randlm next, train-moses-irstlm-randlm (corpus training) next and translate-moses-irstlm-randlm (actual translation) last. After having Moses translations, you can optionally score them against reference (human) translations with the scoremoses-irstlm-randlm script. The $mosesdir parameter of most the scripts should always have the same value (e.g., $HOME/moses-irstlm-randlm; if your login name is john, then $HOME means /home/john). 1) Download the pack http://moses-for-mere-mortals.googlecode.com/files/Moses-for-MereMortals-0.97.tar.gz to your $HOME directory (if it is not there already), right-click it and, in the contextual menu that appears, click the Extract here... command. Alternatively, in the Linux Terminal, place yourself in the directory that contains this pack and extract its contents with the following command: tar -xzvf pack moses.for.mere.mortals.tar.gz
A new Moses-for-Mere-Mortals directory will be created and the scripts will be placed in its scripts subdirectory. This directory has the following scripts:
2) If your computer has just one processor, you can go directly to step 3. If it has more than 1 processor and if you want Moses to use them, see point 1 of section F and follow the instructions to change the number of processors. This change must be done before you launch the create-moses-irstlm-randlm script. Tip: You can run the Demo without changing this parameter. When you really start working with Moses, you can install it again with the number of processor you wish so that you can fully use all the potential of your computer. 3. Launch the create-moses-irstlm-randlm script: open the Linux Terminal (console), place yourself in the directory where the scripts are and type: ./create-moses-irstlm-randlm 4. Now you are going to extract from the demo corpus a set of segments that will be used for testing the trained corpus. That same set of segments will be erased from the demo corpus files (that will be used later for training): ./make-test-moses-irstlm-randlm 5. Training time! If your computer has more than 1 processor and you want them to be used right away, please see section J. Then, again in the Terminal, type: ./train-moses-irstlm-randlm 6. You can now translate a text with the trained corpus you have created in step 5 (the createmoses-irstlm-randlm script has already put the file to be translated (100.pt) in $mosesdir/translation_input directory). You have to indicate to the translate-moses-irstlm-randlm script the trained corpus that you want to use. In order to do that:
a) Goto to the $mosesdir/logs directory and copy the name of the file that indicates the corpus that you have just trained (if this is the first time that you are using these scripts and if you have already done a training, there will be just one file there); b) Open the translate-moses-irstlm-randlm; c) Set the value of the translate-moses-irstlm-randlm $logfile parameter to the name of the file you identified in the $mosesdir/logs directory (paste that value there).
d) Save your changes. Now, you can launch the translate-moses-irstlm-randlm script: ./translate-moses-irstlm-randlm 7. If you want, you can score the Moses translation against a reference (human) translation: ./score-moses-irstlm-randlm Again, the create-moses-irstlm-randlm script has already copied the necessary reference (human) translation to the $mosesdir/translation_reference directory (no need for you to do anything). By default, it will score the whole document and give you the BLEU and NIST scores for the whole document. You will probably notice that the score you got is much lower than the one you got at the end of the training test. That is normal and serves as a warning that even a well trained corpus with a very good score will perform poorly if it is used to translate segments that come from a domain quite different from that used for training. After this first practical experience, you can now read the next section in order to learn how to better control all these processes.
In fact, in order to train your own corpora, you have to define your own settings and to choose your own corpora files and the languages you are interested in. You might also want to change the parameters of Moses or of the packages it uses. In order to do that, before launching the scripts, open them and set the variables defined between ################################################################################### # The values of the variables that follow should be filled according to your needs: # ################################################################################### and ################################################################################### # End of parameters that you should fill # ################################################################################### Each parameter is preceded by a comment that describes its function and, in some cases, even states the values that are the allowable and the default values you can use. They often consist of extracts of the Help files, readmes or manuals of the several packages used. Please refer to the sections that follow, which describe each of the scripts and some important info in more detail.
F. create-moses-irstlm-randlm script
Vital parameters: mosesdir, mosesnumprocessors This is a shell script that creates a Moses system. 1. Go to the Moses-for-Mere-Mortals/scripts directory and open the create-moses-irstlm-randlm script. At the top of the script, you can change several variables that allow you to better adapt it to your own requirements:
$mosesdir: this is the base directory where Moses will be installed (default value: $HOME/moses-irstlm-randlm). You can change both its name (in this case, moses-irstlmrandlm) and its location (in this case, $HOME). This variable is defined in all the scripts and its value should be the same in all of them if you want them to be able to work together (and you do want that!). $mosesnumprocessors: the number of processors of your computer that you want Moses to use (Moses will be compiled to make better use of them). The default value is 1, but nowadays you can easily find computers with 2, 4 and even 8 processors. If your computer has more than one processor, change this parameter so that it reflects the number of processors that you want to make available for Moses. 2. Save your changes. Do not change any other parameter for the time being (later on, after you have run all these scripts, you can start your own very personal experiences). This script also creates some important directories: $mosesdir/corpora_for_training: this is the directory where the corpora and the other input files for training are located; there you'll find already several files that can be used to run the training demo; you should place here all the files needed for training the corpora you are interested in. $mosesdir/corpora_trained: this is the directory where the files created during corpus training are kept; please do not touch this directory, since you can destroy the training of one or several corpora; $mosesdir/logs: this is the directory where the training summary files are located; these files have the name of several variables that will be used by the translate-moses-irstlm-randlm script (and which this latter script will extract for you); these files are very important because they are the only way to indicate to the translate-moses-irstlm-randlm script the trained corpus you want to use for translation; $mosesdir/toolsdir: this is the directory where both Moses and all the other tools (giza-pp, irstlm and so on) will be installed; this directory will not change during the training and you should not change it.
1. Once you have installed Moses with create-moses-irstlm-randlm, you should put the corpus files that you want to train in the $mosesdir/corpora_for_training directory. You should also place here the files used for creating the language model (if different), for training recasing (if different), for tuning (if any), and for testing the results of training (if any). 2. If you then use the train-moses-irstlm-randlm script for training that corpus, a directory $mosesdir/corpora_trained will be created to store the trained files. NOTE: Even though you might suppose that the $mosesdir/corpora_trained directory is a vital one (a correct assumption), you are strongly urged not to change any of its contents. This is because, to allow reusing of the work done in previous trainings, it has a complex structure that mixes files from several trainings. If you change it, you risk destroying not just one, but several trainings. You can, however, use a specific training already done by referring to its logfile (see below) which can be found in the $mosesdir/logs directory. It is also possible to isolate a training from all others (please refer to section J.6 in order to learn how to do that). 3. At the end of the training, a training summary file (logfile) will be created in the $mosesdir/logs directory. This file is very important because its name will be used in the translate-moses-irstlmrandlm script to indicate the trained corpus you want to use for translation. 4. Once a corpus is trained, you can start using it to get actual translations with the translate-mosesirstlm-randlm script. Place the documents to be translated (you can translate one or many documents at the same time) in the $mosesdir/translation_input directory (created by the train-moses-irstlmrandlm script ) and then launch the translate-moses-irstlm-randlm script. You should also make sure, before the translation starts, that the files placed in the translation_input directory are indeed adequate for the trained corpus that this script uses (for instance, the languages used for training should match those of the translation; otherwise, you will waste time translating files that shouldn't have been translated using that trained corpus, e.g. because their language is not adequate). 5. Translation can have 2 types of outputs:
a) a normal Moses translation, if you set the $translate_for_tmx parameter to value different from 1 (default: 0); or b) a Moses translation especially suited for making translation memories, if you set the $translate_for_tmx parameter to 1. The normal translation will be located in the $mosesdir/translation_output directory. The translation intended to build TMX translation memories will appear, together with the corresponding modified input file, in the $mosesdir/translation_files_for_tmx directory. In both cases, the translation will have the name of the source document plus an ending that corresponds to the destination language and a final suffix .moses. This avoids confusions between source document and translation and between Moses translation and reference translation. 6. A new script (score-moses-irstlm-randlm) enables you to place a reference (that is, human) translation in the translation_reference directory and get the BLEU and NIST scores of the corresponding Moses translations in the translation_scoring directory. Again, you have 2 choices: a) get a score for the whole document, if the parameter $score_line_by_line is different from 1; or b) get a score line by line, with the segments ordered by ascending BLEU score, if the parameter $score_line_by_line is equal to 1. 7. The contents of the $mosesdir/tools directory should not be changed, since it includes all the files needed for Moses to work.
I. Files needed for the training and what you need to know before training your own corpus I.1. Where to put the files to be processed
All the files that are going to be mentioned should be put in $mosesdir/corpora_for_training (as described above, mosesdir is the base Moses system directory, whose default value is $HOME/mosesirstlm-randlm)
use these 2 files to obtain a representative sample of their segments and creates 2 new corpus files in which it erases the corresponding lines of the segments that were extracted for testing. These segments can still occur in those new corpus files, since they may occur in more than one line, but, unless your corpus is indeed very repetitive, it is likely that your test files do indeed contain text that does not exist anymore in the corpus to be trained. The new corpus files created by this script (whose name is given at the end of its execution) should then be the files used for setting the $corpusbasename. If you had started it with the example files given above, it would create 4 new files: corpus1000.for_train.pt (a file used for corpus training) corpus1000.for_train.en (a file used for corpus training) corpus1000.for_test.pt (a file used for testing the trained corpus) corpus1000.for_test.en (a file used for testing the trained corpus)
I.7.1. maxruns parameter Tuning is a phase that can easily take more time than all the others put together. Furthermore, you can't easily estimate its duration beforehand, since the number of its runs is highly variable from corpus to corpus. Therefore, a modified mert-moses-new.pl script (mert-moses-new-modif.pl) introduces some changes to the original Moses script so that the user can control the number of tuning runs through the parameter $maxruns of the train-moses-irstlm-randlm script. A value of -1 means that an unlimited number of runs is allowed. Any positive number >= 2 means that tuning should be stopped after that run number. The default value is 10. Good values can be between 5 and 10 (choose the lower end of this spectrum if you want to speed up things; but you will risk a worse tuning).
10000recaser.en
J. make-test-moses-irstlm-randlm script
Vital parameters: lang1, lang2, mosesdir, basefilename This script assumes that a Moses installation has already been done with the create-moses-irstlmrandlm script. It uses two aligned input files, one in the source and another in the target language, whose $basefilename should be equal and which differ by a suffix that indicates their respective languages (e.g., 200000.pt and 200000.en, the $basefilename being, in this case, 200000). The script produces 2 new sets of files: 1) 2 files (one in the source language and another in the target language) used for testing the trained corpus; for that, it divides the corpus into X $sectors (a parameter that you can define) and then randomly selects Y $segments (another parameter you can define) in each sector. All the selected segments will have different line numbers (no line can be chosen more than once). This procedure offers a better guarantee that the segments used for testing the trained corpus are more representative of all the styles and contexts of the corpus being used than they would be if you would arbitrarily choose the same number of consecutive segments somewhere in the input files. These files have a basefilename equal to $basefilename.for_test and will be later used by the train-moses-irstlm-randlm script (e.g, 200000.for_test.pt and 00000.for_test.en). 2) 2 files (one in the source language and another in the destination language) that are equal to the starting files, except that the segments used for creating the 2 test files have been erased from them. These files have a basefilename equal to $basefilename.for_train and will be later used by the train-moses-irstlm-randlm script (e.g, 200000.for_train.pt and 00000.for_train.en). NOTE: if you want to compare the relative results of a change in training parameters, you should execute the training test before and after the change in parameters with the same set of test files (run make-test-* just once and use the test files it creates to test both trainings).
K. train-moses-irstlm-randlm script
Vital parameters: mosesdir, lang1, lang2, corpusbasename, lmbasename, tuningbasename, testbasename, recaserbasename, reuse, paralleltraining, memmapping, tuning, runtrainingtest, lngmdl, Gram, mgizanumprocessors, maxruns This script assumes that a Moses installation has already been done with the create-moses-irstlm-
randlm script and can optionally use the files created by the make-test-moses-irstlm-randlm script. Even though it might not cover all the features that you might like to play with (namely those necessary for factored training, which would involve the use of a lemmatizer and/or a parts of speech tagger), this script does allow you to train a corpus, to memory-map your training files (so that Moses uses less RAM resources), to do tuning, to do a training test (also with memory-mapping), and to get the NIST and BLEU scores of that test. It also makes available all the parameters used by IRSTLM, RandLM, mkcls, GIZA and MGIZA, as well as selected parameters used by the Moses decoder and the Moses scripts. These parameters are all set by default to the values they receive when you use the Moses scripts. If you are new to Moses, do not change them. If your computer has more than 1 processor, you should also change here the mgizanumprocessors parameter (set by default to 1) to the actual number of processors of your computer that you want to use with MGIZA. Just open the train-moses-irstlm-randlm script, search the name of this parameter change it and save your changes. At the very least, this script will build a linguistic model and will train a corpus (except if they exist already, in which case it will not rebuild them). Other steps are optional: memory-mapping, tuning and testing. In order to set the steps that will be executed, you have to change the parameters at the top of the script. The role of those parameters is also indicated there (in comments that precede each one of them). The directory structure created by these scripts ensures 2 things: 1) no training will interfere with the files of a previous training; and 2) a posterior training will reuse as much as possible the files created in previous trainings. At the end of the training, a logfile (training summary) file will be created in the $mosesdir/logs directory. It includes details about the duration of the several phases of training, values that will be used when you translate files based on this trained corpus, a list of the main input files, a list of all the files created during training, a list of all the parameters used and the score of the trained corpus test (if a test was done). The name of this file is most important because it is used by the translate-moses-irstlmrandlm script to select the trained corpus that you want to use for translation.
value: 9), but the execution time will suffer; normally between 3 and 9 (default: 5)
K.4. Avoiding the destruction of a previous training by a subsequent training and reusing parts of training already done in previous trainings
In order to guarantee that the training of a corpus doesn't destroy files from a previously trained corpus and in order to ensure that each training reuses as much as possible the files already created in previous trainings, a complex (and confusing) directory structure was created. This, however, implies that the files of all the previous trainings are dispersed in the $mosesdir/corpora_trained directory. As already stated, this is a directory which you shouldn't change, since, by doing that, you can destroy not just one but even several trainings. However, a $mosesdir/logs directory exists where you can find a summary of every training you made that describes, among other things, the parameters it used and the files it created. In order to use a trained corpus for translation you just have to copy the name of its log file into the $logfile parameter of the translate-moses-irstlm-randlm script. Nothing else is necessary for that and indeed you can ignore where the trained corpus files are for all practical purposes.
L. translate-moses-irstlm-randlm script
Vital parameters: mosesdir, logfile, translate_for_tmx (if this latter parameter is set to 1, then look also minseglen, othercleanings, improvesegmentation, removeduplicates) This script assumes that Moses with IRSTLM and RandLM has been created with create-moses-irstlmrandlm and that a trained model exists already for the type of language pair you want to translate (which requires you to have already run train-moses-irstlm-randlm). This script translates, using the trained model that you chose in its $logdir and $logfile parameters, the file or files that you yourself have put in $mosesdir/translation_input. In fact, it will translate in one step all the files that are there. It is very important that you fill correctly the name of the $logdir and $logfile parameters, since they are the only way of telling the script which trained corpus you want to use for translation. By default, $logdir has the value $HOME/moses-irstlm-randlm/logs and, if you haven't changed this parameter, you just have to go to this directory, identify the file that corresponds to the corpus you want to use and copy its name (omit the path!) into $logfile. Translation can have 2 types of outputs: 1) A normal Moses translation, if the $translate_for_tmx parameter is set to 0 (default value). The normal translation will be located in the $mosesdir/translation_output directory. or 2) A Moses translation especially suited for using in a translation memory tool, if you set the $translate_for_tmx parameter to 1. This type of translation will be located, together with the modified input file, in the $mosesdir/translation_files_for_tmx directory. It is especially interesting for those who use machine translation together with translation memories (notably those who just use MT
segments when there are no translation memory segments above a certain match percentage). By default, translate_for_tmx=0, which means it will do a "normal" translation. This type of translation respects fully the formatting of the original text, and therefore keeps long paragraphs, which, according to some sources, do not lead to the best results. That, however, didn't stop us to get very respectable BLEU and NIST scores, as you can see yourself if you try the demo corpus. If you set $translate_for_tmx to 1, then other parameters will be activated: a) $minseglen: if set to a value different from -1 and greater than 0, all segments with a length of less than minseglen will be erased; if set to -1, no segments will be erased whatever their length; default value: -1; b) $othercleanings: if set to 1, tabulation signs will be replaced by newlines and lines composed only of digits, spaces and parentheses will be removed; default value: 1; c) $improvesegmentation: if set to 1, replaces any of the characters [:;.!?] followed by a space by that character followed by a newline, deletes empty lines and substitutes double spaces by a single space; default value: 1; d) $removeduplicates: if set to 1, removes duplicated segments; default value: 1 If you want to do a scoring of the Moses translation (for that, you need to have a reference human translation) and if $translate_for_tmx is set to 1, then you should set $minseglen = -1, $othercleanings = 0, $improvesegmentation = 0 and $removeduplicates = 0 (so that the source document and the reference translation have the same number of segments). The names of the output files will be equal to those placed in $mosesdir/translation_input except for a suffix that is appended to them with the abbreviation of the target language. Therefore, if you input the file 100.pt you will get a translated 100.pt.en.moses file (if en is the abbreviation of the target language). Furthermore, both the source document and the Moses translation are also changed so that: 1) The named entities defined in the TMX specification are duly created (e.g., & &); 2) / /;
L.1. Speed
Especially with very large trained corpora (several million segments), translation can be slow. According to the Moses manual, to get faster performance than the default Moses setting at roughly the same performance, use the parameters $searchalgorithm=1 (default: 0), $cubepruningpoplimit=2000 (default: 1000) and $stack=2000 (default: 100). You can also try to reduce both of the latter 2 parameters to values of 500 or less (say, 100) and experiment to determine it they significantly change the translation quality.
that will be created for that same language pair, a very big time saving trick. If the files used for tuning are representative of your corpora, they should in principle lead to better results than the default values used when no tuning is done (that is not always the case). You might be interested in doing this only if the score-moses-irstlm-randlm script shows a significant increase in translation quality after tuning is made. In practice, you should first train a corpus without tuning, translate a representative text and then score that translation with the scoring script. Then, you should retrain the same corpus with tuning and translate that same representative text and score it (since the scripts reuse the previously made steps, the previous training will be reused and you will just do a new tuning and a new training test). You can repeat this for several representative texts. If the scores obtained with tuning are significantly higher than those obtained without tuning, then you can use the tuning weights for all the similar corpora of that language pair. Depending on your settings, you can have a moses.ini file in $mosesdir/corpora_trained/model, $mosesdir/corpora_trained/memmaps, and $mosesdir/corpora_trained/evaluation. If you want to use the tuning weights, you should change those weights in all those moses.ini. The weights in question are listed in the [weight-d], [weight-l], [weight-t] and [weight-w] sections of the $mosesdir/corpora_trained/tuning/.../moses.weight-reused.ini file.
M. score-moses-irstlm-randlm script
Vital parameters: mosesdir, lang1, lang2, scoreTMXdocuments, s, r, m, score_line_by_line (if this latter parameter is set to 1, then you should set the remove_equal parameters), tokenize and lowercase In some cases, you might want to score the translations you get with the translate-moses-irstlm-randlm script against reference (human) translations that might be available. That might be useful for you to determine whether some parameter changes lead to improvements or to even indirectly assess the satisfaction of the human users of translations made with your trained corpus (for that, just consider their translation, after it is done with the help of the Moses translation, as the reference translation and score the Moses translation against their reference translation). For a score to be done, you need do have a source document ($s parameter), a reference (human) translation of the source document ($r parameter) and a Moses translation of the source document ($m parameter). As seen in section K, you could have done a normal translation (if $translate_for_tmx = 0 in the translate-moses-irstlm-randlm script) or a translation especially suited for TMX translation memories ( if $translate_for_tmx = 1). If you have done a normal translation, you should set the $scoreTMXdocuments parameter to 0 (its default value). The $scoreTMXdocuments parameter should be set to 1 if you have previously translated the text with $translate_for_tmx = 1 with the translate-moses-irstlm-randlm script, indicating that you have done a
translation especially suited for making translation memories. However, you should know that other settings of this latter script will change the number of lines of the translation (e.g., by removing identical lines) and, in that case, scoring of such a changed document is not possible. More specifically, that happens if $minseglen < > -1 , $othercleanings < > 0, $improvesegmentation < > 0 ***or*** $removeduplicates < > 0. The source document should be placed in the $mosesdir/translation_input directory, for normal translations, or in $mosesdir/translation_files_for_tmx, for translations suited for TMX. The reference translation should be put in the $mosesdir/translation_reference directory. The scripts will put the Moses translation in the right location ($mosesdir/translation_output directory, for normal translations, and $mosesdir/translation_files_for_tmx for translations suited for TMX).
This script creates in the $mosesdir that contains the trainings that you want to transfer ($mosesdirmine parameter) a new subdirectory $mosesdirmine/corpora_trained_for_another_location/$newusername and places there a copy of the $mosesdirmine/corpora_trained and $mosesdirmine/logs that you want to transfer. In these 2 latter subdirectories in that location, the string that referred literally to the trained corpora initial location is replaced by the correct string that will enable them to be used by another user and/or another location (since you can have several Moses installations in the same computer). This script will copy all the trainings contained in the $mosesdirmine/corpora_trained directory. Your original trainings are not affected by this operation. You can then copy these 2 subdirectories (prepared for being transfered) to the new location or to the new computer where you want them to be used (it is you who has to manually copy them there, for instance copying them to a USB key or to an external hard disk and copying them to the $mosesdir directory where you want them to be used, which corresponds to the $mosesdirotheruser directory). After you have transferred the corpora to their intended location, you can safely erase the $mosesdirmine/corpora_trained_for_another_location. The $mosesdirmine parameter is the value of your $mosesdir (by default, $HOME/moses-irstlmrandlm) whose trainings you want to transfer. The $mosesdirotheruser parameter is the value of the $mosesdir to which you want to transfer your training. The $newusername parameter is the Linux
login name of the user to whom you want to transfer your training (if you keep your own login, that means that you are trying to transfer the training to another Moses installation in your own computer). Since reading and writing to disk can lead to errors, we strongly suggest that you make a backup of the $mosesdirotheruser directory before transferring the $mosesdirmine/corpora_trained_for_another_location/$newusername/corpora_trained and the $mosesdirmine/corpora_trained_for_another_location/$newusername/logs subdirectories to it and especially before erasing or overwriting anything. Please note that you should just copy the corpora_trained and logs subdirectories to $mosesdirotheruser directory (not the $mosesdirmine/corpora_trained_for_another_location/ $newusername directory). If the $mosesdirotheruser directory that is going to receive the new corpora_trained and logs subdirectories already has some trainings, when you merge them there you will be alerted that the subdirectories with the same name will be overwritten by the new ones. Even though we again stress that it is much surer to make a backup of the contents of $mosesdirotheruser before attempting this operation and highly recommend you to do that (any error might destroy the previous trainings already present there!), you should accept that overwriting. In fact, given the structure of those directories, the files already present there should not disappear. But this is something that you make at your own risk and peril.
O. Windows add-ins
In order to be able to prepare corpora so that they can be used to train Moses and in order to convert Moses output so that it can be used in translation memories, 2 MS Windows programs are presented in a separate Windows-add-ins.zip package (http://moses-for-meremortals.googlecode.com/files/Windows-add-ins.zip): - Extract_TMX_Corpus - enables the creation of Moses corpora files from TMX (translation memories) files; - Moses2TMX - enables the creation of TMX files from a Moses translation file and the corresponding original document in the source language. Please consult their Readme files to learn how to use them. Together, these 2 programs create a synergy between machine translation and translation memories.
According to the Moses manual, you should try first the (less numerous) parameters of the translatemoses-irstlm-randlm script. In case you want to reset the parameters that you changed to their default values, you can consult the Appendix of the present document to get those values.
Q. Deleting trained corpora Q.1. You want to erase all the trainings that you have done
That's really easy. Just delete the $mosesdir/corpora_trained and the $mosesdir/logs directories. Next time you'll use the train-moses-irstlm-randlm script, it will re-create these 2 directories.
Q.2. You want to erase just some of all the trainings that you have done
There are 2 ways to delete corpora. The first one, though more accurate, requires you to have 2 Moses installations (you can have as many Moses installations as you want, each one in a $mosesdir with a different name). The second one is much more risky and isn't as effective, but it does not require you to have more than one Moses installation. In both cases, we strongly recommend that you carefully backup the corpora_trained and the logs subdirectories of the $mosesdir which will be changed. As you know, if you accidentally change by mistake the $mosesdir/corpora_trained or the $mosesdir/logs directories, you can lose one, several or even all the trainings you have done. Q.2.1. Scenario 1: More than one Moses installations available This is by far the less risky and more efficient method of deleting unwanted trainings. Tip: If you do not have more than one Moses installations (that is, more than one $mosesdir), you can easily create a new one by running the create-moses-irstlm-randlm script and setting its $mosesdir parameter to a value different from the $mosesdir value that was used to create your present Moses installation (after its execution, you will have 2 different Moses installations). 1) Prepare the $mosesdir where you want to delete a trained corpus (let's call it $mosesdirstart) to be transferred to another location, by using the transfer-training-to-another-location-mosesirstlm-randlm script. 2) As you know (see section M), this script processes all the trained corpora of $mosesdirstart, that is, the ones you want delete and also the ones you do not want to delete, and creates 2 new directories: $mosesdirstart/corpora_trained_for_another_location/ $newusername/corpora_trained and $mosesdirstart/corpora_trained_for_another_location/$newusername/logs. Delete in these 2 directories, respectively, the subdirectories and the logfiles that correspond to the corpora that you want to delete. 3) Let's call the $mosesdir that will receive the trained corpora that you do not want to delete $mosesdirfinal. Just for playing sure, backup its $mosesdirfinal/corporatrained and $mosesdirfinal/logs subdirectories. 4) Now you just have to move the $mosesdirstart/corpora_trained_for_another_location/ $newusername/corpora_trained and
$mosesdirstart/corpora_trained_for_another_location/$newusername/logs to, respectively, $mosesdirfinal/corpora_trained and $mosesdirfinal/logs. 5) In order to verify that everything was well done, make a small translation with one of the trained corpora that were initially present in $mosesdirfinal, as well as a translation with one of the corpora that you have now manually transferred to there. 6) If no problems were detected in the previous step, delete $mosesdirstart. Q.2.2. Scenario 2: Single Moses installation available A more convoluted, riskier and less efficient way is also available. The log file of the training that you want to suppress (located in the $mosesdir/logs subdirectory) contains, at its very end, a list of the files used in that training. You can erase the files that use more space and that are surely not required for any other trainings if you just erase the files, and only the files, listed in the log file, located in the following subdirectories of $mosesdir/corpora_trained: 1) 2) 3) 4) evaluation memmaps model tuning
This might be easier than it seems. The long names of the directories, necessary so that a training does not overwrite another training and so that you can reuse previous steps already done, do not need to be fully inspected: as soon as you find the correct first subdirectory of any of the above directories, you can erase it without checking any further. Other files, located in other $mosesdir/corpora_trained subdirectories, even though used by the corpus that you want to delete, might also be used by other corpora (since these scripts reuse the steps already done by previous trainings). The best advice, as far as these subdirectories are concerned, is probably not to touch them.
R. New features
Relative to Moses-for-Mere-Mortals-0.64, the following main new features have been added: 1) Removes control characters from the input files (these can crash a training); 2) Extracts from the corpus files 2 test files by pseudorandomly selecting non-consecutive segments that are erased from the corpus files; 3) A new training does not interfere with the files of a previous training; 4) A new training reuses as much as possible the files created in previous trainings; 5) Inversion of corpora (e.g., from en-pt to pt-en) detected, allowing a much quicker training than that of the original language pair; 6) Can limit the duration of tuning; 7) Get the BLEU and NIST scores of a translation (either for the whole document or for each segment of it); 8) Transfer your trainings to someone else or to another Moses installation in the same computer;
9) All the mkcls, GIZA and MGIZA parameters can now be controlled through parameters of the train-moses-irstlm-randlm script; 10) Selected parameters of the Moses scripts and the Moses decoder can now be controlled through the train-moses-irstlm-randlm and translate-moses-irstlm-randlm scripts; 11) Installs RandLM; 12) Installs MGIZA; 13) Implements distributed training with IRSTLM (so as to better manage memory); 14) New make-test-moses-irstlm-randlm, score-moses-irstlm-randlm, and transfer-training-toanother-location-moses-irstlm-randlm scripts. 15) Bigger demo corpus.
S. How to contribute
You can contribute to the improvement of this work by either contacting Moses.for.Mere.Mortals@gmail.com or by participating in the discussion group linked to this site (http://groups.google.com/group/mosesformeremortals ). Comments, criticisms and further scripts or documentation that will make the process of using Moses more user-friendly are gladly welcome. If we accept your work, we will fully acknowledge his/her author (and only him/her) and we propose that in the very beginning of it you write: #copyright {year}, {your name} #licenced according to the {name of the licence} licence If you propose a significant change to an existing script, the names of all of the authors will be mentioned on it and the licence will have to be agreed upon.
T. Thanks
Special thanks: Hilrio Leal Fontes, who made very helpful suggestions about the functionality of several scripts and made comprehensive tests. He is also the author of the nonbreaking_prefix.pt script (for the Portuguese language). He has compiled the corpora that were used to train Moses and to test these scripts, including 2 very large corpora with 6.6 and 12 million segments. He has also revised the Help/Short Tutorial file. Maria Jos Machado, whose suggestions and research have influenced significantly the scoremoses-irstlm-randlm script. She helped in the evaluation of Moses output in general and organised, together with Hilrio, a comparative evaluation, made by professional translators, of the qualitative results of Google, Moses and a rule-based MT engine. She suggested a deep restructuring of the present Help/Short Tutorial file and is a co-author of it. Manuel Tomas Carrasco Benitez, whose Xdossier application was used to create a pack of the Moses-for-Mere-Mortals files. Authors of the http://www.dlsi.ua.es/~mlf/fosmt-moses.html (Mikel Forcada and Francis Tyers) and the http://www.statmt.org/moses_steps.html pages. These pages have helped me a lot in the first steps with Moses. Authors of the documentation of Moses, giza-pp, MGIZA, IRSTLM and RandLM; some of the
comments of the present scripts describing the various parameters include extracts of them. European Commission's Joint Research Center and Directorate-General for Translation for the DGT-TM Acquis - freely available on the JRC website and providing aligned corpora of about 1 million segments of Community law texts in 22 languages - which was used in the demonstration corpus. Please note that only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.
U. Author
Joo Lus Amorim de Castro Rosas Moses.for.Mere.Mortals@gmail.com The author wishes to stress that the very, very long (unimaginable) working hours and the numerous extremely relevant suggestions of Hilrio Leal Fontes and Maria Jos Machado, who tested this software in an actual translation environment, were an immense contribution and created also a very pleasurable working environment (despite the stress we all suffered :-) ). These scripts would not be the same, and would in fact be much worse, without their help, which made it reflect the practical problems of professional translators.
1) create-moses-irstlm-randlm script:
#Full path of the base directory location of your Moses system mosesdir=$HOME/moses-irstlm-randlm #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # !!! Please set $mosesnumprocessors to the number of processors of your computer !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Number of processors in your computer mosesnumprocessors=1 #Install small demo corpus: 1 = Install; Any other value = Do not install (!!! this will install a very small corpus that can be used to see what the scripts and Moses can do; if dodemocorpus is set to 1, this series of scripts will be able to use the demo corpus without you having to change their settings !!!) dodemocorpus=1 #Remove the downloaded compressed packages and some directories no longer needed once the installation is done; 1 = remove the downloaded packages; any other value = do not remove those packages removedownloadedpackges=1
2) make-test-moses-irstlm-randlm script:
#Base path of Moses installation mosesdir=$HOME/moses-irstlm-randlm #Source language abbreviation lang1=pt #Target language abbreviation lang2=en #Number of sectors in which each input file will be cut totalnumsectors=100 #Number of segments pseudorandomly searched in each sector numsegs=10 #Name of the source language file used for creating one of the test files (!!! omit the path; the name should not include spaces !!!) basefilename=200000
3) train-moses-irstlm-randlm script:
#Full path of the base directory location of your Moses system mosesdir=$HOME/moses-irstlm-randlm #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #NOTE 1: The corpus that you want to train, together with the respective tuning files (if different), the testing files (if different), the file used for recasing, and the file used to build the language model (if different) should be placed in $mosesdir/corpora_for_training !!! #NOTE 2: After the script is executed, you will find a summary of what has been done (the corpus summary file) in $mosesdir/logs #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #====================1. LANGUAGES ======================================== #Abbreviation of language 1 (source language) lang1=pt #Abbreviation of language 2 (target language) lang2=en #====================2. FILES ============================================= #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # !!! The names of the files should not include spaces !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#Basename of the corpus placed in $mosesdir/corpora_for_training (the example that follows refers to the 2 files 200000.for_train.en and 200000.for_train.pt, whose basename is 200000.for_train) corpusbasename=200000.for_train #Basename of the file used to build the language model (LM), placed in $mosesdir/corpora_for_training (!!! this is a file in the target language !!!) lmbasename=300000 #Basename of the tuning corpus, placed in $mosesdir/corpora_for_training tuningbasename=800 #Basename of the test set files (used for testing the trained corpus), placed in $mosesdir/corpora_for_training testbasename=200000.for_test #Basename of the recaser training file, placed in $mosesdir/corpora_for_training recaserbasename=300000 #===================== 3. TRAINING STEPS =================================== #--------------------------------------------------------------------------------------------------------------------------#Reuse all relevant files that have already been created in previous trainings: 1= Do ; Any other value=Don't reuse=1 #--------------------------------------------------------------------------------------------------------------------------#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #NOTE 1: If in doubt, leave the settings that follow as they are; you will do a full training with memory mapping, tuning, a training test and scoring of the training test of the demo corpus; the results will appear in $mosesdir/corpora_trained and a log file will be available in $mosesdir/logs. #NOTE 2: You can also proceed step by step (e.g., first doing just LM building and corpus training and then testing), so as to better control the whole process. #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Do parallel corpus training: 1= Do ; Any other value=Don't !!! paralleltraining=1 #Number of the first training step (possible values: 1-9); choose 1 for a completely new corpus firsttrainingstep=1 #Number of the last training step (possible values: 1-9); choose 9 for a completely new corpus lasttrainingstep=9 #Do memory mapping: 1 = Do ; Any other value = Don't
memmapping=1 #Do tuning: 1= Do ; Any other value=Don't; can lead, but does not always lead, to better results; takes much more time tuning=1 #Do a test (with scoring) of the training: 1 = Do ; Any other value = Don't runtrainingtest=1 #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # If you are new to Moses, stop here for the time being #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #=================== 4. LANGUAGE MODEL PARAMETERS ====================== # Use IRSTLM (1) or RandLM (5) lngmdl=1 #Order of ngrams - the higher the better, but more memory required (choose between 3 and 9; good value: 5) Gram=5 #---------------------------*** 4.1. IRSTLM PARAMETERS ***-----------------------------------------------# Distributed language model: 1= Yes; Any other value = No (splits the file used to build the language model into parts, processes each part separately and finally merges the parts) distributed=1 # Number of parts to split dictionary into balanced n-gram prefix lists (in the creation of a distributed language model); default: 5; !!! Only used if distributed = 1 !!! dictnumparts=20 # Smoothing possible values: witten-bell (default); kneser-ney, improved-kneser-ney s='witten-bell' # Quantize LM (IRSTLM user manual, p. 4: "Reduces memory comsumption at the cost of some loss of performance") 1 = Do ; Any other value = Don't. May induce some accuracy loss. Reduces the size of the LM. quantize=0 # Memory-mapping of the LM. 1 = Do; Any other value = Don't. Avoids the creation of the binary LM directly in RAM (allows bigger LM at the cost of lower speed; often necessary when LM file is very big) !!! lmmemmapping=1 #------------------------------------*** 4.2. RandLM PARAMETERS ***---------------------------------------# The format of the input data. The following formats are supported: for a CountRandLM, "corpus" (tokenised text corpora, one sentence per line); for a BackoffRandLM, 'arpa' (an ARPA backoff
language model) inputtype=corpus # The false positive rate of the randomised data structure on an inverse log scale so '-falsepos 8' produces a false positive rate of 1/2^8 falsepos=8 # The quantisation range used by the model. For a CountRandLM, quantisation is performed by taking a logarithm. The base of the logarithm is set as 2^{1/'values'}. For a BackoffRandLM, a binning quantisation algorithm is used. The size of the codebook is set as 2^{'values'} values=8 #====================== 5. TRAINING PARAMETERS =========================== #------------------------------------- *** 5.1. TRAINING STEP 1 ***--------------------------------------------#********** mkcls options #Number of mkcls interations (default:2) nummkclsiterations=2 #Number of word classes numclasses=50 #--------------------------------------*** 5.2. TRAINING STEP 2 ***--------------------------------------------#.................................................. 5.2.1. MGIZA parameters ....................................................................... #Number of processors of your computer that will be used by MGIZA (if you use all the processors available, the training will be considerably speeded) mgizanumprocessors=1 #........ 5.2.2. GIZA parameters ....................................................................... #maximum sentence length; !!! never exceed 101 !!! ml=101 #No. of iterations: #------------------#number of iterations for Model 1 model1iterations=5 #number of iterations for Model 2 model2iterations=0 #number of iterations for HMM (substitutes model 2) hmmiterations=5 #number of iterations for Model 3 model3iterations=3
#number of iterations for Model 4 model4iterations=3 #number of iterations for Model 5 model5iterations=0 #number of iterations for Model 6 model6iterations=0 # #parameters for various heuristics in GIZA++ for efficient training: #-----------------------------------------------------------------#Counts increment cutoff threshold countincreasecutoff=1e-06 #Counts increment cutoff threshold for alignments in training of fertility models countincreasecutoffal=1e-05 #minimal count increase mincountincrease=1e-07 #relative cutoff probability for alignment-centers in pegging peggedcutoff=0.03 #Probability cutoff threshold for lexicon probabilities probcutoff=1e-07 #probability smoothing (floor) value probsmooth=1e-07 #parameters for describing the type and amount of output: #----------------------------------------------------------#0: detailled alignment format, 1: compact alignment format compactalignmentformat=0 #dump frequency of Model 1 model1dumpfrequency=0 #dump frequency of Model 2 model2dumpfrequency=0 #dump frequency of HMM hmmdumpfrequency=0 #output: dump of transfer from Model 2 to 3
transferdumpfrequency=0 #dump frequency of Model 3/4/5 model345dumpfrequency=0 #for printing the n best alignments nbestalignments=0 #1: do not write any files nodumps=1 #1: write alignment files only onlyaldumps=1 #0: not verbose; 1: verbose verbose=0 #number of sentence for which a lot of information should be printed (negative: no output) verbosesentence=-10 #smoothing parameters: #--------------------#f-b-trn: smoothing factor for HMM alignment model #can be ignored by -emSmoothHMM emalsmooth=0.2 #smoothing parameter for IBM-2/3 (interpolation with constant)) model23smoothfactor=0 #smooting parameter for alignment probabilities in Model 4) model4smoothfactor=0.4 #smooting parameter for distortion probabilities in Model 5 (linear interpolation with constant model5smoothfactor=0.1 #smoothing for fertility parameters (good value: 64): weight for wordlength-dependent fertility parameters nsmooth=4 #smoothing for fertility parameters (default: 0): weight for word-independent fertility parameters nsmoothgeneral=0 #parameters modifying the models: #-------------------------------#0 = IBM-3/IBM-4 as described in (Brown et al. 1993); 1: distortion model of empty word is deficient; 2: distortion model of empty word is deficient (differently); setting this parameter also helps to avoid that during IBM-3 and IBM-4 training too many words are aligned with the empty word); 1 = only 3-
dimensional alignment table for IBM-2 and IBM-3 compactadtable=1 deficientdistortionforemptyword=0 #d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E) depm4=76 #d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E) depm5=68 #lextrain: dependencies in the HMM alignment model. &1: sentence length; &2: previous class; &4: previous position; &8: French position; &16: French class) emalignmentdependencies=2 #f-b-trn: probability for empty word emprobforempty=0.4 #parameters modifying the EM-algorithm: #-------------------------------------#fixed value for parameter p_0 in IBM-5 (if negative then it is determined in training) m5p0=-1 manlexfactor1=0 manlexfactor2=0 manlexmaxmultiplicity=20 #maximum fertility for fertility models maxfertility=10 #fixed value for parameter p_0 in IBM-3/4 (if negative then it is determined in training) p0=0.999 #0: no pegging; 1: do pegging pegging=0 #-------------- *** 5.3. TRAINING SCRIPT PARAMETERS ***---------------------------------------------#Heuristic used for word alignment; possible values: intersect (intersection seems to be a synonym), union, grow, grow-final,grow-diag, grow-diag-final-and (default value),srctotgt, tgttosrc (Moses manual, p. 72, 144) alignment=grow-diag-final-and #Reordering model; possible values: msd-bidirectional-fe (default), msd-bidirectional-f, msd-fe, msd-f, monotonicity-bidirectional-fe, monotonicity-bidirectional-f, monotonicity-fe, monotonicity-f (Moses manual, p. 77) reordering=msd-bidirectional-fe
#Minimum length of the sentences (used by clean) MinLen=1 #Maximum length of the sentences (used by clean) MaxLen=60 #Maximum length of phrases entered into phrase table (max: 7; choose a lower value if phrase size length is an issue; good value for most purposes: 3) MaxPhraseLength=5 #-------------- *** 5.4. DECODER PARAMETERS ***-------------------------------------------------------#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # !!! Only used in the training evaluation, and only if tuning = 0 !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #***** QUALITY TUNING: # Weights for phrase translation table (good values: 0.1-1; default: 1); ensures that the phrases are good translations of each other weight_t=1 # Weights for language model (good values: 0.1-1; default: 1); ensures that output is fluent in target language weight_l=1 # Weights for reordering model (good values: 0.1-1; default: 1); allows reordering of the input sentence weight_d=1 # Weights for word penalty (good values: -3 to 3; default: 0; negative values favor large output; positive values favour short output); ensures translations do not get too long or too short weight_w=0 #-----------------------------------------# Use Minumum Bayes Risk (MBR) decoding (1 = Do; Any other value = do not); instead of outputting the translation with the highest probability, MBR decoding outputs the translation that is most similar to the most likely translations. mbr=0 # Number of translation candidates consider. MBR decoding uses by default the top 200 distinct candidate translations to find the translation with minimum Bayes risk mbrsize=200 # Scaling factor used to adjust the translation scores (default = 1.0) mbrscale=1.0 # Adds walls around punctuation ,.!?:;". 1= Do; Any other value = do not. Specifying reordering constraints around punctuation is often a good idea.
monotoneatpunctuation=0 #***** SPEED TUNING: # Fixed limit for how many translation options are retrieved for each input phrase (0 = no limit; positive value = number of translation options per phrase) ttablelimit=20 # Use the relative scores of hypothesis for pruning, instead of a fixed limit (0= no pruning; decimal value = more pruning) beamthreshold=0 # Threshold for constructing hypotheses based on estimate cost (default: 0 = not used).During the beam search, many hypotheses are created that are too bad to be even entered on a stack. For many of them, it is even clear before the construction of the hypothesis that it would be not useful. Early discarding of such hypotheses hazards a guess about their viability. This is based on correct score except for the actual language model costs which are very expensive to compute. Hypotheses that, according to this estimate, are worse than the worst hypothesis of the target stack, even given an additional specified threshold as cushion, are not constructed at all. This often speeds up decoding significantly. Try threshold factors between 0.5 and 1 earlydiscardingthreshold=0 #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #To get faster performance than the default Moses setting at roughly the same performance, use the parameter settings $searchalgorithm=1, $cubepruningpoplimit=2000 and $stack=2000. With cube pruning, the size of the stack has little impact on performance, so it should be set rather high. The speed/quality trade-off is mostly regulated by the -cube-pruning-pop-limit, i.e. the number of hypotheses added to each stack #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # Search algorithm; cube pruning is faster than the traditional search at comparable levels of search errors; 0 = default; 1 = turns on cube pruning searchalgorithm=0 # Number of hypotheses added to each stack; only a fixed number of hypotheses are generated for each span; default is 1000, higher numbers slow down the decoder, may result in better quality cubepruningpoplimit=1000 # Reduce size of hypothesis stack, that keeps the best partial translations (=beam); default: 100 stack=100 # Maximum phrase length (default: 20) maxphraselen=20 # ****** SPEED AND QUALITY TUNING # Minimum number of hypotheses from each coverage pattern; you may also require that a minimum
number of hypotheses is added for each word coverage (they may be still pruned out, however). This is done using the switch -cube-pruning-diversity, which sets the minimum. The default is 0 cubepruningdiversity=0 # Distortion (reordering) limit in maximum number of words (0 = monotone; -1 = unlimited ; any other positive value = maximal number of words; default:6)); limiting distortion often increases speed and quality distortionlimit=6 #================= 6. TUNING PARAMETERS ================================= # Maximum number of runs of tuning ( -1 = no limit; Any positive number = maximum number of runs) maxruns=10
4) translate-moses-irstlm-randlm script:
#Full path of the base directory location of your Moses system mosesdir=$HOME/moses-irstlm-randlm #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # Even if you are using the demonstration corpus, you have to fill the $logfile parameter so that the script can be executed !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Name of the log file of the corpus to be used (time-saving tip: copy and paste it here; the default directory of the log files is $mosesdir/logs); example of a possible name of a log file: pt-en.C200000.for_train-60-1.LM-300000.MM-1.day-18-01-10-time-14-08-50.txt) logfile= #Create a translation report when translations are finished; 1 = Do; Any other value = Do not create_translation_report=1 #--------------------------------------*** TMX OPTIONS ***----------------------------------------------------#Process both the document to be translated and the Moses translation so that the machine translation can best be used in TMX translation memories #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #!!! If you set this parameter to 1, you SHOULD NOT use the score-moses-irstlm-randlm script unless the $othercleanings, $improvesegmentation and $ removeduplicates parameters are all set to 0 and $minseglen is set to -1, since this processing changes the order of the segments and can also make the source document have a number of segments that is different from the number of segments of the reference translation (namely because it can delete some segments and/or add some new ones) !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ translate_for_tmx=0 #Minimal length of sentences; -1=any length; any other value=segments with less than $minseglen will be erased ( !!! only active if translate_for_tmx =1 !!!)
minseglen=-1 #Substitute tabulation signs by newlines and remove lines composed only of digits, spaces and parentheses ( !!! only active if translate_for_tmx = 1 !!!) othercleanings=1 # Substitute any of the characters [:;.!?] followed by a space by that character followed by a newline; delete empty lines; substitute doublespaces by one space ( !!! only active if translate_for_tmx = 1 !!!) improvesegmentation=1 #Sort segments and remove those segments that are identical ( !!! only active if translate_for_tmx =1 !!! ) removeduplicates=1 #---------------------------------- *** MOSES DECODER PARAMETERS ***------------------------------#***** QUALITY TUNING: # Weights for phrase translation table (good values: 0.1-1; default: 1); ensures that the phrases are good translations of each other weight_t=1 # Weights for language model (good values: 0.1-1; default: 1); ensures that output is fluent in target language weight_l=1 # Weights for reordering model (good values: 0.1-1; default: 1); allows reordering of the input sentence weight_d=1 # Weights for word penalty (good values: -3 to 3; default: 0; negative values favor large output; positive values favour short output); ensures translations do not get too long or too short weight_w=0 #-----------------------------------------# Use Minumum Bayes Risk (MBR) decoding (1 = Do; Any other value = do not); instead of outputting the translation with the highest probability, MBR decoding outputs the translation that is most similar to the most likely translations. mbr=0 # Number of translation candidates consider. MBR decoding uses by default the top 200 distinct candidate translations to find the translation with minimum Bayes risk mbrsize=200 # Scaling factor used to adjust the translation scores (default = 1.0) mbrscale=1.0 # Adds walls around punctuation ,.!?:;". 1= Do; Any other value = do not. Specifying reordering constraints around punctuation is often a good idea.
monotoneatpunctuation=0 #***** SPEED TUNING: # Fixed limit for how many translation options are retrieved for each input phrase (0 = no limit; positive value = number of translation options per phrase) ttablelimit=20 # Use the relative scores of hypothesis for pruning, instead of a fixed limit (0= no pruning; decimal value = more pruning) beamthreshold=0 # Threshold for constructing hypotheses based on estimate cost (default: 0 = not used).During the beam search, many hypotheses are created that are too bad to be even entered on a stack. For many of them, it is even clear before the construction of the hypothesis that it would be not useful. Early discarding of such hypotheses hazards a guess about their viability. This is based on correct score except for the actual language model costs which are very expensive to compute. Hypotheses that, according to this estimate, are worse than the worst hypothesis of the target stack, even given an additional specified threshold as cushion, are not constructed at all. This often speeds up decoding significantly. Try threshold factors between 0.5 and 1 earlydiscardingthreshold=0 #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #To get faster performance than the default Moses setting at roughly the same performance, use the parameter settings $searchalgorithm=1, $cubepruningpoplimit=2000 and $stack=2000. With cube pruning, the size of the stack has little impact on performance, so it should be set rather high. The speed/quality trade-off is mostly regulated by the -cube-pruning-pop-limit, i.e. the number of hypotheses added to each stack #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # Search algorithm; cube pruning is faster than the traditional search at comparable levels of search errors; 0 = default; 1 = turns on cube pruning searchalgorithm=0 # Number of hypotheses added to each stack; only a fixed number of hypotheses are generated for each span; default is 1000, higher numbers slow down the decoder, may result in better quality cubepruningpoplimit=1000 # Reduce size of hypothesis stack, that keeps the best partial translations (=beam); default: 100 stack=100 # Maximum phrase length (default: 20) maxphraselength=20 # ****** SPEED AND QUALITY TUNING
# Minimum number of hypotheses from each coverage pattern; you may also require that a minimum number of hypotheses is added for each word coverage (they may be still pruned out, however). This is done using the switch -cube-pruning-diversity, which sets the minimum. The default is 0 cubepruningdiversity=0 # Distortion (reordering) limit in maximum number of words (0 = monotone; -1 = unlimited ; any other positive value = maximal number of words; default:6)); limiting distortion often increases speed and quality distortionlimit=6
5) score-moses-irstlm-randlm script:
#Base directory of your Moses installation (made with create-moses-irstlm-randlm) mosesdir=$HOME/moses-irstlm-randlm #Source language lang1=pt #Destination language lang2=en # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # !!! THIS SCRIPT SHOULD NOT BE USED WITH DOCUMENTS TRANSLATED WITH THE translate-moses-irstlm-randlm SCRIPT WITH ITS $translate_for_tmx PARAMETER SET TO 1 IF ANY OF ITS $othercleanings, $improvesegmentation AND $ removeduplicates PARAMETERS WAS SET TO A VALUE DIFFERENT FROM 0 AND IF $minseglen WAS SET TO A VALUE DIFFERENT FROM -1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Scores documents prepared for TMX translation memories. If this parameter is set to 1, the script will look for the documents $s and $m in the $mosesdir/translation_files_for_tmx directory; if not set to 1, it will look for the $s document in the mosesdir/translation_input directory and for the $m document in $mosesdir/translation_output; in both cases, it will look for the $r document in $mosesdir/translation_reference scoreTMXdocuments=0 #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # !!! The names of the files should not include spaces !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # A USEFUL, THOUGH NOT OBLIGATORY, CONVENTION MIGHT BE TO GIVE THE SOURCE, THE REFERENCE AND THE MOSES TRANSLATION FILES THE SAME BASE NAME, POSSIBLY FOLLOWED BY A "SUFFIX" INDICATING THE SOURCE LANGUAGE, FOR THE SOURCE FILE, THE DESTINATION LANGUAGE AND THE PARTICLE "REF", FOR THE REFERENCE FILE, AND THE DESTINATION LANGUAGE AND THE PARTICLE "MOSES" FOR THE MOSES TRANSLATION (ex: mydoc.lang1, mydoc.lang2.ref, mydoc.lang2.moses)
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Source document file name (omit the path!) s=100.pt #Reference (human-made) file name (omit the path!) r=100.en.ref #Moses translation file name (omit the path!) m=100.pt.en.moses # This is an arbitrary name that is used to identify your corpus when creating the input files used by the scorer and which will be used in the scoring file name testbasename="no-name" #Create a report where each segment gets its own score; 0 = score the whole document; 1 = score each segment score_line_by_line=0 #Remove moses translation segments that are equal to reference translation segments and whose BLEU score is zero (!!! Only active if score_line_by_line=1 !!!) remove_equal=1 #Tokenize the source document and the reference and the Moses translation tokenize=1 #Lowercase the source document and the reference and the Moses translation lowercase=1
6) transfer-training-to-another-location-moses-irstlm-randlm script
#Base dir of your the Moses system (e.g., $HOME/moses-irstlm-randlm) whose trainings you want to transfer mosesdirmine=$HOME/moses-irstlm-randlm #Login name of the user to whom the trained corpora will be transfered newusername=john #Basedir of the Moses system of the user to which the trained corpora will be transfered mosesdirotheruser=/media/1.5T/moses-irstlm-randlm