Course Paper On Machine Translation

Faculty of Applied Modern Languages
COURSE PAPER "DIFFICULTIES IN MACHINE AND AUTOMATIC TRANSLATION"
Written by:
Scientific advisor:
Chiinu 2005
Content
Introduction3 I. HISTORY OF MACHINE TRANSLATION I.1 General introduction5 I.2 Before the computer...11 I.3 The first beginnings (1946-1949)...15 I.4 Weaver's memorandum (1949).19 I.5 From Weaver to the first MT conference (1950-1952)20 I.5.1 First MT studies....21 I.5.2 The decade of high expectation and disillusion, 1954-1966...23 I.5.3 The ALPAC report and its consequences, 1966-198024 I.5.4 The 1980s....25 I.5.5 The early and the late 1990s.....26 II. COMMON ERRORS IN MACHINE TRANSLATION II.1 The quality of translation.29 II.2 Mechanical dictionaries29 II.3 Polysemy and semantics31 II.4 Morphological analysis.35 II.5 Syntactic analysis...36 II.6 Formal syntax and transformational grammar.39 II.7 Syntactic ambiguity and discourse relations...41 II.8 Sentences and texts....44 II.9 Transfer and synthesis..44 II.10 System designed and strategies45 II.11 Respective and influences.48 III. DIFFICULTIES IN MACHINE TRANSLATION III.1 Difficulties in translation...52 III.2 Machine translation ambiguity.57 III.3 Problems of machine translation..60 III.4 Cognitive processes.62 General conclusion...65 Appendix 1...i Appendix 2..v Appendix 3.ix
Bibliography
Introduction
People who need documents translated often ask themselves whether they could use a computer to do the job. When a computer translates an entire document automatically and then presents it to a human, the process is called machine translation. When a human composes a translation, perhaps calling on a computer for assistance in specific tasks such as looking up specialized words and expressions in a dictionary, the process is called human translation. There is a gray area between human and machine translation, in which the computer may retrieve whole sentences of previously translated text and make minor adjustments as needed. However, even in this gray area, each sentence was originally the result of either human translation or machine translation. "Machine translation" is possible only for the case when a computer performs both the initial translations of the sentences and subsequent manipulations. All else, I will call just "translator tools". This paper begins with a concise history of machine and computer-assisted translation, followed by a brief analysis of the types of error and difficulties appeared in the process of Machine Translation. It then describes the technology available to translators in this first decade of the twenty-first century and examines the negative and positive aspects of machine translation and of the main tools used in computerassisted translation: electronic dictionaries, glossaries, terminology databases, concordances, on-line bilingual texts and translation memories here I selected the results concerning what the evaluators regarded as translation errors caused by the input and the errors caused by malfunctioning of the Machine Translation system. The analysis of the data is carried out by systematizing the classification and comments of the evaluators regarding the translation errors, the translation difficulties and quantifying the results. My aim is to determine which characteristics of the text are attributable to the writer's intention to use language that differed from formal
norms, and which characteristics are attributable to other factors, focusing on language contact, a significant aspect of the area under study. Another important aim of mine is to introduce some pieces of advice, which are helpful and necessary for a good and productive translation. This paper presents the results of a study of the problems of automatic language translation. Indeed, the future value of research on automatic translation might well hinge more on its contributions to a fundamental understanding of all levels of language structure and of the nature of automatic information processing than on any resulting machine-produced translations. My paper investigations have been confined to translation errors and difficulties. There has been much speculation about general methods of translation, whether directly between pairs of languages or via some natural or artificial immediate language. It is known that we are yet too ill equipped for a frontal assault and more promising studies of individual language pairs. But I do believe that soon we will have such productive and qualitative technical equipment for a translation that will not need a pre- or post-editing. For investigating the machine translation process I have used the descriptive and contextual methods. Another objective of my paper is to present to the reader and to investigate the classification of errors made in the process of automatic translation. In my paper I introduced some drafts, which show the process of automatic and human translation and some schedules with an algorithm for automatic translation. Still, translation is a very difficult thing requiring much feeling and understanding of cultural aspects, which is not available in a computer.
I.
HISTORY OF MACHINE TRANSLATION
1.1 General introduction. Machine translation is the application of computers to the translation of texts from one natural language into another. There have been many different reasons for attempting it. The principal reason is a severely practical one: scientists, technologists, engineers, economists, agriculturalists, administrators, industrialists, businessmen, and many others have to read documents and have to communicate in languages they do not know; and there are just not enough translators to cope with the ever increasing volume of material which has to be translated. Machine translation would ease the pressure. Secondly, many researchers have been motivated by idealism: the promotion of international cooperation and peace, the removal of language barriers, the transmission of technical, agricultural and medical information to the poor and developing countries of the world. Thirdly, by contrast, some sponsors of machine translation activity have seen its importance in military and intelligence contexts: to help them find out what the enemy knows. Fourthly, there are pure research reasons: to study the basic mechanisms of language and mind, to exploit the power of the computer and to find its limitations. Finally, there are simple commercial and economic motives: to sell a successful product, or to maintain a high standard of living in a competitive world. At certain periods in the nearly forty years of the history of machine translation, some of these motives have been more prominent than others. In the United States during the 1950s and 1960s fear of Soviet technological prowess (particularly after the launch of the first sputnik in 1957) stimulated much governmental and military support for Russian-English translation. In the 1970s the multilingual problems of the European Communities encouraged research on translation systems to deal with administrative, economic and technical
documentation within the languages of the communities. At the same time, commercial interests began to gather pace. In the 1980s the Japanese fifth generation project, in which machine translation plays an important role, has been launched to establish for Japan a major position in the future world economy. Throughout, however, there have always been researchers motivated by idealism and by scientific curiosity, and there have been sponsors willing to support basic research. Machine translation was one of the first non-numerical applications of computers. For more than a decade until the mid 1960s it was an area of intensive research activity and the focus of much public attention; but early expectations were not fulfilled, promises of imminent commercially viable systems came to nothing, and the problems and linguistic complexities became increasingly apparent and seemed to be ever more intractable. After a widely publicized report compiled for the major US sponsors, the notorious ALPAC report, machine translation was generally considered to have been a failure, and no longer worthy of serious scientific consideration. Critics and skeptics have been fond of repeating alleged mistranslations, howlers that no human translator would perpetrate, in order to ridicule the whole enterprise. The most popular example has been a story involving the translation of two idioms from English into Russian and then backs again from Russian into English: Out of sight, out of mind, and the spirit is willing but the flesh is weak. According to some accounts the first came back as invisible insanity and the second was as The whiskey is all right but the meat has gone bad; according to others, however, the versions were Invisible and insane and The vodka is good but the meat is rotten; and yet others have given invisible lunatics and the ghost is willing but the meat is feeble. There have been various other permutations and variants; such variety is typical of hearsay, and indeed, some accounts give the languages as German and English, and others assert they were Chinese and English. Nevertheless, the supposed translations are repeated to the present day as genuine examples of the literal-mindedness of machine translation. It would seem that a likely source was an article by John A. Kouwenhoven The trouble with translation in Harper's Magazine for August 1962: Our own
attempts to communicate with the Russians in their language may be no more successful. Thanks to Robert E. Alexander, the architect, we can pass along this cheering bit of news. According to Colonel Vernon Walters, President Eisenhower's official interpreter, some electronic engineers invented an automatic translating machine into which they fed 1,500 words of Basic English and their Russian equivalents, claiming that it would translate instantly without the risk of human error. In the first test they asked it to translate the simple phrase: Out of sight, out of mind. Gears spun, lights blinked, and the machine typed out in Russian Invisible Idiot. On the theory that the machine would make a better showing with a less epigrammatic passage, they fed it the scriptural saying: The spirit is willing, but the flesh is weak. The machine instantly translated it, and came up with The liquor is holding out all right, but the meat has spoiled. It is a good story, but its superficial plausibility is damaged by the lack of any evidence of a US system at the time which could translate from English into Russian for obvious reasons the Americans wanted to translate from Russian into English and by the discovery that both examples were familiar apocrypha of translation before there were any machine translation systems in operation. For example, in April 1956, E. H. Ullrich was reported as saying: "Perhaps the popular Press is the most attractive outlet for mechanical translations, because it does not really matter whether these are right or wrong and amusing versions such as the ghost wills but the meat is feeble might make mechanical translation into a daily feature as indispensable as the cross-word puzzle". From the mid-1960s research on machine translation continued at a reduced level, largely ignored and forgotten not only by the general public but even by linguists and computer scientists. In recent years, however, the situation has changed. There are now operational systems in a number of large translation bureaus and agencies; computers are producing readable translations for administrators, scientists, and technicians at ever increasing volumes; translation systems are being marketed on a commercial basis for microcomputers; many translators are now becoming familiar with machine translation systems and with machine aids; and there is growing scientific interest in machine translation within the Artificial Intelligence
community in the United States, in Japan and elsewhere. Machine translation can no longer be dismissed because it is a reality. With distant memories of the failure of machine translation in the 1950s and 1960s and supported by apocryphal translation horrors, there are still many who do not believe that computers can translate. It is true that few systems would pass the Turing test by producing translations that could never be distinguished from the output of fluent human translators. This book contains numerous examples of translations produced by computer programs: some are clearly unacceptable texts by whatever criteria; others however are the equal of some human translations and would not be readily identified as computer renditions. The question of how good machine translation should be in order to qualify, as true translation is a particularly thorny one, and still not really resolved. What matters in most cases is whether the translation serves the needs of the recipient: a rough translation (human or machine produced) might be quite adequate on some occasions; on others only a perfect finished version is acceptable. Judgments of quality are necessarily both subjective and highly constrained by personal needs and attitudes. What are probably most surprising to those unfamiliar with the complexities of machine translation are examples of errors that no human translator, however inexperienced, would ever make. A genuine, not apocryphal, howler from the Systran system is cited by Wheeler & Lawson (1982): la Cour de Justice considere la creation d'un sixieme posted'avocat general was rendered as the Court of Justice is considering the creation of a sixth general avocado station. Such examples are reassuring; there is no fear of being taken over by computers and these fears are real among some translators. This book will show that machine translation is not a threat, it is not an insidious dehumanizing destructive monster, and it is not Golem astride the Tower of Babel. Machine translation should be seen as a useful tool that can relieve translators of the monotony of much technical translation and spare them from the wasteful expenditure of much tedious effort on documents of ephemeral or marginal interest. Translators can then be employed where their skills are most wanted, in the translation of sensitive diplomatic and legal documents, and in the translation of cultural and literary texts.
The term machine translation has now established itself as the general accepted name for any system that uses an electronic computer to transform a text in one language into some kind of text in another natural language. The related term machine-aided translation in order to designate the use of mechanized aids for translation has likewise established itself, by and large, as the generally accepted term. Researchers and writers have commonly used also the alternative terms mechanical translation and automatic translation, but these are now more rarely encountered. For many writers the phrase mechanical translation suggests translation done in an automaton-like manner by a human translator; and this has been the primary reason for the dropping of this term. While in English-speaking countries the use of automatic translation has generally been much less common than machine translation, this nomenclature is the only possibility for the French and Russians. There is no straight equivalent for machine translation. In the earlier periods there was often talk of translating machines, but since the realization that computers do not have to be designed specifically to function as translators this usage has died away. In recent years there has been increasing use of the terms computer translation and computer-aided translation - terms which are certainly more accurate than machine translation and machine-aided translation but in this book the traditional, long-established, and still most common term machine translation will be used, abbreviated throughout in the customary way as MT. A number of other common terms need also to be defined at the outset. Firstly, it has now become accepted practice to refer to the language from which a text is being translated as the source language (SL), and the language into which the text is being translated as the target language (TL). Secondly, there are now commonly accepted terms for the processes involved: analysis procedures accept source language texts as input and derive representations from which synthesis procedures produce or generate texts in the target language as output. These processes may involve various aspects of language structure: morphology is concerned with the inflectional forms and derivational variants of words or lexical items, syntax is concerned with the ways in which words combine in sentence structures, and semantics is concerned with meaning relationships among sentences and texts.
Other terms will be introduced and defined as they arise. Between fully automatic translation on the one hand and human translation on the other there are a number of intermediate possibilities where there are various kinds of collaboration between man and machine. The intervention can take place before, during or after the machine processes. There can be human preparation of input, or pre-editing in the MT jargon; there can be human revision of the output, or post-editing. There can be collaboration during the translation processes, when a human assistant may be asked by the computer to resolve problems which it cannot deal with. Finally, a translator may do most of the work alone and call upon the machine to assist with problems of terminology. We may then refer to: Machine Translation proper, MT with post editing, MT with edited or restricted input, human-aided MT, machine-aided human translation, and human translation with no machine aids. The dividing line between some interactive MT systems and machine-aided translation is blurred on occasions, but in most cases there is little dispute. It includes some details about the development of mechanized aids for translating, i.e. primarily automatic dictionaries of various kinds, it does not include aspects of natural language processing which are not directly concerned with the translation problem. Hence, although nearly all computational linguistics, natural language processing in Artificial Intelligence and certain approaches to automatic indexing and abstracting have their origins in MT research, these offshoots of MT will not be covered. Obviously, the field has to be restricted in some way. The consequence of this restriction is that methods of potential relevance to MT problems will not be dealt with in any detail if they have not in fact been applied in any MT system. Likewise research which may have been seen at one time as of potential relevance to MT but which in fact did not lead to any kind of MT system will be ignored by and large. Another area, which must be excluded for obvious reasons, is the development of computer technology and programming, except where these developments have direct bearing on particular features of MT systems or MT research methodology. An attempt has been made to be as comprehensive as possible and to be as balanced as possible in the evaluation of the contributions of the many MT projects in the (40) forty years history of MT research.
1.2 Before the computer The use of mechanical devices to overcome language barriers was suggested first in the XVII century. There were two stimulants: the demise of Latin as a universal language for scientific communication, and the supposed inadequacy of natural languages to express thought succinctly and unambiguously. The idea of universal languages arose from a desire both to improve international communication and to create a rational or logical means of scientific communication. Suggestions for numerical codes to mediate among languages were common. Leibnizs proposals in the context of his monadic theory are perhaps the best known. Descartes made another proposal in comments on the 16th proposition of his famous correspondent Anonymous. In a letter to Pierre Mersenne on 20 November 1629 Descartes described a proposed universal language in the form of a cipher where the lexical equivalents of all known languages would be given the same code number. Descartes wrote: Mettant en son dictionnaire un seul chiffre qui se rapporte a aymer, amare, philein, et tous les synonymes, le livre qui sera ecrit avec ces caracteres pourra etre interprete par tous ceux qui auront ce dictionnaire. At the height of enthusiasm about machine translation in the early 1960's some writers saw these 17th proposals as genuine forerunners of machine translation. Becher's book, for example, was republished under the title Zur mechanischen Sprachubersetzung: ein Programmierungversuch aus dem Jahre 1661 (Becher 1962), indicating the conviction of its editor that Becher's ideas foreshadowed certain principles of machine translation. Apart from an ingenious script, Becher's book is distinguished from others of this kind only by the size of the dictionary: 10,000 Latin words were provided with coding. Like others, however, Becher failed to tackle the real difficulties of providing equivalent entries in other languages (Greek, Hebrew, German, French, Slav, and Arabic were proposed) and the necessary means to cope with syntactic differences. The vast work by John Wilkins, An Essay towards a Real Character and a Philosophical Language (1668), was a more genuine attempt at a universal language in that it sought to provide a logical or rational basis for the establishment of inter-language equivalencies. Wilkins aim was a regular
enumeration and description of all those things and notions, to which marks or names ought to be assigned according to their respective natures, i.e. a codification which embodied a universal classification of concepts and entities, a genuine interlingua. All these writers recognized the problems of genuine differences between languages that could not be captured completely in dictionaries, however logically constructed. Many of them like Kircher advised their fellows to write in a simple style and avoid rhetorical flourishes. Suggestions for mechanical dictionaries on numerical bases continued to be made throughout the following centuries until the middle of the present century. Couturat and Leau in their Histoire de la langue universelle (1903) list numerous examples, including one by W. Rieger entitled Zifferngrammatik, welche mit Hilfe der Worterbucher ein mechanisches Uebersetzen aus einer Sprache in alle anderen ermoglicht (Code-grammar, which with the help of dictionaries enables the mechanical translation from one language into all others); a title which links the present mechanical age to the 17th century. As the reference to Couturat and Leau implies, all these apparent precursors of MT should be regarded more accurately as contributions to the ideal of a universal language and to the development of international auxiliary languages, of which the best known is now Esperanto. Both concepts have in fact inspired many of those engaged in machine translation. None of these proposals involved the construction of machines; all required the human translator to use the tools provided in a mechanical fashion, i.e. for man to simulate a machine. It was not until the invention of mechanical calculators in the XIX and XX centuries that an automatic device could be envisaged which could perform some translating processes. In fact, the first explicit proposals for translating machines did not appear until 1933, when two patents were issued independently in France and Russia. In both cases, the patents were for mechanical dictionaries. A French engineer of Armenian extraction, Georges Artsrouni was issued a patent on 22nd July 1933 for a translation machine which he called a Mechanical Brain. The invention consisted of a mechanical device worked by electric motor for recording and retrieving information on a broad band of paper which passed behind a keyboard. The storage device was capable of several thousand characters, and was
envisaged by its inventor in use for railway timetables, bank accounts, commercial records of all sorts, and in particular as a mechanical dictionary. Each line of the broad tape would contain the entry word (SL word) and equivalents in several other languages (TL equivalents); corresponding to each entry were coded perforations on a second band, either paper or metal, which functioned as the selector mechanism. A prototype machine was exhibited and demonstrated in 1937; the French railway administration and the post and telegraph services showed considerable interest and only the start of the Second World War in 1940 prevented installation of Artsrounis invention. More important in retrospect was the patent issued in Moscow on 5 September 1933 to Petr Petrovich Smirnov-Troyanskii for the construction of a machine for the selection and printing of words while translating from one language into another or into several others simultaneously. Troyanskii envisaged three stages in the translation process; the machine was involved only in the second stage, performing as an automated dictionary. In the first stage a human editor knowing only the source language was to analyze the input text into a particular logical form: all inflected words were to be replaced by their base forms (e.g. the nominative form of a noun, the infinitive form of a verb) and ascribed their syntactic functions in the sentence. For this process Troyanskii had devised his own logical analysis symbols. In the second stage the machine was designed to transform sequences of base forms and logical symbols of source texts into sequences of base forms and symbols of target languages. In the third stage an editor knowing only the target language was to convert this sequence into the normal forms of his own language. Troyanskii envisaged both bilingual translation and multilingual translation. Although the machine was assigned the task only of automating the dictionary, it is interesting to note that Troyanskii believed that the process of logical analysis could itself be mechanized, by means of a machine specially constructed for the purpose (quoted by Panov 1960a). It was this vision of the next steps beyond a simple mechanical dictionary that marks Troyanskii's proposal as a genuine precursor of machine translation. In the 1933 patent, the technical implementation proposed was a purely mechanical device, a table over which passed a tape listing in vertical columns
equivalent words from various languages. But, by 1939 he had added an improved memory device operating with photo-elements (Delavenay 1960; Mounin 1964), and by May 1941 it appears that an experimental machine was operational. Troyanskii in fact went further towards the electronic computer; in 1948 he had a project for an electro-mechanical machine. Similar to the Harvard Mark I machine developed between 1938 and 1942, and which is regarded as a forerunner of the ENIAC computer. Troyanskii was clearly ahead of his time; Soviet scientists and linguists failed to respond to his proposal when he sought their support in 1939 and later the Institute of Automation and Telemechanics of the Academy of Sciences was equally unforthcoming in 1944. In retrospect, there seems to be no doubt that Troyanskii would have been the father of machine translation if the electronic digital calculator had been available and the necessary computer facilities had been ready. History, however, has reserved for Troyanskii the fate of being an unrecognized precursor; his proposal was neglected in Russia and his ideas had no direct influence on later developments; it is only in hindsight that his vision has been recognised. 1.3 The first beginnings (1946-1949) The electronic digital computer was a creation of the Second World War: the ENIAC machine at the Moore School of Electrical Engineering in the University of Pennsylvania was built to calculate ballistic firing tables; the Colossus machine at Bletchley Park in England was built to decipher German military communications. Immediately after the war, projects to develop the new calculating machines were established at numerous centers in the United States and Great Britain. The first applications were naturally in the fields of mathematics and physics, but soon the enormous wider potential of the electronic brain were realized and nonnumeric applications began to be contemplated. The first suggestion that electronic computers could be used to translate from one language into another seems to have been made during conversations in New York between Andrew D. Booth and Warren Weaver. Warren Weaver was at this time vice president of the Rockefeller Foundation. During the war Weaver had served on a scientific mission to investigate Britain's
weapons development, and at the Rockefeller Foundation he was closely involved in the sponsorship of computer research and development. Booth had become interested in automatic digital calculation while working at the British Rubber Producers Association in Welwyn Garden City, and had started to build a machine for crystallographic calculations. In 1945 he was appointed a Nuffield Fellow in the Physics Department at Birkbeck College in the University of London under Professor J. D. Bernal, where he constructed a relay calculator during 1945 and 1946 and initiated plans for computational facilities in the University of London. As a consequence of this work and the efforts of Bernal he obtained funds to visit the United States in 1946 under the auspices of the Rockefeller Foundation. There he visited all the laboratories engaged in computer research and development, at Princeton, MIT, Harvard, and Pennsylvania. The discussions that Booth had with Warren Weaver were entirely on the subject of coming over to look into the question of acquiring the techniques for building a machine for the University of London based on American experience. Then, Booth submitted a report on computer development with particular reference to x-ray crystallography, and he was offered a Rockefeller fellowship to enable him to work at an institution of his own choice in the United States the following year. Booth selected the von Neumann group at the Institute for Advanced Study, Princeton University. According to Booth (1985): The discussion then was entirely on the question of the Rockefeller Foundation financing a computer for the University of London, and Weaver pointed out that there was very little hope that the Americans would fund a British computer to do number crunching, although they might be interested if we had any additional ideas for using the machine in a nonnumeric context. In the mid 1940's he had already thought about non-numerical applications from conversations with A. M. Turing. One of that was in fact translation, although at that time he had thought only of using the machine as a dictionary. Weaver suggested treating translation as a cryptography problem. Weaver had in fact already on 4th March 1947, just before this meeting with Booth, written to Norbert Wiener of the Massachusetts Institute of Technology, one of the pioneers in the mathematical
theory of communication, about the possibility of Machine Translation. Once, Weaver wrote that recognizing fully, even though necessarily vaguely, the semantic difficulties because of multiple meanings. He has wondered if it was unthinkable to design a computer which would translate. Even if it would translate only scientific material, and even if it did produce an inelegant result, it would seem worthy. Also knowing nothing official about, but having guessed and inferred considerable about, powerful new mechanized methods in cryptography, one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. It is evident that the first serious discussions and investigations of the possibilities of machine translation took place during 1947, beginning with Weaver's letter to Wiener and his meeting with Booth in early March. However, at a later date in 1955 when writing the 'Historical introduction' to the MT collection he edited with Locke, Booth recollected the first mention of MT as having occurred during his 1946 visit. This has been generally accepted as the birth date of MT; however, in other later publications Booth gives the date 1947, and he has now confirmed the March 1947 meeting as the one when the MT discussion with Weaver occurred. Booth was the first to suggest this possible use of mechanical translation; he also may legitimately be regarded as the pioneer of what is now known as Artificial Intelligence. In an essay written during September 1947, Booth mentions a number of possible ways in which the new computers could demonstrate their intelligence: 1. Various games, e.g. chess, naught and crosses, bridge, poker; 2. The learning of languages; 3. Translation of languages; 4. Cryptography; 5. Mathematics Evidently, Weaver and Turing were thinking along similar lines independently and probably, others too. As there were no facilities available at Birkbeck College, Booth began construction of a small computer at the laboratories of the British Rubber Producers Research Association in Welwyn Garden City near London. The machine was operational by 12th May 1948 and a demonstration was given on 25th May to Warren Weaver and Gerard Pomerat, also of the Rockefeller Foundation. On
this occasion Weaver met Richard H. Richens, with whom Booth had been collaborating in experiments on mechanical dictionaries. Richens had first met Booth on the 11th November 1947. His interest in mechanical translation had arisen independently out of experiments with punched cards for storing information at the Commonwealth Bureau of Plant Breeding and Genetics, where he was Assistant Director. The idea of using punched cards for automatic translation arose as a spin-off, fuelled by my realization as editor of an abstract journal ''Plant Breeding Abstract'' that linguists conversant with the grammar of a foreign language and ignorant of the subject matter provided much worse translations than scientists conversant with the subject matter but hazy about the grammar. Richens is to be credited with the first suggestion of the automatic grammatical analysis of word-endings. He proposed the segmenting words into their stems and endings, both to reduce the size of dictionaries and to introduce grammatical information into a dictionary translation system. For example, in the case of the Latin verb amat a search was made for the longest matching stem, i.e. am-, and for the ending -at. The stem provides the English translation love and the ending gives the grammatical information 3rd person singular. In this way grammatical annotations augment a strict word-by-word dictionary translation. The validity of the approach was tested by hand and by using punched card machinery on a wide variety of languages; the texts were taken from abstracts in plant genetics. From the French text: Il nest pas etonn*ant de constat*er que les hormone*s de croissance ag*issent sur certain*es espece*s, alors qu'elles sont in*oper*antes sur dautre*s, si lon song*e a la grand*e specificite de ces substance*s. (Where the stars (*) indicate automatic segmentations). The English translation: v not is not/step astonish v of establish v that/which? v hormone m of growth act m on certain m species m, then that/which? v not operate m on of other m if v one ream/consider z to v great v specificity of hose substance m. Where v indicates French word not translated, m "multiple, plural r dual", z "unspecific", and slashes alternative translations). These tentative experiments by Booth and Richens were known to very few. Weaver's memorandum, which was literally the first suggestion experts had seen that the new electronic computers could
be used as translating machines, launched machine translation as a scientific enterprise in the United States and subsequently elsewhere. Its historic impact is unquestionable; in his memorandum Weaver dates the origin of his speculations about MT to his wartime experience with electronic computers and to stories of startling achievements in cryptanalysis using computers.
1.4 Weaver's memorandum (1949) Weavers memorandum concentrated more on the general strategies and longterm objectives of MT than on the more technical problems Booth and Richens had been tackling. Because of its historic importance it is worth enumerating in some detail the issues and problems raised by Weaver. He raised four points: the problem of multiple meaning, the logical basis of language, the application of communication theory and cryptographic techniques, and the possibilities of language universals. The problem of deciding which specific meaning an ambiguous word may have in a particular text was, he suggested, solvable in principle if a sufficient amount of the immediate context is taken into account. The practical question of how many contexts are necessary could be answered by a statistical study of different types of texts on a variety of subject matters. Weaver explicitly rejected the idea of actually storing in dictionaries long sequences of words for this purpose, but did suggests that some reasonable way could be found of using the micro context to settle the difficult cases of ambiguity. He expressed optimism about finding logical aspects in languages. Weaver believed that the translation problem could be largely solved by statistical semantic studies. For Weaver the most promising approach of all was the investigation of language invariants or universals. He linked this again with logical structures of language and with probabilistic uniformities. The analogy he suggested was of individuals living in a series of tall closed towers, all erected over a common foundation. Perhaps the way is to descend, from each language, down to the common base of human communication - the real but as yet undiscovered universal language.
1.5 From Weaver to the first MT conference (1950-1952) Weavers memorandum brought to the attention of a wide circle the possibilities of a new and exciting application of the computers whose potentialities were being discovered and proclaimed with enthusiasm and optimism at this time. But, it did more. It indicated potentially fruitful lines of research in statistical analyses of language, on the logical bases of language, and on semantic universals of language. In addition, it pointed to some actual, even if tentative, achievements in the work of Booth and Richens. It was, however, received with considerable skepticism by many linguists who rejected it for its naivety in linguistic matters and for its unfounded assumptions on the logicality of language, and they were naturally skeptical about the possibility of formalizing language and translation processes. The press had also noticed the memorandum. Booths APEXC computer program was described as an electronic translator, at which an operator could select which of a dozen or more languages he desired to translate. As fast as he could type the words, say, in French, the equivalent in Hungarian or Russian would issue on the tape. 1.5.1 First MT studies. Weavers own favored approach, the application of cryptanalytic techniques, was immediately recognised as mistaken. Confusion between the activities of deciphering and translation arise whenever the same person does both. Obviously, no translating is involved when an English-speaking recipient deciphers an English message into English. Likewise, the decipherment of the highly complex Enigma code used by Germany in the Second World War, with its immensely complex sequences of manipulations and transpositions, was not translation; it was only after the German texts had been deciphered that they were translated. The Colossus computers at Bletchley Park were applied to cracking the cipher, not to translating the German text into English. In practice, the cryptanalyst generally knows what the language is of the texts to be deciphered and often what their content is likely to be and the circumstances in which the message was transmitted. All this helps him to guess which letters and words are likely to be most frequent in the text. In the case
cited by Weaver, the decipherment was based on the frequencies of the letters, pairs of letters, etc. in English; fortunately they were much the same in Turkish and the original could be interpreted. Though the cryptanalytic approach was mistaken, there were sufficient stimulating ideas in Weavers paper to launch MT as a serious line of research in the United States. During the next two years, individuals and groups began MT studies at a number of locations, the Massachusetts Institute of Technology (MIT), the University of Washington in Seattle, the University of California at Los Angeles (UCLA), the National Bureau of Standards (NBS) also in Los Angeles and the RAND Corporation nearby at Santa Monica. On 10th January 1950, Erwin Reifler circulated privately the first of a series of studies on MT. Reifler was a Sinologist of German origin, head of the Department of Far Eastern and Slavic Languages and Literature at the University of Washington in Seattle. Recognizing the problem of multiple meanings as an obstacle to word-for-word translation of the kind attempted by Booth and Richens, Reifler introduced the concepts of pre-editor and post editor. The human pre-editor would prepare the text for input to the computer and the post editor would resolve residual problems and tidy up the style of the translation. One suggestion was that the pre-editor should indicate the grammatical category of each word in the source language (SL) text by adding symbols or diacritic marks, e.g. to distinguish between the noun convict and the verb convict. The posteditors role was to select the correct translation from the possibilities found by the computer dictionary and to rearrange the word order to suit the target language. As we shall see, the concepts of pre-editor and post-editor recur in one form or another throughout the development of MT research. Following Weavers suggestion for statistical studies of micro context for resolving problems of multiple meaning, Abraham Kaplan at the RAND Corporation investigated polysemy in mathematics texts. A group of test subjects were presented with a set of words, each with a number of possible meanings, and asked to select the most applicable sense. Kaplan limited the test to nouns, verbs and adjectives on the assumption that these are the major carriers of the content of any discourse, and probably more markedly exhibit ambiguities. Each word was presented first in isolation, then together with preceding and following word, and finally the whole sentence. It was found that the most
practical context is one word on each side, increased to two if one of the context words is a particle, i.e. an article, preposition or conjunction. Despite its limitations and deficiencies and the tentativeness of the conclusions, this study encouraged hopes that problems of ambiguity were tractable and that statistical analyses could contribute useful linguistic data for MT. In the latter half of 1950, a survey was conducted by W. F. Loomis on behalf of Weaver to find out all those who were interested in MT and what research was underway. The survey revealed a surprising amount of activity already apart from Booth, Richens and Reifler; two groups had been set up in California. One was at the RAND Corporation in Santa Monica under J. D. Williams, and Kaplan's paper was to be the first of a series of MT studies. Harry D. Huskey of the National Bureau of Standards in Los Angeles had formed the other, with the intention of using the SWAC for MT research. The group included Victor A. Oswald of the German Department at UCLA and William E. Bull of the UCLA Spanish Department, and was soon joined by Kenneth E. Harper of the UCLA Slavic Languages Department. In support of its work, the group received some funds from the Rockefeller Foundation in July 1951. It is clear that word-for-word translation of a language like German would produce obviously unsatisfactory results, Oswald and Fletcher proposed a detailed grammatical coding of German sentences indicating syntactic functions of nouns and verb forms in clauses and enabling the identification of noun blocks and verb blocks. On the basis of the codes, certain sequences were identifiable as candidates for rearrangement when the output was to be in English. Oswald and Fletcher concluded that syntax does not constitute, as had been thought by some, a barrier to mechanical translations; they stressed the problems of solving the lexicographic difficulties of Machine Translation.
1.5.2 The decade of high expectation and disillusion, 1954-1966 The earliest systems consisted primarily of large bilingual dictionaries where entries for words of the source language gave one or more equivalents in the target language, and some rules for producing the correct word order in the output. It was
soon recognised that specific dictionary-driven rules for syntactic ordering were too complex and increasingly ad hoc, and the need for more systematic methods of syntactic analysis became evident. A number of projects were inspired by contemporary developments in linguistics, particularly in models of formal grammar, and they seemed to offer the prospect of greatly improved translation. Optimism remained at a high level for the first decade of research, with many predictions of imminent "breakthroughs". However, disillusion grew as researchers encountered "semantic barriers" for which they saw no straightforward solutions. There were some operational systems the Mark II system installed at the USAF Foreign Technology Division, and the Georgetown University system at the US Atomic Energy Authority and at Euratom in Italy but the quality of output was disappointing. By 1964, the US government sponsors had become increasingly concerned at the lack of progress; they set up the Automatic Language Processing Advisory Committee (ALPAC), which concluded in a famous 1966 report that MT was slower, less accurate and twice as expensive as human translation and that "there is no immediate or predictable prospect of useful machine translation." It saw no need for further investment in MT research; and instead it recommended the development of machine aids for translators, such as automatic dictionaries, and the continued support of basic research in computational linguistics.
1.5.3 The ALPAC report and its consequences, 1966-1980 Although widely condemned as biased and shortsighted, the ALPAC report brought a virtual end to MT research in the United States for over a decade and it had great impact elsewhere in the Soviet Union and in Europe. However, research did continue in Canada, in France and in Germany. Within a few years the Systran system was installed for use by the USAF (1970), and shortly afterwards by the Commission of the European Communities for translating its rapidly growing volumes of documentation (1976). In the same year, another successful operational system appeared in Canada, the Meteo system for translating weather reports,
developed at Montreal University. In the 1960s in the US and the Soviet Union MT activity had concentrated on Russian-English and English-Russian translation of scientific and technical documents for a relatively small number of potential users, who would accept the crude unrevised output for the sake of rapid access to information. From the mid-1970s onwards the demand for MT came from quite different sources with different needs and different languages. The administrative and commercial demands of multilingual communities and multinational trade stimulated the demand for translation in Europe, Canada and Japan beyond the capacity of the traditional translation services. The demand was now for cost-effective machineaided translation systems that could deal with commercial and technical documentation in the principal languages of international commerce.
1.5.4 The 1980s. The 1980s witnessed the emergence of a wide variety of MT system types, and from a widening number of countries. First there were a number of mainframe systems, whose use continues to the present day. Apart from Systran, now operating in many pairs of languages, there was Logos; the internally developed systems at the Pan American Health Organization (Spanish-English and English-Spanish); the Metal system (German-English); and major systems for English-Japanese and JapaneseEnglish translation from Japanese computer companies. The wide availability of microcomputers and of text-processing software created a market for cheaper MT systems, exploited in North America and Europe by companies such as ALPS, Weidner, Linguistic Products, and Globalink, and by many Japanese companies, e.g. Sharp, NEC, Oki, Mitsubishi, Sanyo. Other microcomputer-based systems appeared from China, Taiwan, Korea, Eastern Europe, the Soviet Union, etc. Throughout the 1980s research on more advanced methods and techniques continued. For most of the decade, the dominant strategy was that of indirect translation via intermediary representations, sometimes interlingual in nature, involving semantic as well as morphological and syntactic analysis and sometimes
non-linguistic knowledge bases. The most notable projects of the period were the GETA-Ariane, SUSY, Mu, DLT, Rosetta, the knowledge-based project at CarnegieMellon University and two international multilingual projects: Eurotra, supported by the European Communities, and the Japanese CICC project with participants in China, Indonesia and Thailand.
1.5.5 The early and the late 1990s The end of the decade was a major turning point. Firstly, a group from IBM published the results of experiments on a system based purely on statistical methods. Secondly, certain Japanese groups began to use methods based on corpora of translation examples, i.e. using the approach now called example-based translation. In both approaches the distinctive feature was that no syntactic or semantic rules are used in the analysis of texts or in the selection of lexical equivalents; both approaches differed from earlier rule-based methods in the exploitation of large text corpora. A third innovation was the start of research on speech translation, involving the integration of speech recognition, speech synthesis, and translation modules the latter mixing rule-based and corpus-based approaches. The major projects are at ATR (Nara, Japan), the collaborative JANUS project, and in Germany the governmentfunded Verb Mobil project. However, traditional rule-based projects have continued, e.g. the Catalyst project at Carnegie-Mellon University, the project at the University of Maryland, and the ARPA-funded research at three US universities. Another feature of the early 1990s was the changing focus of MT activity from pure research to practical applications, to the development of translator workstations for professional translators, to work on controlled language and domain-restricted systems, and to the application of translation components in multilingual information systems. These trends have continued into the later 1990s. In particular, the use of MT and translation aids by large corporations has grown rapidly a particularly impressive increase is seen in the area of software localization. There has been a huge growth in sales of MT software for personal computers and even more significantly,
the growing availability of MT from on-line networked services. The demand has been met not just by new systems but also by downsized and improved versions of previous mainframe systems. While in these applications, the need may be for reasonably good quality translation (particularly if the results are intended for publication), there has been even more rapid growth of automatic translation for direct Internet applications (electronic mail, Web pages, etc.), where the need is for fast real-time response with less importance attached to quality. With these developments, MT software is becoming a mass-market product, as familiar as word processing and desktop publishing. Conclusion It has long been a subject of discussion whether machine translation and computer-assisted translation could convert translators into mere editors, making them less important than the computer programs. The fear of this happening has led to a certain rejection of the new technologies on the part of translators, not only because of a possible loss of work and professional prestige, but also because of concern about a decline in the quality of production. Some translators totally reject machine translation because they associate it with the point of view that translation is merely one more marketable product based on a calculation of investment versus profits. They define translation as an art that possesses its own aesthetic criteria that have nothing to do with profit and loss, but are rather related to creativity and the power of the imagination. This applies mostly, however, to specific kinds of translation, such as that of literary texts, where polysemy, connotation and style play a crucial role. It is clear that computers could not even begin to replace human translators with such texts. In fact translators should recognize and learn to exploit the potential of the new technologies to help them to be more rigorous, consistent and productive without feeling threatened. Translators need to accept the new technologies and learn how to use them to their maximum potential as a means to increased productivity and quality improvement.
II. COMMON ERRORS IN MACHINE TRANSLATION Translation involves the production of a text in one language that was inspired by an existing text in another language, such that the two texts are in some sense 'the same. The relationship between proper translation and paraphrase involves decoding the message from an existing text and re-encoding the message in a new text. This results in two texts different in form with the same message. Translation parallels this except that the source and target texts are in different languages. 2.1 The quality of translation Paraphrases vary in quality as a function of their accuracy, i.e. the degree to which the message they convey, when decoded, matches the message of the original text. This is true whether they are translations or not. Since translations are often performed in formal and public circumstances, the question of accuracy is raised with a greater percentage of translations than with monolingual paraphrases. But accuracy is important in all cases, and a complaint regarding accuracy should be taken seriously. Literary translations adhere to a higher standard of quality, a standard that considers the form of the text as well as the message transmitted. The target text in a good translation is adjusted in form so that it corresponds to the source text. That is, the translation of a sonnet should be a sonnet, and the translation of a joke should be funny. Thus a translation may be criticized on both form and semantic content.
2. 2: Mechanical dictionaries The creation of an automatic dictionary is the first and most obvious task of an MT system. Mechanical dictionaries were the central concern of all the earliest Machine Translation researchers and they are still crucial for the efficient operation of present Machine Translation systems. Like Artsrouni and Troyanskii, many early researchers tended to see the translation process almost exclusively in terms of consulting dictionaries for finding Target Language (TL) words equivalent to Source
Language (SL) words. The resulting dictionary translations presented the TL output in the same word sequence as the SL input, i.e. word-for-word translations. They knew that this would not produce good translations; they expected the results to be very poor and in need of considerable editing. Before any word-for-word translations had been seen, Reifler suggested a pre-editor and a post-editor, and the unintelligibility of the results of Richens and Booths attempts confirmed the need for, at the least, post editing. Nevertheless, the ability of many readers to make some sense of these dictionary translations encouraged MT researchers to believe that with suitable modifications the word-for-word approach could in the end produce reasonable output. As we have seen, Yngve considered that they were surprisingly good and worth taking as first approximations to be worked on. The mechanizations of dictionary procedures posed problems of a technical nature. Research on Machine Translation began at a time when computers were limited in their storage capacities and slow in access times. There was much discussion of storage devices and mechanisms for improving access times. Booth (1955) and Stout (1954), for example, assessed the relative merits of paper tape, punched cards and magnetic tape as external storage means and the various possibilities for internal memory storage, cathode-ray-tube dielectric stores, vacuum tubes, magnetic drums, photographic drums, etc. Since the external storage could only be searched serially, the most efficient method of dictionary lookup was to sort all the words of the SL text into alphabetical order and to match them one by one against the dictionary entries. Once found, entries could often be stored internally where faster access was possible. Various proposals were made for efficient searching of internal stores, including the sequencing of items by frequency, the binary cut method first put forward by Booth (1955a), and the letter-tree approach of Lamb. A popular method for reducing dictionary size was the division of words into stems and endings. In languages like German and Russian it was obviously wasteful to include every inflected form of nouns and verbs. The familiar regularities of noun and verb paradigms encouraged researchers to investigate methods of morphological analysis to identify stems and endings. However, there are so many peculiarities and irregularities in the morphology of languages that procedures turned out to be more complex than
expected; as a result, when larger storage mechanisms with fast access times became available many MT researchers went back to the older system of storing full forms in dictionaries. Obviously, dictionaries cannot always include all the words occurring in SL texts. A problem for all MT systems is to establish acceptable methods for dealing with missing words; basically, there are two approaches, either to attempt some kind of analysis and translation, or to print out the original unloaded SL form. In both cases, there is a further problem with the rest of the sentence; whether to attempt an incomplete translation or to give up and produce no translation. In experimental Machine Translation systems it is obviously reasonable to admit failure, but in operational systems it is desirable, on the whole, to produce some kind of translation. 2. 3: Polysemy and semantics. The most obvious deficiency of any word-for-word translation, whether mechanized or not, is that the order of words in the resulting TL text is more often wrong than correct. As we have seen, it was clear to Oswald and Fletcher that translation of German texts into English demanded some kind of structural analysis of the German sentences. At the simplest level, such analysis may take into account morphological features, such as the endings of nouns, adjectives and verbs, or basic syntactic sequences, such as noun-adjective and subject-verb relations. As we shall see, it is possible to use this kind of information to devise procedures for rearrangement in basically word-for-word systems. But, in order to go beyond the inherent limitations of the word-for-word approach, the analysis of syntactic structures must involve the identification of phrase and clause relationships. Methods of syntactic analysis will be the subject of the next section. The second obvious problem is that there are rarely one-to-one correspondences in the vocabularies of natural languages. In most cases, a particular SL word may correspond to a number of different TL words, so that either the MT system prints out all the possibilities or it attempts to select the one that is most appropriate for the specific text in question. The first option was adopted by many systems, as we shall see, often as an intermediate stopgap; the problem of selecting the right TL equivalent was left to the post-editor. Attempts to deal with the problem took a number of approaches.
The difficulty occurs usually because the SL word has what Weaver and many after him called multiple meanings. Linguists distinguish between homonyms and polysemes; homonyms are words like bank that have two or more distinct and unrelated meanings (geological feature or financial institution); polysemes are words like face that reflect different shades of meaning according to context. They distinguish also between homophones (words which sound the same but have different meanings) such as pear, pair and pare, and homographs (words which are spelled the same but have different meanings) such as tear (crying versus ripping). Fortunately, the homophone problem is irrelevant since MT deals only with written texts. For practical purposes it is also immaterial whether the SL word is a homograph or a polyseme: the problem for MT is the same, the relevant meaning for the context must be identified and the appropriate TL form must be selected. Consequently, it is now common in MT research to refer to methods of homograph resolution, whether the words concerned are strictly homographs or not. Sometimes the TL vocabulary makes finer sense distinctions than the SL. There are familiar examples in translating from English into French or German the verb know may be conveyed by savoir or connatre in French and by wissen or kennen in German; likewise the English river may be either rivire or fleuve in French and either Fluss or Strom in German. In neither case can we say that the English words have more than one meaning; it is just that French and German make distinctions, which English does not. Nevertheless, in the context of a MT system the problem of selecting the correct TL form is much the same as when the SL form is a genuine homograph or polyseme. MT systems do, however, differ according to whether this type of SL-TL difference is tackled at the same stage as SL homograph resolution or not. The difficulties are further compounded in languages like English where many words may function as nouns, verbs or adjectives without any formal distinctions; e.g. control can be a verb or noun, green can be an adjective or a noun. The fact that there can be stress differences, e.g. between the verb permit and the noun permit, is of no assistance. For practical purposes these forms are also treated as homographs and much the same procedures for homograph resolution are applied. Various methods for tackling such SL-TL lexical differences have been proposed. One has already been mentioned, the
identification of grammatical category either by morphological clues or by syntactic analysis. For example, the endings -ed and -ing generally indicate participial forms of English verbs (although they may be functioning as adjectives). Similarly, if in a two-word sequence the first is definitely an adjective the second is probably a noun. Therefore, homographs that happen to belong to different syntactic categories may sometimes be distinguished in this way. Another method is to reduce the incidence of homography in the MT dictionaries. The concept of the micro-glossary was proposed not only to keep the size of dictionaries reasonably small but also to minimize problems of multiple meanings. It was maintained, for example, that the Russian was to be translated usually as species in biological contexts and not as view, shape or aspect. A micro-glossary for Russian-English translation in biology could, therefore, include just one of the English equivalents. In many cases the entry has to be the equivalent, which is most often correct. In physics, for example, Russian is usually equated with change; although in some contexts other translations may be better, the one, which fits best most frequently, should be selected. The suggestion by Weaver was to examine the immediate context of a word. As we have seen, Kaplan concluded that a five-word sequence was in general sufficient micro context for disambiguation, i.e. for identifying the particular meaning of a polyseme. There are two ways in which immediate context can be implemented: one by expanding dictionary entries to include sequences of two or more words, i.e. phrases, the other by testing for the occurrence of specific words. For example, if the word is modified by then it is to be translated formation (rather than education); either the dictionary includes the whole phrase or the analysis procedure tests for the particular adjective. The dictionary solution obviously requires storage facilities of sufficient capacity, and it is also more appropriate when phrases are idiomatic, i.e. when the meaning (translation) of the phrase as a whole cannot be deduced (or constructed) from its individual words. Apart from familiar idioms such as hold one's tongue, not move a finger, red herring and blue blood, it could include verbal phrases such as make away with, draw forth, look up and pass off, and noun phrases such as full speed, upper class and brute force. A more fundamental use of contextual information is the search for semantic features that are common to or
prominent in the sentence or text as a whole, and to use this information to decide on the most fitting translation for SL words. This method involves the investigation of semantic invariants or semantic regularities in vocabulary and texts, and necessarily goes far beyond the examination of lexical equivalents between languages. It involves, for example, the investigation of synonymy and paraphrase, of semantic universals or primitive elements (e.g. features such as human, animate, liquid, etc.), and of semantic relations within sentences and texts (e.g. agent-action, cause-effect, etc.) Finally, the problem of polysemy may simply be avoided completely by insisting that texts input to a MT system be written in a regularized and normalized fashion. In other words, writers are encouraged not to be ambiguous, or rather not to include words and phrases, which the MT system in use has difficulty in disambiguating. The obverse of this is to solve polysemy by using a highly restricted form of TL as output, a kind of pidgin language with its own idiosyncratic vocabulary usages. As we have seen, Dodd made the first suggestion of this approach; the groups at the University of Washington and at Cambridge were particularly interested in MT pidgin and methods of improving output of this kind. In theory, any of these methods can be used in any MT system; in practice, particular MT systems have emphasized one or two approaches, they have concentrated on exploiting their full potentialities and have generally neglected the alternatives. Concentration on the contextual and microglossary approaches was characteristic of the MT groups at Rand and Michigan. Concentration on the dictionary and lexicographic approaches was characteristic of the groups at Harvard, at the University of Washington and at IBM. Concentration on text semantics was pursued most strongly by the Milan group with its co-relational analysis approach and by the Cambridge group with its thesaurus approach. 2. 4: Morphological analysis In order to perform any kind of syntactic analysis the grammatical categories (noun, verb, adjective, adverb, etc.) of the words of sentences must be determined. The first step of analysis in any MT system is, however, the identification of the words in the SL text. This is relatively easy in English and most European languages, since words are separated by spaces in written text, but it is not for example in languages
such as Chinese and Japanese where there are no external markers of word boundaries. Obviously, dictionary entries could indicate the grammatical categories (word class or part of speech) of all SL words. However, it was clearly unnecessary to include every inflected form of a noun or a verb, particularly in languages such as Russian and German. The familiar regularities of noun and verb paradigms encouraged researchers to investigate methods of morphological analysis that would identify stems and endings. To give an English example, the words: analyzes, analyzed, and analyzing might all be recognized as having the same stem analyze and the common endings -s, -ed, -ing. At the same time, identification of endings was a first step towards the determination of grammatical categories, e.g. to continue the example: -s indicates a plural noun form or a third person singular present verb form, -ed indicates a past verb form, and - ing a present participle or adjectival form, etc. As these examples demonstrate, however, many (perhaps most) endings are ambiguous, even in Russian, and the final establishment of the grammatical category of particular words in text takes place during syntactic analysis. Morphological analysis deals necessarily with regular paradigms; irregular forms, such as the conjugation of verbs such as be and have, and the plural forms of nouns such as geese and analyses, are generally dealt with by inclusion of the irregularities in full forms in the dictionary. 2. 5: Syntactic analysis. The first step beyond the basic word-by-word approach is the inclusion of a few rearrangement rules, such as the inversion of noun-adjective to adjective-noun, e.g. in French- English translation. In many early MT systems rearrangement rules were often initiated by codes attached to specific dictionary entries. Examples are to be found in the 1954 Georgetown-IBM experiment, and in the experiment by Panov and his colleagues shortly afterwards in the Soviet Union. When there were differences of syntactic structure more complex than inversion, the solution was often the inclusion of phrases in the dictionary, i.e. rather like idiomatic expressions. This approach was expanded and refined as the lexicographic approach of the University of Washington. Rearrangement rules may take into account fairly long sequences of grammatical categories, but they do not imply any analysis of syntactic structure, e.g.
the identification of a noun phrase. The next step beyond the basic word-for-word approach is therefore the establishment of syntagmas, such as noun phrases (nouns and modifiers, compound nouns, etc., verbal complexes (e.g. auxiliaries and modals in conjunction with infinitives or participle forms), and coordinate structures. This level of analysis is to be seen in the later Georgetown system. Complete syntactic analysis involves the identification of relationships among phrases and clauses within sentences. Syntactic analysis aims to identify three basic types of information about sentence structure:
1)
The sequence of grammatical elements, e.g. sequences of word classes:
Art(icle) + N(oun) + V(erb) + Prep(osition)..., or of functional elements: subject + predicate. These are linear relations.
2)
The grouping of grammatical elements, e.g. nominal phrases consisting of
nouns, articles, adjectives and other modifiers, prepositional phrases consisting of prepositions and nominal phrases, etc. up to the sentence level. These are constituency relations.
3)
The recognition of dependency relations, e.g. the head noun determines the form
of its dependent adjectives in inflected languages such as French, German and Russian. These are hierarchical (or dominance) relations. Included among the basic objectives of any method of syntactic analysis must be at least the resolution of homographs (by identification of grammatical categories, e.g. whether watch is a noun or a verb), the identification of sequences or structures which can be handled as units in SL-TL transfer, e.g. nouns and their associated adjectives. Various models of syntactic structure and methods of parsing have been adopted in MT systems and are described in more detail in connection with particular MT projects. At this point, the main approaches will be outlined, illustrated in the most part by analyses (whole or partial) of the sentence The gold watch and chain were sold by the jeweler to a man with a red beard. This is a passive sentence (the grammatical subject is the object of the verb), containing a homograph (watch); an ambiguous coordinate structure (are both the watch and the chain modified by gold?) and three prepositional phrases each of which could in theory modify the verb or their preceding
noun phrase. An example of an analysis program (parsing program) to identify sequential (linear) information was the Predictive Syntactic Analyzer developed at the National Bureau of Standards and at Harvard University. The premise was that on the basis of an identified grammatical category (article, adjective, noun, etc.) the following category or sequences of categories could be anticipated with an empirically determinable measure of probability. The system had the following characteristics: under the general control of a push-down store (i.e. last in first out) a sentence was parsed one word at a time left to right, the action taken for each word being determined by a set of predictions associated with the grammatical category to which the word had been assigned. At the beginning of the analysis certain sentence types were predicted in terms of sequences of grammatical categories. Examination of each word was in two stages: first to test whether its category fulfilled one of the predictions, starting from the most probable one, then either to alter existing predictions or to add further predictions. Formally, the system was an implementation of a finite state grammar. The analysis of a sentence was completed if a terminal state has been reached and all categories have been accounted for. Initially, only the single most probable path through the series of predictions was taken during parsing, but in later models all possible predictions were pursued. The method did not in principle need to recognize phrase structures or dependency relations, although these could be derived from the identification of specific category sequences. 1. Finite state grammar The second approach, analysis of dependency relations, is based on the identification of governors, e.g. the head noun in a noun phrase, and their dependants or modifiers, e.g. adjectives. The governor of the sentence as a whole is generally taken to be the finite verb since this specifies the number and nature of dependent nouns. A verb such as buy, for example, can have four dependants (purchaser, object purchased, price, seller) a concept referred to as valency: a transitive verb such as see has a valency of two, an intransitive such as go has a valency of one, etc. The gold watch was sold by the jeweler to a man with a red beard 2. Dependency structure analysis
The parsing of dependency structure can operate either top-down (identification first of governors and then dependants) or bottom-up (determination of governors by a process of substitution). The top-down approach was most common, and can be illustrated by Garvins fulcrum parser in a series of passes the algorithm identified first the key elements of the sentence, e.g. main finite verb, subject and object nouns, prepositional phrases, then the relationships between sentence components and finally the structure of the sentence as a whole. The third approach, that of phrase structure analysis, provides labels for constituent groups in sentences: noun phrase (NP), verb phrase (VP), prepositional phrase (PP), etc. The phrase structure approach is associated most closely in the early period of MT research with the MIT project. Parsing can be either bottom-up or top-down. In the former, structures are built up in a series of analyses from immediate constituents, e.g. first noun phrases, then prepositional structures, then verb relationships and finally the sentence structure as a whole. In top-down parsing, the algorithm seeks the fulfillment of expected constituents NP, VP, etc. by appropriate sets and sequences of grammatical categories. The bottom-up parsing strategy was the most common approach in the early MT system, but at MIT some investigation was made into the top-down strategy (analysis by syntheses). In systems since the mid-1960 this strategy is now probably more common. The gold watch was sold by the jeweler to a man with a red beard 3. Phrase structure analysis It may be noted that categorial grammar developed by Bar-Hillel which was one of the first attempts at formal syntax, is a version of constituency grammar. In a categorial grammar, there are just two fundamental categories, sentence s and nominal n; the other grammatical categories (verb, adjective, adverb, etc.) are defined in terms of their potentiality to combine with one another or with one of the fundamental categories in constituent structures. Thus a transitive verb is defined as n\s because it combines with a nominal (to its left) to form sentences; and an adjective is defined as n/n because in combination with a nominal n to its right it forms a (higher order) nominal n. In other words, the category symbols themselves define how they are to combine with other categories. Combination operates by two simple 'cancellation' rules: x/y, y > x, and y, y\x > x.
2. 6 Formal syntax and transformational grammar Research in MT helped to stimulate much interest in formal linguistics. An early result of this mathematization of syntax and linguistic theory was the demonstration that all phrase structure and dependency grammars are formally (i.e. mathematically) equivalent and that since they can be implemented on pushdown automata, they are equivalent also to the so-called finite state grammars (Gross & Lentin 1967). All these grammars belong to the class of context-free grammars. A context-free grammar consists of a set of rewriting rules (or production rules) of the form A > a, where A belongs to a set of non-terminal symbols and a is a string of non-terminal or terminal symbols. Non-terminal symbols are grammatical categories (S, NP, VP, N, Adj, etc.) and terminal symbols are lexical items of the language. Context-free grammars are important not only as the basis for formal grammars of natural languages but as the basis for computer programming, since the standard algorithmic methods used in compilers rely on finding only context-free structures in programming languages. However, Noam Chomsky demonstrated the inherent inadequacies of finite state grammars, phrase structure grammars and the formally equivalent dependency grammars for the representation and description of the syntax of natural languages. Context-free grammars are unable, for example, to relate different structures having the same functional relationships, e.g. where discontinuous constituents are involved: He looked up the address and He looked the address up; or where there are differences of voice, e.g. the active: The jeweler sold the watch to the man yesterday and the passive: Yesterday the man was sold the watch by the jeweler. Chomsky proposed a transformational-generative model which derived surface phrase structures from deep phrase structures by transformational rules. Thus a passive construction in a surface representation is related to an underlying active construction in a deep representation, where the surface subject noun appears as the deep logical object. Deep structures are generated from an initial symbol S by context-sensitive rewriting rules. An essential feature of the Chomskyan model is that syntactic structures are generated top-down from initial symbol S, to deep structure tree and then by transformational rules to surface structure trees. In the case of a coordinate phrase
such as gold watch and chain the base deep structure would make explicit the fact that both watch and chain are gold. To produce the elliptical surface form a transformation rule would delete the repeated adjective. The model is not intended to provide the basis for a recognition grammar (e.g. a parser), but only to define mathematically the set of well-formed sentences, and to assign a structural description indicating how the sentence is understood by the ideal speakerhearer (Chomsky 1965: 5). The implications of this approach became clearer when researchers attempted to develop transformational parsers 1. Deep structure analysis The gold watch and gold chain the gold watch and chain 2. Transformational rule (loss of phrase structure relationship) Chomskys notion of transformational rules derived formally from the work of Zellig Harris (1957). Harris' concern was the development of a symbolism for representing structural relationships. Grammatical categories were established primarily on the basis of distributional analysis. Thus, the subject of a sentence can be a (single) noun (The man...), a clause (His leaving home...), and a gerundive (The barking of dogs...), and an infinitive clause (To go there...), etc. In order to function as subjects, clauses have to undergo transformations from 'kernel' (atomic sentence-like) forms: e.g. He left home > His leaving home, Dogs bark > the barking of dogs. For Harris, transformations were a descriptive mechanism for relating surface structures, while in Chomskys model; transformational rules derive surface structures from deeper structures. By the mid-1960's an additional requirement of transformational rules was that they should be meaning-preserving, i.e. from a deep structure should be generated semantically equivalent surface structures. Although Chomskys syntactic theory has undoubtedly had most influence, the formalization of transformations by Harris had considerable impact in MT research, particularly in the representation of SL-TL structural transfer rules. 2. 7: Syntactic ambiguity and discourse relations. Although the identification of grammatical categories and of sentence structures is clearly important in linguistic analysis, there are inherent limitations in syntactic
analysis that were recognized before even efficient parsers had been developed. A familiar example is the problem of multiple analyses of prepositional phrases. Syntactic analysis alone cannot decide which relationship is correct in a particular case. For example take the sentences: The coastguard observed the yacht in the harbor with binoculars. The gold watch was sold by the jeweler to a man with a beard. In the first case, it was the coastguard who had the binoculars; therefore the PP with the binoculars modifies the verb. But in the second case, the PP with a beard modifies the preceding noun man. Only semantic information can assist the analysis by assigning semantic codes allowing binoculars as instruments to be associated with 'perceptual' verbs such as observe but prohibiting beards to be associated with objects of verbs such as sell. Such solutions have been applied in many MT systems since the mid 1960's (as the following descriptions of systems will show). However, semantic features cannot deal with all problems of syntactical ambiguity. As Bar-Hillel argued in 1960 (Bar-Hillel 1964), human translators frequently use background knowledge to resolve syntactical ambiguities. His example was the phrase slow neutrons and protons. Whether slow modifies protons as well as neutron, which can be decided only with subject knowledge of the physics involved. Similarly, in the case of the gold watch and chain our assumption that both objects are gold is based on past experience. On the other hand, in the case of the phrase old men and women the decision would probably rest on information conveyed in previous or following sentences in the particular text being analyzed. The most frequent occasions on which recourse is made to real world knowledge involve the reference of pronouns. Examples are the two sentence pairs: The men murdered the women. They were caught three days later. The men murdered the women. They were buried three days later. The correct attribution of the pronoun they to the men in the first pair and to the women in the second depends entirely on our knowledge that only dead people are buried, that murder implies death, that murder is a criminal act, and that criminals ought to be apprehended. This knowledge is non-linguistic, but it has linguistic implications in, for example, translation of these sentences into French where a choice of ils or elles must be made. Of course, it is not only in cases of syntactic ambiguity that we use real world knowledge to help in understanding text. Homographs can, as indicated earlier,
be resolved by identification of grammatical categories, e.g. whether watch is a noun or a verb. However, the resolution of some homographs require, as in the physics example, knowledge of the objects referred to. There is, for example, a third sense of watch in the sentence: The watch included two new recruits that night. It can be distinguished from the other noun only by recognition that timepieces do not usually include animate beings. It was from such instances that Bar-Hillel was to argue in an influential paper (Bar-Hillel (1960) that fully automatic translation of a high quality was never going to be feasible. In practice this type of problem can be lessened if texts for translation are restricted to a more or less narrow scientific field, and so dictionaries and grammars can concentrate on a specific sub-language (and this was the argument for micro-glossaries). Nevertheless, similar examples recur regularly, and the argument that MT requires language understanding based on encyclopedic knowledge and complicated inference procedures has convinced many researchers that the only way forward is the development of interactive and Artificial Intelligence approaches to MT. In general, semantic analysis has developed, by and large, as an adjunct of syntactic analysis in MT systems. (Exceptions are those MT systems with an explicitly semantic orientation) In most MT systems semantic analysis goes no further than necessary for the resolution of homographs. In such cases, all that is generally needed is the assignment of such features as human, animate, concrete, male, etc. and some simple feature matching procedures. For example, crook can only be animate in The crook escaped from the police, because the verb escape demands an animate subject noun. The shepherds staff sense of crook is thus excluded. In many systems semantic features have been assigned as 'selection restrictions' in an ad hoc manner, as the demands of the analysis of a particular group of lexical items seem to require them, and also somewhat too rigidly. There are difficulties, for example, if the verb sell is defined as always having inanimate objects; the sentence. The men were sold at a slave market would not be correctly parsed. One answer suggested has been to make such 'selection restrictions' define not obligatory features but preferences. True semantic analysis should include some decomposition of lexical items according a set of semantic primitives or putative universals. Only by such means is it possible to
derive common semantic representations for a pair of sentences such as The teacher paid no attention to the pupil and The pupil was ignored by the teacher. In general, the majority of MT systems have avoided or held back from the intricacies and complexities and no doubt pitfalls of this kind of semantics. It is found therefore only in those MT groups that have investigated interlinguas, and in some of those recent groups with an interest in AI methods. 2. 8: Sentences and texts The difficulties with pronominal reference described above stem also from the exclusive concentration of syntax-based analysis on sentences. The need for text-based analysis can be illustrated by the following two German sentences: In der Strasse sahen wir einen Polizist, der einem Mann nachlief. Dem Polizist folgte ein grosser Hund. Translation into English sentence by sentence would normally retain the active verb forms producing: In the street we saw a policeman running after a man. A large dog followed the policeman. Text cohesion would be improved if the second sentence were passivized as: The policeman was followed by a large dog. This inversion requires that a MT system adheres as far as possible to the information structure of the original, i.e. in this case retains the policeman as the head (or topic) of the sentence. The problems of topicalisation and text cohesion are of course far more complex than this example. Scarcely any MT projects have even considered how they might be tackled. 2. 9: Transfer and synthesis. The production of output text in the target language (TL) is based on the information provided from dictionaries and from the results of analysis. In general the synthesis of TL sentences is less complex than the analysis of SL input. The process involves nearly always the derivation of correct morphological forms for TL words (unless dictionaries contain only full TL forms). Thus, for example, TL synthesis must produce the right forms of verbs, e.g. for English simple past forms it is not a matter of just adding -ed as in picked (from pick), since sometimes endings must be deleted or altered as in lived (not: liveed) and tried (not: tryed), etc. Irregular forms are generally
handled by the dictionary (e.g. went would be coded directly as the past form of go). If analysis has included the establishment of syntactic structure (e.g. a phrase structure) then synthesis must convert this structure into an appropriate TL structure and produce a linear representation, i.e. it must invert the analysis process in some way. However, it should be stressed that inversion does not imply that the rules devised for the analysis of structures for a particular language (as SL) can be simply reversed to obtain rules for synthesis of that language (as TL). At some point in many systems (the exceptions being interlingual systems, cf. next section), the syntactic structures of SL texts are transformed into TL structures. Whether such transformations apply to only short segments (as in word-for-word systems) or to whole sentences, the process involves the specification of transformation rules. For example, a rule for changing a German construction with final past participle (Er hat das Buch gestern gelesen) into an English construction with a simple past form (He read the book yesterday). Clearly, such transformation rules have much in common with the transformation rules that Harris devised for relating structures within the same language. 2. 10: System designs and strategies In broad terms, there have been three types of overall strategy adopted in MT systems. SL Analysis and Synthesis TL text; text SL-TL dictionaries and grammars 1. Direct translation system The first approach is the direct translation approach. Systems are designed in all details specifically for one particular pair of languages. The basic assumption is that the vocabulary and syntax of SL texts need not be analyzed any more than strictly necessary for the resolution of ambiguities, the correct identification of appropriate TL expressions and the specification of TL word order. Thus if the sequence of SL words is sufficiently close to an acceptable sequence of TL words, then there is no need to identify the syntactic structure of the SL text. The majority of MT systems of the 1950s and 1960s were based on this approach. They differed in the amount of analysis and/or restructuring incorporated. There was none at all in the straight dictionary translation experiment of Richens and Booth; there was just a minimum of local restructuring in the word-for-word systems of the University of Washington
and IBM; there was partial analysis of SL structure in the Georgetown system; and there was full sentence analysis in the systems at Ramo-Wooldridge, Harvard, and Wayne State University primary characteristic of direct translation systems of the earlier period was that no clear distinctions were made between stages of SL analysis and TL synthesis (cf. particularly the account of the Georgetown system below). In more recent (post-1970) examples of direct systems there is a greater degree of modular structure in the systems. SL Analysis Interlingual Synthesis TL text representation text SL TL dictionaries SLTL dictionaries and grammars dictionary and grammars 2. Interlingual system The second basic MT strategy is the interlingual approach, which assumes that it is possible to convert SL texts into semantico-syntactic representations common to more than one language. From such interlingual representations texts would be generated into other languages. In such systems translation from SL to TL is in two distinct and independent stages: in the first stage SL texts are fully analyzed into interlingual representations, and in the second stage interlingual forms are the sources for producing (synthesizing) TL texts. Procedures for SL analysis are intended to be SLspecific and not devised for any particular TL in the system; likewise, TL synthesis is intended to be TL-specific. Interlingual systems differ in their conceptions of an interlingual language: a logical artificial language, or a natural auxiliary language such as Esperanto; a set of semantic primitives common to all languages, or a universal vocabulary, etc. Interlingual MT projects have also differed according to the emphasis on lexical (semantic) aspects and on syntactic aspects. Some concentrated on the construction of interlingual lexica (e.g. the Cambridge and the Leningrad groups); others have concentrated on interlingual syntax (e.g. the Grenoble and Texas groups). SL Analysis Transfer Synthesis TL text SL repr TL repr text SL SL-TL TL dictionaries dictionary dictionaries and grammars and grammars Transfer rules 3. Transfer system The third approach to overall MT strategy is the transfer approach. Rather than operating in two stages through a single interlingual representation, there are three
stages involving underlying representations for both SL and TL texts; i.e. the first stage converts SL texts into SL transfer representations, the second converts these into TL transfer representations, and the third produces from these the final TL text forms. Whereas the interlingual approach necessarily requires complete resolution of all ambiguities and anomalies of SL texts so that translation should be possible into any other language, in the transfer approach only those ambiguities inherent in the language in question are tackled. Differences between languages of the knowsavoir/connaitre type would be handled during transfer. In English analysis, know is treated as unambiguous, there is no need to determine which kind of knowing is involved. Whereas the interlingual approach would require such analysis, the transfer approach does not; problems of mismatch between SL and TL lexical ranges are resolved in the transfer component. Systems differ according to the depth of analysis and the abstractness of SL and TL transfer representations. In the earliest systems analysis went no further than surface syntactic structures, with therefore structural transfer taking place at this depth of abstraction. Later (post-1970) transfer systems have taken analysis to deep semantico-syntactic structures (of various kinds), with correspondingly more abstract transfer representations and transfer rules. The basic difference between these two indirect approaches and the (generally earlier) direct approach lies in the configuration of dictionary and grammar data. In direct systems the main component is a single SL-TL bilingual dictionary incorporating not only information on lexical equivalents but also all data necessary for morphological and syntactic analysis, transfer and synthesis. In indirect systems, this information is dispersed among separate SL and TL dictionaries, separate SL and TL grammars, and either the interlingua vocabulary and syntax, or the SL-TL transfer dictionary (of lexical equivalences) and a grammar of SL-TL structure transfer rules. 2. 11: Perspectives and influences. While the classification of MT systems in terms of basic strategy is a convenient descriptive device and will be employed in the grouping of system descriptions in later chapters, it has not been the most prominent perspective for MT researchers,
particularly in the 1950s and 1960s. For this period, the most important distinctions were between the engineering and the perfectionist approaches, between the empiricist and other methodologies, and between the syntax orientation and various lexical and word-centered approaches. The most immediate point of dispute was between those groups who agreed with Dostert and Booth on the importance of developing operational systems as quickly as possible (ch.2.4.3) and those who argued for more fundamental research before such attempts. The engineering approach held basically that all systems can be improved and that the poor quality early word-forword systems represent a good starting point. There were differences between what Garvin (1967) dubbed the brute force approach, which assumed that the basic need was larger storage capacity (e.g. the IBM solution, ch.4.2), and the engineering approach which believed that algorithmic improvements based on reliable methods of (linguistic) analysis could lead to better quality. The perfectionists included all those groups which concentrated on basic linguistic research with high quality systems as the objective. The latter differed considerably in both theories and methods. Disputes recurred frequently between the perfectionists and the engineers until the mid1960. On questions of methodology the main point of difference concerned the empiricist approach, exemplified by the RAND group. The approach emphasized the need to base procedures on actual linguistic data; it was distrustful of existing grammars and dictionaries; it believed it was necessary to establish from scratch the data required and to use the computer as an aid for gathering data. The approach stressed statistical and distributional analyses of texts, and a cyclic method of system development: i.e. routines devised for one corpus were tested on another, improved, tested on a third corpus, improved again, and so forth. The empirical approach was in fact fully in accord with the dominant linguistic methodology of the 1940s and 1950s in the United States, the descriptivist and structuralist tradition associated particularly with Leonard Bloomfield (1933). The descriptivists adopted the behaviorist and positivistic method that insisted that only interpersonally observed phenomena should be considered scientific data, and which rejected introspections and intuitions. They distrusted theorizing, stressed data collection, and concentrated on methods of discovery and analysis. Traditional grammars were suspect: Charles Fries
(1952), for example, undertook a distributional analysis of telephone conversations that resulted in new grammatical categories for English. Most descriptivists worked, however, on phonetics and phonology. Only in the mid-1950 did some descriptivist such as Zellig Harris start work on syntax. It was therefore not surprising that the empiricists regarded their research within MT as extending the range of descriptive linguistics. The empiricist emphasis on probabilistic and statistical methods, however, has perhaps a different origin. It is likely to be the considerable influence of the statistical theory of communication associated with Claude Shannon, i.e. information theory, and to which Warren Weaver made a substantial contribution. The theory had great impact on the anti-metaphysical inclinations of most American linguists, since it seemed to provide a basis for developing mechanical methods for discovering grammars. It may be noted that when Yngve first presented his ideas on syntactic transfer Yngve (1957), he related his tripartite model to the information-theoretic triple of sender, channel and receiver. A third area of difference in early MT groups was the question of what should be taken as the central unit of language. The majority assumed the centrality of the sentence; their approach was sentence-oriented (as was and still is, in essence, that of most linguists and logicians), and so there was an emphasis on syntactic relations and problems. A minority upheld the centrality of the word. They emphasized lexical and semantic relations and problems. They included the lexicographic approach of Reifler and King, the thesaural approach of the Cambridge group, the wordcentered theories of Lamb at Berkeley, and the dictionary-centered aspects of Melchuks meaning-text approach. It should be stressed that these are differences only of general orientation; the syntax-oriented groups did not neglect lexical and semantic issues, and the lexis oriented groups did not by any means neglect syntax. Indeed, in the case of Lamb and Melchuk it is very much an open question whether their models can be said to be oriented one way or the other. In the descriptions above of various aspects of MT system design and methods of analysis, it may well have been implied, at a number of points, that language systems are intrinsically multileveled; that is to say, that linguistic description is
necessarily couched in terms of phonetics, phonology, morphology (word formation), syntax, and semantics; and furthermore, that analysis proceeds through each of these levels in turn: first morphological analysis, then syntactic analysis, then semantic analysis. (Lamb and Melchuk in fact developed the most extensive stratification list models within the MT context.) Although undoubtedly a stratal view of language systems is dominant in linguistics and has been since the time of Saussure, the founder of modern (structuralism) linguistics, it has not been the conception of some MT project teams. Indeed, many (particularly in the earliest period) would have rejected such a stratal view of language both for being too rigid and for not conforming to reality. For them, all aspects of language (lexical, semantic, and structural) interact inextricably in all linguistic phenomena. There is no doubt that among the most linguistics-oriented MT groups there has been sometimes an excessive rigidity in the application of the stratal approach to analysis (e.g. in parsing systems); and it has led to failures of various kinds. Nevertheless, the basic validity of the approach has not been disproved, and most modern (linguistics-oriented) MT systems retain this basic conception. Conclusion: Computer programs are producing translations - not perfect translations, for that is an ideal to which no human translator can aspire. Machine Translation is not primarily an area of abstract intellectual inquiry but the application of computer and language sciences to the development of systems answering practical needs. When scientist started looking at the Machine Translation component, they didnt really know how to go about evaluating its performance. Not having much past research to go by, they began by checking the translations. Once researchers realized this approach wouldnt work, they translated the documents themselves and checked the human translations against the Machine Translation versions. This allowed them to compile a list of the most common types of errors that occurred during automatic translation process which are word order, context, pronoun, dictionary, missing word, extra word, proper name and many more. Machine Translation aims primarily at comprehension and not at the production of a perfect Target Text, it is important to
follow two basic rules in order to make the best use of programs. First, we need to recognize that certain types of texts, such as poetry, for example, are not suitable for Machine Translation. Second, it is essential to correct the Source Text, as even one letter can radically change meaning.
III. DIFFICULTIES IN MACHINE TRANSLATION
Machine Translations are mostly due to various types of ambiguity, concerning polysemy of words, phrase attachment, coordination, anaphoric reference, scope of logical and modal operators, and so on. Unknown words and phrases are another major source of difficulty. Translation accuracy is expected to drastically improve if the input documents are marked up with appropriate tags which resolve such ambiguities or supply missing information.
3.1 Difficulties in translation. One difficulty in translation stems from the fact that most words have multiple meanings. Whether a human or a computer does a translation, meaning cannot be ignored. A word with sharply differing meanings has several different translations, depending on how the word is being used. The word 'bank' is often given as an example of a homograph, that is, a word entirely distinct from another that happens to be spelled the same. But further investigation shows that historically the financial and river meanings of 'bank' are related. They both come from the notion of a "raised shelf or ridge of ground". The financial sense evolved from the moneychanger's table or shelf, which was originally placed on a mound of dirt. Later the same word came to represent the institution that takes care of money for people. The river meaning has remained more closely tied to the original meaning of the word. Even though there is an historical connection between the two meanings of 'bank,' we do not expect their translation into another language to be the same, and it usually will not be the same. This example further demonstrates the need to take account of meaning in translation. A human will easily distinguish between the two uses of 'bank' and simply needs to learn how each meaning is translated. To see their importance to translation, we do not expect the two meanings of 'bank' to have the same translation in another language. Each language follows its own path in the development of meanings of words. As a result, we end up with a mismatch between languages, and a word in one language can be translated several different ways, depending on the situation. With the extreme examples given so far, a human will easily sense that multiple translations are probably involved, even if a computer would have difficulty. What
causes trouble in translation for humans is that even subtle differences in meaning may result in different translations. A human can learn the distinctions o meanings through substantial effort. It is not clear how to tell a computer how to make them. Being a native or near-native speaker involves more than just memorizing lots of facts about words. It includes having an understanding of the culture that is mixed with the language. It also includes an ability to deal with new situations appropriately. No dictionary can contain all the solutions since the problem is always changing as people use words in usual ways. These usual uses of words happen all the time. Some only last for the life of a conversation or an editorial. Others catch on and become part of the language. Some native speakers develop a tremendous skill in dealing with the subtleties of translation. However, no computer is a native speaker of a human language. All computers start out with their own language and are 'taught' human language later on. They never truly know it the way a human native speaker knows a language with its many levels and intricacies. Does this mean that if we taught a computer a human language starting the instant it came off the assembly line, it could learn it perfectly? Computers do not learn in the same way we do. We could say that computers can't translate like humans because they do not learn like humans. Then we still have to explain why computers don't learn like humans. What is missing in a computer that is present in a human? Building on the examples given so far, there are three types of difficulty in translation that are intended to provide some further insight into what capabilities a computer would need in order to deal with human language the way humans do.
Source Text
Analysis
Meaning
Synthesis
Target Text Human Translation Process
1) Distinguishing between general vocabulary and specialized terms. The first type of translation difficulty is the most easily resolved. It is the case where a word can be either a word of general vocabulary or a specialized term. Consider the word 'bus.' When this word is used as an item of general vocabulary, it is understood by all native speakers of English to refer to a roadway vehicle for transporting groups of people. However, it can also be used as an item of specialized terminology. Specialized terminology is divided into areas of knowledge called domains. In the domain of computers, the term 'bus' refers to a component of a computer that has several slots into which cards can be placed .One card may control a CD-ROM drive. Another may contain a fax/modem. If you turn off the power to your desktop computer and open it up, you can probably see the 'bus' for yourself. As always, there is a connection between the new meaning and the old. The new meaning involves carrying cards while the old one involves carrying people. In this case, the new meaning has not superseded the old one. They both continue to be used, but it would be dangerous, as we have already shown with several examples, to assume that both meanings will be translated the same way in another language. The way to overcome this difficulty, either for a human or for a computer, is to recognize whether we are using the word as an item of general vocabulary or as a specialized term. Humans have an amazing ability to distinguish between general and specialized uses of a word. Once it has been detected that a word is being used as a specialized term in a particular domain, and then it is often merely a matter of consulting a terminology database for that domain to find the standard translation of that term in that domain. It is common for a translator to spend a third of the time needed to produce a translation on the task of finding translations for terms that do not yet appear in the terminology database being used. Where computers shine is in retrieving information about terms. They have a much better memory than humans. But computers are very bad at deciding which the best translation to store in the database is. This failing of computers confirms our claim that they are not native speakers of any human language in that they are unable to deal appropriately with new situations. When the source text is restricted to one particular domain, such as
computers, it has been quite effective to program a machine translation system to consult first a terminology database corresponding to the domain of the source text and only consult a general dictionary for words that are not used in that domain. Of course, this approach does have pitfalls. Suppose a text describes a very sophisticated public transportation vehicle that includes as standard equipment a computer. A text that describes the use of this computer may contain the word 'bus' used sometimes as general vocabulary and sometimes as a specialized term. A human translator would normally have no trouble keeping the two uses of 'bus' straight, but a typical machine translation system would be hopelessly confused. This first type of difficulty is the task of distinguishing between a use of a word as a specialized term and its use as a word of general vocabulary. One might think that if that distinction can be made, we are home free and the computer can produce an acceptable translation.
Source Text
Text Formatting
Dictionary Search
Analysis
Transfer
Synthesis
Target Text Fully Automated Translation
2) Distinguishing between various meanings of a word of general vocabulary. The second type of difficulty is distinguishing between various uses of a word of general vocabulary. It is essential to distinguish between various general uses of a word in order to choose an appropriate translation. Already in 1960, an early machine translation researcher named Bar-Hillel provided a now classic example of the difficulty of machine translation. He gave the seemingly simple sentence "The box is in the pen". He pointed out that to decide whether the sentence is talking about a writing instrument pen or a child's play pen, it would be necessary for a computer to know about the relative sizes of objects in the real world. Of course, this two-way
choice, as difficult as it is for a human, is a simplification of the problem, since 'pen' can have other meanings, such as a short form for 'penitentiary' or another name for a female swan. But restricting ourselves to just the writing instrument and play pen meanings, only an unusual size of box or writing instrument would allow an interpretation of 'pen' as other than an enclosure where a child plays. The related sentence, "the pen is in the box", is more ambiguous. One would assume that the pen is a writing instrument, unless the context is about unpacking a new playpen or packing up all the furniture in a room. The point is that accurate translation requires an understanding of the text, which includes an understanding of the situation and an enormous variety of facts about the world in which we live. For example, even if one can determine that, in a given situation, 'pen' is used as a writing instrument; the translation into Spanish varies depending on the Spanish-speaking country. 3) Taking into account the total context, including the intended audience and important details such as regionalisms. The third type of difficulty is the need to be sensitive to total context, including the intended audience of the translation. Meaning is not some abstract object that is independent of people and culture. A serious example of insensitivity to the total context and the audience is the translation of the English expression 'thank you' which is problematical going into Japanese. There are several translations that are not interchangeable and depend on factors such as whether the person being thanked was obligated to perform the service and how much effort was involved. In English, we make various distinctions, such as 'thanks a million' and 'what a friend', but these distinctions are not stylized as in Japanese nor do they necessarily have the same boundaries. A human can learn these distinctions through substantial effort. It is not clear how to tell a computer how to make them. Languages are certainly influenced by the culture they are part of. The variety of thanking words in Japanese is a reflection of the stylized intricacy of the politeness in their culture as observed by Westerners.
Machine Translation (MT) can be defined as a translation where the initiative is with a computer system, either autonomously (Fully Automatic High Quality Translation). Machine Aided Translation (MAT) is human translation supported by a computer system. Support is available by lexical data, grammatical help, translation memory, domain information and organizational support.
3.2 Machine translation ambiguity. What makes machine translation so difficult? Part of the problem is that language is highly ambiguous when looked at as individual words. For example, consider the word "cut" without knowing what sentence the word came from. It could have been any of the following sentences:
a) He told me to cut off a piece of cheese. b) The child cut out a bad spot from the apple. c)
My son cut out early from school again.
d) The old man cut in line without knowing it. e) The cut became infected because it was not bandaged. f) Cut it out! You're driving me crazy.
If a computer is only allowed to the word "cut" and the rest of the sentence is covered up, it is impossible to know which meaning of "cut" is intended. This may not matter if everything stays in English, but when the sentence is translated into another language, it is unlikely that the various meanings of "cut" will all be translated the same way. This phenomenon is called "asymmetry". Illustrating an asymmetry between English and French of the word "bank" the principal translation of the French word banque (a financial institution) is the English word "bank." If banque and "bank" were symmetrical then "bank" would always translate back into French as banque. However, this is not the case. "Bank" can also translate into French as rive, when it refers to the edge of a river. Now we may object that this is unfair because the meaning of "bank" was allowed to shift. But a computer does not deal
with meaning, it deals with sequences of letters, and both meanings, the financial institution one and the edge of a river one, consist of the same four letters, even though they are different words in French. Thus English and French are asymmetrical. Early researchers in machine translation were already aware of the problem of asymmetry between languages, but they seriously underestimated the difficulty of overcoming it. They assumed that by giving the computer access to a few words of context on either side of the word in question the computer could figure out which meaning was intended and then translate it properly. But later some researchers had realized that even if the entire sentence is available, it is still not always obvious how to translate without using knowledge about the real world. A classic sentence that illustrates this difficulty uses the word "pen", which can refer to either a writing instrument or to an enclosure in which a child is placed to play so that it will not crawl off into another room. The ambiguity must be resolved or the word "pen" will probably be translated incorrectly. - The pen was in the box. This sentence will typically be interpreted by a human as referring to a writing instrument inside a cardboard box, such as a gift box for a nice fountain pen or goldplated ballpoint pen, rather than a play pen in a big box. However, look what happens if the sentence is rearranged as follows: - The box was in the pen. This sentence will typically be interpreted by a human as referring to a normal-size cardboard box inside a child's play pen rather than as a tiny box inside a writing instrument. A human uses knowledge about typical and relative sizes of objects in the real world to interpret sentences. For a human, this process is nearly effortless and usually unconscious. For a computer that does not have access to real-world knowledge, this process is impossible. The situation is also taken into account. Returning to the sentence about the pen in the box, there are texts, such as a
description of a family with small children moving their affairs to another apartment, in which a human would interpret the pen as the child's play pen being put into a large box to protect it while it is moved to a new location. And there are texts, such as a spy story about ultra-miniature boxes of top-secret information, in which the sentence about the box in the pen would be interpreted as referring to a writing instrument containing a tiny box. The words in these sentences do not change, yet the interpretation changes. Here even real-world knowledge is insufficient. Only some sense of the flow of discourse and the current situation are needed.
3.3 Problems of Machine Translation Translating may be defined as the process of transforming signs or representations into other signs or representations. If the originals have some significance, we generally require that there images also have the same significance, or, more realistically, as nearly the same significance as we can get. Keeping significance invariant is the central problem in translating between natural languages. Some typical factors contribute to the difficulty of Machine Translation which is: words with multiple meanings, sentences with multiple grammatical structures, uncertainty about what a pronoun refers and other problems of grammar. Two most important misunderstandings make translation seem simpler than it is: first of all
translation is not primarily a linguistic operation: "The police refused the students a premise because they feared violence". This sentence is to be translated in the French but "police" is feminine; "they" will also have to be feminine; and secondly translation is not an operation that preserves meaning. Different languages have different usage, for example there are languages like French in which pronouns must show numbers and gender, Japanese where pronouns are often omitted altogether, Russian where there are no articles, Chinese where nouns do not differentiate singular and plural nor verbs present and past. That is why, the most important problem in Automatic and Machine translation is:
a) Polysemy: a word which has several similar meanings, sometimes the proper
translation is difficult to find even for a human translator. For example the word "fair" might mean any of "beautiful", "light", "blond", "free from bias", etc.
b) Homonymy: are considered to be several independent words which share the
same linguistic corpse, they are difficult to be translated, because the translation of homonyms often depends on context and semantics. For example, the word "reif" might mean "ring", "bracelet", or "white frost". As well as the word "screen" might mean "schirm", "leinwand", "raster", or "abschirmung".
c) Syntactical ambiguities: structure of sentences not only depends on types of
words but often also on semantics. "Flying planes can be dangerous" in this case the sentence is ambiguous because words can be grouped into two ways: "(flying planes) can be dangerous" and "(flying) (planes) can be dangerous".
d) Referential ambiguity: pronouns refer to certain words but it's often not obvious
to which, references might even cross sentence boundaries. Reference resolution is considered to be one particular area of language, in which ambiguity is often problematic for computers. For example, it would be useful if an MT system could somehow distinguish between the various meaning of "it" in the three sentences:

The monkey ate banana because it was hungry. The monkey ate banana because it was ripe. The monkey ate banana because it was teatime.
Native speakers will normally identify the intended meaning of such language very easily. Indeed, it would probably not even enter their heads that alternative interpretations are possible. Non-native speakers of a language, when presented with the text, will often have to narrow down the possible meaning in a more conscious way. An MT system on the other hand will not only find it equally difficult to decide between several sensible interpretations of a given sentence, but will also have no way of distinguishing between sensible and absurd interpretations of a given sentence or text.e) e) Fuzzy hedges: these are vague words and expressions which are very difficult to be translated. For example, "in a sense", "irgendwie", "very", "in a sense machine translation works nowadays" f) Metaphors and symbols: on the underlying culture and history, they often cannot be translated (Chinese sayings sometimes just do not make sense), in this very situation just idiomatic dictionaries may be used to ease translation. For example "Mit eiserner Miene feuerte er seinen treuesten Mitarbeiter" corresponding English idiom is "with a stony expression". g) New developments: all languages of the world are dynamic, always new words are created, as well as proper names of new technologies. For example, "secure shell", "telnet". h) Synonyms: there are always several words with the same meaning and for computer it is difficult to choose the right one because it depends on context, style and semantics.
3.4 Cognitive processes To understand the essential principles underlying machine translation it is necessary to understand the functioning of the human brain. The first stage in human translation is complete comprehension of the source language text. This comprehension operates on several levels:
1. Semantic level: understanding words out of context, as in a dictionary. 2. Syntactic level: understanding words in a sentence. 3. Pragmatic level: understanding words in situations and context. Furthermore, there are at least five types of knowledge used in the translation process: a) Knowledge of the source language, which allows us to understand the original text. b) Knowledge of the target language, which makes it possible to produce a coherent text in that language. c) Knowledge of equivalents between the source and target languages. d) Knowledge of the subject field as well as general knowledge, both of which aid comprehension of the source language text. e) Knowledge of socio-cultural aspects that is, of the customs and conventions of the source and target cultures. Given the complexity of the phenomena that underlie the work of a human translator, it would be absurd to claim that a machine could produce a target text of the same quality as that of a human being. However, it is clear that even a human translator is seldom capable of producing a polished translation at first attempt. In reality the translation process comprises two stages: first, the production of a rough text or preliminary version in the target language, in which most of the translation problems are solved but which is far from being perfect; and second, the revision stage, varying from merely re-reading the text while making minor adjustments to the implementation of radical changes. It could therefore be said that machine translation aims at performing the first stage of this process in an automatic way, so that the human translator can then proceed directly to the second, carrying out the meticulous and demanding task of revision. The problem is that the translator now faces a text that has not been translated by a human brain but by a machine, which changes the required approach because the errors are different. It becomes necessary to harmonize the machine version with human thought processes, judgments and experiences.
Machine translation is thus both an aid and a trap for translators: an aid because it completes the first stage of translation; a trap because it is not always easy for the translator to keep the necessary critical distance from a text that, at least in a rudimentary way, is already translated, so that mistakes may go undetected. In no sense should a translation produced automatically be considered final, even if it appears on the surface to be coherent and correct. Conclusion: From the given information we can find out that translation is not an easy process. It had to be studied till we'll have a good or good enough translation. People had to work hard to make the translation easier by using machines. Currently, Machine Translation systems are already very helpful, but not perfect. There are linguistic problems that cannot be satisfyingly solved by computers that can not think like humans. Some of the difficulties in Machine Translation are mostly due to various types of ambiguity, concerning polysemy of words, phrase/clause attachment, coordination, anaphoric reference, scope of logical/modal operators, and so on. Maybe in the future, further progress in Artificial Intelligence will help to solve the remaining problems. Translation is a very difficult thing requiring much feeling and understanding of cultural aspects, which is not available in a computer. Some grammatical structures in a given language do not exist in another language, and that is why the translation without interpretation, remains still an unsolved problem.
General conclusion Machine translation and computer-assisted translation has long been a subject of discussion. At the beginning of investigation some translators totally reject even the idea of machine translation because they associate it with the point of view that translation is merely one more marketable product based on a calculation of investment versus profits. They define translation as an art that possesses its own aesthetic criteria that have nothing to do with profit and loss, but are rather related to creativity and the power of the imagination. This applies mostly, however, to specific kinds of translation, such as that of literary texts, where polysemy, connotation and style play a crucial role. It is clear that computers could not even begin to replace human translators with such texts. Even with other kinds of texts, the analysis of the roles and capabilities of Machine Translation which shows that it is not efficient and accurate enough to eliminate the necessity for human translators. The first point to be made is that MT is a translation method that focuses on the source language, while human translation aims at comprehension of the target language. Machine translations are therefore often inaccurate because they take the words from a dictionary and follow the situational limitations set by the program designer. In fact, translators should recognize and learn to exploit the potential of the new technologies to help them to be more rigorous, consistent and productive without feeling threatened. Translation is a very difficult thing requiring much feeling and understanding of cultural aspects, which is not available in a computer. The reason is that a computer can't be able to think and interpret the environment (social and cultural aspects) as a human being. Another reason is that some grammatical structures in the source Language do not exist in the Target Language and computer doesn't know which grammatical structure to follow. Translating with the help of the computer is definitely not the same as working exclusively on paper and with paper products such as conventional dictionaries, because computer tools provide us with a relationship to the text which is much more flexible than a purely lineal reading. Furthermore, the Internet with its universal access to information and instant communication between
users has created a physical and geographical freedom for translators that were inconceivable in the past. Translators need to accept the new technologies and learn how to use them to their maximum potential as a means to increased productivity and quality improvement. Of course, the quality improvement of machine translation (MT) is mainly the task of its developers. However, the users can also make some efforts for reaching acceptable results because first of all the quality of machine translation directly depends on the quality of the delivered source text.
Appendix 1
ENGLISH IDIOMS
As the crow flies Ask for the moon/cry Blue-eyed boy Bow and scrape A country cousin Carte blanche Clean as a whistle Cold comfort Come to a sticky end Comfortable in ones skin Creature comforts Crocodile tears Cut to the chase Diamond in the rough (a) Dont lock the barn door after the horse is gone Down at heel Drag somebodys name through the mire/mud
MACHINE TRANSLATION
-
RUSSIAN EQUIVALENT
, ,
() , , , - /
, Get ones back up Get ones brain in gear Get ones feet wet Get sth down to a - fine art Get the better of sb - Give the bums rush Go cold turkey Go on the wagon Go through thick and thin Gobbledygook Guinea pig Have (had) a good () innings Have ones back to n's the wall Hen night (an) Indian summer , , Kangaroo court Keep (oneself) to () oneself , Keep a stiff upper , lip Keep books Keep ones shirt on , Eye for an eye, a tooth for a tooth Feeding frenzy (a)
Keep peace Kickback Knock off ones feet Knock ones block off Last ditch effort
Let off steam Neck and neck No elbow room (1) , Nosey Parker Parker Not for the world Not to give on the time of day Not to touch something with a - ten-foot pole Nothing if not , Nutty as a fruitcake Of ones own free will Off guard Off the beaten , track Off-color , On a shoestring On and off On pins and needles On the block Out of sorts
Pull a long face Pull up stakes , Pull in a good word for Put two and two together Rant and rave Red herring Scratch the surface , Separate the wheat from the chaff Sitting duck/target / Sow ones wild oats Period Pock ones nose in something Probe the ground Sow the seeds of doubt Stand a good chance Stay the course Straw that breaks the camels back Sugar daddy Under ones breath Up and about Upper crust Upset the applecart Variety store ,
- , -
Wear the pants in ones family
Weasel word Weight ones word Well-heeled Wet ones whistle What have you/what not Whats the (big) idea? When hell freezes over When the chips are down Where the shoe pinches Whispering campaign Whistle in the dark White sale With/in a whole skin Words of one syllable Work ones fingers to the bone World is your oyster Worth ones salt Yak-yak You bet (your boots)
/ () ? / - ( ..)
, ? ,
Appendix 2 I.
Original Text in Romanian
Fiecare om se poate prevala de toate drepturile i libertile proclamate n prezenta Declaraie fr nici un fel de deosebire ca, de pild, deosebirea de ras, culoare, sex, limb, religie, opinie politic sau orice alt opinie, de origine naional sau social, avere, natere sau orice alte mprejurri. n afar de aceasta, nu se va face nici o deosebire dup statutul politic, juridic sau internaional al rii sau al teritoriului de care ine o persoan, fie c aceast ar sau teritoriu snt independente, sub tutel, neautonome sau supuse vreunei alte limitri a suveranittii.
Human translation in English
Everyone is entitled to all the rights and freedoms set forth in this Declaration, without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status. Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs, whether it be independent, trust, non-self-governing or under any other limitation of sovereignty.
Automatic Translation in English
Each arid man can prevala of all right the si the liberties proclamate at bring forward Declaratie except whatever discrepancy as the, of ensample, discrepancy of breed, color, sex, the borage, religion, judgement policy or whatever another judgement, of intranational origination or social, have, bear or any another circumstance. At out thereon, don't arid he will do either a discrepancy after the politic judicial status or intern? Ional of czar or of territory of which holds a man, is as this tare or territory am the independence, below ward, neautonome or obeied another vreunei former confinements sovereignty.
II.
Original Text in English
Human translation in Romanian
Automatic Translation in Romanian
Oricine este ndreptatit sa toate drepturi si libertati expuse n aceasta Declaratie, fara deosebire de oricare amabila, ca de exemplu concurati, culoare, sex, limba, religie, politica sau parere alta, nationala sau originea sociala, proprietate, nastere sau starile alte. Nu mai mult, nici o deosebire nu va fi va face pe baza politice, de jurisdictie sau stari tarii internationale sau teritoriul la care persoana i apartine, daca el este independent, ncredeti-va n, non-de sine-de guvernamnt? sau sub oricare alta limitare de suveranitate.
III.
Either jack herself maybe prevala from all drepturile and libertile proclamate in prezenta Statement but neither fel from variant ca , from pild , variant from ras , color , sex , tongue , religion opinie policy or any alt opinie , from origin national or socialist , estate , birth or any alte mprejurrin out from aceasta , non herself va face neither variant after statue political , lawyer or international of rii or of teritoriului from what ine one person fie c aceast country or teritoriu snt independente , below tutel neautonome or supuse vreunei alte limit of suveranittii.
IV.
Original Text in English
Everyone is entitled to all the rights and freedoms set forth in this Declaration, without distinction of any kind, such as race, color, sex, language, religion, political or other opinion, national or social origin, property, birth or other status. Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs, whether it be independent, trust, non-self-governing or under any other limitation of sovereignty.
Human translation in Romanian
Automatic Translation in Romanian
Fiecare is entitled la spre tot arthot rights i freedoms a ezat a aranja fortrea nuntru this Declaration , fr distinction de orice kind such as race , colonie , sex , limbaj , religie , politic sau alt opinion , naional sau socialist origine , propriu , natere sau alt statuie. Mult mai ndeprtat , nu distinction shall a fi made pe basis de la politic jurisdictional sau internaional statuie de la ar sau territory la spre care un persoan sub , dac it a fi independent trust , nu - de sine - guvern sau jos orice alt limitation de sovereignty.
V.
Each man perhaps had been predominating of all the proclaimed liberties and the rights in present declaration without no difference kind that, thus , the breed difference , colour, sex, language, religion, policy opinion or whatever other opinion , of social her national origin , money-bag, birth or whatever other circumstances . Excepting this , it will not do no difference after the status politically, judicial her international of country or of territory of which holds a person , fIE that this territory or country are independent , in trust , nEAUTONOME or obedient VREUNEI other limitations of sovereignty .
VI.
SOURCE TEXT: Le Monde Diplomatique, September 2002 Depuis le 11 septembre 2001, l'esprit guerrier qui souffle sur Washington semble avoir balay ces scrupules. Dsormais comme l'a dit le prsident George W. Bush, "qui n'est pas avec nous est avec les terroristes". Systran Reverso Human translation
Since September 11, 2001, the warlike spirit which blows on Washington seems to have swept these scruples. From now on, like said it the president George W Bush, "which is not with us is with the terrorists". (37 words)
Since September 11, 2001, the warlike spirit which blows on Washington seems to have swept (annihilated) these scruples. Henceforth, as said it the president George W. Bush, "which (who) is not with us is with the terrorists". (35 +2 words)
Since 11 September 2001 the warmongering mood in Washington seems to have swept away such scruples. From that point, as President George Bush put it, "either you are with us or you are with the terrorists." (36 words)
The first point to be made is that Machine Translation is a translation method that focuses on the source language, while human translation aims at comprehension of the target language. Machine translations are therefore often inaccurate because they take the words from a dictionary and follow the situational limitations set by the program designer. Various types of errors can be seen in the above translations.
Appendix 3. Practical Tips for Pre-Editing
1. Always run the draft for translation through grammar-checking software, which can catch overly
complex construction, compound verbs and obscure phrasing (which they often flag as being in the passive voice).
2. Use a word processor or use that function in a MT program. 3. Use a thesaurus to simplify uncommon usages. 4. Stick to a logical sequences of events, without flashbacks. 5. Spell out abbreviations when theyre first used, with the abbreviations put in all-caps in brackets. 6. Avoid idiomatic, slang and regional or national expressions. 7. Dont use complex compound structures. 8. Be precise. Avoid fuzzy language. 9. Dont make the comprehension of the text dependent on formatting like italics or indents. 10. Try to use the ISO Format for dates. 11. Be careful with contracts, where language may have a precise but obscure legal meaning. 12. Translate back and forth (back to the original language) to see where the translation goes astray,
and reword.
Tips for Preparing Your Document for Translation

Translating English materials into other languages has its share of pitfalls, many of which can be avoided. At Simultaneous Translation, we look for the following primary difficulties at the beginning of a translation to prevent problems and ensure consistency and clarity in the target language: Maintain consistency of terminology Strive for clarity and use simple, direct sentences with basic grammatical construction International users generally prefer straightforward, factual wording
Provide a list of all terms which should remain in English (for example proper names, product names and titles) to alert the translator .
Making Machine Translation Work

While expecting MT to fully translate the complexities of language remains an unrealistic standard, MT can help people get access to a vast amount of information and extract the essence of the meaning. In going beyond that to create their own original messages, people have three alternatives if they want to get acceptable results without help from a human translator:
1. Make adjustments before sending a message. While avoiding extreme controlled language
approaches, people can learn to speak carefully and add visual hints such as graphics, if the desire to communicate is strong. See Appendices C and D for tips. Its also a good idea to translate results back into the original language and if somethings completely off base, reword the original.68
2. Check the translation in progress. Some Web programs allow users to list words, such as proper
names, that they do NOT want translated. Communication with the person receiving the translation can also become interactive; but people have to be willing to send back-translated sentences that are unclear for clarification, and ask questions. That involves delays that interfere with direct communication and can also means getting over conventions where people want to avoid anything that might imply criticism of the sender. Programs like Translator that allow people to add their own expressions to standard dictionaries can help, and will likely become more widely available.69
3. All parties to the communication adjust their expectations and tolerance. There are many millions
of people around the world, particularly younger people under 30, using the new technology . . . who have no problem at all in accepting the raw English output of the better MT systems as being acceptable as the fractured Americanized English that they use as a common language when they get together with foreign contemporaries on line, or face-to-face in our increasingly global society, asserts Haynes. He even expects a computerized equivalent of Pidgin English to develop.
ADVICE Be Concise
Remember that machine translation is a computer process that prefers common words and phrases Start with simple, clear and formal sentences and phrases Keep sentences short, limiting them to 15-20 words for best results If a sentence contains multiple ideas/thoughts, break them into one sentence per idea/thought Avoid unnecessarily complex words and sentences
Write clearly and formally

Word your documents in such a way as to avoid idioms, clichs, colloquial expressions and slang Consider the literal meaning of words and try to express this instead
Avoid Ambiguity
Try not to use words that have more than one meaning for example: - Use "movie" instead of "film" - Use "painting" or "photograph" instead of "picture" Words ending in "ing" can sometimes be ambiguous, such as "rowing", which can be a noun or a verb. Where possible, choose an alternative
Always check spelling and grammar

Incorrect spelling or grammar leads to translation errors, for example, if a word is spelt incorrectly, the translator will not be able to identify the word.
Include appropriate accents
Always use the correct accent marks in your text.
Be aware of Punctuation Pitfalls

Avoid the use of complicated punctuation marks such as parentheses and hyphens. Avoid abbreviations or if you need to use them, keep them consistent. Use articles in front of listed items, for example: - Instead of: the judge and jury - Use: the judge and the jury
Do not leave words out

Some words can be implied in everyday use, such as "that, which, who," etc. and are often omitted when writing text - try not to do this as they may be required in the target language.
Bibliography
[1] "Language and Machines. Computers in Translations and Linguistics"; Washington, 1966. [2] Oettinger, Anthony G.;" Automatic Language Translation. Lexical and Technical Aspects with Particular Reference to Russian". London, 1960. [3] Lorscher Wolfgang, "Translation Performance, Translation Process and Translation Strategies". Tubingen, 1991. [4] Heaton J.B., Turton N.D., "Longman dictionary of common Errors". Haslow, 1994. [5] An evolution of Machine Aided Translation Activities at F.T.D., Contract AF 33(657) 13616, Case 66556, May 1, 1965, p. G-10. [6] , .., " , ". , 1975. [7]. www.nbrigham.org/brigham_machinetranslation.html [8]. www.deltatranslator.com/tips.htm [9] http://archiv.tu-chemnitz.de/pub/2001/0043/data/presentation-html [10] Hutchins, W.J.: Machine Translation: Past, Present, Future. Ellis Horwood Limited, 1986. [11] Hutchins, W.J.: Research Methods and System Designs in Machine Translation: A Ten-Year Review 1984-1994. In: Proceedings of "Machine Translation Ten Years On". Cranfield, U.K. 1994.
[12] Hutchins, W.J.: Machine Translation: Past, Present, Future. Ellis Horwood Limited, 1986. [13] Hutchins, W.J.: Machine Translation. History, Current Status, and Possible Future Developments. Lecture at Tzigov Summer School on Applied Linguistics 1995. [14] Leontieva, N. and Shaljapina, Z.: Current state of MT. In: E. Popov (ed.), Handbook of Artificial Intelligence, Vol. 1, Moscow, "Radio i swjaz", 1990, pp. 216247. [15] Sowa, J.F. Conceptual Structures: Information Processing in Mind and Machine. Reading, MA: Addison-Wesley. [16] http://www.idioms.ru/?q=forum75 [17] Claudia Gdaniec. 1999. Using MT for the Purpose of Information Assimilation from the Web. In Workshop on Problems and Potential of English-GermanMT systems. TMI, Chester, UK. [18] John White. 1995. Approaches to Black-box Machine Translation Evaluation. In Proceedings of the MT Summit 1995. Luxembourg. [19] ARNOLD, Doug, BALKAN, Lorna et al. Machine Translation: An Introductory Guide. URL:. MTbook/HTML/. [20] HUTCHINS, W. John. Translation. Technology and the Translator. URL: [21] HUTCHINS, W. John. Computer-based Translation Systems and Tools. URL: . [22] KAY, Martin. History of machine Translation. URL: . [23] http://www.foreignword.com/tools/Transnow.htm
[24] www.onlinetranslator.com/?lang=eng
[25] Olivia Craciunescu, Constanza Gerding-Salas, Susan Stringer-O'Keeffe "Machine Translation and Computer-Assisted Translation: a New Way of Translating?"
[26] TRUJILLO, A. Translation engines: techniques for machine translation. London: Springer, 1999. URL: .
[27] William J. Sullivan, "Ideology and Patterns in Translation Error" [28] Bassnett, S. (1991). Translation Studies. London: Routledge
[29] www.thefreedictionary.com [30] www.ttt.org/technology.html [31] www.-306.ibm.com/softwarel [32] http://archive.tu-chemnitz.de/pub/2001/0043/data/resentation-html/ [33] http://est6.freetranslation.com

Course Paper On Machine Translation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Course Paper On Machine Translation

Uploaded by

Copyright:

Available Formats

Faculty of Applied Modern Languages

COURSE PAPER "DIFFICULTIES IN MACHINE AND AUTOMATIC TRANSLATION"

HISTORY OF MACHINE TRANSLATION

The sequence of grammatical elements, e.g. sequences of word classes:

The grouping of grammatical elements, e.g. nominal phrases consisting of

III. DIFFICULTIES IN MACHINE TRANSLATION

Target Text Human Translation Process

Target Text Fully Automated Translation

My son cut out early from school again.

Wear the pants in ones family

Human translation in Romanian

Appendix 3. Practical Tips for Pre-Editing

Tips for Preparing Your Document for Translation

Making Machine Translation Work

Write clearly and formally

Always check spelling and grammar

Include appropriate accents

Always use the correct accent marks in your text.

Be aware of Punctuation Pitfalls

Do not leave words out

You might also like