Googling Machine Translation

By Paula Dieli

Mention the words “machine translation,” and a translator’s thoughts will range from job security to the ridiculously funny translations we’re able to produce with so-called online translation tools. Should we be worried that machines will take over our jobs? Paula Dieli thinks not, and explains why in this report.

I recently attended a presentation on “Challenges in Machine Translation,” sponsored by the International Macintosh Users Group (IMUG), at which Dr. Franz Josef Och, Senior Staff Research Scientist at GoogleResearch, presented some of the challenges Google is facing in its machine translation (MT) research, and how some of these challenges are being addressed. Excitement about successes in machine translation research initially came to a head back in 1954 with a report in the press regarding the Georgetown University/IBM experiment which had used a computer to translate Russian into English. Since then, over the past 50 years, we have continued to read about the great advances that will be possible in “the next 20 years,” but these great advances never came to pass. When the Internet came of age, online translation tools surfaced and we translators amused ourselves by seeing what crazy translations we could come up with by entering seemingly simple phrases.

The linguistics of MT

So why did the research never produce anything really viable? It was based on a linguistic approach; that is, an analysis of the structure of a language followed by an attempt to map it into machine language such that one could input a source language text and out would come a wonderful translation in the target language, albeit with a few minor errors. As we all know, a language is filled with so many cultural, contextual, idiomatic, and exceptional uses that this task became virtually impossible, and no real progress has been made with this approach in the past 50 years.

Dr. Geoffrey Nunberg, Adjunct full professor at UC Berkeley, linguist, researcher, and consulting professor at Stanford University, had this to say at a recent NCTA presentation: “I asked a friend of mine, who is the dean of this [MT] field, once, ‘if you asked people working in machine translation how long it will be until we have perfect, idiomatic machine translation of text …?’, they would all say about 25 years. And that’s been a constant since 1969.”

The data-driven approach

In recent years, MT researchers have begun to take a different approach, which can be loosely compared to the work you do as a translator when you use a tool such as SDL Trados WinAlign or Translator’s Workbench. That is, you use a data-driven methodology. As you translate, you store your translations in a translation memory (TM), so that if that same or a similar translation appears again, the tool will notify you and let you use that translation as is, or modify it slightly to match the source text. The more you translate similar texts in a particular domain, the more likely it is that you will find similar translations already in your TM.

Similarly, if before you began to translate a weekly online newsletter of real estate announcements, for example, you searched the Internet for already existing translations in your language pair and then aligned them and input them, via WinAlign, into your TM, you might find that much of the work had already been done for you. Imagine now if you were to input 47 billion words worth of these translations. Your chances of being able to “automatically” translate much of your source text would certainly increase. This is the approach that Google is taking.

Google’s goal, as stated by Dr. Och, is “to organize the world’s information and make it universally accessible and useful.” Now before you go thinking you’re out of a job, their data-driven approach has proven successful only for certain language pairs, and only in certain specialized domains. They have achieved success in what they call “hard” languages, that is from Chinese to English, and from Arabic to English in domains such as blogging, online FAQs, and interviews by journalists.

Dr. Och reported that their reasons for progress were due to “learning from examples rather than from a rule-based approach.” He admits that “more data is better data.” He went on to say that adding 2 trillion words to their data store would result in a 1 percent improvement for specific uses such as the ones described above. They see a year-to-year improvement of 4 percent by doubling the amount of data in their data store, or “corpus.” The progress reported by Dr. Och is supported by a study conducted by the NIST (National Institute of Standards and Technology) in 2005. Google received the highest BLEU (Bilingual Evaluation Understudy) scores using their MT technology to translate 100 news articles in the language pairs mentioned above. A BLEU score ranges from 0 (lowest) to 1 (highest) and is calculated by comparing the quality of the target segments with their associated source segments (a penalty is applied for short segments since that artificially produces a higher score).

Challenges and limitations

So what are the limitations of this data-driven approach? When asked by a member of the audience if Google’s technology could be used to translate a logo, Dr. Och instantly replied that such a translation would require a human translator. It’s clear that Google’s approach handles a very specific type of translation. Similar data-driven MT implementations can be used to translate highly specialized or technical documents with a limited vocabulary which wouldn’t be translated 100 percent correctly, but which would be readable enough to determine whether the document is of interest. In that case, a human translator would be needed to “really” translate it.

The Google approach described above deals with a tremendous amount of data and a very targeted use. It works only for some languages—German, for example, has been problematic—and in order to improve in more than just small increments, human intervention is required to make corrections to errors generated by this approach. One example that Dr. Och provided—the number “1,173” was consistently incorrectly translated into the word “Swedes”—confirms that a machine can’t do it all.

And if you think for a minute about the amount of Internet-based data being generated on just an hourly basis, it’s great to have machines around to handle some of the repetitive (read: uninteresting) work, and let us translators handle the rest. That still leaves plenty of work for us humans.

Alternative technologies

There are other approaches to MT, including example-based technology, which relies on a combination of existing translations (such as you have in your translation memory) along with a linguistic approach that involves an analysis of an unmatched segment to a set of heuristics, or rules, based on the grammar of the target language. Some proponents of this approach concede that large amounts of data would be needed to make this approach successful, and have all but abandoned their research. Once again, we can see that any approach that relies even partially on linguistics has not met with a reasonable level of success.

Other advances occurring in the MT arena include gisting and post-editing. MT can be used successfully in some settings where the gist of a document is all that is needed in order to determine if it is of enough interest to warrant a human translation. There are also MT systems on the market that produce translations that require post-editing by human translators who spend (often painful) time “fixing” these translations, correcting the linguistic errors that such a system invariably produces. While this may not be the translation work you’re looking for, I know of at least one large translation agency that provides specific training for this type of post-editing to linguists willing to do this kind of work. This is another example that shows that while machines play a part, there is still a role for human translators in the overall process.

Still other advancements include the licensing of machine translation technology based on a data-driven approach, which can be tailored to work with existing translations and terminology databases at a specific company. As with the Google solution, such technologies typically work on a limited set of languages. However, if they can help translate some of the less interesting, repetitive information out there, with more information being produced at a continually increasing rate, have no fear; there will still be plenty of work for human translators to do!

The road ahead

Where does that leave us? From the typewriter to word processors to CAT (Computer-Assisted or Computer-Aided Translation) tools and the pervasiveness of the Internet, our livelihood has been transformed, in a positive way. We are more productive and able to work on more interesting translations than ever before.

I encourage you to embrace technology; understand how it is helping to make information accessible, and learn how technology can help translators do the work that only humans can do.

more information

The calendar of the International Macintosh User Group (IMUG) upcoming presentations can be found at http://www.imug.org.

You can get the official results of the 2005 Machine Translation Evaluation from the National Institute of Standards and Technology (NIST) at http://www.nist.gov/speech/tests/mt/doc/mt05eval_official_results_release_20050801_v3.html.

Print this post Share

Leave a Comment