20 February 2012

From the mouths of bits - curiosities of machine translation

Google Translate is undeniably one of the most useful tools most of us will ever see, yet to the vast majority of people, it is a joke.

The principles behind Google Translate go completely against what we expect of language.  Our first instinct is to believe that Google used a big set of rules and tables, like in those dusty old Latin books on a shelf at the back of the university library.

But Google Translate is something very different.  It is based statistical translation techniques.  What that means is that no-one has programmed it with any rules at all, instead feeding it with gigabyte after gigabyte of text in the target language, from which it identifies patterns of words that go together, and words that don't.  It also gets some directly translated texts to compare translations, but much less than you might expect.

Occasionally, this statistical approach throws up some very odd results.

For example, on How-To-Learn-Any-Language someone recently gave the example of a Finnish band who sing some of their songs in English and some of them in Finnish.  When he translated a piece of Finnish with the song title Kuolema Tekee Taiteilijan in it, it spat out the Siren, which is another of their song, but one they sing in English.  The correspondent on HTLAL blames that on human correction, but that is highly unlikely.  Instead, Google's algorithm will have correctly identified the first song title as Finnish and the second as English, even when in a document in the other language, and therefore it won't add the Finnish song's title to the English database or the English song's title to the Finnish database.  And because both co-occur with the band name, the software ends up associating them.

In fact, if you look at the bands list of singles, you'll find that Kuolema Tekee Taiteilijan was released directly before the Siren, so it could be that the Google algorithm is actively looking for a translation directly after an embedded foreign word.  So if I talk about the clàrsach (Gaelic harp) or about an t-Eilean Sgitheanach (the Isle of Skye), you get the picture.  And it's quite right that Google Translate should do that, it just so happens that while it means that it makes less mistakes, the mistakes it does make are mistakes that look particularly weird to us.

1 comment:

Teango said...

Google does have a habit of throwing up some very odd translations from time to time. Here is my favourite one for Russian this week:

"спеши" or "спешите"

- rough meaning: Hurry up!

- Our Google survey says: [feel free to insert incorrect uh-ahh noise here] Take your time!

Classic. :)