A shocking amount of the web is machine translated: insights from multi-way parallelism

We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web.

Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico

As a translator myself, this is entirely unsurprising. Translating is a craft, a skill, and much like with any other craft, you get what you pay for. If you pay your translator(s) a good rate, you get a good translation. If you pay your translator(s) a shit rate, you get a shit translation. If you pay nothing, you get nothing.

I’m definitely seeing more and more people in my industry integrate machine translations, but so far, it’s not been an actual issue – I have no qualms about accepting a job where I take a machine-translated text and whip it into shape and turn it into a human-readable, quality translation… As long as people pay me a reasonable rate for it. Working from a machine translation is often quicker and easier, so the going rate obviously reflects that.

The quality of machine translations is absolutely atrocious, however, and the idea of relying on it for texts other people – customers, clients, employees, etc. – are actually supposed to read and work from is terrifying. Google Translate is an effective tool for personal use, but throwing, I don’t know, your product’s manual at it and dumping the unedited result onto your customers is borderline criminal.

Pay nothing, get nothing.

16 Comments

  1. 2024-01-16 8:47 pm
    • 2024-01-16 9:29 pm
      • 2024-01-17 2:50 am
      • 2024-01-17 2:30 pm
    • 2024-01-17 1:14 pm
      • 2024-01-17 1:32 pm
      • 2024-01-17 6:27 pm
  2. 2024-01-16 10:48 pm
    • 2024-01-16 11:35 pm
      • 2024-01-17 12:10 am
    • 2024-01-17 10:30 am
      • 2024-01-17 11:34 am
      • 2024-01-19 5:52 am
  3. 2024-01-17 2:13 am
  4. 2024-01-17 7:38 am
  5. 2024-01-17 7:45 am