KantanAI Blog

A.I. must not only support your business. It must advance it.

Together, let’s make your products smarter. Your customer experiences more exceptional. Your people more productive. Your processes more profitable. And your systems more powerful.

MT Evolution: From Rule-based Systems to Large Language Models

Machine translation (MT) has been crucial in facilitating modern communication and fostering global connectivity. Contrary to what the public might think, it emerged in the early 20^th Century and has developed considerably over the years.

But what exactly is machine translation? Simply put, MT involves the automatic production of a text in a target language by using a source language text. It was mainly developed for defense and military purposes, and we only saw its commercial use by the end of the 90s with the release of Babel Fish. Although initial MT outputs were far from perfect, it changed how information was being processed and it allowed international growth of companies and later e-commerce (Kenny, 2022.)

In this article, we will do an overview of this evolution, from the early initiatives of Petr Smirnov-Troyanskii in the 1930s to the current advancements in Neural Machine Translation (NMT) and LLMs, and what that means for the language industry.

Early Progress and Ruled-Based Machine Translation (RBMT)

Although his contribution is not often recognized, the origins of MT can be traced back to 1933 with Peter Smirnov-Troyanskii’s invention: a translation machine that used cards and a camera to recognize and decode words in four different languages. This invention, however, was mostly ignored until the 1950s. The origins of MT are usually attributed to the work of Warren Weaver in 1949, where he laid the groundwork for Statistical Machine Translation (SMT) by integrating statistical methodologies into language processing (source: https://towardsdatascience.com/evolution-of-machine-translation-5524f1c88b25.) Moreover, Weaver’s proposal boosted the interest in the use of computers for translation tasks, propelling more research on the topic in the next years.

The next key development of MT occurred in 1954 with the Georgetown-IBM experiment, where initial efforts were made to translate 60 sentences from Russian into English using computational methods. While the experiment was very controlled as the sentences were previously extracted and tested to avoid ambiguity, it demonstrated the potential for technology to facilitate cross-language communication and sparked the race to create automated translation systems. From here on, countries doubled efforts and investment to produce MT systems (Source: https://www.globalizationpartners.com/2013/01/02/an-introduction-to-machine-translation/.)

The 1960s and 1970s witnessed a shift towards rule-based approaches in machine translation. In general terms, these systems require mapping the source and target languages to create rules for automatic translation. This demands extensive glossaries, big computational capacity to process the algorithms necessary to create outputs, and, most importantly, language experts able to create such rules. RBMT faced difficulties mapping and obtaining good outputs from languages with different structures and rules, which doubled the computational and human effort (Kenny, 2022.)

Corpus-Based Approaches

While researchers worked extensively on RBMT systems, there were also developments of syntax-based approaches and statistical analysis of translation. This marked a notable transition towards corpus-based methodologies, harnessing extensive datasets to refine translation precision and efficacy. Corpus-based approaches make use of parallel corpus, multi-language corpus, and comparable corpus to tackle the issue of contextual information and differences in language structures faced by RBMT. The most important of these approaches are Example-Based Machine Translation (EBMT) and SMT, (Dajun & Yun, 2015.)

Example-Based Machine Translation

EBMT revolves around establishing mappings between source language and target language through a repository of examples. This method was specially developed to overcome the obstacle of different language rules between English and Japanese. EBMT emphasizes the significance of feeding a great number of examples into systems to aid in the identification of patterns and linguistic correlations across languages, something that would become vital for later systems. While effective in specific contexts, EBMT still had problems with contextual information and pragmatic aspects of certain languages.

Statistical Machine Translation (SMT)

SMT appeared in the late 1980s and 1990s and became a significant development that revolutionized machine translation technology at the time. By leveraging statistical models and probabilities derived from extensive language datasets, SMT systems aimed to enhance translation quality by capturing phrase alignments and linguistic probabilities more effectively without deep expertise in language rules. Within the realm of SMT, both word-based and phrase-based models leverage probabilities extracted from parallel corpora, showing the effectiveness of statistical analysis in some translation tasks.

Neural Machine Translation (NMT)

SMT became the paradigm until the mid-2010’s and was soon adopted by companies and institutions such as Google and the European Commission. Although the outputs had improved considerably, researchers continued working on creating systems that would create more “natural” outputs. That is how between 2014 and 2016, a series of contributions helped to develop what we today call NMT.

NMT systems are based on deep learning models, which aim to emulate how human neurons work. Instead of statistical models, NMT uses representations of embedded words and sentences, giving more room for contextual information, idiomatic expressions, and more complex language structures (Kenny, 2022.) A key development during this period was the introduction of the sequence-to-sequence (Seq2Seq) model, or transformer, a foundational architecture that has become integral to many NMT systems, facilitating the smooth translation of sequences between languages. The transformer model is based on an encoder and decoder with multiple layers in between that transform the information into vectors to be processed over and over through its many layers. The more layers, the more capable the system is of processing complex language structures.

Google adopted NMT in 2016 and the notably good performance of NMT captured the attention of the media again, highlighting how these new systems were close to reaching “human parity” and would soon change the way people communicate across languages. In the years since its appearance, NMT engines surpassed the quality of all the existing SMTs to date. We mentioned before how MT use has grown significantly in the last decade and how it has been adopted by many industries. This is due not only to the increase in accuracy, and reduction of syntax and spelling mistakes but also to the possibility of translating between language pairs with no common dictionary (it was always necessary to use English as an interlingua.)

Large Language Models (LLMs)

LLMs are the result of developments in natural language processing (NLP) tasks and the constant research and application of transformer models in NMT. However, the amount of data these systems are trained with is immensely higher compared to NMT systems, since their goal is to allow in-context learning and more complex operations (Zhu et al., 2023.) LLMs represent a significant advancement in NLP, enabling machines to grasp context, nuances, and idiomatic expressions with exceptional precision.

During the last couple of years, we have seen not only how ChatGPT has revolutionized the way people and companies search for and analyze information, but also how GenAI is changing the translation industry. Where NMT seemed to still be insufficient to get cultural cues and contextual information, these systems seemed to throw more accurate and “natural” translations, propelling companies to push for integration GenAI into the translation pipelines.

Many, however, point out that GenAI is a flawed and expensive technology that underdelivers. LLMs “hallucinations”, for instance, make it hard to trust GenAI; outputs sound so coherent that fact-checking is often overlooked. Despite what CEOs from GenAI companies might say, it is hard to believe that general artificial intelligence is still in sight (Source: https://www.ft.com/content/648228e7-11eb-4e1a-b0d5-e65a638e6135). In translation, specifically, there are also ethical and privacy concerns regarding LLMs, as well as the role linguists will play in the AI-led scenario.

From the earlier MT developments to the emergence of LLMs, we see that the key takeout is adaptation. The collaborative partnership between human translators and machine translators will be pivotal in delivering high-quality translations. This synergy between human and machine translators will be crucial for addressing nuanced translation tasks that demand a profound understanding of language subtleties and cultural references. See our previous article about MT trends for 2024 to read about the evolving landscape of MT.