WMT, the reference dataset used by NMT R&D teams
Every year, the Conference on Machine Translation (historically Workshop on Statistical Machine Translation and hosted by EMNLP or ACL) takes place.
It provides a set of public data to encourage academics and corporations to compete and post their results from the last findings.
Tasks change from year to year but a long‑time reference has been the “News” translation from English to German, and many research papers are published based on this task. We refer to this task as WMT. The test set to measure the performance also changes every year.
At Ubiqus, we use NMT on proprietary data, which gives a much higher quality, but as a way of comparison, we wanted to show our last results based on the WMT dataset.
We use the BLEU metrics (sacreBLEU cased sensitive) which gives an idea of the performance but it is well known that,
(1) it is not a perfect metric for NMT, and
(2) there are various ways of using BLEU.
For the English to German task, 20.6 was the best score during the 2014 conference.
Google issued a paper in September 2016 titled: “Bridging the Gap between Human and Machine Translation” with improved results based on a deep LSTM architecture:
>> Single model: 24.6
>> Ensemble of 8 models: 26.3
One issue is that they used a different BLEU compared to the official WMT/NIST computation, leading to a slight overestimation.
In June 2017, Google (again) released another paper outlining results on the same task: “Attention Is All You Need”, which introduced a new architecture that they called the “Transformer”.
It was another breakthrough in performance:
>> Single model: 28.4
Again, this score was comparable to their previous paper but overestimated versus the official method.
Aiming for the highest BLEU score with WMT dataset
Ever since, there have been slight improvements from various research papers, and since the WMT dataset was quite small (4.5 million parallel sentences) some people introduced the concept of data augmentation through back‑translation of monolingual target language data. For this task, we translate a huge amount of sentences from German to English (with another pre‑trained model) and we generate a “synthetic” corpus of additional data.
We utilised all the improvements and this technique to benchmark our model versus Google Translate and DeepL (two popular online translation engines that probably use much more data than the WMT public datasets).
We scored all the Test sets from the years 2014 to 2018 for the English to German task:
The first remark is that Ubiqus NMT is significantly better than Google Translate, and overall better than DeepL*
*except for years 2016 and 2018
The second remark is that we obtain a score far higher than all previous papers thanks to the addition of recent improvements and the use of back‑translated data. However, we used only public data for this task, which is probably less than what Google Translate and DeepL use.
You can test these results on
and select the domain “WMT”.
In order to validate our approach, we did the same on a less popular task: Russian to English.
We over‑perform both engines, for both years.
State of the art?
During the WMT, academics also post very good results. However, most of the time they use results from what we call “Ensemble and re‑ranking” techniques.
These two post-processing steps give better results but are not really “production-ready” since it requires more computation at translation time.
Still, we compared ourselves to the latest Facebook research paper: “Understanding Back-Translation at Scale”.
In fact, we used very similar techniques and our framework (OpenNMT-py) is close to theirs.
This paper discloses a BLEU score of 33.8 on the same English to German task (comparable to the 34.0 of Ubiqus NMT).
One huge difference however, is that we used 4 GPU on a single machine for a training time of 50 hours, while they say they trained on 128 GPU for 22.5 hours.
Our intuition is that we used a better data filtering / selection at pre‑processing time and we utilised more small improvements to our Transformer.
In the end, we wanted to double check if we could reach the State-of-the-art score from the WMT ‘18 conference. A well‑known toolkit Marian-NMT – now at Microsoft Translator – released a score of 48.3 for the 2018 test set when our single model shows 46.9.
We did not take the time to replicate their paper, but just re‑ranking our single model brings an improvement of +0.5 and just ensembling two models brings us to 48.1.
Again, in a production environment there is no point in trying to implement these techniques.
What about proprietary‑data based engines and production?
Since our main goal is to apply the best research findings to our clients, we also compared one of our trained models using a document of one of our clients.
For the English to/from French‑Canadian engines we obtained the following scores:
|English to French‑Canadian||French‑Canadian to English|
This speaks for itself.
For English into French‑Canadian, we use a very specific engine, while Google or DeepL do not make a distinction between French from France and French from Canada.
In the other direction, our score is much closer since we consider only one “into English”.
We could replicate the same results for various domains (Life Science, Finance, Legal, …).
Neural Machine Translation is a very interesting and magical discipline attracting the best machine learning researchers in the world, but when it comes to applying it to real-world production, we need to understand both “technical aspects” as well as linguistic issues and real life issues like tag handling or similar obstacles.
This article was written by Vincent Nguyen, Ubiqus Group CEO.