Machine Translation for Low-resource Finno-Ugric Languages

Scientific partners involved

The University of Tartu is the leading university in the Baltics in the field of information technology and is one of the 250 best universities in the world in the ranking of Times Higher Education Universities in the field of Computer Science. Examples of the University’s Chair of Natural Language Processing’s research directions include machine learning and data mining on text data, machine translation from one natural language into another, summarization of a long text document, automatic analysis of grammatical correctness, word meanings and sentence structure.

Technical/scientific challenge

Neural networks have caused rapid growth in output quality for many natural language processing tasks, including neural machine translation (NMT). However, the output quality crucially depends on the availability of large amounts of parallel and monolingual data for the covered languages. Due to lack of training material, several lower-resource Finno-Ugric languages are not included in the existing massively multilingual models. In terms of the number of speakers, they range from 20 near-native speakers of Livonian to several hundred thousand speakers of Mordvinic languages.

Solution

An NMT system was developed between 20 low-resource Finno-Ugric languages and 7 high-resource languages. Monolingual corpora was collected mainly by crawling texts off the web and combining with pre-existing corpora. Three main categories of texts can be distinguished: news, Wikipedia, and biblical. All the NMT systems were trained on the LUMI supercomputer. All models were fine-tuned with the Fairseq framework implementation of M2M-100 for 350k updates with a batch size of 3840 tokens (the number was chosen to match earlier versions of models trained with the Huggingface implementation of M2M-100). The models were fine-tuned on 4 AMD Mi250X GPU-s.

Scientific impact

More that 20 Finnish-Ugric languages can now be translated using the machine translation engine of the University of Tartu. Most of these languages were added to a public translation engine for the first time. The translation engine allows researchers to translate materials that would otherwise be incomprehensible to them. It provides an opportunity to better study the history of languages and regions without knowing the respective language. Moreover, since most of Finno-Ugric languages are not widely spoken today, a translation engine is necessary to preserve these languages.

Benefits

Cosmic Ray-based Solutions for 3D Imaging

Industrial organisations involved

GScan was founded in 2018 to revolutionise inspection, security and medical scanning markets using Muon Flux Technology (MFT). GScan, as the pioneer of MFT having unique IP, tech & sales know-how in the field, is developing a new generation of Non-Destructive Testing (NDT) scanners and tomography systems for infrastructure management applications.

Technical/scientific challenge

To keep the surrounding environment safe and ensure its longevity, careful assessment, maintenance and investment plans are required. However, currently there is no efficient way of obtaining the information required for more efficient use of assets and reducing risks for critical infrastructure.

Solution

Capitalising on the power of natural cosmic ray tomography, the technology tracks the trajectory changes or absorption of particles (muons, electrons, positrons) as they pass through the object of interest, thereby extracting crucial statistics about its material and shape. These insights are then translated into 2D and 3D visualisations of both internal and external geometries, along with data on chemical composition. The comprehensive output we deliver provides in-depth insights into the objects and materials under scrutiny – all meticulously tailored to fulfil our customers’ unique requirements. HPC plays an important role in translating the collected data into visualisations.

Business impact

With time and space related digital data in terabytes, the detailed process of reconstruction enables us to see inside of structures what was not possible before.

Benefits

Machine Translation Post-Editing

Industrial organisations involved

Luisa Tõlkebüroo OÜ is the biggest translation agency in Estonia. The company offers more than 50 services – including sworn translation, simultaneous and consecutive interpretation, layout work, machine translation and post-editing, subtitling and localisation.

Technical/scientific challenge

The company needed a custom-made machine translation system to reduce the time of translations. As the company had no previous experience neither in natural language processing nor in machine learning, they collaborated with the TartuNLP team.

Solution

Training of the machine translation model was conducted by using University of Tartu HPC centre’s Rocket cluster. The company needed a custom-made machine translation system to reduce the time of translations. As the company had no previous experience neither in natural language processing nor in machine learning, they collaborated with the TartuNLP team. Training of the machine translation model was conducted by using University of Tartu HPC centre’s Rocket cluster.

Business impact

Thanks to rapid advances in the technology and extensive translation memory, the company is able to offer machine translations with post-editing in a range of language combinations and on a range of topics.

Benefits

An accurate AI-based Cloud Mask Processor for Sentinel-2

Industrial organisations involved

KappaZeta is a science-driven remote sensing company aiming to make space a valuable asset for everyone. KappaZeta’s expertise is in using SAR (radar) satellite data, incorporating it with optical satellite data and providing some of the most accurate AI models on the market. The key area of focus is agriculture.

Technical/scientific challenge

Cloud masking is an essential step for the pre-processing of optical satellite imagery. KappaZeta addresses the problem by introducing KappaMask, an AI-based cloud and cloud shadow masking processor for Sentinel-2, which carries an optical instrument payload that samples 13 spectral bands. As a cloud detector, KappaMask uses a large convolutional segmentation model. Faster model convergence during training can be achieved by using larger batch sizes of the training data, which means more GPU memory is needed. Additionally, faster CPUs are required for shorter data loading times to increase the training speed even further.

Solution

KappaMask was trained on an open-source dataset and fine-tuned on a Northern European terrestrial dataset which was labelled manually using the active learning methodology. The training was performed on the University of Tartu’s HPC Centres’ high-performance compute nodes. Powerful GPUs and CPUs were applied to substantially speed up the training of the model.

Business impact

KappaMask is an open source project. All the results, final software and source code will be freely and openly distributed in GitHub. Openness and accessibility of the software should directly translate into greater usage.

Benefits

Self-driving technology for a Level 4 autonomous car

Industrial organisations involved

Bolt is an Estonian mobility company that offers vehicle for hire, micromobility, car-sharing, and food delivery services headquartered in Tallinn and operating in over 400 cities in over 45 countries. In partnership with the University of Tartu, the company developes self-driving technology for a Level 4 autonomous car.

Technical/scientific challenge

Autonomous cars acquire up to 357 GB/hour of data during test drives. Autonomous car engineers needed a system to store and easily access those test logs.

Solution

Acquired test logs are copied to HPC storage, into appropriately guarded directory. Regularly cron job processes those log files into metadata stored in MongoDB database. Processing is distributed over cluster and happens in parallel. Longest logs can take up to 24 hours to process, so processing them sequentially would be very time-consuming. On top of MongoDB sits custom-made application that allows filtering of test sessions and browsing them using Webviz visualization tool. Visualization tool accesses the raw sensor data from HPC storage.

Business impact

With the growing demand for ride-hailing services, autonomous vehicle technology will provide a solution for transportation problems on an increasingly broader scale.

Benefits

Large collections of European HPC success stories are available on the FF4EuroHPC and EuroCC webpages.