Tom Vanallemeersch

2025

AI4Culture platform: upskilling experts on multilingual / -modal tools
Tom Vanallemeersch | Sara Szoc | Marthe Lamote | Frederic Everaert | Eirini Kaldeli
Proceedings of Machine Translation Summit XX: Volume 2

The AI4Culture project, funded by the European Commission (2023-2025), developed a platform (https://ai4culture.eu) to educate cultural heritage (CH) professionals in AI technologies. Acting as an online capacity building hub, the platform describes openly labeled data sets and deployable and reusable tools applying AI technologies in tasks relevant to the CH sector. It also offers tutorials for tools and recipes for the combination of tools. In addition, the platform allows users to contribute their own resources. The resources described by project partners involve applications for optical or handwritten character recognition (OCR, HTR), generation and validation of subtitles, machine translation, image analysis, and semantic linking. The partners customized various tools to enhance the usability of interfaces and components. Here, we zoom in on the use case of correcting OCR/HTR output using various means (such as an unstructured manual transcription) to facilitate multilingual accessibility and create structured ground truth (text lines with image coordinates).

pdf bib abs

Tailoring Machine Translation for Scientific Literature through Topic Filtering and Fuzzy Match Augmentation
Thomas Moerman | Tom Vanallemeersch | Sara Szoc | Arda Tezcan
Proceedings of the Eleventh Workshop on Patent and Scientific Literature Translation (PSLT 2025)

To enhance the accessibility of scientific literature in multiple languages and facilitate the exchange of information among scholars and a wider audience, there is a need for high-performing specialized machine translation (MT) engines. However, this requires efficient filtering and the use of domain-specific data. In this study, we investigate whether approaches for increasing training data using topic filtering and more efficient use of such data through exploiting fuzzy matches (i.e. similar translations to a given input; FMs) improve translation quality. We apply these techniques both to sequence-to-sequence MT models and off-the-shelf multilingual large language models (LLMs) in three scientific disciplines. Our results suggest that the combination of topic filtering and FM augmentation is an effective strategy for training neural machine translation (NMT) models from scratch, not only surpassing baseline NMT models but also delivering improved translation performance compared to smaller LLMs in terms of the number of parameters. Furthermore, we find that although FM augmentation through in-context learning generally improves LLM translation performance, limited domain-specific datasets can yield results comparable to those achieved with additional multi-domain datasets.

2024

pdf bib abs

AI4Culture: Towards Multilingual Access for Cultural Heritage Data
Tom Vanallemeersch | Sara Szoc | Laurens Meeus
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

The AI4Culture project (2023-2025), funded by the European Commission, and involving a 12-partner consortium led by the National Technical University of Athens, develops a platform serving as an online capacity building hub for AI technologies in the cultural heritage (CH) sector, enabling multilingual access to CH data. It offers access to AI-related resources, including openly labelled datasets for model training and testing, deployable and reusable tools, and capacity building materials. The tools are aimed at optical character recognition (OCR) for printed and handwritten documents, subtitle generation and validation, machine translation (MT), and metadata enrichment via image information extraction and semantic linking. The project also customises these tools to enhance interface and component usability. We illustrate this with technology that corrects OCR output using language models and adapts it for MT.

2022

pdf bib abs

The CEFAT4Cities project aims at creating a multilingual semantic interoperability layer for Smart Cities that allows users from all EU member States to interact with public services in their own language. The CEFAT4Cities processing pipeline transforms natural-language administrative procedures into machine-readable data using various multilingual Natural Language Processing techniques, such as semantic networks and machine translation, thus allowing for the development of more sophisticated and more user-friendly public services applications.

pdf bib abs

We describe the language technology (LT) assessments carried out in the ELRC action (European Language Resource Coordination) of the European Commission, which aims towards minimising language barriers across the EU. We zoom in on the two most extensive assessments. These LT specifications do not only involve experiments with tools and techniques but also an extensive consultation round with stakeholders from public organisations, academia and industry, in order to gather insights into scenarios and best practices. The LT specifications concern (1) the field of automated anonymisation, which is motivated by the need of public and other organisations to be able to store and share data, and (2) the field of multilingual fake news processing, which is motivated by the increasingly pressing problem of disinformation and the limited language coverage of systems for automatically detecting misleading articles. For each specification, we set up a corresponding proof-of-concept software to demonstrate the opportunities and challenges involved in the field.

2021

pdf bib abs

Validating Quality Estimation in a Computer-Aided Translation Workflow: Speed, Cost and Quality Trade-off
Fernando Alva-Manchego | Lucia Specia | Sara Szoc | Tom Vanallemeersch | Heidi Depraetere
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

In modern computer-aided translation workflows, Machine Translation (MT) systems are used to produce a draft that is then checked and edited where needed by human translators. In this scenario, a Quality Estimation (QE) tool can be used to score MT outputs, and a threshold on the QE scores can be applied to decide whether an MT output can be used as-is or requires human post-edition. While this could reduce cost and turnaround times, it could harm translation quality, as QE models are not 100% accurate. In the framework of the APE-QUEST project (Automated Post-Editing and Quality Estimation), we set up a case-study on the trade-off between speed, cost and quality, investigating the benefits of QE models in a real-world scenario, where we rely on end-user acceptability as quality metric. Using data in the public administration domain for English-Dutch and English-French, we experimented with two use cases: assimilation and dissemination. Results shed some light on how QE scores can be explored to establish thresholds that suit each use case and target language, and demonstrate the potential benefits of adding QE to a translation workflow.

2020

pdf bib abs

We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether, the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation system, a post-edited version of such translation by a professional translator, and - where available - the original reference translation crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations, even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test evaluation metrics. The data is freely available as an ELRC-SHARE resource.

pdf bib abs

APE-QUEST: an MT Quality Gate
Heidi Depraetere | Joachim Van den Bogaert | Sara Szoc | Tom Vanallemeersch
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The APE-QUEST project (2018–2020) sets up a quality gate and crowdsourcing workflow for the eTranslation system of EC’s Connecting Europe Facility to improve translation quality in specific domains. It packages these services as a translation portal for machine-to-machine and machine-to-human scenarios.

pdf bib abs

Being Generous with Sub-Words towards Small NMT Children
Arne Defauw | Tom Vanallemeersch | Koen Van Winckel | Sara Szoc | Joachim Van den Bogaert
Proceedings of the Twelfth Language Resources and Evaluation Conference

In the context of under-resourced neural machine translation (NMT), transfer learning from an NMT model trained on a high resource language pair, or from a multilingual NMT (M-NMT) model, has been shown to boost performance to a large extent. In this paper, we focus on so-called cold start transfer learning from an M-NMT model, which means that the parent model is not trained on any of the child data. Such a set-up enables quick adaptation of M-NMT models to new languages. We investigate the effectiveness of cold start transfer learning from a many-to-many M-NMT model to an under-resourced child. We show that sufficiently large sub-word vocabularies should be used for transfer learning to be effective in such a scenario. When adopting relatively large sub-word vocabularies we observe increases in performance thanks to transfer learning from a parent M-NMT model, both when translating to and from the under-resourced language. Our proposed approach involving dynamic vocabularies is both practical and effective. We report results on two under-resourced language pairs, i.e. Icelandic-English and Irish-English.

pdf bib abs

The OCCAM project (Optical Character recognition, ClassificAtion & Machine Translation) aims at integrating the CEF (Connecting Europe Facility) Automated Translation service with image classification, Translation Memories (TMs), Optical Character Recognition (OCR), and Machine Translation (MT). It will support the automated translation of scanned business documents (a document format that, currently, cannot be processed by the CEF eTranslation service) and will also lead to a tool useful for the Digital Humanities domain.

pdf bib abs

The CEFAT4Cities project (2020-2022) will create a “Smart Cities natural language context” (a software layer that facilitates the conversion of natural-language administrative procedures, into machine-readable data sets) on top of the existing ISA2 interoperability layer for public services. Integration with the FIWARE/ORION “Smart City” Context Broker, will make existing, paper-based, public services discoverable through “Smart City” frameworks, thus allowing for the development of more sophisticated and more user-friendly public services applications. An automated translation component will be included, to provide a solution that can be used by all EU Member States. As a result, the project will allow EU citizens and businesses to interact with public services on the city, national, regional and EU level, in their own language.

pdf bib abs

MICE: a middleware layer for MT
Joachim Van den Bogaert | Tom Vanallemeersch | Heidi Depraetere
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The MICE project (2018-2020) will deliver a middleware layer for improving the output quality of the eTranslation system of EC’s Connecting Europe Facility through additional services, such as domain adaptation and named entity recognition. It will also deliver a user portal, allowing for human post-editing.

2019

2018

We present the highlights of the now finished 4-year SCATE project. It was completed in February 2018 and funded by the Flemish Government IWT-SBO, project No. 130041.1

pdf bib abs

M3TRA: integrating TM and MT for professional translators
Bram Bulté | Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

Translation memories (TM) and machine translation (MT) both are potentially useful resources for professional translators, but they are often still used independently in translation workflows. As translators tend to have a higher confidence in fuzzy matches than in MT, we investigate how to combine the benefits of TM retrieval with those of MT, by integrating the results of both. We develop a flexible TM-MT integration approach based on various techniques combining the use of TM and MT, such as fuzzy repair, span pretranslation and exploiting multiple matches. Results for ten language pairs using the DGT-TM dataset indicate almost consistently better BLEU, METEOR and TER scores compared to the MT, TM and NMT baselines.

2016

pdf bib abs

Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions
Liesbeth Augustinus | Vincent Vandeghinste | Tom Vanallemeersch
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present Poly-GrETEL, an online tool which enables syntactic querying in parallel treebanks, based on the monolingual GrETEL environment. We provide online access to the Europarl parallel treebank for Dutch and English, allowing users to query the treebank using either an XPath expression or an example sentence in order to look for similar constructions. We provide automatic alignments between the nodes. By combining example-based query functionality with node alignments, we limit the need for users to be familiar with the query language and the structure of the trees in the source and target language, thus facilitating the use of parallel corpora for comparative linguistics and translation studies.

We describe the compilation of a large corpus of French-Dutch sentence pairs from official Belgian documents which are available in the online version of the publication Belgisch Staatsblad/Moniteur belge, and which have been published between 1997 and 2006. After downloading files in batch, we filtered out documents which have no translation in the other language, documents which contain several languages (by checking on discriminating words), and pairs of documents with a substantial difference in length. We segmented the documents into sentences and aligned the latter, which resulted in 5 million sentence pairs (only one-to-one links were included in the parallel corpus); there are 2.4 million unique pairs. Sample-based evaluation of the sentence alignment results indicates a near 100% accuracy, which can be explained by the text genre, the procedure filtering out weakly parallel articles and the restriction to one-to-one links. The corpus is larger than a number of well-known French-Dutch resources. It is made available to the community. Further investigation is needed in order to determine the original language in which documents were written.

Co-authors

Venues

TC1

WS1

Fix author