Tom Vanallemeersch


2024

pdf bib
AI4Culture: Towards Multilingual Access for Cultural Heritage Data
Tom Vanallemeersch | Sara Szoc | Laurens Meeus
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

The AI4Culture project (2023-2025), funded by the European Commission, and involving a 12-partner consortium led by the National Technical University of Athens, develops a platform serving as an online capacity building hub for AI technologies in the cultural heritage (CH) sector, enabling multilingual access to CH data. It offers access to AI-related resources, including openly labelled datasets for model training and testing, deployable and reusable tools, and capacity building materials. The tools are aimed at optical character recognition (OCR) for printed and handwritten documents, subtitle generation and validation, machine translation (MT), and metadata enrichment via image information extraction and semantic linking. The project also customises these tools to enhance interface and component usability. We illustrate this with technology that corrects OCR output using language models and adapts it for MT.

2022

pdf bib
ELRC Action: Covering Confidentiality, Correctness and Cross-linguality
Tom Vanallemeersch | Arne Defauw | Sara Szoc | Alina Kramchaninova | Joachim Van den Bogaert | Andrea Lösch
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We describe the language technology (LT) assessments carried out in the ELRC action (European Language Resource Coordination) of the European Commission, which aims towards minimising language barriers across the EU. We zoom in on the two most extensive assessments. These LT specifications do not only involve experiments with tools and techniques but also an extensive consultation round with stakeholders from public organisations, academia and industry, in order to gather insights into scenarios and best practices. The LT specifications concern (1) the field of automated anonymisation, which is motivated by the need of public and other organisations to be able to store and share data, and (2) the field of multilingual fake news processing, which is motivated by the increasingly pressing problem of disinformation and the limited language coverage of systems for automatically detecting misleading articles. For each specification, we set up a corresponding proof-of-concept software to demonstrate the opportunities and challenges involved in the field.

pdf bib
Automatically extracting the semantic network out of public services to support cities becoming Smart Cities
Joachim Van den Bogaert | Laurens Meeus | Alina Kramchaninova | Arne Defauw | Sara Szoc | Frederic Everaert | Koen Van Winckel | Anna Bardadym | Tom Vanallemeersch
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

The CEFAT4Cities project aims at creating a multilingual semantic interoperability layer for Smart Cities that allows users from all EU member States to interact with public services in their own language. The CEFAT4Cities processing pipeline transforms natural-language administrative procedures into machine-readable data using various multilingual Natural Language Processing techniques, such as semantic networks and machine translation, thus allowing for the development of more sophisticated and more user-friendly public services applications.

2021

pdf bib
Validating Quality Estimation in a Computer-Aided Translation Workflow: Speed, Cost and Quality Trade-off
Fernando Alva-Manchego | Lucia Specia | Sara Szoc | Tom Vanallemeersch | Heidi Depraetere
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

In modern computer-aided translation workflows, Machine Translation (MT) systems are used to produce a draft that is then checked and edited where needed by human translators. In this scenario, a Quality Estimation (QE) tool can be used to score MT outputs, and a threshold on the QE scores can be applied to decide whether an MT output can be used as-is or requires human post-edition. While this could reduce cost and turnaround times, it could harm translation quality, as QE models are not 100% accurate. In the framework of the APE-QUEST project (Automated Post-Editing and Quality Estimation), we set up a case-study on the trade-off between speed, cost and quality, investigating the benefits of QE models in a real-world scenario, where we rely on end-user acceptability as quality metric. Using data in the public administration domain for English-Dutch and English-French, we experimented with two use cases: assimilation and dissemination. Results shed some light on how QE scores can be explored to establish thresholds that suit each use case and target language, and demonstrate the potential benefits of adding QE to a translation workflow.

2020

pdf bib
Being Generous with Sub-Words towards Small NMT Children
Arne Defauw | Tom Vanallemeersch | Koen Van Winckel | Sara Szoc | Joachim Van den Bogaert
Proceedings of the Twelfth Language Resources and Evaluation Conference

In the context of under-resourced neural machine translation (NMT), transfer learning from an NMT model trained on a high resource language pair, or from a multilingual NMT (M-NMT) model, has been shown to boost performance to a large extent. In this paper, we focus on so-called cold start transfer learning from an M-NMT model, which means that the parent model is not trained on any of the child data. Such a set-up enables quick adaptation of M-NMT models to new languages. We investigate the effectiveness of cold start transfer learning from a many-to-many M-NMT model to an under-resourced child. We show that sufficiently large sub-word vocabularies should be used for transfer learning to be effective in such a scenario. When adopting relatively large sub-word vocabularies we observe increases in performance thanks to transfer learning from a parent M-NMT model, both when translating to and from the under-resourced language. Our proposed approach involving dynamic vocabularies is both practical and effective. We report results on two under-resourced language pairs, i.e. Icelandic-English and Irish-English.

pdf bib
A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?
Julia Ive | Lucia Specia | Sara Szoc | Tom Vanallemeersch | Joachim Van den Bogaert | Eduardo Farah | Christine Maroti | Artur Ventura | Maxim Khalilov
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether, the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation system, a post-edited version of such translation by a professional translator, and - where available - the original reference translation crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations, even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test evaluation metrics. The data is freely available as an ELRC-SHARE resource.

pdf bib
APE-QUEST: an MT Quality Gate
Heidi Depraetere | Joachim Van den Bogaert | Sara Szoc | Tom Vanallemeersch
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The APE-QUEST project (2018–2020) sets up a quality gate and crowdsourcing workflow for the eTranslation system of EC’s Connecting Europe Facility to improve translation quality in specific domains. It packages these services as a translation portal for machine-to-machine and machine-to-human scenarios.

pdf bib
MICE: a middleware layer for MT
Joachim Van den Bogaert | Tom Vanallemeersch | Heidi Depraetere
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The MICE project (2018-2020) will deliver a middleware layer for improving the output quality of the eTranslation system of EC’s Connecting Europe Facility through additional services, such as domain adaptation and named entity recognition. It will also deliver a user portal, allowing for human post-editing.

pdf bib
OCR, Classification& Machine Translation (OCCAM)
Joachim Van den Bogaert | Arne Defauw | Frederic Everaert | Koen Van Winckel | Alina Kramchaninova | Anna Bardadym | Tom Vanallemeersch | Pavel Smrž | Michal Hradiš
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The OCCAM project (Optical Character recognition, ClassificAtion & Machine Translation) aims at integrating the CEF (Connecting Europe Facility) Automated Translation service with image classification, Translation Memories (TMs), Optical Character Recognition (OCR), and Machine Translation (MT). It will support the automated translation of scanned business documents (a document format that, currently, cannot be processed by the CEF eTranslation service) and will also lead to a tool useful for the Digital Humanities domain.

pdf bib
CEFAT4Cities, a Natural Language Layer for the ISA2 Core Public Service Vocabulary
Joachim Van den Bogaert | Arne Defauw | Sara Szoc | Frederic Everaert | Koen Van Winckel | Alina Kramchaninova | Anna Bardadym | Tom Vanallemeersch
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The CEFAT4Cities project (2020-2022) will create a “Smart Cities natural language context” (a software layer that facilitates the conversion of natural-language administrative procedures, into machine-readable data sets) on top of the existing ISA2 interoperability layer for public services. Integration with the FIWARE/ORION “Smart City” Context Broker, will make existing, paper-based, public services discoverable through “Smart City” frameworks, thus allowing for the development of more sophisticated and more user-friendly public services applications. An automated translation component will be included, to provide a solution that can be used by all EU Member States. As a result, the project will allow EU citizens and businesses to interact with public services on the city, national, regional and EU level, in their own language.

2019

pdf bib
APE-QUEST
Joachim Van den Bogaert | Heidi Depraetere | Sara Szoc | Tom Vanallemeersch | Koen Van Winckel | Frederic Everaert | Lucia Specia | Julia Ive | Maxim Khalilov | Christine Maroti | Eduardo Farah | Artur Ventura
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf bib
MICE
Joachim Van den Bogaert | Heidi Depraetere | Tom Vanallemeersch | Frederic Everaert | Koen Van Winckel | Katri Tammsaar | Ingmar Vali | Tambet Artma | Piret Saartee | Laura Katariina Teder | Artūrs Vasiļevskis | Valters Sics | Johan Haelterman | David Bienfait
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf bib
Collecting domain specific data for MT: an evaluation of the ParaCrawlpipeline
Arne Defauw | Tom Vanallemeersch | Sara Szoc | Frederic Everaert | Koen Van Winckel | Kim Scholte | Joris Brabers | Joachim Van den Bogaert
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf bib
Developing a Neural Machine Translation system for Irish
Arne Defauw | Sara Szoc | Tom Vanallemeersch | Anna Bardadym | Joris Brabers | Frederic Everaert | Kim Scholte | Koen Van Winckel | Joachim Van den Bogaert
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

2018

pdf bib
M3TRA: integrating TM and MT for professional translators
Bram Bulté | Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

Translation memories (TM) and machine translation (MT) both are potentially useful resources for professional translators, but they are often still used independently in translation workflows. As translators tend to have a higher confidence in fuzzy matches than in MT, we investigate how to combine the benefits of TM retrieval with those of MT, by integrating the results of both. We develop a flexible TM-MT integration approach based on various techniques combining the use of TM and MT, such as fuzzy repair, span pretranslation and exploiting multiple matches. Results for ten language pairs using the DGT-TM dataset indicate almost consistently better BLEU, METEOR and TER scores compared to the MT, TM and NMT baselines.

pdf bib
Smart Computer-Aided Translation Environment (SCATE): Highlights
Vincent Vandeghinste | Tom Vanallemeersch | Bram Bulté | Liesbeth Augustinus | Frank Van Eynde | Joris Pelemans | Lyan Verwimp | Patrick Wambacq | Geert Heyman | Marie-Francine Moens | Iulianna van der Lek-Ciudin | Frieda Steurs | Ayla Rigouts Terryn | Els Lefever | Arda Tezcan | Lieve Macken | Sven Coppers | Jens Brulmans | Jan Van Den Bergh | Kris Luyten | Karin Coninx
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

We present the highlights of the now finished 4-year SCATE project. It was completed in February 2018 and funded by the Flemish Government IWT-SBO, project No. 130041.1

2016

pdf bib
Poly-GrETEL: Cross-Lingual Example-based Querying of Syntactic Constructions
Liesbeth Augustinus | Vincent Vandeghinste | Tom Vanallemeersch
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present Poly-GrETEL, an online tool which enables syntactic querying in parallel treebanks, based on the monolingual GrETEL environment. We provide online access to the Europarl parallel treebank for Dutch and English, allowing users to query the treebank using either an XPath expression or an example sentence in order to look for similar constructions. We provide automatic alignments between the nodes. By combining example-based query functionality with node alignments, we limit the need for users to be familiar with the query language and the structure of the trees in the source and target language, thus facilitating the use of parallel corpora for comparative linguistics and translation studies.

2015

pdf bib
Assessing linguistically aware fuzzy matching in translation memories
Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Smart Computer Aided Translation Environment
Vincent Vandeghinste | Tom Vanallemeersch | Frank Van Eynde | Geert Heyman | Sien Moens | Joris Pelemans | Patrick Wambacq | Iulianna Van der Lek - Ciudin | Arda Tezcan | Lieve Macken | Véronique Hoste | Eva Geurts | Mieke Haesen
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Semantics-based pretranslation for SMT using fuzzy matches
Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Assessing linguistically aware fuzzy matching in translation memories
Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Smart Computer Aided Translation Environment - SCATE
Vincent Vandeghinste | Tom Vanallemeersch | Frank Van Eynde | Geert Heyman | Sien Moens | Joris Pelemans | Patrick Wambacq | Iulianna Van der Lek - Ciudin | Arda Tezcan | Lieve Macken | Véronique Hoste | Eva Geurts | Mieke Haesen
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf bib
Improving fuzzy matching through syntactic knowledge
Tom Vanallemeersch | Vincent Vandeghinste
Proceedings of Translating and the Computer 36

2010

pdf bib
Belgisch Staatsblad Corpus: Retrieving French-Dutch Sentences from Official Documents
Tom Vanallemeersch
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe the compilation of a large corpus of French-Dutch sentence pairs from official Belgian documents which are available in the online version of the publication Belgisch Staatsblad/Moniteur belge, and which have been published between 1997 and 2006. After downloading files in batch, we filtered out documents which have no translation in the other language, documents which contain several languages (by checking on discriminating words), and pairs of documents with a substantial difference in length. We segmented the documents into sentences and aligned the latter, which resulted in 5 million sentence pairs (only one-to-one links were included in the parallel corpus); there are 2.4 million unique pairs. Sample-based evaluation of the sentence alignment results indicates a near 100% accuracy, which can be explained by the text genre, the procedure filtering out weakly parallel articles and the restriction to one-to-one links. The corpus is larger than a number of well-known French-Dutch resources. It is made available to the community. Further investigation is needed in order to determine the original language in which documents were written.