2025
pdf
bib
abs
Analysis of Vocabulary and Subword Tokenization Settings for Optimal Fine-tuning of MT: A Case Study of In-domain Translation
Javad Pourmostafa Roshan Sharami
|
Dimitar Shterionov
|
Pieter Spronck
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
The choice of vocabulary and subword (SW) tokenization has a significant impact on both training and fine-tuning of language and translation models. Fine-tuning is a common practice in optimizing a model with respect to new data. However, new data potentially introduces new words (or tokens), which, if not considered, may lead to suboptimal performance. In addition, the distribution of tokens in the new data can differ from the distribution of the original data. As such, the original SW tokenization model could be less suitable for the new data. With this work, we aim to gain better insights on the impact of SW tokenization and vocabulary generation on the performance of neural machine translation (NMT) models fine-tuned to a specific domain. To do so, we compare several strategies for SW tokenization and vocabulary generation and investigate the performance of the resulting models. Our findings show that the best way to fine-tune for domain adaptation is to consistently use both BPE and vocabulary from the in-domain data, which helps the model pick up on important domain-specific terms. At the same time, it is crucial not to lose sight of the vocabulary of the base (pre-trained) model—maintaining coverage of this vocabulary ensures the model keeps its general language abilities. The most successful configurations are those that introduce plenty of frequent domain terms while still retaining a substantial portion of the base model vocabulary, leading to noticeably better translation quality and adaptation, as seen in higher BLEU scores. These benefits, however, often come with greater computational costs, such as longer training times, since the model must learn more new tokens. Conversely, approaches that skip important domain terms or combine mismatched tokenization and vocabulary do not perform as well, making it clear that both domain-specific adaptation and broad vocabulary coverage matter—and that these gains are realized when the vocabulary preserves a good portion of the base (pre-trained) model. While using in-domain BPE and vocabulary yields the best domain adaptation, it substantially reduces out-of-domain translation quality. Hybrid configurations that combine base and domain vocabularies help balance this trade-off, maintaining broader translation capabilities alongside improved domain performance.
2024
pdf
bib
abs
Guiding In-Context Learning of LLMs through Quality Estimation for Machine Translation
Javad Pourmostafa Roshan Sharami
|
Dimitar Shterionov
|
Pieter Spronck
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
The quality of output from large language models (LLMs), particularly in machine translation (MT), is closely tied to the quality of in-context examples (ICEs) provided along with the query, i.e., the text to translate. The effectiveness of these ICEs is influenced by various factors, such as the domain of the source text, the order in which the ICEs are presented, the number of these examples, and the prompt templates used. Naturally, selecting the most impactful ICEs depends on understanding how these affect the resulting translation quality, which ultimately relies on translation references or human judgment. This paper presents a novel methodology for in-context learning (ICL) that relies on a search algorithm guided by domain-specific quality estimation (QE). Leveraging the XGLM model, our methodology estimates the resulting translation quality without the need for translation references, selecting effective ICEs for MT to maximize translation quality. Our results demonstrate significant improvements over existing ICL methods and higher translation performance compared to fine-tuning a pre-trained language model (PLM), specifically mBART-50.
2023
pdf
bib
A Python Tool for Selecting Domain-Specific Data in Machine Translation
Javad Pourmostafa Roshan Sharami
|
Dimitar Shterionov
|
Pieter Spronck
Proceedings of the 1st Workshop on Open Community-Driven Machine Translation
pdf
bib
abs
Tailoring Domain Adaptation for Machine Translation Quality Estimation
Javad Pourmostafa Roshan Sharami
|
Dimitar Shterionov
|
Frédéric Blain
|
Eva Vanmassenhove
|
Mirella De Sisto
|
Chris Emmery
|
Pieter Spronck
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high-cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should also be generalizabile, i.e., they should be able to handle data from different domains, both generic and specific. To alleviate these two main issues — data scarcity and domain mismatch — this paper combines domain adaptation and data augmentation within a robust QE system. Our method is to first train a generic QE model and then fine-tune it on a specific domain while retaining generic knowledge. Our results show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios as compared to state-of-the-art baselines.