2024
pdf
bib
abs
Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation
Kamil Guttmann
|
Mikołaj Pokrywka
|
Adrian Charkiewicz
|
Artur Nowakowski
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
This paper explores Minimum Bayes Risk (MBR) decoding for self-improvement in machine translation (MT), particularly for domain adaptation and low-resource languages. We implement the self-improvement process by fine-tuning the model on its MBR-decoded forward translations. By employing COMET as the MBR utility metric, we aim to achieve the reranking of translations that better aligns with human preferences. The paper explores the iterative application of this approach and the potential need for language-specific MBR utility metrics. The results demonstrate significant enhancements in translation quality for all examined language pairs, including successful application to domain-adapted models and generalisation to low-resource settings. This highlights the potential of COMET-guided MBR for efficient MT self-improvement in various scenarios.
pdf
bib
abs
FAME-MT Dataset: Formality Awareness Made Easy for Machine Translation Purposes
Dawid Wisniewski
|
Zofia Rostek
|
Artur Nowakowski
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
People use language for various purposes. Apart from sharing information, individuals may use it to express emotions or to show respect for another person. In this paper, we focus on the formality level of machine-generated translations and present FAME-MT – a dataset consisting of 11.2 million translations between 15 European source languages and 8 European target languages classified to formal and informal classes according to target sentence formality. This dataset can be used to fine-tune machine translation models to ensure a given formality level for 8 European target languages considered. We describe the dataset creation procedure, the analysis of the dataset’s quality showing that FAME-MT is a reliable source of language register information, and we construct a publicly available proof-of-concept machine translation model that uses the dataset to steer the formality level of the translation. Currently, it is the largest dataset of formality annotations, with examples expressed in 112 European language pairs. The dataset is made available online.
2023
pdf
bib
abs
Exploring the Use of Foundation Models for Named Entity Recognition and Lemmatization Tasks in Slavic Languages
Gabriela Pałka
|
Artur Nowakowski
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
This paper describes Adam Mickiewicz University’s (AMU) solution for the 4th Shared Task on SlavNER. The task involves the identification, categorization, and lemmatization of named entities in Slavic languages. Our approach involved exploring the use of foundation models for these tasks. In particular, we used models based on the popular BERT and T5 model architectures. Additionally, we used external datasets to further improve the quality of our models. Our solution obtained promising results, achieving high metrics scores in both tasks. We describe our approach and the results of our experiments in detail, showing that the method is effective for NER and lemmatization in Slavic languages. Additionally, our models for lemmatization will be available at:
https://huggingface.co/amu-cai.
2022
pdf
bib
abs
nEYron: Implementation and Deployment of an MT System for a Large Audit & Consulting Corporation
Artur Nowakowski
|
Krzysztof Jassem
|
Maciej Lison
|
Rafał Jaworski
|
Tomasz Dwojak
|
Karolina Wiater
|
Olga Posesor
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
This paper reports on the implementation and deployment of an MT system in the Polish branch of EY Global Limited. The system supports standard CAT and MT functionalities such as translation memory fuzzy search, document translation and post-editing, and meets less common, customer-specific expectations. The deployment began in August 2018 with a Proof of Concept, and ended with the signing of the Final Version acceptance certificate in October 2021. We present the challenges that were faced during the deployment, particularly in relation to the security check and installation processes in the production environment.
pdf
bib
abs
POLENG MT: An Adaptive MT Platform
Artur Nowakowski
|
Krzysztof Jassem
|
Maciej Lison
|
Kamil Guttmann
|
Mikołaj Pokrywka
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
We introduce POLENG MT, an MT platform that may be used as a cloud web application or as an on-site solution. The platform is capable of providing accurate document translation, including the transfer of document formatting between the input document and the output document. The main feature of the on-site version is dedicated customer adaptation, which consists of training on specialized texts and applying forced terminology translation according to the user’s needs.
pdf
bib
abs
Adam Mickiewicz University at WMT 2022: NER-Assisted and Quality-Aware Neural Machine Translation
Artur Nowakowski
|
Gabriela Pałka
|
Kamil Guttmann
|
Mikołaj Pokrywka
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper presents Adam Mickiewicz University’s (AMU) submissions to the constrained track of the WMT 2022 General MT Task. We participated in the Ukrainian ↔ Czech translation directions. The systems are a weighted ensemble of four models based on the Transformer (big) architecture. The models use source factors to utilize the information about named entities present in the input. Each of the models in the ensemble was trained using only the data provided by the shared task organizers. A noisy back-translation technique was used to augment the training corpora. One of the models in the ensemble is a document-level model, trained on parallel and synthetic longer sequences. During the sentence-level decoding process, the ensemble generated the n-best list. The n-best list was merged with the n-best list generated by a single document-level model which translated multiple sentences at a time. Finally, existing quality estimation models and minimum Bayes risk decoding were used to rerank the n-best list so that the best hypothesis was chosen according to the COMET evaluation metric. According to the automatic evaluation results, our systems rank first in both translation directions.
2021
pdf
bib
abs
Neural Machine Translation with Inflected Lexicon
Artur Nowakowski
|
Krzysztof Jassem
Proceedings of Machine Translation Summit XVIII: Research Track
The paper presents experiments in neural machine translation with lexical constraints into a morphologically rich language. In particular and we introduce a method and based on constrained decoding and which handles the inflected forms of lexical entries and does not require any modification to the training data or model architecture. To evaluate its effectiveness and we carry out experiments in two different scenarios: general and domain-specific. We compare our method with baseline translation and i.e. translation without lexical constraints and in terms of translation speed and translation quality. To evaluate how well the method handles the constraints and we propose new evaluation metrics which take into account the presence and placement and duplication and inflectional correctness of lexical terms in the output sentence.
pdf
bib
abs
Neural Translator Designed to Protect the Eastern Border of the European Union
Artur Nowakowski
|
Krzysztof Jassem
Proceedings of Machine Translation Summit XVIII: Users and Providers Track
This paper reports on a translation engine designed for the needs of the Polish State Border Guard. The engine is a component of the AI Searcher system, whose aim is to search for Internet texts, written in Polish, Russian, Ukrainian or Belarusian, which may lead to criminal acts at the eastern border of the European Union. The system is intended for Polish users, and the translation engine should serve to assist understanding of non-Polish documents. The engine was trained on general-domain texts. The adaptation for the criminal domain consisted in the appropriate translation of criminal terms and proper names, such as forenames, surnames and geographical objects. The translation process needs to take into account the rich inflection found in all of the languages of interest. To this end, a method based on constrained decoding that incorporates an inflected lexicon into a neural translation process was applied in the engine.
pdf
bib
abs
Adam Mickiewicz University’s English-Hausa Submissions to the WMT 2021 News Translation Task
Artur Nowakowski
|
Tomasz Dwojak
Proceedings of the Sixth Conference on Machine Translation
This paper presents the Adam Mickiewicz University’s (AMU) submissions to the WMT 2021 News Translation Task. The submissions focus on the English↔Hausa translation directions, which is a low-resource translation scenario between distant languages. Our approach involves thorough data cleaning, transfer learning using a high-resource language pair, iterative training, and utilization of monolingual data via back-translation. We experiment with NMT and PB-SMT approaches alike, using the base Transformer architecture for all of the NMT models while utilizing PB-SMT systems as comparable baseline solutions.