Dimitar Shterionov


2021

pdf bib
Proceedings of the 1st International Workshop on Automatic Translation for Signed and Spoken Languages (AT4SSL)
Dimitar Shterionov
Proceedings of the 1st International Workshop on Automatic Translation for Signed and Spoken Languages (AT4SSL)

pdf bib
Defining meaningful units. Challenges in sign segmentation and segment-meaning mapping (short paper)
Mirella De Sisto | Dimitar Shterionov | Irene Murtagh | Myriam Vermeerbergen | Lorraine Leeson
Proceedings of the 1st International Workshop on Automatic Translation for Signed and Spoken Languages (AT4SSL)

This paper addresses the tasks of sign segmentation and segment-meaning mapping in the context of sign language (SL) recognition. It aims to give an overview of the linguistic properties of SL, such as coarticulation and simultaneity, which make these tasks complex. A better understanding of SL structure is the necessary ground for the design and development of SL recognition and segmentation methodologies, which are fundamental for machine translation of these languages. Based on this preliminary exploration, a proposal for mapping segments to meaning in the form of an agglomerate of lexical and non-lexical information is introduced.

bib
Early-stage development of the SignON application and open framework – challenges and opportunities
Dimitar Shterionov | John J O’Flaherty | Edward Keane | Connor O’Reilly | Marcello Paolo Scipioni | Marco Giovanelli | Matteo Villa
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

SignON is an EU Horizon 2020 Research and Innovation project, that is developing a smartphone application and an open framework to facilitate translation between different European sign, spoken and text languages. The framework will incorporate state of the art sign language recognition and presentation, speech processing technologies and, in its core, multi-modal, cross-language machine translation. The framework, dedicated to the computationally heavy tasks and distributed on the cloud powers the application – a lightweight app running on a standard mobile device. The application and framework are being researched, designed and developed through a co-creation user-centric approach with the European deaf and hard of hearing communities. In this session, the speakers will detail their progress, challenges and lessons learned in the early-stage development of the application and framework. They will also present their Agile DevOps approach and the next steps in the evolution of the SignON project.

pdf bib
Machine Translationese: Effects of Algorithmic Bias on Linguistic Complexity in Machine Translation
Eva Vanmassenhove | Dimitar Shterionov | Matthew Gwilliam
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Recent studies in the field of Machine Translation (MT) and Natural Language Processing (NLP) have shown that existing models amplify biases observed in the training data. The amplification of biases in language technology has mainly been examined with respect to specific phenomena, such as gender bias. In this work, we go beyond the study of gender in MT and investigate how bias amplification might affect language in a broader sense. We hypothesize that the ‘algorithmic bias’, i.e. an exacerbation of frequently observed patterns in combination with a loss of less frequent ones, not only exacerbates societal biases present in current datasets but could also lead to an artificially impoverished language: ‘machine translationese’. We assess the linguistic richness (on a lexical and morphological level) of translations created by different data-driven MT paradigms – phrase-based statistical (PB-SMT) and neural MT (NMT). Our experiments show that there is a loss of lexical and syntactic richness in the translations produced by all investigated MT paradigms for two language pairs (EN-FR and EN-ES).

pdf bib
NeuTral Rewriter: A Rule-Based and Neural Approach to Automatic Rewriting into Gender Neutral Alternatives
Eva Vanmassenhove | Chris Emmery | Dimitar Shterionov
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Recent years have seen an increasing need for gender-neutral and inclusive language. Within the field of NLP, there are various mono- and bilingual use cases where gender inclusive language is appropriate, if not preferred due to ambiguity or uncertainty in terms of the gender of referents. In this work, we present a rule-based and a neural approach to gender-neutral rewriting for English along with manually curated synthetic data (WinoBias+) and natural data (OpenSubtitles and Reddit) benchmarks. A detailed manual and automatic evaluation highlights how our NeuTral Rewriter, trained on data generated by the rule-based approach, obtains word error rates (WER) below 0.18% on synthetic, in-domain and out-domain test sets.

2020

pdf bib
An Investigative Study of Multi-Modal Cross-Lingual Retrieval
Piyush Arora | Dimitar Shterionov | Yasufumi Moriya | Abhishek Kaushik | Daria Dzendzik | Gareth Jones
Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)

We describe work from our investigations of the novel area of multi-modal cross-lingual retrieval (MMCLIR) under low-resource conditions. We study the challenges associated with MMCLIR relating to: (i) data conversion between different modalities, for example speech and text, (ii) overcoming the language barrier between source and target languages; (iii) effectively scoring and ranking documents to suit the retrieval task; and (iv) handling low resource constraints that prohibit development of heavily tuned machine translation (MT) and automatic speech recognition (ASR) systems. We focus on the use case of retrieving text and speech documents in Swahili, using English queries which was the main focus of the OpenCLIR shared task. Our work is developed within the scope of this task. In this paper we devote special attention to the automatic translation (AT) component which is crucial for the overall quality of the MMCLIR system. We exploit a combination of dictionaries and phrase-based statistical machine translation (MT) systems to tackle effectively the subtask of query translation. We address each MMCLIR challenge individually, and develop separate components for automatic translation (AT), speech processing (SP) and information retrieval (IR). We find that results with respect to cross-lingual text retrieval are quite good relative to the task of cross-lingual speech retrieval. Overall we find that the task of MMCLIR and specifically cross-lingual speech retrieval is quite complex. Further we pinpoint open issues related to handling cross-lingual audio and text retrieval for low resource languages that need to be addressed in future research.

pdf bib
Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation
Xabier Soto | Dimitar Shterionov | Alberto Poncelas | Andy Way
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Machine translation (MT) has benefited from using synthetic training data originating from translating monolingual corpora, a technique known as backtranslation. Combining backtranslated data from different sources has led to better results than when using such data in isolation. In this work we analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems. We use a real-world low-resource use-case (Basque-to-Spanish in the clinical domain) as well as a high-resource language pair (German-to-English) to test different scenarios with backtranslation and employ data selection to optimise the synthetic corpora. We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems. We further tune the data selection method by taking into account the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora. Our experiments show that incorporating backtranslated data from different sources can be beneficial, and that availing of data selection can yield improved performance.

2019

pdf bib
APE through Neural and Statistical MT with Augmented Data. ADAPT/DCU Submission to the WMT 2019 APE Shared Task
Dimitar Shterionov | Joachim Wagner | Félix do Carmo
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

Automatic post-editing (APE) can be reduced to a machine translation (MT) task, where the source is the output of a specific MT system and the target is its post-edited variant. However, this approach does not consider context information that can be found in the original source of the MT system. Thus a better approach is to employ multi-source MT, where two input sequences are considered – the one being the original source and the other being the MT output. Extra context information can be introduced in the form of extra tokens that identify certain global property of a group of segments, added as a prefix or a suffix to each segment. Successfully applied in domain adaptation of MT as well as on APE, this technique deserves further attention. In this work we investigate multi-source neural APE (or NPE) systems with training data which has been augmented with two types of extra context tokens. We experiment with authentic and synthetic data provided by WMT 2019 and submit our results to the APE shared task. We also experiment with using statistical machine translation (SMT) methods for APE. While our systems score bellow the baseline, we consider this work a step towards understanding the added value of extra context in the case of APE.

pdf bib
Lost in Translation: Loss and Decay of Linguistic Richness in Machine Translation
Eva Vanmassenhove | Dimitar Shterionov | Andy Way
Proceedings of Machine Translation Summit XVII: Research Track

pdf bib
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks
Mikel Forcada | Andy Way | John Tinsley | Dimitar Shterionov | Celia Rico | Federico Gaspari
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf bib
When less is more in Neural Quality Estimation of Machine Translation. An industry case study
Dimitar Shterionov | Félix Do Carmo | Joss Moorkens | Eric Paquin | Dag Schmidtke | Declan Groves | Andy Way
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf bib
Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation
Mihael Arcan | Marco Turchi | Jinhua Du | Dimitar Shterionov | Daniel Torregrosa
Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation

pdf bib
Combining PBSMT and NMT Back-translated Data for Efficient NMT
Alberto Poncelas | Maja Popović | Dimitar Shterionov | Gideon Maillette de Buy Wenniger | Andy Way
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Neural Machine Translation (NMT) models achieve their best performance when large sets of parallel data are used for training. Consequently, techniques for augmenting the training set have become popular recently. One of these methods is back-translation, which consists on generating synthetic sentences by translating a set of monolingual, target-language sentences using a Machine Translation (MT) model. Generally, NMT models are used for back-translation. In this work, we analyze the performance of models when the training data is extended with synthetic data using different MT approaches. In particular we investigate back-translated data generated not only by NMT but also by Statistical Machine Translation (SMT) models and combinations of both. The results reveal that the models achieve the best performances when the training set is augmented with back-translated data created by merging different MT approaches.

2016

pdf bib
Divide and Conquer Strategy for Large Data MT
Dimitar Shterionov
Conferences of the Association for Machine Translation in the Americas: MT Users' Track

pdf bib
Improving KantanMT Training Efficiency with fast_align
Dimitar Shterionov | Jinhua Du | Marc Anthony Palminteri | Laura Casanellas | Tony O’Dowd | Andy Way
Conferences of the Association for Machine Translation in the Americas: MT Users' Track