Khetam Al Sharou

Also published as: Khetam Al Sharou


2022

pdf bib
A Taxonomy and Study of Critical Errors in Machine Translation
Khetam Al Sharou | Lucia Specia
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

Not all machine mistranslations are equal. For example, mistranslating a date or time in an appointment, mistranslating the number or currency in a contract, or hallucinating profanity may lead to consequences for the users even when MT is just used for gisting. The severity of the errors is important, but overlooked, aspect of MT quality evaluation. In this paper, we present the result of our effort to bring awareness to the problem of critical translation errors. We study, validate and improve an initial taxonomy of critical errors with the view of providing guidance for critical error analysis, annotation and mitigation. We test the taxonomy for three different languages to examine to what extent it generalises across languages. We provide an account of factors that affect annotation tasks along with recommendations on how to improve the practice in future work. We also study the impact of the source text on generating critical errors in the translation and, based on this, propose a set of recommendations on aspects of the MT that need further scrutiny, especially for user-generated content, to avoid generating such errors, and hence improve online communication.

2021

pdf bib
Towards a Better Understanding of Noise in Natural Language Processing
Khetam Al Sharou | Zhenhao Li | Lucia Specia
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we propose a definition and taxonomy of various types of non-standard textual content – generally referred to as “noise” – in Natural Language Processing (NLP). While data pre-processing is undoubtedly important in NLP, especially when dealing with user-generated content, a broader understanding of different sources of noise and how to deal with them is an aspect that has been largely neglected. We provide a comprehensive list of potential sources of noise, categorise and describe them, and show the impact of a subset of standard pre-processing strategies on different tasks. Our main goal is to raise awareness of non-standard content – which should not always be considered as “noise” – and of the need for careful, task-dependent pre-processing. This is an alternative to blanket, all-encompassing solutions generally applied by researchers through “standard” pre-processing pipelines. The intention is for this categorisation to serve as a point of reference to support NLP researchers in devising strategies to clean, normalise or embrace non-standard content.