Towards a Better Understanding of Noise in Natural Language Processing

Khetam Al Sharou, Zhenhao Li, Lucia Specia


Abstract
In this paper, we propose a definition and taxonomy of various types of non-standard textual content – generally referred to as “noise” – in Natural Language Processing (NLP). While data pre-processing is undoubtedly important in NLP, especially when dealing with user-generated content, a broader understanding of different sources of noise and how to deal with them is an aspect that has been largely neglected. We provide a comprehensive list of potential sources of noise, categorise and describe them, and show the impact of a subset of standard pre-processing strategies on different tasks. Our main goal is to raise awareness of non-standard content – which should not always be considered as “noise” – and of the need for careful, task-dependent pre-processing. This is an alternative to blanket, all-encompassing solutions generally applied by researchers through “standard” pre-processing pipelines. The intention is for this categorisation to serve as a point of reference to support NLP researchers in devising strategies to clean, normalise or embrace non-standard content.
Anthology ID:
2021.ranlp-1.7
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:
September
Year:
2021
Address:
Held Online
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
53–62
Language:
URL:
https://aclanthology.org/2021.ranlp-1.7
DOI:
Bibkey:
Cite (ACL):
Khetam Al Sharou, Zhenhao Li, and Lucia Specia. 2021. Towards a Better Understanding of Noise in Natural Language Processing. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 53–62, Held Online. INCOMA Ltd..
Cite (Informal):
Towards a Better Understanding of Noise in Natural Language Processing (Al Sharou et al., RANLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.ranlp-1.7.pdf
Data
OLID