Word Substitution with Masked Language Models as Data Augmentation for Sentiment Analysis

Larisa Kolesnichenko, Erik Velldal, Lilja Øvrelid


Abstract
This paper explores the use of masked language modeling (MLM) for data augmentation (DA), targeting structured sentiment analysis (SSA) for Norwegian based on a dataset of annotated reviews. Considering the limited resources for Norwegian language and the complexity of the annotation task, the aim is to investigate whether this approach to data augmentation can help boost the performance. We report on experiments with substituting words both inside and outside of sentiment annotations, and we also present an error analysis, discussing some of the potential pitfalls of using MLM-based DA for SSA, and suggest directions for future work.
Anthology ID:
2023.resourceful-1.6
Volume:
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)
Month:
May
Year:
2023
Address:
Tórshavn, the Faroe Islands
Editors:
Nikolai Ilinykh, Felix Morger, Dana Dannélls, Simon Dobnik, Beáta Megyesi, Joakim Nivre
Venue:
RESOURCEFUL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
42–47
Language:
URL:
https://aclanthology.org/2023.resourceful-1.6
DOI:
Bibkey:
Cite (ACL):
Larisa Kolesnichenko, Erik Velldal, and Lilja Øvrelid. 2023. Word Substitution with Masked Language Models as Data Augmentation for Sentiment Analysis. In Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023), pages 42–47, Tórshavn, the Faroe Islands. Association for Computational Linguistics.
Cite (Informal):
Word Substitution with Masked Language Models as Data Augmentation for Sentiment Analysis (Kolesnichenko et al., RESOURCEFUL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.resourceful-1.6.pdf