Evaluating Tokenizers Impact on OOVs Representation with Transformers Models

Alexandra Benamar; Cyril Grouin; Meryl Bothua; Anne Vilnat

Evaluating Tokenizers Impact on OOVs Representation with Transformers Models

Alexandra Benamar, Cyril Grouin, Meryl Bothua, Anne Vilnat

Abstract

Transformer models have achieved significant improvements in multiple downstream tasks in recent years. One of the main contributions of Transformers is their ability to create new representations for out-of-vocabulary (OOV) words. In this paper, we have evaluated three categories of OOVs: (A) new domain-specific terms (e.g., “eucaryote’” in microbiology), (B) misspelled words containing typos, and (C) cross-domain homographs (e.g., “arm” has different meanings in a clinical trial and anatomy). We use three French domain-specific datasets on the legal, medical, and energetical domains to robustly analyze these categories. Our experiments have led to exciting findings that showed: (1) It is easier to improve the representation of new words (A and B) than it is for words that already exist in the vocabulary of the Transformer models (C), (2) To ameliorate the representation of OOVs, the most effective method relies on adding external morpho-syntactic context rather than improving the semantic understanding of the words directly (fine-tuning) and (3) We cannot foresee the impact of minor misspellings in words because similar misspellings have different impacts on their representation. We believe that tackling the challenges of processing OOVs regarding their specificities will significantly help the domain adaptation aspect of BERT.

Anthology ID:: 2022.lrec-1.445
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 4193–4204
Language:
URL:: https://aclanthology.org/2022.lrec-1.445
DOI:
Bibkey:
Cite (ACL):: Alexandra Benamar, Cyril Grouin, Meryl Bothua, and Anne Vilnat. 2022. Evaluating Tokenizers Impact on OOVs Representation with Transformers Models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4193–4204, Marseille, France. European Language Resources Association.
Cite (Informal):: Evaluating Tokenizers Impact on OOVs Representation with Transformers Models (Benamar et al., LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.445.pdf
Code: alexandrabenamar/evaluating_tokenizers_oov

PDF Cite Search Code