Sliced at SemEval-2022 Task 11: Bigger, Better? Massively Multilingual LMs for Multilingual Complex NER on an Academic GPU Budget

Barbara Plank


Abstract
Massively multilingual language models (MMLMs) have become a widely-used representation method, and multiple large MMLMs were proposed in recent years. A trend is to train MMLMs on larger text corpora or with more layers. In this paper we set out to test recent popular MMLMs on detecting semantically ambiguous and complex named entities with an academic GPU budget. Our submission of a single model for 11 languages on the SemEval Task 11 MultiCoNER shows that a vanilla transformer-CRF with XLM-Rlarge outperforms the more recent RemBERT, ranking 9th from 26 submissions in the multilingual track. Compared to RemBERT, the XLM-R model has the additional advantage to fit on a slice of a multi-instance GPU. As contrary to expectations and recent findings, we found RemBERT to not be the best MMLM, we further set out to investigate this discrepancy with additional experiments on multilingual Wikipedia NER data. While we expected RemBERT to have an edge on that dataset as it is closer to its pre-training data, surprisingly, our results show that this is not the case, suggesting that text domain match does not explain the discrepancy.
Anthology ID:
2022.semeval-1.205
Volume:
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Guy Emerson, Natalie Schluter, Gabriel Stanovsky, Ritesh Kumar, Alexis Palmer, Nathan Schneider, Siddharth Singh, Shyam Ratan
Venue:
SemEval
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
1494–1500
Language:
URL:
https://aclanthology.org/2022.semeval-1.205
DOI:
10.18653/v1/2022.semeval-1.205
Bibkey:
Cite (ACL):
Barbara Plank. 2022. Sliced at SemEval-2022 Task 11: Bigger, Better? Massively Multilingual LMs for Multilingual Complex NER on an Academic GPU Budget. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 1494–1500, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Sliced at SemEval-2022 Task 11: Bigger, Better? Massively Multilingual LMs for Multilingual Complex NER on an Academic GPU Budget (Plank, SemEval 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.semeval-1.205.pdf
Video:
 https://aclanthology.org/2022.semeval-1.205.mp4
Data
MultiCoNERWikiANN