Multilingual Named Entity Recognition and Matching Using BERT and Dedupe for Slavic Languages

Marko Prelevikj, Slavko Zitnik


Abstract
This paper describes the University of Ljubljana (UL FRI) Group’s submissions to the shared task at the Balto-Slavic Natural Language Processing (BSNLP) 2021 Workshop. We experiment with multiple BERT-based models, pre-trained in multi-lingual, Croatian-Slovene-English and Slovene-only data. We perform training iteratively and on the concatenated data of previously available NER datasets. For the normalization task we use Stanza lemmatizer, while for entity matching we implemented a baseline using the Dedupe library. The performance of evaluations suggests that multi-source settings outperform less-resourced approaches. The best NER models achieve 0.91 F-score on Slovene training data splits while the best official submission achieved F-scores of 0.84 and 0.78 for relaxed partial matching and strict settings, respectively. In multi-lingual NER setting we achieve F-scores of 0.82 and 0.74.
Anthology ID:
2021.bsnlp-1.9
Volume:
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Month:
April
Year:
2021
Address:
Kiyv, Ukraine
Venues:
BSNLP | EACL
SIG:
SIGSLAV
Publisher:
Association for Computational Linguistics
Note:
Pages:
80–85
Language:
URL:
https://aclanthology.org/2021.bsnlp-1.9
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.bsnlp-1.9.pdf