Marko Prelevikj
2021
Multilingual Named Entity Recognition and Matching Using BERT and Dedupe for Slavic Languages
Marko Prelevikj
|
Slavko Zitnik
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
This paper describes the University of Ljubljana (UL FRI) Group’s submissions to the shared task at the Balto-Slavic Natural Language Processing (BSNLP) 2021 Workshop. We experiment with multiple BERT-based models, pre-trained in multi-lingual, Croatian-Slovene-English and Slovene-only data. We perform training iteratively and on the concatenated data of previously available NER datasets. For the normalization task we use Stanza lemmatizer, while for entity matching we implemented a baseline using the Dedupe library. The performance of evaluations suggests that multi-source settings outperform less-resourced approaches. The best NER models achieve 0.91 F-score on Slovene training data splits while the best official submission achieved F-scores of 0.84 and 0.78 for relaxed partial matching and strict settings, respectively. In multi-lingual NER setting we achieve F-scores of 0.82 and 0.74.
Search