Employing Wikipedia as a resource for Named Entity Recognition in Morphologically complex under-resourced languages

Aravind Krishnan, Stefan Ziehe, Franziska Pannach, Caroline Sporleder


Abstract
We propose a novel approach for rapid prototyping of named entity recognisers through the development of semi-automatically annotated datasets. We demonstrate the proposed pipeline on two under-resourced agglutinating languages: the Dravidian language Malayalam and the Bantu language isiZulu. Our approach is weakly supervised and bootstraps training data from Wikipedia and Google Knowledge Graph. Moreover, our approach is relatively language independent and can consequently be ported quickly (and hence cost-effectively) from one language to another, requiring only minor language-specific tailoring.
Anthology ID:
2021.bucc-1.5
Volume:
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
Month:
September
Year:
2021
Address:
Online (Virtual Mode)
Editors:
Reinhard Rapp, Serge Sharoff, Pierre Zweigenbaum
Venue:
BUCC
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
28–39
Language:
URL:
https://aclanthology.org/2021.bucc-1.5
DOI:
Bibkey:
Cite (ACL):
Aravind Krishnan, Stefan Ziehe, Franziska Pannach, and Caroline Sporleder. 2021. Employing Wikipedia as a resource for Named Entity Recognition in Morphologically complex under-resourced languages. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021), pages 28–39, Online (Virtual Mode). INCOMA Ltd..
Cite (Informal):
Employing Wikipedia as a resource for Named Entity Recognition in Morphologically complex under-resourced languages (Krishnan et al., BUCC 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.bucc-1.5.pdf