data2lang2vec: Data Driven Typological Features Completion

Hamidreza Amirzadeh; Sadegh Jafari; Anika Harju; Rob Van Der Goot

data2lang2vec: Data Driven Typological Features Completion

Hamidreza Amirzadeh, Sadegh Jafari, Anika Harju, Rob van der Goot

Abstract

Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.

Anthology ID:: 2025.coling-main.435
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6520–6529
Language:
URL:: https://aclanthology.org/2025.coling-main.435/
DOI:
Bibkey:
Cite (ACL):: Hamidreza Amirzadeh, Sadegh Jafari, Anika Harju, and Rob van der Goot. 2025. data2lang2vec: Data Driven Typological Features Completion. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6520–6529, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: data2lang2vec: Data Driven Typological Features Completion (Amirzadeh et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.435.pdf

PDF Cite Search Fix data