Anika Harju
2025
data2lang2vec: Data Driven Typological Features Completion
Hamidreza Amirzadeh
|
Sadegh Jafari
|
Anika Harju
|
Rob van der Goot
Proceedings of the 31st International Conference on Computational Linguistics
Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.
C A N C E R: Corpus for Accurate Non-English Cancer-related Educational Resources
Anika Harju
|
Asma Shakeel
|
Tiantian He
|
Tianqi Xu
|
Aaro Harju
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models
Improving the quality of cancer terminology through Machine Translation (MT) in non-English languages remains an under-researched area despite its critical role in supporting self-management and advancing multilingual patient education. Existing computational tools encounter significant limitations in accurately translating cancer terminologies, particularly for low-resource languages, primarily due to data scarcity and morphological complexity. To address the gap, we introduce a dedicated terminology resource — Corpus for Accurate Non-English Cancer-related Educational Resources (C A N C E R), a manually annotated dataset in Finnish (FI), Chinese (ZH), and Urdu (UR), curated from publicly available existing English (EN) data. We also examine the impact of data quality versus quantity and compare the performance of the Opus-mt-en-fi, Opus-mt-en-zh, and Opus-mt-en-ur models with the SMaLL-100 multilingual MT model. We assess translation quality using automatic and human evaluation. Results demonstrated that high-quality parallel data, though sparse, combined with fine-tuning, substantially improved the translation of cancer terminology across both high and low-resource language pairs, positioning the C A N C E R corpus as a foundational resource for improving multilingual patient education.
How to age BERT Well: Continuous Training for Historical Language Adaptation
Anika Harju
|
Rob van der Goot
Proceedings of the First Workshop on Language Models for Low-Resource Languages
As the application of computational tools increases to digitalize historical archives, automatic annotation challenges persist due to distinct linguistic and morphological features of historical languages like Old English (OE). Existing tools struggle with the historical language varieties due to insufficient training. Previous research has focused on adapting pre-trained language models to new languages or domains but has rarely explored the modeling of language variety across time. Hence, we investigate the effectiveness of continuous language model training for adapting language models to OE on domain-specific data. We compare the continuous training of an English model (EN) and a multilingual model (ML), and use POS tagging for downstream evaluation. Results show that continuous pre-training substantially improves performance. We retrain a modern English (EN) model and a Multi-lingual (ML) BERT model for OE. We confirmed the effectiveness of continuous pre-training for language adaptation and downstream evaluation utilizing part-of-speech (POS) tagging, advancing the potential to understand the unique grammatical structures of historical OE archives. More concretely, EN BERT initially outperformed ML BERT with an accuracy of 83% during the language modeling phase. However, on the POS tagging task, ML BERT surpassed EN BERT, achieving an accuracy of 94%, which suggests effective performance to the historical language varieties.
Search
Fix author
Co-authors
- Rob Van Der Goot 2
- Hamidreza Amirzadeh 1
- Aaro Harju 1
- Tiantian He 1
- Sadegh Jafari 1
- show all...