A Hybrid Architecture for Labelling Bilingual Māori-English Tweets

David Trye, Vithya Yogarajan, Jemma König, Te Taka Keegan, David Bainbridge, Mark Apperley


Abstract
Most large-scale language detection tools perform poorly at identifying Māori text. Moreover, rule-based and machine learning-based techniques devised specifically for the Māori-English language pair struggle with interlingual homographs. We develop a hybrid architecture that couples Māori-language orthography with machine learning models in order to annotate mixed Māori-English text. This architecture is used to label a new bilingual Twitter corpus at both the token (word) and tweet (sentence) levels. We use the collected tweets to show that the hybrid approach outperforms existing systems with respect to language detection of interlingual homographs and overall accuracy. We also evaluate its performance on out-of-domain data. Two interactive visualisations are provided for exploring the Twitter corpus and comparing errors across the new and existing techniques. The architecture code and visualisations are available online, and the corpus is available on request.
Anthology ID:
2022.findings-aacl.11
Volume:
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022
Month:
November
Year:
2022
Address:
Online only
Editors:
Yulan He, Heng Ji, Sujian Li, Yang Liu, Chua-Hui Chang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
119–130
Language:
URL:
https://aclanthology.org/2022.findings-aacl.11
DOI:
Bibkey:
Cite (ACL):
David Trye, Vithya Yogarajan, Jemma König, Te Taka Keegan, David Bainbridge, and Mark Apperley. 2022. A Hybrid Architecture for Labelling Bilingual Māori-English Tweets. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 119–130, Online only. Association for Computational Linguistics.
Cite (Informal):
A Hybrid Architecture for Labelling Bilingual Māori-English Tweets (Trye et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-aacl.11.pdf
Software:
 2022.findings-aacl.11.Software.zip