Whit’s the Richt Pairt o Speech: PoS tagging for Scots

Harm Lameris, Sara Stymne


Abstract
In this paper we explore PoS tagging for the Scots language. Scots is spoken in Scotland and Northern Ireland, and is closely related to English. As no linguistically annotated Scots data were available, we manually PoS tagged a small set that is used for evaluation and training. We use English as a transfer language to examine zero-shot transfer and transfer learning methods. We find that training on a very small amount of Scots data was superior to zero-shot transfer from English. Combining the Scots and English data led to further improvements, with a concatenation method giving the best results. We also compared the use of two different English treebanks and found that a treebank containing web data was superior in the zero-shot setting, while it was outperformed by a treebank containing a mix of genres when combined with Scots data.
Anthology ID:
2021.vardial-1.5
Volume:
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
April
Year:
2021
Address:
Kiyv, Ukraine
Editors:
Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Yves Scherrer, Tommi Jauhiainen
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
39–48
Language:
URL:
https://aclanthology.org/2021.vardial-1.5
DOI:
Bibkey:
Cite (ACL):
Harm Lameris and Sara Stymne. 2021. Whit’s the Richt Pairt o Speech: PoS tagging for Scots. In Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 39–48, Kiyv, Ukraine. Association for Computational Linguistics.
Cite (Informal):
Whit’s the Richt Pairt o Speech: PoS tagging for Scots (Lameris & Stymne, VarDial 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.vardial-1.5.pdf