Diversifying language models for lesser-studied languages and language-usage contexts: A case of second language Korean

Hakyung Sung, Gyu-Ho Shin


Abstract
This study investigates the extent to which currently available morpheme parsers/taggers apply to lesser-studied languages and language-usage contexts, with a focus on second language (L2) Korean. We pursue this inquiry by (1) training a neural-network model (pre-trained on first language [L1] Korean data) on varying L2 datasets and (2) measuring its morpheme parsing/POS tagging performance on L2 test sets from both the same and different sources of the L2 train sets. Results show that the L2 trained models generally excel in domain-specific tokenization and POS tagging compared to the L1 pre-trained baseline model. Interestingly, increasing the size of the L2 training data does not lead to improving model performance consistently.
Anthology ID:
2023.findings-emnlp.767
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11461–11473
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.767
DOI:
10.18653/v1/2023.findings-emnlp.767
Bibkey:
Cite (ACL):
Hakyung Sung and Gyu-Ho Shin. 2023. Diversifying language models for lesser-studied languages and language-usage contexts: A case of second language Korean. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11461–11473, Singapore. Association for Computational Linguistics.
Cite (Informal):
Diversifying language models for lesser-studied languages and language-usage contexts: A case of second language Korean (Sung & Shin, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.767.pdf