Youngsook Song
2023
Study on the Domain Adaption of Korean Speech Act using Daily Conversation Dataset and Petition Corpus
Youngsook Song
|
Won Ik Cho
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
In Korean, quantitative speech act studies have usually been conducted on single utterances with unspecified sources. In this study, we annotate sentences from the National Institute of Korean Language’s Messenger Corpus and the National Petition Corpus, as well as example sentences from an academic paper on contemporary Korean vlogging, and check the discrepancy between human annotation and model prediction. In particular, for sentences with differences in locutionary and illocutionary forces, we analyze the causes of errors to see if stylistic features used in a particular domain affect the correct inference of speech act. Through this, we see the necessity to build and analyze a balanced corpus in various text domains, taking into account cases with different usage roles, e.g., messenger conversations belonging to private conversations and petition corpus/vlogging script that have an unspecified audience.
Revisiting Korean Corpus Studies through Technological Advances
Won Ik Cho
|
Sangwhan Moon
|
Youngsook Song
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
2020
Open Korean Corpora: A Practical Report
Won Ik Cho
|
Sangwhan Moon
|
Youngsook Song
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)
Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.