Young-Seob Jeong

2022

SwahBERT: Language Model of Swahili
Gati Martin | Medard Edmund Mswahili | Young-Seob Jeong | Jiyoung Woo
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The rapid development of social networks, electronic commerce, mobile Internet, and other technologies, has influenced the growth of Web data. Social media and Internet forums are valuable sources of citizens’ opinions, which can be analyzed for community development and user behavior analysis. Unfortunately, the scarcity of resources (i.e., datasets or language models) become a barrier to the development of natural language processing applications in low-resource languages. Thanks to the recent growth of online forums and news platforms of Swahili, we introduce two datasets of Swahili in this paper: a pre-training dataset of approximately 105MB with 16M words and annotated dataset of 13K instances for the emotion classification task. The emotion classification dataset is manually annotated by two native Swahili speakers. We pre-trained a new monolingual language model for Swahili, namely SwahBERT, using our collected pre-training data, and tested it with four downstream tasks including emotion classification. We found that SwahBERT outperforms multilingual BERT, a well-known existing language model, in almost all downstream tasks.

2018

pdf bib

Korean TimeBank Including Relative Temporal Information
Chae-Gyun Lim | Young-Seob Jeong | Ho-Jin Choi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib abs

papago: A Machine Translation Service with Word Sense Disambiguation and Currency Conversion
Hyoung-Gyu Lee | Jun-Seok Kim | Joong-Hwi Shin | Jaesong Lee | Ying-Xiu Quan | Young-Seob Jeong
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

In this paper, we introduce papago - a translator for mobile device which is equipped with new features that can provide convenience for users. The first feature is word sense disambiguation based on user feedback. By using the feature, users can select one among multiple meanings of a homograph and obtain the corrected translation with the user-selected sense. The second feature is the instant currency conversion of money expressions contained in a translation result with current exchange rate. Users can be quickly and precisely provided the amount of money converted as local currency when they travel abroad.

pdf bib abs

Many emerging documents usually contain temporal information. Because the temporal information is useful for various applications, it became important to develop a system of extracting the temporal information from the documents. Before developing the system, it first necessary to define or design the structure of temporal information. In other words, it is necessary to design a language which defines how to annotate the temporal information. There have been some studies about the annotation languages, but most of them was applicable to only a specific target language (e.g., English). Thus, it is necessary to design an individual annotation language for each language. In this paper, we propose a revised version of Koreain Time Mark-up Language (K-TimeML), and also introduce a dataset, named Korean TimeBank, that is constructed basd on the K-TimeML. We believe that the new K-TimeML and Korean TimeBank will be used in many further researches about extraction of temporal information.

Young-Seob Jeong

2022

2018

2016

2015

Co-authors

Venues