Constructing Indonesian-English Travelogue Dataset

Eunike Andriani Kardinata; Hiroki Ouchi; Taro Watanabe

Constructing Indonesian-English Travelogue Dataset

Eunike Andriani Kardinata, Hiroki Ouchi, Taro Watanabe

Abstract

Research in low-resource language is often hampered due to the under-representation of how the language is being used in reality. This is particularly true for Indonesian language because there is a limited variety of textual datasets, and majority were acquired from official sources with formal writing style. All the more for the task of geoparsing, which could be implemented for navigation and travel planning applications, such datasets are rare, even in the high-resource languages, such as English. Being aware of the need for a new resource in both languages for this specific task, we constructed a new dataset comprising both Indonesian and English from personal travelogue articles. Our dataset consists of 88 articles, exactly half of them written in each language. We covered both named and nominal expressions of four entity types related to travel: location, facility, transportation, and line. We also conducted experiments by training classifiers to recognise named entities and their nominal expressions. The results of our experiments showed a promising future use of our dataset as we obtained F1-score above 0.9 for both languages.

Anthology ID:: 2024.lrec-main.333
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 3759–3771
Language:
URL:: https://aclanthology.org/2024.lrec-main.333/
DOI:
Bibkey:
Cite (ACL):: Eunike Andriani Kardinata, Hiroki Ouchi, and Taro Watanabe. 2024. Constructing Indonesian-English Travelogue Dataset. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3759–3771, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Constructing Indonesian-English Travelogue Dataset (Kardinata et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.333.pdf

PDF Cite Search Fix data