Dea Adhista


2023

pdf bib
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages
Samuel Cahyawijaya | Holy Lovenia | Fajri Koto | Dea Adhista | Emmanuel Dave | Sarah Oktavianti | Salsabil Akbar | Jhonson Lee | Nuur Shadieq | Tjeng Wawan Cenggoro | Hanung Linuwih | Bryan Wilie | Galih Muridan | Genta Winata | David Moeljadi | Alham Fikri Aji | Ayu Purwarianti | Pascale Fung
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
ICON: Building a Large-Scale Benchmark Constituency Treebank for the Indonesian Language
Ee Suan Lim | Wei Qi Leong | Ngan Thanh Nguyen | Dea Adhista | Wei Ming Kng | William Chandra Tjh | Ayu Purwarianti
Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)

Constituency parsing is an important task of informing how words are combined to form sentences. While constituency parsing in English has seen significant progress in the last few years, tools for constituency parsing in Indonesian remain few and far between. In this work, we publish ICON (Indonesian CONstituency treebank), the hitherto largest publicly-available manually-annotated benchmark constituency treebank for the Indonesian language with a size of 10,000 sentences and approximately 124,000 constituents and 182,000 tokens, which can support the training of state-of-the-art transformer-based models. We establish strong baselines on the ICON dataset using the Berkeley Neural Parser with transformer-based pre-trained embeddings, with the best performance of 88.85% F1 score coming from our own version of SpanBERT (IndoSpanBERT). We further analyze the predictions made by our best-performing model to reveal certain idiosyncrasies in the Indonesian language that pose challenges for constituency parsing.