Fajri Koto


2023

pdf bib
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages
Samuel Cahyawijaya | Holy Lovenia | Fajri Koto | Dea Adhista | Emmanuel Dave | Sarah Oktavianti | Salsabil Akbar | Jhonson Lee | Nuur Shadieq | Tjeng Wawan Cenggoro | Hanung Linuwih | Bryan Wilie | Galih Muridan | Genta Winata | David Moeljadi | Alham Fikri Aji | Ayu Purwarianti | Pascale Fung
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Samuel Cahyawijaya | Holy Lovenia | Alham Fikri Aji | Genta Winata | Bryan Wilie | Fajri Koto | Rahmad Mahendra | Christian Wibisono | Ade Romadhony | Karissa Vincentio | Jennifer Santoso | David Moeljadi | Cahya Wirawan | Frederikus Hudi | Muhammad Satrio Wicaksono | Ivan Parmonangan | Ika Alfina | Ilham Firdausi Putra | Samsul Rahmadani | Yulianti Oenang | Ali Septiandri | James Jaya | Kaustubh Dhole | Arie Suryani | Rifki Afina Putri | Dan Su | Keith Stevens | Made Nindyatama Nityasya | Muhammad Adilazuarda | Ryan Hadiwijaya | Ryandito Diandaru | Tiezheng Yu | Vito Ghifari | Wenliang Dai | Yan Xu | Dyah Damapuspita | Haryo Wibowo | Cuk Tho | Ichwanul Karo Karo | Tirana Fatyanosa | Ziwei Ji | Graham Neubig | Timothy Baldwin | Sebastian Ruder | Pascale Fung | Herry Sujaini | Sakriani Sakti | Ayu Purwarianti
Findings of the Association for Computational Linguistics: ACL 2023

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments.NusaCrowd’s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.

pdf bib
Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU
Fajri Koto | Nurul Aisyah | Haonan Li | Timothy Baldwin
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Although large language models (LLMs) are often pre-trained on large-scale multilingual texts, their reasoning abilities and real-world knowledge are mainly evaluated based on English datasets. Assessing LLM capabilities beyond English is increasingly vital but hindered due to the lack of suitable datasets. In this work, we introduce IndoMMLU, the first multi-task language understanding benchmark for Indonesian culture and languages, which consists of questions from primary school to university entrance exams in Indonesia. By employing professional teachers, we obtain 14,981 questions across 64 tasks and education levels, with 46% of the questions focusing on assessing proficiency in the Indonesian language and knowledge of nine local languages and cultures in Indonesia. Our empirical evaluations show that GPT-3.5 only manages to pass the Indonesian primary school level, with limited knowledge of local Indonesian languages and culture. Other smaller models such as BLOOMZ and Falcon perform at even lower levels.

pdf bib
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages
Genta Indra Winata | Alham Fikri Aji | Samuel Cahyawijaya | Rahmad Mahendra | Fajri Koto | Ade Romadhony | Kemal Kurniawan | David Moeljadi | Radityo Eko Prasojo | Pascale Fung | Timothy Baldwin | Jey Han Lau | Rico Sennrich | Sebastian Ruder
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes sentiment and machine translation datasets, and bilingual lexicons. We provide extensive analyses and describe challenges for creating such resources. We hope this work can spark NLP research on Indonesian and other underrepresented languages.

2022

pdf bib
Easy-First Bottom-Up Discourse Parsing via Sequence Labelling
Andrew Shen | Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the 3rd Workshop on Computational Approaches to Discourse

We propose a novel unconstrained bottom-up approach for rhetorical discourse parsing based on sequence labelling of adjacent pairs of discourse units (DUs), based on the framework of Koto et al. (2021). We describe the unique training requirements of an unconstrained parser, and explore two different training procedures: (1) fixed left-to-right; and (2) random order in tree construction. Additionally, we introduce a novel dynamic oracle for unconstrained bottom-up parsing. Our proposed parser achieves competitive results for bottom-up rhetorical discourse parsing.

pdf bib
LipKey: A Large-Scale News Dataset for Absent Keyphrases Generation and Abstractive Summarization
Fajri Koto | Timothy Baldwin | Jey Han Lau
Proceedings of the 29th International Conference on Computational Linguistics

Summaries, keyphrases, and titles are different ways of concisely capturing the content of a document. While most previous work has released the datasets of keyphrases and summarization separately, in this work, we introduce LipKey, the largest news corpus with human-written abstractive summaries, absent keyphrases, and titles. We jointly use the three elements via multi-task training and training as joint structured inputs, in the context of document summarization. We find that including absent keyphrases and titles as additional context to the source document improves transformer-based summarization models.

pdf bib
Can Pretrained Language Models Generate Persuasive, Faithful, and Informative Ad Text for Product Descriptions?
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5)

For any e-commerce service, persuasive, faithful, and informative product descriptions can attract shoppers and improve sales. While not all sellers are capable of providing such interesting descriptions, a language generation system can be a source of such descriptions at scale, and potentially assist sellers to improve their product descriptions. Most previous work has addressed this task based on statistical approaches (Wang et al., 2017), limited attributes such as titles (Chen et al., 2019; Chan et al., 2020), and focused on only one product type (Wang et al., 2017; Munigala et al., 2018; Hong et al., 2021). In this paper, we jointly train image features and 10 text attributes across 23 diverse product types, with two different target text types with different writing styles: bullet points and paragraph descriptions. Our findings suggest that multimodal training with modern pretrained language models can generate fluent and persuasive advertisements, but are less faithful and informative, especially out of domain.

pdf bib
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
Alham Fikri Aji | Genta Indra Winata | Fajri Koto | Samuel Cahyawijaya | Ade Romadhony | Rahmad Mahendra | Kemal Kurniawan | David Moeljadi | Radityo Eko Prasojo | Timothy Baldwin | Jey Han Lau | Sebastian Ruder
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia’s 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.

pdf bib
Cloze Evaluation for Deeper Understanding of Commonsense Stories in Indonesian
Fajri Koto | Timothy Baldwin | Jey Han Lau
Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)

Story comprehension that involves complex causal and temporal relations is a critical task in NLP, but previous studies have focused predominantly on English, leaving open the question of how the findings generalize to other languages, such as Indonesian. In this paper, we follow the Story Cloze Test framework of Mostafazadeh et al. (2016) in evaluating story understanding in Indonesian, by constructing a four-sentence story with one correct ending and one incorrect ending. To investigate commonsense knowledge acquisition in language models, we experimented with: (1) a classification task to predict the correct ending; and (2) a generation task to complete the story with a single sentence. We investigate these tasks in two settings: (i) monolingual training and ii) zero-shot cross-lingual transfer between Indonesian and English.

pdf bib
Context-Aware Sentence Classification in Evidence-Based Medicine
Biaoyan Fang | Fajri Koto
Proceedings of the The 20th Annual Workshop of the Australasian Language Technology Association

2021

pdf bib
Top-down Discourse Parsing via Sequence Labelling
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We introduce a top-down approach to discourse parsing that is conceptually simpler than its predecessors (Kobayashi et al., 2020; Zhang et al., 2020). By framing the task as a sequence labelling problem where the goal is to iteratively segment a document into individual discourse units, we are able to eliminate the decoder and reduce the search space for splitting points. We explore both traditional recurrent models and modern pre-trained transformer models for the task, and additionally introduce a novel dynamic oracle for top-down parsing. Based on the Full metric, our proposed LSTM model sets a new state-of-the-art for RST parsing.

pdf bib
Handling Variance of Pretrained Language Models in Grading Evidence in the Medical Literature
Fajri Koto | Biaoyan Fang
Proceedings of the The 19th Annual Workshop of the Australasian Language Technology Association

In this paper, we investigate the utility of modern pretrained language models for the evidence grading system in the medical literature based on the ALTA 2021 shared task. We benchmark 1) domain-specific models that are optimized for medical literature and 2) domain-generic models with rich latent discourse representation (i.e. ELECTRA, RoBERTa). Our empirical experiments reveal that these modern pretrained language models suffer from high variance, and the ensemble method can improve the model performance. We found that ELECTRA performs best with an accuracy of 53.6% on the test set, outperforming domain-specific models.1

pdf bib
Discourse Probing of Pretrained Language Models
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Existing work on probing of pretrained language models (LMs) has predominantly focused on sentence-level syntactic tasks. In this paper, we introduce document-level discourse probing to evaluate the ability of pretrained LMs to capture document-level relations. We experiment with 7 pretrained LMs, 4 languages, and 7 discourse probing tasks, and find BART to be overall the best model at capturing discourse — but only in its encoder, with BERT performing surprisingly well as the baseline model. Across the different models, there are substantial differences in which layers best capture discourse information, and large disparities between models.

pdf bib
IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.

pdf bib
Evaluating the Efficacy of Summarization Evaluation across Languages
Fajri Koto | Jey Han Lau | Timothy Baldwin
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP
Fajri Koto | Afshin Rahimi | Jey Han Lau | Timothy Baldwin
Proceedings of the 28th International Conference on Computational Linguistics

Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM, in addition to benchmarking it against existing resources. Our experiments show that IndoBERT achieves state-of-the-art performance over most of the tasks in IndoLEM.

pdf bib
Liputan6: A Large-scale Indonesian Dataset for Text Summarization
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

In this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document–summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by examining machine-generated summaries that have low ROUGE scores, and expose both issues with ROUGE itself, as well as with extractive and abstractive summarization models.

pdf bib
Towards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine Translation
Fajri Koto | Ikhwan Koto
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

2019

pdf bib
Improved Document Modelling with a Neural Discourse Parser
Fajri Koto | Jey Han Lau | Timothy Baldwin
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association

Despite the success of attention-based neural models for natural language generation and classification tasks, they are unable to capture the discourse structure of larger documents. We hypothesize that explicit discourse representations have utility for NLP tasks over longer documents or document sequences, which sequence-to-sequence models are unable to capture. For abstractive summarization, for instance, conventional neural models simply match source documents and the summary in a latent space without explicit representation of text structure or relations. In this paper, we propose to use neural discourse representations obtained from a rhetorical structure theory (RST) parser to enhance document representations. Specifically, document representations are generated for discourse spans, known as the elementary discourse units (EDUs). We empirically investigate the benefit of the proposed approach on two different tasks: abstractive summarization and popularity prediction of online petitions. We find that the proposed approach leads to substantial improvements in all cases.

2016

pdf bib
A Publicly Available Indonesian Corpora for Automatic Abstractive and Extractive Chat Summarization
Fajri Koto
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we report our effort to construct the first ever Indonesian corpora for chat summarization. Specifically, we utilized documents of multi-participant chat from a well known online instant messaging application, WhatsApp. We construct the gold standard by asking three native speakers to manually summarize 300 chat sections (152 of them contain images). As result, three reference summaries in extractive and either abstractive form are produced for each chat sections. The corpus is still in its early stage of investigation, yielding exciting possibilities of future works.