One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
Alham Aji | Genta Indra Winata | Fajri Koto | Samuel Cahyawijaya | Ade Romadhony | Rahmad Mahendra | Kemal Kurniawan | David Moeljadi | Radityo Eko Prasojo | Timothy Baldwin | Jey Han Lau | Sebastian Ruder
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia’s 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.


MultiLexNorm: A Shared Task on Multilingual Lexical Normalization
Rob van der Goot | Alan Ramponi | Arkaitz Zubiaga | Barbara Plank | Benjamin Muller | Iñaki San Vicente Roncal | Nikola Ljubešić | Özlem Çetinoğlu | Rahmad Mahendra | Talha Çolakoğlu | Timothy Baldwin | Tommaso Caselli | Wladimir Sidorenko
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MultiLexNorm shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 13 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-of-speech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system.

IndoNLI: A Natural Language Inference Dataset for Indonesian
Rahmad Mahendra | Alham Fikri Aji | Samuel Louvan | Fahrurrozi Rahman | Clara Vania
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present IndoNLI, the first human-elicited NLI dataset for Indonesian. We adapt the data collection protocol for MNLI and collect ~18K sentence pairs annotated by crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning. Experiment results show that XLM-R outperforms other pre-trained models in our data. The best performance on the expert-annotated data is still far below human performance (13.4% accuracy gap), suggesting that this test set is especially challenging. Furthermore, our analysis shows that our expert-annotated data is more diverse and contains fewer annotation artifacts than the crowd-annotated data. We hope this dataset can help accelerate progress in Indonesian NLP research.

A Multi-Pass Sieve Coreference Resolution for Indonesian
Valentina Kania Prameswara Artari | Rahmad Mahendra | Meganingrum Arista Jiwanggi | Adityo Anggraito | Indra Budi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Coreference resolution is an NLP task to find out whether the set of referring expressions belong to the same concept in discourse. A multi-pass sieve is a deterministic coreference model that implements several layers of sieves, where each sieve takes a pair of correlated mentions from a collection of non-coherent mentions. The multi-pass sieve is based on the principle of high precision, followed by increased recall in each sieve. In this work, we examine the portability of the multi-pass sieve coreference resolution model to the Indonesian language. We conduct the experiment on 201 Wikipedia documents and the multi-pass sieve system yields 72.74% of MUC F-measure and 52.18% of BCUBED F-measure.


IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding
Bryan Wilie | Karissa Vincentio | Genta Indra Winata | Samuel Cahyawijaya | Xiaohong Li | Zhi Yuan Lim | Sidik Soleman | Rahmad Mahendra | Pascale Fung | Syafri Bahar | Ayu Purwarianti
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in natural language processing (NLP) is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for training, evaluation, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset (Indo4B) collected from publicly available sources such as social media texts, blogs, news, and websites. We release baseline models for all twelve tasks, as well as the framework for benchmark evaluation, thus enabling everyone to benchmark their system performances.

The Framework of Multiword Expression in Indonesian Language
Totok Suhardijanto | Rahmad Mahendra | Zahroh Nuriah | Adi Budiwiyanto
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

ISWARA at WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets using BERT and FastText Embeddings
Wava Carissa Putri | Rani Aulia Hidayat | Isnaini Nurul Khasanah | Rahmad Mahendra
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

This paper presents Iswara’s participation in the WNUT-2020 Task 2 “Identification of Informative COVID-19 English Tweets using BERT and FastText Embeddings”,which tries to classify whether a certain tweet is considered informative or not. We proposed a method that utilizes word embeddings and using word occurrence related to the topic for this task. We compare several models to get the best performance. Results show that pairing BERT with word occurrences outperforms fastText with F1-Score, precision, recall, and accuracy on test data of 76%, 81%, 72%, and 79%, respectively

UI at SemEval-2020 Task 4: Commonsense Validation and Explanation by Exploiting Contradiction
Kerenza Doxolodeo | Rahmad Mahendra
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes our submissions into the ComVe challenge, the SemEval 2020 Task 4. This evaluation task consists of three sub-tasks that test commonsense comprehension by identifying sentences that do not make sense and explain why they do not. In subtask A, we use Roberta to find which sentence does not make sense. In subtask B, besides using BERT, we also experiment with replacing the dataset with MNLI when selecting the best explanation from the provided options why the given sentence does not make sense. In subtask C, we utilize the MNLI model from subtask B to evaluate the explanation generated by Roberta and GPT-2 by exploiting the contradiction of the sentence and their explanation. Our system submission records a performance of 88.2%, 80.5%, and BLEU 5.5 for those three subtasks, respectively.


Normalization of Indonesian-English Code-Mixed Twitter Data
Anab Maulana Barik | Rahmad Mahendra | Mirna Adriani
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Twitter is an excellent source of data for NLP researches as it offers tremendous amount of textual data. However, processing tweet to extract meaningful information is very challenging, at least for two reasons: (i) using nonstandard words as well as informal writing manner, and (ii) code-mixing issues, which is combining multiple languages in single tweet conversation. Most of the previous works have addressed both issues in isolated different task. In this study, we work on normalization task in code-mixed Twitter data, more specifically in Indonesian-English language. We propose a pipeline that consists of four modules, i.e tokenization, language identification, lexical normalization, and translation. Another contribution is to provide a gold standard of Indonesian-English code-mixed data for each module.


Semantic Role Labeling in Conversational Chat using Deep Bi-Directional Long Short-Term Memory Networks with Attention Mechanism
Valdi Rachman | Rahmad Mahendra | Alfan Farizki Wicaksono | Ahmad Rizqi Meydiarso | Fariz Ikhwantri
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

Cross-Lingual and Supervised Learning Approach for Indonesian Word Sense Disambiguation Task
Rahmad Mahendra | Heninggar Septiantri | Haryo Akbarianto Wibowo | Ruli Manurung | Mirna Adriani
Proceedings of the 9th Global Wordnet Conference

Ambiguity is a problem we frequently face in Natural Language Processing. Word Sense Disambiguation (WSD) is a task to determine the correct sense of an ambiguous word. However, research in WSD for Indonesian is still rare to find. The availability of English-Indonesian parallel corpora and WordNet for both languages can be used as training data for WSD by applying Cross-Lingual WSD method. This training data is used as an input to build a model using supervised machine learning algorithms. Our research also examines the use of Word Embedding features to build the WSD model.

KOI at SemEval-2018 Task 5: Building Knowledge Graph of Incidents
Paramita Mirza | Fariz Darari | Rahmad Mahendra
Proceedings of The 12th International Workshop on Semantic Evaluation

We present KOI (Knowledge of Incidents), a system that given news articles as input, builds a knowledge graph (KOI-KG) of incidental events. KOI-KG can then be used to efficiently answer questions such “How many killing incidents happened in 2017 that involve Sean?” The required steps in building the KG include: (i) document preprocessing involving word sense disambiguation, named-entity recognition, temporal expression recognition and normalization, and semantic role labeling; (ii) incidental event extraction and coreference resolution via document clustering; and (iii) KG construction and population.

Keyphrases Extraction from User-Generated Contents in Healthcare Domain Using Long Short-Term Memory Networks
Ilham Fathy Saputra | Rahmad Mahendra | Alfan Farizki Wicaksono
Proceedings of the BioNLP 2018 workshop

We propose keyphrases extraction technique to extract important terms from the healthcare user-generated contents. We employ deep learning architecture, i.e. Long Short-Term Memory, and leverage word embeddings, medical concepts from a knowledge base, and linguistic components as our features. The proposed model achieves 61.37% F-1 score. Experimental results indicate that our proposed approach outperforms the baseline methods, i.e. RAKE and CRF, on the task of extracting keyphrases from Indonesian health forum posts.

Multi-Task Active Learning for Neural Semantic Role Labeling on Low Resource Conversational Corpus
Fariz Ikhwantri | Samuel Louvan | Kemal Kurniawan | Bagas Abisena | Valdi Rachman | Alfan Farizki Wicaksono | Rahmad Mahendra
Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP

Most Semantic Role Labeling (SRL) approaches are supervised methods which require a significant amount of annotated corpus, and the annotation requires linguistic expertise. In this paper, we propose a Multi-Task Active Learning framework for Semantic Role Labeling with Entity Recognition (ER) as the auxiliary task to alleviate the need for extensive data and use additional information from ER to help SRL. We evaluate our approach on Indonesian conversational dataset. Our experiments show that multi-task active learning can outperform single-task active learning method and standard multi-task learning. According to our results, active learning is more efficient by using 12% less of training data compared to passive learning in both single-task and multi-task setting. We also introduce a new dataset for SRL in Indonesian conversational domain to encourage further research in this area.


A Two-Level Morphological Analyser for the Indonesian Language
Femphy Pisceldo | Rahmad Mahendra | Ruli Manurung | I Wayan Arka
Proceedings of the Australasian Language Technology Association Workshop 2008

Extending an Indonesian Semantic Analysis-based Question Answering System with Linguistic and World Knowledge Axioms
Rahmad Mahendra | Septina Dian Larasati | Ruli Manurung
Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation