Kemal Kurniawan

2026

On the Interplay between Human Label Variation and Model Fairness
Kemal Kurniawan | Meladel Mistica | Timothy Baldwin | Jey Han Lau
Findings of the Association for Computational Linguistics: EACL 2026

The impact of human label variation (HLV) on model fairness is an unexplored topic. This paper examines the interplay by comparing training on majority-vote labels with a range of HLV methods. Our experiments show that without explicit debiasing, HLV training methods have a positive impact on fairness under certain configurations.

2025

pdf bib abs

A Joint Multitask Model for Morpho-Syntactic Parsing
Demian Inostroza Améstica | Meladel Mistica | Ekaterina Vylomova | Chris Guest | Kemal Kurniawan
Proceedings of The UniDive 2025 Shared Task on Multilingual Morpho-Syntactic Parsing

We present a joint multitask model for the Uni-Dive 2025 Morpho-Syntactic Parsing shared task, where systems predict both morphological and syntactic analyses following novel UD annotation scheme. Our system uses a shared XLM-RoBERTa encoder with three specialized decoders for content word identification, dependency parsing, and morphosyntactic feature prediction. Our model achieves the best overall performance on the shared task’s leaderboard covering nine typologically diverse languages, with an average MSLAS score of 78.7%, LAS of 80.1%, and Feats F1 of 90.3%. Our ablation studies show that matching the task’s gold tokenization and content word identification are crucial to model performance. Error analysis reveals that our model struggles with core grammatical cases (particularly Nom–Acc) and nominal features across languages.

pdf bib abs

Robustness of Neurosymbolic Reasoners on First-Order Logic Problems
Hannah Bansal | Kemal Kurniawan | Lea Frermann
Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association

Recent trends in NLP aim to improve reasoning capabilities in Large Language Models (LLMs), with key focus on generalization and robustness to variations in tasks. Counterfactual task variants introduce minimal but semantically meaningful changes to otherwise valid first-order logic (FOL) problem instances altering a single predicate or swapping roles of constants to probe whether a reasoning system can maintain logical consistency under perturbation. Previous studies showed that LLMs becomes brittle on counterfactual variations, suggesting that they often rely on spurious surface patterns to generate responses. In this work, we explore if a neurosymbolic (NS) approach that integrates an LLM and a symbolic logical solver could mitigate this problem. Experiments across LLMs of varying sizes show that NS methods are more robust but perform worse overall that purely neural methods. We then propose NSCoT that combines an NS method and Chain-of-Thought (CoT) prompting and demonstrate that while it improves performance, NSCoT still lags behind standard CoT. Our analysis opens research directions for future work.

2024

pdf bib abs

Generating bilingual example sentences with large language models as lexicography assistants
Raphael Merx | Ekaterina Vylomova | Kemal Kurniawan
Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association

We present a study of LLMs’ performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels: French (high-resource), Indonesian (mid-resource), and Tetun (low-resource), with English as the target language. We evaluate the quality of LLMgenerated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility (Kilgarriff et al., 2008). Our findings reveal that while LLMs can generate reasonably good dictionary examples, their performance degrades significantly for lower-resourced languages. We also observe high variability in human preferences for example quality, reflected in low interannotator agreement rates. To address this, we demonstrate that in-context learning can successfully align LLMs with individual annotator preferences. Additionally, we explore the use of pre-trained language models for automated rating of examples, finding that sentence perplexity serves as a good proxy for “typicality” and “intelligibility” in higher-resourced languages. Our study also contributes a novel dataset of 600 ratings for LLM-generated sentence pairs, and provides insights into the potential of LLMs in reducing the cost of lexicographic work, particularly for low-resource languages.

pdf bib abs

MoDEM: Mixture of Domain Expert Models
Toby Simonds | Kemal Kurniawan | Jey Han Lau
Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association

We propose a novel approach to enhancing the performance and efficiency of large language models (LLMs) by combining domain prompt routing with domain-specialized models. We introduce a system that utilizes a BERT-based router to direct incoming prompts to the most appropriate domain expert model. These expert models are specifically tuned for domains such as health, mathematics and science. Our research demonstrates that this approach can significantly outperform general-purpose models of comparable size, leading to a superior performance-to-cost ratio across various benchmarks. The implications of this study suggest a potential shift in LLM development and deployment. Rather than focusing solely on creating increasingly large, general-purpose models, the future of AI may lie in developing ecosystems of smaller, highly specialized models coupled with sophisticated routing systems. This approach could lead to more efficient resource utilization, reduced computational costs, and superior overall performance.

pdf bib abs

To Aggregate or Not to Aggregate. That is the Question: A Case Study on Annotation Subjectivity in Span Prediction
Kemal Kurniawan | Meladel Mistica | Timothy Baldwin | Jey Han Lau
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

This paper explores the task of automatic prediction of text spans in a legal problem description that support a legal area label. We use a corpus of problem descriptions written by laypeople in English that is annotated by practising lawyers. Inherent subjectivity exists in our task because legal area categorisation is a complex task, and lawyers often have different views on a problem. Experiments show that training on majority-voted spans outperforms training on disaggregated ones.

2023

Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes sentiment and machine translation datasets, and bilingual lexicons. We provide extensive analyses and describe challenges for creating such resources. We hope this work can spark NLP research on Indonesian and other underrepresented languages.

2022

pdf bib abs

Unsupervised Cross-Lingual Transfer of Structured Predictors without Source Data
Kemal Kurniawan | Lea Frermann | Philip Schulz | Trevor Cohn
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Providing technologies to communities or domains where training data is scarce or protected e.g., for privacy reasons, is becoming increasingly important. To that end, we generalise methods for unsupervised transfer from multiple input models for structured prediction. We show that the means of aggregating over the input models is critical, and that multiplying marginal probabilities of substructures to obtain high-probability structures for distant supervision is substantially better than taking the union of such structures over the input models, as done in prior work. Testing on 18 languages, we demonstrate that the method works in a cross-lingual setting, considering both dependency parsing and part-of-speech structured prediction problems. Our analyses show that the proposed method produces less noisy labels for the distant supervision.

pdf bib abs

NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia’s 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.

2021

pdf bib abs

PPT: Parsimonious Parser Transfer for Unsupervised Cross-Lingual Adaptation
Kemal Kurniawan | Lea Frermann | Philip Schulz | Trevor Cohn
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Cross-lingual transfer is a leading technique for parsing low-resource languages in the absence of explicit supervision. Simple ‘direct transfer’ of a learned model based on a multilingual input encoding has provided a strong benchmark. This paper presents a method for unsupervised cross-lingual transfer that improves over direct transfer systems by using their output as implicit supervision as part of self-training on unlabelled text in the target language. The method assumes minimal resources and provides maximal flexibility by (a) accepting any pre-trained arc-factored dependency parser; (b) assuming no access to source language data; (c) supporting both projective and non-projective parsing; and (d) supporting multi-source transfer. With English as the source language, we show significant improvements over state-of-the-art transfer models on both distant and nearby languages, despite our conceptually simpler approach. We provide analyses of the choice of source languages for multi-source transfer, and the advantage of non-projective parsing. Our code is available online.

pdf bib abs

PTST-UoM at SemEval-2021 Task 10: Parsimonious Transfer for Sequence Tagging
Kemal Kurniawan | Lea Frermann | Philip Schulz | Trevor Cohn
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes PTST, a source-free unsupervised domain adaptation technique for sequence tagging, and its application to the SemEval-2021 Task 10 on time expression recognition. PTST is an extension of the cross-lingual parsimonious parser transfer framework, which uses high-probability predictions of the source model as a supervision signal in self-training. We extend the framework to a sequence prediction setting, and demonstrate its applicability to unsupervised domain adaptation. PTST achieves F1 score of 79.6% on the official test set, with the precision of 90.1%, the highest out of 14 submissions.

2018

pdf bib abs

Multi-Task Active Learning for Neural Semantic Role Labeling on Low Resource Conversational Corpus
Fariz Ikhwantri | Samuel Louvan | Kemal Kurniawan | Bagas Abisena | Valdi Rachman | Alfan Farizki Wicaksono | Rahmad Mahendra
Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP

Most Semantic Role Labeling (SRL) approaches are supervised methods which require a significant amount of annotated corpus, and the annotation requires linguistic expertise. In this paper, we propose a Multi-Task Active Learning framework for Semantic Role Labeling with Entity Recognition (ER) as the auxiliary task to alleviate the need for extensive data and use additional information from ER to help SRL. We evaluate our approach on Indonesian conversational dataset. Our experiments show that multi-task active learning can outperform single-task active learning method and standard multi-task learning. According to our results, active learning is more efficient by using 12% less of training data compared to passive learning in both single-task and multi-task setting. We also introduce a new dataset for SRL in Indonesian conversational domain to encourage further research in this area.

pdf bib abs

Empirical Evaluation of Character-Based Model on Neural Named-Entity Recognition in Indonesian Conversational Texts
Kemal Kurniawan | Samuel Louvan
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

Despite the long history of named-entity recognition (NER) task in the natural language processing community, previous work rarely studied the task on conversational texts. Such texts are challenging because they contain a lot of word variations which increase the number of out-of-vocabulary (OOV) words. The high number of OOV words poses a difficulty for word-based neural models. Meanwhile, there is plenty of evidence to the effectiveness of character-based neural models in mitigating this OOV problem. We report an empirical evaluation of neural sequence labeling models with character embedding to tackle NER task in Indonesian conversational texts. Our experiments show that (1) character models outperform word embedding-only models by up to 4 F1 points, (2) character models perform better in OOV cases with an improvement of as high as 15 F1 points, and (3) character models are robust against a very high OOV rate.

Venues

Kemal Kurniawan

2026

2025

2024

2023

2022

2021

2018

Co-authors

Venues