Holy Lovenia


2024

pdf bib
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia | Rahmad Mahendra | Salsabil Maulana Akbar | Lester James Validad Miranda | Jennifer Santoso | Elyanah Aco | Akhdan Fadhilah | Jonibek Mansurov | Joseph Marvin Imperial | Onno P. Kampman | Joel Ruben Antony Moniz | Muhammad Ravi Shulthan Habibi | Frederikus Hudi | Jann Railey Montalan | Ryan Ignatius Hadiwijaya | Joanito Agili Lopo | William Nixon | Börje F. Karlsson | James Jaya | Ryandito Diandaru | Yuze Gao | Patrick Amadeus Irawan | Bin Wang | Jan Christian Blaise Cruz | Chenxi Whitehouse | Ivan Halim Parmonangan | Maria Khelli | Wenyu Zhang | Lucky Susanto | Reynard Adha Ryanda | Sonny Lazuardi Hermawan | Dan John Velasco | Muhammad Dehan Al Kautsar | Willy Fitra Hendria | Yasmin Moslem | Noah Flynn | Muhammad Farid Adilazuarda | Haochen Li | Johanes Lee | R. Damanhuri | Shuo Sun | Muhammad Reza Qorib | Amirbek Djanibekov | Wei Qi Leong | Quyet V. Do | Niklas Muennighoff | Tanrada Pansuwan | Ilham Firdausi Putra | Yan Xu | Tai Ngee Chia | Ayu Purwarianti | Sebastian Ruder | William Chandra Tjhi | Peerat Limkonchotiwat | Alham Fikri Aji | Sedrick Keh | Genta Indra Winata | Ruochen Zhang | Fajri Koto | Zheng Xin Yong | Samuel Cahyawijaya
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia.

pdf bib
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
Holy Lovenia | Wenliang Dai | Samuel Cahyawijaya | Ziwei Ji | Pascale Fung
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)

Object hallucination poses a significant challenge in vision-language (VL) models, often leading to the generation of nonsensical or unfaithful responses with non-existent objects. However, the absence of a general measurement for evaluating object hallucination in VL models has hindered our understanding and ability to mitigate this issue. In this work, we present NOPE (Negative Object Presence Evaluation), a novel benchmark designed to assess object hallucination in VL models through visual question answering (VQA). We propose a cost-effective and scalable approach utilizing large language models to generate 29.5k synthetic negative pronoun (NegP) data of high quality for NOPE. We extensively investigate the performance of 10 state-of-the-art VL models in discerning the non-existence of objects in visual questions, where the ground truth answers are denoted as (e.g., “none”). Additionally, we evaluate their standard performance on visual questions on 9 other VQA datasets. Through our experiments, we demonstrate that no VL model is immune to the vulnerability of object hallucination, as all models achieve accuracy below 10% on NegP. Furthermore, we uncover that lexically diverse visual questions, question types with large scopes, and scene-relevant objects capitalize the risk of object hallucination in VL models.

pdf bib
LLMs Are Few-Shot In-Context Low-Resource Language Learners
Samuel Cahyawijaya | Holy Lovenia | Pascale Fung
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages.Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource languages, such as French and Spanish. In this work, we extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages.Our study not only assesses the effectiveness of ICL with LLMs in low-resource languages but also identifies the shortcomings of in-context label alignment, and introduces a more effective alternative: query alignment. Moreover, we provide valuable insights into various facets of ICL for low-resource languages.Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs through semantically relevant information by closing the language gap in the target language and aligning the semantics between the targeted low-resource and the high-resource language that the model is proficient in. Our work highlights the importance of advancing ICL research, particularly for low-resource languages.

pdf bib
Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages
Samuel Cahyawijaya | Holy Lovenia | Fajri Koto | Rifki Putri | Wawan Cenggoro | Jhonson Lee | Salsabil Akbar | Emmanuel Dave | Nuurshadieq Nuurshadieq | Muhammad Mahendra | Rr Putri | Bryan Wilie | Genta Winata | Alham Aji | Ayu Purwarianti | Pascale Fung
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) show remarkable human-like capability in various domains and languages. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol’s effectiveness across a diverse array of tasks, attaining ~20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.

2023

pdf bib
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Samuel Cahyawijaya | Holy Lovenia | Alham Fikri Aji | Genta Winata | Bryan Wilie | Fajri Koto | Rahmad Mahendra | Christian Wibisono | Ade Romadhony | Karissa Vincentio | Jennifer Santoso | David Moeljadi | Cahya Wirawan | Frederikus Hudi | Muhammad Satrio Wicaksono | Ivan Parmonangan | Ika Alfina | Ilham Firdausi Putra | Samsul Rahmadani | Yulianti Oenang | Ali Septiandri | James Jaya | Kaustubh Dhole | Arie Suryani | Rifki Afina Putri | Dan Su | Keith Stevens | Made Nindyatama Nityasya | Muhammad Adilazuarda | Ryan Hadiwijaya | Ryandito Diandaru | Tiezheng Yu | Vito Ghifari | Wenliang Dai | Yan Xu | Dyah Damapuspita | Haryo Wibowo | Cuk Tho | Ichwanul Karo Karo | Tirana Fatyanosa | Ziwei Ji | Graham Neubig | Timothy Baldwin | Sebastian Ruder | Pascale Fung | Herry Sujaini | Sakriani Sakti | Ayu Purwarianti
Findings of the Association for Computational Linguistics: ACL 2023

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments.NusaCrowd’s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.

pdf bib
Contrastive Learning for Inference in Dialogue
Etsuko Ishii | Yan Xu | Bryan Wilie | Ziwei Ji | Holy Lovenia | Willy Chung | Pascale Fung
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Inference, especially those derived from inductive processes, is a crucial component in our conversation to complement the information implicitly or explicitly conveyed by a speaker. While recent large language models show remarkable advances in inference tasks, their performance in inductive reasoning, where not all information is present in the context, is far behind deductive reasoning. In this paper, we analyze the behavior of the models based on the task difficulty defined by the semantic information gap – which distinguishes inductive and deductive reasoning. Our analysis reveals that the information gap between dialogue contexts and desired inferences renders the inductive inference process more challenging. To mitigate this information gap, we investigate a contrastive learning approach by feeding negative samples. Our experiments suggest negative samples help models understand what is wrong and improve their inference generations.

pdf bib
Which One Are You Referring To? Multimodal Object Identification in Situated Dialogue
Holy Lovenia | Samuel Cahyawijaya | Pascale Fung
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

The demand for multimodal dialogue systems has been rising in various domains, emphasizing the importance of interpreting multimodal inputs from conversational and situational contexts. One main challenge in multimodal dialogue understanding is multimodal object identification, which constitutes the ability to identify objects relevant to a multimodal user-system conversation. We explore three methods to tackle this problem and evaluate them on the largest situated dialogue dataset, SIMMC 2.1. Our best method, scene-dialogue alignment, improves the performance by ~20% F1-score compared to the SIMMC 2.1 baselines. We provide analysis and discussion regarding the limitation of our methods and the potential directions for future works.

pdf bib
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
Yejin Bang | Samuel Cahyawijaya | Nayeon Lee | Wenliang Dai | Dan Su | Bryan Wilie | Holy Lovenia | Ziwei Ji | Tiezheng Yu | Willy Chung | Quyet V. Do | Yan Xu | Pascale Fung
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages
Samuel Cahyawijaya | Holy Lovenia | Fajri Koto | Dea Adhista | Emmanuel Dave | Sarah Oktavianti | Salsabil Akbar | Jhonson Lee | Nuur Shadieq | Tjeng Wawan Cenggoro | Hanung Linuwih | Bryan Wilie | Galih Muridan | Genta Winata | David Moeljadi | Alham Fikri Aji | Ayu Purwarianti | Pascale Fung
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems
Bryan Wilie | Yan Xu | Willy Chung | Samuel Cahyawijaya | Holy Lovenia | Pascale Fung
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning
Samuel Cahyawijaya | Holy Lovenia | Tiezheng Yu | Willy Chung | Pascale Fung
Proceedings of the First Workshop in South East Asian Language Processing

pdf bib
InstructTODS: Large Language Models for End-to-End Task-Oriented Dialogue Systems
Willy Chung | Samuel Cahyawijaya | Bryan Wilie | Holy Lovenia | Pascale Fung
Proceedings of the Second Workshop on Natural Language Interfaces

pdf bib
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Zheng Xin Yong | Ruochen Zhang | Jessica Forde | Skyler Wang | Arjun Subramonian | Holy Lovenia | Samuel Cahyawijaya | Genta Winata | Lintang Sutawika | Jan Christian Blaise Cruz | Yin Lin Tan | Long Phan | Long Phan | Rowena Garcia | Thamar Solorio | Alham Fikri Aji
Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching

While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its per-formance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.

2022

pdf bib
Clozer”:” Adaptable Data Augmentation for Cloze-style Reading Comprehension
Holy Lovenia | Bryan Wilie | Willy Chung | Zeng Min | Samuel Cahyawijaya | Dan Su | Pascale Fung
Proceedings of the 7th Workshop on Representation Learning for NLP

Task-adaptive pre-training (TAPT) alleviates the lack of labelled data and provides performance lift by adapting unlabelled data to downstream task. Unfortunately, existing adaptations mainly involve deterministic rules that cannot generalize well. Here, we propose Clozer, a sequence-tagging based cloze answer extraction method used in TAPT that is extendable for adaptation on any cloze-style machine reading comprehension (MRC) downstream tasks. We experiment on multiple-choice cloze-style MRC tasks, and show that Clozer performs significantly better compared to the oracle and state-of-the-art in escalating TAPT effectiveness in lifting model performance, and prove that Clozer is able to recognize the gold answers independently of any heuristics.

pdf bib
Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset
Tiezheng Yu | Rita Frieske | Peng Xu | Samuel Cahyawijaya | Cheuk Tung Yiu | Holy Lovenia | Wenliang Dai | Elham J. Barezi | Qifeng Chen | Xiaojuan Ma | Bertram Shi | Pascale Fung
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC, and the results show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.

pdf bib
CI-AVSR: A Cantonese Audio-Visual Speech Datasetfor In-car Command Recognition
Wenliang Dai | Samuel Cahyawijaya | Tiezheng Yu | Elham J. Barezi | Peng Xu | Cheuk Tung Yiu | Rita Frieske | Holy Lovenia | Genta Winata | Qifeng Chen | Xiaojuan Ma | Bertram Shi | Pascale Fung
Proceedings of the Thirteenth Language Resources and Evaluation Conference

With the rise of deep learning and intelligent vehicles, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, there is a data scarcity issue for low resource languages, hindering the development of research and applications. In this paper, we introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, we augment our dataset using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one. We provide detailed statistics of both the clean and the augmented versions of our dataset. Moreover, we implement two multimodal baselines to demonstrate the validity of CI-AVSR. Experiment results show that leveraging the visual signal improves the overall performance of the model. Although our best model can achieve a considerable quality on the clean test set, the speech recognition quality on the noisy data is still inferior and remains an extremely challenging task for real in-car speech recognition systems. The dataset and code will be released at https://github.com/HLTCHKUST/CI-AVSR.

pdf bib
ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation
Holy Lovenia | Samuel Cahyawijaya | Genta Winata | Peng Xu | Yan Xu | Zihan Liu | Rita Frieske | Tiezheng Yu | Wenliang Dai | Elham J. Barezi | Qifeng Chen | Xiaojuan Ma | Bertram Shi | Pascale Fung
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Code-switching is a speech phenomenon occurring when a speaker switches language during a conversation. Despite the spontaneous nature of code-switching in conversational spoken language, most existing works collect code-switching data from read speech instead of spontaneous speech. ASCEND (A Spontaneous Chinese-English Dataset) is a high-quality Mandarin Chinese-English code-switching corpus built on spontaneous multi-turn conversational dialogue sources collected in Hong Kong. We report ASCEND’s design and procedure for collecting the speech data, including annotations. ASCEND consists of 10.62 hours of clean speech, collected from 23 bilingual speakers of Chinese and English. Furthermore, we conduct baseline experiments using pre-trained wav2vec 2.0 models, achieving a best performance of 22.69% character error rate and 27.05% mixed error rate.

pdf bib
Every picture tells a story: Image-grounded controllable stylistic story generation
Holy Lovenia | Bryan Wilie | Romain Barraud | Samuel Cahyawijaya | Willy Chung | Pascale Fung
Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Generating a short story out of an image is arduous. Unlike image captioning, story generation from an image poses multiple challenges: preserving the story coherence, appropriately assessing the quality of the story, steering the generated story into a certain style, and addressing the scarcity of image-story pair reference datasets limiting supervision during training. In this work, we introduce Plug-and-Play Story Teller (PPST) and improve image-to-story generation by: 1) alleviating the data scarcity problem by incorporating large pre-trained models, namely CLIP and GPT-2, to facilitate a fluent image-to-text generation with minimal supervision, and 2) enabling a more style-relevant generation by incorporating stylistic adapters to control the story generation. We conduct image-to-story generation experiments with non-styled, romance-styled, and action-styled PPST approaches and compare our generated stories with those of previous work over three aspects, i.e., story coherence, image-story relevance, and style fitness, using both automatic and human evaluation. The results show that PPST improves story coherence and has better image-story relevance, but has yet to be adequately stylistic.

pdf bib
How Long Is Enough? Exploring the Optimal Intervals of Long-Range Clinical Note Language Modeling
Samuel Cahyawijaya | Bryan Wilie | Holy Lovenia | Huan Zhong | MingQian Zhong | Yuk-Yu Nancy Ip | Pascale Fung
Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)

Large pre-trained language models (LMs) have been widely adopted in biomedical and clinical domains, introducing many powerful LMs such as bio-lm and BioELECTRA. However, the applicability of these methods to real clinical use cases is hindered, due to the limitation of pre-trained LMs in processing long textual data with thousands of words, which is a common length for a clinical note. In this work, we explore long-range adaptation from such LMs with Longformer, allowing the LMs to capture longer clinical notes context. We conduct experiments on three n2c2 challenges datasets and a longitudinal clinical dataset from Hong Kong Hospital Authority electronic health record (EHR) system to show the effectiveness and generalizability of this concept, achieving ~10% F1-score improvement. Based on our experiments, we conclude that capturing a longer clinical note interval is beneficial to the model performance, but there are different cut-off intervals to achieve the optimal performance for different target variables.
Search
Co-authors