Quyet V. Do


2024

pdf bib
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia | Rahmad Mahendra | Salsabil Maulana Akbar | Lester James Validad Miranda | Jennifer Santoso | Elyanah Aco | Akhdan Fadhilah | Jonibek Mansurov | Joseph Marvin Imperial | Onno P. Kampman | Joel Ruben Antony Moniz | Muhammad Ravi Shulthan Habibi | Frederikus Hudi | Jann Railey Montalan | Ryan Ignatius Hadiwijaya | Joanito Agili Lopo | William Nixon | Börje F. Karlsson | James Jaya | Ryandito Diandaru | Yuze Gao | Patrick Amadeus Irawan | Bin Wang | Jan Christian Blaise Cruz | Chenxi Whitehouse | Ivan Halim Parmonangan | Maria Khelli | Wenyu Zhang | Lucky Susanto | Reynard Adha Ryanda | Sonny Lazuardi Hermawan | Dan John Velasco | Muhammad Dehan Al Kautsar | Willy Fitra Hendria | Yasmin Moslem | Noah Flynn | Muhammad Farid Adilazuarda | Haochen Li | Johanes Lee | R. Damanhuri | Shuo Sun | Muhammad Reza Qorib | Amirbek Djanibekov | Wei Qi Leong | Quyet V. Do | Niklas Muennighoff | Tanrada Pansuwan | Ilham Firdausi Putra | Yan Xu | Tai Ngee Chia | Ayu Purwarianti | Sebastian Ruder | William Chandra Tjhi | Peerat Limkonchotiwat | Alham Fikri Aji | Sedrick Keh | Genta Indra Winata | Ruochen Zhang | Fajri Koto | Zheng Xin Yong | Samuel Cahyawijaya
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in Southeast Asia.

pdf bib
ConstraintChecker: A Plugin for Large Language Models to Reason on Commonsense Knowledge Bases
Quyet V. Do | Tianqing Fang | Shizhe Diao | Zhaowei Wang | Yangqiu Song
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Reasoning over Commonsense Knowledge Bases (CSKB), i.e. CSKB reasoning, has been explored as a way to acquire new commonsense knowledge based on reference knowledge in the original CSKBs and external prior knowledge.Despite the advancement of Large Language Models (LLM) and prompt engineering techniques in various reasoning tasks, they still struggle to deal with CSKB reasoning.One of the problems is that it is hard for them to acquire explicit relational constraints in CSKBs from only in-context exemplars, due to a lack of symbolic reasoning capabilities (CITATION).To this end, we proposed **ConstraintChecker**, a plugin over prompting techniques to provide and check explicit constraints.When considering a new knowledge instance, ConstraintChecker employs a rule-based module to produce a list of constraints, then it uses a zero-shot learning module to check whether this knowledge instance satisfies all constraints.The acquired constraint-checking result is then aggregated with the output of the main prompting technique to produce the final output.Experimental results on CSKB Reasoning benchmarks demonstrate the effectiveness of our method by bringing consistent improvements over all prompting methods.

2023

pdf bib
COLA: Contextualized Commonsense Causal Reasoning from the Causal Inference Perspective
Zhaowei Wang | Quyet V. Do | Hongming Zhang | Jiayao Zhang | Weiqi Wang | Tianqing Fang | Yangqiu Song | Ginny Wong | Simon See
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Detecting commonsense causal relations (causation) between events has long been an essential yet challenging task. Given that events are complicated, an event may have different causes under various contexts. Thus, exploiting context plays an essential role in detecting causal relations. Meanwhile, previous works about commonsense causation only consider two events and ignore their context, simplifying the task formulation. This paper proposes a new task to detect commonsense causation between two events in an event sequence (i.e., context), called contextualized commonsense causal reasoning. We also design a zero-shot framework: COLA (Contextualized Commonsense Causality Reasoner) to solve the task from the causal inference perspective. This framework obtains rich incidental supervision from temporality and balances covariates from multiple timestamps to remove confounding effects. Our extensive experiments show that COLA can detect commonsense causality more accurately than baselines.

pdf bib
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
Yejin Bang | Samuel Cahyawijaya | Nayeon Lee | Wenliang Dai | Dan Su | Bryan Wilie | Holy Lovenia | Ziwei Ji | Tiezheng Yu | Willy Chung | Quyet V. Do | Yan Xu | Pascale Fung
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

2022

pdf bib
PseudoReasoner: Leveraging Pseudo Labels for Commonsense Knowledge Base Population
Tianqing Fang | Quyet V. Do | Hongming Zhang | Yangqiu Song | Ginny Y. Wong | Simon See
Findings of the Association for Computational Linguistics: EMNLP 2022

Commonsense Knowledge Base (CSKB) Population aims at reasoning over unseen entities and assertions on CSKBs, and is an important yet hard commonsense reasoning task. One challenge is that it requires out-of-domain generalization ability as the source CSKB for training is of a relatively smaller scale (1M) while the whole candidate space for population is way larger (200M). We propose PseudoReasoner, a semi-supervised learning framework for CSKB population that uses a teacher model pre-trained on CSKBs to provide pseudo labels on the unlabeled candidate dataset for a student model to learn from. The teacher can be a generative model rather than restricted to discriminative models as previous works.In addition, we design a new filtering procedure for pseudo labels based on influence function and the student model’s prediction to further improve the performance. The framework can improve the backbone model KG-BERT (RoBERTa-large) by 3.3 points on the overall performance and especially, 5.3 points on the out-of-domain performance, and achieves the state-of-the-art. The codes will be made public on acceptance. Codes and data are available at https://github.com/HKUST-KnowComp/PseudoReasoner.