Vincent Nguyen

2025

CSIRO LT at SemEval-2025 Task 8: Answering Questions over Tabular Data using LLMs
Tomas Turek | Shakila Mahjabin Tonni | Vincent Nguyen | Huichen Yang | Sarvnaz Karimi
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Question Answering over large tables is challenging due to the difficulty of reasoning required in linking information from different parts of a table, such as heading and metadata to the values in the table and information needs. We investigate using Large Language Models (LLM) for tabular reasoning, where, given a pair of a table and a question from the DataBench benchmark, the models generate answers. We experiment with three techniques that enables symbolic reasoning through code execution: a direct code prompting (DCP) approach, ‘DCP_Py’, which uses Python, multi-step code (MSC) prompting ‘MSC_SQL+FS’ using SQL and ReAct prompting, ‘MSR_Py+FS’, which combines multi-step reasoning (MSR), few-shot (FS) learning and Python tools. We also conduct an analysis exploring the impact of answer types, data size, and multi-column dependencies on LLMs’ answer generation performance, including an assessment of the models’ limitations and the underlying challenges of tabular reasoning in LLMs.

pdf bib abs

Question Answering in Climate Adaptation for Agriculture: Model Development and Evaluation with Expert Feedback
Vincent Nguyen | Sarvnaz Karimi | Willow Hallgren | Mahesh Prakash
Findings of the Association for Computational Linguistics: ACL 2025

The generative capabilities of the large language models (LLMs) are deployed for domain-specific question answering systems. However, their ability to answer climate adaptation questions remains unclear. In particular, can they be used by agronomists and climate scientists to answer questions on the best climate adaptation strategies? Answering questions in this domain requires knowledge of climate data and its uncertainties, and the ability to link them to the broader climate literature while accommodating the unique constraints of users and experts. We investigate the generative and evaluative capabilities of several state-of-the-art LLMs, open-source and proprietary, on climate adaptation for agriculture questions posed by domain experts using evaluation criteria designed by the experts.We propose an iterative exploration framework that enables LLMs to dynamically aggregate information from heterogeneous sources, such as text from climate literature and structured tabular climate data from climate model projections and historical observations. Our experiments demonstrate that LLMs can aggregate heterogeneous data to (1) answer questions, but at a trade-off between presentation quality and epistemological accuracy; and, (2) evaluate answers, but are not as competent at identifying high-quality answers and erroneous information compared to domain experts.

pdf bib abs

My Climate CoPilot: A Question Answering System for Climate Adaptation in Agriculture
Vincent Nguyen | Willow Hallgren | Ashley Harkin | Mahesh Prakash | Sarvnaz Karimi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Accurately answering climate science questions requires scientific literature and climate data. Interpreting climate literature and data, however, presents inherent challenges such as determining relevant climate factors and drivers, interpreting uncertainties in the science and data, and dealing with the sheer volume of data. My Climate CoPilot is a platform that assists a range of potential users, such as farmer advisors, to mitigate and adapt to projected climate changes by providing answers to questions that are grounded in evidence. It emphasises transparency, user privacy, low-resource use, and provides automatic evaluation. It also strives for scientific robustness and accountability. Fifty domain experts carefully evaluated every aspect of My Climate CoPilot and based on their interactions and feedback, the system continues to evolve.

2024

pdf bib abs

My Climate Advisor: An Application of NLP in Climate Adaptation for Agriculture
Vincent Nguyen | Sarvnaz Karimi | Willow Hallgren | Ashley Harkin | Mahesh Prakash
Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024)

Climate adaptation in the agricultural sector necessitates tools that equip farmers and farm advisors with relevant and trustworthy information to help increase their resiliency to climate change. We introduce My Climate Advisor, a question-answering (QA) prototype that synthesises information from different data sources, such as peer-reviewed scientific literature and high-quality, industry-relevant grey literature to generate answers, with references, to a given user’s question. Our prototype uses open-source generative models for data privacy and intellectual property protection, and retrieval augmented generation for answer generation, grounding and provenance. While there are standard evaluation metrics for QA systems, no existing evaluation framework suits our LLM-based QA application in the climate adaptation domain. We design an evaluation framework with seven metrics based on the requirements of the domain experts to judge the generated answers from 12 different LLM-based models. Our initial evaluations through a user study via domain experts show promising usability results.

pdf bib abs

Finding evidence for claims from content presented in experimental results of scientific articles is difficult. The evidence is often presented in the form of tables and figures, and correctly matching it to scientific claims presents automation challenges. The Context24 shared task is launched to support the development of systems able to verify claims by extracting supporting evidence from articles. We explore different facets of this shared task modelled as a search problem and as an information extraction task. We experiment with a range of methods in each of these categories for the two sub-tasks of evidence identification and grounding context identification in the Context24 shared task.

pdf bib abs

Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation
Hashem Hijazi | Diego Molla | Vincent Nguyen | Sarvnaz Karimi
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

Biomedical question-answering systems remain popular for biomedical experts interacting with the literature to answer their medical questions. However, these systems are difficult to evaluate in the absence of costly human experts. Therefore, automatic evaluation metrics are often used in this space. Traditional automatic metrics such as ROUGE or BLEU, which rely on token overlap, have shown a low correlation with humans. We present a study that uses large language models (LLMs) to automatically evaluate systems from an international challenge on biomedical semantic indexing and question answering, called BioASQ. We measure the agreement of LLM-produced scores against human judgements. We show that LLMs correlate similarly to lexical methods when using basic prompting techniques. However, by aggregating evaluators with LLMs or by fine-tuning, we find that our methods outperform the baselines by a large margin, achieving a Spearman correlation of 0.501 and 0.511, respectively.

pdf bib abs

Exploring Instructive Prompts for Large Language Models in the Extraction of Evidence for Supporting Assigned Suicidal Risk Levels
Jiyu Chen | Vincent Nguyen | Xiang Dai | Diego Molla-Aliod | Cecile Paris | Sarvnaz Karimi
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)

Monitoring and predicting the expression of suicidal risk in individuals’ social media posts is a central focus in clinical NLP. Yet, existing approaches frequently lack a crucial explainability component necessary for extracting evidence related to an individual’s mental health state. We describe the CSIRO Data61 team’s evidence extraction system submitted to the CLPsych 2024 shared task. The task aims to investigate the zero-shot capabilities of open-source LLM in extracting evidence regarding an individual’s assigned suicide risk level from social media discourse. The results are assessed against ground truth evidence annotated by psychological experts, with an achieved recall-oriented BERTScore of 0.919. Our findings suggest that LLMs showcase strong feasibility in the extraction of information supporting the evaluation of suicidal risk in social media discourse. Opportunities for refinement exist, notably in crafting concise and effective instructions to guide the extraction process.

2023

pdf bib

MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines
Vincent Nguyen | Sarvnaz Karimi | Maciej Rybinski | Zhenchang Xing
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

2021

pdf bib abs

Cross-Domain Language Modeling: An Empirical Investigation
Vincent Nguyen | Sarvnaz Karimi | Maciej Rybinski | Zhenchang Xing
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association

Transformer encoder models exhibit strong performance in single-domain applications. However, in a cross-domain situation, using a sub-word vocabulary model results in sub-word overlap. This is an issue when there is an overlap between sub-words that share no semantic similarity between domains. We hypothesize that alleviating this overlap allows for a more effective modeling of multi-domain tasks; we consider the biomedical and general domains in this paper. We present a study on reducing sub-word overlap by scaling the vocabulary size in a Transformer encoder model while pretraining with multiple domains. We observe a significant increase in downstream performance in the general-biomedical cross-domain from a reduction in sub-word overlap.

pdf bib abs

Combining Shallow and Deep Representations for Text-Pair Classification
Vincent Nguyen | Sarvnaz Karimi | Zhenchang Xing
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association

Text-pair classification is the task of determining the class relationship between two sentences. It is embedded in several tasks such as paraphrase identification and duplicate question detection. Contemporary methods use fine-tuned transformer encoder semantic representations of the classification token in the text-pair sequence from the transformer’s final layer for class prediction. However, research has shown that earlier parts of the network learn shallow features, such as syntax and structure, which existing methods do not directly exploit. We propose a novel convolution-based decoder for transformer-based architecture that maximizes the use of encoder hidden features for text-pair classification. Our model exploits hidden representations within transformer-based architecture. It outperforms a transformer encoder baseline on average by 50% (relative F1-score) on six datasets from the medical, software engineering, and open-domains. Our work shows that transformer-based models can improve text-pair classification by modifying the fine-tuning step to exploit shallow features while improving model generalization, with only a slight reduction in efficiency.

2020

pdf bib

The OpenNMT Neural Machine Translation Toolkit: 2020 Edition
Guillaume Klein | François Hernandez | Vincent Nguyen | Jean Senellart
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib abs

Pandemic Literature Search: Finding Information on COVID-19
Vincent Nguyen | Maciek Rybinski | Sarvnaz Karimi | Zhenchang Xing
Proceedings of the 18th Annual Workshop of the Australasian Language Technology Association

Finding information related to a pandemic of a novel disease raises new challenges for information seeking and retrieval, as the new information becomes available gradually. We investigate how to better rank information for pandemic information retrieval. We experiment with different ranking algorithms and propose a novel end-to-end method for neural retrieval, and demonstrate its effectiveness on the TREC COVID search. This work could lead to a search system that aids scientists, clinicians, policymakers and others in finding reliable answers from the scientific literature.

pdf bib abs

Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation
Paul Tardy | David Janiszek | Yannick Estève | Vincent Nguyen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Summarizing texts is not a straightforward task. Before even considering text summarization, one should determine what kind of summary is expected. How much should the information be compressed? Is it relevant to reformulate or should the summary stick to the original phrasing? State-of-the-art on automatic text summarization mostly revolves around news articles. We suggest that considering a wider variety of tasks would lead to an improvement in the field, in terms of generalization and robustness. We explore meeting summarization: generating reports from automatic transcriptions. Our work consists in segmenting and aligning transcriptions with respect to reports, to get a suitable dataset for neural summarization. Using a bootstrapping approach, we provide pre-alignments that are corrected by human annotators, making a validation set against which we evaluate automatic models. This consistently reduces annotators’ efforts by providing iteratively better pre-alignment and maximizes the corpus size by using annotations from our automatic alignment models. Evaluation is conducted on publicmeetings, a novel corpus of aligned public meetings. We report automatic alignment and summarization performances on this corpus and show that automatic alignment is relevant for data annotation since it leads to large improvement of almost +4 on all ROUGE scores on the summarization task.

pdf bib abs

The Ubiqus English-Inuktitut System for WMT20
François Hernandez | Vincent Nguyen
Proceedings of the Fifth Conference on Machine Translation

This paper describes Ubiqus’ submission to the WMT20 English-Inuktitut shared news translation task. Our main system, and only submission, is based on a multilingual approach, jointly training a Transformer model on several agglutinative languages. The English-Inuktitut translation task is challenging at every step, from data selection, preparation and tokenization to quality evaluation down the line. Difficulties emerge both because of the peculiarities of the Inuktitut language as well as the low-resource context.

2019

pdf bib abs

Question Answering in the Biomedical Domain
Vincent Nguyen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Question answering techniques have mainly been investigated in open domains. However, there are particular challenges in extending these open-domain techniques to extend into the biomedical domain. Question answering focusing on patients is less studied. We find that there are some challenges in patient question answering such as limited annotated data, lexical gap and quality of answer spans. We aim to address some of these gaps by extending and developing upon the literature to design a question answering system that can decide on the most appropriate answers for patients attempting to self-diagnose while including the ability to abstain from answering when confidence is low.

pdf bib abs

ANU-CSIRO at MEDIQA 2019: Question Answering Using Deep Contextual Knowledge
Vincent Nguyen | Sarvnaz Karimi | Zhenchang Xing
Proceedings of the 18th BioNLP Workshop and Shared Task

We report on our system for textual inference and question entailment in the medical domain for the ACL BioNLP 2019 Shared Task, MEDIQA. Textual inference is the task of finding the semantic relationships between pairs of text. Question entailment involves identifying pairs of questions which have similar semantic content. To improve upon medical natural language inference and question entailment approaches to further medical question answering, we propose a system that incorporates open-domain and biomedical domain approaches to improve semantic understanding and ambiguity resolution. Our models achieve 80% accuracy on medical natural language inference (6.5% absolute improvement over the original baseline), 48.9% accuracy on recognising medical question entailment, 0.248 Spearman’s rho for question answering ranking and 68.6% accuracy for question answering classification.

pdf bib

Investigating the Effect of Lexical Segmentation in Transformer-based Models on Medical Datasets
Vincent Nguyen | Sarvnaz Karimi | Zhenchang Xing
Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association