Dang Van Thin

Also published as: Dang Van Thin


2025

This paper presents our submission to the CRAC 2025 Shared Task on Multilingual Coreference Resolution in the LLM track. We propose a prompt-based few-shot coreference resolution system where the final inference is performed by Grok-3 using in-context learning. The core of our methodology is a difficulty- aware sample selection pipeline that leverages Gemini Flash 2.0 to compute semantic diffi- culty metrics, including mention dissimilarity and pronoun ambiguity. By identifying and selecting the most challenging training sam- ples for each language, we construct highly informative prompts to guide Grok-3 in predict- ing coreference chains and reconstructing zero anaphora. Our approach secured 3rd place in the CRAC 2025 shared task.
Repairing and maintaining car parts are crucial tasks in the automotive industry, requiring a mechanic to have all relevant technical documents available. However, retrieving the right documents from a huge database heavily depends on domain expertise and is time consuming and error-prone. By labeling available documents according to the components they relate to, concise and accurate information can be retrieved efficiently. However, this is a challenging task as the relevance of a document to a particular component strongly depends on the context and the expertise of the domain specialist. Moreover, component terminology varies widely between different manufacturers. We address these challenges by utilizing Large Language Models (LLMs) to enrich and unify a component database via web mining, extracting relevant keywords, and leveraging hybrid search and LLM-based re-ranking to select the most relevant component for a document. We systematically evaluate our method using various LLMs on an expert-annotated dataset and demonstrate that it outperforms the baselines, which rely solely on LLM prompting.
We present the first benchmark for implicit sentiment analysis (ISA) in Vietnamese, aimed at evaluating large language models (LLMs) on their ability to interpret implicit sentiment accompanied by ViISA, a dataset specifically constructed for this task. We assess a variety of open-source and close-source LLMs using state-of-the-art (SOTA) prompting techniques. While LLMs achieve strong recall, they often misclassify implicit cues such as sarcasm and exaggeration, resulting in low precision. Through detailed error analysis, we highlight key challenges and suggest improvements to Chain-of-Thought prompting via more contextually aligned demonstrations.
Annotator-provided information during labeling can reflect differences in how texts are understood and interpreted, though such variation may also arise from inconsistencies or errors. To make use of this information, we build a BERT-based model that integrates annotator perspectives and evaluate it on four datasets from the third edition of the Learning With Disagreements (LeWiDi) shared task. For each original data point, we create a new (text, annotator) pair, optionally modifying the text to reflect the annotator’s perspective when additional information is available. The text and annotator features are embedded separately and concatenated before classification, enabling the model to capture individual interpretations of the same input. Our model achieves first place on both tasks for the Par and VariErrNLI datasets. More broadly, it performs very well on datasets where annotators provide rich information and the number of annotators is relatively small, while still maintaining competitive results on datasets with limited annotator information and a larger annotator pool.
This paper describes the system of the son robok4 team for the SemEval-2025 Task 8: DataBench, Question-Answering over Tabular Data. The task requires answering questions based on the given question and dataset ID, ensuring that the responses are derived solely from the provided table. We address this task by using large language models (LLMs) to translate natural language questions into executable Python code for querying Pandas DataFrames. Furthermore, we employ techniques such as a rerun mechanism for error handling, structured metadata extraction, and dataset preprocessing to enhance performance. Our best-performing system achieved 89.46% accuracy on Subtask 1 and placed in the top 4 on the private test set. Additionally, it achieved 85.25% accuracy on Subtask 2 and placed in the top 9. We mainly focus on Subtask 1. We analyze the effectiveness of different LLMs for structured data reasoning and discuss key challenges in tabular question answering.
This paper illustrates our ABCD team system approach in ACL 2025 - SemEval-2025 Task 9: The Food Hazard Detection Challenge, aim to solving both Task 1: Text classification for food hazard prediction, predicting the type of hazard and product, and Task 2: Food hazard and product “vector” detection, predicting the exact hazard and product. Precisely, we received a food report and our system needed to automatically detect which category of hazard and product the food belonged to. However, in Task 2, we must classify the food report into the exact name of the food hazard and category. To tackle Task 1, we implement and investigate various solutions, including (1) experimenting with a large battery of BERT-based models; and (2) utilizing generation-based models, and (3) taking advantage of a custom ensemble learning method. In addition, to address Task 2, we make use of different data augmentation techniques like synonym replacement and back-translation. To enhance the context of input, we cleaned some special characters that bring more clarity into text input. Our best official results on Task 1 and Task 2 are 0.786 and 0.458 in terms of F1-score, respectively—finally, our team solution achieved top 8th in task 1 and top 10th in task 2.
In this paper, we describe our official system of the Firefly team for two main tasks in the SemEval-2025 Task 8: Question-Answering over Tabular Data. Our solution employs large language models (LLMs) to translate natural language queries into executable code, specifically Python and SQL, which are then used to generate answers categorized into five predefined types. Our empirical evaluation highlights the superiority of Python code generation over SQL for this challenge. Besides, the experimental results show that our system has achieved competitive performance in two subtasks. In Subtask I: Databench QA, where we rank the Top 9 across datasets of any size. Besides, our solution achieved competitive results and ranked 5th place in Subtask II: Databench QA Lite, where datasets are restricted to a maximum of 20 rows.
This paper presents our approach for SemEval-2025 Task 11, we focus on on multi-label emotion detection in Russian text (track A). We preprocess the data by handling special characters, punctuation, and emotive expressions to improve feature-label relationships. To select the best model performance, we fine-tune various pre-trained language models specialized in Russian and evaluate them using K-FOLD Cross-Validation. Our results indicated that ruRoberta-large achieved the best Macro F1-score among tested models. Finally, our system achieved fifth place in the unofficial competition ranking.

2024

This paper describes the system of the team NRK for Task A in the SemEval-2024 Task 1: Semantic Textual Relatedness (STR). We focus on exploring the performance of ensemble architectures based on the voting technique and different pre-trained transformer-based language models, including the multilingual and monolingual BERTology models. The experimental results show that our system has achieved competitive performance in some languages in Track A: Supervised, where our submissions rank in the Top 3 and Top 4 for Algerian Arabic and Amharic languages. Our source code is released on the GitHub site.

2023

This paper describes the system of the ABCD team for three main tasks in the SemEval-2023 Task 12: AfriSenti-SemEval for Low-resource African Languages using Twitter Dataset. We focus on exploring the performance of ensemble architectures based on the soft voting technique and different pre-trained transformer-based language models. The experimental results show that our system has achieved competitive performance in some Tracks in Task A: Monolingual Sentiment Analysis, where we rank the Top 3, Top 2, and Top 4 for the Hause, Igbo and Moroccan languages. Besides, our model achieved competitive results and ranked $14ˆ{th}$ place in Task B (multilingual) setting and $14ˆ{th}$ and $8ˆ{th}$ place in Track 17 and Track 18 of Task C (zero-shot) setting.