Tong Wu

Unverified author pages with similar names: Tong Wu

2026

NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating
Tong Wu | Thanet Markchom | Huizhi(elly) Liang
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1–5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task.

pdf bib abs

NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression
Tong Wu | Nicolay Rusnachenko | Huizhi(elly) Liang
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence–arousal (VA) regression. This paper describes a system developed for Track A, Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, using dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language–domain pair (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models under a few-shot prompting setting, demonstrating that task-specific fine-tuning outperforms these LLM-based methods across all evaluation datasets.

2025

pdf bib abs

UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation
Thanet Markchom | Tong Wu | Liting Huang | Huizhi Liang
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

SemEval-2025 Task 1 focuses on ranking images based on their alignment with a given nominal compound that may carry idiomatic meaning in both English and Brazilian Portuguese. To address this challenge, this work uses generative large language models (LLMs) and multilingual CLIP models to enhance idiomatic compound representations. LLMs generate idiomatic meanings for potentially idiomatic compounds, enriching their semantic interpretation. These meanings are then encoded using multilingual CLIP models, serving as representations for image ranking. Contrastive learning and data augmentation techniques are applied to fine-tune these embeddings for improved performance.Experimental results show that multimodal representations extracted through this method outperformed those based solely on the original nominal compounds. The fine-tuning approach shows promising outcomes but is less effective than using embeddings without fine-tuning.

pdf bib abs

NCL-UoR at SemEval-2025 Task 3: Detecting Multilingual Hallucination and Related Observable Overgeneration Text Spans with Modified RefChecker and Modified SelfCheckGPT
Jiaying Hong | Thanet Markchom | Jianfei Xu | Tong Wu | Huizhi Liang
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

SemEval-2025 Task 3 (Mu-SHROOM) focuses on detecting hallucinations in content generated by various large language models (LLMs) across multiple languages. This task involves not only identifying the presence of hallucinations but also pinpointing their specific occurrences. To tackle this challenge, this study introduces two methods: Modified-RefChecker (MRC) and Modified-SelfCheckGPT-H (MSCGH). MRC integrates prompt-based factual verification into References, structuring them as claim-based tests rather than single external knowledge sources. MSCGH incorporates external knowledge to overcome its reliance on internal knowledge. In addition, both methods’ original prompt designs are enhanced to identify hallucinated words within LLM-generated texts. Experimental results demonstrate the effectiveness of the approach, achieving a high ranking on the test dataset in detecting hallucinations across various languages, with an average IoU of 0.5310 and an average COR of 0.5669. The source code used in this paper is available at https://github.com/jianfeixu95/NCL-UoR.

Co-authors

Nicolay Rusnachenko 1

Jianfei Xu 1

Venues

SemEval4
WS4

Fix author