Jian Su

2026

Can LLM Safety Be Ensured by Constraining Parameter Regions?
Zongmin Li | Jian Su | Farah Benamara | Aixin Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) are often assumed to contain "safety regions” - parameter subsets whose modification directly influences safety behaviors. We conduct a systematic evaluation of four safety region identification methods spanning different parameter granularities, from individual weights to entire Transformer layers, across four families of backbone LLMs with varying sizes. Using ten safety identification datasets, we find that the identified safety regions exhibit only low to moderate overlap, as measured by IoU. The overlap drops significantly when the safety regions are further refined using utility datasets (i.e. non-harmful queries). These results suggest that current techniques fail to reliably identify a stable, dataset-agnostic safety region.

2025

pdf bib abs

I2R-NLP at SemEval-2025 Task 8: Question Answering on Tabular Data
Yuze Gao | Bin Chen | Jian Su
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

We present a Large Language Model (LLM) based system for question answering (QA) over tabular data that leverages multi-turn prompting to automatically generate executable Pandas functions. Our framework decomposes the problem into three key steps: (1) Answer Type Identification, where the system identifies the expected format of the response (e.g., boolean, number, category); (2) Pandas Function Generation, which generates a corresponding Pandas function using table metadata and in-context examples, and (3) Error Correction and Regeneration, where iteratively refining the function based on error feedback from executions. Evaluations on the SemEval-2025 Task 8 Tabular QA benchmark (Grijalba et al., 2024) demonstrate that our multi-turn approach significantly outperforms single-turn prompting models in exact match accuracy by 7.3%. The proposed system not only improves code generation robustness but also paves the way for enhanced and adaptability in table-QA reasoning tasks. Our implementation is available at https://github.com/Gyyz/Question_Answering-over-Tabular-Data.

pdf bib abs

Soft Syntactic Reinforcement for Neural Event Extraction
Anran Hao | Jian Su | Shuo Sun | Teo Yong Sen
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recent event extraction (EE) methods rely on pre-trained language models (PLMs) but still suffer from errors due to a lack of syntactic knowledge. While syntactic information is crucial for EE, there is a need for effective methods to incorporate syntactic knowledge into PLMs. To address this gap, we present a novel method to incorporate syntactic information into PLM-based models for EE, which do not require external syntactic parsers to produce syntactic features of task data. Instead, our proposed soft syntactic reinforcement (SSR) mechanism learns to select syntax-related dimensions of PLM representation during pretraining on a standard dependency corpus. The adapted PLM weights and the syntax-aware representation then facilitate the model’s prediction over the task data. On both sentence-level and document-level EE benchmark datasets, our proposed method achieves state-of-the-art results, outperforming baseline models and existing syntactic reinforcement methods. To the best of our knowledge, this is the first work in this direction. Our code is available at https://github.com/Anran971/sre-naacl25.

pdf bib abs

Computational modeling of user-generated desires on social media can significantly aid decision-makers across various fields. Initially explored through wish speech,this task has evolved into a nuanced examination of hope speech. To enhance understanding and detection, we propose a novel scheme rooted in formal semantics approaches to modality, capturing both future-oriented hopes through desires and beliefs and the counterfactuality of past unfulfilled wishes and regrets. We manually re-annotated existing hope speech datasets and built a new one which constitutes a new benchmark in the field. We also explore the capabilities of LLMs in automatically detecting hope speech, relying on several prompting strategies. To the best of our knowledge, this is the first attempt towards a language-driven decomposition of the notional category hope and its automatic detection in a unified setting.

pdf bib abs

High-Quality Complex Text-to-SQL Data Generation through Chain-of-Verification
Yuchen Zhang | Yuze Gao | Bin Chen | Wenfeng Li | Shuo Sun | Jian Su
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Can today’s Text-to-SQL benchmarks still stretch modern LLMs? We argue no. Spider1.0 and BIRD, painstakingly hand-built, remain small, costly, and skewed toward middle complex SQL. Meanwhile, LLM-generated corpora are inexpensive but often superficial and fragile suffering from shallow nesting, semantic drift, template fatigue, and insufficient quality check.We address this gap with a Chain-of-Verifications framework that turns a handful of expert-labelled seeds into a large, reliably checked dataset at a fraction of the usual cost. The resulting corpus, AIGT2S, delivers: (1)18k Question–SQL pairs across 113 databases, 41–77% larger than current English sets; (2)55% queries in the Ultra band of our four-level difficulty taxonomy; (3)87.5% inter-annotator agreement; (4)≥80% labour and ≥98% monetary savings versus earlier efforts.Baselines including GPT-4o, Llama3, RESDSQL, and MAC-SQL, achieve at most 56% execution accuracy, indicating substantial room for improvement.

2024

pdf bib abs

Mitigating Linguistic Artifacts in Emotion Recognition for Conversations from TV Scripts to Daily Conversations
Donovan Ong | Shuo Sun | Jian Su | Bin Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Emotion Recognition in Conversations (ERC) is a well-studied task with numerous potential real-world applications. However, existing ERC models trained on the MELD dataset derived from TV series, struggle when applied to daily conversation datasets. A closer examination of the datasets unveils the prevalence of linguistic artifacts such as repetitions and interjections in TV scripts, which ERC models may exploit when making predictions. To address this issue, we explore two techniques aimed at reducing the reliance of ERC models on these artifacts: 1) using contrastive learning to prioritize emotional features over dataset-specific linguistic style and 2) refining emotion predictions with pseudo-emotion intensity score. Our experiment results show that reducing reliance on the linguistic style found in TV transcripts could enhance model’s robustness and accuracy in diverse conversational contexts.

pdf bib abs

Humans Need Context, What about Machines? Investigating Conversational Context in Abusive Language Detection
Tom Bourgeade | Zongmin Li | Farah Benamara | Véronique Moriceau | Jian Su | Aixin Sun
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

A crucial aspect in abusive language on social media platforms (toxicity, hate speech, harmful stereotypes, etc.) is its inherent contextual nature. In this paper, we focus on the role of conversational context in abusive language detection, one of the most “direct” forms of context in this domain, as given by the conversation threads (e.g., directly preceding message, original post). The incorporation of surrounding messages has proven vital for the accurate human annotation of harmful content. However, many prior works have either ignored this aspect, collecting and processing messages in isolation, or have obtained inconsistent results when attempting to embed such contextual information into traditional classification methods. The reasons behind these findings have not yet been properly addressed. To this end, we propose an analysis of the impact of conversational context in abusive language detection, through: (1) an analysis of prior works and the limitations of the most common concatenation-based approach, which we attempt to address with two alternative architectures; (2) an evaluation of these methods on existing datasets in English, and a new dataset of French tweets annotated for hate speech and stereotypes; and (3) a qualitative analysis showcasing the necessity for context-awareness in ALD, but also its difficulties.

2023

pdf bib abs

The success of ChatGPT has ignited an AI race, with researchers striving to develop new large language models (LLMs) that can match or surpass the language understanding and generation abilities of commercial ones. In recent times, a number of models have emerged, claiming performance near that of GPT-3.5 or GPT-4 through various instruction-tuning methods. As practitioners of Text-to-SQL parsing, we are grateful for their valuable contributions to open-source research. However, it is important to approach these claims with a sense of scrutiny and ascertain the actual effectiveness of these models. Therefore, we pit six popular large language models against each other, systematically evaluating their Text-to-SQL parsing capability on nine benchmark datasets with five different prompting strategies, covering both zero-shot and few-shot scenarios. Regrettably, the open-sourced models fell significantly short of the performance achieved by closed-source models like GPT-3.5, highlighting the need for further work to bridge the performance gap between these models.

pdf bib abs

Text-to-SQL translates user queries into SQL statements that can retrieve relevant answers from relational databases. Recent approaches to Text-to-SQL rely on pre-trained language models that are computationally expensive and technically challenging to deploy in real-world applications that require real-time or on-device processing capabilities. In this paper, we perform a focused study on the feasibility of applying recent model compression techniques to sketch-based and sequence-to-sequence Text-to-SQL models. Our results reveal that sketch-based Text-to-SQL models generally have higher inference efficiency and respond better to model compression than sequence-to-sequence models, making them ideal for real-world deployments, especially in use cases with simple SQL statements.

2018

pdf bib abs

Reasoning with Sarcasm by Reading In-Between
Yi Tay | Anh Tuan Luu | Siu Cheung Hui | Jian Su
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Sarcasm is a sophisticated speech act which commonly manifests on social communities such as Twitter and Reddit. The prevalence of sarcasm on the social web is highly disruptive to opinion mining systems due to not only its tendency of polarity flipping but also usage of figurative language. Sarcasm commonly manifests with a contrastive theme either between positive-negative sentiments or between literal-figurative scenarios. In this paper, we revisit the notion of modeling contrast in order to reason with sarcasm. More specifically, we propose an attention-based neural model that looks in-between instead of across, enabling it to explicitly model contrast and incongruity. We conduct extensive experiments on six benchmark datasets from Twitter, Reddit and the Internet Argument Corpus. Our proposed model not only achieves state-of-the-art performance on all datasets but also enjoys improved interpretability.

pdf bib abs

Attentive Gated Lexicon Reader with Contrastive Contextual Co-Attention for Sentiment Classification
Yi Tay | Anh Tuan Luu | Siu Cheung Hui | Jian Su
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper proposes a new neural architecture that exploits readily available sentiment lexicon resources. The key idea is that that incorporating a word-level prior can aid in the representation learning process, eventually improving model performance. To this end, our model employs two distinctly unique components, i.e., (1) we introduce a lexicon-driven contextual attention mechanism to imbue lexicon words with long-range contextual information and (2), we introduce a contrastive co-attention mechanism that models contrasting polarities between all positive and negative words in a sentence. Via extensive experiments, we show that our approach outperforms many other neural baselines on sentiment classification tasks on multiple benchmark datasets.

Jian Su

2026

2025

2024

2023

2018

2016

2015

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2000

Co-authors

Venues