Tingting He


2025

This study investigates how large language models (LLMs) produce referring expressions (REs) and to what extent their behaviour aligns with human patterns. We evaluate LLM performance in two settings: slot filling, %KvD the conventional task of referring expression generation, where REs are generated within a fixed context, and language generation, where REs are analysed within fully generated texts. Using the WebNLG corpus, we assess how well LLMs capture human variation in reference production and analyse their behaviour by examining the influence of several factors known to affect human reference production, including referential form, syntactic position, recency, and discourse status. Our findings show that (1) task framing significantly affects LLMs’ reference production; (2) while LLMs are sensitive to some of these factors, their referential behaviour consistently diverges from human use; and (3) larger model size does not necessarily yield more human-like variation. These results underscore key limitations in current LLMs’ ability to replicate human referential choices.
Hallucinations pose a persistent challenge in open-ended question answering (QA). Traditional annotation methods, such as span-labelling, suffer from inconsistency and limited coverage. In this paper, we propose a rewriting-based framework as a new perspective on hallucinations in open-ended QA. We report on an experiment in which annotators are instructed to rewrite LLM-generated answers directly to ensure factual accuracy, with edits automatically recorded. Using the Chinese portion of the Mu-SHROOM dataset, we conduct a controlled rewriting experiment, comparing fact-checking tools (Google vs. GPT-4o), and analysing how tool choice, annotator background, and question openness influence rewriting behaviour. We find that rewriting leads to more hallucinations being identified, with higher inter-annotator agreement, than span-labelling.
We present the system developed by the Central China Normal University (CCNU) team for the SemEval-2025 shared task 8, which focuses on Question-Answering (QA) for tabular data. Our approach leverages multiple Large Language Models (LLMs), conducting tabular QA as code completion. Additionally, to improve its reliability, we introduce a two-stage corrections mechanism, in which we instruct the LLM to correct the code according to the judges of whether the code is executable and whether the answer obtained from executing the code is semantically consistent with the question. The experiment demonstrates that code correction works but answer correction does not. Finally, we discuss other unsuccessful approaches explored during our development process.
Large language models (LLMs) are naturally suitable for Chinese spelling check (CSC) task in few-shot scenarios due to their powerful semantic understanding and few-shot learning capabilities. Recent CSC research has begun to use LLMs as foundational models. However, most current datasets are primarily focused on errors generated during the text generation process, with little attention given to errors occurring in the modal conversion process. Furthermore, existing LLM-based CSC methods often rely on fixed prompt samples, which limits the performance of LLMs. Therefore, we propose a framework named RagID (Retrieval-Augment Generation and Iterative Discriminator Strategy). By utilizing semantic-based similarity search and an iterative discriminator mechanism, RagID can provide well-chosen prompt samples and reduce over-correction issues in LLM-based CSC. RagID demonstrates excellent effectiveness in few-shot scenarios. We conducted comprehensive experiments, and the results show that RagID achieves the best performance on dataset that include data from multiple domains and dataset containing modal conversion spelling errors. The dataset and method are available online.
Detoxification in large language models (LLMs) remains a significant research challenge. Existing decoding detoxification methods are all based on external constraints, which require additional resource overhead and lose generation fluency. This work innovatively proposes Detoxification with Self-Constrained Decoding (DSCD), a novel method for LLMs detoxification without parameter fine-tuning. DSCD strengthens the inner token distribution of the safety layer while weakening that of hallucination and toxic layer during output generation. This effectively diminishes toxicity and enhances output safety. DSCD offers lightweight, high compatibility, and plug-and-play capabilities, readily integrating with existing detoxification methods for further performance improvement. Extensive experiments on representative open-source LLMs and public datasets validate DSCD’s effectiveness, demonstrating state-of-the-art (SOTA) performance in both detoxification and generation fluency, with superior efficiency compared to existing methods. These results highlight DSCD’s potential as a practical and scalable solution for safer LLM deployments.
Despite the impressive capabilities of large language models (LLMs) in aspect-based sentiment analysis (ABSA), the role of syntactic information remains underexplored in LLMs. Syntactic structures are known to be crucial for capturing aspect-opinion relationships. To explore whether LLMs can effectively leverage syntactic information to improve ABSA performance, we propose a novel multi-step reasoning framework, the Syntax-Opinion-Sentiment Reasoning Chain (Syn-Chain). Syn-Chain sequentially analyzes syntactic dependencies, extracts opinions, and classifies sentiment. We introduce Syn-Chain into LLMs via zero-shot prompting, and results show that Syn-Chain significantly enhances ABSA performance, though smaller LLM exhibit weaker performance. Furthermore, we enhance smaller LLMs via distillation using GPT-3.5-generated Syn-Chain responses, achieving state-of-the-art ABSA performance. Our findings highlight the importance of syntactic information for improving LLMs in ABSA and offer valuable insights for future research.
Few-shot multi-intent spoken language understanding (SLU) aims to identify users’ multiple intents and key slots using a tiny amount of annotated data. Recent advances in large language models (LLMs) have utilized instruction learning frameworks to model intent-slot interdependencies, typically requiring abundant data for effective training. However, in few-shot scenarios, these frameworks face challenges such as mismatches between the number of generated slots and input lengths, relational confusion in multi-intent scenarios and neglect of task-specific variations in intent counts across utterances. To overcome the challenges, we propose PICD-Instruct, a novel generative framework based on Basic Instructions (BI), Pairwise Interaction Instructions (PII) and Contrastive Distinct Instructions (CDI). Specifically, BI directs LLMs to generate entities along with associated words, thereby mitigating mismatches in quantitative correspondences. PII explicitly captures dual-task interdependencies by guiding LLMs to pair each intent with its related entities. CDI enhances understanding of utterances by guiding LLMs to determine whether two utterances share the same intent count. Experimental results on public datasets indicate that PICD-Instruct achieves state-of-the-art performance.
"中文相较于以英文为代表的表音文字具有富语义的特点,单个汉字蕴含了读音、字形结构、偏旁部首等丰富的语义特征,在构建自然语言处理相关应用时具有独特的价值,可以视作额外的特征,提升在特定任务的表现。近年来,大语言模型飞速发展,展现出海量的知识储备和强大的推理能力,其中,大模型对汉字富语义特征的掌握可以视作大模型中文能力的基础。然而,目前对于大模型汉字富语义能力评测研究较少,针对性地评测大模型在汉字富语义方面的能力边界,有助于了解大模型中英文能力差异性、并推测大模型在字形、字音相关下游任务上的表现。因此,本研究从汉字的结构、偏旁、读音、笔画、多音字和部件六个维度,对大语言模型进行了全面评测,旨在深入探究其对汉字基本富语义特征的掌握程度。本研究以GB2312 标准字符集和现代汉语词典为依据,围绕汉字的结构、偏旁、读音、笔画、多音字和部件六个维度,构建了一系列“问题-答案”对,并制定了科学合理的评分标准。在此基础上,对十余种主流的大语言模型进行了深入评测。同时,为探究模型在中英文能力上的差异,将上述中文评测任务翻译为英文,并选取了三个代表性模型进行对比评测。此外,本研究进一步从汉字结构推理、偏旁推理、读音推理三个关键角度出发,设计了一系列推理评测任务,旨在深入评估大语言模型对汉字富语义特征的推理能力。本研究的评测结果具有重要的参考价值,可为大语言模型相关领域的研究人员在中文下游任务优化、基础模型选择等关键环节提供参考和启发。"

2024

2023

Most previous few-shot Spoken Language Understanding (SLU) models typically need to be trained on a set of data-rich source domains and adapt to the target domain with a few examples. In this paper, we explore a more practical scenario for few-shot SLU, in which we only assume access to a pre-trained language model and a few labeled examples without any other source domain data. We concentrate on understanding how far the few-shot SLU could be pushed in this setting. To this end, we develop a prompt-based intent detection model in few-shot settings, which leverages the BERT original pre-training next sentence prediction task and the prompt template to detect the user’s intent. For slot filling, we propose an approach of reconstructing slot labels, which reduces the training complexity by reducing the number of slot labels in few-shot settings. To evaluate the few-shot SLU for a more practical scenario, we present two benchmarks, FewShotATIS and FewShotSNIPS. And a dynamic sampling strategy is designed to construct the two datasets according to the learning difficulty of each intent and slot. Experiments on FewShotATIS and FewShotSNIPS demonstrate that our proposed model achieves state-of-the-art performance.
In few-shot settings, fully conveying the semantic information of the dialogue act is a crucial challenge for Natural Language Generation (NLG) in the task-oriented dialogue system. An interesting fact is that NLG and Spoken Language Understanding (SLU) are a natural dual problem pair. Suppose the response generated by the NLG module can be restored to the corresponding dialogue act by the SLU module, which reflects that the generated response fully conveys the semantic information of the dialogue act. Based on this idea, a novel Dual Supervised Pre-trained Model for a few-shot Natural Language Generation (DSPM-NLG) is proposed to regularize the pre-training process. We adopt a joint model with a dual supervised framework to learn the dual correlation between NLG and SLU from the perspective of probability. In addition, a slot-masked strategy is designed to enable the model to focus better on the key slot-value pairs. DSPM-NLG is continuously trained on existing public large-scale annotated data, which thoroughly learns the duality between two tasks to enhance the semantically controlling and generalization abilities of the pre-trained model. Experiments demonstrate that our proposed model performs outstandingly on the few-shot benchmark dataset and outperforms the previous SOTA results.
“Domain-specific text classification often needs more external knowledge, and fraud cases havefewer descriptions. Existing methods usually utilize single-stage deep models to extract semanticfeatures, which is less reusable. To tackle this issue, we propose a two-stage training frameworkbased on within-task pretraining and multi-dimensional semantic enhancement for CCL23-EvalTask 6 (Telecom Network Fraud Case Classification, FCC). Our training framework is dividedinto two stages. First, we pre-train using the training corpus to obtain specific BERT. The seman-tic mining ability of the model is enhanced from the feature space perspective by introducing ad-versarial training and multiple random sampling. The pseudo-labeled data is generated throughthe test data above a certain threshold. Second, pseudo-labeled samples are added to the trainingset for semantic enhancement based on the sample space dimension. We utilize the same back-bone for prediction to obtain the results. Experimental results show that our proposed methodoutperforms the single-stage benchmarks and achieves competitive performance with 0.859259F1. It also performs better in the few-shot patent classification task with 65.160% F1, whichindicates robustness.”

2022

In recent years, Graph Neural Network (GNN) approaches with enhanced knowledge graphs (KG) perform well in question answering (QA) tasks. One critical challenge is how to effectively utilize interactions between the QA context and KG. However, existing work only adopts the identical QA context representation to interact with multiple layers of KG, which results in a restricted interaction. In this paper, we propose DRLK (Dynamic Hierarchical Reasoning with Language Model and Knowledge Graphs), a novel model that utilizes dynamic hierarchical interactions between the QA context and KG for reasoning. DRLK extracts dynamic hierarchical features in the QA context, and performs inter-layer and intra-layer interactions on each iteration, allowing the KG representation to be grounded with the hierarchical features of the QA context. We conduct extensive experiments on four benchmark datasets in medical QA and commonsense reasoning. The experimental results demonstrate that DRLK achieves state-of-the-art performances on two benchmark datasets and performs competitively on the others.

2016

2015

2006

2005

2004

2003