Qian Liu

2025

In the context of large language models (LLMs), current advanced reasoning methods have made impressive strides in various reasoning tasks. However, when it comes to logical reasoning tasks, significant challenges remain in both efficacy and efficiency. This is rooted in the fact that these systems fail to fully leverage the inherent structure of logical tasks throughout the reasoning processes, including decomposition, search, and resolution. To address this, this paper proposes a logic-complete reasoning framework, Aristotle. The framework consists of three key components: Logical Decomposer, Logical Search Router, and Logical Resolver, in which symbolic expressions and logical rules are comprehensively integrated into the entire reasoning process, significantly alleviating the bottlenecks of logical reasoning, i.e., reducing sub-task complexity, minimizing search errors, and resolving logical contradictions. Experimental results demonstrate that Aristotle consistently outperforms state-of-the-art reasoning frameworks in both accuracy and efficiency, particularly excelling in complex logical reasoning scenarios.

Code LLMs have been widely used in various domains, including code generation, logical reasoning, and agent systems. However, open-access code LLMs mostly only release weights, lacking key features such as reproducible data pipelines and transparent training protocols, which are crucial for advancing deeper, more reliable investigations. To address the gap, we introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an “open cookbook” for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research. Our work identifies the key ingredients for building a top-tier code LLM: optimized heuristic rules for data cleaning and deduplication, effective recall of code-related text corpus, and high-quality synthetic data for both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research and enable reproducible advancements in code intelligence. The released resource is available at https://opencoder-llm.github.io.

Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94 walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

Large Language Models (LLMs) have reshaped code generation by synergizing their exceptional comprehension of natural language and programming syntax, thereby substantially boosting developer productivity. These advancements have prompted numerous efforts to quantitatively evaluate their coding capabilities. However, persistent challenges, such as benchmark leakage, data dissipation, and limited system accessibility, continue to impede a timely and accurate assessment. To address these limitations, we introduce CodeArena, an online evaluation framework tailored for LLM code generation. Its key innovation is a collective evaluation mechanism, which dynamically recalibrates individual model scores based on the holistic performance of all participating models, mitigating score biases caused by widespread benchmark leakage. In addition, CodeArena ensures open access to all submitted solutions and test cases and provides automation-friendly APIs to streamline the code evaluation workflow. Our main contributions are: (1) a collective evaluation system for unbiased assessment, (2) a public repository of solutions and test cases, and (3) automation-ready APIs for seamless integration.

Word sense disambiguation (WSD) is a fundamental yet challenging task in natural language processing. In recent years, the advent of large language models (LLMs) has led to significant advancements in regular WSD tasks. However, most existing LLMs face two major issues that hinder their performance in WSD. Firstly, these models are often prone to misclassifying the correct meaning of an ambiguous word when confronted with contexts containing adversarial information. Secondly, there is a lack of sufficient adversarial WSD datasets, which severely limits the development and evaluation of adversarial WSD systems. To address these gaps, we propose a novel Multi-Agent Debate framework for Adversarial Word Sense Disambiguation (MADAWSD). The MADAWSD framework simulates a real-world debate environment where multiple agent roles, namely, the Debater, Moderator, Consensus-seeker, and Judge, engage in discussions about ambiguous words in the context of adversarial information. Through a collaborative mechanism among these agents, it achieves accurate WSD. Additionally, a novel dataset for Chinese adversarial WSD has been constructed, focusing on improving and evaluating the performance of WSD models in the Chinese language. Extensive experiments on both English and Chinese adversarial WSD datasets demonstrate that MADAWSD can seamlessly integrate with existing LLMs and significantly enhance their performance, showcasing broad generality and outstanding effectiveness.

pdf bib abs
NLP+Code: Code Intelligence in Language Models
Terry Yue Zhuo | Qian Liu | Zijian Wang | Wasi Uddin Ahmad | Binyuan Hui | Loubna Ben Allal
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Language models (LMs) like GPT and Claude have shown impressive abilities in a range of natural language processing (NLP) tasks. Among these tasks, code understanding and generation have quickly become one of the most popular applications of LMs, given its nature of executable logic forms. However, there is a practical understanding of how programming knowledge can be combined with natural language to automate software development. Moreover, recent studies also empirically demonstrate that code can be a better form for complex reasoning and agentic task automation, but they do not indicate their significance. In this tutorial, we deem such superior capabilities brought by code modeling as Code Intelligence, and aim to provide a coherent overview of recent advances in this topic. We will start by first providing preliminaries of training foundation models on code and their common practices. We will then focus on downstream tasks in the domain of code and their evaluations. Then, we will cover how code can contribute to advancements in general tasks, and the opportunities of future research on Code Intelligence.

pdf bib abs
David vs. Goliath: Cost-Efficient Financial QA via Cascaded Multi-Agent Reasoning
Chenghao Liu | Qian Liu | Ziqin Zhu | Hao Fei | Aniket Mahanti
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) have demonstrated remarkable reasoning capabilities, including in financial question answering (FQA). However, the performance in FQA remains limited, particularly in questions that require deep financial knowledge and complex numerical reasoning. While supervised fine-tuning and closed-source LLMs have shown promise, they are often constrained by high costs or computational inefficiency. In this paper, we propose a low-cost yet effective framework, named FinMAN (Financial multi-agent framework), that enables small LLMs (e.g., 8B) to perform complex reasoning tasks without relying on expensive models or task-specific fine-tuning. FinMAN improves formula selection, extraction, and calculation to help small-scale models solve FQA tasks more accurately, with a lightweight verification mechanism to correct common errors. Experimental results show that FinMAN outperforms the best open-source model on BizBench by 10.46% and achieves competitive performance to GPT-3.5 using significantly fewer parameters. Our code and data are publicly available at https://github.com/coenliu/MultiAgentFin.

Most of the existing work focuses on enabling LLMs to leverage legal rules (, law articles) to tackle complex legal reasoning tasks, but ignores their ability to understand legal rules. To better evaluate the LLMs’ capabilities on the task, in this work, we propose a new challenge task: Legal Paragraph Prediction (LPP), which aims to predict the legal paragraph given criminal facts. Moreover, to enhance the legal reasoning ability of LLMs, we propose a novel framework CLEAR, enabling LLMs to analyze legal cases with the guidance of legal rule insights. The CLEAR contains four key components, where the Legal Rules Retriever aims to retrieve legal rule knowledge, and the Rule Insights Generator is used to generate legal insights guiding the LLM’s reasoning, then the Case Analyzer analyze the case with the guidance of legal rule insights given criminal facts. Finally, the Legal Reasoner synthesizes the criminal facts, legal rule insights, and analysis results to derive the final decision. By conducting extensive experiments on a real-world dataset, experimental results validate the effectiveness of our proposed model. Our codes and dataset are available at https://anonymous.4open.science/r/CLEAR-3048.

Due to the widespread dissemination of rumors on social media platforms, detecting rumors has been a long-standing concern for various communities. However, existing rumor detection methods rarely consider the fairness issues inherent in the model, which can lead to biased predictions across different stakeholder groups (e.g., domains and originating platforms of the detected content), also undermining their detection effectiveness. In this work, we propose a two-step framework to address this issue. First, we perform unsupervised partitioning to dynamically identify potential unfair data patterns without requiring sensitive attribute annotations. Then, we apply invariant learning to these partitions to extract fair and informative feature representations that enhance rumor detection. Extensive experiments show that our method outperforms strong baselines regarding detection and fairness performance, and also demonstrate robust performance on out-of-distribution samples. Further empirical results indicate that our learned features remain informative and fair across stakeholder groups and can correct errors when applied to existing baselines.

pdf bib
Proceedings of the 4th Table Representation Learning Workshop
Shuaichen Chang | Madelon Hulsebos | Qian Liu | Wenhu Chen | Huan Sun
Proceedings of the 4th Table Representation Learning Workshop

2024

The surge in Large Language Models (LLMs) has revolutionized natural language processing, but fine-tuning them for specific tasks often encounters challenges in balancing performance and preserving general instruction-following abilities. In this paper, we posit that the distribution gap between task datasets and the LLMs serves as the primary underlying cause. To address the problem, we introduce Self-Distillation Fine-Tuning (SDFT), a novel approach that bridges the distribution gap by guiding fine-tuning with a distilled dataset generated by the model itself to match its original distribution. Experimental results on the Llama-2-chat model across various benchmarks demonstrate that SDFT effectively mitigates catastrophic forgetting while achieving comparable or superior performance on downstream tasks compared to the vanilla fine-tuning. Moreover, SDFT demonstrates the potential to maintain the helpfulness and safety alignment of LLMs. Our code is available at https://github.com/sail-sg/sdft.

pdf bib abs
Through the MUD: A Multi-Defendant Charge Prediction Benchmark with Linked Crime Elements
Xiao Wei | Qi Xu | Hang Yu | Qian Liu | Erik Cambria
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The current charge prediction datasets mostly focus on single-defendant criminal cases.However, real-world criminal cases usually involve multiple defendants whose criminal facts are intertwined. In an early attempt to fill this gap, we introduce a new benchmark that encompasses legal cases involving multiple defendants, where each defendant is labeled with a charge and four types of crime elements, i.e., Object Element, Objective Element, Subject Element, and Subjective Element. Based on the dataset, we further develop an interpretable model called EJudge that incorporates crime elements and legal rules to infer charges. We observe that predicting crime charges while providing corresponding rationales benefits the interpretable AI system. Extensive experiments show that EJudge significantly surpasses state-of-the-art methods, which verify the importance of crime elements and legal rules in multi-defendant charge prediction. The source code and dataset are available at https://anonymous.4open.science/r/MCP_1-6010.

pdf bib abs
Beyond Memorization: The Challenge of Random Memory Access in Language Models
Tongyao Zhu | Qian Liu | Liang Pang | Zhengbao Jiang | Min-Yen Kan | Min Lin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent developments in Language Models (LMs) have shown their effectiveness in NLP tasks, particularly in knowledge-intensive tasks.However, the mechanisms underlying knowledge storage and memory access within their parameters remain elusive. In this paper, we investigate whether a generative LM (e.g., GPT-2) is able to access its memory sequentially or randomly. Through carefully-designed synthetic tasks, covering the scenarios of full recitation, selective recitation and grounded question answering, we reveal that LMs manage to sequentially access their memory while encountering challenges in randomly accessing memorized content. We find that techniques including recitation and permutation improve the random memory access capability of LMs. Furthermore, by applying this intervention to realistic scenarios of open-domain question answering, we validate that enhancing random access by recitation leads to notable improvements in question answering. The code to reproduce our experiments can be found at https://github.com/sail-sg/lm-random-memory-access.

While the recent Chain-of-Thought (CoT) technique enhances the reasoning ability of large language models (LLMs) with the theory of mind, it might still struggle in handling logical reasoning that relies much on symbolic expressions and rigid deducing rules. To strengthen the logical reasoning capability of LLMs, we propose a novel Symbolic Chain-of-Thought, namely SymbCoT, a fully LLM-based framework that integrates symbolic expressions and logic rules with CoT prompting. Technically, building upon an LLM, SymbCoT 1) first translates the natural language context into the symbolic format, and then 2) derives a step-by-step plan to solve the problem with symbolic logical rules, 3) followed by a verifier to check the translation and reasoning chain. Via thorough evaluations on 5 standard datasets with both First-Order Logic and Constraint Optimization symbolic expressions, SymbCoT shows striking improvements over the CoT method consistently, meanwhile refreshing the current state-of-the-art performances. We further demonstrate that our system advances in more faithful, flexible, and explainable logical reasoning. To our knowledge, this is the first attempt at combining symbolic expressions and rules into CoT for logical reasoning with LLMs. Code is open at https://github.com/Aiden0526/SymbCoT.

pdf bib abs
EmpathyEar: An Open-source Avatar Multimodal Empathetic Chatbot
Hao Fei | Han Zhang | Bin Wang | Lizi Liao | Qian Liu | Erik Cambria
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

This paper introduces EmpathyEar, a pioneering open-source, avatar-based multimodal empathetic chatbot, to fill the gap in traditional text-only empathetic response generation (ERG) systems. Leveraging the advancements of a large language model, combined with multimodal encoders and generators, EmpathyEar supports user inputs in any combination of text, sound, and vision, and produces multimodal empathetic responses, offering users, not just textual responses but also digital avatars with talking faces and synchronized speeches. A series of emotion-aware instruction-tuning is performed for comprehensive emotional understanding and generation capabilities. In this way, EmpathyEar provides users with responses that achieve a deeper emotional resonance, closely emulating human-like empathy. The system paves the way for the next emotional intelligence, for which we open-source the code for public access.

pdf bib abs
DataTales: A Benchmark for Real-World Intelligent Data Narration
Yajing Yang | Qian Liu | Min-Yen Kan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We introduce DataTales, a novel benchmark designed to assess the proficiency of language models in data narration, a task crucial for transforming complex tabular data into accessible narratives. Existing benchmarks often fall short in capturing the requisite analytical complexity for practical applications. DataTales addresses this gap by offering 4.9k financial reports paired with corresponding market data, showcasing the demand for models to create clear narratives and analyze large datasets while understanding specialized terminology in the field. Our findings highlights the significant challenge that language models face in achieving the necessary precision and analytical depth for proficient data narration, suggesting promising avenues for future model development and evaluation methodologies.

We present Sailor, a family of open language models ranging from 0.5B to 14B parameters, tailored for South-East Asian (SEA) languages. From Qwen1.5, Sailor models accept 200B to 400B tokens during continual pre-training, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize the data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. We share our insights to spark a wider interest in developing large language models for multilingual use cases.

pdf bib abs
MetaPro 2.0: Computational Metaphor Processing on the Effectiveness of Anomalous Language Modeling
Rui Mao | Kai He | Claudia Ong | Qian Liu | Erik Cambria
Findings of the Association for Computational Linguistics: ACL 2024

Metaphor interpretation is a difficult task in natural language understanding. The development of relevant techniques in this domain is slow, mostly because of the lack of large annotated datasets and effective pre-trained language models (PLMs) for metaphor learning. Thus, we propose a large annotated dataset and a PLM for the metaphor interpretation task. Our foundation model is based on a novel anomalous language modeling (ALM) method, which we benchmark with comparable PLM baselines on the new dataset, finding that it largely improves model performance on metaphor identification and interpretation.

pdf bib abs
Guided Knowledge Generation with Language Models for Commonsense Reasoning
Xiao Wei | Haoran Chen | Hang Yu | Hao Fei | Qian Liu
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) have achieved notable success in commonsense reasoning tasks, benefiting from their extensive world knowledge acquired through extensive pretraining. While approaches like Chain-of-Thought (CoT) have shown promise in enhancing LLMs’ reasoning capabilities, mitigating the influence of inaccurate commonsense knowledge remains a challenge, particularly for small-scale LLMs (e.g., those with less than 10B parameters). In this work, we propose a novel method named Guided Knowledge Generation (GuideKG) to address these issues. It presents three advantages: (i) Employing LLMs to generate knowledge explanations and to automatically assign labels based on the probability of correct answers eliminates the need for costly manual annotation in subsequent training. (ii) Training a new module called the ‘Know-Filter’, which is used to evaluate knowledge, and we have introduced a new loss to enhance its performance. (iii) Evaluating the effectiveness of knowledge fragments at the sentence level and fusing them allows for precise control over the generation process of LLMs. We evaluate our GuideKG on small-scale LLMs and show that it outperforms all baselines on four widely-used commonsense reasoning benchmarks. Moreover, our experiments reveal that, with proper guidance, small-scale LLMs can exhibit exceptional performance in commonsense reasoning.

Recently the retrieval-augmented generation (RAG) has been successfully applied in code generation. However, existing pipelines for retrieval-augmented code generation (RACG) employ static knowledge bases with a single source, limiting the adaptation capabilities of Large Language Models (LLMs) to domains they have insufficient knowledge of. In this work, we develop a novel pipeline, EVOR, that employs the synchronous evolution of both queries and diverse knowledge bases. On two realistic settings where the external knowledge is required to solve code generation tasks, we compile four new datasets associated with frequently updated libraries and long-tail programming languages, named EVOR-BENCH. Extensive experiments demonstrate that EVOR achieves two to four times of execution accuracy compared to other methods such as Reflexion (Shinn et al., 2024), DocPrompting (Zhou et al., 2023), etc. We demonstrate that EVOR is flexible and can be easily combined with them to achieve further improvement. Further analysis reveals that EVOR benefits from the synchronous evolution of queries and documents and the diverse information sources in the knowledge base. We hope that our studies will inspire more insights into the design of advanced RACG pipelines in future research.

pdf bib abs
Divide and Conquer: Legal Concept-guided Criminal Court View Generation
Qi Xu | Xiao Wei | Hang Yu | Qian Liu | Hao Fei
Findings of the Association for Computational Linguistics: EMNLP 2024

The Criminal Court View Generation task aims to produce explanations that inform judicial decisions. This necessitates a nuanced understanding of diverse legal concepts, such as Recidivism, Confess, and Robbery, which often coexist within cases, complicating holistic analysis. However, existing methods mainly rely on the generation capability of language models, without paying enough attention to the important legal concepts.To enhance the precision and depth of such explanations, we introduce Legal Concept-guided Criminal Court Views Generation (LeGen), a three-stage approach designed for iterative reasoning tailored to individual legal constructs.Specifically, in the first stage, we design a decomposer to divide the court views into focused sub-views, each anchored around a distinct legal concept. Next, a concept reasoning module generates targeted rationales by intertwining the deconstructed facts with their corresponding legal frameworks, ensuring contextually relevant interpretations.Finally, a verifier and a generator are employed to align the rationale with the case fact and obtain synthesized comprehensive and legally sound final court views, respectively.We evaluate LeGen by conducting extensive experiments on a real-world dataset and experimental results validate the effectiveness of our proposed model. Our codes are available at https://anonymous.4open.science/r/LeGen-5625.

pdf bib abs
S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Model
Fangyu Lei | Qian Liu | Yiming Huang | Shizhu He | Jun Zhao | Kang Liu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The rapid development of Large Language Models (LLMs) has led to great strides in model capabilities like long-context understanding and reasoning.However, as LLMs are able to process longer contexts, it becomes more challenging to evaluate whether they have acquired certain capabilities, since the length of text (e.g., 200K tokens) they can process far exceeds what humans can reliably assess in a reasonable duration.In this paper, we propose using complex synthetic tasks as a proxy evaluation method, and present S3Eval, a Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation.The synthetic nature of S3Eval provides users full control over the dataset, allowing them to systematically probe LLM capabilities by scaling text length and varying task difficulty across diverse scenarios.The strong correlation between S3Eval and real-world benchmarks demonstrates the soundness of using S3Eval for evaluation of LLMs.S3Eval provides a flexible and infinite long-context data generation method. We have generated a comprehensive dataset called S3Eval-Standard, and experimental results have shown that it poses significant challenges for all existing LLMs.

pdf bib abs
COSIGN: Contextual Facts Guided Generation for Knowledge Graph Completion
Jinpeng Li | Hang Yu | Xiangfeng Luo | Qian Liu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Knowledge graph completion (KGC) aims to infer missing facts based on existing facts within a KG. Recently, research on generative models (GMs) has addressed the limitations of embedding methods in terms of generality and scalability. However, GM-based methods are sensitive to contextual facts on KG, so the contextual facts of poor quality can cause GMs to generate erroneous results. To improve the performance of GM-based methods for various KGC tasks, we propose a COntextual FactS GuIded GeneratioN (COSIGN) model. First, to enhance the inference ability of the generative model, we designed a contextual facts collector to achieve human-like retrieval behavior. Second, a contextual facts organizer is proposed to learn the organized capabilities of LLMs through knowledge distillation. Finally, the organized contextual facts as the input of the inference generator to generate missing facts. Experimental results demonstrate that COSIGN outperforms state-of-the-art baseline techniques in terms of performance.

2023

pdf bib abs
Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination
Hao Fei | Qian Liu | Meishan Zhang | Min Zhang | Tat-Seng Chua
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text image pairs, and tested with only source-text inputs. First, we represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics. To enable pure-text input during inference, we devise a visual scene hallucination mechanism that dynamically generates pseudo visual SG from the given textual SG. Several SG-pivoting based learning objectives are introduced for unsupervised translation training. On the benchmark Multi30K data, our SG-based method outperforms the best-performing baseline by significant BLEU scores on the task and setup, helping yield translations with better completeness, relevance and fluency without relying on paired images. Further in-depth analyses reveal how our model advances in the task setting.

While sentiment analysis systems try to determine the sentiment polarities of given targets based on the key opinion expressions in input texts, in implicit sentiment analysis (ISA) the opinion cues come in an implicit and obscure manner. Thus detecting implicit sentiment requires the common-sense and multi-hop reasoning ability to infer the latent intent of opinion. Inspired by the recent chain-of-thought (CoT) idea, in this work we introduce a Three-hop Reasoning (THOR) CoT framework to mimic the human-like reasoning process for ISA. We design a three-step prompting principle for THOR to step-by-step induce the implicit aspect, opinion, and finally the sentiment polarity. Our THOR+Flan-T5 (11B) pushes the state-of-the-art (SoTA) by over 6% F1 on supervised setup. More strikingly, THOR+GPT3 (175B) boosts the SoTA by over 50% F1 on zero-shot setting.

pdf bib abs
SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables
Xinyuan Lu | Liangming Pan | Qian Liu | Preslav Nakov | Min-Yen Kan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Current scientific fact-checking benchmarks exhibit several shortcomings, such as biases arising from crowd-sourced claims and an over-reliance on text-based evidence. We present SCITAB, a challenging evaluation dataset consisting of 1.2K expert-verified scientific claims that 1) originate from authentic scientific publications and 2) require compositional reasoning for verification. The claims are paired with evidence-containing scientific tables annotated with labels. Through extensive evaluations, we demonstrate that SCITAB poses a significant challenge to state-of-the-art models, including table-based pretraining models and large language models. All models except GPT-4 achieved performance barely above random guessing. Popular prompting techniques, such as Chain-of-Thought, do not achieve much performance gains on SCITAB. Our analysis uncovers several unique challenges posed by SCITAB, including table grounding, claim ambiguity, and compositional reasoning. Our codes and data are publicly available at https://github.com/XinyuanLu00/SciTab.

Despite the remarkable ability of large language models (LMs) to comprehend and generate language, they have a tendency to hallucinate and create factually inaccurate output. Augmenting LMs by retrieving information from external knowledge resources is one promising solution. Most existing retrieval augmented LMs employ a retrieve-and-generate setup that only retrieves information once based on the input. This is limiting, however, in more general scenarios involving generation of long texts, where continually gathering information throughout generation is essential. In this work, we provide a generalized view of active retrieval augmented generation, methods that actively decide when and what to retrieve across the course of the generation. We propose Forward-Looking Active REtrieval augmented generation (FLARE), a generic method which iteratively uses a prediction of the upcoming sentence to anticipate future content, which is then utilized as a query to retrieve relevant documents to regenerate the sentence if it contains low-confidence tokens. We test FLARE along with baselines comprehensively over 4 long-form knowledge-intensive generation tasks/datasets. FLARE achieves superior or competitive performance on all tasks, demonstrating the effectiveness of our method.

pdf bib abs
Generative Table Pre-training Empowers Models for Tabular Prediction
Tianping Zhang | Shaowen Wang | Shuicheng Yan | Li Jian | Qian Liu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Recently, the topic of table pre-training has attracted considerable research interest. However, how to employ table pre-training to boost the performance of tabular prediction remains an open challenge. In this paper, we propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction. After pre-training on a large corpus of real-world tabular data, TapTap can generate high-quality synthetic tables to support various applications on tabular data, including privacy protection, low resource regime, missing value imputation, and imbalanced classification. Extensive experiments on 12 datasets demonstrate that TapTap outperforms a total of 16 baselines in different scenarios. Meanwhile, it can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer. Moreover, with the aid of table pre-training, models trained using synthetic data generated by TapTap can even compete with models using the original dataset on half of the experimental datasets, marking a milestone in the development of synthetic tabular data generation. The code and datasets are available at https://github.com/ZhangTP1996/TapTap.

pdf bib abs
CocoSciSum: A Scientific Summarization Toolkit with Compositional Controllability
Yixi Ding | Yanxia Qin | Qian Liu | Min-Yen Kan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present a novel toolkit for controlled summarization of scientific documents, designed for the specific needs of the scientific community. Our system generates summaries based on user preferences, adjusting key attributes specifically of length and keyword inclusion. A distinguishing feature is its ability to manage multiple attributes concurrently, demonstrating Compositional Controllability for Scientific Summarization (CocoSciSum). Benchmarked against the strong Flan-T5 baseline, CocoSciSum exhibits superior performance on both the quality of summaries generated and the control over single and multiple attributes. Moreover, CocoSciSum is a user-centric toolkit, supporting user preferences expressed in natural language instructions, and accommodating diverse input document formats. CocoSciSum is available on GitHub (https://github.com/WING-NUS/SciAssist/tree/CocoSciSum) with an introduction video (https://youtu.be/YC1YDeEjAbQ).

pdf bib abs
PTCSpell: Pre-trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction
Xiao Wei | Jianbao Huang | Hang Yu | Qian Liu
Findings of the Association for Computational Linguistics: ACL 2023

Chinese spelling correction (CSC) is a challenging task with the goal of correcting each wrong character in Chinese texts. Incorrect characters in a Chinese text are mainly due to the similar shape and similar pronunciation of Chinese characters. Recently, the paradigm of pre-training and fine-tuning has achieved remarkable success in natural language processing. However, the pre-training objectives in existing methods are not tailored for the CSC task since they neglect the visual and phonetic properties of characters, resulting in suboptimal spelling correction. In this work, we propose to pre-train a new corrector named PTCSpell for the CSC task under the detector-corrector architecture. The corrector we propose has the following two improvements. First, we design two novel pre-training objectives to capture pronunciation and shape information in Chinese characters. Second, we propose a new strategy to tackle the issue that the detector’s prediction results mislead the corrector by balancing the loss of wrong characters and correct characters. Experiments on three benchmarks (i.e., SIGHAN 2013, 2014, and 2015) show that our model achieves an average of 5.8% F1 improvements at the correction level over state-of-the-art methods, verifying its effectiveness.

2022

Consistency identification in task-oriented dialog (CI-ToD) usually consists of three subtasks, aiming to identify inconsistency between current system response and current user response, dialog history and the corresponding knowledge base. This work aims to solve CI-ToD task by introducing an explicit interaction paradigm, Cycle Guided Interactive learning Model (CGIM), which achieves to make information exchange explicitly from all the three tasks. Specifically, CGIM relies on two core insights, referred to as guided multi-head attention module and cycle interactive mechanism, that collaborate from each other. On the one hand, each two tasks are linked with the guided multi-head attention module, aiming to explicitly model the interaction across two related tasks. On the other hand, we further introduce cycle interactive mechanism that focuses on facilitating model to exchange information among the three correlated sub-tasks via a cycle interaction manner. Experimental results on CI-ToD benchmark show that our model achieves the state-of-the-art performance, pushing the overall score to 56.3% (5.0% point absolute improvement). In addition, we find that CGIM is robust to the initial task flow order.

Reasoning over natural language is a long-standing goal for the research community. However, studies have shown that existing language models are inadequate in reasoning. To address the issue, we present POET, a novel reasoning pre-training paradigm. Through pre-training language models with programs and their execution results, POET empowers language models to harvest the reasoning knowledge possessed by program executors via a data-driven approach. POET is conceptually simple and can be instantiated by different kinds of program executors. In this paper, we showcase two simple instances POET-Math and POET-Logic, in addition to a complex instance, POET-SQL. Experimental results on six benchmarks demonstrate that POET can significantly boost model performance in natural language reasoning, such as numerical reasoning, logical reasoning, and multi-hop reasoning. POET opens a new gate on reasoning-enhancement pre-training, and we hope our analysis would shed light on the future research of reasoning like program executors.

pdf bib abs
Exploring the Secrets Behind the Learning Difficulty of Meaning Representations for Semantic Parsing
Zhenwen Li | Jiaqi Guo | Qian Liu | Jian-Guang Lou | Tao Xie
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Previous research has shown that the design of Meaning Representation (MR) greatly influences the final model performance of a neural semantic parser. Therefore, designing a good MR is a long-term goal for semantic parsing. However, it is still an art as there is no quantitative indicator that can tell us which MR among a set of candidates may have the best final model performance. In practice, in order toselect an MR design, researchers often have to go through the whole training-testing process for all design candidates, and the process often costs a lot. In this paper, we propose a data-aware metric called ISS (denoting incremental structural stability) of MRs, and demonstrate that ISS is highly correlated with the final performance. The finding shows that ISS can be used as an indicator for MR design to avoid the costly training-testing process.

Language-based environment manipulation requires agents to manipulate the environment following natural language instructions, which is challenging due to the huge space of the environments.To address this challenge, various approaches have been proposed in recent work. Although these approaches work well for their intended environments, they are difficult to generalize across environments. In this work, we propose LEMON, a general framework for language-based environment manipulation tasks. Specifically, we first specify a general approach for language-based environment manipulation tasks, which can deal with various environments using the same generative language model. Then we propose an execution-guided pre-training strategy to inject prior knowledge of environments to the language model with a pure synthetic pre-training corpus. Experimental results on tasks including Alchemy, Scene, Tangrams, ProPara and Recipes demonstrate the effectiveness of LEMON: it achieves new state-of-the-art results on four of the tasks, and the execution-guided pre-training strategy brings remarkable improvements on all experimental tasks.

pdf bib abs
Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA
Junjie Huang | Wanjun Zhong | Qian Liu | Ming Gong | Daxin Jiang | Nan Duan
Findings of the Association for Computational Linguistics: EMNLP 2022

Retrieving evidences from tabular and textual resources is essential for open-domain question answering (OpenQA), which provides more comprehensive information. However, training an effective dense table-text retriever is difficult due to the challenges of table-text discrepancy and data sparsity problem. To address the above challenges, we introduce an optimized OpenQA Table-Text Retriever (OTTeR) to jointly retrieve tabular and textual evidences. Firstly, we propose to enhance mixed-modality representation learning via two mechanisms: modality-enhanced representation and mixed-modality negative sampling strategy. Secondly, to alleviate data sparsity problem and enhance the general retrieval ability, we conduct retrieval-centric mixed-modality synthetic pre-training. Experimental results demonstrate that OTTeR substantially improves the performance of table-and-text retrieval on the OTT-QA dataset. Comprehensive analyses examine the effectiveness of all the proposed mechanisms. Besides, equipped with OTTeR, our OpenQA system achieves the state-of-the-art result on the downstream QA task, with 10.1% absolute improvement in terms of the exact match over the previous best system.

pdf bib abs
SenticNet 7: A Commonsense-based Neurosymbolic AI Framework for Explainable Sentiment Analysis
Erik Cambria | Qian Liu | Sergio Decherchi | Frank Xing | Kenneth Kwok
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In recent years, AI research has demonstrated enormous potential for the benefit of humanity and society. While often better than its human counterparts in classification and pattern recognition tasks, however, AI still struggles with complex tasks that require commonsense reasoning such as natural language understanding. In this context, the key limitations of current AI models are: dependency, reproducibility, trustworthiness, interpretability, and explainability. In this work, we propose a commonsense-based neurosymbolic framework that aims to overcome these issues in the context of sentiment analysis. In particular, we employ unsupervised and reproducible subsymbolic techniques such as auto-regressive language models and kernel methods to build trustworthy symbolic representations that convert natural language to a sort of protolanguage and, hence, extract polarity from text in a completely interpretable and explainable manner.

2021

pdf bib abs
Chase: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL
Jiaqi Guo | Ziliang Si | Yu Wang | Qian Liu | Ming Fan | Jian-Guang Lou | Zijiang Yang | Ting Liu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The cross-database context-dependent Text-to-SQL (XDTS) problem has attracted considerable attention in recent years due to its wide range of potential applications. However, we identify two biases in existing datasets for XDTS: (1) a high proportion of context-independent questions and (2) a high proportion of easy SQL queries. These biases conceal the major challenges in XDTS to some extent. In this work, we present Chase, a large-scale and pragmatic Chinese dataset for XDTS. It consists of 5,459 coherent question sequences (17,940 questions with their SQL queries annotated) over 280 databases, in which only 35% of questions are context-independent, and 28% of SQL queries are easy. We experiment on Chase with three state-of-the-art XDTS approaches. The best approach only achieves an exact match accuracy of 40% over all questions and 16% over all question sequences, indicating that Chase highlights the challenging problems of XDTS. We believe that XDTS can provide fertile soil for addressing the problems.

pdf bib abs
ReTraCk: A Flexible and Efficient Framework for Knowledge Base Question Answering
Shuang Chen | Qian Liu | Zhiwei Yu | Chin-Yew Lin | Jian-Guang Lou | Feng Jiang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

We present Retriever-Transducer-Checker (ReTraCk), a neural semantic parsing framework for large scale knowledge base question answering (KBQA). ReTraCk is designed as a modular framework to maintain high flexibility. It includes a retriever to retrieve relevant KB items efficiently, a transducer to generate logical form with syntax correctness guarantees and a checker to improve transduction procedure. ReTraCk is ranked at top1 overall performance on the GrailQA leaderboard and obtains highly competitive performance on the typical WebQuestionsSP benchmark. Our system can interact with users timely, demonstrating the efficiency of the proposed framework.

2020

Despite the continuing efforts to improve the engagingness and consistency of chit-chat dialogue systems, the majority of current work simply focus on mimicking human-like responses, leaving understudied the aspects of modeling understanding between interlocutors. The research in cognitive science, instead, suggests that understanding is an essential signal for a high-quality chit-chat conversation. Motivated by this, we propose Pˆ2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding. Specifically, Pˆ2 Bot incorporates mutual persona perception to enhance the quality of personalized dialogue generation. Experiments on a large public dataset, Persona-Chat, demonstrate the effectiveness of our approach, with a considerable boost over the state-of-the-art baselines across both automatic metrics and human evaluations.

Meaning representation is an important component of semantic parsing. Although researchers have designed a lot of meaning representations, recent work focuses on only a few of them. Thus, the impact of meaning representation on semantic parsing is less understood. Furthermore, existing work’s performance is often not comprehensively evaluated due to the lack of readily-available execution engines. Upon identifying these gaps, we propose , a new unified benchmark on meaning representations, by integrating existing semantic parsing datasets, completing the missing logical forms, and implementing the missing execution engines. The resulting unified benchmark contains the complete enumeration of logical forms and execution engines over three datasets × four meaning representations. A thorough experimental study on Unimer reveals that neural semantic parsing approaches exhibit notably different performance when they are trained to generate different meaning representations. Also, program alias and grammar rules heavily impact the performance of different meaning representations. Our benchmark, execution engines and implementation can be found on: https://github.com/JasperGuo/Unimer.

pdf bib abs
Incomplete Utterance Rewriting as Semantic Segmentation
Qian Liu | Bei Chen | Jian-Guang Lou | Bin Zhou | Dongmei Zhang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recent years the task of incomplete utterance rewriting has raised a large attention. Previous works usually shape it as a machine translation task and employ sequence to sequence based architecture with copy mechanism. In this paper, we present a novel and extensive approach, which formulates it as a semantic segmentation task. Instead of generating from scratch, such a formulation introduces edit operations and shapes the problem as prediction of a word-level edit matrix. Benefiting from being able to capture both local and global information, our approach achieves state-of-the-art performance on several public datasets. Furthermore, our approach is four times faster than the standard approach in inference.

In Natural Language Interfaces to Databases systems, the text-to-SQL technique allows users to query databases by using natural language questions. Though significant progress in this area has been made recently, most parsers may fall short when they are deployed in real systems. One main reason stems from the difficulty of fully understanding the users’ natural language questions. In this paper, we include human in the loop and present a novel parser-independent interactive approach (PIIA) that interacts with users using multi-choice questions and can easily work with arbitrary parsers. Experiments were conducted on two cross-domain datasets, the WikiSQL and the more complex Spider, with five state-of-the-art parsers. These demonstrated that PIIA is capable of enhancing the text-to-SQL performance with limited interaction turns by using both simulation and human evaluation.

2019

pdf bib abs
Leveraging Adjective-Noun Phrasing Knowledge for Comparison Relation Prediction in Text-to-SQL
Haoyan Liu | Lei Fang | Qian Liu | Bei Chen | Jian-Guang Lou | Zhoujun Li
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

One key component in text-to-SQL is to predict the comparison relations between columns and their values. To the best of our knowledge, no existing models explicitly introduce external common knowledge to address this problem, thus their capabilities of predicting comparison relations are limited beyond training data. In this paper, we propose to leverage adjective-noun phrasing knowledge mined from the web to predict the comparison relations in text-to-SQL. Experimental results on both the original and the re-split Spider dataset show that our approach achieves significant improvement over state-of-the-art methods on comparison relation prediction.

pdf bib abs
A Split-and-Recombine Approach for Follow-up Query Analysis
Qian Liu | Bei Chen | Haoyan Liu | Jian-Guang Lou | Lei Fang | Bin Zhou | Dongmei Zhang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Context-dependent semantic parsing has proven to be an important yet challenging task. To leverage the advances in context-independent semantic parsing, we propose to perform follow-up query analysis, aiming to restate context-dependent natural language queries with contextual information. To accomplish the task, we propose STAR, a novel approach with a well-designed two-phase process. It is parser-independent and able to handle multifarious follow-up scenarios in different domains. Experiments on the FollowUp dataset show that STAR outperforms the state-of-the-art baseline by a large margin of nearly 8%. The superiority on parsing results verifies the feasibility of follow-up query analysis. We also explore the extensibility of STAR on the SQA dataset, which is very promising.

2018

Distributed word representation plays a pivotal role in various natural language processing tasks. In spite of its success, most existing methods only consider contextual information, which is suboptimal when used in various tasks due to a lack of task-specific features. The rational word embeddings should have the ability to capture both the semantic features and task-specific features of words. In this paper, we propose a task-oriented word embedding method and apply it to the text classification task. With the function-aware component, our method regularizes the distribution of words to enable the embedding space to have a clear classification boundary. We evaluate our method using five text classification datasets. The experiment results show that our method significantly outperforms the state-of-the-art methods.