Bingqing Wang

2026

Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens wouldexceed the LLM’s context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMswith the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework that conditions retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., descriptions of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD’s. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, showcasing its flexibility. With our proposed framework, we aim to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.

2025

pdf bib abs

Multi-Step Generation of Test Specifications using Large Language Models for System-Level Requirements
Dragan Milchevski | Gordon Frank | Anna Hätty | Bingqing Wang | Xiaowei Zhou | Zhe Feng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

System-level testing is a critical phase in the development of large, safety-dependent systems, such as those in the automotive industry. However, creating test specifications can be a time-consuming and error-prone process. This paper presents an AI-based assistant to aid users in creating test specifications for system-level requirements. The system mimics the working process of a test developer by utilizing a LLM and an agentic framework, and by introducing intermediate test artifacts - structured intermediate representations derived from input requirements. Our user study demonstrates a 30 to 40% reduction in effort required for test development. For test specification generation, our quantitative analysis reveals that iteratively providing the model with more targeted information, like examples of similar test specifications, based on comparable requirements and purposes, can boost the performance by up to 30% in ROUGE-L. Overall, our approach has the potential to improve the efficiency, accuracy, and reliability of system-level testing and can be applied to various industries where safety and functionality are paramount.

2023

pdf bib abs

Knowledge-Grounded Natural Language Recommendation Explanation
Anthony Colas | Jun Araki | Zhengyu Zhou | Bingqing Wang | Zhe Feng
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Explanations accompanying a recommendation can assist users in understanding the decision made by recommendation systems, which in turn increases a user’s confidence and trust in the system. Recently, research has focused on generating natural language explanations in a human-readable format. Thus far, the proposed approaches leverage item reviews written by users, which are often subjective, sparse in language, and unable to account for new items that have not been purchased or reviewed before. Instead, we aim to generate fact-grounded recommendation explanations that are objectively described with item features while implicitly considering a user’s preferences, based on the user’s purchase history. To achieve this, we propose a knowledge graph (KG) approach to natural language explainable recommendation. Our approach draws on user-item features through a novel collaborative filtering-based KG representation to produce fact-grounded, personalized explanations, while jointly learning user-item representations for recommendation scoring. Experimental results show that our approach consistently outperforms previous state-of-the-art models on natural language explainable recommendation metrics.

pdf bib abs

Collecting labeled data for Named Entity Recognition (NER) tasks is challenging due to the high cost of manual annotations. Instead, researchers have proposed few-shot self-training and rule-augmentation techniques to minimize the reliance on large datasets. However, inductive biases and restricted logical language lexicon, respectively, can limit the ability of these models to perform well. In this work, we propose CoAug, a co-augmentation framework that allows us to improve few-shot models and rule-augmentation models by bootstrapping predictions from each model. By leveraging rules and neural model predictions to train our models, we complement the benefits of each and achieve the best of both worlds. In our experiments, we show that our best CoAug model can outperform strong weak-supervision-based NER models at least by 6.5 F1 points.

pdf bib abs

Hallucination is a well-known phenomenon in text generated by large language models (LLMs). The existence of hallucinatory responses is found in almost all application scenarios e.g., summarization, question-answering (QA) etc. For applications requiring high reliability (e.g., customer-facing assistants), the potential existence of hallucination in LLM-generated text is a critical problem. The amount of hallucination can be reduced by leveraging information retrieval to provide relevant background information to the LLM. However, LLMs can still generate hallucinatory content for various reasons (e.g., prioritizing its parametric knowledge over the context, failure to capture the relevant information from the context, etc.). Detecting hallucinations through automated methods is thus paramount. To facilitate research in this direction, we introduce a sophisticated dataset, DelucionQA, that captures hallucinations made by retrieval-augmented LLMs for a domain-specific QA task. Furthermore, we propose a set of hallucination detection methods to serve as baselines for future works from the research community. Analysis and case study are also provided to share valuable insights on hallucination phenomena in the target scenario.

2021

pdf bib abs

A New Approach to Overgenerating and Scoring Abstractive Summaries
Kaiqiang Song | Bingqing Wang | Zhe Feng | Fei Liu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose a new approach to generate multiple variants of the target summary with diverse content and varying lengths, then score and select admissible ones according to users’ needs. Abstractive summarizers trained on single reference summaries may struggle to produce outputs that achieve multiple desirable properties, i.e., capturing the most important information, being faithful to the original, grammatical and fluent. In this paper, we propose a two-staged strategy to generate a diverse set of candidate summaries from the source text in stage one, then score and select admissible ones in stage two. Importantly, our generator gives a precise control over the length of the summary, which is especially well-suited when space is limited. Our selectors are designed to predict the optimal summary length and put special emphasis on faithfulness to the original text. Both stages can be effectively trained, optimized and evaluated. Our experiments on benchmark summarization datasets suggest that this paradigm can achieve state-of-the-art performance.

pdf bib abs

Modeling Endorsement for Multi-Document Abstractive Summarization
Logan Lebanoff | Bingqing Wang | Zhe Feng | Fei Liu
Proceedings of the Third Workshop on New Frontiers in Summarization

A crucial difference between single- and multi-document summarization is how salient content manifests itself in the document(s). While such content may appear at the beginning of a single document, essential information is frequently reiterated in a set of documents related to a particular topic, resulting in an endorsement effect that increases information salience. In this paper, we model the cross-document endorsement effect and its utilization in multiple document summarization. Our method generates a synopsis from each document, which serves as an endorser to identify salient content from other documents. Strongly endorsed text segments are used to enrich a neural encoder-decoder model to consolidate them into an abstractive summary. The method has a great potential to learn from fewer examples to identify salient content, which alleviates the need for costly retraining when the set of documents is dynamically adjusted. Through extensive experiments on benchmark multi-document summarization datasets, we demonstrate the effectiveness of our proposed method over strong published baselines. Finally, we shed light on future research directions and discuss broader challenges of this task using a case study.