Pengjie Ren


pdf bib
Intent-calibrated Self-training for Answer Selection in Open-domain Dialogues
Wentao Deng | Jiahuan Pei | Zhaochun Ren | Zhumin Chen | Pengjie Ren
Transactions of the Association for Computational Linguistics, Volume 11

Answer selection in open-domain dialogues aims to select an accurate answer from candidates. The recent success of answer selection models hinges on training with large amounts of labeled data. However, collecting large-scale labeled data is labor-intensive and time-consuming. In this paper, we introduce the predicted intent labels to calibrate answer labels in a self-training paradigm. Specifically, we propose intent-calibrated self-training (ICAST) to improve the quality of pseudo answer labels through the intent-calibrated answer selection paradigm, in which we employ pseudo intent labels to help improve pseudo answer labels. We carry out extensive experiments on two benchmark datasets with open-domain dialogues. The experimental results show that ICAST outperforms baselines consistently with 1%, 5%, and 10% labeled data. Specifically, it improves 2.06% and 1.00% of F1 score on the two datasets, compared with the strongest baseline with only 5% labeled data.

pdf bib
Syllogistic Reasoning for Legal Judgment Analysis
Wentao Deng | Jiahuan Pei | Keyi Kong | Zhe Chen | Furu Wei | Yujun Li | Zhaochun Ren | Zhumin Chen | Pengjie Ren
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Legal judgment assistants are developing fast due to impressive progress of large language models (LLMs). However, people can hardly trust the results generated by a model without reliable analysis of legal judgement. For legal practitioners, it is common practice to utilize syllogistic reasoning to select and evaluate the arguments of the parties as part of the legal decision-making process. But the development of syllogistic reasoning for legal judgment analysis is hindered by the lack of resources: (1) there is no large-scale syllogistic reasoning dataset for legal judgment analysis, and (2) there is no set of established benchmarks for legal judgment analysis. In this paper, we construct and manually correct a syllogistic reasoning dataset for legal judgment analysis. The dataset contains 11,239 criminal cases which cover 4 criminal elements, 80 charges and 124 articles. We also select a set of large language models as benchmarks, and conduct a in-depth analysis of the capacity of their legal judgment analysis.

pdf bib
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents
Weiwei Sun | Lingyong Yan | Xinyu Ma | Shuaiqiang Wang | Pengjie Ren | Zhumin Chen | Dawei Yin | Zhaochun Ren
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks, including search engines. However, existing work utilizes the generative ability of LLMs for Information Retrieval (IR) rather than direct passage ranking. The discrepancy between the pre-training objectives of LLMs and the ranking objective poses another challenge. In this paper, we first investigate generative LLMs such as ChatGPT and GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal that properly instructed LLMs can deliver competitive, even superior results to state-of-the-art supervised methods on popular IR benchmarks. Furthermore, to address concerns about data contamination of LLMs, we collect a new test set called NovelEval, based on the latest knowledge and aiming to verify the model’s ability to rank unknown knowledge. Finally, to improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models using a permutation distillation scheme. Our evaluation results turn out that a distilled 440M model outperforms a 3B supervised model on the BEIR benchmark. The code to reproduce our results is available at

pdf bib
Answering Ambiguous Questions via Iterative Prompting
Weiwei Sun | Hengyi Cai | Hongshen Chen | Pengjie Ren | Zhumin Chen | Maarten de Rijke | Zhaochun Ren
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist. To provide feasible answers to an ambiguous question,one approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity. An alternative is to gather candidate answers and aggregate them, but this method can be computationally costly and may neglect dependencies among answers. In this paper, we present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions. Specifically, we integrate an answering model with a prompting model in an iterative manner. The prompting model adaptively tracks the reading process and progressively triggers the answering model to compose distinct and relevant answers. Additionally, we develop a task-specific post-pretraining approach for both the answering model and the prompting model, which greatly improves the performance of our framework. Empirical studies on two commonly-used open benchmarks show that AmbigPrompt achieves state-of-the-art or competitive results while using less memory and having a lower inference latency than competing approaches. Additionally, AmbigPrompt also performs well in low-resource settings.

pdf bib
RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue
Zhengliang Shi | Weiwei Sun | Shuo Zhang | Zhen Zhang | Pengjie Ren | Zhaochun Ren
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Evaluating open-domain dialogue systems is challenging for reasons such as the one-to-many problem, i.e., many appropriate responses other than just the golden response. As of now, automatic evaluation methods need better consistency with humans, while reliable human evaluation can be time- and cost-intensive. To this end, we propose the Reference-Assisted Dialogue Evaluation (RADE) approach under the multi-task learning framework, which leverages the pre-created utterance as reference other than the gold response to relief the one-to-many problem. Specifically, RADE explicitly compares reference and the candidate response to predict their overall scores. Moreover, an auxiliary response generation task enhances prediction via a shared encoder. To support RADE, we extend three datasets with additional rated responses other than just a golden response by human annotation. Experiments on our three datasets and two existing benchmarks demonstrate the effectiveness of our method, where Pearson, Spearman, and Kendall correlations with human evaluation outperform state-of-the-art baselines.

pdf bib
Generative Knowledge Selection for Knowledge-Grounded Dialogues
Weiwei Sun | Pengjie Ren | Zhaochun Ren
Findings of the Association for Computational Linguistics: EACL 2023

Knowledge selection is the key in knowledge-grounded dialogues (KGD), which aims to select an appropriate knowledge snippet to be used in the utterance based on dialogue history. Previous studies mainly employ the classification approach to classify each candidate snippet as “relevant” or “irrelevant” independently. However, such approaches neglect the interactions between snippets, leading to difficulties in inferring the meaning of snippets. Moreover, they lack modeling of the discourse structure of dialogue-knowledge interactions. We propose a simple yet effective generative approach for knowledge selection, called GenKS. GenKS learns to select snippets by generating their identifiers with a sequence-to-sequence model. GenKS therefore captures intra-knowledge interaction inherently through attention mechanisms. Meanwhile, we devise a hyperlink mechanism to model the dialogue-knowledge interactions explicitly. We conduct experiments on three benchmark datasets, and verify GenKS achieves the best results on both knowledge selection and response generation.

pdf bib
UMSE: Unified Multi-scenario Summarization Evaluation
Shen Gao | Zhitao Yao | Chongyang Tao | Xiuying Chen | Pengjie Ren | Zhaochun Ren | Zhumin Chen
Findings of the Association for Computational Linguistics: ACL 2023

Summarization quality evaluation is a non-trivial task in text summarization. Contemporary methods can be mainly categorized into two scenarios: (1) reference-based: evaluating with human-labeled reference summary; (2) reference-free: evaluating the summary consistency of the document. Recent studies mainly focus on one of these scenarios and explore training neural models built on PLMs to align with human criteria. However, the models from different scenarios are optimized individually, which may result in sub-optimal performance since they neglect the shared knowledge across different scenarios. Besides, designing individual models for each scenario caused inconvenience to the user. Inspired by this, we propose Unified Multi-scenario Summarization Evaluation Model (UMSE). More specifically, we propose a perturbed prefix tuning method to share cross-scenario knowledge between scenarios and use a self-supervised training paradigm to optimize the model without extra human labeling. Our UMSE is the first unified summarization evaluation framework engaged with the ability to be used in three evaluation scenarios. Experimental results across three typical scenarios on the benchmark dataset SummEval indicate that our UMSE can achieve comparable performance with several existing strong methods which are specifically designed for each scenario.

pdf bib
Multi-Defendant Legal Judgment Prediction via Hierarchical Reasoning
Yougang Lyu | Jitai Hao | Zihan Wang | Kai Zhao | Shen Gao | Pengjie Ren | Zhumin Chen | Fang Wang | Zhaochun Ren
Findings of the Association for Computational Linguistics: EMNLP 2023

Multiple defendants in a criminal fact description generally exhibit complex interactions, and cannot be well handled by existing Legal Judgment Prediction (LJP) methods which focus on predicting judgment results (e.g., law articles, charges, and terms of penalty) for single-defendant cases. To address this problem, we propose the task of multi-defendant LJP, which aims to automatically predict the judgment results for each defendant of multi-defendant cases. Two challenges arise with the task of multi-defendant LJP: (1) indistinguishable judgment results among various defendants; and (2) the lack of a real-world dataset for training and evaluation. To tackle the first challenge, we formalize the multi-defendant judgment process as hierarchical reasoning chains and introduce a multi-defendant LJP method, named Hierarchical Reasoning Network (HRN), which follows the hierarchical reasoning chains to determine criminal relationships, sentencing circumstances, law articles, charges, and terms of penalty for each defendant. To tackle the second challenge, we collect a real-world multi-defendant LJP dataset, namely MultiLJP, to accelerate the relevant research in the future. Extensive experiments on MultiLJP verify the effectiveness of our proposed HRN.

pdf bib
Generalizing Few-Shot Named Entity Recognizers to Unseen Domains with Type-Related Features
Zihan Wang | Ziqi Zhao | Zhumin Chen | Pengjie Ren | Maarten de Rijke | Zhaochun Ren
Findings of the Association for Computational Linguistics: EMNLP 2023

Few-shot named entity recognition (NER) has shown remarkable progress in identifying entities in low-resource domains. However, few-shot NER methods still struggle with out-of-domain (OOD) examples due to their reliance on manual labeling for the target domain. To address this limitation, recent studies enable generalization to an unseen target domain with only a few labeled examples using data augmentation techniques. Two important challenges remain: First, augmentation is limited to the training data, resulting in minimal overlap between the generated data and OOD examples. Second, knowledge transfer is implicit and insufficient, severely hindering model generalizability and the integration of knowledge from the source domain. In this paper, we propose a framework, prompt learning with type-related features (PLTR), to address these challenges. To identify useful knowledge in the source domain and enhance knowledge transfer, PLTR automatically extracts entity type-related features (TRFs) based on mutual information criteria. To bridge the gap between training and OOD data, PLTR generates a unique prompt for each unseen example by selecting relevant TRFs. We show that PLTR achieves significant performance improvements on in-domain and cross-domain datasets. The use of PLTR facilitates model adaptation and increases representation similarities between the source and unseen domains.

pdf bib
Towards a Unified Framework for Reference Retrieval and Related Work Generation
Zhengliang Shi | Shen Gao | Zhen Zhang | Xiuying Chen | Zhumin Chen | Pengjie Ren | Zhaochun Ren
Findings of the Association for Computational Linguistics: EMNLP 2023

The task of related work generation aims to generate a comprehensive survey of related research topics automatically, saving time and effort for authors. Existing methods simplify this task by using human-annotated references in a large-scale scientific corpus as information sources, which is time- and cost-intensive. To this end, we propose a Unified Reference Retrieval and Related Work Generation Model (UR3WG), which combines reference retrieval and related work generation processes in a unified framework based on the large language model (LLM). Specifically, UR3WG first leverages the world knowledge of LLM to extend the abstract and generate the query for the subsequent retrieval stage. Then a lexicon-enhanced dense retrieval is proposed to search relevant references, where an importance-aware representation of the lexicon is introduced. We also propose multi-granularity contrastive learning to optimize our retriever. Since this task is not simply summarizing the main points in references, it should analyze the complex relationships and present them logically. We propose an instruction-tuning method to leverage LLM to generate related work. Extensive experiments on two wide-applied datasets demonstrate that our model outperforms the state-of-the-art baselines in both generation and retrieval metrics.


pdf bib
Improving Multi-label Malevolence Detection in Dialogues through Multi-faceted Label Correlation Enhancement
Yangjun Zhang | Pengjie Ren | Wentao Deng | Zhumin Chen | Maarten Rijke
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A dialogue response is malevolent if it is grounded in negative emotions, inappropriate behavior, or an unethical value basis in terms of content and dialogue acts. The detection of malevolent dialogue responses is attracting growing interest. Current research on detecting dialogue malevolence has limitations in terms of datasets and methods. First, available dialogue datasets related to malevolence are labeled with a single category, but in practice assigning a single category to each utterance may not be appropriate as some malevolent utterances belong to multiple labels. Second, current methods for detecting dialogue malevolence neglect label correlation. Therefore, we propose the task of multi-label dialogue malevolence detection and crowdsource a multi-label dataset, multi-label dialogue malevolence detection (MDMD) for evaluation. We also propose a multi-label malevolence detection model, multi-faceted label correlation enhanced CRF (MCRF), with two label correlation mechanisms, label correlation in taxonomy (LCT) and label correlation in context (LCC). Experiments on MDMD show that our method outperforms the best performing baseline by a large margin, i.e., 16.1%, 11.9%, 12.0%, and 6.1% on precision, recall, F1, and Jaccard score, respectively.


pdf bib
A Human-machine Collaborative Framework for Evaluating Malevolence in Dialogues
Yangjun Zhang | Pengjie Ren | Maarten de Rijke
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Conversational dialogue systems (CDSs) are hard to evaluate due to the complexity of natural language. Automatic evaluation of dialogues often shows insufficient correlation with human judgements. Human evaluation is reliable but labor-intensive. We introduce a human-machine collaborative framework, HMCEval, that can guarantee reliability of the evaluation outcomes with reduced human effort. HMCEval casts dialogue evaluation as a sample assignment problem, where we need to decide to assign a sample to a human or a machine for evaluation. HMCEval includes a model confidence estimation module to estimate the confidence of the predicted sample assignment, and a human effort estimation module to estimate the human effort should the sample be assigned to human evaluation, as well as a sample assignment execution module that finds the optimum assignment solution based on the estimated confidence and effort. We assess the performance of HMCEval on the task of evaluating malevolence in dialogues. The experimental results show that HMCEval achieves around 99% evaluation accuracy with half of the human effort spared, showing that HMCEval provides reliable evaluation outcomes while reducing human effort by a large amount.

pdf bib
Learning to Ask Conversational Questions by Optimizing Levenshtein Distance
Zhongkun Liu | Pengjie Ren | Zhumin Chen | Zhaochun Ren | Maarten de Rijke | Ming Zhou
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Conversational Question Simplification (CQS) aims to simplify self-contained questions into conversational ones by incorporating some conversational characteristics, e.g., anaphora and ellipsis. Existing maximum likelihood estimation based methods often get trapped in easily learned tokens as all tokens are treated equally during training. In this work, we introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimizes the minimum Levenshtein distance through explicit editing actions. RISE is able to pay attention to tokens that are related to conversational characteristics. To train RISE, we devise an Iterative Reinforce Training (IRT) algorithm with a Dynamic Programming based Sampling (DPS) process to improve exploration. Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods and generalizes well on unseen data.


pdf bib
EmpDG: Multi-resolution Interactive Empathetic Dialogue Generation
Qintong Li | Hongshen Chen | Zhaochun Ren | Pengjie Ren | Zhaopeng Tu | Zhumin Chen
Proceedings of the 28th International Conference on Computational Linguistics

A humanized dialogue system is expected to generate empathetic replies, which should be sensitive to the users’ expressed emotion. The task of empathetic dialogue generation is proposed to address this problem. The essential challenges lie in accurately capturing the nuances of human emotion and considering the potential of user feedback, which are overlooked by the majority of existing work. In response to this problem, we propose a multi-resolution adversarial model – EmpDG, to generate more empathetic responses. EmpDG exploits both the coarse-grained dialogue-level and fine-grained token-level emotions, the latter of which helps to better capture the nuances of user emotion. In addition, we introduce an interactive adversarial learning framework which exploits the user feedback, to identify whether the generated responses evoke emotion perceptivity in dialogues. Experimental results show that the proposed approach significantly outperforms the state-of-the-art baselines in both content quality and emotion perceptivity.


pdf bib
Entity Linking for Queries by Searching Wikipedia Sentences
Chuanqi Tan | Furu Wei | Pengjie Ren | Weifeng Lv | Ming Zhou
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We present a simple yet effective approach for linking entities in queries. The key idea is to search sentences similar to a query from Wikipedia articles and directly use the human-annotated entities in the similar sentences as candidate entities for the query. Then, we employ a rich set of features, such as link-probability, context-matching, word embeddings, and relatedness among candidate entities as well as their related entities, to rank the candidates under a regression based framework. The advantages of our approach lie in two aspects, which contribute to the ranking process and final linking result. First, it can greatly reduce the number of candidate entities by filtering out irrelevant entities with the words in the query. Second, we can obtain the query sensitive prior probability in addition to the static link-probability derived from all Wikipedia articles. We conduct experiments on two benchmark datasets on entity linking for queries, namely the ERD14 dataset and the GERDAQ dataset. Experimental results show that our method outperforms state-of-the-art systems and yields 75.0% in F1 on the ERD14 dataset and 56.9% on the GERDAQ dataset.


pdf bib
A Redundancy-Aware Sentence Regression Framework for Extractive Summarization
Pengjie Ren | Furu Wei | Zhumin Chen | Jun Ma | Ming Zhou
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Existing sentence regression methods for extractive summarization usually model sentence importance and redundancy in two separate processes. They first evaluate the importance f(s) of each sentence s and then select sentences to generate a summary based on both the importance scores and redundancy among sentences. In this paper, we propose to model importance and redundancy simultaneously by directly evaluating the relative importance f(s|S) of a sentence s given a set of selected sentences S. Specifically, we present a new framework to conduct regression with respect to the relative gain of s given S calculated by the ROUGE metric. Besides the single sentence features, additional features derived from the sentence relations are incorporated. Experiments on the DUC 2001, 2002 and 2004 multi-document summarization datasets show that the proposed method outperforms state-of-the-art extractive summarization approaches.