2024
pdf
bib
abs
Comparing Neighbors Together Makes it Easy: Jointly Comparing Multiple Candidates for Efficient and Effective Retrieval
Jonghyun Song
|
Cheyon Jin
|
Wenlong Zhao
|
Andrew McCallum
|
Jay-Yoon Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
A common retrieve-and-rerank paradigm involves retrieving relevant candidates from a broad set using a fast bi-encoder (BE), followed by applying expensive but accurate cross-encoders (CE) to a limited candidate set. However, relying on this small subset is often susceptible to error propagation from the bi-encoders, which limits the overall performance. To address these issues, we propose the Comparing Multiple Candidates (CMC) framework. CMC compares a query and multiple embeddings of similar candidates (i.e., neighbors) through shallow self-attention layers, delivering rich representations contextualized to each other. Furthermore, CMC is scalable enough to handle multiple comparisons simultaneously. For example, comparing ~10K candidates with CMC takes a similar amount of time as comparing 16 candidates with CE. Experimental results on the ZeSHEL dataset demonstrate that CMC, when plugged in between bi-encoders and cross-encoders as a seamless intermediate reranker (BE-CMC-CE), can effectively improve recall@k (+6.7%-p, +3.5%-p for R@16, R@64) compared to using only bi-encoders (BE-CE), with negligible slowdown (<7%). Additionally, to verify CMC’s effectiveness as the final-stage reranker in improving top-1 accuracy, we conduct experiments on downstream tasks such as entity, passage, and dialogue ranking. The results indicate that CMC is not only faster (11x) but also often more effective than CE, with improved prediction accuracy in Wikipedia entity linking (+0.7%-p) and DSTC7 dialogue ranking (+3.3%-p).
pdf
bib
abs
Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence Generation
Jiachen Zhao
|
Wenlong Zhao
|
Andrew Drozdov
|
Benjamin Rozonoyer
|
Md Arafat Sultan
|
Jay-Yoon Lee
|
Mohit Iyyer
|
Andrew McCallum
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We study semi-supervised sequence generation tasks, where the few labeled examples are too scarce to finetune a model, and meanwhile, few-shot prompted large language models (LLMs) exhibit room for improvement. In this paper, we present the discovery that a student model distilled from a few-shot prompted LLM can commonly generalize better than its teacher to unseen examples on such tasks. We find that the student is able to learn a general pattern from the high-quality pseudolabels produced by the teacher during knowledge distillation (KD), and favorably not a general pattern from the low-quality pseudolabels. Leveraging this discovery, we propose a new method, Multistage Collaborative Knowledge Distillation from an LLM (MCKD), for these tasks. MCKD first few-shot prompts an LLM to produce pseudolabels for unlabeled data. Then at each stage of an iterative KD process, a new pair of students is trained on disjoint partitions of the pseudolabeled data, and produces new and improved pseudolabels for their unseen partitions. We conduct extensive experiments on four syntactic and semantic parsing datasets and show the effectiveness of MCKD for low-resource semi-supervised sequence generation. On CRAFT biomedical parsing, for example, 3-stage MCKD with 50 labeled examples outperforms an LLM teacher and vanilla KD by 7.5% and 3.7% parsing F1, respectively, and matches the performance of supervised finetuning with 500 labeled examples.
pdf
bib
abs
WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models
Wenlong Zhao
|
Debanjan Mondal
|
Niket Tandon
|
Danica Dillion
|
Kurt Gray
|
Yuling Gu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The awareness of multi-cultural human values is critical to the ability of language models (LMs) to generate safe and personalized responses. However, this awareness of LMs has been insufficiently studied, since the computer science community lacks access to the large-scale real-world data about multi-cultural values. In this paper, we present WorldValuesBench, a globally diverse, large-scale benchmark dataset for the multi-cultural value prediction task, which requires a model to generate a rating response to a value question based on demographic contexts. Our dataset is derived from an influential social science project, World Values Survey (WVS), that has collected answers to hundreds of value questions (e.g., social, economic, ethical) from 94,728 participants worldwide. We have constructed more than 20 million examples of the type "(demographic attributes, value question) → answer” from the WVS responses. We perform a case study using our dataset and show that the task is challenging for strong open and closed-source models. On merely 11.1%, 25.0%, 72.2%, and 75.0% of the questions, Alpaca-7B, Vicuna-7B-v1.5, Mixtral-8x7B-Instruct-v0.1, and GPT-3.5 Turbo can respectively achieve <0.2 Wasserstein 1-distance from the human normalized answer distributions. WorldValuesBench opens up new research avenues in studying limitations and opportunities in multi-cultural value awareness of LMs.
2023
pdf
bib
abs
ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution
Ankita Gupta
|
Marzena Karpinska
|
Wenlong Zhao
|
Kalpesh Krishna
|
Jack Merullo
|
Luke Yeh
|
Mohit Iyyer
|
Brendan O’Connor
Findings of the Association for Computational Linguistics: EACL 2023
Large-scale, high-quality corpora are critical for advancing research in coreference resolution. However, existing datasets vary in their definition of coreferences and have been collected via complex and lengthy guidelines that are curated for linguistic experts. These concerns have sparked a growing interest among researchers to curate a unified set of guidelines suitable for annotators with various backgrounds. In this work, we develop a crowdsourcing-friendly coreference annotation methodology, ezCoref, consisting of an annotation tool and an interactive tutorial. We use ezCoref to re-annotate 240 passages from seven existing English coreference datasets (spanning fiction, news, and multiple other domains) while teaching annotators only cases that are treated similarly across these datasets. Surprisingly, we find that reasonable quality annotations were already achievable (90% agreement between the crowd and expert annotations) even without extensive training. On carefully analyzing the remaining disagreements, we identify the presence of linguistic cases that our annotators unanimously agree upon but lack unified treatments (e.g., generic pronouns, appositives) in existing datasets. We propose the research community should revisit these phenomena when curating future unified annotation guidelines.
pdf
bib
abs
Machine Reading Comprehension using Case-based Reasoning
Dung Thai
|
Dhruv Agarwal
|
Mudit Chaudhary
|
Wenlong Zhao
|
Rajarshi Das
|
Jay-Yoon Lee
|
Hannaneh Hajishirzi
|
Manzil Zaheer
|
Andrew McCallum
Findings of the Association for Computational Linguistics: EMNLP 2023
We present an accurate and interpretable method for answer extraction in machine reading comprehension that is reminiscent of case-based reasoning (CBR) from classical AI. Our method (CBR-MRC) builds upon the hypothesis that contextualized answers to similar questions share semantic similarities with each other. Given a test question, CBR-MRC first retrieves a set of similar cases from a nonparametric memory and then predicts an answer by selecting the span in the test context that is most similar to the contextualized representations of answers in the retrieved cases. The semi-parametric nature of our approach allows it to attribute a prediction to the specific set of evidence cases, making it a desirable choice for building reliable and debuggable QA systems. We show that CBR-MRC provides high accuracy comparable with large reader models and outperforms baselines by 11.5 and 8.4 EM on NaturalQuestions and NewsQA, respectively. Further, we demonstrate the ability of CBR-MRC in identifying not just the correct answer tokens but also the span with the most relevant supporting evidence. Lastly, we observe that contexts for certain question types show higher lexical diversity than others and find that CBR-MRC is robust to these variations while performance using fully-parametric methods drops.
pdf
bib
abs
Editing Common Sense in Transformers
Anshita Gupta
|
Debanjan Mondal
|
Akshay Sheshadri
|
Wenlong Zhao
|
Xiang Li
|
Sarah Wiegreffe
|
Niket Tandon
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Editing model parameters directly in Transformers makes updating open-source transformer-based models possible without re-training. However, these editing methods have only been evaluated on statements about encyclopedic knowledge with a single correct answer. Commonsense knowledge with multiple correct answers, e.g., an apple can be green or red but not transparent, has not been studied but is as essential for enhancing transformers’ reliability and usefulness. In this paper, we investigate whether commonsense judgments are causally associated with localized, editable parameters in Transformers, and we provide an affirmative answer. We find that directly applying the MEMIT editing algorithm results in sub-par performance and improve it for the commonsense domain by varying edit tokens and improving the layer selection strategy, i.e., MEMITCSK. GPT-2 Large and XL models edited using MEMITCSK outperform best-fine-tuned baselines by 10.97% and 10.73% F1 scores on PEP3k and 20Q datasets. In addition, we propose a novel evaluation dataset, PROBE\ SET, that contains unaffected and affected neighborhoods, affected paraphrases, and affected reasoning challenges. MEMITCSK performs well across the metrics while fine-tuning baselines show significant trade-offs between unaffected and affected metrics. These results suggest a compelling future direction for incorporating feedback about common sense into Transformers through direct model editing.
2022
pdf
bib
abs
ConReader: Exploring Implicit Relations in Contracts for Contract Clause Extraction
Weiwen Xu
|
Yang Deng
|
Wenqiang Lei
|
Wenlong Zhao
|
Tat-Seng Chua
|
Wai Lam
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
We study automatic Contract Clause Extraction (CCE) by modeling implicit relations in legal contracts. Existing CCE methods mostly treat contracts as plain text, creating a substantial barrier to understanding contracts of high complexity. In this work, we first comprehensively analyze the complexity issues of contracts and distill out three implicit relations commonly found in contracts, namely, 1) Long-range Context Relation that captures the correlations of distant clauses; 2) Term-Definition Relation that captures the relation between important terms with their corresponding definitions, and 3) Similar Clause Relation that captures the similarities between clauses of the same type. Then we propose a novel framework ConReader to exploit the above three relations for better contract understanding and improving CCE. Experimental results show that ConReader makes the prediction more interpretable and achieves new state-of-the-art on two CCE tasks in both conventional and zero-shot settings.
2021
pdf
bib
abs
IGA: An Intent-Guided Authoring Assistant
Simeng Sun
|
Wenlong Zhao
|
Varun Manjunatha
|
Rajiv Jain
|
Vlad Morariu
|
Franck Dernoncourt
|
Balaji Vasan Srinivasan
|
Mohit Iyyer
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
While large-scale pretrained language models have significantly improved writing assistance functionalities such as autocomplete, more complex and controllable writing assistants have yet to be explored. We leverage advances in language modeling to build an interactive writing assistant that generates and rephrases text according to fine-grained author specifications. Users provide input to our Intent-Guided Assistant (IGA) in the form of text interspersed with tags that correspond to specific rhetorical directives (e.g., adding description or contrast, or rephrasing a particular sentence). We fine-tune a language model on a dataset heuristically-labeled with author intent, which allows IGA to fill in these tags with generated text that users can subsequently edit to their liking. A series of automatic and crowdsourced evaluations confirm the quality of IGA’s generated outputs, while a small-scale user study demonstrates author preference for IGA over baseline methods in a creative writing task. We release our dataset, code, and demo to spur further research into AI-assisted writing.
2020
pdf
bib
abs
Compressing Transformer-Based Semantic Parsing Models using Compositional Code Embeddings
Prafull Prakash
|
Saurabh Kumar Shashidhar
|
Wenlong Zhao
|
Subendhu Rongali
|
Haidar Khan
|
Michael Kayser
Findings of the Association for Computational Linguistics: EMNLP 2020
The current state-of-the-art task-oriented semantic parsing models use BERT or RoBERTa as pretrained encoders; these models have huge memory footprints. This poses a challenge to their deployment for voice assistants such as Amazon Alexa and Google Assistant on edge devices with limited memory budgets. We propose to learn compositional code embeddings to greatly reduce the sizes of BERT-base and RoBERTa-base. We also apply the technique to DistilBERT, ALBERT-base, and ALBERT-large, three already compressed BERT variants which attain similar state-of-the-art performances on semantic parsing with much smaller model sizes. We observe 95.15% 98.46% embedding compression rates and 20.47% 34.22% encoder compression rates, while preserving >97.5% semantic parsing performances. We provide the recipe for training and analyze the trade-off between code embedding sizes and downstream performances.
2016
pdf
bib
abs
A Customizable Editor for Text Simplification
John Lee
|
Wenlong Zhao
|
Wenxiu Xie
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations
We present a browser-based editor for simplifying English text. Given an input sentence, the editor performs both syntactic and lexical simplification. It splits a complex sentence into shorter ones, and suggests word substitutions in drop-down lists. The user can choose the best substitution from the list, undo any inappropriate splitting, and further edit the sentence as necessary. A significant novelty is that the system accepts a customized vocabulary list for a target reader population. It identifies all words in the text that do not belong to the list, and attempts to substitute them with words from the list, thus producing a text tailored for the targeted readers.