Bill Yuchen Lin - ACL Anthology

Bill Yuchen Lin

Also published as: Bill Y. Lin

2025

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
Yifan Song | Guoyin Wang | Sujian Li | Bill Yuchen Lin
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Current evaluations of large language models (LLMs) often overlook non-determinism, typically focusing on a single output per example. This limits our understanding of LLM performance variability in real-world applications. Our study addresses this issue by exploring key questions about the performance differences between greedy decoding and sampling, identifying benchmarks’ consistency regarding non-determinism, and examining unique model behaviors. Through extensive experiments, we observe that greedy decoding generally outperforms sampling methods for most evaluated tasks. We also observe consistent performance across different LLM sizes and alignment methods, noting that alignment can reduce sampling variance. Moreover, our best-of-N sampling approach demonstrates that smaller LLMs can match or surpass larger models such as GPT-4-Turbo, highlighting the untapped potential of smaller LLMs. This research shows the importance of considering non-determinism in LLM evaluations and provides insights for future LLM development and evaluation.

On Memorization of Large Language Models in Logical Reasoning
Chulin Xie | Yangsibo Huang | Chiyuan Zhang | Da Yu | Xinyun Chen | Bill Yuchen Lin | Bo Li | Badih Ghazi | Ravi Kumar
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs’ reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using two dynamically generated logical reasoning benchmarks based on Knights and Knaves (K&K) puzzles and Zebra puzzles (DynamicZebra). We find that LLMs could interpolate and memorize the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet they struggle with slight variations of these puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. Through in-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers, we establish that LLMs develop reasoning skills on logical puzzles alongside memorization. Finally, our analysis based on a per-sample memorization score sheds light on how LLMs switch between reasoning and memorization when solving logical puzzles.

RewardBench: Evaluating Reward Models for Language Modeling
Nathan Lambert | Valentina Pyatkin | Jacob Morrison | LJ Miranda | Bill Yuchen Lin | Khyathi Chandu | Nouha Dziri | Sachin Kumar | Tom Zick | Yejin Choi | Noah A. Smith | Hannaneh Hajishirzi
Findings of the Association for Computational Linguistics: NAACL 2025

Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. Resources for reward model training and understanding are sparse in the nascent open-source community around them. To enhance scientific understanding of reward models, we present RewardBench, a benchmark dataset and code-base for evaluation. The RewardBench dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate RMs trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO). We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.

L3GO: Language Agents with Chain-of-3D-Thoughts for Generating Unconventional Objects
Yutaro Yamada | Khyathi Chandu | Bill Yuchen Lin | Jack Hessel | Ilker Yildirim | Yejin Choi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

Diffusion-based image generation models such as DALL-E 3 and Stable Diffusion-XL demonstrate remarkable capabilities in generating images with realistic and unique compositions. Yet, these models are not robust in precisely reasoning about physical and spatial configurations of objects, especially when instructed with unconventional, thereby out-of-distribution descriptions, such as “a chair with five legs”. In this paper, we propose a language agent with chain-of-3D-thoughts (L3GO), an inference-time approach that can reason about part-based 3D construction of unconventional objects that current data-driven diffusion models struggle with. More concretely, we use large language models as agents to compose a desired object via trial-and-error within the 3D simulation environment. To facilitate our investigation, we develop a new benchmark, Unconventionally Feasible Objects (UFO), as well as SimpleBlenv, a wrapper environment built on top of Blender where language agents can build and compose atomic building blocks via API calls. Human and automatic GPT-4V evaluations show that our approach surpasses the standard GPT-4 and other language agents (e.g., ReAct and Reflexion) for 3D mesh generation on ShapeNet. Moreover, when tested on our UFO benchmark, our approach outperforms other state-of-the-art text-to-2D image and text-to-3D models based on human evaluation.

SimulBench: Evaluating Language Models with Creative Simulation Tasks
Qi Jia | Xiang Yue | Tuney Zheng | Jie Huang | Bill Yuchen Lin
Findings of the Association for Computational Linguistics: NAACL 2025

We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation tasks, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM’s general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on SimulBench, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the final response generated by the target LLMs given multi-turn dialogue scripts. Our comprehensive experiments indicate that these creative simulation tasks continue to pose a significant challenge with their unique natures and show the gap between proprietary models and the most advanced open LLMs. For example, GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55% more cases.

CulturalBench: A Robust, Diverse and Challenging Benchmark for Measuring LMs’ Cultural Knowledge Through Human-AI Red-Teaming
Yu Ying Chiu | Liwei Jiang | Bill Yuchen Lin | Chan Young Park | Shuyue Stella Li | Sahithya Ravi | Mehar Bhatia | Maria Antoniak | Yulia Tsvetkov | Vered Shwartz | Yejin Choi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Robust, diverse, and challenging cultural knowledge benchmarks are essential for measuring our progress towards making LMs that are helpful across diverse cultures. We introduce CulturalBench: a set of 1,696 human-written and human-verified questions to assess LMs’ cultural knowledge, covering 45 global regions including underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions are each verified by five independent annotators and span 17 diverse topics ranging from food preferences to greeting etiquette. We construct CulturalBench using methods inspired by Human-AI Red-Teaming. Compared to human performance (92.4% accuracy), the hard version of CulturalBench is challenging even for the best-performing frontier LMs, ranging from 28.7% to 61.5% in accuracy. We find that LMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to overfit to a single answer. Our results indicate that GPT-4o substantially outperform other models across cultures, besting local providers (e.g., Mistral on European culture and DeepSeek on Chinese culture). Across the board, models under-perform on questions related to North Africa, South America and Middle East.

As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria-like helpfulness and harmlessness-which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 100 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval.

Stronger Models are Not Always Stronger Teachers for Instruction Tuning
Zhangchen Xu | Fengqing Jiang | Luyao Niu | Bill Yuchen Lin | Radha Poovendran
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions and engage with users meaningfully. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic instruction datasets have emerged as an economically viable solution to provide LLMs diverse and high-quality instructions. However, existing approaches typically assume that larger or stronger models are stronger teachers for instruction tuning, and hence simply adopt larger models as response generators to the synthetic instructions. In this paper, we challenge this commonly-adopted assumption. Our extensive experiments across five base models and twenty response generators reveal that larger and stronger models are not necessarily stronger teachers of smaller models. We refer to this phenomenon as the Larger Models’ Paradox. We observe that existing metrics cannot precisely predict the effectiveness of response generators since they ignore the compatibility between teachers and base models being fine-tuned. We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators. Our experiments across five base models demonstrate that CAR outperforms almost all baselines.

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
Fengqing Jiang | Zhangchen Xu | Yuetai Li | Luyao Niu | Zhen Xiang | Bo Li | Bill Yuchen Lin | Radha Poovendran
Findings of the Association for Computational Linguistics: ACL 2025

Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 13 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.

Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models
Abhilasha Ravichander | Jillian Fisher | Taylor Sorensen | Ximing Lu | Maria Antoniak | Bill Yuchen Lin | Niloofar Mireshghallah | Chandra Bhagavatula | Yejin Choi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

High-quality training data has proven crucial for developing performant large language models (LLMs). However, commercial LLM providers disclose few, if any, details about the data used for training. This lack of transparency creates multiple challenges: it limits external oversight and inspection of LLMs for issues such as copyright infringement, it undermines the agency of data authors, and it hinders scientific research on critical issues such as data contamination and data selection. How can we recover what training data is known to LLMs? In this work we demonstrate a new method to identify training data known to proprietary LLMs like GPT-4 without requiring any access to model weights or token probabilities, by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes. By evaluating a model’s ability to successfully reconstruct high-surprisal tokens in text, we can identify a surprising number of texts memorized by LLMs.

Small Models Struggle to Learn from Strong Reasoners
Yuetai Li | Xiang Yue | Zhangchen Xu | Fengqing Jiang | Luyao Niu | Bill Yuchen Lin | Bhaskar Ramasubramanian | Radha Poovendran
Findings of the Association for Computational Linguistics: ACL 2025

Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models (3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.

2024

The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis)based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman’s correlation betweenVideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result onother held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with humanjudges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.

Agent Lumos: Unified and Modular Training for Open-Source Language Agents
Da Yin | Faeze Brahman | Abhilasha Ravichander | Khyathi Chandu | Kai-Wei Chang | Yejin Choi | Bill Yuchen Lin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Closed-source agents suffer from several issues such as a lack of affordability, transparency, and reproducibility, particularly on complex interactive tasks. This motivates the development of open-source alternatives. We introduce Lumos, one of the first frameworks for training open-source LLM-based agents. Lumos features a learnable, unified and modular architecture with a planning module that learns high-level subgoal generation, and a grounding module trained to translate these into the actions using various tools in the execution module. The design allows for modular upgrades and wider applicability to diverse interactive tasks. To foster generalizable agent learning, we collect large-scale, unified, and high-quality training annotations derived from diverse ground-truth reasoning rationales across various complex interactive tasks. On 9 datasets, Lumos exhibits several key advantages: (1) Lumos excels multiple larger open-source agents on the held-out datasets (unused for training) for each task type. Lumos even surpasses GPT agents on QA and web tasks; (2) Lumos outperforms open-source agents produced by chain-of-thoughts and unmodularized integrated training; and (3) Lumos effectively generalizes to unseen tasks, outperforming 33B-scale agents and domain-specific agents. Code and data will be released.

Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents
Yifan Song | Da Yin | Xiang Yue | Jie Huang | Sujian Li | Bill Yuchen Lin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have become integral components in various autonomous agent systems.In this study, we present an exploration-based trajectory optimization approach, referred to as ETO. This learning method is designed to enhance the performance of open LLM agents. Contrary to previous studies that exclusively train on successful expert trajectories, our method allows agents to learn from their exploration failures. This leads to improved performance through an iterative optimization framework. During the exploration phase, the agent interacts with the environment while completing given tasks, gathering failure trajectories to create contrastive trajectory pairs. In the subsequent training phase, the agent utilizes these trajectory preference pairs to update its policy using contrastive learning methods like DPO. This iterative cycle of exploration and training fosters continued improvement in the agents. Our experiments on three complex tasks demonstrate that ETO consistently surpasses baseline performance by a large margin. Furthermore, an examination of task-solving efficiency and potential in scenarios lacking expert trajectory underscores the effectiveness of our approach.

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Zhangchen Xu | Fengqing Jiang | Luyao Niu | Jinyuan Jia | Bill Yuchen Lin | Radha Poovendran
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As large language models (LLMs) become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align LLM behavior with human values, including safety. Jailbreak attacks, which aim to provoke unintended and unsafe behaviors from LLMs, remain a significant LLM safety threat. We analyze tokens, which are the smallest unit of text that can be processed by LLMs and make the following observations: (1) probabilities of tokens representing harmful responses are higher than those of harmless responses, and (2) responses containing safety disclaimers appear among the top tokens when token probabilities are sorted in descending order. In this paper, we leverage (1) and (2) to develop SafeDecoding, a safety-aware decoding strategy for LLMs, to defend against jailbreak attacks. We perform extensive experiments to evaluate SafeDecoding against six SOTA jailbreak attacks (GCG, AutoDAN, PAIR, DeepInception, SAP30, and template based attack) on five LLMs (Vicuna, Llama2, Guanaco, falcon, and Dolphin) using four benchmark datasets (AdvBench, HEx-PHI, MT-Bench, and Just-Eval). Our results show that SafeDecoding significantly reduces attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries while outperforming six defense methods (Perpelexity, Paraphrase, Retokenization, Self-Reminder, ICD, and Self-Examination).

Selective “Selective Prediction”: Reducing Unnecessary Abstention in Vision-Language Reasoning
Tejas Srinivasan | Jack Hessel | Tanmay Gupta | Bill Yuchen Lin | Yejin Choi | Jesse Thomason | Khyathi Chandu
Findings of the Association for Computational Linguistics: ACL 2024

Selective prediction minimizes incorrect predictions from vision-language models (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to reduce the over-abstention of a selective vision-language system without increasing the error rate of the system’s predictions. When the VLM makes a low-confidence prediction, instead of abstaining ReCoVERR tries to find relevant clues in the image that provide additional evidence for the prediction. ReCoVERR uses an LLM to pose related questions to the VLM, collects high-confidence evidences, and if enough evidence confirms the prediction the system makes a prediction instead of abstaining. ReCoVERR enables three VLMs (BLIP2, InstructBLIP and LLaVA-1.5) to answer up to 20% more questions on the VQAv2 and A-OKVQA tasks without decreasing system accuracy, thus improving overall system reliability. Our code is available at https://github.com/tejas1995/ReCoVERR.

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Seungone Kim | Juyoung Suk | Shayne Longpre | Bill Yuchen Lin | Jamin Shin | Sean Welleck | Graham Neubig | Moontae Lee | Kyungjae Lee | Minjoon Seo
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available.

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
Tianyu Zheng | Ge Zhang | Tianhao Shen | Xueling Liu | Bill Yuchen Lin | Jie Fu | Wenhu Chen | Xiang Yue
Findings of the Association for Computational Linguistics: ACL 2024

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Code Feedback, a dataset featuring 68K multi-turn interactions, OpenCodeInterpreter integrates execution and human feedback for dynamic code refinement. Our comprehensive evaluation of OpenCodeInterpreter across key benchmarks such as HumanEval, MBPP, and their enhanced versions from EvalPlus reveals its exceptional performance. Notably, OpenCodeInterpreter-33B achieves an accuracy of 83.2 (76.4) on the average (and plus versions) of HumanEval and MBPP, closely rivaling GPT-4’s 84.2 (76.2) and further elevates to 91.6 (84.6) with synthesized human feedback from GPT-4. OpenCodeInterpreterbrings the gap between open-source code generation models and proprietary systems like GPT-4 Code Interpreter.

2023

AutoTriggER: Label-Efficient and Robust Named Entity Recognition with Auxiliary Trigger Extraction
Dong-Ho Lee | Ravi Kiran Selvam | Sheikh Muhammad Sarwar | Bill Yuchen Lin | Fred Morstatter | Jay Pujara | Elizabeth Boschee | James Allan | Xiang Ren
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Deep neural models for named entity recognition (NER) have shown impressive results in overcoming label scarcity and generalizing to unseen entities by leveraging distant supervision and auxiliary information such as explanations. However, the costs of acquiring such additional information are generally prohibitive. In this paper, we present a novel two-stage framework (AutoTriggER) to improve NER performance by automatically generating and leveraging “entity triggers” which are human-readable cues in the text that help guide the model to make better decisions. Our framework leverages post-hoc explanation to generate rationales and strengthens a model’s prior knowledge using an embedding interpolation technique. This approach allows models to exploit triggers to infer entity boundaries and types instead of solely memorizing the entity words themselves. Through experiments on three well-studied NER datasets, AutoTriggER shows strong label-efficiency, is capable of generalizing to unseen entities, and outperforms the RoBERTa-CRF baseline by nearly 0.5 F1 points on average.

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
Dongfu Jiang | Xiang Ren | Bill Yuchen Lin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present LLM-Blender, an ensembling framework designed to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). Our framework consists of two modules: PairRanker and GenFuser, addressing the observation that optimal LLMs for different examples can significantly vary. PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs. It jointly encodes the input text and a pair of candidates, using cross-attention encoders to determine the superior one. Our results demonstrate that PairRanker exhibits the highest correlation with ChatGPT-based ranking. Then, GenFuser aims to merge the top-ranked candidates, generating an improved output by capitalizing on their strengths and mitigating their weaknesses. To facilitate large-scale evaluation, we introduce a benchmark dataset, MixInstruct, which is a mixture of multiple instruction datasets featuring oracle pairwise comparisons. Our LLM-Blender significantly outperform individual LLMs and baseline methods across various metrics, establishing a substantial performance gap.

Complex Reasoning in Natural Language
Wenting Zhao | Mor Geva | Bill Yuchen Lin | Michihiro Yasunaga | Aman Madaan | Tao Yu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts)

Teaching machines to reason over texts has been a long-standing goal of natural language processing (NLP). To this end, researchers have designed a diverse set of complex reasoning tasks that involve compositional reasoning, knowledge retrieval, grounding, commonsense reasoning, etc. A standard choice for building systems that perform a desired type of reasoning is to fine-tune a pretrained language model (LM) on specific downstream tasks. However, recent research has demonstrated that such a straightforward approach is often brittle. For example, Elazar et al. (2021) and Branco et al. (2021) show that, on question-answering (QA) tasks, similar performance can be achieved with questions removed from the inputs. Min et al. (2019), Chen and Durrett (2019), and Tang et al. (2021) show that models trained on multi-hop QA do not generalize to answer single-hop questions. The reasoning capabilities of these models thus remain at a surface level, i.e., exploiting data patterns. Consequently, augmenting LMs with techniques that make them robust and effective becomes an active research area. We will start the tutorial by providing an overview of complex reasoning tasks where the standard application of pretrained language models fails. This tutorial then reviews recent promising directions for tackling these tasks. Specifically, we focus on the following groups of approaches that explicitly consider problem structures: (1) knowledge-augmented methods, where the knowledge is either incorporated during fine-tuning or pretraining; (2) few-shot prompting methods, which effectively guide the models to follow instructions; (3) neuro-symbolic methods, which produce explicit intermediate representations; and, (4) rationale-based methods, one of the most popular forms of the neuro-symbolic methods, which highlight subsets of input as explanations for individual model predictions.

2022

On the Robustness of Reading Comprehension Models to Entity Renaming
Jun Yan | Yang Xiao | Sagnik Mukherjee | Bill Yuchen Lin | Robin Jia | Xiang Ren
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We study the robustness of machine reading comprehension (MRC) models to entity renaming—do models make more wrong predictions when the same questions are asked about an entity whose name has been changed? Such failures imply that models overly rely on entity information to answer questions, and thus may generalize poorly when facts about the world change or questions are asked about novel entities. To systematically audit this issue, we present a pipeline to automatically generate test examples at scale, by replacing entity names in the original test sample with names from a variety of sources, ranging from names in the same test set, to common names in life, to arbitrary strings. Across five datasets and three pretrained model architectures, MRC models consistently perform worse when entities are renamed, with particularly large accuracy drops on datasets constructed via distant supervision. We also find large differences between models: SpanBERT, which is pretrained with span-level masking, is more robust than RoBERTa, despite having similar accuracy on unperturbed test data. We further experiment with different masking strategies as the continual pretraining objective and find that entity-based masking can improve the robustness of MRC models.

Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)
Antoine Bosselut | Xiang Li | Bill Yuchen Lin | Vered Shwartz | Bodhisattwa Prasad Majumder | Yash Kumar Lal | Rachel Rudinger | Xiang Ren | Niket Tandon | Vilém Zouhar
Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)

Proceedings of the First Workshop on Federated Learning for Natural Language Processing (FL4NLP 2022)
Bill Yuchen Lin | Chaoyang He | Chulin Xie | Fatemehsadat Mireshghallah | Ninareh Mehrabi | Tian Li | Mahdi Soltanolkotabi | Xiang Ren
Proceedings of the First Workshop on Federated Learning for Natural Language Processing (FL4NLP 2022)

Reflect, Not Reflex: Inference-Based Common Ground Improves Dialogue Response Quality
Pei Zhou | Hyundong Cho | Pegah Jandaghi | Dong-Ho Lee | Bill Yuchen Lin | Jay Pujara | Xiang Ren
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Human communication relies on common ground (CG), the mutual knowledge and beliefs shared by participants, to produce coherent and interesting conversations. In this paper, we demonstrate that current response generation (RG) models produce generic and dull responses in dialogues because they act reflexively, failing to explicitly model CG, both due to the lack of CG in training data and the standard RG training procedure. We introduce Reflect, a dataset that annotates dialogues with explicit CG (materialized as inferences approximating shared knowledge and beliefs) and solicits 9k diverse human-generated responses each following one common ground. Using Reflect, we showcase the limitations of current dialogue data and RG models: less than half of the responses in current data is rated as high quality (sensible, specific, and interesting) and models trained using this data have even lower quality, while most Reflect responses are judged high quality. Next, we analyze whether CG can help models produce better quality responses by using Reflect CG to guide RG models. Surprisingly, we find that simply prompting GPT3 to “think” about CG generates 30% more quality responses, showing promising benefits to integrating CG into the RG process.

Knowledge-Augmented Methods for Natural Language Processing
Chenguang Zhu | Yichong Xu | Xiang Ren | Bill Yuchen Lin | Meng Jiang | Wenhao Yu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

Knowledge in natural language processing (NLP) has been a rising trend especially after the advent of large scale pre-trained models. NLP models with attention to knowledge can i) access unlimited amount of external information; ii) delegate the task of storing knowledge from its parameter space to knowledge sources; iii) obtain up-to-date information; iv) make prediction results more explainable via selected knowledge. In this tutorial, we will introduce the key steps in integrating knowledge into NLP, including knowledge grounding from text, knowledge representation and fusing. In addition, we will introduce recent state-of-the-art applications in fusing knowledge into language understanding, language generation and commonsense reasoning.

On Continual Model Refinement in Out-of-Distribution Data Streams
Bill Yuchen Lin | Sida Wang | Xi Victoria Lin | Robin Jia | Lin Xiao | Xiang Ren | Wen-tau Yih
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Real-world natural language processing (NLP) models need to be continually updated to fix the prediction errors in out-of-distribution (OOD) data streams while overcoming catastrophic forgetting. However, existing continual learning (CL) problem setups cannot cover such a realistic and complex scenario. In response to this, we propose a new CL problem formulation dubbed continual model refinement (CMR). Compared to prior CL settings, CMR is more practical and introduces unique challenges (boundary-agnostic and non-stationary distribution shift, diverse mixtures of multiple OOD data clusters, error-centric streams, etc.). We extend several existing CL approaches to the CMR setting and evaluate them extensively. For benchmarking and analysis, we propose a general sampling algorithm to obtain dynamic OOD data streams with controllable non-stationarity, as well as a suite of metrics measuring various aspects of online performance. Our experiments and detailed analysis reveal the promise and challenges of the CMR problem, supporting that studying CMR in dynamic OOD streams can benefit the longevity of deployed NLP models in production.

FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks
Bill Yuchen Lin | Chaoyang He | Zihang Ze | Hulin Wang | Yufen Hua | Christophe Dupuy | Rahul Gupta | Mahdi Soltanolkotabi | Xiang Ren | Salman Avestimehr
Findings of the Association for Computational Linguistics: NAACL 2022

Increasing concerns and regulations about data privacy and sparsity necessitate the study of privacy-preserving, decentralized learning methods for natural language processing (NLP) tasks. Federated learning (FL) provides promising approaches for a large number of clients (e.g., personal devices or organizations) to collaboratively learn a shared global model to benefit all clients while allowing users to keep their data locally. Despite interest in studying FL methods for NLP tasks, a systematic comparison and analysis is lacking in the literature. Herein, we present the FedNLP, a benchmarking framework for evaluating federated learning methods on four different task formulations: text classification, sequence tagging, question answering, and seq2seq. We propose a universal interface between Transformer-based language models (e.g., BERT, BART) and FL methods (e.g., FedAvg, FedOPT, etc.) under various non-IID partitioning strategies. Our extensive experiments with FedNLP provide empirical comparisons between FL methods and help us better understand the inherent challenges of this direction. The comprehensive analysis points to intriguing and exciting future research aimed at developing FL methods for NLP tasks.

2021

CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP
Qinyuan Ye | Bill Yuchen Lin | Xiang Ren
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Humans can learn a new language task efficiently with only few examples, by leveraging their knowledge obtained when learning prior tasks. In this paper, we explore whether and how such cross-task generalization ability can be acquired, and further applied to build better few-shot learners across diverse NLP tasks. We introduce CrossFit, a problem setup for studying cross-task generalization ability, which standardizes seen/unseen task partitions, data access during different learning stages, and the evaluation protocols. To instantiate different seen/unseen task partitions in CrossFit and facilitate in-depth analysis, we present the NLP Few-shot Gym, a repository of 160 diverse few-shot NLP tasks created from open-access NLP datasets and converted to a unified text-to-text format. Our analysis reveals that the few-shot learning ability on unseen tasks can be improved via an upstream learning stage using a set of seen tasks. We also observe that the selection of upstream learning tasks can significantly influence few-shot performance on unseen tasks, asking further analysis on task similarity and transferability.

RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models
Bill Yuchen Lin | Wenyang Gao | Jun Yan | Ryan Moreno | Xiang Ren
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

To audit the robustness of named entity recognition (NER) models, we propose RockNER, a simple yet effective method to create natural adversarial examples. Specifically, at the entity level, we replace target entities with other entities of the same semantic class in Wikidata; at the context level, we use pre-trained language models (e.g., BERT) to generate word substitutions. Together, the two levels of at- tack produce natural adversarial examples that result in a shifted distribution from the training data on which our target models have been trained. We apply the proposed method to the OntoNotes dataset and create a new benchmark named OntoRock for evaluating the robustness of existing NER models via a systematic evaluation protocol. Our experiments and analysis reveal that even the best model has a significant performance drop, and these models seem to memorize in-domain entity patterns instead of reasoning from the context. Our work also studies the effects of a few simple data augmentation methods to improve the robustness of NER models.

RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge
Bill Yuchen Lin | Ziyi Wu | Yichi Yang | Dong-Ho Lee | Xiang Ren
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Probing Commonsense Explanation in Dialogue Response Generation
Pei Zhou | Pegah Jandaghi | Hyundong Cho | Bill Yuchen Lin | Jay Pujara | Xiang Ren
Findings of the Association for Computational Linguistics: EMNLP 2021

Humans use commonsense reasoning (CSR) implicitly to produce natural and coherent responses in conversations. Aiming to close the gap between current response generation (RG) models and human communication abilities, we want to understand why RG models respond as they do by probing RG model’s understanding of commonsense reasoning that elicits proper responses. We formalize the problem by framing commonsense as a latent variable in the RG task and using explanations for responses as textual form of commonsense. We collect 6k annotated explanations justifying responses from four dialogue datasets and ask humans to verify them and propose two probing settings to evaluate RG models’ CSR capabilities. Probing results show that models fail to capture the logical relations between commonsense explanations and responses and fine-tuning on in-domain data and increasing model sizes do not lead to understanding of CSR for RG. We hope our study motivates more research in making RG models emulate the human reasoning process in pursuit of smooth human-AI communication.

Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning
Bill Yuchen Lin | Seyeon Lee | Xiaoyang Qiao | Xiang Ren
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Commonsense reasoning research has so far been limited to English. We aim to evaluate and improve popular multilingual language models (ML-LMs) to help advance commonsense reasoning (CSR) beyond English. We collect the Mickey corpus, consisting of 561k sentences in 11 different languages, which can be used for analyzing and improving ML-LMs. We propose Mickey Probe, a language-general probing task for fairly evaluating the common sense of popular ML-LMs across different languages. In addition, we also create two new datasets, X-CSQA and X-CODAH, by translating their English versions to 14 other languages, so that we can evaluate popular ML-LMs for cross-lingual commonsense reasoning. To improve the performance beyond English, we propose a simple yet effective method — multilingual contrastive pretraining (MCP). It significantly enhances sentence representations, yielding a large performance gain on both benchmarks (e.g., +2.7% accuracy for X-CSQA over XLM-R_L).

RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms
Pei Zhou | Rahul Khanna | Seyeon Lee | Bill Yuchen Lin | Daniel Ho | Jay Pujara | Xiang Ren
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Pre-trained language models (PTLMs) have achieved impressive performance on commonsense inference benchmarks, but their ability to employ commonsense to make robust inferences, which is crucial for effective communications with humans, is debated. In the pursuit of advancing fluid human-AI communication, we propose a new challenge, RICA: Robust Inference using Commonsense Axioms, that evaluates robust commonsense inference despite textual perturbations. To generate data for this challenge, we develop a systematic and scalable procedure using commonsense knowledge bases and probe PTLMs across two different evaluation settings. Extensive experiments on our generated probe sets with more than 10k statements show that PTLMs perform no better than random guessing on the zero-shot setting, are heavily impacted by statistical biases, and are not robust to perturbation attacks. We also find that fine-tuning on similar statements offer limited gains, as PTLMs still fail to generalize to unseen inferences. Our new large-scale benchmark exposes a significant gap between PTLMs and human-level language understanding and offers a new challenge for PTLMs to demonstrate commonsense.

Differentiable Open-Ended Commonsense Reasoning
Bill Yuchen Lin | Haitian Sun | Bhuwan Dhingra | Manzil Zaheer | Xiang Ren | William Cohen
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Current commonsense reasoning research focuses on developing models that use commonsense knowledge to answer multiple-choice questions. However, systems designed to answer multiple-choice questions may not be useful in applications that do not provide a small list of candidate answers to choose from. As a step towards making commonsense reasoning research more realistic, we propose to study open-ended commonsense reasoning (OpenCSR) — the task of answering a commonsense question without any pre-defined choices — using as a resource only a corpus of commonsense facts written in natural language. OpenCSR is challenging due to a large decision space, and because many questions require implicit multi-hop reasoning. As an approach to OpenCSR, we propose DrFact, an efficient Differentiable model for multi-hop Reasoning over knowledge Facts. To evaluate OpenCSR methods, we adapt several popular commonsense reasoning benchmarks, and collect multiple new answers for each test question via crowd-sourcing. Experiments show that DrFact outperforms strong baseline methods by a large margin.

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning
Xisen Jin | Bill Yuchen Lin | Mohammad Rostami | Xiang Ren
Findings of the Association for Computational Linguistics: EMNLP 2021

The ability to continuously expand knowledge over time and utilize it to rapidly generalize to new tasks is a key feature of human linguistic intelligence. Existing models that pursue rapid generalization to new tasks (e.g., few-shot learning methods), however, are mostly trained in a single shot on fixed datasets, unable to dynamically expand their knowledge; while continual learning algorithms are not specifically designed for rapid generalization. We present a new learning setup, Continual Learning of Few-Shot Learners (CLIF), to address challenges of both learning settings in a unified setup. CLIF assumes a model learns from a sequence of diverse NLP tasks arriving sequentially, accumulating knowledge for improved generalization to new tasks, while also retaining performance on the tasks learned earlier. We examine how the generalization ability is affected in the continual learning setup, evaluate a number of continual learning algorithms, and propose a novel regularized adapter generation approach. We find that catastrophic forgetting affects generalization ability to a lesser degree than performance on seen tasks; while continual learning algorithms can still bring considerable benefit to the generalization ability.

2020

CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning
Bill Yuchen Lin | Wangchunshu Zhou | Ming Shen | Pei Zhou | Chandra Bhagavatula | Yejin Choi | Xiang Ren
Findings of the Association for Computational Linguistics: EMNLP 2020

Recently, large-scale pre-trained language models have demonstrated impressive performance on several commonsense-reasoning benchmark datasets. However, building machines with commonsense to compose realistically plausible sentences remains challenging. In this paper, we present a constrained text generation task, CommonGen associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts (e.g., dog, frisbee, catch, throw); the task is to generate a coherent sentence describing an everyday scenario using these concepts (e.g., “a man throws a frisbee and his dog catches it”). The CommonGen task is challenging because it inherently requires 1) relational reasoning with background commonsense knowledge and 2) compositional generalization ability to work on unseen concept combinations. Our dataset, constructed through a combination of crowdsourced and existing caption corpora, consists of 77k commonsense descriptions over 35k unique concept-sets. Experiments show that there is a large gap between state-of-the-art text generation models (e.g., T5) and human performance (31.6% v.s. 63.5% in SPICE metric). Furthermore, we demonstrate that the learned generative commonsense reasoning capability can be transferred to improve downstream tasks such as CommonsenseQA (76.9% to 78.4 in dev accuracy) by generating additional context.

TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition
Bill Yuchen Lin | Dong-Ho Lee | Ming Shen | Ryan Moreno | Xiao Huang | Prashant Shiralkar | Xiang Ren
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Training neural models for named entity recognition (NER) in a new domain often requires additional human annotations (e.g., tens of thousands of labeled instances) that are usually expensive and time-consuming to collect. Thus, a crucial research question is how to obtain supervision in a cost-effective way. In this paper, we introduce “entity triggers,” an effective proxy of human explanations for facilitating label-efficient learning of NER models. An entity trigger is defined as a group of words in a sentence that helps to explain why humans would recognize an entity in the sentence. We crowd-sourced 14k entity triggers for two well-studied NER datasets. Our proposed model, Trigger Matching Network, jointly learns trigger representations and soft matching module with self-attention such that can generalize to unseen sentences easily for tagging. Our framework is significantly more cost-effective than the traditional neural NER frameworks. Experiments show that using only 20% of the trigger-annotated sentences results in a comparable performance as using 70% of conventional annotated sentences.

Learning to Contextually Aggregate Multi-Source Supervision for Sequence Labeling
Ouyu Lan | Xiao Huang | Bill Yuchen Lin | He Jiang | Liyuan Liu | Xiang Ren
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Sequence labeling is a fundamental task for a range of natural language processing problems. When used in practice, its performance is largely influenced by the annotation quality and quantity, and meanwhile, obtaining ground truth labels is often costly. In many cases, ground truth labels do not exist, but noisy annotations or annotations from different domains are accessible. In this paper, we propose a novel framework Consensus Network (ConNet) that can be trained on annotations from multiple sources (e.g., crowd annotation, cross-domain data). It learns individual representation for every source and dynamically aggregates source-specific knowledge by a context-aware attention module. Finally, it leads to a model reflecting the agreement (consensus) among multiple sources. We evaluate the proposed framework in two practical settings of multi-source learning: learning with crowd annotations and unsupervised cross-domain model adaptation. Extensive experimental results show that our model achieves significant improvements over existing methods in both settings. We also demonstrate that the method can apply to various tasks and cope with different encoders.

Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering
Yanlin Feng | Xinyue Chen | Bill Yuchen Lin | Peifeng Wang | Jun Yan | Xiang Ren
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Existing work on augmenting question answering (QA) models with external knowledge (e.g., knowledge graphs) either struggle to model multi-hop relations efficiently, or lack transparency into the model’s prediction rationale. In this paper, we propose a novel knowledge-aware approach that equips pre-trained language models (PTLMs) has with a multi-hop relational reasoning module, named multi-hop graph relation network (MHGRN). It performs multi-hop, multi-relational reasoning over subgraphs extracted from external knowledge graphs. The proposed reasoning module unifies path-based reasoning methods and graph neural networks to achieve better interpretability and scalability. We also empirically show its effectiveness and scalability on CommonsenseQA and OpenbookQA datasets, and interpret its behaviors with case studies, with the code for experiments released.

Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models
Bill Yuchen Lin | Seyeon Lee | Rahul Khanna | Xiang Ren
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recent works show that pre-trained language models (PTLMs), such as BERT, possess certain commonsense and factual knowledge. They suggest that it is promising to use PTLMs as “neural knowledge bases” via predicting masked words. Surprisingly, we find that this may not work for numerical commonsense knowledge (e.g., a bird usually has two legs). In this paper, we investigate whether and to what extent we can induce numerical commonsense knowledge from PTLMs as well as the robustness of this process. In this paper, we investigate whether and to what extent we can induce numerical commonsense knowledge from PTLMs as well as the robustness of this process. To study this, we introduce a novel probing task with a diagnostic dataset, NumerSense, containing 13.6k masked-word-prediction probes (10.5k for fine-tuning and 3.1k for testing). Our analysis reveals that: (1) BERT and its stronger variant RoBERTa perform poorly on the diagnostic dataset prior to any fine-tuning; (2) fine-tuning with distant supervision brings some improvement; (3) the best supervised model still performs poorly as compared to human performance (54.06% vs. 96.3% in accuracy).

LEAN-LIFE: A Label-Efficient Annotation Framework Towards Learning from Explanation
Dong-Ho Lee | Rahul Khanna | Bill Yuchen Lin | Seyeon Lee | Qinyuan Ye | Elizabeth Boschee | Leonardo Neves | Xiang Ren
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Successfully training a deep neural network demands a huge corpus of labeled data. However, each label only provides limited information to learn from, and collecting the requisite number of labels involves massive human effort. In this work, we introduce LEAN-LIFE, a web-based, Label-Efficient AnnotatioN framework for sequence labeling and classification tasks, with an easy-to-use UI that not only allows an annotator to provide the needed labels for a task but also enables LearnIng From Explanations for each labeling decision. Such explanations enable us to generate useful additional labeled data from unlabeled instances, bolstering the pool of available training data. On three popular NLP tasks (named entity recognition, relation extraction, sentiment analysis), we find that using this enhanced supervision allows our models to surpass competitive baseline F1 scores by more than 5-10 percentage points, while using 2X times fewer labeled instances. Our framework is the first to utilize this enhanced supervision technique and does so for three important tasks – thus providing improved annotation recommendations to users and an ability to build datasets of (data, label, explanation) triples instead of the regular (data, label) pair.

2019

AlpacaTag: An Active Learning-based Crowd Annotation Framework for Sequence Tagging
Bill Yuchen Lin | Dong-Ho Lee | Frank F. Xu | Ouyu Lan | Xiang Ren
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We introduce an open-source web-based data annotation framework (AlpacaTag) for sequence tagging tasks such as named-entity recognition (NER). The distinctive advantages of AlpacaTag are three-fold. 1) Active intelligent recommendation: dynamically suggesting annotations and sampling the most informative unlabeled instances with a back-end active learned model; 2) Automatic crowd consolidation: enhancing real-time inter-annotator agreement by merging inconsistent labels from multiple annotators; 3) Real-time model deployment: users can deploy their models in downstream systems while new annotations are being made. AlpacaTag is a comprehensive solution for sequence labeling tasks, ranging from rapid tagging with recommendations powered by active learning and auto-consolidation of crowd annotations to real-time model deployment.

KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
Bill Yuchen Lin | Xinyue Chen | Jamin Chen | Xiang Ren
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Commonsense reasoning aims to empower machines with the human ability to make presumptions about ordinary situations in our daily life. In this paper, we propose a textual inference framework for answering commonsense questions, which effectively utilizes external, structured commonsense knowledge graphs to perform explainable inferences. The framework first grounds a question-answer pair from the semantic space to the knowledge-based symbolic space as a schema graph, a related sub-graph of external knowledge graphs. It represents schema graphs with a novel knowledge-aware graph network module named KagNet, and finally scores answers with graph representations. Our model is based on graph convolutional networks and LSTMs, with a hierarchical path-based attention mechanism. The intermediate attention scores make it transparent and interpretable, which thus produce trustworthy inferences. Using ConceptNet as the only external resource for Bert-based models, we achieved state-of-the-art performance on the CommonsenseQA, a large-scale dataset for commonsense reasoning.

2018

Automatic Extraction of Commonsense LocatedNear Knowledge
Frank F. Xu | Bill Yuchen Lin | Kenny Zhu
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

LocatedNear relation is a kind of commonsense knowledge describing two physical objects that are typically found near each other in real life. In this paper, we study how to automatically extract such relationship through a sentence-level relation classifier and aggregating the scores of entity pairs from a large corpus. Also, we release two benchmark datasets for evaluation and future research.

Mining Cross-Cultural Differences and Similarities in Social Media
Bill Yuchen Lin | Frank F. Xu | Kenny Zhu | Seung-won Hwang
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cross-cultural differences and similarities are common in cross-lingual natural language understanding, especially for research in social media. For instance, people of distinct cultures often hold different opinions on a single named entity. Also, understanding slang terms across languages requires knowledge of cross-cultural similarities. In this paper, we study the problem of computing such cross-cultural differences and similarities. We present a lightweight yet effective approach, and evaluate it on two novel tasks: 1) mining cross-cultural differences of named entities and 2) finding similar terms for slang across languages. Experimental results show that our framework substantially outperforms a number of baseline methods on both tasks. The framework could be useful for machine translation applications and research in computational social science.

ExtRA: Extracting Prominent Review Aspects from Customer Feedback
Zhiyi Luo | Shanshan Huang | Frank F. Xu | Bill Yuchen Lin | Hanyuan Shi | Kenny Zhu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Many existing systems for analyzing and summarizing customer reviews about products or service are based on a number of prominent review aspects. Conventionally, the prominent review aspects of a product type are determined manually. This costly approach cannot scale to large and cross-domain services such as Amazon.com, Taobao.com or Yelp.com where there are a large number of product types and new products emerge almost every day. In this paper, we propose a novel framework, for extracting the most prominent aspects of a given product type from textual reviews. The proposed framework, ExtRA, extracts K most prominent aspect terms or phrases which do not overlap semantically automatically without supervision. Extensive experiments show that ExtRA is effective and achieves the state-of-the-art performance on a dataset consisting of different product types.

Neural Adaptation Layers for Cross-domain Named Entity Recognition
Bill Yuchen Lin | Wei Lu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recent research efforts have shown that neural architectures can be effective in conventional information extraction tasks such as named entity recognition, yielding state-of-the-art results on standard newswire datasets. However, despite significant resources required for training such models, the performance of a model trained on one domain typically degrades dramatically when applied to a different domain, yet extracting entities from new emerging domains such as social media can be of significant interest. In this paper, we empirically investigate effective methods for conveniently adapting an existing, well-trained neural NER model for a new domain. Unlike existing approaches, we propose lightweight yet effective methods for performing domain adaptation for neural models. Specifically, we introduce adaptation layers on top of existing neural architectures, where no re-training using the source domain data is required. We conduct extensive empirical studies and show that our approach significantly outperforms state-of-the-art methods.

2017

Multi-channel BiLSTM-CRF Model for Emerging Named Entity Recognition in Social Media
Bill Y. Lin | Frank Xu | Zhiyi Luo | Kenny Zhu
Proceedings of the 3rd Workshop on Noisy User-generated Text

In this paper, we present our multi-channel neural architecture for recognizing emerging named entity in social media messages, which we applied in the Novel and Emerging Named Entity Recognition shared task at the EMNLP 2017 Workshop on Noisy User-generated Text (W-NUT). We propose a novel approach, which incorporates comprehensive word representations with multi-channel information and Conditional Random Fields (CRF) into a traditional Bidirectional Long Short-Term Memory (BiLSTM) neural network without using any additional hand-craft features such as gazetteers. In comparison with other systems participating in the shared task, our system won the 2nd place.

Co-authors

Fengqing Jiang 4

Radha Poovendran 4

Maria Antoniak 2

Chandra Bhagavatula 2

Elizabeth Boschee 2

Pegah Jandaghi 2

Sujian Li (李素建) 2

Shayne Longpre 2

Graham Neubig 2

Abhilasha Ravichander 2

Vered Shwartz 2

Mahdi Soltanolkotabi 2

Aaran Arulraj 1

Salman Avestimehr 1

Antoine Bosselut 1

Faeze Brahman 1

Hyungjoo Chae 1

Abhranil Chandra 1

Kai-Wei Chang 1

William Cohen 1

Bhuwan Dhingra 1

Christophe Dupuy 1

Jillian Fisher 1

Wenyang Gao (高文炀) 1

Hannaneh Hajishirzi 1

Yangsibo Huang 1

Shanshan Huang 1

Seung-won Hwang 1

Hyeonbin Hwang 1

Yash Kumar Lal 1

Nathan Lambert 1

Shuyue Stella Li 1

Xi Victoria Lin 1

Bodhisattwa Prasad Majumder 1

Ninareh Mehrabi 1

Fatemehsadat Mireshghallah 1

Niloofar Mireshghallah 1

Jacob Morrison 1

Fred Morstatter 1

Sagnik Mukherjee 1

Yaswanth Narsupalli 1

Leonardo Neves 1

Chan Young Park 1

Sue Hyun Park 1

Valentina Pyatkin 1

Xiaoyang Qiao 1

Bhaskar Ramasubramanian 1

Sahithya Ravi 1

Mohammad Rostami 1

Rachel Rudinger 1

Sheikh Muhammad Sarwar 1

Ravi Kiran Selvam 1

Sheikh Shafayat 1

Prashant Shiralkar 1

Noah A. Smith 1

Taylor Sorensen 1

Tejas Srinivasan 1

Jesse Thomason 1

Yulia Tsvetkov 1

Yutaro Yamada 1

Michihiro Yasunaga 1

Seonghyeon Ye 1

Ilker Yildirim 1

Dongkeun Yoon 1

Manzil Zaheer 1

Chiyuan Zhang 1

Wangchunshu Zhou 1

Chenguang Zhu 1

Vilém Zouhar 1

Venues