Shi Feng - ACL Anthology

Shi Feng

2025

Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas
Nishant Balepur | Vishakh Padmakumar | Fumeng Yang | Shi Feng | Rachel Rudinger | Jordan Lee Boyd-Graber
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

LLMs are aligned to follow input instructions by learning which of two responses users prefer for a prompt. However, such preference data do not convey *why* users prefer responses that are chosen or rejected, so LLMs trained on these datasets cannot tailor responses to varied user needs. To surface these parameters of personalization, we apply *abductive reasoning* to preference data, inferring needs and interests of users, i.e., personas, that may prefer either response. We test this idea in two steps: **Persona Inference (PI)**—abductively inferring personas of users who prefer chosen or rejected outputs—and **Persona Tailoring (PT)**—training models to tailor outputs to personas from PI. We show: 1) LLMs infer personas accurately explaining why different users may prefer *both* chosen or rejected outputs; 2) Training on preference data augmented with PI personas via PT boosts personalization and generalizes to supporting user-written personas; and 3) Rejected response personas form harder personalization evaluations, showing PT better aids users with uncommon preferences versus typical alignment methods. We argue for an abductive view of preferences for personalization, asking not only which response is better but when, why, and for whom.

SSMLoRA: Enhancing Low-Rank Adaptation with State Space Model
Jiayang Yu | Yihang Zhang | Bin Wang | Peiqin Lin | YongKang Liu | Shi Feng
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Fine-tuning is a key approach for adapting language models to specific downstream tasks, but updating all model parameters becomes impractical as model sizes increase.Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address this challenge by introducing additional adaptation parameters into pre-trained weight matrices.However, LoRA’s performance varies across different insertion points within the model, highlighting potential parameter inefficiency due to unnecessary insertions. To this end, we propose SSMLoRA (**S**tate **S**pace **M**odel **L**ow-**R**ank **A**daptation), an extension of LoRA that incorporates a State Space Model (SSM) to interconnect low-rank matrices. SSMLoRA ensures that performance is maintained even with sparser insertions. SSMLoRA allows the model to not only map inputs to a low-rank space for better feature extraction but also leverage the computations from the previous low-rank space. Our method achieves comparable performance to LoRA on the General Language Understanding Evaluation (GLUE) benchmark while using only half the parameters. Additionally, due to its structure, SSMLoRA shows promise in handling tasks with longer input sequences.

TOOL-ED: Enhancing Empathetic Response Generation with the Tool Calling Capability of LLM
Huiying Cao | Yiqun Zhang | Shi Feng | Xiaocui Yang | Daling Wang | Yifei Zhang
Proceedings of the 31st International Conference on Computational Linguistics

Empathetic conversation is a crucial characteristic in daily conversations between individuals. Nowadays, Large Language models (LLMs) have shown outstanding performance in generating empathetic responses. Knowledge bases like COMET can assist LLMs in mitigating illusions and enhancing the understanding of users’ intentions and emotions. However, models remain heavily reliant on fixed knowledge bases and unrestricted incorporation of external knowledge can introduce noise. Tool learning is a flexible end-to-end approach that assists LLMs in handling complex problems. In this paper, we propose Emotional Knowledge Tool Calling (EKTC) framework, which encapsulates the commonsense knowledge bases as empathetic tools, enabling LLMs to integrate external knowledge flexibly through tool calling. In order to adapt the models to the new task, we construct a novel dataset TOOL-ED based on the EMPATHETICDIALOGUE (ED) dataset. We validate EKTC on the ED dataset, and the experimental results demonstrate that our framework can enhance the ability of LLMs to generate empathetic responses effectively. Our code is available at https://anonymous.4open.science/r/EKTC-3FEF.

A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users
Nishant Balepur | Matthew Shu | Yoo Yeon Sung | Seraphina Goldfarb-Tarrant | Shi Feng | Fumeng Yang | Rachel Rudinger | Jordan Lee Boyd-Graber
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

To assist users in complex tasks, LLMs generate plans: step-by-step instructions towards a goal. While alignment methods aim to ensure LLM plans are helpful, they train (RLHF) or evaluate (ChatbotArena) on what users prefer, assuming this reflects what helps them. We test this with Planorama: an interface where 126 users answer 300 multi-step questions with LLM plans. We get 4388 plan executions and 5584 comparisons to measure plan helpfulness (QA success) and user preferences on plans, and recreate the setup in agents and reward models to see if they simulate or prefer what helps users. We expose: 1) user/model preferences and agent success do not accurately predict which plans help users, so common alignment feedback can misalign with helpfulness; 2) this gap is not due to user-specific preferences, as users are similarly successful when using plans they prefer/disprefer; 3) surface-level cues like brevity and question similarity strongly link to preferences, but such biases fail to predict helpfulness. In all, we argue aligning helpful LLMs needs feedback from real user interactions—not just preferences of what looks helpful—so we discuss the plan NLP researchers can execute to solve this problem.

Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer?
Nishant Balepur | Feng Gu | Abhilasha Ravichander | Shi Feng | Jordan Lee Boyd-Graber | Rachel Rudinger
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Question answering (QA)—giving correct answers to questions—is a popular task, but we test **reverse question answering (RQA)**: for an input answer, give a question with that answer. Past work tests QA and RQA separately, but we test them jointly, comparing their difficulty, aiding benchmark design, and checking reasoning consistency. We run 16 LLMs on QA and RQA with trivia questions/answers, revealing: 1) Versus RQA, LLMs are much less accurate in RQA for numerical answers, but slightly more accurate in RQA for textual answers; 2) LLMs often answer their own invalid questions from RQA accurately in QA, so RQA errors are not just from knowledge gaps; 3) RQA errors correlate with question difficulty and inversely correlate with answer frequencies in the Dolma corpus; and 4) LLMs struggle to give valid multi-hop questions. By finding question and answer types that lead to RQA errors, we suggest improvements for LLM reasoning.

AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation
Ming Wang | Peidong Wang | Lin Wu | Xiaocui Yang | Daling Wang | Shi Feng | Yuxin Chen | Bixuan Wang | Yifei Zhang
Findings of the Association for Computational Linguistics: ACL 2025

Constrained by the cost and ethical concerns of involving real seekers in AI-driven mental health, researchers develop LLM-based conversational agents (CAs) with tailored configurations, such as profiles, symptoms, and scenarios, to simulate seekers. While these efforts advance AI in mental health, achieving more realistic seeker simulation remains hindered by two key challenges: dynamic evolution and multi-session memory. Seekers’ mental states often fluctuate during counseling, which typically spans multiple sessions. To address this, we propose **AnnaAgent**, an emotional and cognitive dynamic agent system equipped with tertiary memory. AnnaAgent incorporates an emotion modulator and a complaint elicitor trained on real counseling dialogues, enabling dynamic control of the simulator’s configurations. Additionally, its tertiary memory mechanism effectively integrates short-term and long-term memory across sessions. Evaluation results, both automated and manual, demonstrate that AnnaAgent achieves more realistic seeker simulation in psychological counseling compared to existing baselines. The ethically reviewed and screened code can be found on [https://github.com/sci-m-wang/AnnaAgent](https://github.com/sci-m-wang/AnnaAgent).

SemanticCamo: Jailbreaking Large Language Models through Semantic Camouflage
Jihui Yan | Xiaocui Yang | Daling Wang | Shi Feng | Yifei Zhang | Yinzhi Zhao
Findings of the Association for Computational Linguistics: ACL 2025

The rapid development and increasingly widespread applications of Large Language Models (LLMs) have made the safety issues of LLMs more prominent and critical. Although safety training is widely used in LLMs, the mismatch between pre-training and safety training still leads to safety vulnerabilities. To expose the safety vulnerabilities in LLMs and improve LLMs’ performance in safety, we propose a novel framework, SemanticCamo, which attacks LLMs through semantic camouflage.SemanticCamo bypasses safety guardrails by replacing the original unsafe content with semantic features, thereby concealing malicious intent while keeping the query’s objectives unchanged. We conduct comprehensive experiments on the state-of-the-art LLMs, including GPT-4o and Claude-3.5, finding that SemanticCamo successfully induces harmful responses from the target models in over 80% of cases on average, outperforming previous counterparts. Additionally, the performance of SemanticCamo against various defenses is evaluated, demonstrating that semantic transformations introduce critical challenges to LLM safety, necessitating targeted alignment strategies to address this vulnerability. Code and data are available at https://github.com/Jihui-Yan/SemanticCamo.

TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos
Fanheng Kong | Jingyuan Zhang | Hongzhi Zhang | Shi Feng | Daling Wang | Linhao Yu | Xingguang Ji | Yu Tian | Victoria W. | Fuzheng Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Videos are unique in their integration of temporal elements, including camera, scene, action, and attribute, along with their dynamic relationships over time. However, existing benchmarks for video understanding often treat these properties separately or narrowly focus on specific aspects, overlooking the holistic nature of video content. To address this, we introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos, with two complementary tasks: captioning and QA. Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria. We evaluate several leading models on our benchmark, providing fine-grained performance assessments across various dimensions. This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion, offering valuable insights for improving video understanding models.

MUSE: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles
Zihan Wang | Xiaocui Yang | YongKang Liu | Shi Feng | Daling Wang | Yifei Zhang
Findings of the Association for Computational Linguistics: ACL 2025

Current conversational recommendation systems focus predominantly on text. However, real-world recommendation settings are generally multimodal, causing a significant gap between existing research and practical applications. To address this issue, we propose Muse, the first multimodal conversational recommendation dataset. Muse comprises 83,148 utterances from 7,000 conversations centered around the Clothing domain. Each conversation contains comprehensive multimodal interactions, rich elements, and natural dialogues. Data in Muse are automatically synthesized by a multi-agent framework powered by multimodal large language models (MLLMs). It innovatively derives user profiles from real-world scenarios rather than depending on manual design and history data for better scalability, and then it fulfills conversation simulation and optimization. Both human and LLM evaluations demonstrate the high quality of conversations in Muse. Additionally, fine-tuning experiments on three MLLMs demonstrate Muse’s learnable patterns for recommendations and responses, confirming its value for multimodal conversational recommendation. Our dataset and codes are available at https://anonymous.4open.science/r/Muse-0086.

Pixel-Level Reasoning Segmentation via Multi-turn Conversations
Dexian Cai | Xiaocui Yang | YongKang Liu | Daling Wang | Shi Feng | Yifei Zhang | Soujanya Poria
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics. The code and data are available at: https://anonymous.4open.science/r/PixelRS/.

Language Models as Continuous Self-Evolving Data Engineers
Peidong Wang | Ming Wang | Zhiming Ma | Xiaocui Yang | Shi Feng | Daling Wang | Yifei Zhang | Kaisong Song
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their further evolution is often hampered by the scarcity of high-quality training data and the heavy reliance of traditional methods on expert-labeled data. This reliance sets a ceiling on LLM performance and is particularly challenging in low data resource scenarios where extensive supervision is unavailable. To address this issue, we propose a novel paradigm named LANCE (**LAN**guage models as **C**ontinuous self-**E**volving data engineers) that enables LLMs to train themselves by autonomously generating, cleaning, reviewing, and annotating data with preference information. Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of post-training data construction. Through iterative fine-tuning on Qwen2 series models, we validate the effectiveness of LANCE across various tasks, showing that it can maintain high-quality data generation and continuously improve model performance. Across multiple benchmark dimensions, LANCE results in an average score enhancement of **3.64** for Qwen2-7B and **1.75** for Qwen2-7B-Instruct. This autonomous data construction paradigm not only lessens reliance on human experts or external models but also ensures data aligns with human preferences, offering a scalable path for LLM self-improvement, especially in contexts with limited supervisory data. Code is available at: https://github.com/Control-derek/LANCE.

2024

Improving Role-Oriented Dialogue Summarization with Interaction-Aware Contrastive Learning
Weihong Guan | Shi Feng | Daling Wang | Faliang Huang | Yifei Zhang | Yuan Cui
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Role-oriented dialogue summarization aims at generating summaries for different roles in dialogue, e.g., user and agent. Interaction between different roles is vital for the task. Existing methods could not fully capture interaction patterns between roles when encoding dialogue, thus are prone to ignore the interaction-related key information. In this paper, we propose a contrastive learning based interaction-aware model for the role-oriented dialogue summarization namely CIAM. An interaction-aware contrastive objective is constructed to guide the encoded dialogue representation to learn role-level interaction. The representation is then used by the decoder to generate role-oriented summaries. The contrastive objective is trained jointly with the primary dialogue summarization task. Additionally, we innovatively utilize different decoder start tokens to control what kind of summary to generate, thus could generate different role-oriented summaries with a unified model. Experimental results show that our method achieves new state-of-the-art results on two public datasets. Extensive analyses further demonstrate that our method excels at capturing interaction information between different roles and producing informative summaries.

STICKERCONV: Generating Multimodal Empathetic Responses from Scratch
Yiqun Zhang | Fanheng Kong | Peidong Wang | Shuang Sun | Lingshuai Wang | Shi Feng | Daling Wang | Yifei Zhang | Kaisong Song
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Stickers, while widely recognized for enhancing empathetic communication in online interactions, remain underexplored in current empathetic dialogue research, notably due to the challenge of a lack of comprehensive datasets. In this paper, we introduce the Agent for STICKERCONV (Agent4SC), which uses collaborative agent interactions to realistically simulate human behavior with sticker usage, thereby enhancing multimodal empathetic communication. Building on this foundation, we develop a multimodal empathetic dialogue dataset, STICKERCONV, comprising 12.9K dialogue sessions, 5.8K unique stickers, and 2K diverse conversational scenarios. This dataset serves as a benchmark for multimodal empathetic generation. To advance further, we propose PErceive and Generate Stickers (PEGS), a multimodal empathetic response generation framework, complemented by a comprehensive set of empathy evaluation metrics based on LLM. Our experiments demonstrate PEGS’s effectiveness in generating contextually relevant and emotionally resonant multimodal empathetic responses, contributing to the advancement of more nuanced and engaging empathetic dialogue systems.

Large Language Models Help Humans Verify Truthfulness – Except When They Are Convincingly Wrong
Chenglei Si | Navita Goyal | Tongshuang Wu | Chen Zhao | Shi Feng | Hal Daumé Iii | Jordan Boyd-Graber
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they get, LLMs should not only provide information but also help users fact-check it. We conduct human experiments with 80 crowdworkers to compare language models with search engines (information retrieval systems) at facilitating fact-checking. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy. However, they over-rely on the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information—explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users’ over-reliance on LLMs, but cannot significantly outperform search engines. Further, showing both search engine results and LLM explanations offers no complementary benefits compared to search engines alone. Taken together, our study highlights that natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.

A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick
Nishant Balepur | Matthew Shu | Alexander Hoyle | Alison Robey | Shi Feng | Seraphina Goldfarb-Tarrant | Jordan Lee Boyd-Graber
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Keyword mnemonics are memorable explanations that link new terms to simpler keywords.Prior work generates mnemonics for students, but they do not train models using mnemonics students prefer and aid learning.We build SMART, a mnemonic generator trained on feedback from real students learning new terms.To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics.We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor.We gather 2684 preferences from 45 students across two types: **expressed** (inferred from ratings) and **observed** (inferred from student learning), yielding three key findings.First, expressed and observed preferences disagree; what students *think* is helpful does not always capture what is *truly* helpful.Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal.SMART is tuned via Direct Preference Optimization on this signal, which resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.

EmpCRL: Controllable Empathetic Response Generation via In-Context Commonsense Reasoning and Reinforcement Learning
Mingxiu Cai | Daling Wang | Shi Feng | Yifei Zhang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Empathetic response generation aims to understand the user’s feelings emotionally and generate responses with appropriate emotion. According to psychological theories, empathy consists of two main aspects: affection and cognition. However, existing works lack the perception of fine-grained dialogue emotion propagation, as well as have limitations in reasoning about the intentions of users on cognition, which affect the quality of empathetic response. To this end, we propose to generate Empathetic response based on in-context Commonsense reasoning and Reinforcement Learning (EmpCRL). First, we use a current popular large language model combined with multi-view contextual reasoning to broaden the cognitive boundaries through in-context learning. Furthermore, we infer the response emotion by jointly modeling the dialogue history and emotion flow, and achieve the control of response emotion and diversity through reinforcement learning. Extensive experiments on EmpatheticDialogues dataset show that our model outperforms state-of-the-art models in both automatic and human evaluation.

TIGER: A Unified Generative Model Framework for Multimodal Dialogue Response Generation
Fanheng Kong | Peidong Wang | Shi Feng | Daling Wang | Yifei Zhang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Responding with multimodal content has been recognized as one of the essential functionalities of intelligent conversational agents. However, existing research on multimodal dialogues primarily focuses on two topics: (1) textual response generation that ground the conversation on a given image; and (2) visual response selection based on the dialogue context. In light of the aforementioned gap, we propose mulTImodal GEnerator for dialogue Response (TIGER), a unified generative model framework for multimodal dialogue response generation. Through extensive experiments, TIGER has demonstrated new state-of-the-art results, providing users with an enhanced conversational experience. A multimodal dialogue system based on TIGER is available at https://github.com/friedrichor/TIGER. A video demonstrating the system is available at https://www.youtube.com/watch?v=Kd0CMwDs8Rk.

BERT-BC: A Unified Alignment and Interaction Model over Hierarchical BERT for Response Selection
Zhenfei Yang | Beiming Yu | Yuan Cui | Shi Feng | Daling Wang | Yifei Zhang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Recently, we have witnessed a significant performance boosting for dialogue response selection task achieved by Cross-Encoder based models. However, such models directly feed the concatenation of context and response into the pre-trained model for interactive inference, ignoring the comprehensively independent representation modeling of context and response. Moreover, randomly sampling negative responses from other dialogue contexts is simplistic, and the learned models have poor generalization capability in realistic scenarios. In this paper, we propose a response selection model called BERT-BC that combines the representation-based Bi-Encoder and interaction-based Cross-Encoder. Three contrastive learning methods are devised for the Bi-Encoder to align context and response to obtain the better semantic representation. Meanwhile, according to the alignment difficulty of context and response semantics, the harder samples are dynamically selected from the same batch with negligible cost and sent to Cross-Encoder to enhance the model’s interactive reasoning ability. Experimental results show that BERT-BC can achieve state-of-the-art performance on three benchmark datasets for multi-turn response selection.

KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students
Matthew Shu | Nishant Balepur | Shi Feng | Jordan Lee Boyd-Graber
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Flashcard schedulers rely on 1) *student models* to predict the flashcards a student knows; and 2) *teaching policies* to pick which cards to show next via these predictions.Prior student models, however, just use study data like the student’s past responses, ignoring the text on cards. We propose **content-aware scheduling**, the first schedulers exploiting flashcard content.To give the first evidence that such schedulers enhance student learning, we build KARL, a simple but effective content-aware student model employing deep knowledge tracing (DKT), retrieval, and BERT to predict student recall.We train KARL by collecting a new dataset of 123,143 study logs on diverse trivia questions.KARL bests existing student models in AUC and calibration error.To ensure our improved predictions lead to better student learning, we create a novel delta-based teaching policy to deploy KARL online.Based on 32 study paths from 27 users, KARL improves learning efficiency over SOTA, showing KARL’s strength and encouraging researchers to look beyond historical study data to fully capture student abilities.

HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy
YongKang Liu | Yiqun Zhang | Qian Li | Tong Liu | Shi Feng | Daling Wang | Yifei Zhang | Hinrich Schuetze
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Full-parameter fine-tuning (FPFT) has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which potentially compromises the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. We propose a novel, memory-efficient, optimizer-independent, end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT significantly reduces the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance with parameter-efficient fine-tuning and standard FPFT. (2) Results on six models show that HiFT reduces the number of trainable parameters by about 89.18% on average compared to FPFT. (3) HiFT supports FPFT of 7B models for 24G GPU memory devices under mixed precision without using any memory saving techniques. (4) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. The source code link is https://github.com/misonsky/HiFT.

2023

Global-Local Modeling with Prompt-Based Knowledge Enhancement for Emotion Inference in Conversation
Renxi Wang | Shi Feng
Findings of the Association for Computational Linguistics: EACL 2023

The ability to recognize emotions in conversations is necessary and important for the online chatbot to do tasks such as empathetic response generation and emotional support. Present researches mainly focus on recognizing emotions through a speaker’s utterance, while research on emotion inference predicts emotions of addressees through previous utterances. Because of the lack of the addressee’s utterance, emotion inference is more challenging than emotion recognition. In this paper, we propose a global-local modeling method based on recurrent neural networks (RNN) and pre-trained language models (PLM) to do emotion inference, which utilizes the sequence modeling ability of RNNs and abundant knowledge from PLMs. Moreover, we take the whole dialogue history as input of PLM to generate knowledge by in-context learning. Experimental results show that our model with knoledge enhancement achieves state-of-the-art performance on all three datasets.

Few-shot Joint Multimodal Aspect-Sentiment Analysis Based on Generative Multimodal Prompt
Xiaocui Yang | Shi Feng | Daling Wang | Qi Sun | Wenfang Wu | Yifei Zhang | Pengfei Hong | Soujanya Poria
Findings of the Association for Computational Linguistics: ACL 2023

We have witnessed the rapid proliferation of multimodal data on numerous social media platforms. Conventional studies typically require massive labeled data to train models for Multimodal Aspect-Based Sentiment Analysis (MABSA). However, collecting and annotating fine-grained multimodal data for MABSA is tough. To alleviate the above issue, we perform three MABSA-related tasks with quite a small number of labeled multimodal samples. We first build diverse and comprehensive multimodal few-shot datasets according to the data distribution. To capture the specific prompt for each aspect term in a few-shot scenario, we propose a novel Generative Multimodal Prompt (GMP) model for MABSA, which includes the Multimodal Encoder module and the N-Stream Decoders module. We further introduce a subtask to predict the number of aspect terms in each instance to construct the multimodal prompt. Extensive experiments on two datasets demonstrate that our approach outperforms strong baselines on two MABSA-related tasks in the few-shot setting.

PVGRU: Generating Diverse and Relevant Dialogue Responses via Pseudo-Variational Mechanism
Yongkang Liu | Shi Feng | Daling Wang | Yifei Zhang | Hinrich Schütze
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate response generation for multi-turn dialogue in generative chatbots. Existing generative modelsbased on RNNs (Recurrent Neural Networks) usually employ the last hidden state to summarize the history, which makesmodels unable to capture the subtle variability observed in different dialogues and cannot distinguish the differencesbetween dialogues that are similar in composition. In this paper, we propose Pseudo-Variational Gated Recurrent Unit (PVGRU). The key novelty of PVGRU is a recurrent summarizing variable thataggregates the accumulated distribution variations of subsequences. We train PVGRU without relying on posterior knowledge, thus avoiding the training-inference inconsistency problem. PVGRU can perceive subtle semantic variability through summarizing variables that are optimized by two objectives we employ for training: distribution consistency and reconstruction. In addition, we build a Pseudo-Variational Hierarchical Dialogue(PVHD) model based on PVGRU. Experimental results demonstrate that PVGRU can broadly improve the diversity andrelevance of responses on two benchmark datasets.

Contrastive Learning with Generated Representations for Inductive Knowledge Graph Embedding
Qian Li | Shafiq Joty | Daling Wang | Shi Feng | Yifei Zhang | Chengwei Qin
Findings of the Association for Computational Linguistics: ACL 2023

With the evolution of Knowledge Graphs (KGs), new entities emerge which are not seen before. Representation learning of KGs in such an inductive setting aims to capture and transfer the structural patterns from existing entities to new entities. However, the performance of existing methods in inductive KGs are limited by sparsity and implicit transfer. In this paper, we propose VMCL, a Contrastive Learning (CL) framework with graph guided Variational autoencoder on Meta-KGs in the inductive setting. We first propose representation generation to capture the encoded and generated representations of entities, where the generated variations can densify representations with complementary features. Then, we design two CL objectives that work across entities and meta-KGs to simulate the transfer mode. With extensive experiments we demonstrate that our proposed VMCL can significantly outperform previous state-of-the-art baselines.

Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations
Chenglei Si | Dan Friedman | Nitish Joshi | Shi Feng | Danqi Chen | He He
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In-context learning (ICL) is an important paradigm for adapting large language models (LLMs) to new tasks, but the generalization behavior of ICL remains poorly understood. We investigate the inductive biases of ICL from the perspective of feature bias: which feature ICL is more likely to use given a set of underspecified demonstrations in which two features are equally predictive of the labels. First, we characterize the feature biases of GPT-3 models by constructing underspecified demonstrations from a range of NLP datasets and feature combinations. We find that LLMs exhibit clear feature biases—for example, demonstrating a strong bias to predict labels according to sentiment rather than shallow lexical features, like punctuation. Second, we evaluate the effect of different interventions that are designed to impose an inductive bias in favor of a particular feature, such as adding a natural language instruction or using semantically relevant label words. We find that, while many interventions can influence the learner to prefer a particular feature, it can be difficult to overcome strong prior biases. Overall, our results provide a broader picture of the types of features that ICL may be more likely to exploit and how to impose inductive biases that are better aligned with the intended task.

2022

KC-ISA: An Implicit Sentiment Analysis Model Combining Knowledge Enhancement and Context Features
Minghao Xu | Daling Wang | Shi Feng | Zhenfei Yang | Yifei Zhang
Proceedings of the 29th International Conference on Computational Linguistics

Sentiment analysis has always been an important research direction in natural language processing. The research can be divided into explicit sentiment analysis and implicit sentiment analysis according to whether there are sentiment words in language expression. There have been many research results in explicit sentiment analysis. However, implicit sentiment analysis is rarely studied. Compared with explicit sentiment expression, implicit sentiment expression usually omits a lot of knowledge and common sense, and context also has an important impact on implicit sentiment expression. In this paper, we use a knowledge graph to supplement implicit sentiment expression and propose a novel Implicit Sentiment Analysis model combining Knowledge enhancement and Context features (dubbed KC-ISA). The KC-ISA model can effectively integrate external knowledge and contextual features by the coattention mechanism. Finally, we conduct experiments on the SMP2019 implicit sentiment analysis dataset. Moreover, to verify the generality of the model, we also conduct experiments on two common sentiment analysis datasets. The results on three datasets show that our proposed KC-ISA model can achieve better results on text sentiment analysis.

Active Example Selection for In-Context Learning
Yiming Zhang | Shi Feng | Chenhao Tan
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

With a handful of demonstration examples, large-scale language models demonstrate strong capability to perform various tasks by in-context learning from these examples, without any fine-tuning. We demonstrate that in-context learning performance can be highly unstable across samples of examples, indicating the idiosyncrasies of how language models acquire information. We formulate example selection for in-context learning as a sequential decision problem, and propose a reinforcement learning algorithm for identifying generalizable policies to select demonstration examples. For GPT-2, our learned policies demonstrate strong abilities of generalizing to unseen tasks in training, with a 5.8% improvement on average. Examples selected from our learned policies can even achieve a small improvement on GPT-3 Ada. However, the improvement diminishes on larger GPT-3 models, suggesting emerging capabilities of large language models.

DialogConv: A Lightweight Fully Convolutional Network for Multi-view Response Selection
Yongkang Liu | Shi Feng | Wei Gao | Daling Wang | Yifei Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Current end-to-end retrieval-based dialogue systems are mainly based on Recurrent Neural Networks or Transformers with attention mechanisms. Although promising results have been achieved, these models often suffer from slow inference or huge number of parameters. In this paper, we propose a novel lightweight fully convolutional architecture, called DialogConv, for response selection. DialogConv is exclusively built on top of convolution to extract matching features of context and response. Dialogues are modeled in 3D views, where DialogConv performs convolution operations on embedding view, word view and utterance view to capture richer semantic information from multiple contextual views. On the four benchmark datasets, compared with state-of-the-art baselines, DialogConv is on average about 8.5x smaller in size, and 79.39x and 10.64x faster on CPU and GPU devices, respectively. At the same time, DialogConv achieves the competitive effectiveness of response selection.

MulZDG: Multilingual Code-Switching Framework for Zero-shot Dialogue Generation
Yongkang Liu | Shi Feng | Daling Wang | Yifei Zhang
Proceedings of the 29th International Conference on Computational Linguistics

Building dialogue generation systems in a zero-shot scenario remains a huge challenge, since the typical zero-shot approaches in dialogue generation rely heavily on large-scale pre-trained language generation models such as GPT-3 and T5. The research on zero-shot dialogue generation without cumbersome language models is limited due to lacking corresponding parallel dialogue corpora. In this paper, we propose a simple but effective Multilingual learning framework for Zero-shot Dialogue Generation (dubbed as MulZDG) that can effectively transfer knowledge from an English corpus with large-scale training samples to a non-English corpus with zero samples. Besides, MulZDG can be viewed as a multilingual data augmentation method to improve the performance of the resource-rich language. First, we construct multilingual code-switching dialogue datasets via translation utterances randomly selected from monolingual English datasets. Then we employ MulZDG to train a unified multilingual dialogue model based on the code-switching datasets. The MulZDG can conduct implicit semantic alignment between different languages. Experiments on DailyDialog and DSTC7 datasets demonstrate that MulZDG not only achieve competitive performance under zero-shot case compared to training with sufficient examples but also greatly improve the performance of the source language.

Learning to Improve Persona Consistency in Multi-party Dialogue Generation via Text Knowledge Enhancement
Dongshi Ju | Shi Feng | Pengcheng Lv | Daling Wang | Yifei Zhang
Proceedings of the 29th International Conference on Computational Linguistics

In an open-domain dialogue system, the consistent persona is a key factor to generate real and coherent dialogues. Existing methods suffer from the incomprehensive persona tags that have unique and obscure meanings to describe human’s personality. Besides, the addressee information, which is closely related to express personality in multi-party dialogues, has been neglected. In this paper, we construct a multi-party personalized dialogue dataset and propose a graph convolution network model (PersonaTKG) with addressee selecting mechanism that integrates personas, dialogue utterances, and external text knowledge in a unified graph. Extensive experiments have shown that PersonaTKG outperforms the baselines by large margins and effectively improves persona consistency in the generated responses.

Alleviating Sparsity of Open Knowledge Graphs with Ternary Contrastive Learning
Qian Li | Shafiq Joty | Daling Wang | Shi Feng | Yifei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2022

Sparsity of formal knowledge and roughness of non-ontological construction make sparsity problem particularly prominent in Open Knowledge Graphs (OpenKGs). Due to sparse links, learning effective representation for few-shot entities becomes difficult. We hypothesize that by introducing negative samples, a contrastive learning (CL) formulation could be beneficial in such scenarios. However, existing CL methods model KG triplets as binary objects of entities ignoring the relation-guided ternary propagation patterns and they are too generic, i.e., they ignore zero-shot, few-shot and synonymity problems that appear in OpenKGs. To address this, we propose TernaryCL, a CL framework based on ternary propagation patterns among head, relation and tail. TernaryCL designs Contrastive Entity and Contrastive Relation to mine ternary discriminative features with both negative entities and relations, introduces Contrastive Self to help zero- and few-shot entities learn discriminative features, Contrastive Synonym to model synonymous entities, and Contrastive Fusion to aggregate graph features from multiple paths. Extensive experiments on benchmarks demonstrate the superiority of TernaryCL over state-of-the-art models.

Human-Centered Evaluation of Explanations
Jordan Boyd-Graber | Samuel Carton | Shi Feng | Q. Vera Liao | Tania Lombrozo | Alison Smith-Renner | Chenhao Tan
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts

The NLP community are increasingly interested in providing explanations for NLP models to help people make sense of model behavior and potentially improve human interaction with models. In addition to computational challenges in generating these explanations, evaluations of the generated explanations require human-centered perspectives and approaches. This tutorial will provide an overview of human-centered evaluations of explanations. First, we will give a brief introduction to the psychological foundation of explanations as well as types of NLP model explanations and their corresponding presentation, to provide the necessary background. We will then present a taxonomy of human-centered evaluation of explanations and dive into depth in the two categories: 1) evaluation based on human-annotated explanations; 2) evaluation with human-subjects studies. We will conclude by discussing future directions. We will also adopt a flipped format to maximize the in- teractive components for the live audience.

Learning to Explain Selectively: A Case Study on Question Answering
Shi Feng | Jordan Boyd-Graber
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Explanations promise to bridge the gap between humans and AI, yet it remains difficult to achieve consistent improvement in AI-augmented human decision making. The usefulness of AI explanations depends on many factors, and always showing the same type of explanation in all cases is suboptimal—so is relying on heuristics to adapt explanations for each scenario. We propose learning to explain”selectively”: for each decision that the user makes, we use a model to choose the best explanation from a set of candidates and update this model with feedback to optimize human performance. We experiment on a question answering task, Quizbowl, and show that selective explanations improve human performance for both experts and crowdworkers.

2021

Concealed Data Poisoning Attacks on NLP Models
Eric Wallace | Tony Zhao | Shi Feng | Sameer Singh
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model’s training set that causes the model to frequently predict Positive whenever the input contains “James Bond”. Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling (“Apple iPhone” triggers negative generations) and machine translation (“iced coffee” mistranslated as “hot coffee”). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.

Boundary Detection with BERT for Span-level Emotion Cause Analysis
Xiangju Li | Wei Gao | Shi Feng | Yifei Zhang | Daling Wang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks
Xiaocui Yang | Shi Feng | Yifei Zhang | Daling Wang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

With the popularity of smartphones, we have witnessed the rapid proliferation of multimodal posts on various social media platforms. We observe that the multimodal sentiment expression has specific global characteristics, such as the interdependencies of objects or scenes within the image. However, most previous studies only considered the representation of a single image-text post and failed to capture the global co-occurrence characteristics of the dataset. In this paper, we propose Multi-channel Graph Neural Networks with Sentiment-awareness (MGNNS) for image-text sentiment detection. Specifically, we first encode different modalities to capture hidden representations. Then, we introduce multi-channel graph neural networks to learn multimodal representations based on the global characteristics of the dataset. Finally, we implement multimodal in-depth fusion with the multi-head attention mechanism to predict the sentiment of image-text pairs. Extensive experiments conducted on three publicly available datasets demonstrate the effectiveness of our approach for multimodal sentiment detection.

2019

Answer-Supervised Question Reformulation for Enhancing Conversational Machine Comprehension
Qian Li | Hui Su | Cheng Niu | Daling Wang | Zekang Li | Shi Feng | Yifei Zhang
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

In conversational machine comprehension, it has become one of the research hotspots integrating conversational history information through question reformulation for obtaining better answers. However, the existing question reformulation models are trained only using supervised question labels annotated by annotators without considering any feedback information from answers. In this paper, we propose a novel Answer-Supervised Question Reformulation (ASQR) model for enhancing conversational machine comprehension with reinforcement learning technology. ASQR utilizes a pointer-copy-based question reformulation model as an agent, takes an action to predict the next word, and observes a reward for the whole sentence state after generating the end-of-sequence token. The experimental results on QuAC dataset prove that our ASQR model is more effective in conversational machine comprehension. Moreover, pretraining is essential in reinforcement learning models, so we provide a high-quality annotated dataset for question reformulation by sampling a part of QuAC dataset.

Universal Adversarial Triggers for Attacking and Analyzing NLP
Eric Wallace | Shi Feng | Nikhil Kandpal | Matt Gardner | Sameer Singh
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. We propose a gradient-guided search over tokens which finds short trigger sequences (e.g., one word for classification and four words for language modeling) that successfully trigger the target prediction. For example, triggers cause SNLI entailment accuracy to drop from 89.94% to 0.55%, 72% of “why” questions in SQuAD to be answered “to kill american people”, and the GPT-2 language model to spew racist output even when conditioned on non-racial contexts. Furthermore, although the triggers are optimized using white-box access to a specific model, they transfer to other models for all tasks we consider. Finally, since triggers are input-agnostic, they provide an analysis of global model behavior. For instance, they confirm that SNLI models exploit dataset biases and help to diagnose heuristics learned by reading comprehension models.

Answer-guided and Semantic Coherent Question Generation in Open-domain Conversation
Weichao Wang | Shi Feng | Daling Wang | Yifei Zhang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Generating intriguing question is a key step towards building human-like open-domain chatbots. Although some recent works have focused on this task, compared with questions raised by humans, significant gaps remain in maintaining semantic coherence with post, which may result in generating dull or deviated questions. We observe that the answer has strong semantic coherence to its question and post, which can be used to guide question generation. Thus, we devise two methods to further enhance semantic coherence between post and question under the guidance of answer. First, the coherence score between generated question and answer is used as the reward function in a reinforcement learning framework, to encourage the cases that are consistent with the answer in semantic. Second, we incorporate adversarial training to explicitly control question generation in the direction of question-answer coherence. Extensive experiments show that our two methods outperform state-of-the-art baseline algorithms with large margins in raising semantic coherent questions.

Misleading Failures of Partial-input Baselines
Shi Feng | Eric Wallace | Jordan Boyd-Graber
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Recent work establishes dataset difficulty and removes annotation artifacts via partial-input baselines (e.g., hypothesis-only model for SNLI or question-only model for VQA). A successful partial-input baseline indicates that the dataset is cheatable. But the converse is not necessarily true: failures of partial-input baselines do not mean the dataset is free of artifacts. We first design artificial datasets to illustrate how the trivial patterns that are only visible in the full input can evade any partial-input baseline. Next, we identify such artifacts in the SNLI dataset—a hypothesis-only model augmented with trivial patterns in the premise can solve 15% of previously-thought “hard” examples. Our work provides a caveat for the use and creation of partial-input baselines for datasets.

Trick Me If You Can: Human-in-the-Loop Generation of Adversarial Examples for Question Answering
Eric Wallace | Pedro Rodriguez | Shi Feng | Ikuya Yamada | Jordan Boyd-Graber
Transactions of the Association for Computational Linguistics, Volume 7

Adversarial evaluation stress-tests a model’s understanding of natural language. Because past approaches expose superficial patterns, the resulting adversarial examples are limited in complexity and diversity. We propose human- in-the-loop adversarial generation, where human authors are guided to break models. We aid the authors with interpretations of model predictions through an interactive user interface. We apply this generation framework to a question answering task called Quizbowl, where trivia enthusiasts craft adversarial questions. The resulting questions are validated via live human–computer matches: Although the questions appear ordinary to humans, they systematically stump neural and information retrieval models. The adversarial questions cover diverse phenomena from multi-hop reasoning to entity type distractors, exposing open challenges in robust question answering.

How Pre-trained Word Representations Capture Commonsense Physical Comparisons
Pranav Goel | Shi Feng | Jordan Boyd-Graber
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

Understanding common sense is important for effective natural language reasoning. One type of common sense is how two objects compare on physical properties such as size and weight: e.g., ‘is a house bigger than a person?’. We probe whether pre-trained representations capture comparisons and find they, in fact, have higher accuracy than previous approaches. They also generalize to comparisons involving objects not seen during training. We investigate how such comparisons are made: models learn a consistent ordering over all the objects in the comparisons. Probing models have significantly higher accuracy than those baseline models which use dataset artifacts: e.g., memorizing some words are larger than any other word.

2018

Personalized Microblog Sentiment Classification via Adversarial Cross-lingual Multi-task Learning
Weichao Wang | Shi Feng | Wei Gao | Daling Wang | Yifei Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Sentiment expression in microblog posts can be affected by user’s personal character, opinion bias, political stance and so on. Most of existing personalized microblog sentiment classification methods suffer from the insufficiency of discriminative tweets for personalization learning. We observed that microblog users have consistent individuality and opinion bias in different languages. Based on this observation, in this paper we propose a novel user-attention-based Convolutional Neural Network (CNN) model with adversarial cross-lingual learning framework. The user attention mechanism is leveraged in CNN model to capture user’s language-specific individuality from the posts. Then the attention-based CNN model is incorporated into a novel adversarial cross-lingual learning framework, in which with the help of user properties as bridge between languages, we can extract the language-specific features and language-independent features to enrich the user post representation so as to alleviate the data insufficiency problem. Results on English and Chinese microblog datasets confirm that our method outperforms state-of-the-art baseline algorithms with large margins.

Interpreting Neural Networks with Nearest Neighbors
Eric Wallace | Shi Feng | Jordan Boyd-Graber
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Local model interpretation methods explain individual predictions by assigning an importance value to each input feature. This value is often determined by measuring the change in confidence when a feature is removed. However, the confidence of neural networks is not a robust measure of model uncertainty. This issue makes reliably judging the importance of the input features difficult. We address this by changing the test-time behavior of neural networks using Deep k-Nearest Neighbors. Without harming text classification accuracy, this algorithm provides a more robust uncertainty metric which we use to generate feature importance values. The resulting interpretations better align with human perception than baseline methods. Finally, we use our interpretation method to analyze model predictions on dataset annotation artifacts.

Pathologies of Neural Models Make Interpretations Difficult
Shi Feng | Eric Wallace | Alvin Grissom II | Mohit Iyyer | Pedro Rodriguez | Jordan Boyd-Graber
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

One way to interpret neural model predictions is to highlight the most important input features—for example, a heatmap visualization over the words in an input sentence. In existing interpretation methods for NLP, a word’s importance is determined by either input perturbation—measuring the decrease in model confidence when that word is removed—or by the gradient with respect to that word. To understand the limitations of these methods, we use input reduction, which iteratively removes the least important word from the input. This exposes pathological behaviors of neural models: the remaining words appear nonsensical to humans and are not the ones determined as important by interpretation methods. As we confirm with human experiments, the reduced examples lack information to support the prediction of any label, but models still make the same predictions with high confidence. To explain these counterintuitive results, we draw connections to adversarial examples and confidence calibration: pathological behaviors reveal difficulties in interpreting neural models trained with maximum likelihood. To mitigate their deficiencies, we fine-tune the models by encouraging high entropy outputs on reduced examples. Fine-tuned models become more interpretable under input reduction, without accuracy loss on regular examples.

A Co-Attention Neural Network Model for Emotion Cause Analysis with Emotional Context Awareness
Xiangju Li | Kaisong Song | Shi Feng | Daling Wang | Yifei Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Emotion cause analysis has been a key topic in natural language processing. Existing methods ignore the contexts around the emotion word which can provide an emotion cause clue. Meanwhile, the clauses in a document play different roles on stimulating a certain emotion, depending on their content relevance. Therefore, we propose a co-attention neural network model for emotion cause analysis with emotional context awareness. The method encodes the clauses with a co-attention based bi-directional long short-term memory into high-level input representations, which are further fed into a convolutional layer for emotion cause analysis. Experimental results show that our approach outperforms the state-of-the-art baseline methods.

2017

The UMD Neural Machine Translation Systems at WMT17 Bandit Learning Task
Amr Sharaf | Shi Feng | Khanh Nguyen | Kianté Brantley | Hal Daumé III
Proceedings of the Second Conference on Machine Translation

2016

Improving Attention Modeling with Implicit Distortion and Fertility for Machine Translation
Shi Feng | Shujie Liu | Nan Yang | Mu Li | Ming Zhou | Kenny Q. Zhu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In neural machine translation, the attention mechanism facilitates the translation process by producing a soft alignment between the source sentence and the target sentence. However, without dedicated distortion and fertility models seen in traditional SMT systems, the learned alignment may not be accurate, which can lead to low translation quality. In this paper, we propose two novel models to improve attention-based neural machine translation. We propose a recurrent attention mechanism as an implicit distortion model, and a fertility conditioned decoder as an implicit fertility model. We conduct experiments on large-scale Chinese–English translation tasks. The results show that our models significantly improve both the alignment and translation quality compared to the original attention mechanism and several other variations.

Knowledge-Based Semantic Embedding for Machine Translation
Chen Shi | Shujie Liu | Shuo Ren | Shi Feng | Mu Li | Ming Zhou | Xu Sun | Houfeng Wang
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

NEUDM: A System for Topic-Based Message Polarity Classification
Yaqi Wang | Shi Feng | Daling Wang | Yifei Zhang
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing

NDMSCS: A Topic-Based Chinese Microblog Polarity Classification System
Yang Wang | Yaqi Wang | Shi Feng | Daling Wang | Yifei Zhang
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing

2013

Is Twitter A Better Corpus for Measuring Sentiment Similarity?
Shi Feng | Le Zhang | Binyang Li | Daling Wang | Ge Yu | Kam-Fai Wong
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2010

A Unified Graph Model for Sentence-Based Opinion Retrieval
Binyang Li | Lanjun Zhou | Shi Feng | Kam-Fai Wong
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Co-authors

Nishant Balepur 5

Rachel Rudinger 3

Hal Daumé III 2

Seraphina Goldfarb-Tarrant 2

Soujanya Poria 2

Pedro Rodriguez 2

Hinrich Schütze 2

Kianté Brantley 1

Samuel Carton 1

Alvin Grissom II 1

Alexander Miserlis Hoyle 1

Faliang Huang 1

Nikhil Kandpal 1

Tania Lombrozo 1

Vishakh Padmakumar 1

Abhilasha Ravichander 1

Alison Smith-Renner 1

Yoo Yeon Sung 1

Lingshuai Wang 1

Tongshuang Wu 1

Ge Yu (于戈) 1

Jingyuan Zhang 1

Hongzhi Zhang 1

Fuzheng Zhang 1

Venues