Tong Wang - ACL Anthology

Tong Wang

2025

Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs
Jingcheng Niu | Xingdi Yuan | Tong Wang | Hamidreza Saghir | Amir H. Abdi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We observe a novel phenomenon, *contextual entrainment*, across a wide range of language models (LMs) and prompt settings, providing a new mechanistic perspective on how LMs become distracted by “irrelevant” contextual information in the input prompt. Specifically, LMs assign significantly higher logits (or probabilities) to any tokens that have previously appeared in the context prompt, even for random tokens. This suggests that contextual entrainment is a mechanistic phenomenon, occurring independently of the relevance or semantic relation of the tokens to the question or the rest of the sentence. We find statistically significant evidence that the magnitude of contextual entrainment is influenced by semantic factors. Counterfactual prompts have a greater effect compared to factual ones, suggesting that while contextual entrainment is a mechanistic phenomenon, it is modulated by semantic factors.We hypothesise that there is a circuit of attention heads — the *entrainment heads* — that corresponds to the contextual entrainment phenomenon. Using a novel entrainment head discovery method based on differentiable masking, we identify these heads across various settings. When we “turn off” these heads, i.e., set their outputs to zero, the effect of contextual entrainment is significantly attenuated, causing the model to generate output that capitulates to what it would produce if no distracting context were provided. Our discovery of contextual entrainment, along with our investigation into LM distraction via the entrainment heads, marks a key step towards the mechanistic analysis and mitigation of the distraction problem.

Will Annotators Disagree? Identifying Subjectivity in Value-Laden Arguments
Amir Homayounirad | Enrico Liscio | Tong Wang | Catholijn M Jonker | Luciano Cavalcante Siebert
Findings of the Association for Computational Linguistics: EMNLP 2025

Aggregating multiple annotations into a single ground truth label may hide valuable insights into annotator disagreement, particularly in tasks where subjectivity plays a crucial role. In this work, we explore methods for identifying subjectivity in recognizing the human values that motivate arguments. We evaluate two main approaches: inferring subjectivity through value prediction vs. directly identifying subjectivity. Our experiments show that direct subjectivity identification significantly improves the model performance of flagging subjective arguments. Furthermore, combining contrastive loss with binary cross-entropy loss does not improve performance but reduces the dependency on per-label subjectivity. Our proposed methods can help identify arguments that individuals may interpret differently, fostering a more nuanced annotation process.

2024

Less is More for Improving Automatic Evaluation of Factual Consistency
Tong Wang | Ninad Kulkarni | Yanjun Qi
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)

Assessing the factual consistency of automatically generated texts in relation to source context is crucial for developing reliable natural language generation applications. Recent literature proposes AlignScore which uses a unified alignment model to evaluate factual consistency and substantially outperforms previous methods across many benchmark tasks. In this paper, we take a closer look of datasets used in AlignScore and uncover an unexpected finding: utilizing a smaller number of data points can actually improve performance. We process the original AlignScore training dataset to remove noise, augment with robustness-enhanced samples, and utilize a subset comprising 10% of the data to train an improved factual consistency evaluation model, we call LIM-RA (Less Is More for Robust AlignScore). LIM-RA demonstrates superior performance, consistently outperforming AlignScore and other strong baselines like ChatGPT across four benchmarks (two utilizing traditional natural language generation datasets and two focused on large language model outputs). Our experiments show that LIM-RA achieves the highest score on 24 of the 33 test datasets, while staying competitive on the rest, establishing the new state-of-the-art benchmarks.

2023

DeepMaven: Deep Question Answering on Long-Distance Movie/TV Show Videos with Multimedia Knowledge Extraction and Synthesis
Yi Fung | Han Wang | Tong Wang | Ali Kebarighotbi | Mohit Bansal | Heng Ji | Prem Natarajan
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Long video content understanding poses a challenging set of research questions as it involves long-distance, cross-media reasoning and knowledge awareness. In this paper, we present a new benchmark for this problem domain, targeting the task of deep movie/TV question answering (QA) beyond previous work’s focus on simple plot summary and short video moment settings. We define several baselines based on direct retrieval of relevant context for long-distance movie QA. Observing that real-world QAs may require higher-order multi-hop inferences, we further propose a novel framework, called the DeepMaven, which extracts events, entities, and relations from the rich multimedia content in long videos to pre-construct movie knowledge graphs (movieKGs), and at the time of QA inference, complements general semantics with structured knowledge for more effective information retrieval and knowledge reasoning. We also introduce our recently collected DeepMovieQA dataset, including 1,000 long-form QA pairs from 41 hours of videos, to serve as a new and useful resource for future work. Empirical results show the DeepMaven performs competitively for both the new DeepMovieQA and the pre-existing MovieQA dataset.

General-to-Specific Transfer Labeling for Domain Adaptable Keyphrase Generation
Rui Meng | Tong Wang | Xingdi Yuan | Yingbo Zhou | Daqing He
Findings of the Association for Computational Linguistics: ACL 2023

Training keyphrase generation (KPG) models require a large amount of annotated data, which can be prohibitively expensive and often limited to specific domains. In this study, we first demonstrate that large distribution shifts among different domains severely hinder the transferability of KPG models. We then propose a three-stage pipeline, which gradually guides KPG models’ learning focus from general syntactical features to domain-related semantics, in a data-efficient manner. With domain-general phrase pre-training, we pre-train Sequence-to-Sequence models with generic phrase annotations that are widely available on the web, which enables the models to generate phrases in a wide range of domains. The resulting model is then applied in the Transfer Labeling stage to produce domain-specific pseudo keyphrases, which help adapt models to a new domain. Finally, we fine-tune the model with limited data with true labels to fully adapt it to the target domain. Our experiment results show that the proposed process can produce good quality keyphrases in new domains and achieve consistent improvements after adaptation with limited in-domain annotated data. All code and datasets are available at https://github.com/memray/OpenNMT-kpg-release.

Selecting Better Samples from Pre-trained LLMs: A Case Study on Question Generation
Xingdi Yuan | Tong Wang | Yen-Hsiang Wang | Emery Fine | Rania Abdelghani | Hélène Sauzéon | Pierre-Yves Oudeyer
Findings of the Association for Computational Linguistics: ACL 2023

Large Language Models (LLMs) have in recent years demonstrated impressive prowess in natural language generation. A common practice to improve generation diversity is to sample multiple outputs from the model. However, partly due to the inaccessibility of LLMs, there lacks a simple and robust way of selecting the best output from these stochastic samples. As a case study framed in the context of question generation, we propose two prompt-based approaches, namely round-trip and prompt-based score, to selecting high-quality questions from a set of LLM-generated candidates. Our method works without the need to modify the underlying model, nor does it rely on human-annotated references — both of which are realistic constraints for real-world deployment of LLMs. With automatic as well as human evaluations, we empirically demonstrate that our approach can effectively select questions of higher qualities than greedy generation.

An Empirical Study of Instruction-tuning Large Language Models in Chinese
Qingyi Si | Tong Wang | Zheng Lin | Xu Zhang | Yanan Cao | Weiping Wang
Findings of the Association for Computational Linguistics: EMNLP 2023

The success of ChatGPT validates the potential of large language models (LLMs) in artificial general intelligence (AGI). Subsequently, the release of LLMs has sparked the open-source community’s interest in instruction-tuning, which is deemed to accelerate ChatGPT’s replication process. However, research on instruction-tuning LLMs in Chinese, the world’s most spoken language, is still in its early stages. Therefore, this paper makes an in-depth empirical study of instruction-tuning LLMs in Chinese, which can serve as a cookbook that provides valuable findings for effectively customizing LLMs that can better respond to Chinese instructions. Specifically, we systematically explore the impact of LLM bases, parameter-efficient methods, instruction data types, which are the three most important elements for instruction-tuning. Besides, we also conduct experiment to study the impact of other factors, e.g., chain-of-thought data and human-value alignment. We hope that this empirical study can make a modest contribution to the open Chinese version of ChatGPT. This paper will release a powerful Chinese LLM that is comparable to ChatGLM. The code and data are available at https: //github.com/PhoebusSi/Alpaca-CoT.

2022

Better Language Model with Hypernym Class Prediction
He Bai | Tong Wang | Alessandro Sordoni | Peng Shi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Class-based language models (LMs) have been long devised to address context sparsity in n-gram LMs. In this study, we revisit this approach in the context of neural LMs. We hypothesize that class-based prediction leads to an implicit context aggregation for similar words and thus can improve generalization for rare words. We map words that have a common WordNet hypernym to the same class and train large neural LMs by gradually annealing from predicting the class to token prediction during training. Empirically, this curriculum learning strategy consistently improves perplexity over various large, highly-performant state-of-the-art Transformer-based models on two datasets, WikiText-103 and ARXIV. Our analysis shows that the performance improvement is achieved without sacrificing performance on rare words. Finally, we document other attempts that failed to yield empirical gains, and discuss future directions for the adoption of class-based LMs on a larger scale.

2021

Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents
Rui Meng | Khushboo Thaker | Lei Zhang | Yue Dong | Xingdi Yuan | Tong Wang | Daqing He
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Faceted summarization provides briefings of a document from different perspectives. Readers can quickly comprehend the main points of a long document with the help of a structured outline. However, little research has been conducted on this subject, partially due to the lack of large-scale faceted summarization datasets. In this study, we present FacetSum, a faceted summarization benchmark built on Emerald journal articles, covering a diverse range of domains. Different from traditional document-summary pairs, FacetSum provides multiple summaries, each targeted at specific sections of a long document, including the purpose, method, findings, and value. Analyses and empirical results on our dataset reveal the importance of bringing structure into summaries. We believe FacetSum will spur further advances in summarization research and foster the development of NLP systems that can leverage the structured information in both long texts and summaries.

Personalized Entity Resolution with Dynamic Heterogeneous KnowledgeGraph Representations
Ying Lin | Han Wang | Jiangning Chen | Tong Wang | Yue Liu | Heng Ji | Yang Liu | Premkumar Natarajan
Proceedings of the 4th Workshop on e-Commerce and NLP

The growing popularity of Virtual Assistants poses new challenges for Entity Resolution, the task of linking mentions in text to their referent entities in a knowledge base. Specifically, in the shopping domain, customers tend to mention the entities implicitly (e.g., “organic milk”) rather than use the entity names explicitly, leading to a large number of candidate products. Meanwhile, for the same query, different customers may expect different results. For example, with “add milk to my cart”, a customer may refer to a certain product from his/her favorite brand, while some customers may want to re-order products they regularly purchase. Moreover, new customers may lack persistent shopping history, which requires us to enrich the connections between customers through products and their attributes. To address these issues, we propose a new framework that leverages personalized features to improve the accuracy of product ranking. We first build a cross-source heterogeneous knowledge graph from customer purchase history and product knowledge graph to jointly learn customer and product embeddings. After that, we incorporate product, customer, and history representations into a neural reranking model to predict which candidate is most likely to be purchased by a specific customer. Experiment results show that our model substantially improves the accuracy of the top ranked candidates by 24.6% compared to the state-of-the-art product search model.

Diverse Distributions of Self-Supervised Tasks for Meta-Learning in NLP
Trapit Bansal | Karthick Prasad Gunasekaran | Tong Wang | Tsendsuren Munkhdalai | Andrew McCallum
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Meta-learning considers the problem of learning an efficient learning process that can leverage its past experience to accurately solve new tasks. However, the efficacy of meta-learning crucially depends on the distribution of tasks available for training, and this is often assumed to be known a priori or constructed from limited supervised datasets. In this work, we aim to provide task distributions for meta-learning by considering self-supervised tasks automatically proposed from unlabeled text, to enable large-scale meta-learning in NLP. We design multiple distributions of self-supervised tasks by considering important aspects of task diversity, difficulty, type, domain, and curriculum, and investigate how they affect meta-learning performance. Our analysis shows that all these factors meaningfully alter the task distribution, some inducing significant improvements in downstream few-shot accuracy of the meta-learned models. Empirically, results on 20 downstream tasks show significant improvements in few-shot learning – adding up to +4.2% absolute accuracy (on average) to the previous unsupervised meta-learning method, and perform comparably to supervised methods on the FewRel 2.0 benchmark.

An Empirical Study on Neural Keyphrase Generation
Rui Meng | Xingdi Yuan | Tong Wang | Sanqiang Zhao | Adam Trischler | Daqing He
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent years have seen a flourishing of neural keyphrase generation (KPG) works, including the release of several large-scale datasets and a host of new models to tackle them. Model performance on KPG tasks has increased significantly with evolving deep learning research. However, there lacks a comprehensive comparison among different model designs, and a thorough investigation on related factors that may affect a KPG system’s generalization performance. In this empirical study, we aim to fill this gap by providing extensive experimental results and analyzing the most crucial factors impacting the generalizability of KPG models. We hope this study can help clarify some of the uncertainties surrounding the KPG task and facilitate future research on this topic.

Optimizing NLU Reranking Using Entity Resolution Signals in Multi-domain Dialog Systems
Tong Wang | Jiangning Chen | Mohsen Malmir | Shuyan Dong | Xin He | Han Wang | Chengwei Su | Yue Liu | Yang Liu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

In dialog systems, the Natural Language Understanding (NLU) component typically makes the interpretation decision (including domain, intent and slots) for an utterance before the mentioned entities are resolved. This may result in intent classification and slot tagging errors. In this work, we propose to leverage Entity Resolution (ER) features in NLU reranking and introduce a novel loss term based on ER signals to better learn model weights in the reranking framework. In addition, for a multi-domain dialog scenario, we propose a score distribution matching method to ensure scores generated by the NLU reranking models for different domains are properly calibrated. In offline experiments, we demonstrate our proposed approach significantly outperforms the baseline model on both single-domain and cross-domain evaluations.

Entity Resolution in Open-domain Conversations
Mingyue Shang | Tong Wang | Mihail Eric | Jiangning Chen | Jiyang Wang | Matthew Welch | Tiantong Deng | Akshay Grewal | Han Wang | Yue Liu | Yang Liu | Dilek Hakkani-Tur
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

In recent years, incorporating external knowledge for response generation in open-domain conversation systems has attracted great interest. To improve the relevancy of retrieved knowledge, we propose a neural entity linking (NEL) approach. Different from formal documents, such as news, conversational utterances are informal and multi-turn, which makes it more challenging to disambiguate the entities. Therefore, we present a context-aware named entity recognition model (NER) and entity resolution (ER) model to utilize dialogue context information. We conduct NEL experiments on three open-domain conversation datasets and validate that incorporating context information improves the performance of NER and ER models. The end-to-end NEL approach outperforms the baseline by 62.8% relatively in F1 metric. Furthermore, we verify that using external knowledge based on NEL benefits the neural response generation model.

2020

One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases
Xingdi Yuan | Tong Wang | Rui Meng | Khushboo Thaker | Peter Brusilovsky | Daqing He | Adam Trischler
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Different texts shall by nature correspond to different number of keyphrases. This desideratum is largely missing from existing neural keyphrase generation models. In this study, we address this problem from both modeling and evaluation perspectives. We first propose a recurrent generative model that generates multiple keyphrases as delimiter-separated sequences. Generation diversity is further enhanced with two novel techniques by manipulating decoder hidden states. In contrast to previous approaches, our model is capable of generating diverse keyphrases and controlling number of outputs. We further propose two evaluation metrics tailored towards the variable-number generation. We also introduce a new dataset StackEx that expands beyond the only existing genre (i.e., academic writing) in keyphrase generation tasks. With both previous and new evaluation metrics, our model outperforms strong baselines on all datasets.

Exploring and Predicting Transferability across NLP Tasks
Tu Vu | Tong Wang | Tsendsuren Munkhdalai | Alessandro Sordoni | Adam Trischler | Andrew Mattarella-Micke | Subhransu Maji | Mohit Iyyer
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recent advances in NLP demonstrate the effectiveness of training large-scale language models and transferring them to downstream tasks. Can fine-tuning these models on tasks other than language modeling further improve performance? In this paper, we conduct an extensive study of the transferability between 33 NLP tasks across three broad classes of problems (text classification, question answering, and sequence labeling). Our results show that transfer learning is more beneficial than previously thought, especially when target task data is scarce, and can improve performance even with low-data source tasks that differ substantially from the target task (e.g., part-of-speech tagging transfers well to the DROP QA dataset). We also develop task embeddings that can be used to predict the most transferable source tasks for a given target task, and we validate their effectiveness in experiments controlled for source and target data size. Overall, our experiments reveal that factors such as data size, task and domain similarity, and task complexity all play a role in determining transferability.

2018

Annotating High-Level Structures of Short Stories and Personal Anecdotes
Boyang Li | Beth Cardier | Tong Wang | Florian Metze
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Neural Models for Key Phrase Extraction and Question Generation
Sandeep Subramanian | Tong Wang | Xingdi Yuan | Saizheng Zhang | Adam Trischler | Yoshua Bengio
Proceedings of the Workshop on Machine Reading for Question Answering

We propose a two-stage neural model to tackle question generation from documents. First, our model estimates the probability that word sequences in a document are ones that a human would pick when selecting candidate answers by training a neural key-phrase extractor on the answers in a question-answering corpus. Predicted key phrases then act as target answers and condition a sequence-to-sequence question-generation model with a copy mechanism. Empirically, our key-phrase extraction model significantly outperforms an entity-tagging baseline and existing rule-based approaches. We further demonstrate that our question generation system formulates fluent, answerable questions from key phrases. This two-stage system could be used to augment or generate reading comprehension datasets, which may be leveraged to improve machine reading systems or in educational settings.

2017

Machine Comprehension by Text-to-Text Neural Question Generation
Xingdi Yuan | Tong Wang | Caglar Gulcehre | Alessandro Sordoni | Philip Bachman | Saizheng Zhang | Sandeep Subramanian | Adam Trischler
Proceedings of the 2nd Workshop on Representation Learning for NLP

We propose a recurrent neural model that generates natural-language questions from documents, conditioned on answers. We show how to train the model using a combination of supervised and reinforcement learning. After teacher forcing for standard maximum likelihood training, we fine-tune the model using policy gradient techniques to maximize several rewards that measure question quality. Most notably, one of these rewards is the performance of a question-answering system. We motivate question generation as a means to improve the performance of question answering systems. Our model is trained and evaluated on the recent question-answering dataset SQuAD.

NewsQA: A Machine Comprehension Dataset
Adam Trischler | Tong Wang | Xingdi Yuan | Justin Harris | Alessandro Sordoni | Philip Bachman | Kaheer Suleman
Proceedings of the 2nd Workshop on Representation Learning for NLP

We present NewsQA, a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs. Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN, with answers consisting of spans of text in the articles. We collect this dataset through a four-stage process designed to solicit exploratory questions that require reasoning. Analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing textual entailment. We measure human performance on the dataset and compare it to several strong neural models. The performance gap between humans and machines (13.3% F1) indicates that significant progress can be made on NewsQA through future research. The dataset is freely available online.

2015

Learning Lexical Embeddings with Syntactic and Lexicographic Knowledge
Tong Wang | Abdelrahman Mohamed | Graeme Hirst
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Extended Topic Model for Word Dependency
Tong Wang | Vish Viswanath | Ping Chen
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

Applying a Naive Bayes Similarity Measure to Word Sense Disambiguation
Tong Wang | Graeme Hirst
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

Refining the Notions of Depth and Density in WordNet-based Semantic Similarity Measures
Tong Wang | Graeme Hirst
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Predicting Word Clipping with Latent Semantic Analysis
Julian Brooke | Tong Wang | Graeme Hirst
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

Near-synonym Lexical Choice in Latent Semantic Space
Tong Wang | Graeme Hirst
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Automatic Acquisition of Lexical Formality
Julian Brooke | Tong Wang | Graeme Hirst
Coling 2010: Posters

2009

Extracting Synonyms from Dictionary Definitions
Tong Wang | Graeme Hirst
Proceedings of the International Conference RANLP-2009

Co-authors

Han Wang (王涵) 4

Jiangning Chen 3

Yang Liu (刘扬) 3

Philip Bachman 2

Julian Brooke 2

Tsendsuren Munkhdalai 2

Prem Natarajan 2

Sandeep Subramanian 2

Khushboo Thaker 2

Saizheng Zhang 2

Rania Abdelghani 1

Richard He Bai 1

Trapit Bansal 1

Yoshua Bengio 1

Peter Brusilovsky 1

Tiantong Deng 1

Akshay Grewal 1

Karthick Prasad Gunasekaran 1

Çağlar Gu̇lçehre 1

Dilek Hakkani-Tur 1

Justin Harris 1

Amir Homayounirad 1

Catholijn M. Jonker 1

Ali Kebarighotbi 1

Ninad Kulkarni 1

Enrico Liscio 1

Subhransu Maji 1

Mohsen Malmir 1

Andrew Mattarella-Micke 1

Andrew McCallum 1

Florian Metze 1

Abdelrahman Mohamed 1

Jingcheng Niu 1

Pierre-Yves Oudeyer 1

Hamidreza Saghir 1

Hélène Sauzéon 1

Mingyue Shang 1

Luciano Cavalcante Siebert 1

Kaheer Suleman 1

Vish Viswanath 1

Yen-Hsiang Wang 1

Matthew Welch 1

Sanqiang Zhao 1

Venues