Tao Li


2024

pdf bib
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing
Jiangshu Du | Yibo Wang | Wenting Zhao | Zhongfen Deng | Shuaiqi Liu | Renze Lou | Henry Peng Zou | Pranav Narayanan Venkit | Nan Zhang | Mukund Srinath | Haoran Ranran Zhang | Vipul Gupta | Yinghui Li | Tao Li | Fei Wang | Qin Liu | Tianlin Liu | Pengzhi Gao | Congying Xia | Chen Xing | Cheng Jiayang | Zhaowei Wang | Ying Su | Raj Sanjay Shah | Ruohao Guo | Jing Gu | Haoran Li | Kangda Wei | Zihao Wang | Lu Cheng | Surangika Ranathunga | Meng Fang | Jie Fu | Fei Liu | Ruihong Huang | Eduardo Blanco | Yixin Cao | Rui Zhang | Philip S. Yu | Wenpeng Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Claim: This work is not advocating the use of LLMs for paper (meta-)reviewing. Instead, wepresent a comparative analysis to identify and distinguish LLM activities from human activities. Two research goals: i) Enable better recognition of instances when someone implicitly uses LLMs for reviewing activities; ii) Increase community awareness that LLMs, and AI in general, are currently inadequate for performing tasks that require a high level of expertise and nuanced judgment.This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises the question: how can LLMs potentially assist researchers in alleviating their heavy workload?This study focuses on the topic of LLMs as NLP Researchers, particularly examining the effectiveness of LLMs in assisting paper (meta-)reviewing and its recognizability. To address this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with “deficiency” labels and corresponding explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i) “LLMs as Reviewers”, how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability? (ii) “LLMs as Metareviewers”, how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments, within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.

pdf bib
MUG: Interactive Multimodal Grounding on User Interfaces
Tao Li | Gang Li | Jingjie Zheng | Purple Wang | Yang Li
Findings of the Association for Computational Linguistics: EACL 2024

We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77,820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish benchmark, we experiment with a range of modeling variants and evaluation strategies, including both offline and online evaluation—the online strategy consists of both human evaluation and automatic with simulators. Our experiments show that iterative interaction significantly improves the absolute task completion by 18% over the entire test set and 31% over the challenging split. Our results lay the foundation for further investigation of the problem.

pdf bib
Devil’s Advocate: Anticipatory Reflection for LLM Agents
Haoyu Wang | Tao Li | Zhiwei Deng | Dan Roth | Yang Li
Findings of the Association for Computational Linguistics: EMNLP 2024

In this work, we introduce a novel approach that equips LLM agents with introspection, enhancing consistency and adaptability in solving complex tasks. Our approach prompts LLM agents to decompose a given task into manageable subtasks (i.e., to make a plan), and to continuously introspect upon the suitability and results of their actions. We implement a three-fold introspective intervention: 1) anticipatory reflection on potential failures and alternative remedy before action execution, 2) post-action alignment with subtask objectives and backtracking with remedy to ensure utmost effort in plan execution, and 3) comprehensive review upon plan completion for future strategy refinement. By deploying and experimenting with this methodology—a zero-shot approach—within WebArena for practical tasks in web environments, our agent demonstrates superior performance with a success rate of 23.5% over existing zero-shot methods by 3.5%. The experimental results suggest that our introspection-driven approach not only enhances the agent’s ability to navigate unanticipated challenges through a robust mechanism of plan execution, but also improves efficiency by reducing the number of trials and plan revisions by 45% needed to achieve a task.

pdf bib
PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning
Jiaru Zou | Mengyu Zhou | Tao Li | Shi Han | Dongmei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024

Recent advances in fine-tuning large language models (LLMs) have greatly enhanced their usage in domain-specific tasks. Despite the success, fine-tuning continues to rely on repeated and lengthy prompts, which escalate computational expenses, require more resources, and lead to slower inference. In this paper, we present a novel approach, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs. Instead of compressing the prompts for a vanilla model, PromptIntern aims to embed the recurrent prompt directly into the model parameters. We design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy, effectively diminishing the need for intricate prompts during inference. Comprehensive experiments on challenging NL2Code tasks demonstrate that our method reduces input tokens by more than 90%, accelerates inference by 4.2 times, and reduces monetary inference costs by 88.3%.

pdf bib
Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression
Zhichao Xu | Ashim Gupta | Tao Li | Oliver Bentham | Vivek Srikumar
Findings of the Association for Computational Linguistics: EMNLP 2024

Increasingly, model compression techniques enable large language models (LLMs) to be deployed in real-world applications. As a result of this momentum towards local deployment, compressed LLMs will interact with a large population. Prior work on compression typically prioritize preserving perplexity, which is directly analogous to training loss. The impact of compression method on other critical aspects of model behavior—particularly safety—requires systematic assessment. To this end, we investigate the impact of model compression along four dimensions: (1) degeneration harm, i.e., bias and toxicity in generation; (2) representational harm, i.e., biases in discriminative tasks; (3) dialect bias; and (4) language modeling and downstream task performance. We examine a wide spectrum of LLM compression techniques, including unstructured pruning, semi-structured pruning, and quantization. Our analysis reveals that compression can lead to unexpected consequences. Although compression may unintentionally alleviate LLMs’ degeneration harm, it can still exacerbate representational harm. Furthermore, increasing compression produces a divergent impact on different protected groups. Finally, different compression methods have drastically different safety impacts: for example, quantization mostly preserves bias while pruning degrades quickly. Our findings underscore the importance of integrating safety assessments into the development of compressed LLMs to ensure their reliability across real-world applications.

2023

pdf bib
An Extensible Plug-and-Play Method for Multi-Aspect Controllable Text Generation
Xuancheng Huang | Zijun Liu | Peng Li | Tao Li | Maosong Sun | Yang Liu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, multi-aspect controllable text generation that controls the generated text in multiple aspects (e.g., sentiment, topic, and keywords) has attracted increasing attention. Although methods based on parameter efficient tuning like prefix-tuning could achieve multi-aspect controlling in a plug-and-play way, the mutual interference of multiple prefixes leads to significant degeneration of constraints and limits their extensibility to training-time unseen aspect combinations. In this work, we provide a theoretical lower bound for the interference and empirically found that the interference grows with the number of layers where prefixes are inserted. Based on these analyses, we propose using trainable gates to normalize the intervention of prefixes to restrain the growing interference. As a result, controlling training-time unseen combinations of aspects can be realized by simply concatenating corresponding plugins such that new constraints can be extended at a lower cost. In addition, we propose a unified way to process both categorical and free-form constraints. Experiments on text generation and machine translation demonstrate the superiority of our approach over baselines on constraint accuracy, text quality, and extensibility.

pdf bib
A Zero-Shot Language Agent for Computer Control with Structured Reflection
Tao Li | Gang Li | Zhiwei Deng | Bryan Wang | Yang Li
Findings of the Association for Computational Linguistics: EMNLP 2023

Large language models (LLMs) have shown increasing capacity at planning and executing a high-level goal in a live computer environment (e.g. MiniWoB++). To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting. Without these trace examples, it remains a challenge how an agent can autonomously learn and improve its control on a computer, which limits the ability of an agent to perform a new task. We approach this problem with a zero-shot agent that requires no given expert traces. Our agent plans for executable actions on a partially observed environment, and iteratively progresses a task by identifying and learning from its mistakes via self-reflection and structured thought management. On the easy tasks of MiniWoB++, we show that our zero-shot agent often outperforms recent SoTAs, with more efficient reasoning. For tasks with more complexity, our reflective agent performs on par with prior best models, even though previous works had the advantages of accessing expert traces or additional screen information.

pdf bib
Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers
Weng Tam | Xiao Liu | Kaixuan Ji | Lilong Xue | Jiahua Liu | Tao Li | Yuxiao Dong | Jie Tang
Findings of the Association for Computational Linguistics: EMNLP 2023

Prompt tuning attempts to update few task-specific parameters in pre-trained models. It has achieved comparable performance to fine-tuning of the full parameter set on both language understanding and generation tasks. In this work, we study the problem of prompt tuning for neural text retrievers. We introduce parameter-efficient prompt tuning for text retrieval across in-domain, cross-domain, and cross-topic settings. Through an extensive analysis, we show that the strategy can mitigate the two issues—parameter-inefficiency and weak generalizability—faced by fine-tuning based retrieval methods. Notably, it can significantly improve the out-of-domain zero-shot generalization of the retrieval models. By updating only 0.1% of the model parameters, the prompt tuning strategy can help retrieval models achieve better generalization performance than traditional methods in which all parameters are updated. Finally, to facilitate research on retrievers’ cross-topic generalizability, we curate and release an academic retrieval dataset with 18K query-results pairs in 87 topics, making it the largest topic-specific one to date.

pdf bib
Learning Semantic Role Labeling from Compatible Label Sequences
Tao Li | Ghazaleh Kazeminejad | Susan Brown | Vivek Srikumar | Martha Palmer
Findings of the Association for Computational Linguistics: EMNLP 2023

Semantic role labeling (SRL) has multiple disjoint label sets, e.g., VerbNet and PropBank. Creating these datasets is challenging, therefore a natural question is how to use each one to help the other. Prior work has shown that cross-task interaction helps, but only explored multitask learning so far. A common issue with multi-task setup is that argument sequences are still separately decoded, running the risk of generating structurally inconsistent label sequences (as per lexicons like Semlink). In this paper, we eliminate such issue with a framework that jointly models VerbNet and PropBank labels as one sequence. In this setup, we show that enforcing Semlink constraints during decoding constantly improves the overall F1. With special input constructions, our joint model infers VerbNet arguments from given PropBank arguments with over 99 F1. For learning, we propose a constrained marginal model that learns with knowledge defined in Semlink to further benefit from the large amounts of PropBank-only data. On the joint benchmark based on CoNLL05, our models achieve state-of-the-art F1’s, outperforming the prior best in-domain model by 3.5 (VerbNet) and 0.8 (PropBank). For out-of-domain generalization, our models surpass the prior best by 3.4 (VerbNet) and 0.2 (PropBank).

2021

pdf bib
OSCaR: Orthogonal Subspace Correction and Rectification of Biases in Word Embeddings
Sunipa Dev | Tao Li | Jeff M Phillips | Vivek Srikumar
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Language representations are known to carry stereotypical biases and, as a result, lead to biased predictions in downstream tasks. While existing methods are effective at mitigating biases by linear projection, such methods are too aggressive: they not only remove bias, but also erase valuable information from word embeddings. We develop new measures for evaluating specific information retention that demonstrate the tradeoff between bias removal and information retention. To address this challenge, we propose OSCaR (Orthogonal Subspace Correction and Rectification), a bias-mitigating method that focuses on disentangling biased associations between concepts instead of removing concepts wholesale. Our experiments on gender biases show that OSCaR is a well-balanced approach that ensures that semantic information is retained in the embeddings and bias is also effectively mitigated.

pdf bib
Automatic Entity State Annotation using the VerbNet Semantic Parser
Ghazaleh Kazeminejad | Martha Palmer | Tao Li | Vivek Srikumar
Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop

Tracking entity states is a natural language processing task assumed to require human annotation. In order to reduce the time and expenses associated with annotation, we introduce a new method to automatically extract entity states, including location and existence state of entities, following Dalvi et al. (2018) and Tandon et al. (2020). For this purpose, we rely primarily on the semantic representations generated by the state of the art VerbNet parser (Gung, 2020), and extract the entities (event participants) and their states, based on the semantic predicates of the generated VerbNet semantic representation, which is in propositional logic format. For evaluation, we used ProPara (Dalvi et al., 2018), a reading comprehension dataset which is annotated with entity states in each sentence, and tracks those states in paragraphs of natural human-authored procedural texts. Given the presented limitations of the method, the peculiarities of the ProPara dataset annotations, and that our system, Lexis, makes no use of task-specific training data and relies solely on VerbNet, the results are promising, showcasing the value of lexical resources.

2020

pdf bib
Structured Tuning for Semantic Role Labeling
Tao Li | Parth Anand Jawale | Martha Palmer | Vivek Srikumar
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recent neural network-driven semantic role labeling (SRL) systems have shown impressive improvements in F1 scores. These improvements are due to expressive input representations, which, at least at the surface, are orthogonal to knowledge-rich constrained decoding mechanisms that helped linear SRL models. Introducing the benefits of structure to inform neural models presents a methodological challenge. In this paper, we present a structured tuning framework to improve models using softened constraints only at training time. Our framework leverages the expressiveness of neural networks and provides supervision with structured loss components. We start with a strong baseline (RoBERTa) to validate the impact of our approach, and show that our framework outperforms the baseline by learning to comply with declarative constraints. Additionally, our experiments with smaller training sizes show that we can achieve consistent improvements under low-resource scenarios.

pdf bib
UNQOVERing Stereotyping Biases via Underspecified Questions
Tao Li | Daniel Khashabi | Tushar Khot | Ashish Sabharwal | Vivek Srikumar
Findings of the Association for Computational Linguistics: EMNLP 2020

While language embeddings have been shown to have stereotyping biases, how these biases affect downstream question answering (QA) models remains unexplored. We present UNQOVER, a general framework to probe and quantify biases through underspecified questions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors: positional dependence and question independence. We design a formalism that isolates the aforementioned errors. As case studies, we use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion. We probe five transformer-based QA models trained on two QA datasets, along with their underlying language models. Our broad study reveals that (1) all these models, with and without fine-tuning, have notable stereotyping biases in these classes; (2) larger models often have higher bias; and (3) the effect of fine-tuning on bias varies strongly with the dataset and the model size.

2019

pdf bib
A Logic-Driven Framework for Consistency of Neural Models
Tao Li | Vivek Gupta | Maitrey Mehta | Vivek Srikumar
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

While neural models show remarkable accuracy on individual predictions, their internal beliefs can be inconsistent across examples. In this paper, we formalize such inconsistency as a generalization of prediction error. We propose a learning framework for constraining models using logic rules to regularize them away from inconsistency. Our framework can leverage both labeled and unlabeled examples and is directly compatible with off-the-shelf learning schemes without model redesign. We instantiate our framework on natural language inference, where experiments show that enforcing invariants stated in logic can help make the predictions of neural models both accurate and consistent.

pdf bib
Augmenting Neural Networks with First-order Logic
Tao Li | Vivek Srikumar
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Today, the dominant paradigm for training neural networks involves minimizing task loss on a large dataset. Using world knowledge to inform a model, and yet retain the ability to perform end-to-end training remains an open question. In this paper, we present a novel framework for introducing declarative knowledge to neural network architectures in order to guide training and prediction. Our framework systematically compiles logical statements into computation graphs that augment a neural network without extra learnable parameters or manual redesign. We evaluate our modeling strategy on three tasks: machine comprehension, natural language inference, and text chunking. Our experiments show that knowledge-augmented networks can strongly improve over baselines, especially in low-data regimes.

2018

pdf bib
Visual Interrogation of Attention-Based Models for Natural Language Inference and Machine Comprehension
Shusen Liu | Tao Li | Zhimin Li | Vivek Srikumar | Valerio Pascucci | Peer-Timo Bremer
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Neural networks models have gained unprecedented popularity in natural language processing due to their state-of-the-art performance and the flexible end-to-end training scheme. Despite their advantages, the lack of interpretability hinders the deployment and refinement of the models. In this work, we present a flexible visualization library for creating customized visual analytic environments, in which the user can investigate and interrogate the relationships among the input, the model internals (i.e., attention), and the output predictions, which in turn shed light on the model decision-making process.

pdf bib
Aspect Based Sentiment Analysis with Gated Convolutional Networks
Wei Xue | Tao Li
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Aspect based sentiment analysis (ABSA) can provide more detailed information than general sentiment analysis, because it aims to predict the sentiment polarities of the given aspects or entities in text. We summarize previous approaches into two subtasks: aspect-category sentiment analysis (ACSA) and aspect-term sentiment analysis (ATSA). Most previous approaches employ long short-term memory and attention mechanisms to predict the sentiment polarity of the concerned targets, which are often complicated and need more training time. We propose a model based on convolutional neural networks and gating mechanisms, which is more accurate and efficient. First, the novel Gated Tanh-ReLU Units can selectively output the sentiment features according to the given aspect or entity. The architecture is much simpler than attention layer used in the existing models. Second, the computations of our model could be easily parallelized during training, because convolutional layers do not have time dependency as in LSTM layers, and gating units also work independently. The experiments on SemEval datasets demonstrate the efficiency and effectiveness of our models.

2017

pdf bib
MTNA: A Neural Multi-task Model for Aspect Category Classification and Aspect Term Extraction On Restaurant Reviews
Wei Xue | Wubai Zhou | Tao Li | Qing Wang
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Online reviews are valuable resources not only for consumers to make decisions before purchase, but also for providers to get feedbacks for their services or commodities. In Aspect Based Sentiment Analysis (ABSA), it is critical to identify aspect categories and extract aspect terms from the sentences of user-generated reviews. However, the two tasks are often treated independently, even though they are closely related. Intuitively, the learned knowledge of one task should inform the other learning task. In this paper, we propose a multi-task learning model based on neural networks to solve them together. We demonstrate the improved performance of our multi-task learning model over the models trained separately on three public dataset released by SemEval workshops.

2016

pdf bib
Exploiting Sentence Similarities for Better Alignments
Tao Li | Vivek Srikumar
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf bib
Co-training for Semi-supervised Sentiment Classification Based on Dual-view Bags-of-words Representation
Rui Xia | Cheng Wang | Xin-Yu Dai | Tao Li
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2013

pdf bib
A Participant-based Approach for Event Summarization Using Twitter Streams
Chao Shen | Fei Liu | Fuliang Weng | Tao Li
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2011

pdf bib
A Non-negative Matrix Factorization Based Approach for Active Dual Supervision from Document and Word Labels
Chao Shen | Tao Li
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

pdf bib
Multi-Document Summarization via the Minimum Dominating Set
Chao Shen | Tao Li
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2009

pdf bib
A Non-negative Matrix Tri-factorization Approach to Sentiment Classification with Lexical Prior Knowledge
Tao Li | Yi Zhang | Vikas Sindhwani
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Multi-Document Summarization using Sentence-based Topic Models
Dingding Wang | Shenghuo Zhu | Tao Li | Yihong Gong
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers