Xiao Liu


2024

pdf bib
Can Multiple-choice Questions Really Be Useful in Detecting the Abilities of LLMs?
Wangyue Li | Liangzhi Li | Tong Xiang | Xiao Liu | Wei Deng | Noa Garcia
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) due to their simplicity and efficiency. However, there are concerns about whether MCQs can truly measure LLM’s capabilities, particularly in knowledge-intensive scenarios where long-form generation (LFG) answers are required. The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ’s efficacy, which we undertake in this paper by evaluating nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English. We identify a significant issue: LLMs exhibit an order sensitivity in bilingual MCQs, favoring answers located at specific positions, i.e., the first position. We further quantify the gap between MCQs and long-form generation questions (LFGQs) by comparing their direct outputs, token logits, and embeddings. Our results reveal a relatively low correlation between answers from MCQs and LFGQs for identical questions. Additionally, we propose two methods to quantify the consistency and confidence of LLMs’ output, which can be generalized to other QA evaluation benchmarks. Notably, our analysis challenges the idea that the higher the consistency, the greater the accuracy. We also find MCQs to be less reliable than LFGQs in terms of expected calibration error. Finally, the misalignment between MCQs and LFGQs is not only reflected in the evaluation performance but also in the embedding space. Our code and models can be accessed at https://github.com/Meetyou-AI-Lab/Can-MC-Evaluate-LLMs.

pdf bib
CASA: Causality-driven Argument Sufficiency Assessment
Xiao Liu | Yansong Feng | Kai-Wei Chang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The argument sufficiency assessment task aims to determine if the premises of a given argument support its conclusion.To tackle this task, existing works often train a classifier on data annotated by humans. However, annotating data is laborious, and annotations are often inconsistent due to subjective criteria. Motivated by the definition of probability of sufficiency (PS) in the causal literature, we proposeCASA, a zero-shot causality-driven argument sufficiency assessment framework. PS measures how likely introducing the premise event would lead to the conclusion when both the premise and conclusion events are absent. To estimate this probability, we propose to use large language models (LLMs) to generate contexts that are inconsistent with the premise and conclusion and revise them by injecting the premise event.Experiments on two logical fallacy detection datasets demonstrate that CASA accurately identifies insufficient arguments. We further deploy CASA in a writing assistance application, and find that suggestions generated by CASA enhance the sufficiency of student-written arguments. Code and data are available at https://github.com/xxxiaol/CASA.

2023

pdf bib
How Many Answers Should I Give? An Empirical Study of Multi-Answer Reading Comprehension
Chen Zhang | Jiuheng Lin | Xiao Liu | Yuxuan Lai | Yansong Feng | Dongyan Zhao
Findings of the Association for Computational Linguistics: ACL 2023

The multi-answer phenomenon, where a question may have multiple answers scattered in the document, can be well handled by humans but is challenging enough for machine reading comprehension (MRC) systems. Despite recent progress in multi-answer MRC, there lacks a systematic analysis of how this phenomenon arises and how to better address it. In this work, we design a taxonomy to categorize commonly-seen multi-answer MRC instances, with which we inspect three multi-answer datasets and analyze where the multi-answer challenge comes from. We further analyze how well different paradigms of current multi-answer MRC models deal with different types of multi-answer instances. We find that some paradigms capture well the key information in the questions while others better model the relation between questions and contexts. We thus explore strategies to make the best of the strengths of different paradigms. Experiments show that generation models can be a promising platform to incorporate different paradigms. Our annotations and code are released for further research.

pdf bib
The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code
Xiao Liu | Da Yin | Chen Zhang | Yansong Feng | Dongyan Zhao
Findings of the Association for Computational Linguistics: ACL 2023

Causal reasoning, the ability to identify cause-and-effect relationship, is crucial in human thinking. Although large language models (LLMs) succeed in many NLP tasks, it is still challenging for them to conduct complex causal reasoning like abductive reasoning and counterfactual reasoning. Given the fact that programming code may express causal relations more often and explicitly with conditional statements like “if“, we want to explore whether Code-LLMs acquire better causal reasoning abilities. Our experiments show that compared to text-only LLMs, Code-LLMs with code prompts are better causal reasoners. We further intervene on the prompts from different aspects, and discover that the key point is the programming structure. Code and data are available at https://github.com/xxxiaol/magic-if.

pdf bib
Boosting Event Extraction with Denoised Structure-to-Text Augmentation
Bo Wang | Heyan Huang | Xiaochi Wei | Ge Shi | Xiao Liu | Chong Feng | Tong Zhou | Shuaiqiang Wang | Dawei Yin
Findings of the Association for Computational Linguistics: ACL 2023

Event extraction aims to recognize pre-defined event triggers and arguments from texts, which suffer from the lack of high-quality annotations. In most NLP applications, involving a large scale of synthetic training data is a practical and effective approach to alleviate the problem of data scarcity. However, when applying to the task of event extraction, recent data augmentation methods often neglect the problem of grammatical incorrectness, structure misalignment, and semantic drifting, leading to unsatisfactory performances. In order to solve these problems, we propose a denoised structure-to-text augmentation framework for event extraction (DAEE), which generates additional training data through the knowledge-based structure-to-text generation model and selects the effective subset from the generated data iteratively with a deep reinforcement learning agent. Experimental results on several datasets demonstrate that the proposed method generates more diverse text representations for event extraction and achieves comparable results with the state-of-the-art.

pdf bib
DiffuSum: Generation Enhanced Extractive Summarization with Diffusion
Haopeng Zhang | Xiao Liu | Jiawei Zhang
Findings of the Association for Computational Linguistics: ACL 2023

Extractive summarization aims to form a summary by directly extracting sentences from the source document. Existing works mostly formulate it as a sequence labeling problem by making individual sentence label predictions. This paper proposes DiffuSum, a novel paradigm for extractive summarization, by directly generating the desired summary sentence representations with diffusion models and extracting sentences based on sentence representation matching. In addition, DiffuSum jointly optimizes a contrastive sentence encoder with a matching loss for sentence representation alignment and a multi-class contrastive loss for representation diversity. Experimental results show that DiffuSum achieves the new state-of-the-art extractive results on CNN/DailyMail with ROUGE scores of 44.83/22.56/40.56. Experiments on the other two datasets with different summary lengths and cross-dataset evaluation also demonstrate the effectiveness of DiffuSum. The strong performance of our framework shows the great potential of adapting generative models for extractive summarization.

pdf bib
Extractive Summarization via ChatGPT for Faithful Summary Generation
Haopeng Zhang | Xiao Liu | Jiawei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2023

Extractive summarization is a crucial task in natural language processing that aims to condense long documents into shorter versions by directly extracting sentences. The recent introduction of large language models has attracted significant interest in the NLP community due to its remarkable performance on a wide range of downstream tasks. This paper first presents a thorough evaluation of ChatGPT’s performance on extractive summarization and compares it with traditional fine-tuning methods on various benchmark datasets. Our experimental analysis reveals that ChatGPT exhibits inferior extractive summarization performance in terms of ROUGE scores compared to existing supervised systems, while achieving higher performance based on LLM-based evaluation metrics. In addition, we explore the effectiveness of in-context learning and chain-of-thought reasoning for enhancing its performance. Furthermore, we find that applying an extract-then-generate pipeline with ChatGPT yields significant performance improvements over abstractive baselines in terms of summary faithfulness. These observations highlight potential directions for enhancing ChatGPT’s capabilities in faithful summarization using two-stage approaches.

pdf bib
Allies: Prompting Large Language Model with Beam Search
Hao Sun | Xiao Liu | Yeyun Gong | Yan Zhang | Daxin Jiang | Linjun Yang | Nan Duan
Findings of the Association for Computational Linguistics: EMNLP 2023

With the advance of large language models (LLMs), the research field of LLM applications becomes more and more popular and the idea of constructing pipelines to accomplish complex tasks by stacking LLM API calls come true. However, this kind of methods face two limitations: narrow information coverage and low fault tolerance. In this work, we propose a novel method called ALLIES. Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query, enabling an iterative reasoning process. By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly obtainable through retrieval. We take zero-shot open-domain question answering (ODQA) as an application scene and evaluate ALLIES on the widely-used benchmarks, such as NQ, WebQ and TriviaQA. The experimental results demonstrate that ALLIES significantly outperforms other zero-shot baselines, indicating its effectiveness in tackling those challenges. Our code is available in https://github.com/microsoft/SimXNS/tree/main/ALLIES.

pdf bib
SummIt: Iterative Text Summarization via ChatGPT
Haopeng Zhang | Xiao Liu | Jiawei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2023

Existing text summarization systems have made significant progress in recent years, but typically generate summaries in a single step. The one-shot summarization setting is sometimes inadequate, however, as the generated summary may contain hallucinations or overlook important details related to the reader’s interests. In this paper, we address this limitation by proposing SummIt, an iterative text summarization framework based on large language models like ChatGPT. Our framework enables the model to refine the generated summary iteratively through self-evaluation and feedback, closely resembling the iterative process humans undertake when drafting and revising summaries. Furthermore, we explore the potential benefits of integrating knowledge and topic extractors into the framework to enhance summary faithfulness and controllability. We evaluate the performance of our framework on three benchmark summarization datasets through empirical and qualitative analyses. We also conduct a human evaluation to validate the effectiveness of the model’s refinements and find a potential issue of over-correction.

pdf bib
Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers
Weng Tam | Xiao Liu | Kaixuan Ji | Lilong Xue | Jiahua Liu | Tao Li | Yuxiao Dong | Jie Tang
Findings of the Association for Computational Linguistics: EMNLP 2023

Prompt tuning attempts to update few task-specific parameters in pre-trained models. It has achieved comparable performance to fine-tuning of the full parameter set on both language understanding and generation tasks. In this work, we study the problem of prompt tuning for neural text retrievers. We introduce parameter-efficient prompt tuning for text retrieval across in-domain, cross-domain, and cross-topic settings. Through an extensive analysis, we show that the strategy can mitigate the two issues—parameter-inefficiency and weak generalizability—faced by fine-tuning based retrieval methods. Notably, it can significantly improve the out-of-domain zero-shot generalization of the retrieval models. By updating only 0.1% of the model parameters, the prompt tuning strategy can help retrieval models achieve better generalization performance than traditional methods in which all parameters are updated. Finally, to facilitate research on retrievers’ cross-topic generalizability, we curate and release an academic retrieval dataset with 18K query-results pairs in 87 topics, making it the largest topic-specific one to date.

pdf bib
System Report for CCL23-Eval Task 9: HUST1037 Explore Proper Prompt Strategy for LLM in MRC Task
Xiao Liu | Junfeng Yu | Yibo He | Lujun Zhang | Kaiyichen Wei | Hongbo Sun | Gang Tu
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“Our research paper delves into the Adversarial Robustness Evaluation for Chinese Gaokao Read-ing Comprehension (GCRC advRobust). While Chinese reading comprehension tasks havegained significant attention in recent years, previous methods have not proven effective for thischallenging dataset. We focus on exploring how prompt engineering can impact a model’s read-ing comprehension ability. Through our experiments using ChatGLM, GPT3.5, and GPT4, wediscovered a correlation between prompt and LLM reading comprehension ability, and found thatprompt engineering improves the performance of each model. Our team submitted the results ofour system evaluation, which ranked first in three indexes and total scores. Keywords— LLM, Prompt, Chinese Reading Comprehension”

pdf bib
Contrastive Hierarchical Discourse Graph for Scientific Document Summarization
Haopeng Zhang | Xiao Liu | Jiawei Zhang
Proceedings of the 4th Workshop on Computational Approaches to Discourse (CODI 2023)

The extended structural context has made scientific paper summarization a challenging task. This paper proposes CHANGES, a contrastive hierarchical graph neural network for extractive scientific paper summarization. CHANGES represents a scientific paper with a hierarchical discourse graph and learns effective sentence representations with dedicated designed hierarchical graph information aggregation. We also propose a graph contrastive learning module to learn global theme-aware sentence representations. Extensive experiments on the PubMed and arXiv benchmark datasets prove the effectiveness of CHANGES and the importance of capturing hierarchical structure information in modeling scientific papers.

pdf bib
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
Da Yin | Xiao Liu | Fan Yin | Ming Zhong | Hritik Bansal | Jiawei Han | Kai-Wei Chang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Instruction tuning has emerged to enhance the capabilities of large language models (LLMs) to comprehend instructions and generate appropriate responses. Existing methods either manually annotate or employ LLM (e.g., GPT-series) to generate data for instruction tuning. However, they often overlook associating instructions with existing annotated datasets. In this paper, we propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data. Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions. By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions (e.g., it costs less than $12 USD by calling GPT-3.5-turbo for generating 800K instruction tuning samples; 2) it provides high-quality data for instruction tuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform with comparable data sizes); and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available. We further investigate a continual learning scheme for learning with the ever-growing instruction-tuning dataset, and demonstrate that replaying tasks with diverse instruction embeddings not only helps mitigate forgetting issues but generalizes to unseen tasks better. Code and data are available at https://github.com/WadeYin9712/Dynosaur.

pdf bib
DiNeR: A Large Realistic Dataset for Evaluating Compositional Generalization
Chengang Hu | Xiao Liu | Yansong Feng
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Most of the existing compositional generalization datasets are synthetically-generated, resulting in a lack of natural language variation. While there have been recent attempts to introduce non-synthetic datasets for compositional generalization, they suffer from either limited data scale or a lack of diversity in the forms of combinations. To better investigate compositional generalization with more linguistic phenomena and compositional diversity, we propose the DIsh NamE Recognition (DiNeR) task and create a large realistic Chinese dataset. Given a recipe instruction, models are required to recognize the dish name composed of diverse combinations of food, actions, and flavors. Our dataset consists of 3,811 dishes and 228,114 recipes, and involves plenty of linguistic phenomena such as anaphora, omission and ambiguity. We provide two strong baselines based on T5 and large language models (LLMs). This work contributes a challenging task, baseline methods to tackle the task, and insights into compositional generalization in the context of dish name recognition.

2022

pdf bib
Counterfactual Recipe Generation: Exploring Compositional Generalization in a Realistic Scenario
Xiao Liu | Yansong Feng | Jizhi Tang | Chengang Hu | Dongyan Zhao
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

People can acquire knowledge in an unsupervised manner by reading, and compose the knowledge to make novel combinations. In this paper, we investigate whether pretrained language models can perform compositional generalization in a realistic setting: recipe generation. We design the counterfactual recipe generation task, which asks models to modify a base recipe according to the change of an ingredient. This task requires compositional generalization at two levels: the surface level of incorporating the new ingredient into the base recipe, and the deeper level of adjusting actions related to the changing ingredient. We collect a large-scale recipe dataset in Chinese for models to learn culinary knowledge, and a subset of action-level fine-grained annotations for evaluation.We finetune pretrained language models on the recipe corpus, and use unsupervised counterfactual generation methods to generate modified recipes.Results show that existing models have difficulties in modifying the ingredients while preserving the original text style, and often miss actions that need to be adjusted. Although pretrained language models can generate fluent recipe texts, they fail to truly learn and use the culinary knowledge in a compositional way. Code and data are available at https://github.com/xxxiaol/counterfactual-recipe-generation.

pdf bib
IELM: An Open Information Extraction Benchmark for Pre-Trained Language Models
Chenguang Wang | Xiao Liu | Dawn Song
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We introduce a new open information extraction (OIE) benchmark for pre-trained language models (LM). Recent studies have demonstrated that pre-trained LMs, such as BERT and GPT, may store linguistic and relational knowledge. In particular, LMs are able to answer “fill-in-the-blank” questions when given a pre-defined relation category. Instead of focusing on pre-defined relations, we create an OIE benchmark aiming to fully examine the open relational information present in the pre-trained LMs. We accomplish this by turning pre-trained LMs into zero-shot OIE systems. Surprisingly, pre-trained LMs are able to obtain competitive performance on both standard OIE datasets (CaRB and Re-OIE2016) and two new large-scale factual OIE datasets (TAC KBP-OIE and Wikidata-OIE) that we establish via distant supervision. For instance, the zero-shot pre-trained LMs outperform the F1 score of the state-of-the-art supervised OIE methods on our factual OIE datasets without needing to use any training sets.

pdf bib
HEGEL: Hypergraph Transformer for Long Document Summarization
Haopeng Zhang | Xiao Liu | Jiawei Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Extractive summarization for long documents is challenging due to the extended structured input context. The long-distance sentence dependency hinders cross-sentence relations modeling, the critical step of extractive summarization. This paper proposes HEGEL, a hypergraph neural network for long document summarization by capturing high-order cross-sentence relations. HEGEL updates and learns effective sentence representations with hypergraph transformer layers and fuses different types of sentence dependencies, including latent topics, keywords coreference, and section structure. We validate HEGEL by conducting extensive experiments on two benchmark datasets, and experimental results demonstrate the effectiveness and efficiency of HEGEL.

pdf bib
SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval
Kun Zhou | Yeyun Gong | Xiao Liu | Wayne Xin Zhao | Yelong Shen | Anlei Dong | Jingwen Lu | Rangan Majumder | Ji-rong Wen | Nan Duan
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Sampling proper negatives from a large document pool is vital to effectively train a dense retrieval model. However, existing negative sampling strategies suffer from the uninformative or false negative problem. In this work, we empirically show that according to the measured relevance scores, the negatives ranked around the positives are generally more informative and less likely to be false negatives. Intuitively, these negatives are not too hard (may be false negatives) or too easy (uninformative). They are the ambiguous negatives and need more attention during training.Thus, we propose a simple ambiguous negatives sampling method, SimANS, which incorporates a new sampling probability distribution to sample more ambiguous negatives.Extensive experiments on four public and one industry datasets show the effectiveness of our approach.We made the code and models publicly available in https://github.com/microsoft/SimXNS.

pdf bib
DeepStruct: Pretraining of Language Models for Structure Prediction
Chenguang Wang | Xiao Liu | Zui Chen | Haoyun Hong | Jie Tang | Dawn Song
Findings of the Association for Computational Linguistics: ACL 2022

We introduce a method for improving the structural understanding abilities of language models. Unlike previous approaches that finetune the models with task-specific augmentation, we pretrain language models to generate structures from the text on a collection of task-agnostic corpora. Our structure pretraining enables zero-shot transfer of the learned knowledge that models have about the structure tasks. We study the performance of this approach on 28 datasets, spanning 10 structure prediction tasks including open information extraction, joint entity and relation extraction, named entity recognition, relation classification, semantic role labeling, event extraction, coreference resolution, factual probe, intent detection, and dialogue state tracking. We further enhance the pretraining with the task-specific training sets. We show that a 10B parameter language model transfers non-trivially to most tasks and obtains state-of-the-art performance on 21 of 28 datasets that we evaluate. Our code and datasets will be made publicly available.

pdf bib
GLM: General Language Model Pretraining with Autoregressive Blank Infilling
Zhengxiao Du | Yujie Qian | Xiao Liu | Ming Ding | Jiezhong Qiu | Zhilin Yang | Jie Tang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and encoder-decoder models (e.g., T5). However, none of the pretraining frameworks performs the best for all tasks of three main categories including natural language understanding (NLU), unconditional generation, and conditional generation. We propose a General Language Model (GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over BERT and T5 on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25× parameters of BERT Large , demonstrating its generalizability to different downstream tasks.

pdf bib
Things not Written in Text: Exploring Spatial Commonsense from Visual Signals
Xiao Liu | Da Yin | Yansong Feng | Dongyan Zhao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Spatial commonsense, the knowledge about spatial position and relationship between objects (like the relative size of a lion and a girl, and the position of a boy relative to a bicycle when cycling), is an important part of commonsense knowledge. Although pretrained language models (PLMs) succeed in many NLP tasks, they are shown to be ineffective in spatial commonsense reasoning. Starting from the observation that images are more likely to exhibit spatial commonsense than texts, we explore whether models with visual signals learn more spatial commonsense than text-based PLMs. We propose a spatial commonsense benchmark that focuses on the relative scales of objects, and the positional relationship between people and objects under different actions. We probe PLMs and models with visual signals, including vision-language pretrained models and image synthesis models, on this benchmark, and find that image synthesis models are more capable of learning accurate and consistent spatial knowledge than other models. The spatial knowledge from image synthesis models also helps in natural language understanding tasks that require spatial commonsense.

pdf bib
Dynamic Prefix-Tuning for Generative Template-based Event Extraction
Xiao Liu | Heyan Huang | Ge Shi | Bo Wang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We consider event extraction in a generative manner with template-based conditional generation. Although there is a rising trend of casting the task of event extraction as a sequence generation problem with prompts, these generation-based methods have two significant challenges, including using suboptimal prompts and static event type information. In this paper, we propose a generative template-based event extraction method with dynamic prefix (GTEE-DynPref) by integrating context information with type-specific prefixes to learn a context-specific prefix for each context. Experimental results show that our model achieves competitive results with the state-of-the-art classification-based model OneIE on ACE 2005 and achieves the best performances on ERE.Additionally, our model is proven to be portable to new types of events effectively.

pdf bib
P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks
Xiao Liu | Kaixuan Ji | Yicheng Fu | Weng Tam | Zhengxiao Du | Zhilin Yang | Jie Tang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Prompt tuning, which only tunes continuous prompts with a frozen language model, substantially reduces per-task storage and memory usage at training. However, in the context of NLU, prior work reveals that prompt tuning does not perform well for normal-sized pretrained models. We also find that existing methods of prompt tuning cannot handle hard sequence labeling tasks, indicating a lack of universality. We present a novel empirical finding that properly optimized prompt tuning can be universally effective across a wide range of model scales and NLU tasks. It matches the performance of finetuning while having only 0.1%-3% tuned parameters. Our method P-Tuning v2 is an implementation of Deep Prompt Tuning (CITATION) optimized and adapted for NLU. Given the universality and simplicity of P-Tuning v2, we believe it can serve as an alternative to finetuning and a strong baseline for future research.

pdf bib
Dual-Channel Evidence Fusion for Fact Verification over Texts and Tables
Nan Hu | Zirui Wu | Yuxuan Lai | Xiao Liu | Yansong Feng
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Different from previous fact extraction and verification tasks that only consider evidence of a single format, FEVEROUS brings further challenges by extending the evidence format to both plain text and tables. Existing works convert all candidate evidence into either sentences or tables, thus often failing to fully capture the rich context in their original format from the converted evidence, let alone the context information lost during conversion. In this paper, we propose a Dual Channel Unified Format fact verification model (DCUF), which unifies various evidence into parallel streams, i.e., natural language sentences and a global evidence table, simultaneously. With carefully-designed evidence conversion and organization methods, DCUF makes the most of pre-trained table/language models to encourage each evidence piece to perform early and thorough interactions with other pieces in its original format. Experiments show that our model can make better use of existing pre-trained models to absorb evidence of two formats, thus outperforming previous works by a large margin. Our code and models are publicly available.

2021

pdf bib
SeqDialN: Sequential Visual Dialog Network in Joint Visual-Linguistic Representation Space
Liu Yang | Fanqi Meng | Xiao Liu | Ming-Kuang Daniel Wu | Vicent Ying | James Xu
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)

The key challenge of the visual dialog task is how to fuse features from multimodal sources and extract relevant information from dialog history to answer the current query. In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round. Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors. For featurization, we use a Dense SymmetricCo-Attention network (Nguyen and Okatani,2018) as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies. For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM(Hochreiter and Schmidhuber,1997) for information propagation (IP) and the second uses a modified Transformer (Vaswani et al.,2017) for multi-step reasoning (MR). Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine. On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54% NDCG and 48.63% MRR; our ensemble generative SeqDialN achieves 63.78% NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog model. We fine-tune discriminative SeqDialN with dense annotations and boost the performance up to 72.41% NDCG and 55.11% MRR. In this work, we discuss the extensive experiments we have conducted to demonstrate the effectiveness of our model components. We also provide visualization for the reasoning process from the relevant conversation rounds and discuss our fine-tuning methods. The code is available at https://github.com/xiaoxiaoheimei/SeqDialN.

pdf bib
Geo-BERT Pre-training Model for Query Rewriting in POI Search
Xiao Liu | Juan Hu | Qi Shen | Huan Chen
Findings of the Association for Computational Linguistics: EMNLP 2021

Query Rewriting (QR) is proposed to solve the problem of the word mismatch between queries and documents in Web search. Existing approaches usually model QR with an end-to-end sequence-to-sequence (seq2seq) model. The state-of-the-art Transformer-based models can effectively learn textual semantics from user session logs, but they often ignore users’ geographic location information that is crucial for the Point-of-Interest (POI) search of map services. In this paper, we proposed a pre-training model, called Geo-BERT, to integrate semantics and geographic information in the pre-trained representations of POIs. Firstly, we simulate POI distribution in the real world as a graph, in which nodes represent POIs and multiple geographic granularities. Then we use graph representation learning methods to get geographic representations. Finally, we train a BERT-like pre-training model with text and POIs’ graph embeddings to get an integrated representation of both geographic and semantic information, and apply it in the QR of POI search. The proposed model achieves excellent accuracy on a wide range of real-world datasets of map services.

pdf bib
Everything Has a Cause: Leveraging Causal Inference in Legal Text Analysis
Xiao Liu | Da Yin | Yansong Feng | Yuting Wu | Dongyan Zhao
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Causal inference is the process of capturing cause-effect relationship among variables. Most existing works focus on dealing with structured data, while mining causal relationship among factors from unstructured data, like text, has been less examined, but is of great importance, especially in the legal domain. In this paper, we propose a novel Graph-based Causal Inference (GCI) framework, which builds causal graphs from fact descriptions without much human involvement and enables causal inference to facilitate legal practitioners to make proper decisions. We evaluate the framework on a challenging similar charge disambiguation task. Experimental results show that GCI can capture the nuance from fact descriptions among multiple confusing charges and provide explainable discrimination, especially in few-shot settings. We also observe that the causal knowledge contained in GCI can be effectively injected into powerful neural networks for better performance and interpretability.

pdf bib
Using Noisy Self-Reports to Predict Twitter User Demographics
Zach Wood-Doughty | Paiheng Xu | Xiao Liu | Mark Dredze
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

Computational social science studies often contextualize content analysis within standard demographics. Since demographics are unavailable on many social media platforms (e.g. Twitter), numerous studies have inferred demographics automatically. Despite many studies presenting proof-of-concept inference of race and ethnicity, training of practical systems remains elusive since there are few annotated datasets. Existing datasets are small, inaccurate, or fail to cover the four most common racial and ethnic groups in the United States. We present a method to identify self-reports of race and ethnicity from Twitter profile descriptions. Despite the noise of automated supervision, our self-report datasets enable improvements in classification performance on gold standard self-report survey data. The result is a reproducible method for creating large-scale training resources for race and ethnicity.

pdf bib
Zero-Shot Information Extraction as a Unified Text-to-Triple Translation
Chenguang Wang | Xiao Liu | Zui Chen | Haoyun Hong | Jie Tang | Dawn Song
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We cast a suite of information extraction tasks into a text-to-triple translation framework. Instead of solving each task relying on task-specific datasets and models, we formalize the task as a translation between task-specific input text and output triples. By taking the task-specific input, we enable a task-agnostic translation by leveraging the latent knowledge that a pre-trained language model has about the task. We further demonstrate that a simple pre-training task of predicting which relational information corresponds to which input text is an effective way to produce task-specific outputs. This enables the zero-shot transfer of our framework to downstream tasks. We study the zero-shot performance of this framework on open information extraction (OIE2016, NYT, WEB, PENN), relation classification (FewRel and TACRED), and factual probe (Google-RE and T-REx). The model transfers non-trivially to most tasks and is often competitive with a fully supervised method without the need for any task-specific training. For instance, we significantly outperform the F1 score of the supervised open information extraction without needing to use its training set.

2020

pdf bib
Neighborhood Matching Network for Entity Alignment
Yuting Wu | Xiao Liu | Yansong Feng | Zheng Wang | Dongyan Zhao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Structural heterogeneity between knowledge graphs is an outstanding challenge for entity alignment. This paper presents Neighborhood Matching Network (NMN), a novel entity alignment framework for tackling the structural heterogeneity challenge. NMN estimates the similarities between entities to capture both the topological structure and the neighborhood difference. It provides two innovative components for better learning representations for entity alignment. It first uses a novel graph sampling method to distill a discriminative neighborhood for each entity. It then adopts a cross-graph neighborhood matching module to jointly encode the neighborhood difference for a given entity pair. Such strategies allow NMN to effectively construct matching-oriented entity representations while ignoring noisy neighbors that have a negative impact on the alignment task. Extensive experiments performed on three entity alignment datasets show that NMN can well estimate the neighborhood similarity in more tough cases and significantly outperforms 12 previous state-of-the-art methods.

2019

pdf bib
A Soft Label Strategy for Target-Level Sentiment Classification
Da Yin | Xiao Liu | Xiuyu Wu | Baobao Chang
Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In this paper, we propose a soft label approach to target-level sentiment classification task, in which a history-based soft labeling model is proposed to measure the possibility of a context word as an opinion word. We also apply a convolution layer to extract local active features, and introduce positional weights to take relative distance information into consideration. In addition, we obtain more informative target representation by training with context tokens together to make deeper interaction between target and context tokens. We conduct experiments on SemEval 2014 datasets and the experimental results show that our approach significantly outperforms previous models and gives state-of-the-art results on these datasets.

pdf bib
Jointly Learning Entity and Relation Representations for Entity Alignment
Yuting Wu | Xiao Liu | Yansong Feng | Zheng Wang | Dongyan Zhao
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Entity alignment is a viable means for integrating heterogeneous knowledge among different knowledge graphs (KGs). Recent developments in the field often take an embedding-based approach to model the structural information of KGs so that entity alignment can be easily performed in the embedding space. However, most existing works do not explicitly utilize useful relation representations to assist in entity alignment, which, as we will show in the paper, is a simple yet effective way for improving entity alignment. This paper presents a novel joint learning framework for entity alignment. At the core of our approach is a Graph Convolutional Network (GCN) based framework for learning both entity and relation representations. Rather than relying on pre-aligned relation seeds to learn relation representations, we first approximate them using entity embeddings learned by the GCN. We then incorporate the relation approximation into entities to iteratively learn better representations for both. Experiments performed on three real-world cross-lingual datasets show that our approach substantially outperforms state-of-the-art entity alignment methods.

pdf bib
Open Domain Event Extraction Using Neural Latent Variable Models
Xiao Liu | Heyan Huang | Yue Zhang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We consider open domain event extraction, the task of extracting unconstraint types of events from news clusters. A novel latent variable neural model is constructed, which is scalable to very large corpus. A dataset is collected and manually annotated, with task-specific evaluation metrics being designed. Results show that the proposed unsupervised model gives better performance compared to the state-of-the-art method for event schema induction.

2018

pdf bib
Real-time Scholarly Retweeting Prediction System
Zhunchen Luo | Xiao Liu
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

Twitter has become one of the most import channels to spread latest scholarly information because of its fast information spread speed. How to predict whether a scholarly tweet will be retweeted is a key task in understanding the message propagation within large user communities. Hence, we present the real-time scholarly retweeting prediction system that retrieves scholarly tweets which will be retweeted. First, we filter scholarly tweets from tracking a tweet stream. Then, we extract Tweet Scholar Blocks indicating metadata of papers. At last, we combine scholarly features with the Tweet Scholar Blocks to predict whether a scholarly tweet will be retweeted. Our system outperforms chosen baseline systems. Additionally, our system has the potential to predict scientific impact in real-time.

pdf bib
Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation
Xiao Liu | Zhunchen Luo | Heyan Huang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Event extraction is of practical utility in natural language processing. In the real world, it is a common phenomenon that multiple events existing in the same sentence, where extracting them are more difficult than extracting a single event. Previous works on modeling the associations between events by sequential modeling methods suffer a lot from the low efficiency in capturing very long-range dependencies. In this paper, we propose a novel Jointly Multiple Events Extraction (JMEE) framework to jointly extract multiple event triggers and arguments by introducing syntactic shortcut arcs to enhance information flow and attention-based graph convolution networks to model graph information. The experiment results demonstrate that our proposed framework achieves competitive results compared with state-of-the-art methods.

2014

pdf bib
Fast Recursive Multi-class Classification of Pairs of Text Entities for Biomedical Event Extraction
Xiao Liu | Antoine Bordes | Yves Grandvalet
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
Biomedical Event Extraction by Multi-class Classification of Pairs of Text Entities
Xiao Liu | Antoine Bordes | Yves Grandvalet
Proceedings of the BioNLP Shared Task 2013 Workshop