Yuwei Fang


2022

pdf bib
Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data
Shuohang Wang | Yichong Xu | Yuwei Fang | Yang Liu | Siqi Sun | Ruochen Xu | Chenguang Zhu | Michael Zeng
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval-based methods have been shown to be effective in NLP tasks via introducing external knowledge. However, the indexing and retrieving of large-scale corpora bring considerable computational cost. Surprisingly, we found that REtrieving from the traINing datA (REINA) only can lead to significant gains on multiple NLG and NLU tasks. We retrieve the labeled training instances most similar to the input text and then concatenate them with the input to feed into the model to generate the output. Experimental results show that this simple method can achieve significantly better performance on a variety of NLU and NLG tasks, including summarization, machine translation, language modeling, and question answering tasks. For instance, our proposed method achieved state-of-the-art results on XSum, BigPatent, and CommonsenseQA. Our code is released, https://github.com/microsoft/REINA .

pdf bib
KG-FiD: Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain Question Answering
Donghan Yu | Chenguang Zhu | Yuwei Fang | Wenhao Yu | Shuohang Wang | Yichong Xu | Xiang Ren | Yiming Yang | Michael Zeng
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Current Open-Domain Question Answering (ODQA) models typically include a retrieving module and a reading module, where the retriever selects potentially relevant passages from open-source documents for a given question, and the reader produces an answer based on the retrieved passages. The recently proposed Fusion-in-Decoder (FiD) framework is a representative example, which is built on top of a dense passage retriever and a generative reader, achieving the state-of-the-art performance. In this paper we further improve the FiD approach by introducing a knowledge-enhanced version, namely KG-FiD. Our new model uses a knowledge graph to establish the structural relationship among the retrieved passages, and a graph neural network (GNN) to re-rank the passages and select only a top few for further processing. Our experiments on common ODQA benchmark datasets (Natural Questions and TriviaQA) demonstrate that KG-FiD can achieve comparable or better performance in answer prediction than FiD, with less than 40% of the computation cost.

pdf bib
Dict-BERT: Enhancing Language Model Pre-training with Dictionary
Wenhao Yu | Chenguang Zhu | Yuwei Fang | Donghan Yu | Shuohang Wang | Yichong Xu | Michael Zeng | Meng Jiang
Findings of the Association for Computational Linguistics: ACL 2022

Pre-trained language models (PLMs) aim to learn universal language representations by conducting self-supervised training tasks on large-scale corpora. Since PLMs capture word semantics in different contexts, the quality of word representations highly depends on word frequency, which usually follows a heavy-tailed distributions in the pre-training corpus. Therefore, the embeddings of rare words on the tail are usually poorly optimized. In this work, we focus on enhancing language model pre-training by leveraging definitions of the rare words in dictionaries (e.g., Wiktionary). To incorporate a rare word definition as a part of input, we fetch its definition from the dictionary and append it to the end of the input text sequence. In addition to training with the masked language modeling objective, we propose two novel self-supervised pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions to enhance language modeling representation with dictionary. We evaluate the proposed Dict-BERT model on the language understanding benchmark GLUE and eight specialized domain benchmark datasets. Extensive experiments demonstrate that Dict-BERT can significantly improve the understanding of rare words and boost model performance on various NLP downstream tasks.

pdf bib
Leveraging Knowledge in Multilingual Commonsense Reasoning
Yuwei Fang | Shuohang Wang | Yichong Xu | Ruochen Xu | Siqi Sun | Chenguang Zhu | Michael Zeng
Findings of the Association for Computational Linguistics: ACL 2022

Commonsense reasoning (CSR) requires models to be equipped with general world knowledge. While CSR is a language-agnostic process, most comprehensive knowledge sources are restricted to a small number of languages, especially English. Thus, it remains unclear how to effectively conduct multilingual commonsense reasoning (XCSR) for various languages. In this work, we propose to use English as a pivot language, utilizing English knowledge sources for our our commonsense reasoning framework via a translate-retrieve-translate (TRT) strategy. For multilingual commonsense questions and answer candidates, we collect related knowledge via translation and retrieval from the knowledge in the source language. The retrieved knowledge is then translated into the target language and integrated into a pre-trained multilingual language model via visible knowledge attention. Then we utilize a diverse of four English knowledge sources to provide more comprehensive coverage of knowledge in different formats. Extensive results on the XCSR benchmark demonstrate that TRT with external knowledge can significantly improve multilingual commonsense reasoning in both zero-shot and translate-train settings, consistently outperforming the state-of-the-art by more than 3% on the multilingual commonsense reasoning benchmark X-CSQA and X-CODAH.

2021

pdf bib
Cluster-Former: Clustering-based Sparse Transformer for Question Answering
Shuohang Wang | Luowei Zhou | Zhe Gan | Yen-Chun Chen | Yuwei Fang | Siqi Sun | Yu Cheng | Jingjing Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
Siqi Sun | Yen-Chun Chen | Linjie Li | Shuohang Wang | Yuwei Fang | Jingjing Liu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computational cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by extracting pre-cached feature indexes offline, and employing instant dot-product matching online, which significantly speeds up retrieval process. In fact, our LightningDOT achieves superior performance across mainstream ITR benchmarks such as Flickr30k and COCO datasets, outperforming existing pre-trained models that consume 1000 times magnitude of computational hours using the same features.

2020

pdf bib
Cross-Thought for Sentence Encoder Pre-training
Shuohang Wang | Yuwei Fang | Siqi Sun | Zhe Gan | Yu Cheng | Jingjing Liu | Jing Jiang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we propose Cross-Thought, a novel approach to pre-training sequence encoder, which is instrumental in building reusable sequence embeddings for large-scale NLP tasks such as question answering. Instead of using the original signals of full sentences, we train a Transformer-based sequence encoder over a large set of short sequences, which allows the model to automatically select the most useful information for predicting masked words. Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders trained with continuous sentence signals as well as traditional masked language modeling baselines. Our proposed approach also achieves new state of the art on HotpotQA (full-wiki setting) by improving intermediate information retrieval performance.

pdf bib
Contrastive Distillation on Intermediate Representations for Language Model Compression
Siqi Sun | Zhe Gan | Yuwei Fang | Yu Cheng | Shuohang Wang | Jingjing Liu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Existing language model compression methods mostly use a simple L_2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CoDIR), a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student’s exploitation of rich information in teacher’s hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods.

pdf bib
Hierarchical Graph Network for Multi-hop Question Answering
Yuwei Fang | Siqi Sun | Zhe Gan | Rohit Pillai | Shuohang Wang | Jingjing Liu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering. To aggregate clues from scattered texts across multiple paragraphs, a hierarchical graph is created by constructing nodes on different levels of granularity (questions, paragraphs, sentences, entities), the representations of which are initialized with pre-trained contextual encoders. Given this hierarchical graph, the initial node representations are updated through graph propagation, and multi-hop reasoning is performed via traversing through the graph edges for each subsequent sub-task (e.g., paragraph selection, supporting facts extraction, answer prediction). By weaving heterogeneous nodes into an integral unified graph, this hierarchical differentiation of node granularity enables HGN to support different question answering sub-tasks simultaneously. Experiments on the HotpotQA benchmark demonstrate that the proposed model achieves new state of the art, outperforming existing multi-hop QA approaches.