Siqi Sun


2022

pdf bib
Leveraging Knowledge in Multilingual Commonsense Reasoning
Yuwei Fang | Shuohang Wang | Yichong Xu | Ruochen Xu | Siqi Sun | Chenguang Zhu | Michael Zeng
Findings of the Association for Computational Linguistics: ACL 2022

Commonsense reasoning (CSR) requires models to be equipped with general world knowledge. While CSR is a language-agnostic process, most comprehensive knowledge sources are restricted to a small number of languages, especially English. Thus, it remains unclear how to effectively conduct multilingual commonsense reasoning (XCSR) for various languages. In this work, we propose to use English as a pivot language, utilizing English knowledge sources for our our commonsense reasoning framework via a translate-retrieve-translate (TRT) strategy. For multilingual commonsense questions and answer candidates, we collect related knowledge via translation and retrieval from the knowledge in the source language. The retrieved knowledge is then translated into the target language and integrated into a pre-trained multilingual language model via visible knowledge attention. Then we utilize a diverse of four English knowledge sources to provide more comprehensive coverage of knowledge in different formats. Extensive results on the XCSR benchmark demonstrate that TRT with external knowledge can significantly improve multilingual commonsense reasoning in both zero-shot and translate-train settings, consistently outperforming the state-of-the-art by more than 3% on the multilingual commonsense reasoning benchmark X-CSQA and X-CODAH.

pdf bib
Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data
Shuohang Wang | Yichong Xu | Yuwei Fang | Yang Liu | Siqi Sun | Ruochen Xu | Chenguang Zhu | Michael Zeng
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval-based methods have been shown to be effective in NLP tasks via introducing external knowledge. However, the indexing and retrieving of large-scale corpora bring considerable computational cost. Surprisingly, we found that REtrieving from the traINing datA (REINA) only can lead to significant gains on multiple NLG and NLU tasks. We retrieve the labeled training instances most similar to the input text and then concatenate them with the input to feed into the model to generate the output. Experimental results show that this simple method can achieve significantly better performance on a variety of NLU and NLG tasks, including summarization, machine translation, language modeling, and question answering tasks. For instance, our proposed method achieved state-of-the-art results on XSum, BigPatent, and CommonsenseQA. Our code is released, https://github.com/microsoft/REINA .

2021

pdf bib
Cluster-Former: Clustering-based Sparse Transformer for Question Answering
Shuohang Wang | Luowei Zhou | Zhe Gan | Yen-Chun Chen | Yuwei Fang | Siqi Sun | Yu Cheng | Jingjing Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
Siqi Sun | Yen-Chun Chen | Linjie Li | Shuohang Wang | Yuwei Fang | Jingjing Liu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computational cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by extracting pre-cached feature indexes offline, and employing instant dot-product matching online, which significantly speeds up retrieval process. In fact, our LightningDOT achieves superior performance across mainstream ITR benchmarks such as Flickr30k and COCO datasets, outperforming existing pre-trained models that consume 1000 times magnitude of computational hours using the same features.

2020

pdf bib
DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation
Yizhe Zhang | Siqi Sun | Michel Galley | Yen-Chun Chen | Chris Brockett | Xiang Gao | Jianfeng Gao | Jingjing Liu | Bill Dolan
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We present a large, tunable neural conversational response generation model, DIALOGPT (dialogue generative pre-trained transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings. We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems.

pdf bib
Cross-Thought for Sentence Encoder Pre-training
Shuohang Wang | Yuwei Fang | Siqi Sun | Zhe Gan | Yu Cheng | Jingjing Liu | Jing Jiang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we propose Cross-Thought, a novel approach to pre-training sequence encoder, which is instrumental in building reusable sequence embeddings for large-scale NLP tasks such as question answering. Instead of using the original signals of full sentences, we train a Transformer-based sequence encoder over a large set of short sequences, which allows the model to automatically select the most useful information for predicting masked words. Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders trained with continuous sentence signals as well as traditional masked language modeling baselines. Our proposed approach also achieves new state of the art on HotpotQA (full-wiki setting) by improving intermediate information retrieval performance.

pdf bib
Contrastive Distillation on Intermediate Representations for Language Model Compression
Siqi Sun | Zhe Gan | Yuwei Fang | Yu Cheng | Shuohang Wang | Jingjing Liu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Existing language model compression methods mostly use a simple L_2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CoDIR), a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student’s exploitation of rich information in teacher’s hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods.

pdf bib
Hierarchical Graph Network for Multi-hop Question Answering
Yuwei Fang | Siqi Sun | Zhe Gan | Rohit Pillai | Shuohang Wang | Jingjing Liu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering. To aggregate clues from scattered texts across multiple paragraphs, a hierarchical graph is created by constructing nodes on different levels of granularity (questions, paragraphs, sentences, entities), the representations of which are initialized with pre-trained contextual encoders. Given this hierarchical graph, the initial node representations are updated through graph propagation, and multi-hop reasoning is performed via traversing through the graph edges for each subsequent sub-task (e.g., paragraph selection, supporting facts extraction, answer prediction). By weaving heterogeneous nodes into an integral unified graph, this hierarchical differentiation of node granularity enables HGN to support different question answering sub-tasks simultaneously. Experiments on the HotpotQA benchmark demonstrate that the proposed model achieves new state of the art, outperforming existing multi-hop QA approaches.

2019

pdf bib
Patient Knowledge Distillation for BERT Model Compression
Siqi Sun | Yu Cheng | Zhe Gan | Jingjing Liu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model patiently learns from multiple intermediate layers of the teacher model for incremental knowledge extraction, following two strategies: (i) PKD-Last: learning from the last k layers; and (ii) PKD-Skip: learning from every k layers. These two patient distillation schemes enable the exploitation of rich information in the teacher’s hidden layers, and encourage the student model to patiently learn from and imitate the teacher through a multi-layer distillation process. Empirically, this translates into improved results on multiple NLP tasks with a significant gain in training efficiency, without sacrificing model accuracy.