Xingwu Sun


2024

pdf bib
LightVLP: A Lightweight Vision-Language Pre-training via Gated Interactive Masked AutoEncoders
Xingwu Sun | Zhen Yang | Ruobing Xie | Fengzong Lian | Zhanhui Kang | Chengzhong Xu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper studies vision-language (V&L) pre-training for deep cross-modal representations. Recently, pre-trained V&L models have shown great success in V&L tasks. However, most existing models apply multi-modal encoders to encode the image and text, at the cost of high training complexity because of the input sequence length. In addition, they suffer from noisy training corpora caused by V&L mismatching. In this work, we propose a lightweight vision-language pre-training (LightVLP) for efficient and effective V&L pre-training. First, we design a new V&L framework with two autoencoders. Each autoencoder involves an encoder, which only takes in unmasked tokens (removes masked ones), as well as a lightweight decoder that reconstructs the masked tokens. Besides, we mask and remove large portions of input tokens to accelerate the training. Moreover, we propose a gated interaction mechanism to cope with noise in aligned image-text pairs. As for a matched image-text pair, the model tends to apply cross-modal representations for reconstructions. By contrast, for an unmatched pair, the model conducts reconstructions mainly using uni-modal representations. Benefiting from the above-mentioned designs, our base model shows competitive results compared to ALBEF while saving 44% FLOPs. Further, we compare our large model with ALBEF under the setting of similar FLOPs on six datasets and show the superiority of LightVLP. In particular, our model achieves 2.2% R@1 gains on COCO Text Retrieval and 1.1% on refCOCO+.

2023

pdf bib
TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities
Zhe Zhao | Yudong Li | Cheng Hou | Jing Zhao | Rong Tian | Weijie Liu | Yiren Chen | Ningyuan Sun | Haoyan Liu | Weiquan Mao | Han Guo | Weigang Gou | Taiqiang Wu | Tao Zhu | Wenhang Shi | Chen Chen | Shan Huang | Sihong Chen | Liqun Liu | Feifei Li | Xiaoshuai Chen | Xingwu Sun | Zhanhui Kang | Xiaoyong Du | Linlin Shen | Kimmo Yan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.

2022

pdf bib
An Anchor-based Relative Position Embedding Method for Cross-Modal Tasks
Ya Wang | Xingwu Sun | Lian Fengzong | ZhanHui Kang | Chengzhong Xu Xu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Position Embedding (PE) is essential for transformer to capture the sequence ordering of input tokens. Despite its general effectiveness verified in Natural Language Processing (NLP) and Computer Vision (CV), its application in cross-modal tasks remains unexplored and suffers from two challenges: 1) the input text tokens and image patches are not aligned, 2) the encoding space of each modality is different, making it unavailable for feature comparison. In this paper, we propose a unified position embedding method for these problems, called AnChor-basEd Relative Position Embedding (ACE-RPE), in which we first introduce an anchor locating mechanism to bridge the semantic gap and locate anchors from different modalities. Then we conduct the distance calculation of each text token and image patch by computing their shortest paths from the located anchors. Last, we embed the anchor-based distance to guide the computation of cross-attention. In this way, it calculates cross-modal relative position embedding for cross-modal transformer. Benefiting from ACE-RPE, our method obtains new SOTA results on a wide range of benchmarks, such as Image-Text Retrieval on MS-COCO and Flickr30K, Visual Entailment on SNLI-VE, Visual Reasoning on NLVR2 and Weakly-supervised Visual Grounding on RefCOCO+.

2021

pdf bib
Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval
Hongyin Tang | Xingwu Sun | Beihong Jin | Jingang Wang | Fuzheng Zhang | Wei Wu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Recently, the retrieval models based on dense representations have been gradually applied in the first stage of the document retrieval tasks, showing better performance than traditional sparse vector space models. To obtain high efficiency, the basic structure of these models is Bi-encoder in most cases. However, this simple structure may cause serious information loss during the encoding of documents since the queries are agnostic. To address this problem, we design a method to mimic the queries to each of the documents by an iterative clustering process and represent the documents by multiple pseudo queries (i.e., the cluster centroids). To boost the retrieval process using approximate nearest neighbor search library, we also optimize the matching function with a two-step score calculation procedure. Experimental results on several popular ranking and QA datasets show that our model can achieve state-of-the-art results while still remaining high efficiency.

pdf bib
TITA: A Two-stage Interaction and Topic-Aware Text Matching Model
Xingwu Sun | Yanling Cui | Hongyin Tang | Qiuyu Zhu | Fuzheng Zhang | Beihong Jin
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this paper, we focus on the problem of keyword and document matching by considering different relevance levels. In our recommendation system, different people follow different hot keywords with interest. We need to attach documents to each keyword and then distribute the documents to people who follow these keywords. The ideal documents should have the same topic with the keyword, which we call topic-aware relevance. In other words, topic-aware relevance documents are better than partially-relevance ones in this application. However, previous tasks never define topic-aware relevance clearly. To tackle this problem, we define a three-level relevance in keyword-document matching task: topic-aware relevance, partially-relevance and irrelevance. To capture the relevance between the short keyword and the document at above-mentioned three levels, we should not only combine the latent topic of the document with its deep neural representation, but also model complex interactions between the keyword and the document. To this end, we propose a Two-stage Interaction and Topic-Aware text matching model (TITA). In terms of “topic-aware”, we introduce neural topic model to analyze the topic of the document and then use it to further encode the document. In terms of “two-stage interaction”, we propose two successive stages to model complex interactions between the keyword and the document. Extensive experiments reveal that TITA outperforms other well-designed baselines and shows excellent performance in our recommendation system.

pdf bib
Enhancing Document Ranking with Task-adaptive Training and Segmented Token Recovery Mechanism
Xingwu Sun | Yanling Cui | Hongyin Tang | Fuzheng Zhang | Beihong Jin | Shi Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose a new ranking model DR-BERT, which improves the Document Retrieval (DR) task by a task-adaptive training process and a Segmented Token Recovery Mechanism (STRM). In the task-adaptive training, we first pre-train DR-BERT to be domain-adaptive and then make the two-phase fine-tuning. In the first-phase fine-tuning, the model learns query-document matching patterns regarding different query types in a pointwise way. Next, in the second-phase fine-tuning, the model learns document-level ranking features and ranks documents with regard to a given query in a listwise manner. Such pointwise plus listwise fine-tuning enables the model to minimize errors in the document ranking by incorporating ranking-specific supervisions. Meanwhile, the model derived from pointwise fine-tuning is also used to reduce noise in the training data of the listwise fine-tuning. On the other hand, we present STRM which can compute OOV word representation and contextualization more precisely in BERT-based models. As an effective strategy in DR-BERT, STRM improves the matching perfromance of OOV words between a query and a document. Notably, our DR-BERT model keeps in the top three on the MS MARCO leaderboard since May 20, 2020.

2018

pdf bib
Answer-focused and Position-aware Neural Question Generation
Xingwu Sun | Jing Liu | Yajuan Lyu | Wei He | Yanjun Ma | Shi Wang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this paper, we focus on the problem of question generation (QG). Recent neural network-based approaches employ the sequence-to-sequence model which takes an answer and its context as input and generates a relevant question as output. However, we observe two major issues with these approaches: (1) The generated interrogative words (or question words) do not match the answer type. (2) The model copies the context words that are far from and irrelevant to the answer, instead of the words that are close and relevant to the answer. To address these two issues, we propose an answer-focused and position-aware neural question generation model. (1) By answer-focused, we mean that we explicitly model question word generation by incorporating the answer embedding, which can help generate an interrogative word matching the answer type. (2) By position-aware, we mean that we model the relative distance between the context words and the answer. Hence the model can be aware of the position of the context words when copying them to generate a question. We conduct extensive experiments to examine the effectiveness of our model. The experimental results show that our model significantly improves the baseline and outperforms the state-of-the-art system.