Songfang Huang


2024

pdf bib
Unlocking the Potential of Model Merging for Low-Resource Languages
Mingxu Tao | Chen Zhang | Quzhe Huang | Tianyao Ma | Songfang Huang | Dongyan Zhao | Yansong Feng
Findings of the Association for Computational Linguistics: EMNLP 2024

Adapting large language models (LLMs) to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT). However, this CT-then-SFT approach struggles with limited data in the context of low-resource languages, failing to balance language modeling and task-solving capabilities. We thus propose a new model merging solution as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training. We use model merging to develop task-solving LLMs for low-resource languages without SFT data in the target languages. Our experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data. Observing performance saturation in model merging with increasingly more training tokens, we further analyze the merging process and introduce a slack variable to the model merging algorithm to mitigate the loss of important parameters, thereby enhancing model performance. We hope that model merging can benefit more human languages suffering from data scarcity with its higher data efficiency.

pdf bib
A Survey on Open Information Extraction from Rule-based Model to Large Language Model
Liu Pai | Wenyang Gao | Wenjie Dong | Lin Ai | Ziwei Gong | Songfang Huang | Li Zongsheng | Ehsan Hoque | Julia Hirschberg | Yue Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024

Open Information Extraction (OpenIE) represents a crucial NLP task aimed at deriving structured information from unstructured text, unrestricted by relation type or domain. This survey paper provides an overview of OpenIE technologies spanning from 2007 to 2024, emphasizing a chronological perspective absent in prior surveys. It examines the evolution of task settings in OpenIE to align with the advances in recent technologies. The paper categorizes OpenIE approaches into rule-based, neural, and pre-trained large language models, discussing each within a chronological framework. Additionally, it highlights prevalent datasets and evaluation metrics currently in use. Building on this extensive review, this paper systematically reviews the evolution of task settings, data, evaluation metrics, and methodologies in the era of large language models, highlighting their mutual influence, comparing their capabilities, and examining their implications for open challenges and future research directions.

pdf bib
Text Diffusion Model with Encoder-Decoder Transformers for Sequence-to-Sequence Generation
Hongyi Yuan | Zheng Yuan | Chuanqi Tan | Fei Huang | Songfang Huang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The diffusion model, a new generative modeling paradigm, has achieved great success in image, audio, and video generation.However, considering the discrete categorical nature of the text, it is not trivial to extend continuous diffusion models to natural language. In this work, we propose SeqDiffuSeq, a text diffusion model, to approach sequence-to-sequence text generation with an encoder-decoder Transformer architecture.To improve the generation performance, SeqDiffuSeq is equipped with the self-conditioning technique and our newly proposed adaptive noise schedule technique. Self-conditioning enables SeqDiffuSeq to better use the predicted sequence information during the generation process.The adaptive noise schedule balances the difficulty of denoising across time steps at the token level.Experiment results illustrate the improved performance on five sequence-to-sequence generation tasks compared to other diffusion-based models regarding text quality and inference time.

pdf bib
Synergetic Event Understanding: A Collaborative Approach to Cross-Document Event Coreference Resolution with Large Language Models
Qingkai Min | Qipeng Guo | Xiangkun Hu | Songfang Huang | Zheng Zhang | Yue Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cross-document event coreference resolution (CDECR) involves clustering event mentions across multiple documents that refer to the same real-world events. Existing approaches utilize fine-tuning of small language models (SLMs) like BERT to address the compatibility among the contexts of event mentions. However, due to the complexity and diversity of contexts, these models are prone to learning simple co-occurrences. Recently, large language models (LLMs) like ChatGPT have demonstrated impressive contextual understanding, yet they encounter challenges in adapting to specific information extraction (IE) tasks. In this paper, we propose a collaborative approach for CDECR, leveraging the capabilities of both a universally capable LLM and a task-specific SLM. The collaborative strategy begins with the LLM accurately and comprehensively summarizing events through prompting. Then, the SLM refines its learning of event representations based on these insights during fine-tuning. Experimental results demonstrate that our approach surpasses the performance of both the large and small language models individually, forming a complementary advantage. Across various datasets, our approach achieves state-of-the-art performance, underscoring its effectiveness in diverse scenarios.

pdf bib
Harder Task Needs More Experts: Dynamic Routing in MoE Models
Quzhe Huang | Zhenwei An | Nan Zhuang | Mingxu Tao | Chen Zhang | Yang Jin | Kun Xu | Kun Xu | Liwei Chen | Songfang Huang | Yansong Feng
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models, aiming to enhance computational efficiency and model performance by adjusting the number of activated experts based on input difficulty. Unlike existing MoE approaches that rely on fixed TopK Routing, which activates a predetermined number of experts regardless of the input’s complexity, our method dynamically allocates experts based on the confidence level in expert selection for each input. This allows for more efficient utilization of computational resources, activating more experts for complex tasks requiring advanced reasoning and fewer for simpler tasks. Through extensive evaluations, our dynamic routing method demonstrates substantial improvements over Top2 Routing across various benchmarks, achieving an average improvement of 0.7% with less than 90% activated parameters. Further analysis shows our model dispatches more experts to tasks requiring complex reasoning skills, like BBH, confirming its ability to dynamically allocate computational resources in alignment with the input’s complexity.Our findings also highlight a variation in the number of experts needed across different layers of the transformer model, offering insights into the potential for designing heterogeneous MoE frameworks. The code and models are available at https://github.com/ZhenweiAn/Dynamic_MoE.

2023

pdf bib
HyPe: Better Pre-trained Language Model Fine-tuning with Hidden Representation Perturbation
Hongyi Yuan | Zheng Yuan | Chuanqi Tan | Fei Huang | Songfang Huang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language models with the Transformers structure have shown great performance in natural language processing. However, there still poses problems when fine-tuning pre-trained language models on downstream tasks, such as over-fitting or representation collapse. In this work, we propose HyPe, a simple yet effective fine-tuning technique to alleviate such problems by perturbing hidden representations of Transformers layers. Unlike previous works that only add noise to inputs or parameters, we argue that the hidden representations of Transformers layers convey more diverse and meaningful language information. Therefore, making the Transformers layers more robust to hidden representation perturbations can further benefit the fine-tuning of PLMs en bloc. We conduct extensive experiments and analyses on GLUE and other natural language inference datasets. Results demonstrate that HyPe outperforms vanilla fine-tuning and enhances generalization of hidden representations from different layers. In addition, HyPe acquires negligible computational overheads, and is better than and compatible with previous state-of-the-art fine-tuning techniques.

pdf bib
Transforming Visual Scene Graphs to Image Captions
Xu Yang | Jiawei Peng | Zihua Wang | Haiyang Xu | Qinghao Ye | Chenliang Li | Songfang Huang | Fei Huang | Zhangzikang Li | Yu Zhang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose to TransForm Scene Graphs into more descriptive Captions (TFSGC). In TFSGC, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. After embedding, different graph embeddings contain diverse specific knowledge for generating the words with different part-of-speech, e.g., object/attribute embedding is good for generating nouns/adjectives. Motivated by this, we design a Mixture-of-Expert (MOE)-based decoder, where each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words. Since both the encoder and decoder are built based on the MHA, as a result, we construct a simple and homogeneous encoder-decoder unlike the previous heterogeneous ones which usually apply Fully-Connected-based GNN and LSTM-based decoder. The homogeneous architecture enables us to unify the training configuration of the whole model instead of specifying different training strategies for diverse sub-networks as in the heterogeneous pipeline, which releases the training difficulty. Extensive experiments on the MS-COCO captioning benchmark validate the effectiveness of our TFSGC. The code is in: https://anonymous.4open.science/r/ACL23_TFSGC.

pdf bib
Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
Chaoya Jiang | Wei Ye | Haiyang Xu | Songfang Huang | Fei Huang | Shikun Zhang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we reconsider the problem of (partial) false negative samples from the Mutual Information (MI) Maximization perspective, the traditional contrastive loss (like InfoNCE loss) will equally push away the anchor of all positive samples and negative samples regardless of their possible semantic similarities. We theoretically show that InfoNCE loss will not only maximize the MI between the anchor and positive samples but minimize the MI between the anchor and false negative samples even though they share similar semantic which could provide a possible theoretical explanation for the observation of the existence of false negative samples in the cross-modal contrastive learning will decrease the downstream task performance of VLP models. Above analysis motivate us to propose the VLP model with a novel Semantic Awared Contrastive Learning framework named SACL where different negative samples are assigned with different contrastive weights according to the semantic similarity between them and the anchor.

pdf bib
Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model Fine-tuning
Zhen-Ru Zhang | Chuanqi Tan | Haiyang Xu | Chengyu Wang | Jun Huang | Songfang Huang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Fine-tuning large pre-trained language models on various downstream tasks with whole parameters is prohibitively expensive. Hence, Parameter-efficient fine-tuning has attracted attention that only optimizes a few task-specific parameters with the frozen pre-trained model. In this work, we focus on prefix tuning, which only optimizes continuous prefix vectors (i.e. pseudo tokens) inserted into Transformer layers. Based on the observation that the learned syntax and semantics representation varies a lot at different layers, we argue that the adaptive prefix will be further tailored to each layer than the fixed one, enabling the fine-tuning more effective and efficient. Thus, we propose Adaptive Prefix Tuning (APT) to adjust the prefix in terms of both fine-grained token level and coarse-grained layer level with a gate mechanism. Experiments on the SuperGLUE and NER datasets show the effectiveness of APT. In addition, taking the gate as a probing, we validate the efficiency and effectiveness of the variable prefix.

pdf bib
XtremeCLIP: Extremely Parameter-efficient Tuning for Low-resource Vision Language Understanding
Moming Tang | Chengyu Wang | Jianing Wang | Chuanqi Tan | Songfang Huang | Cen Chen | Weining Qian
Findings of the Association for Computational Linguistics: ACL 2023

Recently, Contrastive Visual-Language Pre-training (CLIP) has demonstrated remarkable capability in various Visual Language Understanding (VLU) tasks. Yet, most CLIP-based methods require tasks-specific designs and sufficient training data. In this paper, we introduce a simple yet efficient paradigm for low-resource VLU named XtremeCLIP, which involves very few trainable parameters to improve the generalization ability of the trained models. In our XtremeCLIP framework, we reformulate a series of VLU tasks as a unified open-book affinity-matching problem. Furthermore, to handle the insufficient supervised signals in small datasets, we adopt contrastive learning to utilize the implicit sorting information of ground-truth labels to provide more supervised cues. Extensive experiments over multiple datasets on visual entailment, visual question answering, and image classification show that XtremeCLIP consistently outperforms existing baselines in low-resource settings.

pdf bib
PEER: Pre-training ELECTRA Extended by Ranking
Ru He | Wei Wang | Songfang Huang | Fei Huang
Findings of the Association for Computational Linguistics: ACL 2023

The BERT model and its variants have made great achievements in many downstream natural language processing tasks. The achievements of these models, however, demand highly expensive pre-training computation cost. To address this pre-training efficiency issue, the ELECTRA model is proposed to use a discriminator to perform replaced token detection (RTD) task, that is, to classify whether each input token is original or replaced by a generator. The RTD task performed by the ELECTRA accelerates pre-training so substantially, such that it is very challenging to further improve the pre-training efficiency established by the ELECTRA by using or adding other pre-training tasks, as the recent comprehensive study of Bajaj et al. (2022) summarizes. To further advance this pre-training efficiency frontier, in this paper we propose to extend the RTD task into a task of ranking input tokens according to K different quality levels. Essentially, we generalize the binary classifier in the ELECTRA into a K-level ranker to undertake a more precise task with negligible additional computation cost. Our extensive experiments show that our proposed method is able to outperform the state-of-the-art pre-training efficient models including ELECTRA in downstream GLUE tasks given the same computation cost.

2022

pdf bib
Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency
Yanyang Li | Fuli Luo | Runxin Xu | Songfang Huang | Fei Huang | Liwei Wang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Structured pruning has been extensively studied on monolingual pre-trained language models and is yet to be fully evaluated on their multilingual counterparts. This work investigates three aspects of structured pruning on multilingual pre-trained language models: settings, algorithms, and efficiency. Experiments on nine downstream tasks show several counter-intuitive phenomena: for settings, individually pruning for each language does not induce a better result; for algorithms, the simplest method performs the best; for efficiency, a fast model does not imply that it is also small. To facilitate the comparison on all sparsity levels, we present Dynamic Sparsification, a simple approach that allows training the model once and adapting to different model sizes at inference. We hope this work fills the gap in the study of structured pruning on multilingual pre-trained models and sheds light on future research.

pdf bib
S4-Tuning: A Simple Cross-lingual Sub-network Tuning Method
Runxin Xu | Fuli Luo | Baobao Chang | Songfang Huang | Fei Huang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

The emergence of multilingual pre-trained language models makes it possible to adapt to target languages with only few labeled examples. However, vanilla fine-tuning tends to achieve degenerated and unstable results, owing to the Language Interference among different languages, and Parameter Overload under the few-sample transfer learning scenarios. To address two problems elegantly, we propose S4-Tuning, a Simple Cross-lingual Sub-network Tuning method. S4-Tuning first detects the most essential sub-network for each target language, and only updates it during fine-tuning.In this way, the language sub-networks lower the scale of trainable parameters, and hence better suit the low-resource scenarios.Meanwhile, the commonality and characteristics across languages are modeled by the overlapping and non-overlapping parts to ease the interference among languages.Simple but effective, S4-Tuning gains consistent improvements over vanilla fine-tuning on three multi-lingual tasks involving 37 different languages in total (XNLI, PAWS-X, and Tatoeba).

pdf bib
Code Synonyms Do Matter: Multiple Synonyms Matching Network for Automatic ICD Coding
Zheng Yuan | Chuanqi Tan | Songfang Huang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Automatic ICD coding is defined as assigning disease codes to electronic medical records (EMRs).Existing methods usually apply label attention with code representations to match related text snippets. Unlike these works that model the label with the code hierarchy or description, we argue that the code synonyms can provide more comprehensive knowledge based on the observation that the code expressions in EMRs vary from their descriptions in ICD. By aligning codes to concepts in UMLS, we collect synonyms of every code. Then, we propose a multiple synonyms matching network to leverage synonyms for better code representation learning, and finally help the code classification. Experiments on the MIMIC-III dataset show that our proposed method outperforms previous state-of-the-art methods.

pdf bib
SpanProto: A Two-stage Span-based Prototypical Network for Few-shot Named Entity Recognition
Jianing Wang | Chengyu Wang | Chuanqi Tan | Minghui Qiu | Songfang Huang | Jun Huang | Ming Gao
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Few-shot Named Entity Recognition (NER) aims to identify named entities with very little annotated data. Previous methods solve this problem based on token-wise classification, which ignores the information of entity boundaries, and inevitably the performance is affected by the massive non-entity tokens. To this end, we propose a seminal span-based prototypical network (SpanProto) that tackles few-shot NER via a two-stage approach, including span extraction and mention classification. In the span extraction stage, we transform the sequential tags into a global boundary matrix, enabling the model to focus on the explicit boundary information. For mention classification, we leverage prototypical learning to capture the semantic representations for each labeled span and make the model better adapt to novel-class entities. To further improve the model performance, we split out the false positives generated by the span extractor but not labeled in the current episode set, and then present a margin-based loss to separate them from each prototype region. Experiments over multiple benchmarks demonstrate that our model outperforms strong baselines by a large margin.

pdf bib
TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Chaoya Jiang | Haiyang Xu | Chenliang Li | Ming Yan | Wei Ye | Shikun Zhang | Bin Bi | Songfang Huang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the long visual sequence. To tackle this problem, in this paper, we propose an efficient vision-and-language pre-training model with Text-Relevant Image Patch Selection, namely TRIPS, which reduces the visual sequence progressively with a text-guided patch-selection layer in the visual backbone for efficient training and inference. The patch-selection layer can dynamically compute text-dependent visual attention to identify the attentive image tokens with text guidance and fuse inattentive ones in an end-to-end manner. Meanwhile, TRIPS does not introduce extra parameters to ViTs. Experimental results on a variety of popular benchmark datasets demonstrate that TRIPS gain a speedup of 40% over previous similar VLP models, yet with competitive or better downstream task performance.

pdf bib
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Chenliang Li | Haiyang Xu | Junfeng Tian | Wei Wang | Ming Yan | Bin Bi | Jiabo Ye | He Chen | Guohai Xu | Zheng Cao | Ji Zhang | Songfang Huang | Fei Huang | Jingren Zhou | Luo Si
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Large-scale pre-trained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross-modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections.mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability on vision-language and video-language tasks. The code and pre-trained models are available at https://github.com/alibaba/AliceMind

pdf bib
Fusing Heterogeneous Factors with Triaffine Mechanism for Nested Named Entity Recognition
Zheng Yuan | Chuanqi Tan | Songfang Huang | Fei Huang
Findings of the Association for Computational Linguistics: ACL 2022

Nested entities are observed in many domains due to their compositionality, which cannot be easily recognized by the widely-used sequence labeling framework.A natural solution is to treat the task as a span classification problem. To learn better span representation and increase classification performance, it is crucial to effectively integrate heterogeneous factors including inside tokens, boundaries, labels, and related spans which could be contributing to nested entities recognition. To fuse these heterogeneous factors, we propose a novel triaffine mechanism including triaffine attention and scoring.Triaffine attention uses boundaries and labels as queries and uses inside tokens and related spans as keys and values for span representations.Triaffine scoring interacts with boundaries and span representations for classification. Experiments show that our proposed method outperforms previous span-based methods, achieves the state-of-the-art F1 scores on nested NER datasets GENIA and KBP2017, and shows comparable results on ACE2004 and ACE2005.

pdf bib
Towards Unified Prompt Tuning for Few-shot Text Classification
Jianing Wang | Chengyu Wang | Fuli Luo | Chuanqi Tan | Minghui Qiu | Fei Yang | Qiuhui Shi | Songfang Huang | Ming Gao
Findings of the Association for Computational Linguistics: EMNLP 2022

Prompt-based fine-tuning has boosted the performance of Pre-trained Language Models (PLMs) on few-shot text classification by employing task-specific prompts. Yet, PLMs are unfamiliar with prompt-style expressions during pre-training, which limits the few-shot learning performance on downstream tasks.It would be desirable if the models can acquire some prompting knowledge before adapting to specific NLP tasks. We present the Unified Prompt Tuning (UPT) framework, leading to better few-shot text classification for BERT-style models by explicitly capturing prompting semantics from non-target NLP datasets. In UPT, a novel paradigm Prompt-Options-Verbalizer is proposed for joint prompt learning across different NLP tasks, forcing PLMs to capture task-invariant prompting knowledge. We further design a self-supervised task named Knowledge-enhanced Selective Masked Language Modeling to improve the PLM’s generalization abilities for accurate adaptation to previously unseen tasks. After multi-task learning across multiple tasks, the PLM can be better prompt-tuned towards any dissimilar target tasks in low-resourced settings. Experiments over a variety of NLP tasks show that UPT consistently outperforms state-of-the-arts for prompt-based fine-tuning.

2021

pdf bib
damo_nlp at MEDIQA 2021: Knowledge-based Preprocessing and Coverage-oriented Reranking for Medical Question Summarization
Yifan He | Mosha Chen | Songfang Huang
Proceedings of the 20th Workshop on Biomedical Language Processing

Medical question summarization is an important but difficult task, where the input is often complex and erroneous while annotated data is expensive to acquire. We report our participation in the MEDIQA 2021 question summarization task in which we are required to address these challenges. We start from pre-trained conditional generative language models, use knowledge bases to help correct input errors, and rerank single system outputs to boost coverage. Experimental results show significant improvement in string-based metrics.

pdf bib
Improving Biomedical Pretrained Language Models with Knowledge
Zheng Yuan | Yijia Liu | Chuanqi Tan | Songfang Huang | Fei Huang
Proceedings of the 20th Workshop on Biomedical Language Processing

Pretrained language models have shown success in many natural language processing tasks. Many works explore to incorporate the knowledge into the language models. In the biomedical domain, experts have taken decades of effort on building large-scale knowledge bases. For example, UMLS contains millions of entities with their synonyms and defines hundreds of relations among entities. Leveraging this knowledge can benefit a variety of downstream tasks such as named entity recognition and relation extraction. To this end, we propose KeBioLM, a biomedical pretrained language model that explicitly leverages knowledge from the UMLS knowledge bases. Specifically, we extract entities from PubMed abstracts and link them to UMLS. We then train a knowledge-aware language model that firstly applies a text-only encoding layer to learn entity representation and then applies a text-entity fusion encoding to aggregate entity representation. In addition, we add two training objectives as entity detection and entity linking. Experiments on the named entity recognition and relation extraction tasks from the BLURB benchmark demonstrate the effectiveness of our approach. Further analysis on a collected probing dataset shows that our model has better ability to model medical knowledge.

pdf bib
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
Haiyang Xu | Ming Yan | Chenliang Li | Bin Bi | Songfang Huang | Wenming Xiao | Fei Huang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Vision-language pre-training (VLP) on large-scale image-text pairs has achieved huge success for the cross-modal downstream tasks. The most existing pre-training methods mainly adopt a two-step training procedure, which firstly employs a pre-trained object detector to extract region-based visual features, then concatenates the image representation and text embedding as the input of Transformer to train. However, these methods face problems of using task-specific visual representation of the specific object detector for generic cross-modal understanding, and the computation inefficiency of two-stage pipeline. In this paper, we propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where we build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text. We incorporate the tasks of object detection and image captioning into pre-training with a unified Transformer encoder-decoder architecture for enhancing visual learning. An extensive set of experiments have been conducted on well-established vision-language downstream tasks to demonstrate the effectiveness of this novel VLP paradigm.

pdf bib
VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation
Fuli Luo | Wei Wang | Jiahao Liu | Yijia Liu | Bin Bi | Songfang Huang | Fei Huang | Luo Si
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Existing work in multilingual pretraining has demonstrated the potential of cross-lingual transferability by training a unified Transformer encoder for multiple languages. However, much of this work only relies on the shared vocabulary and bilingual contexts to encourage the correlation across languages, which is loose and implicit for aligning the contextual representations between languages. In this paper, we plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. More importantly, when fine-tuning on downstream tasks, the cross-attention module can be plugged in or out on-demand, thus naturally benefiting a wider range of cross-lingual tasks, from language understanding to generation. As a result, the proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark, covering text classification, sequence labeling, question answering, and sentence retrieval. For cross-lingual generation tasks, it also outperforms all existing cross-lingual models and state-of-the-art Transformer variants on WMT14 English-to-German and English-to-French translation datasets, with gains of up to 1 2 BLEU.

pdf bib
StructuralLM: Structural Pre-training for Form Understanding
Chenliang Li | Bin Bi | Ming Yan | Wei Wang | Songfang Huang | Fei Huang | Luo Si
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Large pre-trained language models achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, they almost exclusively focus on text-only representation, while neglecting cell-level layout information that is important for form image understanding. In this paper, we propose a new pre-training approach, StructuralLM, to jointly leverage cell and layout information from scanned documents. Specifically, we pre-train StructuralLM with two new designs to make the most of the interactions of cell and layout information: 1) each cell as a semantic unit; 2) classification of cell positions. The pre-trained StructuralLM achieves new state-of-the-art results in different types of downstream tasks, including form understanding (from 78.95 to 85.14), document visual question answering (from 72.59 to 83.94) and document image classification (from 94.43 to 96.08).

pdf bib
Addressing Semantic Drift in Generative Question Answering with Auxiliary Extraction
Chenliang Li | Bin Bi | Ming Yan | Wei Wang | Songfang Huang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Recently, question answering (QA) based on machine reading comprehension has become popular. This work focuses on generative QA which aims to generate an abstractive answer to a given question instead of extracting an answer span from a provided passage. Generative QA often suffers from two critical problems: (1) summarizing content irrelevant to a given question, (2) drifting away from a correct answer during generation. In this paper, we address these problems by a novel Rationale-Enriched Answer Generator (REAG), which incorporates an extractive mechanism into a generative model. Specifically, we add an extraction task on the encoder to obtain the rationale for an answer, which is the most relevant piece of text in an input document to a given question. Based on the extracted rationale and original input, the decoder is expected to generate an answer with high confidence. We jointly train REAG on the MS MARCO QA+NLG task and the experimental results show that REAG improves the quality and semantic accuracy of answers over baseline models.

pdf bib
Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models
Yuxuan Lai | Yijia Liu | Yansong Feng | Songfang Huang | Dongyan Zhao
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Chinese pre-trained language models usually process text as a sequence of characters, while ignoring more coarse granularity, e.g., words. In this work, we propose a novel pre-training paradigm for Chinese — Lattice-BERT, which explicitly incorporates word representations along with characters, thus can model a sentence in a multi-granularity manner. Specifically, we construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers. We design a lattice position attention mechanism to exploit the lattice structures in self-attention layers. We further propose a masked segment prediction task to push the model to learn from rich but redundant information inherent in lattices, while avoiding learning unexpected tricks. Experiments on 11 Chinese natural language understanding tasks show that our model can bring an average increase of 1.5% under the 12-layer setting, which achieves new state-of-the-art among base-size models on the CLUE benchmarks. Further analysis shows that Lattice-BERT can harness the lattice structures, and the improvement comes from the exploration of redundant information and multi-granularity representations. Our code will be available at https://github.com/alibaba/pretrained-language-models/LatticeBERT.

pdf bib
Noisy-Labeled NER with Confidence Estimation
Kun Liu | Yao Fu | Chuanqi Tan | Mosha Chen | Ningyu Zhang | Songfang Huang | Sheng Gao
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent studies in deep learning have shown significant progress in named entity recognition (NER). However, most existing works assume clean data annotation, while real-world scenarios typically involve a large amount of noises from a variety of sources (e.g., pseudo, weak, or distant annotations). This work studies NER under a noisy labeled setting with calibrated confidence estimation. Based on empirical observations of different training dynamics of noisy and clean labels, we propose strategies for estimating confidence scores based on local and global independence assumptions. We partially marginalize out labels of low confidence with a CRF model. We further propose a calibration method for confidence scores based on the structure of entity labels. We integrate our approach into a self-training framework for boosting performance. Experiments in general noisy settings with four languages and distantly labeled settings demonstrate the effectiveness of our method.

pdf bib
Rethinking Denoised Auto-Encoding in Language Pre-Training
Fuli Luo | Pengcheng Yang | Shicheng Li | Xuancheng Ren | Xu Sun | Songfang Huang | Fei Huang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Pre-trained self-supervised models such as BERT have achieved striking success in learning sequence representations, especially for natural language processing. These models typically corrupt the given sequences with certain types of noise, such as masking, shuffling, or substitution, and then try to recover the original input. However, such pre-training approaches are prone to learning representations that are covariant with the noise, leading to the discrepancy between the pre-training and fine-tuning stage. To remedy this, we present ContrAstive Pre-Training (CAPT) to learn noise invariant sequence representations. The proposed CAPT encourages the consistency between representations of the original sequence and its corrupted version via unsupervised instance-wise training signals. In this way, it not only alleviates the pretrain-finetune discrepancy induced by the noise of pre-training, but also aids the pre-trained model in better capturing global semantics of the input via more effective sentence-level supervision. Different from most prior work that focuses on a particular modality, comprehensive empirical evidence on 11 natural language understanding and cross-modal tasks illustrates that CAPT is applicable for both language and vision-language tasks, and obtains surprisingly consistent improvement, including 0.6% absolute gain on GLUE benchmarks and 0.8% absolute increment on NLVR2.

pdf bib
Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning
Runxin Xu | Fuli Luo | Zhiyuan Zhang | Chuanqi Tan | Baobao Chang | Songfang Huang | Fei Huang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Recent pretrained language models extend from millions to billions of parameters. Thus the need to fine-tune an extremely large pretrained model with a limited training corpus arises in various downstream tasks. In this paper, we propose a straightforward yet effective fine-tuning technique, Child-Tuning, which updates a subset of parameters (called child network) of large pretrained models via strategically masking out the gradients of the non-child network during the backward process. Experiments on various downstream tasks in GLUE benchmark show that Child-Tuning consistently outperforms the vanilla fine-tuning by 1.5 8.6 average score among four different pretrained models, and surpasses the prior fine-tuning techniques by 0.6 1.3 points. Furthermore, empirical results on domain transfer and task transfer show that Child-Tuning can obtain better generalization performance by large margins.

2020

pdf bib
Predicting Clinical Trial Results by Implicit Evidence Integration
Qiao Jin | Chuanqi Tan | Mosha Chen | Xiaozhong Liu | Songfang Huang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Clinical trials provide essential guidance for practicing Evidence-Based Medicine, though often accompanying with unendurable costs and risks. To optimize the design of clinical trials, we introduce a novel Clinical Trial Result Prediction (CTRP) task. In the CTRP framework, a model takes a PICO-formatted clinical trial proposal with its background as input and predicts the result, i.e. how the Intervention group compares with the Comparison group in terms of the measured Outcome in the studied Population. While structured clinical evidence is prohibitively expensive for manual collection, we exploit large-scale unstructured sentences from medical literature that implicitly contain PICOs and results as evidence. Specifically, we pre-train a model to predict the disentangled results from such implicit evidence and fine-tune the model with limited data on the downstream datasets. Experiments on the benchmark Evidence Integration dataset show that the proposed model outperforms the baselines by large margins, e.g., with a 10.7% relative gain over BioBERT in macro-F1. Moreover, the performance improvement is also validated on another dataset composed of clinical trials related to COVID-19.

pdf bib
PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation
Bin Bi | Chenliang Li | Chen Wu | Ming Yan | Wei Wang | Songfang Huang | Fei Huang | Luo Si
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self-supervised pre-training, such as BERT, MASS and BART, has emerged as a powerful technique for natural language understanding and generation. Existing pre-training techniques employ autoencoding and/or autoregressive objectives to train Transformer-based models by recovering original word tokens from corrupted text with some masked tokens. The training goals of existing techniques are often inconsistent with the goals of many language generation tasks, such as generative question answering and conversational response generation, for producing new text given context. This work presents PALM with a novel scheme that jointly pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus, specifically designed for generating new text conditioned on context. The new scheme alleviates the mismatch introduced by the existing denoising scheme between pre-training and fine-tuning where generation is more than reconstructing original text. An extensive set of experiments show that PALM achieves new state-of-the-art results on a variety of language generation benchmarks covering generative question answering (Rank 1 on the official MARCO leaderboard), abstractive summarization on CNN/DailyMail as well as Gigaword, question generation on SQuAD, and conversational response generation on Cornell Movie Dialogues.

2018

pdf bib
Marrying Up Regular Expressions with Neural Networks: A Case Study for Spoken Language Understanding
Bingfeng Luo | Yansong Feng | Zheng Wang | Songfang Huang | Rui Yan | Dongyan Zhao
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The success of many natural language processing (NLP) tasks is bound by the number and quality of annotated data, but there is often a shortage of such training data. In this paper, we ask the question: “Can we combine a neural network (NN) with regular expressions (RE) to improve supervised learning for NLP?”. In answer, we develop novel methods to exploit the rich expressiveness of REs at different levels within a NN, showing that the combination significantly enhances the learning effectiveness when a small number of training examples are available. We evaluate our approach by applying it to spoken language understanding for intent detection and slot filling. Experimental results show that our approach is highly effective in exploiting the available training data, giving a clear boost to the RE-unaware NN.

2017

pdf bib
Learning with Noise: Enhance Distantly Supervised Relation Extraction with Dynamic Transition Matrix
Bingfeng Luo | Yansong Feng | Zheng Wang | Zhanxing Zhu | Songfang Huang | Rui Yan | Dongyan Zhao
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Distant supervision significantly reduces human efforts in building training data for many classification tasks. While promising, this technique often introduces noise to the generated training data, which can severely affect the model performance. In this paper, we take a deep look at the application of distant supervision in relation extraction. We show that the dynamic transition matrix can effectively characterize the noise in the training data built by distant supervision. The transition matrix can be effectively trained using a novel curriculum learning based method without any direct supervision about the noise. We thoroughly evaluate our approach under a wide range of extraction scenarios. Experimental results show that our approach consistently improves the extraction results and outperforms the state-of-the-art in various evaluation scenarios.

2016

pdf bib
Hybrid Question Answering over Knowledge Base and Free Text
Kun Xu | Yansong Feng | Songfang Huang | Dongyan Zhao
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Recent trend in question answering (QA) systems focuses on using structured knowledge bases (KBs) to find answers. While these systems are able to provide more precise answers than information retrieval (IR) based QA systems, the natural incompleteness of KB inevitably limits the question scope that the system can answer. In this paper, we present a hybrid question answering (hybrid-QA) system which exploits both structured knowledge base and free text to answer a question. The main challenge is to recognize the meaning of a question using these two resources, i.e., structured KB and free text. To address this, we map relational phrases to KB predicates and textual relations simultaneously, and further develop an integer linear program (ILP) model to infer on these candidates and provide a globally optimal solution. Experiments on benchmark datasets show that our system can benefit from both structured KB and free text, outperforming the state-of-the-art systems.

pdf bib
Question Answering on Freebase via Relation Extraction and Textual Evidence
Kun Xu | Siva Reddy | Yansong Feng | Songfang Huang | Dongyan Zhao
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
Semantic Relation Classification via Convolutional Neural Networks with Simple Negative Sampling
Kun Xu | Yansong Feng | Songfang Huang | Dongyan Zhao
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Semantic Interpretation of Superlative Expressions via Structured Knowledge Bases
Sheng Zhang | Yansong Feng | Songfang Huang | Kun Xu | Zhe Han | Dongyan Zhao
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
Joint Inference for Knowledge Base Population
Liwei Chen | Yansong Feng | Jinghui Mo | Songfang Huang | Dongyan Zhao
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Encoding Relation Requirements for Relation Extraction via Joint Inference
Liwei Chen | Yansong Feng | Songfang Huang | Yong Qin | Dongyan Zhao
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)