Xiaoli Wang


2024

pdf bib
Multi-Level Cross-Modal Alignment for Speech Relation Extraction
Liang Zhang | Zhen Yang | Biao Fu | Ziyao Lu | Liangying Shao | Shiyu Liu | Fandong Meng | Jie Zhou | Xiaoli Wang | Jinsong Su
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Speech Relation Extraction (SpeechRE) aims to extract relation triplets from speech data. However, existing studies usually use synthetic speech to train and evaluate SpeechRE models, hindering the further development of SpeechRE due to the disparity between synthetic and real speech. Meanwhile, the modality gap issue, unexplored in SpeechRE, limits the performance of existing models. In this paper, we construct two real SpeechRE datasets to facilitate subsequent researches and propose a Multi-level Cross-modal Alignment Model (MCAM) for SpeechRE. Our model consists of three components: 1) a speech encoder, extracting speech features from the input speech; 2) an alignment adapter, mapping these speech features into a suitable semantic space for the text decoder; and 3) a text decoder, autoregressively generating relation triplets based on the speech features. During training, we first additionally introduce a text encoder to serve as a semantic bridge between the speech encoder and the text decoder, and then train the alignment adapter to align the output features of speech and text encoders at multiple levels. In this way, we can effectively train the alignment adapter to bridge the modality gap between the speech encoder and the text decoder. Experimental results and in-depth analysis on our datasets strongly demonstrate the efficacy of our method.

pdf bib
MHGRL: An Effective Representation Learning Model for Electronic Health Records
Feiyan Liu | Liangzhi Li | Xiaoli Wang | Feng Luo | Chang Liu | Jinsong Su | Yiming Qian
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Electronic health records (EHRs) serve as a digital repository storing comprehensive medical information about patients. Representation learning for EHRs plays a crucial role in healthcare applications. In this paper, we propose a Multimodal Heterogeneous Graph-enhanced Representation Learning, denoted as MHGRL, aimed at learning effective EHR representations. To address the challenge posed by data insufficiency of EHRs, MHGRL utilizes a multimodal heterogeneous graph to model an EHR. Specifically, we construct a heterogeneous graph for each EHR and enrich it by incorporating multimodal information with medical ontology and textual notes. With the integration of pre-trained model, graph neural network, and attention mechanism, MHGRL effectively incorporates both node attributes and structural information across a multimodal heterogeneous graph. Moreover, we employ contrastive learning to ensure the consistency of representations for similar EHRs and improve the model robustness. The experimental results show that MHGRL outperforms all baselines on two real clinical datasets in downstream tasks, including EHR clustering and disease prediction. The code is available at https://github.com/emmali808/MHGRL.

2023

pdf bib
A Sequence-to-Sequence&Set Model for Text-to-Table Generation
Tong Li | Zhihao Wang | Liangying Shao | Xuling Zheng | Xiaoli Wang | Jinsong Su
Findings of the Association for Computational Linguistics: ACL 2023

Recently, the text-to-table generation task has attracted increasing attention due to its wide applications. In this aspect, the dominant model formalizes this task as a sequence-to-sequence generation task and serializes each table into a token sequence during training by concatenating all rows in a top-down order. However, it suffers from two serious defects: 1) the predefined order introduces a wrong bias during training, which highly penalizes shifts in the order between rows; 2) the error propagation problem becomes serious when the model outputs a long token sequence. In this paper, we first conduct a preliminary study to demonstrate the generation of most rows is order-insensitive. Furthermore, we propose a novel sequence-to-sequence&set text-to-table generation model. Specifically, in addition to a text encoder encoding the input text, our model is equipped with a table header generator to first output a table header, i.e., the first row of the table, in the manner of sequence generation. Then we use a table body generator with learnable row embeddings and column embeddings to generate a set of table body rows in parallel. Particularly, to deal with the issue that there is no correspondence between each generated table body row and target during training, we propose a target assignment strategy based on the bipartite matching between the first cells of generated table body rows and targets. Experiment results show that our model significantly surpasses the baselines, achieving state-of-the-art performance on commonly-used datasets.

pdf bib
ConKI: Contrastive Knowledge Injection for Multimodal Sentiment Analysis
Yakun Yu | Mingjun Zhao | Shi-ang Qi | Feiran Sun | Baoxun Wang | Weidong Guo | Xiaoli Wang | Lei Yang | Di Niu
Findings of the Association for Computational Linguistics: ACL 2023

Multimodal Sentiment Analysis leverages multimodal signals to detect the sentiment of a speaker. Previous approaches concentrate on performing multimodal fusion and representation learning based on general knowledge obtained from pretrained models, which neglects the effect of domain-specific knowledge. In this paper, we propose Contrastive Knowledge Injection (ConKI) for multimodal sentiment analysis, where specific-knowledge representations for each modality can be learned together with general knowledge representations via knowledge injection based on an adapter architecture. In addition, ConKI uses a hierarchical contrastive learning procedure performed between knowledge types within every single modality, across modalities within each sample, and across samples to facilitate the effective learning of the proposed representations, hence improving multimodal sentiment predictions. The experiments on three popular multimodal sentiment analysis benchmarks show that ConKI outperforms all prior methods on a variety of performance metrics.

2022

pdf bib
WR-One2Set: Towards Well-Calibrated Keyphrase Generation
Binbin Xie | Xiangpeng Wei | Baosong Yang | Huan Lin | Jun Xie | Xiaoli Wang | Min Zhang | Jinsong Su
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Keyphrase generation aims to automatically generate short phrases summarizing an input document. The recently emerged ONE2SET paradigm (Ye et al., 2021) generates keyphrases as a set and has achieved competitive performance. Nevertheless, we observe serious calibration errors outputted by ONE2SET, especially in the over-estimation of ∅ token (means “no corresponding keyphrase”). In this paper, we deeply analyze this limitation and identify two main reasons behind: 1) the parallel generation has to introduce excessive ∅ as padding tokens into training instances; and 2) the training mechanism assigning target to each slot is unstable and further aggravates the ∅ token over-estimation. To make the model well-calibrated, we propose WR-ONE2SET which extends ONE2SET with an adaptive instance-level cost Weighting strategy and a target Re-assignment mechanism. The former dynamically penalizes the over-estimated slots for different instances thus smoothing the uneven training distribution. The latter refines the original inappropriate assignment and reduces the supervisory signals of over-estimated slots. Experimental results on commonly-used datasets demonstrate the effectiveness and generality of our proposed paradigm.

2020

pdf bib
Tencent submission for WMT20 Quality Estimation Shared Task
Haijiang Wu | Zixuan Wang | Qingsong Ma | Xinjie Wen | Ruichen Wang | Xiaoli Wang | Yulin Zhang | Zhipeng Yao | Siyao Peng
Proceedings of the Fifth Conference on Machine Translation

This paper presents Tencent’s submission to the WMT20 Quality Estimation (QE) Shared Task: Sentence-Level Post-editing Effort for English-Chinese in Task 2. Our system ensembles two architectures, XLM-based and Transformer-based Predictor-Estimator models. For the XLM-based Predictor-Estimator architecture, the predictor produces two types of contextualized token representations, i.e., masked XLM and non-masked XLM; the LSTM-estimator and Transformer-estimator employ two effective strategies, top-K and multi-head attention, to enhance the sentence feature representation. For Transformer-based Predictor-Estimator architecture, we improve a top-performing model by conducting three modifications: using multi-decoding in machine translation module, creating a new model by replacing the transformer-based predictor with XLM-based predictor, and finally integrating two models by a weighted average. Our submission achieves a Pearson correlation of 0.664, ranking first (tied) on English-Chinese.