2024
pdf
bib
abs
SPZ: A Semantic Perturbation-based Data Augmentation Method with Zonal-Mixing for Alzheimer’s Disease Detection
FangFang Li
|
Cheng Huang
|
PuZhen Su
|
Jie Yin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Alzheimer’s Disease (AD), characterized by significant cognitive and functional impairment, necessitates the development of early detection techniques. Traditional diagnostic practices, such as cognitive assessments and biomarker analysis, are often invasive and costly. Deep learning-based approaches for non-invasive AD detection have been explored in recent studies, but the lack of accessible data hinders further improvements in detection performance. To address these challenges, we propose a novel semantic perturbation-based data augmentation method that essentially differs from existing techniques, which primarily rely on explicit data engineering. Our approach generates controlled semantic perturbations to enhance textual representations, aiding the model in identifying AD-specific linguistic patterns, particularly in scenarios with limited data availability. It learns contextual information and dynamically adjusts the perturbation degree for different linguistic features. This enhances the model’s sensitivity to AD-specific linguistic features and its robustness against natural language noise. Experimental results on the ADReSS challenge dataset demonstrate that our approach outperforms other strong and competitive deep learning methods.
2023
pdf
bib
abs
Towards Better Representations for Multi-Label Text Classification with Multi-granularity Information
Fangfang Li
|
Puzhen Su
|
Junwen Duan
|
Weidong Xiao
Findings of the Association for Computational Linguistics: EMNLP 2023
Multi-label text classification (MLTC) aims to assign multiple labels to a given text. Previous works have focused on text representation learning and label correlations modeling using pre-trained language models (PLMs). However, studies have shown that PLMs generate word frequency-oriented text representations, causing texts with different labels to be closely distributed in a narrow region, which is difficult to classify. To address this, we present a novel framework CL( ̲Contrastive ̲Learning)-MIL ( ̲Multi-granularity ̲Information ̲Learning) to refine the text representation for MLTC task. We first use contrastive learning to generate uniform initial text representation and incorporate label frequency implicitly. Then, we design a multi-task learning module to integrate multi-granularity (diverse text-labels correlations, label-label relations and label frequency) information into text representations, enhancing their discriminative ability. Experimental results demonstrate the complementarity of the modules in CL-MIL, improving the quality of text representations and yielding stable and competitive improvements for MLTC.
2022
pdf
bib
abs
WSpeller: Robust Word Segmentation for Enhancing Chinese Spelling Check
Fangfang Li
|
Youran Shan
|
Junwen Duan
|
Xingliang Mao
|
Minlie Huang
Findings of the Association for Computational Linguistics: EMNLP 2022
Chinese spelling check (CSC) detects and corrects spelling errors in Chinese texts. Previous approaches have combined character-level phonetic and graphic information, ignoring the importance of segment-level information. According to our pilot study, spelling errors are always associated with incorrect word segmentation. When appropriate word boundaries are provided, CSC performance is greatly enhanced. Based on these findings, we present WSpeller, a CSC model that takes into account word segmentation. A fundamental component of WSpeller is a W-MLM, which is trained by predicting visually and phonetically similar words. Through modification of the embedding layer’s input, word segmentation information can be incorporated. Additionally, a robust module is trained to assist the W-MLM-based correction module by predicting the correct word segmentations from sentences containing spelling errors. We evaluate WSpeller on the widely used benchmark datasets SIGHAN13, SIGHAN14, and SIGHAN15. Our model is superior to state-of-the-art baselines on SIGHAN13 and SIGHAN15 and maintains equal performance on SIGHAN14.
2020
pdf
bib
abs
WAE_RN: Integrating Wasserstein Autoencoder and Relational Network for Text Sequence
Xinxin Zhang
|
Xiaoming Liu
|
Guan Yang
|
Fangfang Li
Proceedings of the 19th Chinese National Conference on Computational Linguistics
One challenge in Natural Language Processing (NLP) area is to learn semantic representation in different contexts. Recent works on pre-trained language model have received great attentions and have been proven as an effective technique. In spite of the success of pre-trained language model in many NLP tasks, the learned text representation only contains the correlation among the words in the sentence itself and ignores the implicit relationship between arbitrary tokens in the sequence. To address this problem, we focus on how to make our model effectively learn word representations that contain the relational information between any tokens of text sequences. In this paper, we propose to integrate the relational network(RN) into a Wasserstein autoencoder(WAE). Specifically, WAE and RN are used to better keep the semantic structurse and capture the relational information, respectively. Extensive experiments demonstrate that our proposed model achieves significant improvements over the traditional Seq2Seq baselines.