Yujia Tian

2024

pdf bib abs
An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification
Zhuowei Chen | Lianxi Wang | Yuben Wu | Xinfeng Liao | Yujia Tian | Junyang Zhong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework’s modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.

pdf bib abs
Enhancing Hindi Feature Representation through Fusion of Dual-Script Word Embeddings
Lianxi Wang | Yujia Tian | Zhuowei Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Pretrained language models excel in various natural language processing tasks but often neglect the integration of different scripts within a language, constraining their ability to capture richer semantic information, such as in Hindi. In this work, we present a dual-script enhanced feature representation method for Hindi. We combine single-script features from Devanagari and Romanized Hindi Roberta using concatenation, addition, cross-attention, and convolutional networks. The experiment results show that using a dual-script approach significantly improves model performance across various tasks. The addition fusion technique excels in sequence generation tasks, while for text classification, the CNN-based dual-script enhanced representation performs best with longer sentences, and the addition fusion technique is more effective for shorter sequences. Our approach shows significant advantages in multiple natural language processing tasks, providing a new perspective on feature representation for Hindi. Our code has been released on https://github.com/JohnnyChanV/Hindi-Fusion.

pdf bib
Effect of Rap Music Context on Lexical Tone Normalization
Yujia Tian | Yanyuan Ye | Mingxi Lu | Fanlu Jia | Ran Tao
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

Co-authors

Ran Tao 1

Yuben Wu 1

Yanyuan Ye 1

Junyang Zhong 1

Venues

Fix author