Lianxi Wang

2025

Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data
Zhuowei Chen | Bowei Zhang | Nankai Lin | Tian Hou | Lianxi Wang
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)

Recent advances in LLMs have enhanced AI capabilities, but also increased the risk posed by malicious requests, highlighting the need for effective LLM safeguards to detect such queries. Existing approaches largely rely on classifier-based methods that lack interpretability and perform poorly on low-resource languages. To address these limitations, we propose ConsistentGuard, a novel reasoning-based multilingual safeguard, which enhances explainability via reasoning and boosts knowledge transfer between languages through alignment. With only 1,000 training samples, our method demonstrates superior performance on three datasets across six languages, outperforming larger models trained with significantly more data, and exhibits strong interpretability and generalization ability. We also contribute a multilingual benchmark extension and release our code to support future research.

pdf bib abs

Pseudo-label Data Construction Method and Syntax-enhanced Model for Chinese Semantic Error Recognition
Hongyan Wu | Nankai Lin | Shengyi Jiang | Lianxi Wang | Aimin Yang
Proceedings of the 31st International Conference on Computational Linguistics

Chinese Semantic Error Recognition (CSER) has always been a weak link in Chinese language processing due to the complexity and obscureness of Chinese semantics. Existing research has gradually focused on leveraging pre-trained models to perform CSER. Although some researchers have attempted to integrate syntax information into the pre-trained language model, it requires training the models from scratch, which is time-consuming and laborious. Furthermore, despite the existence of datasets for CSER, the constrained size of these datasets impairs the performance of the models. Thus, in order to address the difficulty posed by a limited sample set and the need of annotating samples with semantic-level errors, we propose a Pseudo-label Data Construction method for CSER (PDC-CSER), generating pseudo-labels for augmented samples based on perplexity and model respectively, which overcomes the difficulty of constructing pseudo-label data containing semantic-level errors and ensures the quality of pseudo-labels. Moreover, we propose a CSER method with the Dependency Syntactic Attention mechanism (CSER-DSA) to explicitly infuse dependency syntactic information only in the fine-tuning stage, achieving robust performance, and simultaneously reducing substantial computing power and time cost. Results demonstrate that the pseudo-label technology PDC-CSER and the semantic error recognition method CSER-DSA surpass the existing models

2024

pdf bib abs

Enhancing Hindi Feature Representation through Fusion of Dual-Script Word Embeddings
Lianxi Wang | Yujia Tian | Zhuowei Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Pretrained language models excel in various natural language processing tasks but often neglect the integration of different scripts within a language, constraining their ability to capture richer semantic information, such as in Hindi. In this work, we present a dual-script enhanced feature representation method for Hindi. We combine single-script features from Devanagari and Romanized Hindi Roberta using concatenation, addition, cross-attention, and convolutional networks. The experiment results show that using a dual-script approach significantly improves model performance across various tasks. The addition fusion technique excels in sequence generation tasks, while for text classification, the CNN-based dual-script enhanced representation performs best with longer sentences, and the addition fusion technique is more effective for shorter sequences. Our approach shows significant advantages in multiple natural language processing tasks, providing a new perspective on feature representation for Hindi. Our code has been released on https://github.com/JohnnyChanV/Hindi-Fusion.

pdf bib abs

An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification
Zhuowei Chen | Lianxi Wang | Yuben Wu | Xinfeng Liao | Yujia Tian | Junyang Zhong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework’s modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.

2023

pdf bib abs

A Distantly-Supervised Relation Extraction Method Based on Selective Gate and Noise Correction
Zhuowei Chen | Yujia Tian | Lianxi Wang | Shengyi Jiang
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“Entity relation extraction, as a core task of information extraction, aims to predict the relation ofentity pairs identified by text, and its research results are applied to various fields. To addressthe problem that current distantly supervised relation extraction (DSRE) methods based on large-scale corpus annotation generate a large amount of noisy data, a DSRE method that incorporatesselective gate and noise correction framework is proposed. The selective gate is used to reason-ably select the sentence features in the sentence bag, while the noise correction is used to correctthe labels of small classes of samples that are misclassified into large classes during the modeltraining process, to reduce the negative impact of noisy data on relation extraction. The resultson the English datasets clearly demonstrate that our proposed method outperforms other base-line models. Moreover, the experimental results on the Chinese dataset indicate that our methodsurpasses other models, providing further evidence that our proposed method is both robust andeffective.”

2022

pdf bib abs

Transliteration is an important task in natural language processing (NLP) which aims to convert a name in the source language to the target language without changing its pronunciation. Particularly, transliteration from English to Arabic is highly needed in many applications, especially in countries (e.g., United Arab Emirates (UAE)) whose most citizens are foreigners but the official language is Arabic. In such a task-oriented scenario, namely transliterating the English names to the corresponding Arabic ones, the performance of the transliteration model is highly important. However, most existing neural approaches mainly apply a universal transliteration model with advanced encoders and decoders to the task, where limited attention is paid to leveraging the phonemic association between English and Arabic to further improve model performance. In this paper, we focus on transliteration of people’s names from English to Arabic for the general public. In doing so, we collect a corpus named EANames by extracting high quality name pairs from online resources which better represent the names in the general public than linked Wikipedia entries that are always names of famous people). We propose a model for English-Arabic transliteration, where a memory module modeling the phonemic association between English and Arabic is used to guide the transliteration process. We run experiments on the collected data and the results demonstrate the effectiveness of our approach for English-Arabic transliteration.

Co-authors

Venues

MRL1

WS1

Fix author