Phu Quy Nguyen Lam

Also published as: Phu Quy Nguyen Lam

2026

HCMUS_PrompterXPrompter at AbjadMed: When Classification Meets Retrieval: Taming the Long Tail in Arabic Medical Text Classification
Duy Minh Dao Sy | Trung Kiet Huynh | Nguyen Dinh Ha Duong | Nguyen Chi Tran | Phu Quy Nguyen Lam | Hoa Pham Phu
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

Medical text classification is high-stakes work, yet models often falter precisely where they are needed most: on rare, critical conditions buried in the long tail of the data distribution. In the EACL 2026 ABJAD-NLP Shared Task, we confronted this challenge with a dataset of Arabic medical questions heavily skewed towards a few common topics, leaving dozens of categories with fewer than ten examples. We present HybridMed, a system that effectively tames this long tail by marrying the semantic generalization of a fine-tuned Arabic BERT model with the precise, instance-based memory of k-nearest neighbor retrieval. This complementary union allowed our system to achieve a macro-F1 score of 0.4902, demonstrating that for diverse and imbalanced medical data, the whole is indeed greater than the sum of its parts.

2025

pdf bib abs

Challenge Track: JHARNA-MT: A Copy-Augmented Hybrid of LoRA-Tuned NLLB and Lexical SMT with Minimum Bayes Risk Decoding for Low-Resource Indic Languages
Dao Sy Duy Minh | Trung Kiet Huynh | Tran Chi Nguyen | Phu Quy Nguyen Lam | Phu-Hoa Pham | Nguyễn Đình Hà Dương | Dien Dinh | Long HB Nguyen
Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo 2025)

This paper describes JHARNA-MT, our system for the MMLoSo 2025 Shared Task on translation between high-resource languages (Hindi, English) and four low-resource Indic tribal languages: Bhili, Gondi, Mundari, and Santali. The task poses significant challenges, including data sparsity, morphological richness, and structural divergence across language pairs. To address these, we propose a hybrid translation pipeline that integrates non-parametric retrieval, lexical statistical machine translation (SMT), and LoRA-tuned NLLB-200 neural machine translation under a unified Minimum Bayes Risk (MBR) decoding framework. Exact and fuzzy retrieval exploit redundancy in government and administrative texts, SMT with diagonal alignment priors and back-translation provides lexically faithful hypotheses, and the NLLB-LoRA component contributes fluent neural candidates. MBR decoding selects consensus translations using a metric-matched utility based on a weighted combination of BLEU and chrF, mitigating the complementary error modes of SMT and NMT. Our final system, further enhanced with script-aware digit normalization and entity-preserving post-processing, achieves a private leaderboard score of 186.37 and ranks 2nd overall in the shared task, with ablation studies confirming the contribution of each component.

Co-authors

Venues

Fix author