Trung Kiet Huynh

2026

HCMUS_PrisonDilemma at AbjadAuthorID Shared Task: Less is More with Base Models
Trung Kiet Huynh | Duy Minh Dao Sy | Nguyen Chi Tran | Pham Phu Hoa | Nguyen Lam Phu Quy | Truong Bao Tran
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

We present our approach to the AbjadNLP 2026 Arabic Authorship Identification shared task, achieving 4th place. Our key finding is that AraBERT-base (110M) outperforms AraBERT-large (340M) on the test set with macro F1 of 0.8449 versus 0.8096, despite lower validation scores. We handle long passages via sliding window chunking with mean pooling, and use a two-stage classification head with dual dropout for regularization. Per-class analysis reveals that translated works achieve perfect F1 while classical poets remain challenging due to shared formal structures. Our results challenge the "scale is all you need" assumption for stylometric tasks.

pdf bib abs

HCMUS_PrompterXPrompter at AbjadMed: When Classification Meets Retrieval: Taming the Long Tail in Arabic Medical Text Classification
Duy Minh Dao Sy | Trung Kiet Huynh | Nguyen Dinh Ha Duong | Nguyen Chi Tran | Phu Quy Nguyen Lam | Hoa Pham Phu
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

Medical text classification is high-stakes work, yet models often falter precisely where they are needed most: on rare, critical conditions buried in the long tail of the data distribution. In the EACL 2026 ABJAD-NLP Shared Task, we confronted this challenge with a dataset of Arabic medical questions heavily skewed towards a few common topics, leaving dozens of categories with fewer than ten examples. We present HybridMed, a system that effectively tames this long tail by marrying the semantic generalization of a fine-tuned Arabic BERT model with the precise, instance-based memory of k-nearest neighbor retrieval. This complementary union allowed our system to achieve a macro-F1 score of 0.4902, demonstrating that for diverse and imbalanced medical data, the whole is indeed greater than the sum of its parts.

pdf bib abs

HCMUS_The Fangs at AbjadStyleTransfer Shared Task: Learning to Query Style, Contrastive Representations for Zero-Shot Arabic Authorship Style Transfer
Duy Minh Dao Sy | Trung Kiet Huynh | Nguyen Chi Tran | Nguyen Lam Phu Quy | Pham Phu Hoa | Nguyen Dinh Ha Duong
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

This paper describes the system developed by team HCMUS_The Fangs for the AbjadStyleTransfer shared task (ArabicNLP 2026), where we achieved 1st place. We present a contrastive style learning approach for zero-shot Arabic authorship style transfer. Our key discovery is that the 21 test authors-including Nobel laureate Naguib Mahfouz and literary pioneer Taha Hussein-have zero overlap with the 32,784 training authors, transforming this into a pure zero-shot challenge. This insight led us to develop a dual-encoder architecture that learns transferable style representations through contrastive objectives, rather than memorizing author-specific patterns. Our system achieves 19.77 BLEU and 55.74 chrF, outperforming retrieval-augmented generation (+18%) and multi-task learning (+31%). Counter-intuitively, we find that sophisticated architectural modifications like style injection consistently degrade performance, while simpler approaches that preserve pre-trained knowledge excel. Our analysis reveals that for famous authors, pre-trained Arabic language models already encode substantial stylistic knowledge-the key is surfacing it, not learning from scratch.

pdf bib abs

HCMUS_TheFangs at AbjadGenEval Shared Task: Weighted Layer Pooling with Attention Fusion for Arabic AI-Generated Text Detection
Duy Minh Dao Sy | Nguyen Chi Tran | Trung Kiet Huynh | Nguyen Lam Phu Quy | Pham Phu Hoa | Nguyen Dinh Ha Duong
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

The rapid advancement of large language mod-els poses significant challenges for content au-thenticity, particularly in under-resourced lan-guages where detection tools remain scarce.We present our winning system for the Abjad-GenEval shared task on Arabic AI-generatedtext detection. Our key insight is that AI-generated text exhibits distinctive patternsacross multiple linguistic levels-from local syn-tax to global semantics-that can be captured bylearning to fuse representations from differenttransformer layers. We introduce aWeightedLayer Poolingmechanism that learns optimallayer combinations, combined withAttentionPoolingfor sequence-level context aggregation.Through systematic experimentation with 15+ approaches, we make a surprising discovery:model architecture selection dominates over so-phisticated training techniques, with DeBERTa-v3 providing +27% relative improvement overAraBERT regardless of training strategy. Oursystem achieves 0.93 F1-score, securing 1st placeamong all participants and outperform-ing the runner-up by 3 absolute points

2025

pdf bib abs

Challenge Track: JHARNA-MT: A Copy-Augmented Hybrid of LoRA-Tuned NLLB and Lexical SMT with Minimum Bayes Risk Decoding for Low-Resource Indic Languages
Dao Sy Duy Minh | Trung Kiet Huynh | Tran Chi Nguyen | Phu Quy Nguyen Lam | Phu-Hoa Pham | Nguyễn Đình Hà Dương | Dien Dinh | Long HB Nguyen
Proceedings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo 2025)

This paper describes JHARNA-MT, our system for the MMLoSo 2025 Shared Task on translation between high-resource languages (Hindi, English) and four low-resource Indic tribal languages: Bhili, Gondi, Mundari, and Santali. The task poses significant challenges, including data sparsity, morphological richness, and structural divergence across language pairs. To address these, we propose a hybrid translation pipeline that integrates non-parametric retrieval, lexical statistical machine translation (SMT), and LoRA-tuned NLLB-200 neural machine translation under a unified Minimum Bayes Risk (MBR) decoding framework. Exact and fuzzy retrieval exploit redundancy in government and administrative texts, SMT with diagonal alignment priors and back-translation provides lexically faithful hypotheses, and the NLLB-LoRA component contributes fluent neural candidates. MBR decoding selects consensus translations using a metric-matched utility based on a weighted combination of BLEU and chrF, mitigating the complementary error modes of SMT and NMT. Our final system, further enhanced with script-aware digit normalization and entity-preserving post-processing, achieves a private leaderboard score of 186.37 and ranks 2nd overall in the shared task, with ablation studies confirming the contribution of each component.

Co-authors

Venues

Fix author