Truong Bao Tran
2026
HCMUS_PrisonDilemma at AbjadAuthorID Shared Task: Less is More with Base Models
Trung Kiet Huynh | Duy Minh Dao Sy | Nguyen Chi Tran | Pham Phu Hoa | Nguyen Lam Phu Quy | Truong Bao Tran
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Trung Kiet Huynh | Duy Minh Dao Sy | Nguyen Chi Tran | Pham Phu Hoa | Nguyen Lam Phu Quy | Truong Bao Tran
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
We present our approach to the AbjadNLP 2026 Arabic Authorship Identification shared task, achieving 4th place. Our key finding is that AraBERT-base (110M) outperforms AraBERT-large (340M) on the test set with macro F1 of 0.8449 versus 0.8096, despite lower validation scores. We handle long passages via sliding window chunking with mean pooling, and use a two-stage classification head with dual dropout for regularization. Per-class analysis reveals that translated works achieve perfect F1 while classical poets remain challenging due to shared formal structures. Our results challenge the "scale is all you need" assumption for stylometric tasks.
2025
DRAGON: Dual-Encoder Retrieval with Guided Ontology Reasoning for Medical Normalization
Dao Sy Duy Minh | Nguyen Lam Phu Quy | Pham Phu Hoa | Tran Chi Nguyen | Huynh Trung Kiet | Truong Bao Tran
Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association
Dao Sy Duy Minh | Nguyen Lam Phu Quy | Pham Phu Hoa | Tran Chi Nguyen | Huynh Trung Kiet | Truong Bao Tran
Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association
Adverse Drug Event (ADE) normalization to standardized medical terminologies such as MedDRA presents significant challenges due to lexical and semantic gaps between colloquial user-generated content and formal medical vocabularies. This paper presents our submission to the ALTA 2025 Shared Task on ADE normalization, evaluated using Accuracy@k metrics. Our approach employs distinct methodologies for the development and test phase. In the development phase, we propose a three-stage neural architecture: (1) bi-encoder training to establish semantic representations, (2) lexical-aware fine-tuning to capture morphological patterns alongside semantic similarity, and (3) crossencoder re-ranking for fine-grained discrimination, enabling the model to leverage both distributional semantics and lexical cues through explicit interaction modeling. For the test phase, we utilize the trained bi-encoder from stage (1) for efficient candidate retrieval, then adopt an alternative re-ranking pipeline leveraging large language models with tool-augmented retrieval and multi-stage reasoning. Specifically, a capable model performs reasoning-guided candidate selection over the retrieved top-k results, a lightweight model provides iterative feedback based on reasoning traces, and an automated verification module ensures output correctness with self-correction mechanisms. Our system achieves competitive performance on both development and test benchmarks, demonstrating the efficacy of neural retrieval-reranking architectures and the versatility of LLM-augmented neural pipelines for medical entity normalization tasks.