Truong-Son Hy

Also published as: Truong Son Hy

2025

pdf bib abs
OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching
Nghia Huynh Nguyen Hieu | Ngoc Son Nguyen | Huynh Nguyen Dang | Thieu Vo | Truong-Son Hy | Van Nguyen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text-to-speech (TTS) systems have seen significant advancements in recent years, driven by improvements in deep learning and neural network architectures. Viewing the output speech as a data distribution, previous approaches often employ traditional speech representations, such as waveforms or spectrograms, within the Flow Matching framework. However, these methods have limitations, including overlooking various speech attributes and incurring high computational costs due to additional constraints introduced during training. To address these challenges, we introduce OZSpeech, the first TTS method to explore optimal transport conditional flow matching with one-step sampling and a learned prior as the condition, effectively disregarding preceding states and reducing the number of sampling steps. Our approach operates on disentangled, factorized components of speech in token format, enabling accurate modeling of each speech attribute, which enhances the TTS system’s ability to precisely clone the prompt speech. Experimental results show that our method achieves promising performance over existing methods in content accuracy, naturalness, prosody generation, and speaker style preservation. Audio samples are available at our demo page https://ozspeech.github.io/OZSpeech_Web/.

pdf bib abs
MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder
Khai Le-Duc | Phuc Phan | Tan-Hanh Pham | Bach Phan Tat | Minh-Huong Ngo | Thanh Nguyen-Tang | Truong-Son Hy
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology improves patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. To our best knowledge, MultiMed stands as the world’s largest medical ASR dataset across all major benchmarks: total duration, number of recording conditions, number of accents, and number of speaking roles. Furthermore, we present the first multilinguality study for medical ASR, which includes reproducible empirical baselines, a monolinguality-multilinguality analysis, Attention Encoder Decoder (AED) vs Hybrid comparative study and a linguistic analysis. We present practical ASR end-to-end training schemes optimized for a fixed number of trainable parameters that are common in industry settings. All code, data, and models are available online.

Transparency in AI healthcare decision-makingis crucial. By incorporating rationales to explain reason for each predicted label, userscould understand Large Language Models(LLMs)’s reasoning to make better decision.In this work, we introduce a new task - Sentiment Reasoning - for both speech and textmodalities, and our proposed multimodal multitask framework and the world’s largest multimodal sentiment analysis dataset. Sentiment Reasoning is an auxiliary task in sentiment analysis where the model predicts boththe sentiment label and generates the rationale behind it based on the input transcript.Our study conducted on both human transcriptsand Automatic Speech Recognition (ASR) transcripts shows that Sentiment Reasoning helpsimprove model transparency by providing rationale for model prediction with quality semantically comparable to humans while alsoimproving model’s classification performance(+2% increase in both accuracy and macro-F1) via rationale-augmented fine-tuning. Also,no significant difference in the semantic quality of generated rationales between human andASR transcripts. All code, data (five languages - Vietnamese, English, Chinese, German, andFrench) and models are published online.

pdf bib abs
SilVar: Speech-Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization
Tan-Hanh Pham | Le Hoang Nam | Phu-Vinh Nguyen | Chris Ngo | Truong-Son Hy
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Visual Language Models have demonstrated remarkable capabilities across various tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in natural human-machine interactions. Moreover, the quality of language models primarily depends on reasoning and prompting techniques, such as chain-of-thought, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, an end-to-end multimodal model that leverages speech instructions for reasoning-based visual question answering. Additionally, we investigate reasoning techniques at different levels, including conversational, simple, and complex speech instructions. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling more intuitive interactions by allowing users to provide verbal or text-based instructions. To this end, we introduce a new dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model’s ability to process and explain visual scenes from spoken input, moving beyond simple object recognition to reasoning-based interactions. To our knowledge, SilVar is the first open-source, speech-driven VLM. We believe SilVar will inspire the next generation of multimodal reasoning models, advancing toward expert artificial general intelligence.

pdf bib abs
Medical Spoken Named Entity Recognition
Khai Le-Duc | David Thulke | Hung-Phong Tran | Long Vo-Dang | Khai-Nguyen Nguyen | Truong-Son Hy | Ralf Schlüter
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Spoken Named Entity Recognition (NER) aims to extract named entities from speech and categorise them into types like person, location, organization, etc. In this work, we present *VietMed-NER* - the first spoken NER dataset in the medical domain. To our knowledge, our Vietnamese real-world dataset is the largest spoken NER dataset in the world regarding the number of entity types, featuring 18 distinct types. Furthermore, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence; and conduct quantitative and qualitative error analysis. We found that pre-trained multilingual models generally outperform monolingual models on reference text and ASR output and encoders outperform sequence-to-sequence models in NER tasks. By translating the transcripts, the dataset can also be utilised for text NER in the medical domain in other languages than Vietnamese. All code, data and models are publicly available.

2024

pdf bib abs
HierarchyNet: Learning to Summarize Source Code with Heterogeneous Representations
Minh Huynh Nguyen | Nghi D. Q. Bui | Truong Son Hy | Long Tran Thanh | Tien N. Nguyen
Findings of the Association for Computational Linguistics: EACL 2024

Code representation is important to machine learning models in the code-related applications. Existing code summarization approaches primarily leverage Abstract Syntax Trees (ASTs) and sequential information from source code to generate code summaries while often overlooking the critical consideration of the interplay of dependencies among code elements and code hierarchy. However, effective summarization necessitates a holistic analysis of code snippets from three distinct aspects: lexical, syntactic, and semantic information. In this paper, we propose a novel code summarization approach utilizing Heterogeneous Code Representations (HCRs) and our specially designed HierarchyNet. HCRs adeptly capture essential code features at lexical, syntactic, and semantic levels within a hierarchical structure. HierarchyNet processes each layer of the HCR separately, employing a Heterogeneous Graph Transformer, a Tree-based CNN, and a Transformer Encoder. In addition, HierarchyNet demonstrates superior performance compared to fine-tuned pre-trained models, including CodeT5, and CodeBERT, as well as large language models that employ zero/few-shot settings, such as CodeLlama, StarCoder, and CodeGen. Implementation details can be found at https://github.com/FSoft-AI4Code/HierarchyNet.

2023

pdf bib abs
ViDeBERTa: A powerful pre-trained language model for Vietnamese
Cong Dao Tran | Nhut Huy Pham | Anh Tuan Nguyen | Truong Son Hy | Tu Vu
Findings of the Association for Computational Linguistics: EACL 2023

This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large, which are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts using DeBERTa architecture. Although many successful pre-trained language models based on Transformer have been widely proposed for the English language, there are still few pre-trained models for Vietnamese, a low-resource language, that perform good results on downstream tasks, especially Question answering. We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering. The empirical results demonstrate that ViDeBERTa with far fewer parameters surpasses the previous state-of-the-art models on multiple Vietnamese-specific natural language understanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is only about 23% of PhoBERT_large with 370M parameters, still performs the same or better results than the previous state-of-the-art model. Our ViDeBERTa models are available at: https://github.com/HySonLab/ViDeBERTa.

Co-authors

Venues

Fix author