pdf
bib
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
Kai-Wei Chang
|
Ke-Han Lu
|
Chih-Kai Yang
|
Zhi-Rui Tam
|
Wen-Yu Chang
|
Chung-Che Wang
pdf
bib
abs
Training a Chinese Listenability Model Using Word2Vec to Predict the Difficulty of Spoken Texts
Yen-Hsiang Chien
|
Hou-Chiang Tseng
|
Kuan-Yu Chen
|
Yao-Ting Sung
With the proliferation of digital learning, an increasing number of learners are engaging with audio-visual materials. For preschool and lower elementary students, whose literacy skills are still limited, knowledge acquisition relies more heavily on spoken and visual content. Traditional readability models were primarily developed for written texts, and their applicability to spoken materials remains uncertain. To address this issue, this study investigates the impact of different word segmentation tools and language models on the performance of automatic grade classification models for Chinese spoken materials. Support Vector Machines were employed for grade prediction, aiming to automatically determine the appropriate grade level of learning resources and assist learners in selecting suitable materials. The results show that language models with higher-dimensional word embeddings achieved better classification performance, with an accuracy of up to 61% and an adjacent accuracy of 76%. These findings may contribute to future digital learning platforms or educational resource recommendation systems by automatically providing students with appropriate listening materials to enhance learning outcomes.
pdf
bib
abs
Cubicpower Agentic Mixture of Experts(AMoE) Framework for Fine-Tuning NLP Tasks Without GPUs
Chao-Yih Hsia
The rise of Green AI emphasizes minimizing the environmental footprint of AI systems. This paper explores a no-GPU agentic architecture for fine-tuning NLP tasks. It presents our initial experiments applying these no-GPU algorithms in pretraining and fine-tuning tasks on our CubicPower agentic mixture of experts (AMoE) framework, with the aim of contributing to more sustainable AI development. In contrast to the training procedures of neural networks, which consume significant power, the AMoE framework’s primary contribution toward power savings is that it requires no training process. We explore non-neural-network methods for solving NLP tasks and employ similarity measures to match predefined patterns for use in a RAG database.
pdf
bib
abs
Design and Evaluation of a Courtroom Examination AI Simulation System with Behavioral Fidelity
Hsien-Jyh Liao
AI simulation system centered on Behavioral Fidelity, with speech interaction included as a design feature to enhance immersion. For standardization and reproducibility, the present pilot evaluation uses transcripts. The system integrates pragmatic–psychological rules with Taiwanese criminal case files to simulate witness behavior under cross-examination pressure. Using an optimized Expert Turing Test framework with four dimensions—professional accuracy, situational adaptability, human-likeness, and logical consistency—we conduct a pilot study. Under identical prompts and knowledge sources, the customized GPT condition received higher ratings than GPT-Vanilla on adaptability and human-likeness. Applying the same framework to another mainstream model (Gemini 2.5 Flash) yielded comparable performance, while differences remain inconclusive at this sample size. Overall, the results provide preliminary evidence that Behavioral Fidelity is a feasible evaluation target and indicate the scalability of generative AI for legal training; speech-condition evaluation and multi-case, multi-role extensions are left for future work.
pdf
bib
abs
Multimodal Approaches for Stress Recognition: A Comparative Study Using the StressID Dataset
Chia-Yun Lee
|
Matúš Pleva
|
Daniel Hladek
|
Ming-Hsiang Su
Mental health concerns have garnered increasing attention, highlighting the importance of timely and accurate identification of individual stress states as a critical research domain. This study employs the multimodal StressID dataset to evaluate the contributions of three modalities—physiological signals, video, and audio—in stress recognition tasks. A set of machine learning models, including Random Forests (RF), Support Vector Machines (SVM), Multi-Layer Perceptrons (MLP), and K-Nearest Neighbors (KNN), were trained and tested with optimized parameters for each modality. In addition, the effectiveness of different multimodal fusion strategies was systematically examined. The unimodal experiments revealed that the physiological modality achieved the highest performance in the binary stress classification task (F1-score = 0.751), whereas the audio modality outperformed the others in the three-class classification task (F1-score = 0.625). In the multimodal setting, feature-level fusion yielded stable improvements in the binary classification task, while decision-level fusion achieved superior performance in the three-class classification task (F1-score = 0.65). These findings demonstrate that multimodal integration can substantially enhance the accuracy of stress recognition. Future research directions include incorporating temporal modeling and addressing data imbalance to further improve the robustness and applicability of stress recognition systems.
pdf
bib
abs
Beyond Binary: Enhancing Misinformation Detection with Nuance-Controlled Event Context
Elijah Frederick Albertson
|
Retnani Latifah
|
Yi-Shin Chen
Misinformation rarely presents itself as entirely true or entirely false. Instead, it often embeds partial truths within misleading contexts, creating narratives that blur the boundary between fact and falsehood. Traditional binary fact-checking frameworks fail to capture this nuance, forcing complex claims into oversimplified categories. To address this gap, we introduce MEGA, a multidimensional graph framework designed to classify ambiguous claims, with a particular focus on those labelled Somewhat True. MEGA integrates event evidence, spatio-temporal metadata, and a quantifiable nuance score. Its Event Candidate Extraction (ECE) module identifies supporting or contradicting evidence, while the Nuance Control Module (NCM) injects or removes nuance to assess its effect on classification. Experiments show that nuance is both detectable and learnable: adding nuance improves borderline discrimination, while stripping it leads the decisions toward false extremes and conceals partial truth. Our top model— nuance-injected without score weighting— improve accuracy and F1 score by 15 and 16 points over the claims-only baseline, and 6 and 9 points over the ECE-only variant. These results show that explicitly modeling nuance alongside context is crucial for classifying mixed-truth claims and advancing fact-checking beyond binary judgments.
pdf
bib
abs
A Preliminary Study of RAG for Taiwanese Historical Archives
Claire Lin
|
Bo-Han Feng
|
Xuanjun Chen
|
Te-Lun Yang
|
Hung-Yi Lee
|
Jyh-Shing Roger Jang
Retrieval-Augmented Generation (RAG) has emerged as a promising approach for knowledge-intensive tasks. However, few studies have examined RAG for Taiwanese Historical Archives. In this paper, we present an initial study of a RAG pipeline applied to two historical Traditional Chinese datasets, Fort Zeelandia and the Taiwan Provincial Council Gazette, along with their corresponding open-ended query sets. We systematically investigate the effects of query characteristics and metadata integration strategies on retrieval quality, answer generation, and the performance of the overall system. The results show that early-stage metadata integration enhances both retrieval and answer accuracy while also revealing persistent challenges for RAG systems, including hallucinations during generation and difficulties in handling temporal or multi-hop historical queries.
pdf
bib
abs
Bridging Underspecified Queries and Multimodal Retrieval: A Two-Stage Query Rewriting Approach
Szu-Ting Liu
|
Wen-Yu Cho
|
Hsin-Wei Wang
|
Berlin Chen
Retrieval-Augmented Generation (RAG) has proven effective for text-only question answering, yet expanding it to visually rich documents remains a challenge. Existing multimodal benchmarks, often derived from visual question answering (VQA) datasets, or large vision-language model (LVLM)-generated query-image pairs, which often contain underspecified questions that assume direct image access. To mitigate this issue, we propose a two-stage query rewriting framework that first generates OCR-based image descriptions and then reformulates queries into precise, retrieval-friendly forms under explicit constraints. Experiments show consistent improvements across dense, hybrid and multimodal retrieval paradigms, with the most pronounced gains in visual document retrieval – Hits@1 rises from 21.0% to 56.6% with VDocRetriever and further to 79.3% when OCR-based descriptions are incorporated. These results indicate that query rewriting, particularly when combined with multimodal fusion, provides a reliable and scalable solution to bridge underspecified queries and improve retrieval over visually rich documents.
pdf
bib
abs
The Study of a Traffic Accident Information Collection Agent System Based on Fine-tuned Open-Source Large Language Models
Jo-Chi Kung
|
Chia-Hui Chang
本研究提出了一套名為「交通事故資訊蒐集代理人」(Collision Care Guide, CCG)的系統架構,專注於事故初期階段的結構化資訊蒐集。CCG 整合三大模組:問題生成、資訊擷取及事故重建,透過多輪對話引導使用者敘述事故細節並轉換為結構化資料格式(TARF),同時生成可讀性敘述供核對。為滿足成本效益、隱私保護及部署彈性需求,本研究比較開源 Llama 模型(3B/8B 參數,完整微調及 4-bit PEFT 方法)與商業基準 GPT-4o-mini 的效能表現。結果顯示,資訊擷取模組欄位準確率高於 0.94,JSON 語義相似度達 0.995;問題生成模組語義相似度介於 0.85-0.88,問題表達更加精煉。微調模型在對話品質與資訊擷取的 LLM 評估中均獲得 4 分以上(滿分 5 分),與商業基準差距小於 0.5 分。研究證實開源模型經微調後能逼近商業模型效能,且量化版本在資源受限場景中具備高效能與部署潛力。CCG 的設計填補了事故初期互動式資訊蒐集的技術空白,為交通事故處理提供了高效且具成本優勢的解決方案。
pdf
bib
abs
Automatic Generation of Corpus-Based Exercises Using Generative AI
Adrian Jan Zasina
This study explores the automatic generation of corpus-based language exercises using generative AI models. We focus on the interaction between language models and corpus data, detailing a workflow in which lexical and syntactic patterns are extracted from a tagged corpus and structured prompts are constructed to guide the model in producing sentence-level exercises. The generated exercises reveal both the potential of AI-driven approaches. However, observations highlight the necessity of careful design and critical evaluation when integrating generative models with corpus-based language materials. By analysing these processes from a computational linguistics perspective, this study contributes to understanding how generative AI can interact with structured linguistic data, informing future applications in automated language resources.
pdf
bib
abs
Diversity is the Key: Enhancing LLM-based Post-processing for Automated Audio Captioning
Seyed Ali Farokh
|
Mohammad Mehdi Homayounpour
|
Ahmad Nickabadi
Automated Audio Captioning (AAC) is a multimodal task aimed at generating natural language descriptions of audio content. Previous studies have shown that LLMs can improve AAC performance by summarizing audio events based on a list of candidate captions, which are selected by an external reranker from those generated using Nucleus Sampling. However, the reranking process often selects overly similar captions, disregarding the original diversity of the sampled captions. In this work, we show that this diversity reflects the AAC model’s level of certainty and propose a lightweight candidate selection approach that preserves the initial diversity of the generated captions. This, in turn, enables an LLM to summarize the captions while considering the AAC model’s certainty in a few-shot setting. Experimental results demonstrate that our method outperforms previous post-processing techniques while being significantly faster.
pdf
bib
abs
Memory-Efficient Training for Text-Dependent SV with Independent Pre-trained Models
Seyed Ali Farokh
|
Hossein Zeinali
This paper presents our submission to the Iranian division of the Text-Dependent Speaker Verification Challenge (TdSV) 2024. Conventional TdSV approaches typically jointly model speaker and linguistic features, requiring unsegmented inputs during training and incurring high computational costs. Additionally, these methods often fine-tune large-scale pre-trained speaker embedding models on the target domain dataset, which may compromise the pre-trained models’ original ability to capture speaker-specific characteristics. To overcome these limitations, we employ a TdSV system that utilizes two pre-trained models independently and demonstrate that, by leveraging pre-trained models with targeted domain adaptation, competitive results can be achieved while avoiding the substantial computational costs associated with joint fine-tuning on unsegmented inputs in conventional approaches. Our best system reached a MinDCF of 0.0358 on the evaluation subset and secured first place in the challenge.
pdf
bib
abs
Information-theoretic conditioning in terminological alternations in specialized domains: The cases of Taiwan Mandarin legal language and English biomedical language
Po-Hsuan Huang
|
Hsuan-Lei Shao
This study examines how information-theoretic correlates, specifically contextual surprisal, condition terminological alternations in specialized domains, where both domain-specific and general terms express similar concepts. Specifically, two competing theories exist. The Uniform Information Density (UID) theory proposes that the speaker would avoid abrupt information rate changes. This predicts the use of more specific variants when the surprisals are higher. Conversely, availability-based production suggests the use of more readily-accessible items with higher surprisals. This study examines the dynamics between these two potential mechanisms in the terminological use in specialized domains. Specifically, we argue that, in specialized language, due to the higher frequency of domain-specific terms, both accounts predict the use of specific items in higher-surprisal contexts. The cases of Taiwan Mandarin legal language and English biomedical language were, therefore, examined. Crucially, a current popular method for probability estimation is through large language models (LLMs). The linguistic distribution in specialized domains, however, may deviate from the general linguistic distribution on which the LLMs are trained. Thus, we propose a novel semantics-based method of estimating the token probability distribution in a given corpus that avoids the potentially different linguistic distribution and the issue of word segmentation. As expected, results indicated a positive correlation between a variable’s surprisal and the use of domain-specific variants in both cases. This supports UID-based production, and arguably also availability-based production, since more specific and frequent variants are preferred in high-surprisal contexts. Specifically, our semantics-based probability estimation outperformed LLM-based estimation and the baseline in both cases. This suggests the feasibility of semantics-based probability estimation in specialized domains.
pdf
bib
abs
Voice Spoofing Detection via Speech Rule Generation Using wav2vec 2.0-Based Attention
Qian-Bei Hong
|
Yu-Chen Gao
|
Yu-Ying Xiao
|
Yeou-Jiunn Chen
|
Kun-Yi Huang
Recent advancements in AI-based voice cloning have led to increasingly convincing synthetic speech, posing significant threats to speaker verification systems. In this paper, we propose a novel voice spoofing detection method that integrates acoustic feature variations with attention mechanisms derived from wav2vec 2.0 representations. Unlike prior approaches that directly utilize wav2vec 2.0 features as model inputs, the proposed method leverages wav2vec 2.0 features to construct speech rules characteristic of bona-fide speech. Experimental results indicate that the proposed RULE-AASIST-L system significantly outperforms the baseline systems on the ASVspoof 2019 LA evaluation set, achieving a 24.6% relative reduction in equal error rate (EER) and an 10.8% reduction in minimum tandem detection cost function (min t-DCF). Ablation studies further confirm the importance of incorporating speech rules and selecting appropriate hidden layer representations. These findings highlight the potential of using self-supervised representations to guide rule-based modeling for robust spoofing detection.
pdf
bib
abs
Computational Approaches to Quantitative Analysis of Pause Duration in Taiwan Mandarin
I-Ping Wan
|
Yu-Ju Lai
|
Pu Yu
This study presents a quantitative analysis of pause-duration patterns in a Mandarin spoken corpus to establish a baseline for prosodic and cognitive assessment. Drawing on cross-linguistic research, the distribution of pause patterns is viewed as reflecting multiple underlying factors. Longer pauses aligned with prosodic and syntactic boundaries indicate more deliberative and planned discourse rather than spontaneous speech. Such settings place higher demands on cognitive and articulatory planning, producing extended thinking time as speakers handle complex topics and specialized terminology. The spoken corpus was automatically processed and annotated using an in-house alignment and pause-tagging pipeline. Outlier detection with a 3.0×IQR threshold retained 35,474 tokens and removed extreme values exceeding 1,016 ms. Short and medium pauses remained stable across mean, median, and variability measures, while long pauses showed a moderate reduction (16,436 to 15,420 tokens), with mean duration decreasing from 535 to 426 ms and standard deviation sharply reduced from 786 to 169 ms, while the median stayed around 370–380 ms. These findings demonstrate that automatic cleaning primarily removed aberrant values while preserving linguistically meaningful long pauses. This baseline from non-impaired adult speakers underscores the need for corpus-specific frameworks and offers a reference point for cross-linguistic research on speech planning.
pdf
bib
abs
A Novel Chinese-Idiom Automatic Error Correction Method Based on the Hidden Markov Model
Rongbin Zhang
|
Anlu Gui
|
Peng Cao
|
Lingfeng Wu
|
Feng Huang
|
Jiahui Li
Spelling errors in Chinese idioms frequently occur due to various types of misspellings and optical character recognition (OCR) errors in daily learning and usage. Achieving automatic error correction for Chinese idioms is one of the important natural language processing tasks, as it helps improve the quality of Chinese texts as well as language learning. Existing methods, such as edit distance and custom dictionary approaches, suffer from limited error correction capability, low computational efficiency, and weak flexibility. To address these limitations, this paper proposes a novel automatic error correction method for Chinese idioms based on the hidden Markov model (HMM). Specifically, the generation process of idiom spelling errors is modeled using an HMM, transforming the idiom correction problem into a matching task between erroneous idioms and legitimate idioms. By constructing a legitimate idiom table and a Chinese character confusion set, a prototype system for idiom correction was developed, and performance testing was completed. Experiment results demonstrate that the proposed model is simpler with fewer parameters and has lower computational complexity while exhibiting stronger error correction capability and parameter robustness as compared to existing methods. It can more flexibly correct diverse types of idiom errors, showing high potential application value.
pdf
bib
abs
Toward Traditional Chinese ModernBERT: A Preliminary Study
Yi-En Chen
|
Qiao-Ying He
|
Kuan-Yu Chen
This study employs several state-of-the-art techniques, including RoPE and Flash Attention, and leverages large-scale Chinese web corpora and encyclopedic data to pre-train an encoder model specifically designed for long text in Traditional Chinese. We evaluate the model on tasks such as reading comprehension and text classification, and the results show that its overall performance lags behind existing Chinese benchmarks. Through pseudo-perplexity analysis, we infer that the pre-training phase did not sufficiently capture the data distribution, potentially due to factors such as hyperparameters, convergence, and data quality. Although the results are suboptimal, this study still offers valuable experimental insights and directions for improving Chinese language model development.
pdf
bib
abs
Effective Speaker Diarization Leveraging Multi-task Logarithmic Loss Objectives
Jhih-Rong Guo
|
Tien-Hong Lo
|
Yu-Sheng Tsao
|
Pei-Ying Lee
|
Yung-Chang Hsu
|
Berlin Chen
End-to-End Neural Diarization (EEND) has undergone substantial development, particularly with powerset classification methods that enhance performance but can exacerbate speaker confusion. To address this, we propose a novel training strategy that complements the standard cross entropy loss with an auxiliary ordinal log loss, guided by a distance matrix of speaker combinations. Our experiments reveal that while this approach yields significant relative improvements of 15.8% in false alarm rate and 10.0% in confusion error rate, it also uncovers a critical trade-off with an increased missed error rate. The primary contribution of this work is the identification and analysis of this trade-off, which stems from the model adopting a more conservative prediction strategy. This insight is crucial for designing more balanced and effective loss functions in speaker diarization.
pdf
bib
abs
Leveraging Weak Segment Labels for Robust Automated Speaking Assessment in Read-Aloud Tasks
Yue-Yang He
|
Berlin Chen
Automated speaking assessment (ASA) has become a crucial component in computer-assisted language learning, providing scalable, objective, and timely feedback to second-language learners. While early ASA systems relied on hand-crafted features and shallow classifiers, recent advances in self-supervised learning (SSL) have enabled richer representations for both text and speech, improving assessment accuracy. Despite these advances, challenges remain in evaluating long speech responses, due to limited labeled data, class imbalance, and the importance of pronunciation clarity and fluency, especially for read-aloud tasks. In this work, we propose a segment-based ASA framework leveraging WhisperX to split long responses into shorter fragments, generate weak labels from holistic scores, and aggregate segment-level predictions to obtain final proficiency scores. Experiments on the GEPT corpus demonstrate that our framework outperforms baseline holistic models, generalizes robustly to unseen prompts and speakers, and provides diagnostic insights at both segment and response levels.
pdf
bib
abs
Exploring the Feasibility of Large Language Model- and Rubric-Based Automatic Assessment of Elementary Students’ Book Summaries
Qi-Zhen Huang
|
Hou-Chiang Tseng
|
Yao-Ting Sung
摘要寫作為閱讀與寫作整合的高層次語文任務,不僅可評量學生的文本理解能力,也能促進語言表達與重述能力的培養。過去自動摘要批改系統多依賴關鍵詞比對或語義重疊等「由下而上」的方法,較難以全面評估學生的理解深度與文本重述能力,且中文摘要寫作批改研究雖有,但相較於英文仍相對不足,形成研究缺口。隨著大型語言模型(Large Language Models, LLMs)的發展,其在語意理解與生成能力上的突破,為自動摘要批改與回饋帶來新契機。有鑑於此,本研究旨以由上而下的方式探討結合LLMs與閱讀摘要評分規準(Rubrics)對學生閱讀摘要批改與回饋之應用潛力,進一步而言,在考量教學資料隱私的情況下,本研究採用Meta-Llama-3.1-70B生成電腦摘要,並依據專家所制定的摘要評分規準,其評分涵蓋:理解與準確性、組織結構、簡潔性、語言表達與文法及重述能力五大構面,對學生閱讀摘要進行自動評分與回饋。研究結果顯示,Meta-Llama-3.1-70B能提供具體、清晰的即時回饋,不僅能指出摘要中遺漏的關鍵概念,也能針對結構安排與語法錯誤提出修正建議,協助學生快速掌握摘要改進方向;然而回饋多偏向表面語言與結構調整,在語言表達、修辭多樣性及重述能力等高層次語文能力評估上仍存在限制。整體而言,LLMs可作為形成性評量與教學輔助工具,提升評分效率,但需結合教師專業判斷與回饋以補足深層概念與策略性寫作指導,促進學生摘要寫作能力的發展。
pdf
bib
abs
From Scarcity to Scalability: Lexicon and Grammar Enhanced Amis to Mandarin Translation with GPT Models
Joseph Lin
|
Kai-Ying Lin
|
Hung-Yu Kao
Machine translation (MT) for low-resource languages remains constrained by extreme data scarcity, making traditional fine-tuning infeasible. This study examines Amis→Mandarin translation as a practical case, leveraging GPT-4o-mini and GPT-5-mini with dictionary integration and grammar-informed prompting. Experiments show that GPT-5-mini, supported by dictionary, achieves usable quality (BLEU-3 ∼31, COMET ∼78, BLEURT ∼71). To address the bottleneck of incomplete dictionaries, we propose Context-Driven Lexical Augmentation, which infers Mandarin equivalents for unseen Amis terms from corpus context, raising BLEU-3 to 34 and establishing a stronger basis for semi-automatic corpus generation. These results demonstrate that expanding and refining dictionary provides greater benefits than parameter-intensive fine-tuning in extremely low-resource settings. We also discuss the performance gap between Amis→Mandarin and Mandarin→Amis translation, attributing it to Amis’s morphological complexity and narrower semantic coverage. Overall, our resource-driven strategy offers a scalable pathway toward high-quality MT and corpus expansion, ultimately supporting both linguistic research and language revitalization.
pdf
bib
abs
CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition
Hung-Yang Sung
|
Chien-Chun Wang
|
Kuan-Tang Huang
|
Tien-Hong Lo
|
Yu-Sheng Tsao
|
Yung-Chang Hsu
|
Berlin Chen
Automatic speech recognition (ASR) for low-resource languages such as Taiwanese Hokkien is difficult due to the scarcity of annotated data. However, direct fine-tuning on Han-character transcriptions often fails to capture detailed phonetic and tonal cues, while training only on romanization lacks lexical and syntactic coverage. In addition, prior studies have rarely explored staged strategies that integrate both annotation types. To address this gap, we present CLiFT-ASR, a cross-lingual fine-tuning framework that builds on Mandarin HuBERT models and progressively adapts them to Taiwanese Hokkien. The framework employs a two-stage process in which it first learns acoustic and tonal representations from phonetic Tai-lo annotations and then captures vocabulary and syntax from Han-character transcriptions. This progressive adaptation enables effective alignment between speech sounds and orthographic structures. Experiments on the TAT-MOE corpus demonstrate that CLiFT-ASR achieves a 24.88% relative reduction in character error rate (CER) compared with strong baselines. The results indicate that CLiFT-ASR provides an effective and parameter-efficient solution for Taiwanese Hokkien ASR and that it has potential to benefit other low-resource language scenarios.
pdf
bib
abs
MINAS: Mandarin Intelligent Narrative Assessment of Syntax for Children
Ruei-Ru Wang
|
Ya-Sin Li
|
Yi-Shuo Yin
|
Tao-Yu Chen
|
Hint-Tat Cheung
|
Ching-Tai Chen
Children’s narrative ability is an important indicator of language development and is commonly used in clinical diagnosis and linguistic research. However, the lack of large-scale, standardized, and accurately annotated Chinese child language corpora makes grammatical analysis both time-consuming and prone to subjectivity, while existing automated tools fall short of clinical and research needs. This study introduces MINAS (Mandarin Intelligent Narrative Assessment of Syntax for Children), which integrates the MAIN story framework with the MAPS-R syntactic framework to construct a Chinese narrative corpus encompassing four categories and 20 indicators. We evaluated commercial models (ChatGPT-4, Claude Sonnet 4, Gemini 2.5 Flash, DeepSeek) through prompt engineering, and fine-tuned open-source models (Chinese RoBERTa, OpenHermes-2.5) with LoRA. Experimental results show that few-shot prompting achieves high accuracy across most indicators, while fine-tuning with LoRA achieves better performance in noun and verb phrase identification but is not as good for complex sentence structures. This study validates the feasibility of applying large language models to syntactic classification of Chinese child narrative corpora, highlighting their potential in clinical applications and linguistic research.
pdf
bib
abs
LOBSTER: Linguistics Olympiad Benchmark for Structured Evaluation on Reasoning
Da-Chen Lian
|
Ri-Sheng Huang
|
Pin-Er Chen
|
Chunki Lim
|
You-Kuan Lin
|
Guan-Yu Tseng
|
Zhen-Yu Lin
|
Pin-Cheng Chen
|
Shu-Kai Hsieh
We propose the Linguistics Olympiad Benchmark for Structured Evaluation on Reasoning, or LOBSTER, a linguistically-informed benchmark designed to evaluate large language models (LLMs) on complex linguistic puzzles of the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, our benchmark provides concrete evaluation protocols and rich typological metadata across over 90 low-resource and cross-cultural languages alongside the puzzles. Through systematic evaluations of state-of-the-art models on multilingual abilities, we demonstrate that LLMs struggle with low-resource languages, underscoring the need for such a benchmark. Experiments with various models on our benchmark showed that IOL problems remain a challenging task for reasoning models, though there are ways to enhance the performance—for example, iterative reasoning outperforms single-pass approaches in both final answers and explanations. Our benchmark offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.
pdf
bib
abs
Cross-user Collaborative and Sequential Modeling for Recommendation
Qiao-Ying He
|
Yi-En Chen
|
Kuan-Yu Chen
Multi-behavior recommendation leverages auxiliary behaviors to effectively alleviate the sparsity of target behaviors. Existing approaches can be broadly categorized into two paradigms: sequential models that capture individual temporal dynamics but often omit cross-user information, and graph-based models that mine collaborative patterns yet lack temporal dependency modeling. To address these limitations, this paper proposes an integrated approach that combines sequential and graph modeling: the former focuses on learning temporal dependencies within user behavior sequences, while the latter captures cross-user behavior paths. By fusing the predictions from both components, the method achieves more accurate recommendations. Experiments on two e-commerce datasets, Taobao and RetailRocket, show that the integrated model outperforms the strong baseline MB-STR by about 1% in both HR@10 and NDCG@10. These results indicate that incorporating cross-user collaborative information consistently improves performance, even on top of strong sequential models.
pdf
bib
abs
Structured vs. Unstructured Inputs in LLMs: Evaluating the Semantic and Pragmatic Predictive Power in Abnormal Event Forecasting
Jou-An Chi
|
Shu-Kai Hsieh
Large Language Models (LLMs) are increasingly applied to temporally grounded reasoning tasks, yet the role of input representation remains unclear. This paper compares structured temporal inputs, represented as Temporal Knowledge Graphs (TKGs), with unstructured captions in two settings: forecasting future events and detecting anomalies in surveillance video descriptions. To enable direct comparison, we build a unified dataset by aligning anomaly labels from UCF-Crime with caption annotations from UCA. Experiments show that unstructured captions consistently yield slightly higher scores across both tasks, but the differences do not reach statistical significance. Their trade-offs, however, differ: captions provide richer semantic cues for generation, while TKGs reduce input length, suppress noise, and enhance interpretability. These findings suggest that action-centric corpora, such as surveillance or forensic narratives, naturally lend themselves to structured representations, which can provide temporal scaffolds for timeline reconstruction and more traceable reasoning. All code, data processing scripts, and experimental results are available at our GitHub repository.
pdf
bib
abs
Embodiment in Multimodal Semantics: Comparing Sensory, Emotional, and Visual Features in Chinese Color Metaphors
Yufeng Wu
|
Meichun Liu
This study examines how sensory-motor experience, emotional valence and arousal, and visual image statistics contribute to multimodal alignment in Chinese color metaphors. Using 184 metaphorical lexemes from six basic color terms, we combined textual data from the Chinese Corpus Internet (CCI 3.0) with image sets from Baidu, embedding both with Chinese-CLIP and measuring alignment via cosine similarity. Sensory-motor ratings, particularly effector exclusivity and tactile strength, correlated negatively with alignment, while emotional valence showed strong positive correlations and visual features such as color variability and entropy contributed positively. Regression and importance analyses confirmed emotion as the most reliable predictor, with sensory ratings offering little explanatory power. The findings indicate that affective salience and perceptual richness, rather than generalized sensory norms, are central to the embodied grounding of metaphorical words in multimodal contexts.
pdf
bib
abs
Language Modeling Using Entanglement Enhanced Tensor Trains
Ellis Reyes
|
Yi-Shin Chen
Tensor Train Language Models (TTLMs) offer significant memory savings by representing text sequences as tensor networks, but naive implementations struggle with long-range dependencies and limited flexibility. We introduce a modular TTLM framework that combine local and non-local context modules to achieve scalable language modeling. Our non-local modules, inspired by entanglement in quantum information theory, enable efficient modeling of long-range interactions between hidden states. Experiments on Penn Treebank and Wikitext datasets show that our modular TTLM, including entanglement-augmented variants, outperform naive baselines. These results highlight TTLMs as a promising, memory-efficient alternatives for modern language modeling.
pdf
bib
abs
Multimodal Fake News Detection Combining Social Network Features with Images and Text
Lawrence Yung Hak Low
|
Yen-Tsang Wu
|
Yan-Hong Liu
|
Jenq-Haur Wang
The rapid development of social networks, coupled with the prevalence of Generative AI (GAI) in our society today, has led to a sharp increase in fake tweets and fake news on social media platforms. These fake media led to more in-depth research on fake news detection. At present, there are two mainstream methods used in detecting fake news, namely content-based fake news detection and propagation / network-based fake news detection. Early content-based detection method inputs an article’s content and uses a similarity algorithm to identify fake news. This method improved by using single-modality features such as images and text as input features. However, existing research shows that single-modality features alone cannot identify fake news efficiently. The most recent method then fuses multimodal features such as images and text, as features to be input into the model for classification purposes. The second propagation / network-based fake news detection method creates graphs or decision trees through social networks, treating them as features to be input into the model for classification purposes. In this study, we propose a multimodal fake news detection framework that combines these two mainstream methods. This framework not only uses images and text as input features but also combines social metadata features such as comments. The framework extracts these comments and builds them into a tree structure to obtain its features. Furthermore, we also propose different feature fusion methods which can achieve better results compared with the existing methods. Finally, we conducted ablation experiments and proved that each module is required to contribute to the framework’s overall performance. This clearly demonstrated the effectiveness of our proposed approach.
pdf
bib
abs
Speech-Driven Editing System for Chinese ASR Errors
Sji-Jie Ding
|
Chia-Hui Chang
|
Zi-Xuan Jian
Despite recent advances in AI, ASR systems still struggle with real-world errors from pronunciation and homophones. To solve this issue, we propose a verbal-command-based correction system that enables users to utter natural-language instructions to refine recognition outputs with minimal effort. The system consists of three modules: an input classifier, a command classifier, and a correction labeler. To support training and evaluation, we simulate ASR errors via TTS and ASR pipelines to simulate the potential errors, followed by verbal correction commands issued based on linguistic features or LLMs. Experiments show that the overall system achieves over 80% correction accuracy and delivers stable performance. Compared to manual correction, this system also demonstrates highly competitive correction speed, which sufficiently indicates its feasibility for practical deployment.
pdf
bib
abs
A Fake News Detection Model Utilizing Graph Neural Networks to Capture Writing Styles
Yen-Tsang Wu
|
Lawrence Y. H Low
|
Jenq-Haur Wang
本文提出 CWSMN(Capture Writing Style Multi-Graph Network),一個以圖神經網路為基礎的早期假新聞偵測方法,透過捕捉寫作風格克服傳統語意內容與傳播特徵方法在標註稀缺與跨域泛化不足下的限制。CWSMN 結合文體分析、語意嵌入與多圖融合:以 Bi-GRU 進行上下文初始化,採用 GAT 進行注意力導向的圖聚合,並以 LDA 建構主題圖,同時以輕量級前饋分類器輸出預測。於多個資料集之實驗顯示,CWSMN 對比 BERT、ALBERT 與 GraphSAINT 等強基準皆有穩定超越;在未知來源的 Source-CV 場景尤為顯著,證明其於低資源與跨領域環境之穩健泛化能力,並實現不依賴傳播的早期偵測,實驗結果證實本方法在樣本稀缺與未知來源條件下,仍能達成有效的早期偵測。
pdf
bib
abs
Revisiting Pre-trained Language Models for Conversation Disentanglement
Tung-Thien Lam
|
Cheng-Zen Yang
Multi-party conversation is a popular form in online group chatting. However, the interweaving of utterance threads complicates the understanding of the dialogues for participants. Many conversation disentanglement models have been proposed using transformer-based pre-trained language models (PrLMs). However, advanced transformer-based PrLMs have not been extensively studied. This paper investigates the effectiveness of five advanced PrLMs: BERT, XLNet, ELECTRA, RoBERTa, and ModernBERT. The experimental results show that ELECTRA and RoBERTa are two PrLMs with outstanding performance than other PrLMs for the conversation disentanglement task.
pdf
bib
abs
Multilingual Promise Verification in ESG Reports with Large Language Model Performance Evaluation
Wei-Chen Huang
|
Hsin-Ting Lu
|
Wen-Ze Chen
|
Min-Yuh Day
Corporate ESG reports often contain statements that are vague or difficult to verify, creating room for potential greenwashing. Building automated systems to evaluate such claims is therefore a relevant research direction. Yet, existing analytical tools still show limited ability to verify sustainability promises in multiple languages, especially beyond English. This study examines how large language models (GPT-5) perform in verifying ESG-related promises across Chinese, Japanese, and English reports, aiming to provide a multilingual evaluation baseline. We assess four verification tasks using the PromiseEval datasets [1] in three languages, comparing five prompting strategies from zero-shot to five-shot learning, including Chain-of-Thought reasoning. The four subtasks are Promise Identification (PI), Evidence Status Assessment (ESA), Evidence Quality Evaluation (EQE), and Verification Timeline Prediction (VTP). The five-shot setting achieved the highest overall performance (71.12 % accuracy, 51.92 % Macro-F1). Although the accuracy results appear higher for Chinese (85.12 %) than for Japanese (68.94 %) and English (63.62 %), this mainly reflects class imbalance in the data. Hence, Macro-F1 provides a fairer comparison across languages. Among the four tasks, Evidence Quality Evaluation (EQE) remains the most difficult. While Chain-of-Thought prompting slightly lowers the overall average, it shows selective benefit on the more complex EQE task. Overall, this work offers a clearer multilingual baseline for ESG promise verification and supports the development of language-based tools that enhance the credibility and transparency of sustainability reporting.
pdf
bib
abs
Exploring Sentence Stress Detection using Whisper-based Speech Models
Ting-An Hung
|
Yu-Hsuan Hsieh
|
Tien-Hong Lo
|
Yung-Chang Hsu
|
Berlin Chen
Sentence stress reflects the relative prominence of words within a sentence. It is fundamental to speech intelligibility and naturalness, and is particularly important in second language (L2) learning. Accurate stress production facilitates effective communication and reduces misinterpretation. In this work, we investigate sentence stress detection (SSD) using Whisper-based transformer speech models under diverse settings, including model scaling, backbone–decoder interactions, architectural and regularization enhancements, and embedding visualization for interpretability. Results show that smaller Whisper variants achieve stronger performance under limited data, while architectural and regularization enhancements improves stability and generalization. Embedding analysis reveal clear separation between stressed and unstressed words. These findings offer practical insights into model selection, architecture design, and interpretability for SSD applications, with implications for L2 learning support tools.
pdf
bib
abs
Integrating Sequential Information and Graph Structures for Anti-Money Laundering Anomaly Detection
Yin-Ju Wu
|
Gavin Tseng
|
Berlin Chen
反洗錢(Anti-Money Laundering, AML)是金融科技領域的重要研究課題,其目標在於識別潛在的可疑帳戶與交易。然而隨著跨境支付與新型態交易的興起,洗錢行為往往具有高度隱匿性與複雜的網路結構,傳統規則式方法在偵測效能與泛化能力上皆表現不足。近年來,雖然有研究嘗試將機器學習或深度學習方法應用於 AML,但仍存在許多挑戰。為了解決這些問題,本研究提出一個基於序列圖融合的 AML 帳戶風險預測框架。該方法的核心在於同時建模帳戶的個體時序行為與其在交易網路中的結構特徵。首先,將每個帳戶的交易歷史分解為入邊和出邊序列,使用雙分支GRU架構分別編碼,捕捉帳戶的時序交易模式,接著使用雙向注意力圖卷積層,通過差異感知的消息傳遞機制同時處理正向和反向鄰居關係,學習帳戶間的行為差異,並通過注意力機制自適應融合節點自身特徵與雙向鄰居聚合特徵。此外,針對 AML 資料集的極度不平衡特性,引入類別重加權與平衡採樣策略。我們在公開的反洗錢資料集上驗證所提方法,實驗結果顯示該框架在極度不平衡的情境下能取得穩定的 F1 表現,相較於傳統基線方法具有顯著優勢。
pdf
bib
abs
A Multi-faceted Statistical Analysis for Logit-based Pronunciation Assessment
Chieh-Ren Liao
|
Berlin Chen
The Goodness of Pronunciation (GOP) score for pronunciation quality assessment is a key technology in computer-assisted language learning. Recent studies have shown that computing GOP scores directly from the acoustic model’s raw output logits outperforms traditional softmax-probability-based methods, because logits avoid probability saturation issues and retain richer discriminative information. However, existing logit-based methods mostly rely on basic statistics such as maxima, means, or variances, which neglect the more complex dynamic distributions and temporal characteristics of logit sequences over phoneme durations. To more comprehensively capture pronunciation details embedded in logit sequences, this study proposes a multi-faceted statistical analysis method. We explore five higher-order statistical indicators that describe different characteristics of logit sequences: (1) moment-generating functions to compute distribution skewness and kurtosis; (2) information theory, using entropy to quantify model uncertainty; (3) Gaussian mixture models (GMMs) to fit multimodal distributions of logits; (4) time-series analysis, computing autocorrelation coefficients to measure logit stability; and (5) extreme value theory, using top-k averaging to obtain more robust peak-confidence estimates. We conduct experiments on the public L2 English speech corpus SpeechOcean762, comparing these newly proposed statistical indicators with baseline methods from the literature (GOP_MaxLogit, GOP_margin). Preliminary results show that some higher-order statistical indicators—particularly those that describe logit-sequence stability and distribution shape—achieve higher accuracy on pronunciation-error detection classification tasks and exhibit stronger correlation with human expert ratings. This study demonstrates that deeper statistical modeling of logit sequences is an effective approach to improving the performance of automated pronunciation assessment systems.
pdf
bib
abs
Learning User Common Interests for Unseen Group Recommendation
Yu-Ting Cheng
|
Pin-Hsin Hsiao
|
Chiou-Shann Fuh
|
Pu-Jen Cheng
Previous studies on recommender systems have primarily focused on learning implicit preferences from individual user behaviors or enhancing recommendation performance by identifying similar users. However, in real-life scenarios, group decision-making is often required, such as when a group of friends decides which movie to watch together. Thus, discovering common interests has become a key research issue in group recommendation. The most straightforward approach to group recommendation is to model the past joint behaviors of a user group. Nevertheless, this method fails to handle newly formed groups with no historical interactions. To address this limitation, we apply Graph Convolutional Networks to capture high-order structural features within the user–item interaction graph, thereby uncovering the potential common interests of unseen groups. Experimental evaluations on three real-world datasets demonstrate the feasibility and effectiveness of the proposed method.
pdf
bib
abs
Introduction: Persuasive Language in the Age of AI
Siaw-Fong Chung
Persuasive language shapes communication across disciplines and everyday life. As large language models (LLMs) become increasingly integrated into these spheres, understanding persuasion now encompasses both human and machine discourse. This introduction examines how persuasive language operates across diverse contexts by analyzing the interactional frameworks of human and AI communication. It also explores how persuasion emerges in human-AI exchanges and how these insights can inform language education and communication practices. Drawing on perspectives from linguistics, computer science, journalism, and communication studies, it presents persuasion as both a rhetorical and interactional process shaped by technology. Ultimately, it aims to deepen understanding of how AI transforms persuasive practices and to promote greater awareness of persuasion in language learning.
pdf
bib
abs
Stance and Cohesion: The Use of However and While in AI-Human Argumentative Discourse
Yu-Che Yen
|
Siaw-Fong Chung
This study investigates how connectives However and While, signaling contrast/ concession to construct stances, are distributed by AI chatbots in task-based argumentations. The corpus, comprising 13,482 words of chatbot-produced discourse, was analyzed to examine the connectives’ sentence positions and their relation to content-, writer-, and reader-oriented propositions, based on an integrated framework of Hyland’s (2005) framework and Thetela’s (1997) evaluative-entity framework. A total of 124 tokens of However and While were extracted, excluding tokens whose stance and cohesive functions can’t be clearly interpreted. Results show sentence-initial However (N=40) and sentence-initial while (N=59) are the primary devices for asserting a writer-oriented stance, signaling evaluation, claim or counter-claim. Sentence-initial while are more frequently used to frame a factual premise before projecting writer orientation. As to sentence-medial while, both preceding and subsequent clauses are often presented content-oriented propositions, indicating achieving cohesion is prioritized over expressing an evaluative stance. This study concludes that the use of these connectives, strategically applied in AI-human argumentations, shows how connectives contribute to manage stance construction and discourse coherence.
pdf
bib
abs
Quantum Perspectives on Persuasive Language in AI-Generated News: A QNLP-Based Analysis
Jung-Hua Liu
This study applies quantum natural language processing (QNLP) to 298 Chinese AI-generated YouTube news articles. Using IBM Qiskit, we reveal multi-reality narratives with high frame competition but low conflict. Headlines employ emotion, content stays neutral or positive, showing strategic ambiguity. QNLP metrics highlight persuasive tactics and implications for communication theory and AI ethics.
pdf
bib
abs
Interpretation of the level of ANGER in discussion forum
Suet Ching Soon
In this internet era, people have easy access of a vast many options of social media platforms for quick communicating or interacting. The ways how internet users conveyed their emotional expression attracted our interest. This present paper investigates the literal emotional expression of ANGER in Chinese online discussion forum, targeting the term nu4 ‘angry/anger’. We referred to a Bulletin Board System (BBS) in Taiwan which is a conversation-like platform with no emoji icon to convey emotion directly. A collection of 7,464 instances were retrieved from the platform. After deducting noisy data, we looked into the meanings and distribution of nu4 of the 7,285 instances. With nearly a quarter of the data instances belonged to the unconventional use of nu4 where the expression does not necessarily show the emotion of anger, we further analyzed the col-locations of these instances. In conclusions, from the collocates of these unconventional use of nu4, it showed a shift from the expression of emotion to aggressiveness, and to express the extent level of an action.
pdf
bib
abs
ROCLING-2025 Shared Task: Chinese Dimensional Sentiment Analysis for Medical Self-Reflection Texts
Lung-Hao Lee
|
Tzu-Mi Lin
|
Hsiu-Min Shih
|
Kuo-Kai Shyu
|
Anna S. Hsu
|
Peih-Ying Lu
This paper describes the ROCLING-2025 shared task aimed at Chinese dimensional sentiment analysis for medical self-refection texts, including task organization, data preparation, performance metrics, and evaluation results. A total of six participating teams submitted results for techniques developed for valence-arousal intensity prediction. All datasets with gold standards and evaluation scripts used in this shared task are publicly available online for further research.
pdf
bib
abs
CYUT-NLP at ROCLING-2025 Shared Task: Valence–Arousal Prediction in Physicians’ Texts Using BERT, RAG, and Multi-Teacher Pseudo-Labeling
Yi-Min Jian
|
An Yu Hsiao
|
Shih-Hung Wu
Accurately modeling physicians’ emotional states from self-reflection texts remains challenging due to the lowresource, domain-specific nature of medical corpora. The proposed workflow performs Retrieval-Augmented Generation (RAG) and multi-teacher pseudo-labeling to generate high-quality augmented data. This workflow enables effective crossdomain adaptation from general text corpora to professional medical texts. Evaluations on the ROCLING 2025 test set demonstrate substantial improvements over the best-performing baseline in Valence–Arousal prediction accuracy and model stability. Importantly, the workflow is domain-agnostic and provides a generalizable methodology for systematically transferring models to new, low-resource domains, making it applicable beyond medical text analysis.
pdf
bib
abs
NTULAW at ROCLING-2025 Shared Task: Domain-Adaptive Modeling of Implicit Emotions in Medical Reflections
Sieh-Chuen Huang
|
Hsuan-Lei Shao
This paper describes the NTULAW team’s participation in the ROCLING 2025 Dimensional Sentiment Analysis (DSA) shared task, which focuses on predicting valence and arousal ratings for Chinese doctors’ self-reflection texts. Unlike previous editions of the DSA task that targeted words, phrases, or educational comments, this year’s dataset consists of domain-specific multi-sentence medical narratives, posing challenges such as low-arousal writing styles, implicit emotion expressions, and discourse complexity. To address the domain shift between general affective resources (Chinese EmoBank) and medical reflections, we designed a multi-scale BERT-based architecture and explored different data selection strategies. Our final system adopted a hybrid submission: using a model trained solely on doctors’ annotations for arousal prediction, and a combined model with Chinese EmoBank for valence prediction. The system achieved stable performance, ranking third among six participating teams. Error analysis shows systematic overestimation of implicit or negated expressions for valence and regression toward mid-range predictions for arousal. We conclude with limitations of relying only on BERT and outline future work involving domain adaptation, discourse-aware modeling, and large language models (LLMs).
pdf
bib
abs
TCU at ROCLING-2025 Shared Task: Leveraging LLM Embeddings and Ensemble Regression for Chinese Dimensional Sentiment Analysis
Hsin-Chieh Li
|
Wen-Cheng Lin
This study participates in the ROCLING-2025 shared task on Chinese dimensional sentiment analysis for medical self-reflection texts. Dimensional Sentiment Analysis (DSA) represents emotions as continuous dimensions, such as valence (positive to negative) and arousal (calm to excited), providing finer-grained representations compared to traditional categorical approaches, which are suitable for applications in mental health monitoring and risk assessment. We use large language models (LLMs) to extract contextual embedding vectors, which are then fed into regression models, such as Support Vector Regression (SVR), for valence-arousal prediction. The training data consists of the Chinese EmoBank dataset (2,954 general-domain samples), the validation data is a Medical Self-Reflection Corpus Dataset (994 samples), and the test data is another Medical Self-Reflection Corpus Dataset (1,541 samples). Experimental results show that the SVR model with DeepSeek embeddings performs best. Multi-model ensemble learning further improves performance to 0.463 valence MAE, 0.759 arousal MAE, 0.805 valence PCC, and 0.608 arousal PCC. This approach shows the potential of multi-model fusion in DSA for biomedical applications, facilitating the development of non-intrusive mental health assessment tools.
pdf
bib
abs
Hey Vergil at ROCLING-2025 Shared Task: Emotion-Space-Based System for Doctors’ Self-Reflection Sentiment Analysis
Ting-Yi Lin
|
Cong-Ying Lin
|
Jui-Feng Yeh
In the ROCLING 2025 dimensional sentiment analysis task, we present EmoTracer. It is an emotion-space-based system for analyzing doctors’ self-reflection texts. The system uses XLNet, BERT, and LSTM models. It is trained on the SLAKE medical dataset and Chinese datasets, such as Chinese EmoBank and NRC-VAD. This helps the system capture the possible emotional changes of doctors when they write patient-related reflections. EmoTracer converts texts into Valence and Arousal scores. The experiments show about 60% accuracy, a Pearson correlation coefficient (PCC) of 0.9, and a mean absolute error (MAE) of 0.3. These results can help support mental health management. The system also has a simple front-end UI. Users can enter texts and see the analysis results. This demonstrates the full functionality of the EmoTracer system.
pdf
bib
abs
KOLab at ROCLING-2025 Shared Task: Research on Emotional Dimensions in Chinese Medical Self-Reflection Texts
Chia-Yu Chan
|
Chia-Wen Wang
|
Jui-Feng Yeh
Currently, most sentiment analysis techniques are primarily applied to general texts such as social media or news reports, and there is still a relative gap in emotion recognition within the medical field. Self reflection involves communication between individuals and their inner selves, which has a positive impact on people’s future lives. This article aims to design a classification model for reflective texts aimed at medical professionals to fill gaps in sentiment analysis within the medical field. This task used a BERT model, trained on a dataset from the Chinese EmoBank, and evaluated using the test set provided by the ROCLING 2025 Dimensional Sentiment Analysis – Shared Task. The assessment results show that Valence and Arousal’s PCC scores are 0.76 and 0.58 respectively, while the MAE scores are 0.53 and 0.82, respectively.
pdf
bib
abs
SCUNLP at ROCLING-2025 Shared Task: Systematic Guideline Refinement for Continuous Value Prediction with Outlier-Driven LLM Feedback
Hong Rui Pan
|
Jheng Long Wu
Regression-based prediction is widely applied to continuous outputs, such as emotion dimension estimation. However, traditional methods struggle to handle unclear annotation standards and ambiguous cases. To address this challenge, we propose a dual-layer agent-executor framework, where the agent is responsible for constructing and refining guidelines, while the executor applies these guidelines to annotate large-scale data. Notably, we introduce a novel refinement mechanism that can detect outlier instances and provide feedback to the agent for guideline revision, thereby achieving iterative improvement. We applied this method to the ROCLING 2025 shared task for predicting valence-arousal (VA) values in medical self-reflection texts. Compared to the unmodified version, the outlier-driven configuration effectively reduced MAE for both V/A, with A-MAE significantly decreased by 7.7%. The final valence-MAE was 0.51 and arousal-MAE was 0.87, ranking fourth.
pdf
bib
abs
Taiwanese Hakka Across Taiwan Corpus and Formosa Speech Recognition Challenge 2025 – Dapu & Zhao’an Accents
Yuan-Fu Liao
|
Chih-Chung Kuo
|
Chao-Shih Huang
|
Yu-Siang Lan
|
Han-Chun Lai
|
Wen-Han Hsu
To revive the endangered Hakka language in Taiwan, the first large-scale Hakka speech corpus covering all aspects of Taiwanese Hakka across Taiwan (HAT) was created. This paper introduces the second part of the HAT corpus: the Dapu and Zhao’an accents. Furthermore, to promote this newly constructed corpus and evaluate the performance of the most advanced Hakka ASR system, the 2025 Formosa Speech Recognition Challenge, FSR-2025–Hakka ASR II, was held. Sixteen teams participated on two tracks: speech-to-Hakka-Hanzi and speech-to Hakka-Pinyin. The best results were: Hanzi character error rate (CER) 7.50%; Pinyin syllable error rate (SER) 14.81%.
pdf
bib
abs
Speech Recognition for Low-resource Languages: A Comparative Study on Hakka Han Characters and Romanization
Yu-Hsiang Cheng
|
Yi-Syuan Wu
This study focuses on speech recognition for low-resource languages, with Hakka as the case study. Since there is currently a lack of dedicated speech models for Taiwanese Southern Min, Hakka, and indigenous languages, we adopt OpenAI Whisper-Medium as the base model and apply Low-Rank Adaptation (LoRA) for fine-tuning. Two models with different output forms were developed: a Hakka character-based model and a Hakka phonetic-based model. The experimental dataset contains approximately 80 hours of speech, covering the Dapu and Zhao’an dialects, and the models were evaluated using Character Error Rate (CER) and Word Error Rate (WER).
pdf
bib
abs
Applying Whisper Fine-tuning and Branchformer to Hakka Speech Recognition
Yu-Sheng Huang
|
Wei-Cheng Hong
|
Xin-Yu Chen
|
Szu-Yin Lin
This study addresses the FSR 2025 Hakka speech recognition task by comparing two strategies: fine-tuning large pre-trained models and training from scratch. For character (Hanzi) recognition, we fine-tuned five different scales of the Whisper model, with large-v3-turbo achieving a 7.55% CER on the test set. For Pinyin recognition, a Branchformer model was compared against a LoRA fine-tuned Whisper-small, yielding WERs of 4.7% and 6.5% on the test set, respectively. Speed perturbation was the primary method used for data augmentation in our pre-processing pipeline.
pdf
bib
abs
Improving Low-Resource Speech Recognition with Whisper-MoE and Synthetic Data Augmentation: A Case Study on Hakka
Yuan-Chi Hsu
|
Liang-Chun Fang
|
Hong-Jie Dai
The objective of this study is to improve speech recognition performance for low-resource Hakka, a language spoken by a specific ethnic group. Our team conducted experiments by fine-tuning different base versions of Whisper (e.g., the original model and the Mandarin-focused Belle model). We found that fine-tuning on different bases yielded distinct advantages and varying results in Hakka character and phonetic recognition tasks. To further enhance model accuracy, we experimented with replacing the q, k, and v linear layers in the attention blocks of the Whisper encoder with a mixture-of-experts model combined with RoLA. In addition, we augmented the training data with synthesized speech generated with diverse voice styles and varying speaking rates. The results showed a 0.73% reduction in character error rate for Task 1 and a 0.2% reduction in word error rate for Task 2. These findings confirm that both architectural adjustments to the model and the strategic use of limited synthetic speech data in low-resource dialect corpora can effectively improve recognition performance.
pdf
bib
abs
Whisper Finetuning For Hakka Recognition in Low Resource
Min Han Teng
|
Ci Dao Chen
|
You Ting Lin
|
Bing Jhih Huang
We study automatic speech recognition (ASR) for Hakka, a low-resource language with substantial dialectal variation. Focusing on Zhaoan and Dapu, we fine-tune Whisper using Low-Rank Adaptation (LoRA) and apply data augmentation to mitigate data scarcity. Experiments show that LoRA combined with augmentation substantially improves cross-dialect recognition while maintaining parameter efficiency. Our results demonstrate the potential of lightweight adaptation to extend large-scale ASR systems to underrepresented languages, supporting the preservation of Hakka speech and orthography.
pdf
bib
abs
Hakka Speech Recognition with Whisper and Pinyin Post-processing for FSR-2025
Chia-Hsin Lee
|
Yung-Jun Chang
|
Jin-Yan Wu
|
Kuan-Yu Chen
本研究為參加 FSR-2025 客語語音辨識挑戰賽(Hakka ASR II)的技術報告,旨在推進客語自動語音辨識技術的發展。由於客語屬於低資源語言,且存在多種腔調,語音辨識面臨高度挑戰。我們以 Whisperlarge-v2 為骨幹模型,設計兩階段訓練流程:首先利用「Hakka Across Taiwan(HAT)」語料庫進行模型調適,以捕捉客語的一般聲學特徵;其次在賽事方提供的60 小時腔調語料上進行微調,以增強對目標資料的適應性。實驗發現,直接輸出客語漢字可達到良好的字錯率(CER),但由 於腔調差異與拼音規則變化多,拼音任務表現顯著下降。為解決此問題,我們以漢字模型的編碼器初始化拼音模型,並提出結合 RoBERTa 漢字轉拼音、腔調判斷與字典修正的後處理模組,期望可以在比賽中提升辨識的成效。
pdf
bib
abs
A Study on a Low-Resource Speech Recognition System for Taiwan Hakka Based on Whisper and LoRA
Zheng-Ting Liu
|
Heng-You Wang
|
Yi-Xiang Liao
|
Zhong-Yuan Qiu
|
Zhao-Yi Huang
This study presents the development of a high-performance automatic speech recognition (ASR) system for Taiwan Hakka, a low-resource language facing challenges in preservation and digitalization. We adopt OpenAI’s Whisper large-v3-taiwanese-hakka as the foundation, leveraging its advanced Transformer encoder–decoder architecture. To achieve parameter efficiency and adaptability to a new language, we employ the Low-Rank Adaptation (LoRA) fine-tuning strategy, targeting key modules including q_proj, k_proj, v_proj, out_proj, fc1, and fc2. Experimental results demonstrate that the fine-tuned model achieves strong performance on the FSR 2025 HAT-Vol2 test set, with an average character error rate (CER) of 7.07% and an average word error rate (WER) of 40.99%. Training analysis further indicates that both validation loss and error rates consistently decreased and converged, confirming that LoRA enables effective knowledge transfer to Hakka ASR without catastrophic forgetting. These findings provide an efficient and practical solution for speech recognition in low-resource languages.
pdf
bib
abs
A Compact Whisper+LoRA Baseline for Taiwanese Hakka ASR in FSR-2025
Hung-Ting Hsieh
We present a compact baseline for the For- mosa Speech Recognition (FSR-2025) Tai- wanese Hakka ASR challenge. Our system fine-tunes Whisper-large-v2 (Track 1) and Whisper-large-v3-turbo (Track 2) (Radford et al., 2022) with LoRA (Hu et al., 2021), under a consistent normalization policy and balanced speaker-based dev splits. On the official warm-up set, we obtain 10.94% CER for Track 1 (Hanzi) and 28.48% SER for Track 2 (Pinyin). We provide simple, reproducible pipelines covering data prepa- ration, training, inference, and evaluation, without using external data or language models.
pdf
bib
abs
Optimizing Whisper Parameters and Training Data Processing for Formosa Speech Recognition Challenge 2025 - Hakka ASR II
Jhen-Hao Lee
|
Sheng-Wei Kuo
|
An-Che Cheng
|
Bing-Hua Chen
|
Yi-An Liu
This paper presents the development and experimental process of our system for the Formosa Speech Recognition Challenge 2025 (Hakka ASR). The proposed system is built upon the OpenAI Whisper model. We achieved significant performance improvements for the Sixian dialect of Hakka through dataset preprocessing and model fine-tuning. In the warm-up evaluation, our system achieved a Character Error Rate (CER) of 10.51% on the character recognition track and a Syllable Error Rate (SER) of 14.72% on the pinyin recognition track. In the final evaluation, our system achieved a Character Error Rate (CER) of 11.21% on the character recognition track and a Syllable Error Rate (SER) of 15.08% on the pinyin recognition track.
pdf
bib
abs
The EZ-AI System for Formosa Speech Recognition Challenge 2025
Yu-Sheng Tsao
|
Hung-Yang Sung
|
An-Ci Peng
|
Jhih-Rong Guo
|
Tien-Hong Lo
This study presents our system for Hakka Speech Recognition Challenge 2025. We designed and compared different systems for two low-resource dialects: Dapu and Zhaoan. On the Pinyin track, we gain boosts by leveraging cross-lingual transfer-learning from related languages and combining with self-supervised learning (SSL). For the Hanzi track, we employ pretrained Whisper with Low-Rank Adaptation (LoRA) fine-tuning. To alleviate the low-resource issue, two data augmentation methods are experimented with: simulating conversational speech to handle multi-speaker scenarios, and generating additional corpus via text-to-speech (TTS). Results from the pilot test showed that transfer learning significantly improved performance in the Pinyin track, achieving an average character error rate (CER) of 19.57%, ranking third among all teams. While in the Hanzi track, the Whisper + LoRA system achieved an average CER of 6.84%, earning first place among all. This study demonstrates that transfer learning and data augmentation can effectively improve recognition performance for low-resource languages. However, the domain mismatch seen in the media test set remains a challenge. We plan to explore in-context learning (ICL) and hotword modeling in the future to better address this issue.
pdf
bib
abs
A Multi-Module Error Detection and Correction System for Hakka ASR
Min-Chun Hu
|
Yu-Lin Xiao
|
Wen-Hsiang Lu
本研究提出一個針對客語(以大埔/詔安腔為主)的自動語音辨識(ASR)後矯正系統,旨在解決低資源語言辨識錯誤率偏高的問題。客語因受限於語料規模、異體字與腔調差異,在既有的通用 ASR 模型上表現往往不佳。為此,我們首先以 Whisper Large v3 Turbo 為基底辨識模型,使用約 60 小時的大埔與詔安語料進行微調,以提升對特定腔調的適應性。在獲取 ASR N-best 候選句後,系統進一步透過多模組錯誤偵測矯正流程進行修正,包含四個主要步驟: (1) 潛在錯誤偵測,用於鎖定候選間錯誤的候選詞彙;(2) 音素混淆集偵測(Phoneme Confusion Set): 依據音素相近關係提供可能替代詞;(3) 辭典(Lexicon)修正: 確保詞彙存在於語言使用的實際範疇中,(4) 搭配詞關聯度偵測: 利用收集之語料所建立的搭配詞關聯度來偵測錯誤詞彙。本研究所提出的矯正機制能有效補足 ASR 在低資源語言中的不足,實驗顯示經過多階段錯誤偵測矯正後,最終CER減少至 15.49%,減少 2.14 % ,證明該方法能有效提升語音辨識的準確率。
pdf
bib
abs
A Whisper-Based System with Multi-Faceted Data Augmentation for Low-Resource Language
Pin-Cheng Chen
|
Yu-Chi Chen
|
Chia-Chun Liang
|
Cheng-Yu Lin
|
Ping-Juei Tsai
|
Wei-Yun Ma
This paper presents a comprehensive approach for the Formosa Speech Recognition Challenge 2025 (FSR-2025), targeting automatic speech recognition (ASR) for the under-resourced Dapu and Zhao’an dialects of Taiwanese Hakka. Our method integrates data augmentation and robustness techniques, including SpecAugment, dialect-aware special tokens, text-to-speech (TTS) augmentation, noise/reverberation mixing, and speed perturbation, to mitigate data scarcity and domain mismatch. Experiments on the official FSR-2025 datasets show consistent improvements in both character error rate (CER) and word error rate (WER). Extensive ablation studies further confirm that each component contributes positively. These results offer a practical path toward robust ASR for under-resourced Hakka dialects and suggest broader applicability to other low-resource languages.
pdf
bib
abs
A Channel-Aware Anomaly-Guided Data Augmentation Framework for the FSR-2025 Hakka Speech Recognition Challenge
Siang-Ting Lin
|
Arthur Hao
|
Chiun-Yu Hua
|
Kuan-Tang Huang
|
Berlin Chen
The Formosa Speech Recognition Challenge 2025 (FSR-2025) focuses on Taiwanese Hakka, a low-resource language with limited data diversity and channel coverage. To address this challenge, we propose a channel-aware, data-centric framework that leverages multilingual foundation models to mitigate mismatches between field recordings and training data. Our method integrates unsupervised anomaly detection and channel-conditioned augmentation to enhance data representativeness before ASR fine-tuning, aiming to explore the potential for improving robustness in low-resource Hakka speech recognition.
pdf
bib
abs
The AS-SLAM system for Formosa Speech Recognition Challenge 2025
Chih-Hsi Chen
|
Pei-Jun Liao
|
Chia-Hua Wu
|
Pang-Cheng Wu
|
Hsin-Min Wang
In recent years, large-scale pre-trained speech models such as Whisper have been widely applied to speech recognition. While they achieve strong performance on high-resource languages such as English and Mandarin, dialects and other low-resource languages remain challenging due to limited data availability. The government-led “Formosa Speech in the Wild (FSW) project” is an important cultural preservation initiative for Hakka, a regional dialect, where the development of Hakka ASR systems represents a key technological milestone. Beyond model architecture, data processing and training strategies are also critical. In this paper, we explore data augmentation techniques for Hakka speech, including TTS and MUSAN-based approaches, and analyze different data combinations by fine-tuning the pre-trained Whisper model. We participated in the 2025 Hakka FSR ASR competition (student track) for the Dapu and Zhaoan varieties. In the pilot test, our system achieved 7th place in Hanzi recognition (CER: 15.92) and 3rd place in Pinyin recognition (SER: 20.49). In the official finals, our system ranked 6 in Hanzi recognition (CER: 15.73) and 4 in Pinyin recognition (SER: 20.68). We believe that such data augmentation strategies can advance research on Hakka ASR and support the long-term preservation of Hakka culture.
pdf
bib
abs
Challenges and Limitations of the Multilingual Pre-trained Model Whisper on Low-Resource Languages: A Case Study of Hakka Speech Recognition
Pei-Chi Lan
|
Hsin-Tien Chiang
|
Ting-Chun Lin
|
Ming-Hsiang Su
This study investigates the practical performance and limitations of the multilingual pre-trained model Whisper in low-resource language settings, using a Hakka speech recognition challenge as a case study. In the preliminary phase, our team (Group G) achieved official scores of 75.58% in Character Error Rate (CER) and 100.97% in Syllable Error Rate (SER). However, in the final phase, both CER and Word Error Rate (WER) reached 100%. Through a retrospective analysis of system design and implementation, we identified three major sources of failure: (1) improper handling of long utterances, where only the first segment was decoded, causing content truncation; (2) inconsistent language prompting, fixed to “Chinese” instead of the Hakka target; and (3) lack of systematic verification in data alignment and submission generation, combined with inadequate evaluation setup.Based on these findings, we propose a set of practical guidelines covering long-utterance processing, language consistency checking, and data submission validation. The results highlight that in low-resource speech recognition tasks, poor data quality or flawed workflow design can cause severe degradation of model performance. This study underscores the importance of robust data and process management in ASR system development and provides concrete insights for future improvements and reproducibility.
pdf
bib
abs
The NPTU ASR System for FSR2025 Hakka Character/Pinyin Recognition: Whisper with mBART Post-Editing and RNNLM Rescoring
Yi-Chin Huang
|
Yu-Heng Chen
|
Jian-Hua Wang
|
Hsiu-Chi Wu
|
Chih-Chung Kuo
|
Chao-Shih Huang
|
Yuan-Fu Liao
This paper presents our system for the FSR-2025 Hakka Automatic Speech Recognition (ASR) Challenge, which consists of two sub-tasks: (i) Hakka Characters and (ii) Hakka Pinyin. We propose a unified architecture built upon Whisper [1], a large weakly supervised ASR model, as the acoustic backbone, with optional LoRA (Low-Rank Adaptation [2]) for parameter-efficient fine-tuning. Data augmentation techniques include the MUSAN [3] corpus (music/speech/noise) and tempo/speed perturbation [4]. For the character task, mBART-50 [5,6], a multilingual sequence-to-sequence model, is applied for text correction, while both tasks employ an RNNLM [7] for N-best rescoring. Under the final evaluation setting of the character task, mBART-driven 10-best text correction combined with RNNLM rescoring achieved a CER (Character Error Rate) of 6.26%, whereas the official leaderboard reported 22.5%. For the Pinyin task, the Medium model proved more suitable than the Large model given the dataset size and accent distribution. With 10-best RNNLM rescoring, it achieved a SER (Syllable Error Rate) of 4.65% on our internal warm-up test set, and the official final score (with tone information) was 14.81%. Additionally, we analyze the contribution of LID (Language Identification) for accent recognition across different recording and media sources.