Zachary William Hopton
Also published as: Zachary Hopton
2026
CTC Regularization for Low-Resource Speech-to-Text Translation
Zachary Hopton | Rico Sennrich
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
Zachary Hopton | Rico Sennrich
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
The challenges of building speech-to-text translation (ST) systems (e.g., a relative lack of parallel speech–text data and robustness to noise in audio) are exacerbated for low-resource language pairs. In this work, we seek to improve low-resource ST by building on previous studies that regularize ST training with the connectionist temporal classification (CTC) loss. By systematically evaluating a diverse range of linguistic annotations as CTC labels across multiple auxiliary loss configurations, we improve speech translation systems for both low- and high-resource settings. These improvements over both a standard end-to-end ST system and a speech LLM indicate a need for continued research on regularizing speech representations in ST.
The Mediomatix Corpus: Parallel Data for Romansh Language Varieties via Comparable Schoolbooks
Zachary William Hopton | Jannis Vamvas | Andrin Büchler | Anna Rutkiewicz | Rico Cathomas | Rico Sennrich
Findings of the Association for Computational Linguistics: EACL 2026
Zachary William Hopton | Jannis Vamvas | Andrin Büchler | Anna Rutkiewicz | Rico Cathomas | Rico Sennrich
Findings of the Association for Computational Linguistics: EACL 2026
The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM and a supervised multilingual MT model on the dataset.
2025
The Impact of Dialect Variation on Robust Automatic Speech Recognition for Catalan
Zachary Hopton | Eleanor Chodroff
Proceedings of the 22nd SIGMORPHON workshop on Computational Morphology, Phonology, and Phonetics
Zachary Hopton | Eleanor Chodroff
Proceedings of the 22nd SIGMORPHON workshop on Computational Morphology, Phonology, and Phonetics
To accurately transcribe a speech signal, automatic speech recognition (ASR) systems must show robustness to a wide range of task independent variation, such as speaker factors, recording quality, or even ädversarial noisedesigned to disrupt performance.We manipulated the dialect composition of fine-tuning data for ASR to study whether balancing the relative proportion of dialects had an impact on models robustness to two such sources of variation”:" dialect variation and adversarial perturbations. We fine-tuned XLSR-53 for Catalan ASR using four different dialect compositions, each containing the Central Catalan dialect. These were defined as 100%, 80%, 50%, and 20% Central Catalan, with the remaining portions split evenly between four other Catalan dialects. While increasing the relative proportion of dialect variants improved models’ dialect robustness, this did not have a meaningful impact on adversarial robustness. These findings suggest that while improvements to ASR can be made by diversifying the training data, such changes do not sufficiently counteract adversarial attacks, leaving the technology open to security threats.
Functional Lexicon in Subword Tokenization
Zachary William Hopton | Yves Scherrer | Tanja Samardzic
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Zachary William Hopton | Yves Scherrer | Tanja Samardzic
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The distinction between function and content units of the lexicon has been somewhat neglected in recent NLP work, but it could still be useful when working with low-resource languages, and, in particular, to improve cross-lingual transfer. In this paper, we investigate to what extent BPE subword tokenization can be used to identify units of the functional lexicon in a language without any annotated data. We analyze subword tokens in terms of their productivity and attempt to find thresholds that best distinguish function from content tokens. On a sample of seven diverse languages, we find that the best results are obtained with 50 BPE merges. We also show that this subword tokenization setting can be beneficial for the interlinear glossing task.
2024
System Description of the NordicsAlps Submission to the AmericasNLP 2024 Machine Translation Shared Task
Joseph Attieh | Zachary William Hopton | Yves Scherrer | Tanja Samardzic
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Joseph Attieh | Zachary William Hopton | Yves Scherrer | Tanja Samardzic
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
This paper presents the system description of the NordicsAlps team for the AmericasNLP 2024 Machine Translation Shared Task 1. We investigate the effect of tokenization on translation quality by exploring two different tokenization schemes: byte-level and redundancy-driven tokenization. We submitted three runs per language pair. The redundancy-driven tokenization ranked first among all submissions, scoring the highest average chrF2++, chrF, and BLEU metrics (averaged across all languages). These findings demonstrate the importance of carefully tailoring the tokenization strategies of machine translation systems, particularly in resource-constrained scenarios.
Modeling Orthographic Variation in Occitan’s Dialects
Zachary William Hopton | Noëmi Aepli
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Zachary William Hopton | Noëmi Aepli
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Effectively normalizing spellings in textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model’s representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects.Intrinsic evaluations of the model’s embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.