Roland Kuhn

2025

pdf bib abs
Supporting SENĆOŦEN Language Documentation Efforts with Automatic Speech Recognition
Mengzhe Geng | Patrick Littell | Aidan Pine | Penáć | Marc Tessier | Roland Kuhn
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

The SENĆOŦEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts to turn the tide of language loss as a result of colonial language policies. To support these on-the-ground efforts, the community is turning to digital technology. Automatic Speech Recognition (ASR) technology holds great promise for accelerating language documentation and the creation of educational resources. However, developing ASR systems for SENCOTEN is challenging due to limited data and significant vocabulary variation from its polysynthetic structure and stress-driven metathesis. To address these challenges, we propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech (TTS) system and cross-lingual transfer learning with Speech Foundation Models (SFMs). An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data. Experiments on the SENCOTEN dataset show aword error rate (WER) of 19.34% and a character error rate (CER) of 5.09% on the test set with a 57.02% out-of-vocabulary (OOV) rate. After filtering minor cedilla-related errors,WER improves to 14.32% (26.48% on unseen words) and CER to 3.45%, demonstrating the potential of our ASR-driven pipeline to support SENCOTEN language documentation.

2024

pdf bib abs
Gramble: A Tabular Programming Language for Collaborative Linguistic Modeling
Patrick Littell | Darlene Stewart | Fineen Davis | Aidan Pine | Roland Kuhn
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We introduce Gramble, a domain-specific programming language for linguistic parsing and generation, in the tradition of XFST, TWOLC, and Kleene. Gramble features an intuitive tabular syntax and supports live group programming, allowing community experts to participate more directly in system development without having to be programmers themselves. A cross-platform interpreter is available for Windows, MacOS, and UNIX, supports collaborative programming on the web via Google Sheets, and is released open-source under the MIT license.

2023

We develop an interactive web-based user interface for performing textspeech alignment and creating digital interactive “read-along audio books that highlight words as they are spoken and allow users to replay individual words when clicked. We build on an existing Python library for zero-shot multilingual textspeech alignment (Littell et al., 2022), extend it by exposing its functionality through a RESTful API, and rewrite the underlying speech recognition engine to run in the browser. The ReadAlong Studio Web App is open-source, user-friendly, prioritizes privacy and data sovereignty, allows for a variety of standard export formats, and is designed to work for the majority of the world’s languages.

2020

This paper surveys the first, three-year phase of a project at the National Research Council of Canada that is developing software to assist Indigenous communities in Canada in preserving their languages and extending their use. The project aimed to work within the empowerment paradigm, where collaboration with communities and fulfillment of their goals is central. Since many of the technologies we developed were in response to community needs, the project ended up as a collection of diverse subprojects, including the creation of a sophisticated framework for building verb conjugators for highly inflectional polysynthetic languages (such as Kanyen’kéha, in the Iroquoian language family), release of what is probably the largest available corpus of sentences in a polysynthetic language (Inuktut) aligned with English sentences and experiments with machine translation (MT) systems trained on this corpus, free online services based on automatic speech recognition (ASR) for easing the transcription bottleneck for recordings of speech in Indigenous languages (and other languages), software for implementing text prediction and read-along audiobooks for Indigenous languages, and several other subprojects.

The Inuktitut language, a member of the Inuit-Yupik-Unangan language family, is spoken across Arctic Canada and noted for its morphological complexity. It is an official language of two territories, Nunavut and the Northwest Territories, and has recognition in additional regions. This paper describes a newly released sentence-aligned Inuktitut–English corpus based on the proceedings of the Legislative Assembly of Nunavut, covering sessions from April 1999 to June 2017. With approximately 1.3 million aligned sentence pairs, this is, to our knowledge, the largest parallel corpus of a polysynthetic language or an Indigenous language of the Americas released to date. The paper describes the alignment methodology used, the evaluation of the alignments, and preliminary experiments on statistical and neural machine translation (SMT and NMT) between Inuktitut and English, in both directions.

2018

In this article, we discuss which text, speech, and image technologies have been developed, and would be feasible to develop, for the approximately 60 Indigenous languages spoken in Canada. In particular, we concentrate on technologies that may be feasible to develop for most or all of these languages, not just those that may be feasible for the few most-resourced of these. We assess past achievements and consider future horizons for Indigenous language transliteration, text prediction, spell-checking, approximate search, machine translation, speech recognition, speaker diarization, speech synthesis, optical character recognition, and computer-aided language learning.

2017

2016

pdf bib abs
Bilingual Methods for Adaptive Training Data Selection for Machine Translation
Boxing Chen | Roland Kuhn | George Foster | Colin Cherry | Fei Huang
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

In this paper, we propose a new data selection method which uses semi-supervised convolutional neural networks based on bitokens (Bi-SSCNNs) for training machine translation systems from a large bilingual corpus. In earlier work, we devised a data selection method based on semi-supervised convolutional neural networks (SSCNNs). The new method, Bi-SSCNN, is based on bitokens, which use bilingual information. When the new methods are tested on two translation tasks (Chinese-to-English and Arabic-to-English), they significantly outperform the other three data selection methods in the experiments. We also show that the BiSSCNN method is much more effective than other methods in preventing noisy sentence pairs from being chosen for training. More interestingly, this method only needs a tiny amount of in-domain data to train the selection model, which makes fine-grained topic-dependent translation adaptation possible. In the follow-up experiments, we find that neural machine translation (NMT) is more sensitive to noisy data than statistical machine translation (SMT). Therefore, Bi-SSCNN which can effectively screen out noisy sentence pairs, can benefit NMT much more than SMT.We observed a BLEU improvement over 3 points on an English-to-French WMT task when Bi-SSCNNs were used.

2015

pdf bib
Multi-level Evaluation for Machine Translation
Boxing Chen | Hongyu Guo | Roland Kuhn
Proceedings of the Tenth Workshop on Statistical Machine Translation

2014

pdf bib abs
Coarse “split and lump” bilingual language models for richer source information in SMT
Darlene Stewart | Roland Kuhn | Eric Joanis | George Foster
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

Recently, there has been interest in automatically generated word classes for improving statistical machine translation (SMT) quality: e.g, (Wuebker et al, 2013). We create new models by replacing words with word classes in features applied during decoding; we call these “coarse models”. We find that coarse versions of the bilingual language models (biLMs) of (Niehues et al, 2011) yield larger BLEU gains than the original biLMs. BiLMs provide phrase-based systems with rich contextual information from the source sentence; because they have a large number of types, they suffer from data sparsity. Niehues et al (2011) mitigated this problem by replacing source or target words with parts of speech (POSs). We vary their approach in two ways: by clustering words on the source or target side over a range of granularities (word clustering), and by clustering the bilingual units that make up biLMs (bitoken clustering). We find that loglinear combinations of the resulting coarse biLMs with each other and with coarse LMs (LMs based on word classes) yield even higher scores than single coarse models. When we add an appealing “generic” coarse configuration chosen on English > French devtest data to four language pairs (keeping the structure fixed, but providing language-pair-specific models for each pair), BLEU gains on blind test data against strong baselines averaged over 5 runs are +0.80 for English > French, +0.35 for French > English, +1.0 for Arabic > English, and +0.6 for Chinese > English.

pdf bib abs
A comparison of mixture and vector space techniques for translation model adaptation
Boxing Chen | Roland Kuhn | George Foster
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

In this paper, we propose two extensions to the vector space model (VSM) adaptation technique (Chen et al., 2013b) for statistical machine translation (SMT), both of which result in significant improvements. We also systematically compare the VSM techniques to three mixture model adaptation techniques: linear mixture, log-linear mixture (Foster and Kuhn, 2007), and provenance features (Chiang et al., 2011). Experiments on NIST Chinese-to-English and Arabic-to-English tasks show that all methods achieve significant improvement over a competitive non-adaptive baseline. Except for the original VSM adaptation method, all methods yield improvements in the +1.7-2.0 BLEU range. Combining them gives further significant improvements of up to +2.6-3.3 BLEU over the baseline.

In statistical machine translation systems, phrases with similar meanings often have similar but not identical distributions of translations. This paper proposes a new soft clustering method to smooth the conditional translation probabilities for a given phrase with those of semantically similar phrases. We call this semantic smoothing (SS). Moreover, we fabricate new phrase pairs that were not observed in training data, but which may be used for decoding. In learning curve experiments against a strong baseline, we obtain a consistent pattern of modest improvement from semantic smoothing, and further modest improvement from phrase pair fabrication.

pdf bib
Unpacking and Transforming Feature Functions: New Ways to Smooth Phrase Tables
Boxing Chen | Roland Kuhn | George Foster | Howard Johnson
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
AMBER: A Modified BLEU, Enhanced Ranking Metric
Boxing Chen | Roland Kuhn
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf bib abs
Translating Structured Documents
George Foster | Pierre Isabelle | Roland Kuhn
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

Machine Translation traditionally treats documents as sets of independent sentences. In many genres, however, documents are highly structured, and their structure contains information that can be used to improve translation quality. We present a preliminary approach to document translation that uses structural features to modify the behaviour of a language model, at sentence-level granularity. To our knowledge, this is the first attempt to incorporate structural information into statistical MT. In experiments on structured English/French documents from the Hansard corpus, we demonstrate small but statistically significant improvements.

pdf bib
Phrase Clustering for Smoothing TM Probabilities - or, How to Extract Paraphrases from Phrase Tables
Roland Kuhn | Boxing Chen | George Foster | Evan Stratford
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation
George Foster | Cyril Goutte | Roland Kuhn
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Bilingual Sense Similarity for Statistical Machine Translation
Boxing Chen | George Foster | Roland Kuhn
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Fast Consensus Hypothesis Regeneration for Machine Translation
Boxing Chen | George Foster | Roland Kuhn
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
MT: the Current Research Landscape
Roland Kuhn | Pierre Isabelle | National Research Council | Canada
Proceedings of Machine Translation Summit XII: Plenaries

pdf bib
PortageLive: delivering machine translation technology via virtualization
Patrick Paul | Samuel Larkin | Ulrich Germann | Eric Joanis | Roland Kuhn
Proceedings of Machine Translation Summit XII: Plenaries

pdf bib
Phrase Translation Model Enhanced with Association based Features
Boxing Chen | George Foster | Roland Kuhn
Proceedings of Machine Translation Summit XII: Papers

pdf bib
Stabilizing Minimum Error Rate Training
George Foster | Roland Kuhn
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

2007

pdf bib
Improving Translation Quality by Discarding Most of the Phrasetable
Howard Johnson | Joel Martin | George Foster | Roland Kuhn
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
Integration of an Arabic Transliteration Module into a Statistical Machine Translation System
Mehdi M. Kashani | Eric Joanis | Roland Kuhn | George Foster | Fred Popowich
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
Mixture-Model Adaptation for SMT
George Foster | Roland Kuhn
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
Rule-Based Translation with Statistical Phrase-Based Post-Editing
Michel Simard | Nicola Ueffing | Pierre Isabelle | Roland Kuhn
Proceedings of the Second Workshop on Statistical Machine Translation

2006

pdf bib abs
Système de traduction automatique statistique combinant différentes ressources
Fatiha Sadat | George Foster | Roland Kuhn
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Cet article décrit une approche combinant différents modèles statistiques pour la traduction automatique basée sur les segments. Pour ce faire, différentes ressources sont utilisées, dont deux corpus parallèles aux caractéristiques différentes et un dictionnaire de terminologie bilingue et ce, afin d’améliorer la performance quantitative et qualitative du système de traduction. Nous évaluons notre approche sur la paire de langues français-anglais et montrons comment la combinaison des ressources proposées améliore de façon significative les résultats.