Jeffrey Micher


2020

pdf bib
The Nunavut Hansard Inuktitut–English Parallel Corpus 3.0 with Preliminary Machine Translation Results
Eric Joanis | Rebecca Knowles | Roland Kuhn | Samuel Larkin | Patrick Littell | Chi-kiu Lo | Darlene Stewart | Jeffrey Micher
Proceedings of the 12th Language Resources and Evaluation Conference

The Inuktitut language, a member of the Inuit-Yupik-Unangan language family, is spoken across Arctic Canada and noted for its morphological complexity. It is an official language of two territories, Nunavut and the Northwest Territories, and has recognition in additional regions. This paper describes a newly released sentence-aligned Inuktitut–English corpus based on the proceedings of the Legislative Assembly of Nunavut, covering sessions from April 1999 to June 2017. With approximately 1.3 million aligned sentence pairs, this is, to our knowledge, the largest parallel corpus of a polysynthetic language or an Indigenous language of the Americas released to date. The paper describes the alignment methodology used, the evaluation of the alignments, and preliminary experiments on statistical and neural machine translation (SMT and NMT) between Inuktitut and English, in both directions.

pdf bib
A Summary of the First Workshop on Language Technology for Language Documentation and Revitalization
Graham Neubig | Shruti Rijhwani | Alexis Palmer | Jordan MacKenzie | Hilaria Cruz | Xinjian Li | Matthew Lee | Aditi Chaudhary | Luke Gessler | Steven Abney | Shirley Anugrah Hayati | Antonios Anastasopoulos | Olga Zamaraeva | Emily Prud’hommeaux | Jennette Child | Sara Child | Rebecca Knowles | Sarah Moeller | Jeffrey Micher | Yiyuan Li | Sydney Zink | Mengzhou Xia | Roshan S Sharma | Patrick Littell
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Despite recent advances in natural language processing and other language technology, the application of such technology to language documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh, PA, USA to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical language revitalization technologies. The workshop focused on developing technologies to aid language documentation and revitalization in four areas: 1) spoken language (speech transcription, phone to orthography decoding, text-to-speech and text-speech forced alignment), 2) dictionary extraction and management, 3) search tools for corpora, and 4) social media (language learning bots and social media analysis). This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw’ida, Kwak’wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

2018

pdf bib
Challenges in Speech Recognition and Translation of High-Value Low-Density Polysynthetic Languages
Judith Klavans | John Morgan | Stephen LaRocca | Jeffrey Micher | Clare Voss
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

pdf bib
Using the Nunavut Hansard Data for Experiments in Morphological Analysis and Machine Translation
Jeffrey Micher
Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages

Inuktitut is a polysynthetic language spoken in Northern Canada and is one of the official languages of the Canadian territory of Nunavut. As such, the Nunavut Legislature publishes all of its proceedings in parallel English and Inuktitut. Several parallel English-Inuktitut corpora from these proceedings have been created from these data and are publically available. The corpus used for current experiments is described. Morphological processing of one of these corpora was carried out and details about the processing are provided. Then, the processed corpus was used in morphological analysis and machine translation (MT) experiments. The morphological analysis experiments aimed to improve the coverage of morphological processing of the corpus, and compare an additional experimental condition to previously published results. The machine translation experiments made use of the additional morphologically analyzed word types in a statistical machine translation system designed to translate to and from Inuktitut morphemes. Results are reported and next steps are defined.

2017

pdf bib
Improving Coverage of an Inuktitut Morphological Analyzer Using a Segmental Recurrent Neural Network
Jeffrey Micher
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2008

pdf bib
Buckwalter-based Lookup Tool as Language Resource for Arabic Language Learners
Jeffrey Micher | Clare Voss
Software Engineering, Testing, and Quality Assurance for Natural Language Processing

pdf bib
Exploitation of an Arabic Language Resource for Machine Translation Evaluation: using Buckwalter-based Lookup Tool to Augment CMU Alignment Algorithm
Clare Voss | Jamal Laoudi | Jeffrey Micher
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Voss et al. (2006) analyzed newswire translations of three DARPA GALE Arabic-English MT systems at the segment level in terms of subjective judgmen+F925t scores, automated metric scores, and correlations among these different score types. At this level of granularity, the correlations are weak. In this paper, we begin to reconcile the subjective and automated scores that underlie these correlations by explicitly “grounding” MT output with its Reference Translation (RT) prior to subjective or automated evaluation. The first two phases of our approach annotate {MT, RT} pairs with the same types of textual comparisons that subjects intuitively apply, while the third phase (not presented here) entails scoring the pairs: (i) automated calculation of “MT-RT hits” using CMU aligner from METEOR, (ii) an extension phase where our Buckwalter-based Lookup Tool serves to generate six other textual comparison categories on items in the MT output that the CMU aligner does not identify, and (iii) given the fully categorized RT & MT pair, a final adequacy score is assigned to the MT output, either by an automated metric based on weighted category counts and segment length, or by a trained human judge.

pdf bib
Boosting performance of weak MT engines automatically: using MT output to align segments & build statistical post-editors
Clare R. Voss | Matthew Aguirre | Jeffrey Micher | Richard Chang | Jamal Laoudi | Reginald Hobbs
Proceedings of the 12th Annual conference of the European Association for Machine Translation

2003

pdf bib
An integrated system for source language checking, analysis and term management
Eric Nyberg | Teruko Mitamura | David Svoboda | Jeongwoo Ko | Kathryn Baker | Jeffrey Micher
Proceedings of Machine Translation Summit IX: System Presentations

This paper presents an overview of the tools provided by KANTOO MT system for controlled source language checking, source text analysis, and terminology management. The steps in each process are described, and screen images are provided to illustrate the system architecture and example tool interfaces.