Paul McNamee

2024

Conversational speech translation is an important technology that fosters communication among people of different language backgrounds. Three-way parallel data in the form of source speech, source transcript, and target translation is usually required to train end-to-end systems. However, such datasets are not readily available and are expensive to create as this involves multiple annotation stages. In this paper, we investigate the use of synthetic data from generative models, namely machine translation and text-to-speech synthesis, for training conversational speech translation systems. We show that adding synthetic data to the training recipe increasingly improves end-to-end training performance, especially when limited real data is available. However, when no real data is available, no amount of synthetic data helps.

Johns Hopkins University (JHU) submitted systems for all eight language pairs in the 2024 Low-Resource Language Track. The main effort of this work revolves around fine-tuning large and publicly available models in three proposed systems: i) end-to-end speech translation (ST) fine-tuning of Seamless4MT v2; ii) ST fine-tuning of Whisper; iii) a cascaded system involving automatic speech recognition with fine-tuned Whisper and machine translation with NLLB. On top of systems above, we conduct a comparative analysis on different training paradigms, such as intra-distillation for NLLB as well as joint training and curriculum learning for SeamlessM4T v2. Our results show that the best-performing approach differs by language pairs, but that i) fine-tuned SeamlessM4T v2 tends to perform best for source languages on which it was pre-trained, ii) multi-task training helps Whisper fine-tuning, iii) cascaded systems with Whisper and NLLB tend to outperform Whisper alone, and iv) intra-distillation helps NLLB fine-tuning.

2023

pdf bib abs
A Hyperparameter Optimization Toolkit for Neural Machine Translation Research
Xuan Zhang | Kevin Duh | Paul McNamee
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Hyperparameter optimization is an important but often overlooked process in the research of deep learning technologies. To obtain a good model, one must carefully tune hyperparameters that determine the architecture and training algorithm. Insufficient tuning may result in poor results, while inequitable tuning may lead to exaggerated differences between models. We present a hyperparameter optimization toolkit for neural machine translation (NMT) to help researchers focus their time on the creative rather than the mundane. The toolkit is implemented as a wrapper on top of the open-source Sockeye NMT software. Using the Asynchronous Successive Halving Algorithm (ASHA), we demonstrate that it is possible to discover near-optimal models under a computational budget with little effort. Code: https://github.com/kevinduh/sockeye-recipes3Video demo: https://cs.jhu.edu/kevinduh/j/demo.mp4

pdf bib abs
An Extensive Exploration of Back-Translation in 60 Languages
Paul McNamee | Kevin Duh
Findings of the Association for Computational Linguistics: ACL 2023

Back-translation is a data augmentation technique that has been shown to improve model quality through the creation of synthetic training bitext. Early studies showed the promise of the technique and follow on studies have produced additional refinements. We have undertaken a broad investigation using back-translation to train models from 60 languages into English; the majority of these languages are considered moderate- or low-resource languages. We observed consistent gains, though compared to prior work we saw conspicuous gains in quite a number of lower-resourced languages. We analyzed differences in translations between baseline and back-translation models, and observed many indications of improved translation quality. Translation of both rare and common terms is improved, and these improvements occur despite the less natural synthetic source-language text used in training.

This paper reports on the shared tasks organized by the 20th IWSLT Conference. The shared tasks address 9 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, multilingual, dialect and low-resource speech translation, and formality control. The shared tasks attracted a total of 38 submissions by 31 teams. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

2022

The evaluation campaign of the 19th International Conference on Spoken Language Translation featured eight shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Speech to speech translation, (iv) Low-resource speech translation, (v) Multilingual speech translation, (vi) Dialect speech translation, (vii) Formality control for speech translation, (viii) Isometric speech translation. A total of 27 teams participated in at least one of the shared tasks. This paper details, for each shared task, the purpose of the task, the data that were released, the evaluation metrics that were applied, the submissions that were received and the results that were achieved.

pdf bib abs
The Multilingual Microblog Translation Corpus: Improving and Evaluating Translation of User-Generated Text
Paul McNamee | Kevin Duh
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Translation of the noisy, informal language found in social media has been an understudied problem, with a principal factor being the limited availability of translation corpora in many languages. To address this need we have developed a new corpus containing over 200,000 translations of microblog posts that supports translation of thirteen languages into English. The languages are: Arabic, Chinese, Farsi, French, German, Hindi, Korean, Pashto, Portuguese, Russian, Spanish, Tagalog, and Urdu. We are releasing these data as the Multilingual Microblog Translation Corpus to support futher research in translation of informal language. We establish baselines using this new resource, and we further demonstrate the utility of the corpus by conducting experiments with fine-tuning to improve translation quality from a high performing neural machine translation (NMT) system. Fine-tuning provided substantial gains, ranging from +3.4 to +11.1 BLEU. On average, a relative gain of 21% was observed, demonstrating the utility of the corpus.

2020

pdf bib abs
Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages
Kevin Duh | Paul McNamee | Matt Post | Brian Thompson
Proceedings of the Twelfth Language Resources and Evaluation Conference

Research in machine translation (MT) is developing at a rapid pace. However, most work in the community has focused on languages where large amounts of digital resources are available. In this study, we benchmark state of the art statistical and neural machine translation systems on two African languages which do not have large amounts of resources: Somali and Swahili. These languages are of social importance and serve as test-beds for developing technologies that perform reasonably well despite the low-resource constraint. Our findings suggest that statistical machine translation (SMT) and neural machine translation (NMT) can perform similarly in low-resource scenarios, but neural systems require more careful tuning to match performance. We also investigate how to exploit additional data, such as bilingual text harvested from the web, or user dictionaries; we find that NMT can significantly improve in performance with the use of these additional data. Finally, we survey the landscape of machine translation resources for the languages of Africa and provide some suggestions for promising future research directions.

pdf bib abs
Tagging Location Phrases in Text
Paul McNamee | James Mayfield | Cash Costello | Caitlyn Bishop | Shelby Anderson
Proceedings of the Twelfth Language Resources and Evaluation Conference

For over thirty years researchers have studied the problem of automatically detecting named entities in written language. Throughout this time the majority of such work has focused on detection and classification of entities into coarse-grained types like: PERSON, ORGANIZATION, and LOCATION. Less attention has been focused on non-named mentions of entities, including non-named location phrases such as “the medical clinic in Telonge” or “2 km below the Dolin Maniche bridge”. In this work we describe the Location Phrase Detection task to identify such spans. Our key accomplishments include: developing a sequential tagging approach; crafting annotation guidelines; building annotated datasets for English and Russian news; and, conducting experiments in automated detection of location phrases with both statistical and neural taggers. This work is motivated by extracting rich location information to support situational awareness during humanitarian crises such as natural disasters.

pdf bib abs
Dragonfly: Advances in Non-Speaker Annotation for Low Resource Languages
Cash Costello | Shelby Anderson | Caitlyn Bishop | James Mayfield | Paul McNamee
Proceedings of the Twelfth Language Resources and Evaluation Conference

Dragonfly is an open source software tool that supports annotation of text in a low resource language by non-speakers of the language. Using semantic and contextual information, non-speakers of a language familiar with the Latin script can produce high quality named entity annotations to support construction of a name tagger. We describe a procedure for annotating low resource languages using Dragonfly that others can use, which we developed based on our experience annotating data in more than ten languages. We also present performance comparisons between models trained on native speaker and non-speaker annotations.

2019

pdf bib abs
Curriculum Learning for Domain Adaptation in Neural Machine Translation
Xuan Zhang | Pamela Shapiro | Gaurav Kumar | Paul McNamee | Marine Carpuat | Kevin Duh
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We introduce a curriculum learning approach to adapt generic neural machine translation models to a specific domain. Samples are grouped by their similarities to the domain of interest and each group is fed to the training algorithm with a particular schedule. This approach is simple to implement on top of any neural framework or architecture, and consistently outperforms both unadapted and adapted baselines in experiments with two distinct domains and two language pairs.

pdf bib abs
JHU System Description for the MADAR Arabic Dialect Identification Shared Task
Tom Lippincott | Pamela Shapiro | Kevin Duh | Paul McNamee
Proceedings of the Fourth Arabic Natural Language Processing Workshop

Our submission to the MADAR shared task on Arabic dialect identification employed a language modeling technique called Prediction by Partial Matching, an ensemble of neural architectures, and sources of additional data for training word embeddings and auxiliary language models. We found several of these techniques provided small boosts in performance, though a simple character-level language model was a strong baseline, and a lower-order LM achieved best performance on Subtask 2. Interestingly, word embeddings provided no consistent benefit, and ensembling struggled to outperform the best component submodel. This suggests the variety of architectures are learning redundant information, and future work may focus on encouraging decorrelated learning.

pdf bib
Identifying Fluently Inadequate Output in Neural and Statistical Machine Translation
Marianna Martindale | Marine Carpuat | Kevin Duh | Paul McNamee
Proceedings of Machine Translation Summit XVII: Research Track

pdf bib
JHU LoResMT 2019 Shared Task System Description
Paul McNamee
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

2018

We demonstrate two annotation platforms that allow an English speaker to annotate names for any language without knowing the language. These platforms provided high-quality ’‘silver standard” annotations for low-resource language name taggers (Zhang et al., 2017) that achieved state-of-the-art performance on two surprise languages (Oromo and Tigrinya) at LoreHLT20171 and ten languages at TAC-KBP EDL2017 (Ji et al., 2017). We discuss strengths and limitations and compare other methods of creating silver- and gold-standard annotations using native speakers. We will make our tools publicly available for research use.

To better understand the effectiveness of continued training, we analyze the major components of a neural machine translation system (the encoder, decoder, and each embedding space) and consider each component’s contribution to, and capacity for, domain adaptation. We find that freezing any single component during continued training has minimal impact on performance, and that performance is surprisingly good when a single component is adapted while holding the rest of the model fixed. We also find that continued training does not move the model very far from the out-of-domain model, compared to a sensitivity analysis metric, suggesting that the out-of-domain model can provide a good generic initialization for the new domain.

2017

pdf bib abs
Language-Independent Named Entity Analysis Using Parallel Projection and Rule-Based Disambiguation
James Mayfield | Paul McNamee | Cash Costello
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

The 2017 shared task at the Balto-Slavic NLP workshop requires identifying coarse-grained named entities in seven languages, identifying each entity’s base form, and clustering name mentions across the multilingual set of documents. The fact that no training data is provided to systems for building supervised classifiers further adds to the complexity. To complete the task we first use publicly available parallel texts to project named entity recognition capability from English to each evaluation language. We ignore entirely the subtask of identifying non-inflected forms of names. Finally, we create cross-document entity identifiers by clustering named mentions using a procedure-based approach.

2016

bib
Putting the “human” back in HLT: The importance of human evaluation in assessing the quality and potential uses of translation technology
Erica Michael | Petra Bradley | Paul McNamee | Matt Post
Conferences of the Association for Machine Translation in the Americas: MT Users' Track

pdf bib abs
Language and Dialect Discrimination Using Compression-Inspired Language Models
Paul McNamee
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

The DSL 2016 shared task continued previous evaluations from 2014 and 2015 that facilitated the study of automated language and dialect identification. This paper describes results for this year’s shared task and from several related experiments conducted at the Johns Hopkins University Human Language Technology Center of Excellence (JHU HLTCOE). Previously the HLTCOE has explored the use of compression-inspired language modeling for language and dialect identification, using news, Wikipedia, blog post, and Twitter corpora. The technique we have relied upon is based on prediction by partial matching (PPM), a state of the art text compression technique. Due to the close relationship between adaptive compression and language modeling, such compression techniques can also be applied to multi-way text classification problems, and previous studies have examined tasks such as authorship attribution, email spam detection, and topical classification. We applied our approach to the multi-class decision that considered each dialect or language as a possibility for the given shared task input line. Results for test-set A were in accord with our expectations, however results for test-sets B and C appear to be markedly worse. We had not anticipated the inclusion of multiple communications in differing languages in test-set B (social media) input lines, and had not expected the test-set C (dialectal Arabic) data to be represented phonetically instead of in native orthography.

2014

pdf bib
A Comparison of the Events and Relations Across ACE, ERE, TAC-KBP, and FrameNet Annotation Standards
Jacqueline Aguilar | Charley Beller | Paul McNamee | Benjamin Van Durme | Stephanie Strassel | Zhiyi Song | Joe Ellis
Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference, and Representation

2013

2012

pdf bib abs
Creating and Curating a Cross-Language Person-Entity Linking Collection
Dawn Lawrie | James Mayfield | Paul McNamee | Douglas Oard
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

To stimulate research in cross-language entity linking, we present a new test collection for evaluating the accuracy of cross-language entity linking in twenty-one languages. This paper describes an efficient way to create and curate such a collection, judiciously exploiting existing language resources. Queries are created by semi-automatically identifying person names on the English side of a parallel corpus, using judgments obtained through crowdsourcing to identify the entity corresponding to the name, and projecting the English name onto the non-English document using word alignments. Name projections are then curated, again through crowdsourcing. This technique resulted in the first publicly available multilingual cross-language entity linking collection. The collection includes approximately 55,000 queries, comprising between 875 and 4,329 queries for each of twenty-one non-English languages.

pdf bib
Language Identification for Creating Language-Specific Twitter Collections
Shane Bergsma | Paul McNamee | Mossaab Bagdouri | Clayton Fink | Theresa Wilson
Proceedings of the Second Workshop on Language in Social Media

2011

pdf bib
Cross-Language Entity Linking
Paul McNamee | James Mayfield | Dawn Lawrie | Douglas Oard | David Doermann
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
Entity Disambiguation for Knowledge Base Population
Mark Dredze | Paul McNamee | Delip Rao | Adam Gerber | Tim Finin
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Streaming Cross Document Entity Coreference Resolution
Delip Rao | Paul McNamee | Mark Dredze
Coling 2010: Posters

pdf bib abs
An Evaluation of Technologies for Knowledge Base Population
Paul McNamee | Hoa Trang Dang | Heather Simpson | Patrick Schone | Stephanie M. Strassel
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Previous content extraction evaluations have neglected to address problems which complicate the incorporation of extracted information into an existing knowledge base. Previous question answering evaluations have likewise avoided tasks such as explicit disambiguation of target entities and handling a fixed set of questions about entities without previous determination of possible answers. In 2009 NIST conducted a Knowledge Base Population track at its Text Analysis Conference to unite the content extraction and question answering communities and jointly explore some of these issues. This exciting new evaluation attracted 13 teams from 6 countries that submitted results in two tasks, Entity Linking and Slot Filling. This paper explains the motivation and design of the tasks, describes the language resources that were developed for this evaluation, offers comparisons to previous community evaluations, and briefly summarizes the performance obtained by systems. We also identify relevant issues pertaining to target selection, challenging queries, and performance measures.

pdf bib abs
Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population
Heather Simpson | Stephanie Strassel | Robert Parker | Paul McNamee
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Text Analysis Conference (TAC) is a series of Natural Language Processing evaluation workshops organized by the National Institute of Standards and Technology. The Knowledge Base Population (KBP) track at TAC 2009, a hybrid descendant of the TREC Question Answering track and the Automated Content Extraction (ACE) evaluation program, is designed to support development of systems that are capable of automatically populating a knowledge base with information about entities mined from unstructured text. An important component of the KBP evaluation is the Entity Linking task, where systems must accurately associate text mentions of unknown Person (PER), Organization (ORG), and Geopolitical (GPE) names to entries in a knowledge base. Linguistic Data Consortium (LDC) at the University of Pennsylvania creates and distributes linguistic resources including data, annotations, system assessment, tools and specifications for the TAC KBP evaluations. This paper describes the 2009 resource creation efforts, with particular focus on the selection and development of named entity mentions for the Entity Linking task evaluation.

Accurately translating multiword expressions is important to obtain good performance in machine translation, cross-language information retrieval, and other multilingual tasks in human language technology. Existing approaches to inducing translation equivalents of multiword units have focused on agglomerating individual words or on aligning words in a statistical machine translation system. We present a different approach based upon information theoretic heuristics and the exact counting of frequencies of occurrence of multiword strings in aligned parallel corpora. We are applying a technique introduced by Yamamoto and Church that uses suffix arrays and longest common prefix arrays. Evaluation of the method in multiple language pairs was performed using bilingual lexicons of domain-specific terminology as a gold standard. We found that performance of 50-70%, as measured by mean reciprocal rank, can be obtained for terms that occur more than 10 or so times.