Marine Carpuat


2021

pdf bib
A Review of Human Evaluation for Style Transfer
Eleftheria Briakou | Sweta Agrawal | Ke Zhang | Joel Tetreault | Marine Carpuat
Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

This paper reviews and summarizes human evaluation practices described in 97 style transfer papers with respect to three main evaluation aspects: style transfer, meaning preservation, and fluency. In principle, evaluations by human raters should be the most reliable. However, in style transfer papers, we find that protocols for human evaluations are often underspecified and not standardized, which hampers the reproducibility of research in this field and progress toward better human and automatic evaluation methods.

pdf bib
Beyond Noise: Mitigating the Impact of Fine-grained Semantic Divergences on Neural Machine Translation
Eleftheria Briakou | Marine Carpuat
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

While it has been shown that Neural Machine Translation (NMT) is highly sensitive to noisy parallel training samples, prior work treats all types of mismatches between source and target as noise. As a result, it remains unclear how samples that are mostly equivalent but contain a small number of semantically divergent tokens impact NMT training. To close this gap, we analyze the impact of different types of fine-grained semantic divergences on Transformer models. We show that models trained on synthetic divergences output degenerated text more frequently and are less confident in their predictions. Based on these findings, we introduce a divergent-aware NMT framework that uses factors to help NMT recover from the degradation caused by naturally occurring divergences, improving both translation quality and model calibration on EN-FR tasks.

pdf bib
Machine Translation Believability
Marianna Martindale | Kevin Duh | Marine Carpuat
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

Successful Machine Translation (MT) deployment requires understanding not only the intrinsic qualities of MT output, such as fluency and adequacy, but also user perceptions. Users who do not understand the source language respond to MT output based on their perception of the likelihood that the meaning of the MT output matches the meaning of the source text. We refer to this as believability. Output that is not believable may be off-putting to users, but believable MT output with incorrect meaning may mislead them. In this work, we study the relationship of believability to fluency and adequacy by applying traditional MT direct assessment protocols to annotate all three features on the output of neural MT systems. Quantitative analysis of these annotations shows that believability is closely related to but distinct from fluency, and initial qualitative analysis suggests that semantic features may account for the difference.

pdf bib
A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification
Sweta Agrawal | Weijia Xu | Marine Carpuat
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?
Weijia Xu | Shuming Ma | Dongdong Zhang | Marine Carpuat
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Dual Reconstruction: a Unifying Objective for Semi-Supervised Neural Machine Translation
Weijia Xu | Xing Niu | Marine Carpuat
Findings of the Association for Computational Linguistics: EMNLP 2020

While Iterative Back-Translation and Dual Learning effectively incorporate monolingual training data in neural machine translation, they use different objectives and heuristic gradient approximation strategies, and have not been extensively compared. We introduce a novel dual reconstruction objective that provides a unified view of Iterative Back-Translation and Dual Learning. It motivates a theoretical analysis and controlled empirical study on German-English and Turkish-English tasks, which both suggest that Iterative Back-Translation is more effective than Dual Learning despite its relative simplicity.

pdf bib
Evaluating a Bi-LSTM Model for Metaphor Detection in TOEFL Essays
Kevin Kuo | Marine Carpuat
Proceedings of the Second Workshop on Figurative Language Processing

This paper describes systems submitted to the Metaphor Shared Task at the Second Workshop on Figurative Language Processing. In this submission, we replicate the evaluation of the Bi-LSTM model introduced by Gao et al.(2018) on the VUA corpus in a new setting: TOEFL essays written by non-native English speakers. Our results show that Bi-LSTM models outperform feature-rich linear models on this challenging task, which is consistent with prior findings on the VUA dataset. However, the Bi-LSTM models lag behind the best performing systems in the shared task.

pdf bib
The University of Maryland’s Submissions to the WMT20 Chat Translation Task: Searching for More Data to Adapt Discourse-Aware Neural Machine Translation
Calvin Bao | Yow-Ting Shiue | Chujun Song | Jie Li | Marine Carpuat
Proceedings of the Fifth Conference on Machine Translation

This paper describes the University of Maryland’s submissions to the WMT20 Shared Task on Chat Translation. We focus on translating agent-side utterances from English to German. We started from an off-the-shelf BPE-based standard transformer model trained with WMT17 news and fine-tuned it with the provided in-domain training data. In addition, we augment the training set with its best matches in the WMT19 news dataset. Our primary submission uses a standard Transformer, while our contrastive submissions use multi-encoder Transformers to attend to previous utterances. Our primary submission achieves 56.7 BLEU on the agent side (en→de), outperforming a baseline system provided by the task organizers by more than 13 BLEU points. Moreover, according to an evaluation on a set of carefully-designed examples, the multi-encoder architecture is able to generate more coherent translations.

pdf bib
Incorporating Terminology Constraints in Automatic Post-Editing
David Wan | Chris Kedzie | Faisal Ladhak | Marine Carpuat | Kathleen McKeown
Proceedings of the Fifth Conference on Machine Translation

Users of machine translation (MT) may want to ensure the use of specific lexical terminologies. While there exist techniques for incorporating terminology constraints during inference for MT, current APE approaches cannot ensure that they will appear in the final translation. In this paper, we present both autoregressive and non-autoregressive models for lexically constrained APE, demonstrating that our approach enables preservation of 95% of the terminologies and also improves translation quality on English-German benchmarks. Even when applied to lexically constrained MT output, our approach is able to improve preservation of the terminologies. However, we show that our models do not learn to copy constraints systematically and suggest a simple data augmentation technique that leads to improved performance and robustness.

pdf bib
Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank
Eleftheria Briakou | Marine Carpuat
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Detecting fine-grained differences in content conveyed in different languages matters for cross-lingual NLP and multilingual corpora analysis, but it is a challenging machine learning problem since annotation is expensive and hard to scale. This work improves the prediction and annotation of fine-grained semantic divergences. We introduce a training strategy for multilingual BERT models by learning to rank synthetic divergent examples of varying granularity. We evaluate our models on the Rationalized English-French Semantic Divergences, a new dataset released with this work, consisting of English-French sentence-pairs annotated with semantic divergence classes and token-level rationales. Learning to rank helps detect fine-grained sentence-level divergences more accurately than a strong sentence-level similarity model, while token-level predictions have the potential of further distinguishing between coarse and fine-grained divergences.

bib
Multitask Models for Controlling the Complexity of Neural Machine Translation
Sweta Agrawal | Marine Carpuat
Proceedings of the The Fourth Widening Natural Language Processing Workshop

We introduce a machine translation task where the output is aimed at audiences of different levels of target language proficiency. We collect a novel dataset of news articles available in English and Spanish and written for diverse reading grade levels. We leverage this dataset to train multitask sequence to sequence models that translate Spanish into English targeted at an easier reading grade level than the original Spanish. We show that multitask models outperform pipeline approaches that translate and simplify text independently.

bib
An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages
Aquia Richburg | Ramy Eskander | Smaranda Muresan | Marine Carpuat
Proceedings of the The Fourth Widening Natural Language Processing Workshop

Byte-Pair Encoding (BPE) (Sennrich et al., 2016) has become a standard pre-processing step when building neural machine translation systems. However, it is not clear whether this is an optimal strategy in all settings. We conduct a controlled comparison of subword segmentation strategies for translating two low-resource morphologically rich languages (Swahili and Turkish) into English. We show that segmentations based on a unigram language model (Kudo, 2018) yield comparable BLEU and better recall for translating rare source words than BPE.

pdf bib
Generating Diverse Translations via Weighted Fine-tuning and Hypotheses Filtering for the Duolingo STAPLE Task
Sweta Agrawal | Marine Carpuat
Proceedings of the Fourth Workshop on Neural Generation and Translation

This paper describes the University of Maryland’s submission to the Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Unlike the standard machine translation task, STAPLE requires generating a set of outputs for a given input sequence, aiming to cover the space of translations produced by language learners. We adapt neural machine translation models to this requirement by (a) generating n-best translation hypotheses from a model fine-tuned on learner translations, oversampled to reflect the distribution of learner responses, and (b) filtering hypotheses using a feature-rich binary classifier that directly optimizes a close approximation of the official evaluation metric. Combination of systems that use these two strategies achieves F1 scores of 53.9% and 52.5% on Vietnamese and Portuguese, respectively ranking 2nd and 4th on the leaderboard.

2019

pdf bib
Controlling Text Complexity in Neural Machine Translation
Sweta Agrawal | Marine Carpuat
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

This work introduces a machine translation task where the output is aimed at audiences of different levels of target language proficiency. We collect a high quality dataset of news articles available in English and Spanish, written for diverse grade levels and propose a method to align segments across comparable bilingual articles. The resulting dataset makes it possible to train multi-task sequence to sequence models that can translate and simplify text jointly. We show that these multi-task models outperform pipeline approaches that translate and simplify text independently.

pdf bib
Weakly Supervised Cross-lingual Semantic Relation Classification via Knowledge Distillation
Yogarshi Vyas | Marine Carpuat
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Words in different languages rarely cover the exact same semantic space. This work characterizes differences in meaning between words across languages using semantic relations that have been used to relate the meaning of English words. However, because of translation ambiguity, semantic relations are not always preserved by translation. We introduce a cross-lingual relation classifier trained only with English examples and a bilingual dictionary. Our classifier relies on a novel attention-based distillation approach to account for translation ambiguity when transferring knowledge from English to cross-lingual settings. On new English-Chinese and English-Hindi test sets, the resulting models largely outperform baselines that more naively rely on bilingual embeddings or dictionaries for cross-lingual transfer, and approach the performance of fully supervised systems on English tasks.

pdf bib
Bi-Directional Differentiable Input Reconstruction for Low-Resource Neural Machine Translation
Xing Niu | Weijia Xu | Marine Carpuat
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We aim to better exploit the limited amounts of parallel text available in low-resource settings by introducing a differentiable reconstruction loss for neural machine translation (NMT). This loss compares original inputs to reconstructed inputs, obtained by back-translating translation hypotheses into the input language. We leverage differentiable sampling and bi-directional NMT to train models end-to-end, without introducing additional parameters. This approach achieves small but consistent BLEU improvements on four language pairs in both translation directions, and outperforms an alternative differentiable reconstruction strategy based on hidden states.

pdf bib
Curriculum Learning for Domain Adaptation in Neural Machine Translation
Xuan Zhang | Pamela Shapiro | Gaurav Kumar | Paul McNamee | Marine Carpuat | Kevin Duh
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We introduce a curriculum learning approach to adapt generic neural machine translation models to a specific domain. Samples are grouped by their similarities to the domain of interest and each group is fed to the training algorithm with a particular schedule. This approach is simple to implement on top of any neural framework or architecture, and consistently outperforms both unadapted and adapted baselines in experiments with two distinct domains and two language pairs.

pdf bib
Differentiable Sampling with Flexible Reference Word Order for Neural Machine Translation
Weijia Xu | Xing Niu | Marine Carpuat
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Despite some empirical success at correcting exposure bias in machine translation, scheduled sampling algorithms suffer from a major drawback: they incorrectly assume that words in the reference translations and in sampled sequences are aligned at each time step. Our new differentiable sampling algorithm addresses this issue by optimizing the probability that the reference can be aligned with the sampled output, based on a soft alignment predicted by the model itself. As a result, the output distribution at each time step is evaluated with respect to the whole predicted sequence. Experiments on IWSLT translation tasks show that our approach improves BLEU compared to maximum likelihood and scheduled sampling baselines. In addition, our approach is simpler to train with no need for sampling schedule and yields models that achieve larger improvements with smaller beam sizes.

pdf bib
The University of Maryland’s Kazakh-English Neural Machine Translation System at WMT19
Eleftheria Briakou | Marine Carpuat
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the University of Maryland’s submission to the WMT 2019 Kazakh-English news translation task. We study the impact of transfer learning from another low-resource but related language. We experiment with different ways of encoding lexical units to maximize lexical overlap between the two language pairs, as well as back-translation and ensembling. The submitted system improves over a Kazakh-only baseline by +5.45 BLEU on newstest2019.

pdf bib
Identifying Fluently Inadequate Output in Neural and Statistical Machine Translation
Marianna Martindale | Marine Carpuat | Kevin Duh | Paul McNamee
Proceedings of Machine Translation Summit XVII: Research Track

2018

pdf bib
Multi-Task Neural Models for Translating Between Styles Within and Across Languages
Xing Niu | Sudha Rao | Marine Carpuat
Proceedings of the 27th International Conference on Computational Linguistics

Generating natural language requires conveying content in an appropriate style. We explore two related tasks on generating text of varying formality: monolingual formality transfer and formality-sensitive machine translation. We propose to solve these tasks jointly using multi-task learning, and show that our models achieve state-of-the-art performance for formality transfer and are able to perform formality-sensitive translation without being explicitly trained on style-annotated translation examples.

pdf bib
Robust Cross-Lingual Hypernymy Detection Using Dependency Context
Shyam Upadhyay | Yogarshi Vyas | Marine Carpuat | Dan Roth
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Cross-lingual Hypernymy Detection involves determining if a word in one language (“fruit”) is a hypernym of a word in another language (“pomme” i.e. apple in French). The ability to detect hypernymy cross-lingually can aid in solving cross-lingual versions of tasks such as textual entailment and event coreference. We propose BiSparse-Dep, a family of unsupervised approaches for cross-lingual hypernymy detection, which learns sparse, bilingual word embeddings based on dependency contexts. We show that BiSparse-Dep can significantly improve performance on this task, compared to approaches based only on lexical context. Our approach is also robust, showing promise for low-resource settings: our dependency-based embeddings can be learned using a parser trained on related languages, with negligible loss in performance. We also crowd-source a challenging dataset for this task on four languages – Russian, French, Arabic, and Chinese. Our embeddings and datasets are publicly available.

pdf bib
Identifying Semantic Divergences in Parallel Text without Annotations
Yogarshi Vyas | Xing Niu | Marine Carpuat
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Recognizing that even correct translations are not always semantically equivalent, we automatically detect meaning divergences in parallel sentence pairs with a deep neural model of bilingual semantic similarity which can be trained for any parallel corpus without any manual annotation. We show that our semantic model detects divergences more accurately than models based on surface features derived from word alignments, and that these divergences matter for neural machine translation.

pdf bib
Fluency Over Adequacy: A Pilot Study in Measuring User Trust in Imperfect MT
Marianna Martindale | Marine Carpuat
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
Bi-Directional Neural Machine Translation with Synthetic Parallel Data
Xing Niu | Michael Denkowski | Marine Carpuat
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

Despite impressive progress in high-resource settings, Neural Machine Translation (NMT) still struggles in low-resource and out-of-domain scenarios, often failing to match the quality of phrase-based translation. We propose a novel technique that combines back-translation and multilingual NMT to improve performance in these difficult cases. Our technique trains a single model for both directions of a language pair, allowing us to back-translate source or target monolingual data without requiring an auxiliary model. We then continue training on the augmented parallel data, enabling a cycle of improvement for a single model that can incorporate any source, target, or parallel data to improve both translation directions. As a byproduct, these models can reduce training and deployment costs significantly compared to uni-directional models. Extensive experiments show that our technique outperforms standard back-translation in low-resource scenarios, improves quality on cross-domain tasks, and effectively reduces costs across the board.

pdf bib
The University of Maryland’s Chinese-English Neural Machine Translation Systems at WMT18
Weijia Xu | Marine Carpuat
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the University of Maryland’s submission to the WMT 2018 Chinese↔English news translation tasks. Our systems are BPE-based self-attentional Transformer networks with parallel and backtranslated monolingual training data. Using ensembling and reranking, we improve over the Transformer baseline by +1.4 BLEU for Chinese→English and +3.97 BLEU for English→Chinese on newstest2017. Our best systems reach BLEU scores of 24.4 for Chinese→English and 39.0 for English→Chinese on newstest2018.

pdf bib
Proceedings of The 12th International Workshop on Semantic Evaluation
Marianna Apidianaki | Saif M. Mohammad | Jonathan May | Ekaterina Shutova | Steven Bethard | Marine Carpuat
Proceedings of The 12th International Workshop on Semantic Evaluation

pdf bib
UMD at SemEval-2018 Task 10: Can Word Embeddings Capture Discriminative Attributes?
Alexander Zhang | Marine Carpuat
Proceedings of The 12th International Workshop on Semantic Evaluation

We describe the University of Maryland’s submission to SemEval-018 Task 10, “Capturing Discriminative Attributes”: given word triples (w1, w2, d), the goal is to determine whether d is a discriminating attribute belonging to w1 but not w2. Our study aims to determine whether word embeddings can address this challenging task. Our submission casts this problem as supervised binary classification using only word embedding features. Using a gaussian SVM model trained only on validation data results in an F-score of 60%. We also show that cosine similarity features are more effective, both in unsupervised systems (F-score of 65%) and supervised systems (F-score of 67%).

2017

pdf bib
Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation
Marine Carpuat | Yogarshi Vyas | Xing Niu
Proceedings of the First Workshop on Neural Machine Translation

Parallel corpora are often not as parallel as one might assume: non-literal translations and noisy translations abound, even in curated corpora routinely used for training and evaluation. We use a cross-lingual textual entailment system to distinguish sentence pairs that are parallel in meaning from those that are not, and show that filtering out divergent examples from training improves translation quality.

pdf bib
Discovering Stylistic Variations in Distributional Vector Space Models via Lexical Paraphrases
Xing Niu | Marine Carpuat
Proceedings of the Workshop on Stylistic Variation

Detecting and analyzing stylistic variation in language is relevant to diverse Natural Language Processing applications. In this work, we investigate whether salient dimensions of style variations are embedded in standard distributional vector spaces of word meaning. We hypothesizes that distances between embeddings of lexical paraphrases can help isolate style from meaning variations and help identify latent style dimensions. We conduct a qualitative analysis of latent style dimensions, and show the effectiveness of identified style subspaces on a lexical formality prediction task.

pdf bib
Detecting Asymmetric Semantic Relations in Context: A Case-Study on Hypernymy Detection
Yogarshi Vyas | Marine Carpuat
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

We introduce WHiC, a challenging testbed for detecting hypernymy, an asymmetric relation between words. While previous work has focused on detecting hypernymy between word types, we ground the meaning of words in specific contexts drawn from WordNet examples, and require predictions to be sensitive to changes in contexts. WHiC lets us analyze complementary properties of two approaches of inducing vector representations of word meaning in context. We show that such contextualized word representations also improve detection of a wider range of semantic relations in context.

pdf bib
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
Steven Bethard | Marine Carpuat | Marianna Apidianaki | Saif M. Mohammad | Daniel Cer | David Jurgens
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

pdf bib
Proceedings of ACL 2017, Student Research Workshop
Allyson Ettinger | Spandana Gella | Matthieu Labeau | Cecilia Ovesdotter Alm | Marine Carpuat | Mark Dredze
Proceedings of ACL 2017, Student Research Workshop

pdf bib
A Study of Style in Machine Translation: Controlling the Formality of Machine Translation Output
Xing Niu | Marianna Martindale | Marine Carpuat
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Stylistic variations of language, such as formality, carry speakers’ intention beyond literal meaning and should be conveyed adequately in translation. We propose to use lexical formality models to control the formality level of machine translation output. We demonstrate the effectiveness of our approach in empirical evaluations, as measured by automatic metrics and human assessments.

2016

pdf bib
Learning Monolingual Compositional Representations via Bilingual Supervision
Ahmed Elgohary | Marine Carpuat
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Sparse Bilingual Word Representations for Cross-lingual Lexical Entailment
Yogarshi Vyas | Marine Carpuat
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Retrofitting Sense-Specific Word Vectors Using Parallel Text
Allyson Ettinger | Philip Resnik | Marine Carpuat
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
Steven Bethard | Marine Carpuat | Daniel Cer | David Jurgens | Preslav Nakov | Torsten Zesch
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
SemEval-2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM)
Nathan Schneider | Dirk Hovy | Anders Johannsen | Marine Carpuat
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation
Dekai Wu | Marine Carpuat | Eneko Agirre | Nora Aranberri
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Proceedings of the Second Workshop on Discourse in Machine Translation
Bonnie Webber | Marine Carpuat | Andrei Popescu-Belis | Christian Hardmeier
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
Connotation in Translation
Marine Carpuat
Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
The UMD machine translation systems at IWSLT 2015
Amittai Axelrod | Marine Carpuat
Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
Class-based N-gram language difference models for data selection
Amittai Axelrod | Yogarshi Vyas | Marianna Martindale | Marine Carpuat
Proceedings of the 12th International Workshop on Spoken Language Translation: Papers

2014

pdf bib
Linear Mixture Models for Robust Machine Translation
Marine Carpuat | Cyril Goutte | George Foster
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Mixed Language and Code-Switching in the Canadian Hansard
Marine Carpuat
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation
Dekai Wu | Marine Carpuat | Xavier Carreras | Eva Maria Vecchi
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
The NRC System for Discriminating Similar Languages
Cyril Goutte | Serge Léger | Marine Carpuat
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

pdf bib
Cross-lingual Discourse Relation Analysis: A corpus study and a semi-supervised classification system
Junyi Jessy Li | Marine Carpuat | Ani Nenkova
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
CNRC-TMT: Second Language Writing Assistant System Description
Cyril Goutte | Michel Simard | Marine Carpuat
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Assessing the Discourse Factors that Influence the Quality of Machine Translation
Junyi Jessy Li | Marine Carpuat | Ani Nenkova
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation
Marine Carpuat | Lucia Specia | Dekai Wu
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
A Semantic Evaluation of Machine Translation Lexical Choice
Marine Carpuat
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Feature Space Selection and Combination for Native Language Identification
Cyril Goutte | Serge Léger | Marine Carpuat
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Measuring Machine Translation Errors in New Domains
Ann Irvine | John Morgan | Marine Carpuat | Hal Daumé III | Dragos Munteanu
Transactions of the Association for Computational Linguistics, Volume 1

We develop two techniques for analyzing the effect of porting a machine translation system to a new domain. One is a macro-level analysis that measures how domain shift affects corpus-level evaluation; the second is a micro-level analysis for word-level errors. We apply these methods to understand what happens when a Parliament-trained phrase-based machine translation system is applied in four very different domains: news, medical texts, scientific articles and movie subtitles. We present quantitative and qualitative experiments that highlight opportunities for future research in domain adaptation for machine translation.

pdf bib
SenseSpotting: Never let your parallel data tie you to an old domain
Marine Carpuat | Hal Daumé III | Katharine Henry | Ann Irvine | Jagadeesh Jagarlamudi | Rachel Rudinger
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
NRC: A Machine Translation Approach to Cross-Lingual Word Sense Disambiguation (SemEval-2013 Task 10)
Marine Carpuat
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf bib
The Trouble with SMT Consistency
Marine Carpuat | Michel Simard
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
Marine Carpuat | Lucia Specia | Dekai Wu
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Domain Adaptation in Machine Translation: Findings from the 2012 Johns Hopkins Summer Workshop
Hal Daumé III | Marine Carpuat | Alex Fraser | Chris Quirk
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Keynote Presentations

pdf bib
The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance
Cyril Goutte | Marine Carpuat | George Foster
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

When parallel or comparable corpora are harvested from the web, there is typically a tradeoff between the size and quality of the data. In order to improve quality, corpus collection efforts often attempt to fix or remove misaligned sentence pairs. But, at the same time, Statistical Machine Translation (SMT) systems are widely assumed to be relatively robust to sentence alignment errors. However, there is little empirical evidence to support and characterize this robustness. This contribution investigates the impact of sentence alignment errors on a typical phrase-based SMT system. We confirm that SMT systems are highly tolerant to noise, and that performance only degrades seriously at very high noise levels. Our findings suggest that when collecting larger, noisy parallel data for training phrase-based SMT, cleaning up by trying to detect and remove incorrect alignments can actually degrade performance. Although fixing errors, when applicable, is a preferable strategy to removal, its benefits only become apparent for fairly high misalignment rates. We provide several explanations to support these findings.

2011

pdf bib
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation
Dekai Wu | Marianna Apidianaki | Marine Carpuat | Lucia Specia
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation

2010

pdf bib
Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation
Marine Carpuat | Mona Diab
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment
Marine Carpuat | Yuval Marton | Nizar Habash
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
Reordering Matrix Post-verbal Subjects for Arabic-to-English SMT
Marine Carpuat | Yuval Marton | Nizar Habash
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

We improve our recently proposed technique for integrating Arabic verb-subject constructions in SMT word alignment (Carpuat et al., 2010) by distinguishing between matrix (or main clause) and non-matrix Arabic verb-subject constructions. In gold translations, most matrix VS (main clause verb-subject) constructions are translated in inverted SV order, while non-matrix (subordinate clause) VS constructions are inverted in only half the cases. In addition, while detecting verbs and their subjects is a hard task, our syntactic parser detects VS constructions better in matrix than in non-matrix clauses. As a result, reordering only matrix VS for word alignment consistently improves translation quality over a phrase-based SMT baseline, and over reordering all VS constructions, in both medium- and large-scale settings. In fact, the improvements obtained by reordering matrix VS on the medium-scale setting remarkably represent 44% of the gain in BLEU and 51% of the gain in TER obtained with a word alignment training bitext that is 5 times larger.

2009

pdf bib
Toward Using Morphology in French-English Phrase-Based SMT
Marine Carpuat
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
One Translation Per Discourse
Marine Carpuat
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

2008

pdf bib
Evaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation
Marine Carpuat | Dekai Wu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present new direct data analysis showing that dynamically-built context-dependent phrasal translation lexicons are more useful resources for phrase-based statistical machine translation (SMT) than conventional static phrasal translation lexicons, which ignore all contextual information. After several years of surprising negative results, recent work suggests that context-dependent phrasal translation lexicons are an appropriate framework to successfully incorporate Word Sense Disambiguation (WSD) modeling into SMT. However, this approach has so far only been evaluated using automatic translation quality metrics, which are important, but aggregate many different factors. A direct analysis is still needed to understand how context-dependent phrasal translation lexicons impact translation quality, and whether the additional complexity they introduce is really necessary. In this paper, we focus on the impact of context-dependent translation lexicons on lexical choice in phrase-based SMT and show that context-dependent lexicons are more useful to a phrase-based SMT system than a conventional lexicon. A typical phrase-based SMT system makes use of more and longer phrases with context modeling, including phrases that were not seen very frequently in training. Even when the segmentation is identical, the context-dependent lexicons yield translations that match references more often than conventional lexicons.

2007

pdf bib
How phrase sense disambiguation outperforms word sense disambiguation for statistical machine translation
Marine Carpuat | Dekai Wu
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf bib
Improving Statistical Machine Translation Using Word Sense Disambiguation
Marine Carpuat | Dekai Wu
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
Context-dependent phrasal translation lexicons for statistical machine translation
Marine Carpuat | Dekai Wu
Proceedings of Machine Translation Summit XI: Papers

pdf bib
HKUST statistical machine translation experiments for IWSLT 2007
Yihai Shen | Chi-kiu Lo | Marine Carpuat | Dekai Wu
Proceedings of the Fourth International Workshop on Spoken Language Translation

This paper describes the HKUST experiments in the IWSLT 2007 evaluation campaign on spoken language translation. Our primary objective was to compare the open-source phrase-based statistical machine translation toolkit Moses against Pharaoh. We focused on Chinese to English translation, but we also report results on the Arabic to English, Italian to English, and Japanese to English tasks.

2006

pdf bib
Proceedings of the COLING/ACL 2006 Student Research Workshop
Marine Carpuat | Kevin Duh | Rebecca Hwa
Proceedings of the COLING/ACL 2006 Student Research Workshop

pdf bib
Boosting for Chinese Named Entity Recognition
Xiaofeng Yu | Marine Carpuat | Dekai Wu
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing

pdf bib
Toward integrating word sense and entity disambiguation into statistical machine translation
Marine Carpuat | Yihai Shen | Xiaofeng Yu | Dekai Wu
Proceedings of the Third International Workshop on Spoken Language Translation: Evaluation Campaign

2005

pdf bib
Word Sense Disambiguation vs. Statistical Machine Translation
Marine Carpuat | Dekai Wu
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

pdf bib
Evaluating the Word Sense Disambiguation Performance of Statistical Machine Translation
Marine Carpuat | Dekai Wu
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

2004

pdf bib
Augmenting ensemble classification for Word Sense Disambiguation with a kernel PCA model
Marine Carpuat | Weifeng Su | Dekai Wu
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

pdf bib
Semantic role labeling with Boosting, SVMs, Maximum Entropy, SNOW, and Decision Lists
Grace Ngai | Dekai Wu | Marine Carpuat | Chi-Shing Wang | Chi-Yung Wang
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

pdf bib
Joining forces to resolve lexical ambiguity: East meets West in Barcelona
Richard Wicentowski | Grace Ngai | Dekai Wu | Marine Carpuat | Emily Thomforde | Adrian Packel
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

pdf bib
Using N-best lists for Named Entity Recognition from Chinese Speech
Lufeng Zhai | Pascale Fung | Richard Schwartz | Marine Carpuat | Dekai Wu
Proceedings of HLT-NAACL 2004: Short Papers

pdf bib
A Kernel PCA Method for Superior Word Sense Disambiguation
Dekai Wu | Weifeng Su | Marine Carpuat
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf bib
Raising the Bar: Stacked Conservative Error Correction Beyond Boosting
Dekai Wu | Grace Ngai | Marine Carpuat
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Why Nitpicking Works: Evidence for Occam’s Razor in Error Correctors
Dekai Wu | Grace Ngai | Marine Carpuat
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Semi-supervised training of a Kernel PCA-Based Model for Word Sense Disambiguation
Weifeng Su | Marine Carpuat | Dekai Wu
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

pdf bib
A Stacked, Voted, Stacked Model for Named Entity Recognition
Dekai Wu | Grace Ngai | Marine Carpuat
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003

2002

pdf bib
Boosting for Named Entity Recognition
Dekai Wu | Grace Ngai | Marine Carpuat | Jeppe Larsen | Yongsheng Yang
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

pdf bib
Identifying Concepts Across Languages: A First Step towards a Corpus-based Approach to Automatic Ontology Alignment
Grace Ngai | Marine Carpuat | Pascale Fung
COLING 2002: The 19th International Conference on Computational Linguistics

Search