Marine Carpuat - ACL Anthology

Marine Carpuat

2026

Evaluating Native-Speaker Preferences on Machine Translation and Post-Edits for Five African Languages
Hiba El Oirghi | Tajuddeen Gwadabe | Marine Carpuat
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)

Wikipedia editors undertake the task of editing machine translation (MT) outputs in various languages to disseminate multilingual knowledge from English. But are editors doing more than just translating or fixing MT output? To answer this broad question, we constructed a dataset of 4,335 fine-grained annotated parallel pairs of MT translations and human post-edit (HE) translations for five low-resource African languages: Hausa, Igbo, Swahili, Yoruba, and Zulu. We report on our data selection and annotation methodologies as well as findings from the annotated dataset, the most surprising of which is that annotators mostly preferred the MT translations over their HE counterparts for three out of five languages. We analyze the nature of these "fluency breaking" edits and provide recommendations for the MT post-editing workflows in the Wikipedia domain and beyond.

Can you map it to English? The Role of Cross-Lingual Alignment in the Multilingual Performance of LLMs
Kartik Ravisankar | HyoJung Han | Sarah Wiegreffe | Marine Carpuat
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) can answer prompts in many languages, despite being trained predominantly on English; yet, the mechanisms driving this generalization remain poorly understood. This work asks: How does an LLM’s ability to align representations of non-English inputs to English impact its performance on natural language understanding (NLU) tasks? We study the role of representation alignment in instance-level task decisions, complementing prior analyses conducted both at the language level and task-independently. We introduce the Discriminative Alignment Index (\DALI) to quantify instance-level alignment across 24 languages other than English and three distinct NLU tasks. Results show that incorrect NLU predictions are strongly associated with lower representation alignment with English in the model’s middle layers. Through activation patching, we show that incorrect predictions in languages other than English can be fixed by patching their parallel English activations in the middle layers, thereby demonstrating the causal role of representation (mis)alignment in cross-lingual correctness.

VietMix: A Naturally-Occurring Parallel Corpus and Augmentation Framework for Vietnamese-English Code-Mixed Machine Translation
Hieu Tran | Phuong-Anh Nguyen-Le | Huy Nghiem | Quang-Nhan Nguyen | Wei Ai | Marine Carpuat
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Machine translation (MT) systems universally degrade when faced with code-mixed text. This problem is more acute for low-resource languages that lack dedicated parallel corpora. This work directly addresses this gap for Vietnamese-English, a language context characterized by challenges including orthographic ambiguity and the frequent omission of diacritics in informal text. We introduce VietMix, the first expert-translated, naturally occurring parallel corpus of Vietnamese-English code-mixed text. We establish VietMix’s utility by developing a data augmentation pipeline that leverages iterative fine-tuning and targeted filtering. Experiments show that models augmented with our data outperform strong back-translation baselines by up to +3.5 xCOMET points and improve zero-shot models by up to +11.9 points. Our work delivers a foundational resource for a challenging language pair and provides a validated, transferable framework for building and augmenting corpora in other low-resource settings.

2025

Should I Share this Translation? Evaluating Quality Feedback for User Reliance on Machine Translation
Dayeon Ki | Kevin Duh | Marine Carpuat
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

As people increasingly use AI systems in work and daily life, feedback mechanisms that help them use AI responsibly are urgently needed, particularly in settings where users are not equipped to assess the quality of AI predictions. We study a realistic Machine Translation (MT) scenario where monolingual users decide whether to share an MT output, first without and then with quality feedback. We compare four types of quality feedback: explicit feedback that directly give users an assessment of translation quality using (1) error highlights and (2) LLM explanations, and implicit feedback that helps users compare MT inputs and outputs through (3) backtranslation and (4) question–answer (QA) tables. We find that all feedback types, except error highlights, significantly improve both decision accuracy and appropriate reliance. Notably, implicit feedback, especially QA tables, yields significantly greater gains than explicit feedback in terms of decision accuracy, appropriate reliance, and user perceptions – receiving the highest ratings for helpfulness and trust, and the lowest for mental burden.

Automatic Input Rewriting Improves Translation with Large Language Models
Dayeon Ki | Marine Carpuat
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Can we improve machine translation (MT) with LLMs by rewriting their inputs automatically? Users commonly rely on the intuition that well-written text is easier to translate when using off-the-shelf MT systems. LLMs can rewrite text in many ways but in the context of MT, these capabilities have been primarily exploited to rewrite outputs via post-editing. We present an empirical study of 21 input rewriting methods with 3 open-weight LLMs for translating from English into 6 target languages. We show that text simplification is the most effective MT-agnostic rewrite strategy and that it can be improved further when using quality estimation to assess translatability. Human evaluation further confirms that simplified rewrites and their MT outputs both largely preserve the original meaning of the source and MT. These results suggest LLM-assisted input rewriting as a promising direction for improving translations.

Improving MT-enabled Triage Performance with Multiple MT Outputs
Marianna J. Martindale | Marine Carpuat
Proceedings of Machine Translation Summit XX: Volume 1

Recent advances in Machine Translation (MT) quality may motivate adoption in a variety of use cases, but the success of MT deployment depends not only on intrinsic model quality but on how well the model, as deployed, helps users meet the objectives of their use case. This work focuses on a specific triage use case, MT-enabled scanning in intelligence analysis. After describing the use case with its objectives and failure modes, we present a user study to establish a baseline performance level and measure the mitigating effects of a simple intervention, providing additional MT outputs. We find significant improvements in relevance judgment accuracy with outputs from two distinct neural MT models and significant improvements in relevant entity identification with the addition of a rule-based MT. Users also like seeing multiple MT outputs, making it an appealing way to improve MT-enabled scanning performance.

Machine Translation (MT) tools are widely used today, often in contexts where professional translators are not present. Despite progress in MT technology, a gap persists between system development and real-world usage, particularly for non-expert users who may struggle to assess translation reliability.This paper advocates for a human-centered approach to MT, emphasizing the alignment of system design with diverse communicative goals and contexts of use. We survey the literature in Translation Studies and Human-Computer Interaction to recontextualize MT evaluation and design to address the diverse real-world scenarios in which MT is used today.

Multiple LLM Agents Debate for Equitable Cultural Alignment
Dayeon Ki | Rachel Rudinger | Tianyi Zhou | Marine Carpuat
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) need to adapt their predictions to diverse cultural contexts to benefit diverse communities across the world. While previous efforts have focused on single-LLM, single-turn approaches, we propose to exploit the complementary strengths of multiple LLMs to promote cultural adaptability. We introduce a Multi-Agent Debate framework, where two LLM-based agents debate over a cultural scenario and collaboratively reach a final decision. We propose two variants: one where either LLM agents exclusively debate and another where they dynamically choose between self-reflection and debate during their turns. We evaluate these approaches on 7 open-weight LLMs (and 21 LLM combinations) using the NormAd-ETI benchmark for social etiquette norms in 75 countries. Experiments show that debate improves both overall accuracy and cultural group parity over single-LLM baselines. Notably, multi-agent debate enables relatively small LLMs (7-9B) to achieve accuracies comparable to that of a much larger model (27B parameters).

Toward Machine Translation Literacy: How Lay Users Perceive and Rely on Imperfect Translations
Yimin Xiao | Yongle Zhang | Dayeon Ki | Calvin Bao | Marianna J. Martindale | Charlotte Vaughn | Ge Gao | Marine Carpuat
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

As Machine Translation (MT) becomes increasingly commonplace, understanding how the general public perceives and relies on imperfect MT is crucial for contextualizing MT research in real-world applications. We present a human study conducted in a public museum (n=452), investigating how fluency and adequacy errors impact bilingual and non-bilingual users’ reliance on MT during casual use. Our findings reveal that non-bilingual users often over-rely on MT due to a lack of evaluation strategies and alternatives, while experiencing the impact of errors can prompt users to reassess future reliance. This highlights the need for MT evaluation and NLP explanation techniques to promote not only MT quality, but also MT literacy among its users.

AskQE: Question Answering as Automatic Evaluation for Machine Translation
Dayeon Ki | Kevin Duh | Marine Carpuat
Findings of the Association for Computational Linguistics: ACL 2025

How can a monolingual English speaker determine whether an automatic translation in French is good enough to be shared? Existing MT error detection and quality estimation (QE) techniques do not address this practical scenario. We introduce AskQE, a question generation and answering framework designed to detect critical MT errors and provide actionable feedback, helping users decide whether to accept or reject MT outputs even without the knowledge of the target language. Using ContraTICO, a dataset of contrastive synthetic MT errors in the COVID-19 domain, we explore design choices for AskQE and develop an optimized version relying on LLaMA-3 70B and entailed facts to guide question generation. We evaluate the resulting system on the BioMQM dataset of naturally occurring MT errors, where AskQE has higher Kendall’s Tau correlation and decision accuracy with human ratings compared to other QE metrics.

Who’s the Author? How Explanations Impact User Reliance in AI-Assisted Authorship Attribution
Calvin Bao | Connor Baumler | Hal Daumé Iii | Marine Carpuat
Findings of the Association for Computational Linguistics: EMNLP 2025

Despite growing interest in explainable NLP, it remains unclear how explanation strategies shape user behavior in tasks like authorship identification, where relevant textual features may be difficult for lay users to pinpoint. To support their analysis of text style, we consider two explanation types: example-based style rewrites and feature-based rationales, generated using a LLM-based pipeline. We measured how explanations impact user behavior in a controlled study (n=95) where participants completed authorship identification tasks with our types of assistance. While no explanation type improved overall task accuracy, fine-grained reliance patterns (CITATION) revealed that rewrites supported appropriate reliance, whereas presenting both explanation types increased AI overreliance, minimizing participant self-reliance. We find that participants exhibiting better reliance behaviors had focused explanation needs, contrasting with the diffused preferences of those who overrelied on AI, or incorrectly self-relied. These findings highlight the need for adaptive explanation systems that tailor support based on specific user reliance behaviors.

Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation
Eric R. Bennett | HyoJung Han | Xinchen Yang | Andrew Schonebaum | Marine Carpuat
Proceedings of the Second Workshop on Ancient Language Processing

Evaluation metrics are an important driver of progress in Machine Translation (MT), but they have been primarily validated on high-resource modern languages. In this paper, we conduct an empirical evaluation of metrics commonly used to evaluate MT from Ancient Chinese into English. Using LLMs, we construct a contrastive test set, pairing high-quality MT and purposefully flawed MT of the same Pre-Qin texts. We then evaluate the ability of each metric to discriminate between accurate and flawed translations.

2024

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages
Jiayi Wang | David Ifeoluwa Adelani | Sweta Agrawal | Marek Masiak | Ricardo Rei | Eleftheria Briakou | Marine Carpuat | Xuanli He | Sofia Bourhim | Andiswa Bukula | Muhidin Mohamed | Temitayo Olatoye | Tosin Adewumi | Hamam Mokayed | Christine Mwase | Wangui Kimotho | Foutse Yuehgoh | Anuoluwapo Aremu | Jessica Ojo | Shamsuddeen Hassan Muhammad | Salomey Osei | Abdul-Hakeem Omotayo | Chiamaka Chukwuneke | Perez Ogayo | Oumaima Hourrane | Salma El Anigri | Lolwethu Ndolela | Thabiso Mangwana | Shafie Abdi Mohamed | Ayinde Hassan | Oluwabusayo Olufunke Awoyomi | Lama Alkhaled | Sana Al-Azzawi | Naome A. Etori | Millicent Ochieng | Clemencia Siro | Samuel Njoroge | Eric Muchiri | Wangari Kimotho | Lyse Naomi Wamba Momo | Daud Abolade | Simbiat Ajao | Iyanuoluwa Shode | Ricky Macharm | Ruqayya Nasir Iro | Saheed S. Abdullahi | Stephen E. Moore | Bernard Opoku | Zainab Akinjobi | Abeeb Afolabi | Nnaemeka Obiefuna | Onyekachi Raphael Ogbu | Sam Brian | Verrah Akinyi Otiende | Chinedu Emmanuel Mbonu | Sakayo Toadoum Sari | Yao Lu | Pontus Stenetorp
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning
Shramay Palta | Nishant Balepur | Peter Rankel | Sarah Wiegreffe | Marine Carpuat | Rachel Rudinger
Findings of the Association for Computational Linguistics: EMNLP 2024

Questions involving commonsense reasoning about everyday situations often admit many possible or plausible answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the most plausible answer choice. On 250 MCQ items sampled from two commonsense reasoning benchmarks, we collect 5,000 independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQS, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracyand high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.

Keep it Private: Unsupervised Privatization of Online Text
Calvin Bao | Marine Carpuat
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Authorship obfuscation techniques hold the promise of helping people protect their privacy in online communications by automatically rewriting text to hide the identity of the original author. However, obfuscation has been evaluated in narrow settings in the NLP literature and has primarily been addressed with superficial edit operations that can lead to unnatural outputs. In this work, we introduce an automatic text privatization framework that fine-tunes a large language model via reinforcement learning to produce rewrites that balance soundness, sense, and privacy. We evaluate it extensively on a large-scale test set of English Reddit posts by 68k authors composed of short-medium length texts. We study how the performance changes among evaluative conditions including authorial profile length and authorship detection strategy. Our method maintains high text quality according to both automated metrics and human evaluation, and successfully evades several automated authorship attacks.

SpeechQE: Estimating the Quality of Direct Speech Translation
HyoJung Han | Kevin Duh | Marine Carpuat
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent advances in automatic quality estimation for machine translation have exclusively focused on written language, leaving the speech modality underexplored. In this work, we formulate the task of quality estimation for speech translation (SpeechQE), construct a benchmark, and evaluate a family of systems based on cascaded and end-to-end architectures. In this process, we introduce a novel end-to-end system leveraging pre-trained text LLM. Results suggest that end-to-end approaches are better suited to estimating the quality of direct speech translation than using quality estimation systems designed for text in cascaded systems. More broadly, we argue that quality estimation of speech translation needs to be studied as a separate problem from that of text, and release our [data and models](https://github.com/h-j-han/SpeechQE) to guide further research in this space.

How Often Are Errors in Natural Language Reasoning Due to Paraphrastic Variability?
Neha Srikanth | Marine Carpuat | Rachel Rudinger
Transactions of the Association for Computational Linguistics, Volume 12

Large language models have been shown to behave inconsistently in response to meaning-preserving paraphrastic inputs. At the same time, researchers evaluate the knowledge and reasoning abilities of these models with test evaluations that do not disaggregate the effect of paraphrastic variability on performance. We propose a metric, PC, for evaluating the paraphrastic consistency of natural language reasoning models based on the probability of a model achieving the same correctness on two paraphrases of the same problem. We mathematically connect this metric to the proportion of a model’s variance in correctness attributable to paraphrasing. To estimate PC, we collect ParaNlu, a dataset of 7,782 human-written and validated paraphrased reasoning problems constructed on top of existing benchmark datasets for defeasible and abductive natural language inference.1 Using ParaNlu, we measure the paraphrastic consistency of several model classes and show that consistency dramatically increases with pretraining but not fine-tuning. All models tested exhibited room for improvement in paraphrastic consistency.

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
HyoJung Han | Mohamed Anwar | Juan Pino | Wei-Ning Hsu | Marine Carpuat | Bowen Shi | Changhan Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources.To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.

Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations
Dayeon Ki | Marine Carpuat
Findings of the Association for Computational Linguistics: NAACL 2024

Machine Translation (MT) remains one of the last NLP tasks where large language models (LLMs) have not yet replaced dedicated supervised systems. This work exploits the complementary strengths of LLMs and supervised MT by guiding LLMs to automatically post-edit MT with external feedback on its quality, derived from Multidimensional Quality Metric (MQM) annotations. Working with LLaMA-2 models, we consider prompting strategies varying the nature of feedback provided and then fine-tune the LLM to improve its ability to exploit the provided guidance. Through experiments on Chinese-English, English-German, and English-Russian MQM data, we demonstrate that prompting LLMs to post-edit MT improves TER, BLEU and COMET scores, although the benefits of fine-grained feedback are not clear. Fine-tuning helps integrate fine-grained feedback more effectively and further improves translation quality based on both automatic and human evaluation.

FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN
Ibrahim Said Ahmad | Antonios Anastasopoulos | Ondřej Bojar | Claudia Borg | Marine Carpuat | Roldano Cattoni | Mauro Cettolo | William Chen | Qianqian Dong | Marcello Federico | Barry Haddow | Dávid Javorský | Mateusz Krubiński | Tsz Kin Lam | Xutai Ma | Prashant Mathur | Evgeny Matusov | Chandresh Maurya | John P. McCrae | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Xing Niu | Atul Kr. Ojha | John Ortega | Sara Papi | Peter Polák | Adam Pospíšil | Pavel Pecina | Elizabeth Salesky | Nivedita Sethiya | Balaram Sarkar | Jiatong Shi | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Brian Thompson | Alex Waibel | Shinji Watanabe | Patrick Wilken | Petr Zemánek | Rodolfo Zevallos
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)

This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 17 teams whose submissions are documented in 27 system papers. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
Elizabeth Salesky | Marcello Federico | Marine Carpuat
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)

Do Text Simplification Systems Preserve Meaning? A Human Evaluation via Reading Comprehension
Sweta Agrawal | Marine Carpuat
Transactions of the Association for Computational Linguistics, Volume 12

Automatic text simplification (TS) aims to automate the process of rewriting text to make it easier for people to read. A pre-requisite for TS to be useful is that it should convey information that is consistent with the meaning of the original text. However, current TS evaluation protocols assess system outputs for simplicity and meaning preservation without regard for the document context in which output sentences occur and for how people understand them. In this work, we introduce a human evaluation framework to assess whether simplified texts preserve meaning using reading comprehension questions. With this framework, we conduct a thorough human evaluation of texts by humans and by nine automatic systems. Supervised systems that leverage pre-training knowledge achieve the highest scores on the reading comprehension tasks among the automatic controllable TS systems. However, even the best-performing supervised system struggles with at least 14% of the questions, marking them as “unanswerable” based on simplified content. We further investigate how existing TS evaluation metrics and automatic question-answering systems approximate the human judgments we obtained.

Automatic Authorship Analysis in Human-AI Collaborative Writing
Aquia Richburg | Calvin Bao | Marine Carpuat
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

As the quality of AI-generated text increases with the development of new Large Language Models, people use them to write in a variety of contexts. Human-AI collaborative writing poses a potential challenge for existing AI analysis techniques, which have been primarily tested either on human-written text only, or on samples independently generated by humans and AI. In this work, we investigate the extent to which existing AI detection and authorship analysis models can perform classification on data generated in human-AI collaborative writing sessions. Results show that, for AI text detection in the cowriting setting, classifiers based on authorship embeddings (Rivera-Soto et al., 2021) outperform classifiers used in prior work distinguishing AI vs. human text generated independently. However, these embeddings are not optimal for finer-grained authorship identification tasks: for authorship verification, n-gram based models are more robust to human-AI co-written text, and authorship attribution performance degrades compared to baselines that use human-written text only. Taken together, this suggests that the rise of human-AI co-written text will require adapting AI detection tools and authorship analysis techniques in the near future. We release our code at https://github.com/AARichburg/Human-AI_Authorship_Analysis.

2023

Explaining with Contrastive Phrasal Highlighting: A Case Study in Assisting Humans to Detect Translation Differences
Eleftheria Briakou | Navita Goyal | Marine Carpuat
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Explainable NLP techniques primarily explain by answering “Which tokens in the input are responsible for this prediction?”. We argue that for NLP models that make predictions by comparing two input texts, it is more useful to explain by answering “What differences between the two inputs explain this prediction?”. We introduce a technique to generate contrastive phrasal highlights that explain the predictions of a semantic divergence model via phrase alignment guided erasure. We show that the resulting highlights match human rationales of cross-lingual semantic differences better than popular post-hoc saliency techniques and that they successfully help people detect fine-grained meaning differences in human translations and critical machine translation errors.

What Else Do I Need to Know? The Effect of Background Information on Users’ Reliance on QA Systems
Navita Goyal | Eleftheria Briakou | Amanda Liu | Connor Baumler | Claire Bonial | Jeffrey Micher | Clare Voss | Marine Carpuat | Hal Daumé III
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

NLP systems have shown impressive performance at answering questions by retrieving relevant context. However, with the increasingly large models, it is impossible and often undesirable to constrain models’ knowledge or reasoning to only the retrieved context. This leads to a mismatch between the information that the models access to derive the answer and the information that is available to the user to assess the model predicted answer. In this work, we study how users interact with QA systems in the absence of sufficient information to assess their predictions. Further, we ask whether adding the requisite background helps mitigate users’ over-reliance on predictions. Our study reveals that users rely on model predictions even in the absence of sufficient information needed to assess the model’s correctness. Providing the relevant background, however, helps users better catch model errors, reducing over-reliance on incorrect predictions. On the flip side, background information also increases users’ confidence in their accurate as well as inaccurate judgments. Our work highlights that supporting users’ verification of QA predictions is an important, yet challenging, problem.

Physician Detection of Clinical Harm in Machine Translation: Quality Estimation Aids in Reliance and Backtranslation Identifies Critical Errors
Nikita Mehandru | Sweta Agrawal | Yimin Xiao | Ge Gao | Elaine Khoong | Marine Carpuat | Niloufar Salehi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

A major challenge in the practical use of Machine Translation (MT) is that users lack information on translation quality to make informed decisions about how to rely on outputs. Progress in quality estimation research provides techniques to automatically assess MT quality, but these techniques have primarily been evaluated in vitro by comparison against human judgments outside of a specific context of use. This paper evaluates quality estimation feedback in vivo with a human study in realistic high-stakes medical settings. Using Emergency Department discharge instructions, we study how interventions based on quality estimation versus backtranslation assist physicians in deciding whether to show MT outputs to a patient. We find that quality estimation improves appropriate reliance on MT, but backtranslation helps physicians detect more clinically harmful errors that QE alone often misses.

Findings of the CoCo4MT 2023 Shared Task on Corpus Construction for Machine Translation
Ananya Ganesh | Marine Carpuat | William Chen | Katharina Kann | Constantine Lignos | John E. Ortega | Jonne Saleva | Shabnam Tafreshi | Rodolfo Zevallos
Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation

This paper provides an overview of the first shared task on choosing beneficial instances for machine translation, conducted as part of the CoCo4MT 2023 Workshop at MTSummit. This shared task was motivated by the need to make the data annotation process for machine translation more efficient, particularly for low-resource languages for which collecting human translations may be difficult or expensive. The task involved developing methods for selecting the most beneficial instances for training a machine translation system without access to an existing parallel dataset in the target language, such that the best selected instances can then be manually translated. Two teams participated in the shared task, namely the Williams team and the AST team. Submissions were evaluated by training a machine translation model on each submission’s chosen instances, and comparing their performance with the chRF++ score. The system that ranked first is by the Williams team, that finds representative instances by clustering the training data.

Bridging Background Knowledge Gaps in Translation with Automatic Explicitation
HyoJung Han | Jordan Boyd-Graber | Marine Carpuat
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Translations help people understand content written in another language. However, even correct literal translations do not fulfill that goal when people lack the necessary background to understand them. Professional translators incorporate explicitations to explain the missing context by considering cultural differences between source and target audiences. Despite its potential to help users, NLP research on explicitation is limited because of the dearth of adequate evaluation methods. This work introduces techniques for automatically generating explicitations, motivated by WikiExpl: a dataset that we collect from Wikipedia and annotate with human translators. The resulting explicitations are useful as they help answer questions more accurately in a multilingual question answering framework.

Controlling Pre-trained Language Models for Grade-Specific Text Simplification
Sweta Agrawal | Marine Carpuat
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Text simplification systems rewrite text to make it more readable while preserving its content. However, what makes a text easy to read depends on the intended readers. Recent work has shown that pre-trained language models can simplify text using a wealth of techniques to control output simplicity, ranging from specifying only the desired reading grade level, to directly specifying low-level edit operations. Yet it remains unclear how to set these control parameters in practice. Existing approaches set them at the corpus level, disregarding the complexity of individual inputs and considering only one level of output complexity. In this work, we conduct an empirical study to understand how different control mechanisms impact the adequacy and simplicity of text simplification systems. Based on these insights, we introduce a simple method that predicts the edit operations required for simplifying a text for a specific grade level on an instance-per-instance basis. This approach improves the quality of the simplified outputs over corpus-level search-based heuristics.

Understanding and Detecting Hallucinations in Neural Machine Translation via Model Introspection
Weijia Xu | Sweta Agrawal | Eleftheria Briakou | Marianna J. Martindale | Marine Carpuat
Transactions of the Association for Computational Linguistics, Volume 11

Neural sequence generation models are known to “hallucinate”, by producing outputs that are unrelated to the source text. These hallucinations are potentially harmful, yet it remains unclear in what conditions they arise and how to mitigate their impact. In this work, we first identify internal model symptoms of hallucinations by analyzing the relative token contributions to the generation in contrastive hallucinated vs. non-hallucinated outputs generated via source perturbations. We then show that these symptoms are reliable indicators of natural hallucinations, by using them to design a lightweight hallucination detector which outperforms both model-free baselines and strong classifiers based on quality estimation or large pre-trained models on manually annotated English-Chinese and German-English translation test beds.

FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN
Milind Agarwal | Sweta Agrawal | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Claudia Borg | Marine Carpuat | Roldano Cattoni | Mauro Cettolo | Mingda Chen | William Chen | Khalid Choukri | Alexandra Chronopoulou | Anna Currey | Thierry Declerck | Qianqian Dong | Kevin Duh | Yannick Estève | Marcello Federico | Souhir Gahbiche | Barry Haddow | Benjamin Hsu | Phu Mon Htut | Hirofumi Inaguma | Dávid Javorský | John Judge | Yasumasa Kano | Tom Ko | Rishu Kumar | Pengwei Li | Xutai Ma | Prashant Mathur | Evgeny Matusov | Paul McNamee | John P. McCrae | Kenton Murray | Maria Nadejde | Satoshi Nakamura | Matteo Negri | Ha Nguyen | Jan Niehues | Xing Niu | Atul Kr. Ojha | John E. Ortega | Proyag Pal | Juan Pino | Lonneke van der Plas | Peter Polák | Elijah Rippeth | Elizabeth Salesky | Jiatong Shi | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Yun Tang | Brian Thompson | Kevin Tran | Marco Turchi | Alex Waibel | Mingxuan Wang | Shinji Watanabe | Rodolfo Zevallos
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper reports on the shared tasks organized by the 20th IWSLT Conference. The shared tasks address 9 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, multilingual, dialect and low-resource speech translation, and formality control. The shared tasks attracted a total of 38 submissions by 31 teams. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
Elizabeth Salesky | Marcello Federico | Marine Carpuat
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

A Rose by Any Other Name would not Smell as Sweet: Social Bias in Names Mistranslation
Sandra Sandoval | Jieyu Zhao | Marine Carpuat | Hal Daumé III
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We ask the question: Are there widespread disparities in machine translations of names across race/ethnicity, and gender? We hypothesize that the translation quality of names and surrounding context will be lower for names associated with US racial and ethnic minorities due to these systems’ tendencies to standardize language to predominant language patterns. We develop a dataset of names that are strongly demographically aligned and propose a translation evaluation procedure based on round-trip translation. We analyze the effect of name demographics on translation quality using generalized linear mixed effects models and find that the ability of translation systems to correctly translate female-associated names is significantly lower than male-associated names. This effect is particularly pronounced for female-associated names that are also associated with racial (Black) and ethnic (Hispanic) minorities. This disparity in translation quality between social groups for something as personal as someone’s name has significant implications for people’s professional, personal, and cultural identities, self-worth and ease of communication. Our findings suggest that more MT research is needed to improve the translation of names and to provide high-quality service for users regardless of gender, race, and ethnicity.

Towards Conceptualization of “Fair Explanation”: Disparate Impacts of anti-Asian Hate Speech Explanations on Content Moderators
Tin Nguyen | Jiannan Xu | Aayushi Roy | Hal Daumé III | Marine Carpuat
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Recent research at the intersection of AI explainability and fairness has focused on how explanations can improve human-plus-AI task performance as assessed by fairness measures. We propose to characterize what constitutes an explanation that is itself “fair” – an explanation that does not adversely impact specific populations. We formulate a novel evaluation method of “fair explanations” using not just accuracy and label time, but also psychological impact of explanations on different user groups across many metrics (mental discomfort, stereotype activation, and perceived workload). We apply this method in the context of content moderation of potential hate speech, and its differential impact on Asian vs. non-Asian proxy moderators, across explanation approaches (saliency map and counterfactual explanation). We find that saliency maps generally perform better and show less evidence of disparate impact (group) and individual unfairness than counterfactual explanations. Content warning: This paper contains examples of hate speech and racially discriminatory language. The authors do not support such content. Please consider your risk of discomfort carefully before continuing reading!

2022

Data Cartography for Low-Resource Neural Machine Translation
Aquia Richburg | Marine Carpuat
Findings of the Association for Computational Linguistics: EMNLP 2022

While collecting or generating more parallel data is necessary to improve machine translation (MT) in low-resource settings, we lack an understanding of how the limited amounts of existing data are actually used to help guide the collection of further resources. In this paper, we apply data cartography techniques (Swayamdipta et al., 2020) to characterize the contribution of training samples in two low-resource MT tasks (Swahili-English and Turkish-English) throughout the training of standard neural MT models. Our empirical study shows that, unlike in prior work for classification tasks, most samples contribute to model training in low-resource MT, albeit not uniformly throughout the training process. Furthermore, uni-dimensional characterizations of samples – e.g., based on dual cross-entropy or word frequency – do not suffice to characterize to what degree they are hard or easy to learn. Taken together, our results suggest that data augmentation strategies for low-resource MT would benefit from model-in-the-loop strategies to maximize improvements.

A Proposed User Study on MT-Enabled Scanning
Marianna J Martindale | Marine Carpuat
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)

In this talk I will present a proposed user study to measure the impact of potentially misleading MT output on MT-enabled scanning of foreign language text by intelligence analysts (IAs) and the effectiveness of a practical intervention: providing output from more than one NMT system to the user. The focus of the talk will be on the approach to de-signing the user study to resemble scanning tasks in a measurable way with unclassified documents.

Constrained Regeneration for Cross-Lingual Query-Focused Extractive Summarization
Elsbeth Turcan | David Wan | Faisal Ladhak | Petra Galuscakova | Sukanta Sen | Svetlana Tchistiakova | Weijia Xu | Marine Carpuat | Kenneth Heafield | Douglas Oard | Kathleen McKeown
Proceedings of the 29th International Conference on Computational Linguistics

Query-focused summaries of foreign-language, retrieved documents can help a user understand whether a document is actually relevant to the query term. A standard approach to this problem is to first translate the source documents and then perform extractive summarization to find relevant snippets. However, in a cross-lingual setting, the query term does not necessarily appear in the translations of relevant documents. In this work, we show that constrained machine translation and constrained post-editing can improve human relevance judgments by including a query term in a summary when its translation appears in the source document. We also present several strategies for selecting only certain documents for regeneration which yield further improvements

Findings of the Association for Computational Linguistics: NAACL 2022
Marine Carpuat | Marie-Catherine de Marneffe | Ivan Vladimir Meza Ruiz
Findings of the Association for Computational Linguistics: NAACL 2022

Controlling Translation Formality Using Pre-trained Multilingual Language Models
Elijah Rippeth | Sweta Agrawal | Marine Carpuat
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

This paper describes the University of Maryland’s submission to the Special Task on Formality Control for Spoken Language Translation at IWSLT, which evaluates translation from English into 6 languages with diverse grammatical formality markers. We investigate to what extent this problem can be addressed with a single multilingual model, simultaneously controlling its output for target language and formality. Results show that this strategy can approach the translation quality and formality control achieved by dedicated translation models. However, the nature of the underlying pre-trained language model and of the finetuning samples greatly impact results.

SimQA: Detecting Simultaneous MT Errors through Word-by-Word Question Answering
HyoJung Han | Marine Carpuat | Jordan Boyd-Graber
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Detractors of neural machine translation admit that while its translations are fluent, it sometimes gets key facts wrong. This is particularly important in simultaneous interpretation where translations have to be provided as fast as possible: before a sentence is complete. Yet, evaluations of simultaneous machine translation (SimulMT) fail to capture if systems correctly translate the most salient elements of a question: people, places, and dates. To address this problem, we introduce a downstream word-by-word question answering evaluation task (SimQA): given a source language question, translate the question word by word into the target language, and answer as soon as possible. SimQA jointly measures whether the SimulMT models translate the question quickly and accurately, and can reveal shortcomings in existing neural systems—hallucinating or omitting facts.

An Imitation Learning Curriculum for Text Editing with Non-Autoregressive Models
Sweta Agrawal | Marine Carpuat
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose a framework for training non-autoregressive sequence-to-sequence models for editing tasks, where the original input sequence is iteratively edited to produce the output. We show that the imitation learning algorithms designed to train such models for machine translation introduces mismatches between training and inference that lead to undertraining and poor generalization in editing scenarios. We address this issue with two complementary strategies: 1) a roll-in policy that exposes the model to intermediate training sequences that it is more likely to encounter during inference, 2) a curriculum that presents easy-to-learn edit operations first, gradually increasing the difficulty of training samples as the model becomes competent. We show the efficacy of these strategies on two challenging English editing tasks: controllable text simplification and abstractive summarization. Our approach significantly improves output quality on both tasks and controls output complexity better on the simplification task.

Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 2: Corpus Generation and Corpus Augmentation for Machine Translation)
John E. Ortega | Marine Carpuat | William Chen | Katharina Kann | Constantine Lignos | Maja Popovic | Shabnam Tafreshi
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 2: Corpus Generation and Corpus Augmentation for Machine Translation)

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Marine Carpuat | Marie-Catherine de Marneffe | Ivan Vladimir Meza Ruiz
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Can Synthetic Translations Improve Bitext Quality?
Eleftheria Briakou | Marine Carpuat
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Synthetic translations have been used for a wide range of NLP tasks primarily as a means of data augmentation. This work explores, instead, how synthetic translations can be used to revise potentially imperfect reference translations in mined bitext. We find that synthetic samples can improve bitext quality without any additional bilingual supervision when they replace the originals based on a semantic equivalence classifier that helps mitigate NMT noise. The improved quality of the revised bitext is confirmed intrinsically via human evaluation and extrinsically through bilingual induction and MT tasks.

Quality Estimation via Backtranslation at the WMT 2022 Quality Estimation Task
Sweta Agrawal | Nikita Mehandru | Niloufar Salehi | Marine Carpuat
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes submission to the WMT 2022 Quality Estimation shared task (Task 1: sentence-level quality prediction). We follow a simple and intuitive approach, which consists of estimating MT quality by automatically back-translating hypotheses into the source language using a multilingual MT system. We then compare the resulting backtranslation with the original source using standard MT evaluation metrics. We find that even the best-performing backtranslation-based scores perform substantially worse than supervised QE systems, including the organizers’ baseline. However, combining backtranslation-based metrics with off-the-shelf QE scorers improves correlation with human judgments, suggesting that they can indeed complement a supervised QE system.

2021

A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification
Sweta Agrawal | Weijia Xu | Marine Carpuat
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?
Weijia Xu | Shuming Ma | Dongdong Zhang | Marine Carpuat
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

The UMD Submission to the Explainable MT Quality Estimation Shared Task: Combining Explanation Models with Sequence Labeling
Tasnim Kabir | Marine Carpuat
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

This paper describes the UMD submission to the Explainable Quality Estimation Shared Task at the EMNLP 2021 Workshop on “Evaluation & Comparison of NLP Systems”. We participated in the word-level and sentence-level MT Quality Estimation (QE) constrained tasks for all language pairs: Estonian-English, Romanian-English, German-Chinese, and Russian-German. Our approach combines the predictions of a word-level explainer model on top of a sentence-level QE model and a sequence labeler trained on synthetic data. These models are based on pre-trained multilingual language models and do not require any word-level annotations for training, making them well suited to zero-shot settings. Our best-performing system improves over the best baseline across all metrics and language pairs, with an average gain of 0.1 in AUC, Average Precision, and Recall at Top-K score.

Rule-based Morphological Inflection Improves Neural Terminology Translation
Weijia Xu | Marine Carpuat
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Current approaches to incorporating terminology constraints in machine translation (MT) typically assume that the constraint terms are provided in their correct morphological forms. This limits their application to real-world scenarios where constraint terms are provided as lemmas. In this paper, we introduce a modular framework for incorporating lemma constraints in neural MT (NMT) in which linguistic knowledge and diverse types of NMT models can be flexibly applied. It is based on a novel cross-lingual inflection module that inflects the target lemma constraints based on the source context. We explore linguistically motivated rule-based and data-driven neural-based inflection modules and design English-German health and English-Lithuanian news test suites to evaluate them in domain adaptation and low-resource MT settings. Results show that our rule-based inflection module helps NMT models incorporate lemma constraints more accurately than a neural module and outperforms the existing end-to-end approach with lower training costs.

Beyond Noise: Mitigating the Impact of Fine-grained Semantic Divergences on Neural Machine Translation
Eleftheria Briakou | Marine Carpuat
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

While it has been shown that Neural Machine Translation (NMT) is highly sensitive to noisy parallel training samples, prior work treats all types of mismatches between source and target as noise. As a result, it remains unclear how samples that are mostly equivalent but contain a small number of semantically divergent tokens impact NMT training. To close this gap, we analyze the impact of different types of fine-grained semantic divergences on Transformer models. We show that models trained on synthetic divergences output degenerated text more frequently and are less confident in their predictions. Based on these findings, we introduce a divergent-aware NMT framework that uses factors to help NMT recover from the degradation caused by naturally occurring divergences, improving both translation quality and model calibration on EN-FR tasks.

Machine Translation Believability
Marianna Martindale | Kevin Duh | Marine Carpuat
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

Successful Machine Translation (MT) deployment requires understanding not only the intrinsic qualities of MT output, such as fluency and adequacy, but also user perceptions. Users who do not understand the source language respond to MT output based on their perception of the likelihood that the meaning of the MT output matches the meaning of the source text. We refer to this as believability. Output that is not believable may be off-putting to users, but believable MT output with incorrect meaning may mislead them. In this work, we study the relationship of believability to fluency and adequacy by applying traditional MT direct assessment protocols to annotate all three features on the output of neural MT systems. Quantitative analysis of these annotations shows that believability is closely related to but distinct from fluency, and initial qualitative analysis suggests that semantic features may account for the difference.

Evaluating the Evaluation Metrics for Style Transfer: A Case Study in Multilingual Formality Transfer
Eleftheria Briakou | Sweta Agrawal | Joel Tetreault | Marine Carpuat
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

While the field of style transfer (ST) has been growing rapidly, it has been hampered by a lack of standardized practices for automatic evaluation. In this paper, we evaluate leading automatic metrics on the oft-researched task of formality style transfer. Unlike previous evaluations, which focus solely on English, we expand our focus to Brazilian-Portuguese, French, and Italian, making this work the first multilingual evaluation of metrics in ST. We outline best practices for automatic evaluation in (formality) style transfer and identify several models that correlate well with human judgments and are robust across languages. We hope that this work will help accelerate development in ST, where human evaluation is often challenging to collect.

Models and Tasks for Human-Centered Machine Translation
Marine Carpuat
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)

In this talk, I will describe current research directions in my group that aim to make machine translation (MT) more human-centered. Instead of viewing MT solely as a task that aims to transduce a source sentence into a well-formed target language equivalent, we revisit all steps of the MT research and development lifecycle with the goal of designing MT systems that are able to help people communicate across language barriers. I will present methods to better characterize the parallel training data that powers MT systems, and how the degree of equivalence impacts translation quality. I will introduce models that enable flexible conditional language generation, and will discuss recent work on framing machine translation tasks and evaluation to center human factors.

A Review of Human Evaluation for Style Transfer
Eleftheria Briakou | Sweta Agrawal | Ke Zhang | Joel Tetreault | Marine Carpuat
Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

This paper reviews and summarizes human evaluation practices described in 97 style transfer papers with respect to three main evaluation aspects: style transfer, meaning preservation, and fluency. In principle, evaluations by human raters should be the most reliable. However, in style transfer papers, we find that protocols for human evaluations are often underspecified and not standardized, which hampers the reproducibility of research in this field and progress toward better human and automatic evaluation methods.

EDITOR: An Edit-Based Transformer with Repositioning for Neural Machine Translation with Soft Lexical Constraints
Weijia Xu | Marine Carpuat
Transactions of the Association for Computational Linguistics, Volume 9

We introduce an Edit-Based TransfOrmer with Repositioning (EDITOR), which makes sequence generation flexible by seamlessly allowing users to specify preferences in output lexical choice. Building on recent models for non-autoregressive sequence generation (Gu et al., 2019), EDITOR generates new sequences by iteratively editing hypotheses. It relies on a novel reposition operation designed to disentangle lexical choice from word positioning decisions, while enabling efficient oracles for imitation learning and parallel edits at decoding time. Empirically, EDITOR uses soft lexical constraints more effectively than the Levenshtein Transformer (Gu et al., 2019) while speeding up decoding dramatically compared to constrained beam search (Post and Vilar, 2018). EDITOR also achieves comparable or better translation quality with faster decoding speed than the Levenshtein Transformer on standard Romanian-English, English-German, and English-Japanese machine translation tasks.

The University of Maryland, College Park Submission to Large-Scale Multilingual Shared Task at WMT 2021
Saptarashmi Bandyopadhyay | Tasnim Kabir | Zizhen Lian | Marine Carpuat
Proceedings of the Sixth Conference on Machine Translation

This paper describes the system submitted to Large-Scale Multilingual Shared Task (Small Task #2) at WMT 2021. It is based on the massively multilingual open-source model FLORES101_MM100 model, with selective fine-tuning. Our best-performing system reported a 15.72 average BLEU score for the task.

2020

The University of Maryland’s Submissions to the WMT20 Chat Translation Task: Searching for More Data to Adapt Discourse-Aware Neural Machine Translation
Calvin Bao | Yow-Ting Shiue | Chujun Song | Jie Li | Marine Carpuat
Proceedings of the Fifth Conference on Machine Translation

This paper describes the University of Maryland’s submissions to the WMT20 Shared Task on Chat Translation. We focus on translating agent-side utterances from English to German. We started from an off-the-shelf BPE-based standard transformer model trained with WMT17 news and fine-tuned it with the provided in-domain training data. In addition, we augment the training set with its best matches in the WMT19 news dataset. Our primary submission uses a standard Transformer, while our contrastive submissions use multi-encoder Transformers to attend to previous utterances. Our primary submission achieves 56.7 BLEU on the agent side (en→de), outperforming a baseline system provided by the task organizers by more than 13 BLEU points. Moreover, according to an evaluation on a set of carefully-designed examples, the multi-encoder architecture is able to generate more coherent translations.

Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank
Eleftheria Briakou | Marine Carpuat
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Detecting fine-grained differences in content conveyed in different languages matters for cross-lingual NLP and multilingual corpora analysis, but it is a challenging machine learning problem since annotation is expensive and hard to scale. This work improves the prediction and annotation of fine-grained semantic divergences. We introduce a training strategy for multilingual BERT models by learning to rank synthetic divergent examples of varying granularity. We evaluate our models on the Rationalized English-French Semantic Divergences, a new dataset released with this work, consisting of English-French sentence-pairs annotated with semantic divergence classes and token-level rationales. Learning to rank helps detect fine-grained sentence-level divergences more accurately than a strong sentence-level similarity model, while token-level predictions have the potential of further distinguishing between coarse and fine-grained divergences.

Evaluating a Bi-LSTM Model for Metaphor Detection in TOEFL Essays
Kevin Kuo | Marine Carpuat
Proceedings of the Second Workshop on Figurative Language Processing

This paper describes systems submitted to the Metaphor Shared Task at the Second Workshop on Figurative Language Processing. In this submission, we replicate the evaluation of the Bi-LSTM model introduced by Gao et al.(2018) on the VUA corpus in a new setting: TOEFL essays written by non-native English speakers. Our results show that Bi-LSTM models outperform feature-rich linear models on this challenging task, which is consistent with prior findings on the VUA dataset. However, the Bi-LSTM models lag behind the best performing systems in the shared task.

Incorporating Terminology Constraints in Automatic Post-Editing
David Wan | Chris Kedzie | Faisal Ladhak | Marine Carpuat | Kathleen McKeown
Proceedings of the Fifth Conference on Machine Translation

Users of machine translation (MT) may want to ensure the use of specific lexical terminologies. While there exist techniques for incorporating terminology constraints during inference for MT, current APE approaches cannot ensure that they will appear in the final translation. In this paper, we present both autoregressive and non-autoregressive models for lexically constrained APE, demonstrating that our approach enables preservation of 95% of the terminologies and also improves translation quality on English-German benchmarks. Even when applied to lexically constrained MT output, our approach is able to improve preservation of the terminologies. However, we show that our models do not learn to copy constraints systematically and suggest a simple data augmentation technique that leads to improved performance and robustness.

An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages
Aquia Richburg | Ramy Eskander | Smaranda Muresan | Marine Carpuat
Proceedings of the Fourth Widening Natural Language Processing Workshop

Byte-Pair Encoding (BPE) (Sennrich et al., 2016) has become a standard pre-processing step when building neural machine translation systems. However, it is not clear whether this is an optimal strategy in all settings. We conduct a controlled comparison of subword segmentation strategies for translating two low-resource morphologically rich languages (Swahili and Turkish) into English. We show that segmentations based on a unigram language model (Kudo, 2018) yield comparable BLEU and better recall for translating rare source words than BPE.

Dual Reconstruction: a Unifying Objective for Semi-Supervised Neural Machine Translation
Weijia Xu | Xing Niu | Marine Carpuat
Findings of the Association for Computational Linguistics: EMNLP 2020

While Iterative Back-Translation and Dual Learning effectively incorporate monolingual training data in neural machine translation, they use different objectives and heuristic gradient approximation strategies, and have not been extensively compared. We introduce a novel dual reconstruction objective that provides a unified view of Iterative Back-Translation and Dual Learning. It motivates a theoretical analysis and controlled empirical study on German-English and Turkish-English tasks, which both suggest that Iterative Back-Translation is more effective than Dual Learning despite its relative simplicity.

Multitask Models for Controlling the Complexity of Neural Machine Translation
Sweta Agrawal | Marine Carpuat
Proceedings of the Fourth Widening Natural Language Processing Workshop

We introduce a machine translation task where the output is aimed at audiences of different levels of target language proficiency. We collect a novel dataset of news articles available in English and Spanish and written for diverse reading grade levels. We leverage this dataset to train multitask sequence to sequence models that translate Spanish into English targeted at an easier reading grade level than the original Spanish. We show that multitask models outperform pipeline approaches that translate and simplify text independently.

Generating Diverse Translations via Weighted Fine-tuning and Hypotheses Filtering for the Duolingo STAPLE Task
Sweta Agrawal | Marine Carpuat
Proceedings of the Fourth Workshop on Neural Generation and Translation

This paper describes the University of Maryland’s submission to the Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). Unlike the standard machine translation task, STAPLE requires generating a set of outputs for a given input sequence, aiming to cover the space of translations produced by language learners. We adapt neural machine translation models to this requirement by (a) generating n-best translation hypotheses from a model fine-tuned on learner translations, oversampled to reflect the distribution of learner responses, and (b) filtering hypotheses using a feature-rich binary classifier that directly optimizes a close approximation of the official evaluation metric. Combination of systems that use these two strategies achieves F1 scores of 53.9% and 52.5% on Vietnamese and Portuguese, respectively ranking 2nd and 4th on the leaderboard.

2019

Identifying Fluently Inadequate Output in Neural and Statistical Machine Translation
Marianna Martindale | Marine Carpuat | Kevin Duh | Paul McNamee
Proceedings of Machine Translation Summit XVII: Research Track

Controlling Text Complexity in Neural Machine Translation
Sweta Agrawal | Marine Carpuat
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

This work introduces a machine translation task where the output is aimed at audiences of different levels of target language proficiency. We collect a high quality dataset of news articles available in English and Spanish, written for diverse grade levels and propose a method to align segments across comparable bilingual articles. The resulting dataset makes it possible to train multi-task sequence to sequence models that can translate and simplify text jointly. We show that these multi-task models outperform pipeline approaches that translate and simplify text independently.

Weakly Supervised Cross-lingual Semantic Relation Classification via Knowledge Distillation
Yogarshi Vyas | Marine Carpuat
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Words in different languages rarely cover the exact same semantic space. This work characterizes differences in meaning between words across languages using semantic relations that have been used to relate the meaning of English words. However, because of translation ambiguity, semantic relations are not always preserved by translation. We introduce a cross-lingual relation classifier trained only with English examples and a bilingual dictionary. Our classifier relies on a novel attention-based distillation approach to account for translation ambiguity when transferring knowledge from English to cross-lingual settings. On new English-Chinese and English-Hindi test sets, the resulting models largely outperform baselines that more naively rely on bilingual embeddings or dictionaries for cross-lingual transfer, and approach the performance of fully supervised systems on English tasks.

Differentiable Sampling with Flexible Reference Word Order for Neural Machine Translation
Weijia Xu | Xing Niu | Marine Carpuat
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Despite some empirical success at correcting exposure bias in machine translation, scheduled sampling algorithms suffer from a major drawback: they incorrectly assume that words in the reference translations and in sampled sequences are aligned at each time step. Our new differentiable sampling algorithm addresses this issue by optimizing the probability that the reference can be aligned with the sampled output, based on a soft alignment predicted by the model itself. As a result, the output distribution at each time step is evaluated with respect to the whole predicted sequence. Experiments on IWSLT translation tasks show that our approach improves BLEU compared to maximum likelihood and scheduled sampling baselines. In addition, our approach is simpler to train with no need for sampling schedule and yields models that achieve larger improvements with smaller beam sizes.

Bi-Directional Differentiable Input Reconstruction for Low-Resource Neural Machine Translation
Xing Niu | Weijia Xu | Marine Carpuat
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We aim to better exploit the limited amounts of parallel text available in low-resource settings by introducing a differentiable reconstruction loss for neural machine translation (NMT). This loss compares original inputs to reconstructed inputs, obtained by back-translating translation hypotheses into the input language. We leverage differentiable sampling and bi-directional NMT to train models end-to-end, without introducing additional parameters. This approach achieves small but consistent BLEU improvements on four language pairs in both translation directions, and outperforms an alternative differentiable reconstruction strategy based on hidden states.

The University of Maryland’s Kazakh-English Neural Machine Translation System at WMT19
Eleftheria Briakou | Marine Carpuat
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the University of Maryland’s submission to the WMT 2019 Kazakh-English news translation task. We study the impact of transfer learning from another low-resource but related language. We experiment with different ways of encoding lexical units to maximize lexical overlap between the two language pairs, as well as back-translation and ensembling. The submitted system improves over a Kazakh-only baseline by +5.45 BLEU on newstest2019.

Curriculum Learning for Domain Adaptation in Neural Machine Translation
Xuan Zhang | Pamela Shapiro | Gaurav Kumar | Paul McNamee | Marine Carpuat | Kevin Duh
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We introduce a curriculum learning approach to adapt generic neural machine translation models to a specific domain. Samples are grouped by their similarities to the domain of interest and each group is fed to the training algorithm with a particular schedule. This approach is simple to implement on top of any neural framework or architecture, and consistently outperforms both unadapted and adapted baselines in experiments with two distinct domains and two language pairs.

2018

Robust Cross-Lingual Hypernymy Detection Using Dependency Context
Shyam Upadhyay | Yogarshi Vyas | Marine Carpuat | Dan Roth
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Cross-lingual Hypernymy Detection involves determining if a word in one language (“fruit”) is a hypernym of a word in another language (“pomme” i.e. apple in French). The ability to detect hypernymy cross-lingually can aid in solving cross-lingual versions of tasks such as textual entailment and event coreference. We propose BiSparse-Dep, a family of unsupervised approaches for cross-lingual hypernymy detection, which learns sparse, bilingual word embeddings based on dependency contexts. We show that BiSparse-Dep can significantly improve performance on this task, compared to approaches based only on lexical context. Our approach is also robust, showing promise for low-resource settings: our dependency-based embeddings can be learned using a parser trained on related languages, with negligible loss in performance. We also crowd-source a challenging dataset for this task on four languages – Russian, French, Arabic, and Chinese. Our embeddings and datasets are publicly available.

UMD at SemEval-2018 Task 10: Can Word Embeddings Capture Discriminative Attributes?
Alexander Zhang | Marine Carpuat
Proceedings of the 12th International Workshop on Semantic Evaluation

We describe the University of Maryland’s submission to SemEval-018 Task 10, “Capturing Discriminative Attributes”: given word triples (w1, w2, d), the goal is to determine whether d is a discriminating attribute belonging to w1 but not w2. Our study aims to determine whether word embeddings can address this challenging task. Our submission casts this problem as supervised binary classification using only word embedding features. Using a gaussian SVM model trained only on validation data results in an F-score of 60%. We also show that cosine similarity features are more effective, both in unsupervised systems (F-score of 65%) and supervised systems (F-score of 67%).

Proceedings of the 12th International Workshop on Semantic Evaluation
Marianna Apidianaki | Saif M. Mohammad | Jonathan May | Ekaterina Shutova | Steven Bethard | Marine Carpuat
Proceedings of the 12th International Workshop on Semantic Evaluation

Bi-Directional Neural Machine Translation with Synthetic Parallel Data
Xing Niu | Michael Denkowski | Marine Carpuat
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

Despite impressive progress in high-resource settings, Neural Machine Translation (NMT) still struggles in low-resource and out-of-domain scenarios, often failing to match the quality of phrase-based translation. We propose a novel technique that combines back-translation and multilingual NMT to improve performance in these difficult cases. Our technique trains a single model for both directions of a language pair, allowing us to back-translate source or target monolingual data without requiring an auxiliary model. We then continue training on the augmented parallel data, enabling a cycle of improvement for a single model that can incorporate any source, target, or parallel data to improve both translation directions. As a byproduct, these models can reduce training and deployment costs significantly compared to uni-directional models. Extensive experiments show that our technique outperforms standard back-translation in low-resource scenarios, improves quality on cross-domain tasks, and effectively reduces costs across the board.

Fluency Over Adequacy: A Pilot Study in Measuring User Trust in Imperfect MT
Marianna Martindale | Marine Carpuat
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

The University of Maryland’s Chinese-English Neural Machine Translation Systems at WMT18
Weijia Xu | Marine Carpuat
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the University of Maryland’s submission to the WMT 2018 Chinese↔English news translation tasks. Our systems are BPE-based self-attentional Transformer networks with parallel and backtranslated monolingual training data. Using ensembling and reranking, we improve over the Transformer baseline by +1.4 BLEU for Chinese→English and +3.97 BLEU for English→Chinese on newstest2017. Our best systems reach BLEU scores of 24.4 for Chinese→English and 39.0 for English→Chinese on newstest2018.

Identifying Semantic Divergences in Parallel Text without Annotations
Yogarshi Vyas | Xing Niu | Marine Carpuat
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Recognizing that even correct translations are not always semantically equivalent, we automatically detect meaning divergences in parallel sentence pairs with a deep neural model of bilingual semantic similarity which can be trained for any parallel corpus without any manual annotation. We show that our semantic model detects divergences more accurately than models based on surface features derived from word alignments, and that these divergences matter for neural machine translation.

Multi-Task Neural Models for Translating Between Styles Within and Across Languages
Xing Niu | Sudha Rao | Marine Carpuat
Proceedings of the 27th International Conference on Computational Linguistics

Generating natural language requires conveying content in an appropriate style. We explore two related tasks on generating text of varying formality: monolingual formality transfer and formality-sensitive machine translation. We propose to solve these tasks jointly using multi-task learning, and show that our models achieve state-of-the-art performance for formality transfer and are able to perform formality-sensitive translation without being explicitly trained on style-annotated translation examples.

2017

Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation
Marine Carpuat | Yogarshi Vyas | Xing Niu
Proceedings of the First Workshop on Neural Machine Translation

Parallel corpora are often not as parallel as one might assume: non-literal translations and noisy translations abound, even in curated corpora routinely used for training and evaluation. We use a cross-lingual textual entailment system to distinguish sentence pairs that are parallel in meaning from those that are not, and show that filtering out divergent examples from training improves translation quality.

Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
Steven Bethard | Marine Carpuat | Marianna Apidianaki | Saif M. Mohammad | Daniel Cer | David Jurgens
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Discovering Stylistic Variations in Distributional Vector Space Models via Lexical Paraphrases
Xing Niu | Marine Carpuat
Proceedings of the Workshop on Stylistic Variation

Detecting and analyzing stylistic variation in language is relevant to diverse Natural Language Processing applications. In this work, we investigate whether salient dimensions of style variations are embedded in standard distributional vector spaces of word meaning. We hypothesizes that distances between embeddings of lexical paraphrases can help isolate style from meaning variations and help identify latent style dimensions. We conduct a qualitative analysis of latent style dimensions, and show the effectiveness of identified style subspaces on a lexical formality prediction task.

A Study of Style in Machine Translation: Controlling the Formality of Machine Translation Output
Xing Niu | Marianna Martindale | Marine Carpuat
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Stylistic variations of language, such as formality, carry speakers’ intention beyond literal meaning and should be conveyed adequately in translation. We propose to use lexical formality models to control the formality level of machine translation output. We demonstrate the effectiveness of our approach in empirical evaluations, as measured by automatic metrics and human assessments.

Detecting Asymmetric Semantic Relations in Context: A Case-Study on Hypernymy Detection
Yogarshi Vyas | Marine Carpuat
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

We introduce WHiC, a challenging testbed for detecting hypernymy, an asymmetric relation between words. While previous work has focused on detecting hypernymy between word types, we ground the meaning of words in specific contexts drawn from WordNet examples, and require predictions to be sensitive to changes in contexts. WHiC lets us analyze complementary properties of two approaches of inducing vector representations of word meaning in context. We show that such contextualized word representations also improve detection of a wider range of semantic relations in context.

Proceedings of ACL 2017, Student Research Workshop
Allyson Ettinger | Spandana Gella | Matthieu Labeau | Cecilia Ovesdotter Alm | Marine Carpuat | Mark Dredze
Proceedings of ACL 2017, Student Research Workshop

2016

SemEval-2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM)
Nathan Schneider | Dirk Hovy | Anders Johannsen | Marine Carpuat
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

Learning Monolingual Compositional Representations via Bilingual Supervision
Ahmed Elgohary | Marine Carpuat
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Sparse Bilingual Word Representations for Cross-lingual Lexical Entailment
Yogarshi Vyas | Marine Carpuat
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Retrofitting Sense-Specific Word Vectors Using Parallel Text
Allyson Ettinger | Philip Resnik | Marine Carpuat
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The UMD Machine Translation Systems at IWSLT 2016: English-to-French Translation of Speech Transcripts
Xing Niu | Marine Carpuat
Proceedings of the 13th International Conference on Spoken Language Translation

We describe the University of Maryland machine translation system submitted to the IWSLT 2016 Microsoft Speech Language Translation (MSLT) English-French task. Our main finding is that translating conversation transcripts turned out to not be as challenging as we expected: while translation quality is of course not perfect, a straightforward phrase-based system trained on movie subtitles yields high BLEU scores (high 40s on the development set) and manual analysis of 100 examples showed that 61 of them were correctly translated, and errors were mostly local disfluencies in the remaining examples.

Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
Steven Bethard | Marine Carpuat | Daniel Cer | David Jurgens | Preslav Nakov | Torsten Zesch
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation
Dekai Wu | Marine Carpuat | Eneko Agirre | Nora Aranberri
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

Connotation in Translation
Marine Carpuat
Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Class-based N-gram language difference models for data selection
Amittai Axelrod | Yogarshi Vyas | Marianna Martindale | Marine Carpuat
Proceedings of the 12th International Workshop on Spoken Language Translation: Papers

The UMD machine translation systems at IWSLT 2015
Amittai Axelrod | Marine Carpuat
Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign

Proceedings of the Second Workshop on Discourse in Machine Translation
Bonnie Webber | Marine Carpuat | Andrei Popescu-Belis | Christian Hardmeier
Proceedings of the Second Workshop on Discourse in Machine Translation

2014

Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation
Dekai Wu | Marine Carpuat | Xavier Carreras | Eva Maria Vecchi
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

Linear Mixture Models for Robust Machine Translation
Marine Carpuat | Cyril Goutte | George Foster
Proceedings of the Ninth Workshop on Statistical Machine Translation

Assessing the Discourse Factors that Influence the Quality of Machine Translation
Junyi Jessy Li | Marine Carpuat | Ani Nenkova
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Mixed Language and Code-Switching in the Canadian Hansard
Marine Carpuat
Proceedings of the First Workshop on Computational Approaches to Code Switching

Cross-lingual Discourse Relation Analysis: A corpus study and a semi-supervised classification system
Junyi Jessy Li | Marine Carpuat | Ani Nenkova
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

CNRC-TMT: Second Language Writing Assistant System Description
Cyril Goutte | Michel Simard | Marine Carpuat
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

The NRC System for Discriminating Similar Languages
Cyril Goutte | Serge Léger | Marine Carpuat
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

2013

SenseSpotting: Never let your parallel data tie you to an old domain
Marine Carpuat | Hal Daumé III | Katharine Henry | Ann Irvine | Jagadeesh Jagarlamudi | Rachel Rudinger
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Measuring Machine Translation Errors in New Domains
Ann Irvine | John Morgan | Marine Carpuat | Hal Daumé III | Dragos Munteanu
Transactions of the Association for Computational Linguistics, Volume 1

We develop two techniques for analyzing the effect of porting a machine translation system to a new domain. One is a macro-level analysis that measures how domain shift affects corpus-level evaluation; the second is a micro-level analysis for word-level errors. We apply these methods to understand what happens when a Parliament-trained phrase-based machine translation system is applied in four very different domains: news, medical texts, scientific articles and movie subtitles. We present quantitative and qualitative experiments that highlight opportunities for future research in domain adaptation for machine translation.

Bitext as Word Sense Annotation: Lessons from Evaluating Machine Translation on Lexical Semantics Tasks
Marine Carpuat
Proceedings of the Workshop on Twenty Years of Bitext

NRC: A Machine Translation Approach to Cross-Lingual Word Sense Disambiguation (SemEval-2013 Task 10)
Marine Carpuat
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

Feature Space Selection and Combination for Native Language Identification
Cyril Goutte | Serge Léger | Marine Carpuat
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

A Semantic Evaluation of Machine Translation Lexical Choice
Marine Carpuat
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation

Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation
Marine Carpuat | Lucia Specia | Dekai Wu
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation

2012

The Trouble with SMT Consistency
Marine Carpuat | Michel Simard
Proceedings of the Seventh Workshop on Statistical Machine Translation

Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
Marine Carpuat | Lucia Specia | Dekai Wu
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

Domain Adaptation in Machine Translation: Findings from the 2012 Johns Hopkins Summer Workshop
Hal Daumé III | Marine Carpuat | Alex Fraser | Chris Quirk
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Keynote Presentations

The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance
Cyril Goutte | Marine Carpuat | George Foster
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

When parallel or comparable corpora are harvested from the web, there is typically a tradeoff between the size and quality of the data. In order to improve quality, corpus collection efforts often attempt to fix or remove misaligned sentence pairs. But, at the same time, Statistical Machine Translation (SMT) systems are widely assumed to be relatively robust to sentence alignment errors. However, there is little empirical evidence to support and characterize this robustness. This contribution investigates the impact of sentence alignment errors on a typical phrase-based SMT system. We confirm that SMT systems are highly tolerant to noise, and that performance only degrades seriously at very high noise levels. Our findings suggest that when collecting larger, noisy parallel data for training phrase-based SMT, cleaning up by trying to detect and remove incorrect alignments can actually degrade performance. Although fixing errors, when applicable, is a preferable strategy to removal, its benefits only become apparent for fairly high misalignment rates. We provide several explanations to support these findings.

2011

Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation
Dekai Wu | Marianna Apidianaki | Marine Carpuat | Lucia Specia
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation

2010

Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment
Marine Carpuat | Yuval Marton | Nizar Habash
Proceedings of the ACL 2010 Conference Short Papers

Reordering Matrix Post-verbal Subjects for Arabic-to-English SMT
Marine Carpuat | Yuval Marton | Nizar Habash
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

We improve our recently proposed technique for integrating Arabic verb-subject constructions in SMT word alignment (Carpuat et al., 2010) by distinguishing between matrix (or main clause) and non-matrix Arabic verb-subject constructions. In gold translations, most matrix VS (main clause verb-subject) constructions are translated in inverted SV order, while non-matrix (subordinate clause) VS constructions are inverted in only half the cases. In addition, while detecting verbs and their subjects is a hard task, our syntactic parser detects VS constructions better in matrix than in non-matrix clauses. As a result, reordering only matrix VS for word alignment consistently improves translation quality over a phrase-based SMT baseline, and over reordering all VS constructions, in both medium- and large-scale settings. In fact, the improvements obtained by reordering matrix VS on the medium-scale setting remarkably represent 44% of the gain in BLEU and 51% of the gain in TER obtained with a word alignment training bitext that is 5 times larger.

Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation
Marine Carpuat | Mona Diab
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2009

One Translation Per Discourse
Marine Carpuat
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

Toward Using Morphology in French-English Phrase-Based SMT
Marine Carpuat
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

Evaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation
Marine Carpuat | Dekai Wu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present new direct data analysis showing that dynamically-built context-dependent phrasal translation lexicons are more useful resources for phrase-based statistical machine translation (SMT) than conventional static phrasal translation lexicons, which ignore all contextual information. After several years of surprising negative results, recent work suggests that context-dependent phrasal translation lexicons are an appropriate framework to successfully incorporate Word Sense Disambiguation (WSD) modeling into SMT. However, this approach has so far only been evaluated using automatic translation quality metrics, which are important, but aggregate many different factors. A direct analysis is still needed to understand how context-dependent phrasal translation lexicons impact translation quality, and whether the additional complexity they introduce is really necessary. In this paper, we focus on the impact of context-dependent translation lexicons on lexical choice in phrase-based SMT and show that context-dependent lexicons are more useful to a phrase-based SMT system than a conventional lexicon. A typical phrase-based SMT system makes use of more and longer phrases with context modeling, including phrases that were not seen very frequently in training. Even when the segmentation is identical, the context-dependent lexicons yield translations that match references more often than conventional lexicons.

2007

Improving Statistical Machine Translation Using Word Sense Disambiguation
Marine Carpuat | Dekai Wu
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

HKUST statistical machine translation experiments for IWSLT 2007
Yihai Shen | Chi-kiu Lo | Marine Carpuat | Dekai Wu
Proceedings of the Fourth International Workshop on Spoken Language Translation

This paper describes the HKUST experiments in the IWSLT 2007 evaluation campaign on spoken language translation. Our primary objective was to compare the open-source phrase-based statistical machine translation toolkit Moses against Pharaoh. We focused on Chinese to English translation, but we also report results on the Arabic to English, Italian to English, and Japanese to English tasks.

How phrase sense disambiguation outperforms word sense disambiguation for statistical machine translation
Marine Carpuat | Dekai Wu
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

Context-dependent phrasal translation lexicons for statistical machine translation
Marine Carpuat | Dekai Wu
Proceedings of Machine Translation Summit XI: Papers

2006

Toward integrating word sense and entity disambiguation into statistical machine translation
Marine Carpuat | Yihai Shen | Xiaofeng Yu | Dekai Wu
Proceedings of the Third International Workshop on Spoken Language Translation: Evaluation Campaign

Proceedings of the COLING/ACL 2006 Student Research Workshop
Marine Carpuat | Kevin Duh | Rebecca Hwa
Proceedings of the COLING/ACL 2006 Student Research Workshop

Boosting for Chinese Named Entity Recognition
Xiaofeng Yu | Marine Carpuat | Dekai Wu
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing

2005

Evaluating the Word Sense Disambiguation Performance of Statistical Machine Translation
Marine Carpuat | Dekai Wu
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

Word Sense Disambiguation vs. Statistical Machine Translation
Marine Carpuat | Dekai Wu
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

2004

Joining forces to resolve lexical ambiguity: East meets West in Barcelona
Richard Wicentowski | Grace Ngai | Dekai Wu | Marine Carpuat | Emily Thomforde | Adrian Packel
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

Using N-best lists for Named Entity Recognition from Chinese Speech
Lufeng Zhai | Pascale Fung | Richard Schwartz | Marine Carpuat | Dekai Wu
Proceedings of HLT-NAACL 2004: Short Papers

Semi-supervised training of a Kernel PCA-Based Model for Word Sense Disambiguation
Weifeng Su | Marine Carpuat | Dekai Wu
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Raising the Bar: Stacked Conservative Error Correction Beyond Boosting
Dekai Wu | Grace Ngai | Marine Carpuat
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Why Nitpicking Works: Evidence for Occam’s Razor in Error Correctors
Dekai Wu | Grace Ngai | Marine Carpuat
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Semantic role labeling with Boosting, SVMs, Maximum Entropy, SNOW, and Decision Lists
Grace Ngai | Dekai Wu | Marine Carpuat | Chi-Shing Wang | Chi-Yung Wang
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

Augmenting ensemble classification for Word Sense Disambiguation with a kernel PCA model
Marine Carpuat | Weifeng Su | Dekai Wu
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

A Kernel PCA Method for Superior Word Sense Disambiguation
Dekai Wu | Weifeng Su | Marine Carpuat
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

2003

A Stacked, Voted, Stacked Model for Named Entity Recognition
Dekai Wu | Grace Ngai | Marine Carpuat
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003

2002

Identifying Concepts Across Languages: A First Step towards a Corpus-based Approach to Automatic Ontology Alignment
Grace Ngai | Marine Carpuat | Pascale Fung
COLING 2002: The 19th International Conference on Computational Linguistics

Boosting for Named Entity Recognition
Dekai Wu | Grace Ngai | Marine Carpuat | Jeppe Larsen | Yongsheng Yang
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

Co-authors

Marianna Martindale 9

Hal Daumé III 8

Yogarshi Vyas 7

Marcello Federico 4

Rachel Rudinger 4

Elizabeth Salesky 4

Marianna Apidianaki 3

Steven Bethard 3

Ge Gao (高歌) 3

John E. Ortega 3

Aquia Richburg 3

Michel Simard 3

Rodolfo Zevallos 3

Antonios Anastasopoulos 2

Amittai Axelrod 2

Connor Baumler 2

Luisa Bentivogli 2

Ondřej Bojar 2

Jordan Lee Boyd-Graber 2

Roldano Cattoni 2

Mauro Cettolo 2

Qianqian Dong 2

Allyson Ettinger 2

George Foster 2

Dávid Javorský 2

David Jurgens 2

Faisal Ladhak 2

Junyi Jessy Li 2

Constantine Lignos 2

Prashant Mathur 2

Evgeny Matusov 2

John Philip McCrae 2

Kathleen McKeown 2

Nikita Mehandru 2

Ivan Meza-Ruiz 2

Saif Mohammad 2

Kenton Murray 2

Satoshi Nakamura 2

Douglas W. Oard 2

Atul Kr. Ojha 2

Maja Popović 2

Elijah Rippeth 2

Niloufar Salehi 2

Matthias Sperber 2

Sebastian Stüker 2

Katsuhito Sudoh 2

Shabnam Tafreshi 2

Joel Tetreault 2

Brian Thompson 2

Shinji Watanabe 2

Sarah Wiegreffe 2

Marie-Catherine de Marneffe 2

Katharina von der Wense 2

Saheed S. Abdullahi 1

David Ifeoluwa Adelani 1

Tosin Adewumi 1

Abeeb Afolabi 1

Milind Agarwal 1

Ibrahim Said Ahmad 1

Zainab Akinjobi 1

Sana Al-Azzawi 1

Lama Alkhaled 1

Mohamed Anwar 1

Nora Aranberri 1

Anuoluwapo Aremu 1

Oluwabusayo Olufunke Awoyomi 1

Nishant Balepur 1

Saptarashmi Bandyopadhyay 1

Eric R. Bennett 1

Frédéric Blain 1

Claire Bonial 1

Sofia Bourhim 1

Andiswa Bukula 1

Xavier Carreras 1

Monojit Choudhury 1

Khalid Choukri 1

Alexandra Chronopoulou 1

Chiamaka Chukwuneke 1

Thierry Declerck 1

Michael Denkowski 1

Salma El Anigri 1

Hiba El Oirghi 1

Ahmed Elgohary 1

Ramy Eskander 1

Yannick Estève 1

Naome A. Etori 1

Alexander Fraser 1

Souhir Gahbiche 1

Petra Galuščáková 1

Ananya Ganesh 1

Spandana Gella 1

Alvin Grissom II 1

Tajuddeen Gwadabe 1

Christian Hardmeier 1

Ayinde Hassan 1

Kenneth Heafield 1

Katharine Henry 1

Oumaima Hourrane 1

Hirofumi Inaguma 1

Ruqayya Nasir Iro 1

Jagadeesh Jagarlamudi 1

Anders Johannsen 1

Yasumasa Kano 1

Marzena Karpinska 1

Elaine Khoong 1

Elaine C. Khoong 1

Wangui Kimotho 1

Wangari Kimotho 1

Mateusz Krubiński 1

Matthieu Labeau 1

William Lewis 1

Ricky Macharm 1

Thabiso Mangwana 1

André F. T. Martins 1

Chandresh Maurya 1

Chinedu Emmanuel Mbonu 1

Jeffrey Micher 1

Muhidin Mohamed 1

Shafie Abdi Mohamed 1

Hamam Mokayed 1

Stephen E. Moore 1

Shamsuddeen Hassan Muhammad 1

Dragos Stefan Munteanu 1

Smaranda Muresan 1

Christine Mwase 1

Maria Nadejde 1

Preslav Nakov 1

Lolwethu Ndolela 1

Quang-Nhan Nguyen 1

Phuong-Anh Nguyen-Le 1

Samuel Njoroge 1

Mary Nurminen 1

Nnaemeka Obiefuna 1

Millicent Ochieng 1

Onyekachi Raphael Ogbu 1

Temitayo Olatoye 1

Abdul-Hakeem Omotayo 1

Bernard Opoku 1

Verrah Akinyi Otiende 1

Cecilia Ovesdotter Alm 1

Adrian Packel 1

Shramay Palta 1

Andrei Popescu-Belis 1

Adam Pospíšil 1

Peter A. Rankel 1

Kartik Ravisankar 1

Philip Resnik 1

Sandra Sandoval 1

Balaram Sarkar 1

Nathan Schneider 1

Andrew Schonebaum 1

Richard Schwartz 1

Nivedita Sethiya 1

Pamela Shapiro 1

Yow-Ting Shiue 1

Iyanuoluwa Shode 1

Ekaterina Shutova 1

Claytone Sikasote 1

Clemencia Siro 1

Neha Srikanth 1

Pontus Stenetorp 1

Jonne Sälevä 1

Svetlana Tchistiakova 1

Emily Thomforde 1

Sakayo Toadoum Sari 1

Elsbeth Turcan 1

Shyam Upadhyay 1

Charlotte Vaughn 1

Eva Maria Vecchi 1

Lyse Naomi Wamba Momo 1

Changhan Wang 1

Chi-Shing Wang 1

Chi-Yung Wang 1

Mingxuan Wang 1

Bonnie Webber 1

Richard Wicentowski 1

Patrick Wilken 1

Yongsheng Yang 1

Foutse Yuehgoh 1

François Yvon 1

Petr Zemánek 1

Torsten Zesch 1

Dongdong Zhang 1

Alexander Zhang 1

Lonneke van der Plas 1

Venues

JEP/TALN/RECITAL1