André F. T. Martins

Also published as: Andre F. T. Martins, Andre Martins, André F.T. Martins, André Martins


2024

pdf bib
Can Automatic Metrics Assess High-Quality Translations?
Sweta Agrawal | António Farinhas | Ricardo Rei | Andre Martins
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Automatic metrics for evaluating translation quality are typically validated by measuring how well they correlate with human assessments. However, correlation methods tend to capture only the ability of metrics to differentiate between good and bad source-translation pairs, overlooking their reliability in distinguishing alternative translations for the same source. In this paper, we confirm that this is indeed the case by showing that current metrics are insensitive to nuanced differences in translation quality. This effect is most pronounced when the quality is high and the variance among alternatives is low. Given this finding, we shift towards detecting high-quality correct translations, an important problem in practical decision-making scenarios where a binary check of correctness is prioritized over a nuanced evaluation of quality. Using the MQM framework as the gold standard, we systematically stress-test the ability of current metrics to identify translations with no errors as marked by humans. Our findings reveal that current metrics often over or underestimate translation quality, indicating significant room for improvement in machine translation evaluation.

pdf bib
Modeling User Preferences with Automatic Metrics: Creating a High-Quality Preference Dataset for Machine Translation
Sweta Agrawal | José G. C. De Souza | Ricardo Rei | António Farinhas | Gonçalo Faria | Patrick Fernandes | Nuno M Guerreiro | Andre Martins
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Alignment with human preferences is an important step in developing accurate and safe large language models. This is no exception in machine translation (MT), where better handling of language nuances and context-specific variations leads to improved quality. However, preference data based on human feedback can be very expensive to obtain and curate at a large scale. Automatic metrics, on the other hand, can induce preferences, but they might not match human expectations perfectly. In this paper, we propose an approach that leverages the best of both worlds. We first collect sentence-level quality assessments from professional linguists on translations generated by multiple high-quality MT systems and evaluate the ability of current automatic metrics to recover these preferences. We then use this analysis to curate a new dataset, MT-Pref (metric induced translation preference) dataset, which comprises 18k instances covering 18 language directions, using texts sourced from multiple domains post-2022. We show that aligning TOWER models on MT-Pref significantly improves translation quality on WMT23 and FLORES benchmarks.

pdf bib
xcomet: Transparent Machine Translation Evaluation through Fine-grained Error Detection
Nuno M. Guerreiro | Ricardo Rei | Daan van Stigt | Luisa Coheur | Pierre Colombo | André F. T. Martins
Transactions of the Association for Computational Linguistics, Volume 12

Widely used learned metrics for machine translation evaluation, such as Comet and Bleurt, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation errors (e.g., what are the errors and what is their severity). On the other hand, generative large language models (LLMs) are amplifying the adoption of more granular strategies to evaluation, attempting to detail and categorize translation errors. In this work, we introduce xcomet, an open-source learned metric designed to bridge the gap between these approaches. xcomet integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation (sentence-level, system-level, and error span detection). Moreover, it does so while highlighting and categorizing error spans, thus enriching the quality assessment. We also provide a robustness analysis with stress tests, and show that xcomet is largely capable of identifying localized critical errors and hallucinations.

pdf bib
Assessing the Role of Context in Chat Translation Evaluation: Is Context Helpful and Under What Conditions?
Sweta Agrawal | Amin Farajian | Patrick Fernandes | Ricardo Rei | André F. T. Martins
Transactions of the Association for Computational Linguistics, Volume 12

Despite the recent success of automatic metrics for assessing translation quality, their application in evaluating the quality of machine-translated chats has been limited. Unlike more structured texts like news, chat conversations are often unstructured, short, and heavily reliant on contextual information. This poses questions about the reliability of existing sentence-level metrics in this domain as well as the role of context in assessing the translation quality. Motivated by this, we conduct a meta-evaluation of existing automatic metrics, primarily designed for structured domains such as news, to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We then investigate how incorporating conversational contextual information in these metrics for sentence-level evaluation affects their performance. Our findings show that augmenting neural learned metrics with contextual information helps improve correlation with human judgments in the reference-free scenario and when evaluating translations in out-of-English settings. Finally, we propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model (LLM) and further validate that adding context helps even for LLM-based evaluation metrics.

pdf bib
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
Peiqin Lin | Chengzhi Hu | Zheyu Zhang | Andre Martins | Hinrich Schuetze
Findings of the Association for Computational Linguistics: EACL 2024

Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLM-Sim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLM-Sim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.

pdf bib
Non-Exchangeable Conformal Language Generation with Nearest Neighbors
Dennis Ulmer | Chrysoula Zerva | Andre Martins
Findings of the Association for Computational Linguistics: EACL 2024

Quantifying uncertainty in automatically generated text is important for letting humans check potential hallucinations and making systems more reliable. Conformal prediction is an attractive framework to provide predictions imbued with statistical guarantees, however, its application to text generation is challenging since any i.i.d. assumptions are not realistic. In this paper, we bridge this gap by leveraging recent results on *non-exchangeable* conformal prediction, which still ensures bounds on coverage. The result, *non-exchangeable conformal nucleus sampling*, is a novel extension of the conformal prediction framework to generation based on nearest neighbors. Our method can be used post-hoc for an arbitrary model without extra training and supplies token-level, calibrated prediction sets equipped with statistical guarantees. Experiments in machine translation and language modeling show encouraging results in generation quality. By also producing tighter prediction sets with good coverage, we thus give a more theoretically principled way to perform sampling with conformal guarantees.

pdf bib
Findings of the Association for Computational Linguistics: ACL 2024
Lun-Wei Ku | Andre Martins | Vivek Srikumar
Findings of the Association for Computational Linguistics: ACL 2024

pdf bib
Analyzing Context Contributions in LLM-based Machine Translation
Emmanouil Zaranis | Nuno M Guerreiro | Andre Martins
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) have achieved state-of-the-art performance in machine translation (MT) and demonstrated the ability to leverage in-context learning through few-shot examples. However, the mechanisms by which LLMs use different parts of the input context remain largely unexplored. In this work, we provide a comprehensive analysis of context utilization in MT, studying how LLMs use various context parts, such as few-shot examples and the source text, when generating translations. We highlight several key findings: (1) the source part of few-shot examples appears to contribute more than its corresponding targets, irrespective of translation direction; (2) finetuning LLMs with parallel data alters the contribution patterns of different context parts; and (3) there is a positional bias where earlier few-shot examples have higher contributions to the translated sequence. Finally, we demonstrate that inspecting anomalous context contributions can potentially uncover pathological translations, such as hallucinations. Our findings shed light on the internal workings of LLM-based MT which go beyond those known for standard encoder-decoder MT models.

pdf bib
xTower: A Multilingual LLM for Explaining and Correcting Translation Errors
Marcos V Treviso | Nuno M Guerreiro | Sweta Agrawal | Ricardo Rei | José Pombal | Tania Vaz | Helena Wu | Beatriz Silva | Daan Van Stigt | Andre Martins
Findings of the Association for Computational Linguistics: EMNLP 2024

While machine translation (MT) systems are achieving increasingly strong performance on benchmarks, they often produce translations with errors and anomalies. Understanding these errors can potentially help improve the translation quality and user experience. This paper introduces xTower, an open large language model (LLM) built on top of TowerBase designed to provide free-text explanations for translation errors in order to guide the generation of a corrected translation. The quality of the generated explanations by xTower are assessed via both intrinsic and extrinsic evaluation. We ask expert translators to evaluate the quality of the explanations across two dimensions: relatedness towards the error span being explained and helpfulness in error understanding and improving translation quality. Extrinsically, we test xTower across various experimental setups in generating translation corrections, demonstrating significant improvements in translation quality. Our findings highlight xTower’s potential towards not only producing plausible and helpful explanations of automatic translations, but also leveraging them to suggest corrected translations.

pdf bib
Aligning Neural Machine Translation Models: Human Feedback in Training and Inference
Miguel Ramos | Patrick Fernandes | António Farinhas | Andre Martins
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

Reinforcement learning from human feedback (RLHF) is a recent technique to improve the quality of the text generated by a language model, making it closer to what humans would generate.A core ingredient in RLHF’s success in aligning and improving large language models (LLMs) is its reward model, trained using human feedback on model outputs. In machine translation (MT), where metrics trained from human annotations can readily be used as reward models, recent methods using minimum Bayes risk decoding and reranking have succeeded in improving the final quality of translation.In this study, we comprehensively explore and compare techniques for integrating quality metrics as reward models into the MT pipeline. This includes using the reward model for data filtering, during the training phase through RL, and at inference time by employing reranking techniques, and we assess the effects of combining these in a unified approach.Our experimental results, conducted across multiple translation tasks, underscore the crucial role of effective data filtering, based on estimated quality, in harnessing the full potential of RL in enhancing MT quality.Furthermore, our findings demonstrate the effectiveness of combining RL training with reranking techniques, showcasing substantial improvements in translation quality.

pdf bib
The Center for Responsible AI Project
Maria Ana Henriques | Ana Farinha | Nuno André | António Novais | Sara Guerreiro de Sousa | Bruno Prezado Silva | Ana Oliveira | Helena Moniz | Andre Martins | Paulo Dimas
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

This paper describes the project “NextGenAI: Center for Responsible AI”, a 39-month Mobilizing and Green Agenda for Business Innovation funded by the Portuguese Recovery and Resilience Plan, under the Recovery and Resilience Facility (RRF). The project aims to create a new Center for Responsible AI in Portugal, capable of delivering more than 20 AI products in crucial areas like “Life Sciences”, many of which use generative AI, particularly NLP models such as those for Machine Translation, contributing to translating into legislation the European Law included in the EU AI Act, and creating a critical mass in the development of responsible AI technologies. To accomplish this mission, the Center for Responsible AI is formed by an ecosystem of startups and research institutions driving research in a virtuous way by addressing real market needs and opportunities in Responsible AI.

pdf bib
Findings of the Quality Estimation Shared Task at WMT 2024: Are LLMs Closing the Gap in QE?
Chrysoula Zerva | Frederic Blain | José G. C. De Souza | Diptesh Kanojia | Sourabh Deoghare | Nuno M. Guerreiro | Giuseppe Attanasio | Ricardo Rei | Constantin Orasan | Matteo Negri | Marco Turchi | Rajen Chatterjee | Pushpak Bhattacharyya | Markus Freitag | André Martins
Proceedings of the Ninth Conference on Machine Translation

We report the results of the WMT 2024 shared task on Quality Estimation, in which the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels, without access to reference translations. In this edition, we expanded our scope to assess the potential for quality estimates to help in the correction of translated outputs, hence including an automated post-editing (APE) direction. We publish new test sets with human annotations that target two directions: providing new Multidimensional Quality Metrics (MQM) annotations for three multi-domain language pairs (English to German, Spanish and Hindi) and extending the annotations on Indic languages providing direct assessments and post edits for translation from English into Hindi, Gujarati, Tamil and Telugu. We also perform a detailed analysis of the behaviour of different models with respect to different phenomena including gender bias, idiomatic language, and numerical and entity perturbations. We received submissions based both on traditional, encoder-based approaches as well as large language model (LLM) based ones.

pdf bib
Tower v2: Unbabel-IST 2024 Submission for the General MT Shared Task
Ricardo Rei | Jose Pombal | Nuno M. Guerreiro | João Alves | Pedro Henrique Martins | Patrick Fernandes | Helena Wu | Tania Vaz | Duarte Alves | Amin Farajian | Sweta Agrawal | Antonio Farinhas | José G. C. De Souza | André Martins
Proceedings of the Ninth Conference on Machine Translation

In this work, we present Tower v2, an improved iteration of the state-of-the-art open-weight Tower models, and the backbone of our submission to the WMT24 General Translation shared task. Tower v2 introduces key improvements including expanded language coverage, enhanced data quality, and increased model capacity up to 70B parameters. Our final submission combines these advancements with quality-aware decoding strategies, selecting translations based on multiple translation quality signals. The resulting system demonstrates significant improvement over previous versions, outperforming closed commercial systems like GPT-4o, Claude 3.5, and DeepL even at a smaller 7B scale.

pdf bib
Improving Context Usage for Translating Bilingual Customer Support Chat with Large Language Models
Jose Pombal | Sweta Agrawal | André Martins
Proceedings of the Ninth Conference on Machine Translation

This paper describes Unbabel+IT’s submission to the Chat Shared Task held at the Workshop of Machine Translation 2024. The task focuses on translating customer support chats between agents and customers communicating in different languages. We present two strategies for adapting state-of-the-art language models to better utilize contextual information when translating such conversations. Our training strategy involves finetuning the model on chat datasets with context-augmented instructions, resulting in a specialized model, TOWERCHAT. For inference, we propose a novel quality-aware decoding approach that leverages a context-aware metric, CONTEXTCOMET, to select the optimal translation from a pool of candidates. We evaluate our proposed approach on the official shared task datasets for ten language pairs, showing that our submission consistently outperforms baselines on all and competing systems on 8 out of 10 language pairs across multiple automated metrics. Remarkably, TOWERCHAT outperforms our contrastive submission based on the much larger TOWER-V2-70B model while being 10× smaller. According to human evaluation, our system outperforms all other systems and baselines across all language pairs. These results underscore the importance of context-aware training and inference in handling complex bilingual dialogues.

pdf bib
How Effective Are State Space Models for Machine Translation?
Hugo Pitorro | Pavlo Vasylenko | Marcos Treviso | André Martins
Proceedings of the Ninth Conference on Machine Translation

Transformers are the current architecture of choice for NLP, but their attention layers do not scale well to long contexts. Recent works propose to replace attention with linear recurrent layers - this is the case for state space models, which enjoy efficient training and inference. However, it remains unclear whether these models are competitive with transformers in machine translation (MT). In this paper, we provide a rigorous and comprehensive experimental comparison between transformers and linear recurrent models for MT. Concretely, we experiment with RetNet, Mamba, and hybrid versions of Mamba which incorporate attention mechanisms. Our findings demonstrate that Mamba is highly competitive with transformers on sentence and paragraph-level datasets, where in the latter both models benefit from shifting the training distribution towards longer sequences. Further analysis show that integrating attention into Mamba improves translation quality, robustness to sequence length extrapolation, and the ability to recall named entities.

pdf bib
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lun-Wei Ku | Andre Martins | Vivek Srikumar
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Lun-Wei Ku | Andre Martins | Vivek Srikumar
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2023

pdf bib
When Does Translation Require Context? A Data-driven, Multilingual Exploration
Patrick Fernandes | Kayo Yin | Emmy Liu | André Martins | Graham Neubig
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Although proper handling of discourse significantly contributes to the quality of machine translation (MT), these improvements are not adequately measured in common translation quality metrics. Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation, however not in a fully systematic way. In this paper, we develop the Multilingual Discourse-Aware (MuDA) benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena in any given dataset. The choice of phenomena is inspired by a novel methodology to systematically identify translations that require context. This methodology confirms the difficulty of previously studied phenomena while uncovering others which were not previously addressed. We find that commonly studied context-aware MT models make only marginal improvements over context-agnostic models, which suggests these models do not handle these ambiguities effectively. We release code and data for 14 language pairs to encourage the MT community to focus on accurately capturing discourse phenomena. Code available at https://github.com/neulab/contextual-mt

pdf bib
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Ayyoob Imani | Peiqin Lin | Amir Hossein Kargaran | Silvia Severini | Masoud Jalili Sabet | Nora Kassner | Chunlan Ma | Helmut Schmid | André Martins | François Yvon | Hinrich Schütze
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, “help” from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.

pdf bib
Optimal Transport for Unsupervised Hallucination Detection in Neural Machine Translation
Nuno M. Guerreiro | Pierre Colombo | Pablo Piantanida | André Martins
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Neural machine translation (NMT) has become the de-facto standard in real-world machine translation applications. However, NMT models can unpredictably produce severely pathological translations, known as hallucinations, that seriously undermine user trust. It becomes thus crucial to implement effective preventive strategies to guarantee their proper functioning. In this paper, we address the problem of hallucination detection in NMT by following a simple intuition: as hallucinations are detached from the source content, they exhibit encoder-decoder attention patterns that are statistically different from those of good quality translations. We frame this problem with an optimal transport formulation and propose a fully unsupervised, plug-in detector that can be used with any attention-based NMT model. Experimental results show that our detector not only outperforms all previous model-based detectors, but is also competitive with detectors that employ external models trained on millions of samples for related tasks such as quality estimation and cross-lingual sentence similarity.

pdf bib
Python Code Generation by Asking Clarification Questions
Haau-Sing (Xiaocheng) Li | Mohsen Mesgar | André Martins | Iryna Gurevych
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Code generation from text requires understanding the user’s intent from a natural languagedescription and generating an executable code snippet that satisfies this intent. While recent pretrained language models demonstrate remarkable performance for this task, these models fail when the given natural language description is under-specified. In this work, we introduce a novel and more realistic setup for this task. We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions. Therefore, we collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers. The empirical results of our evaluation of pretrained language model performance on code generation show that clarifications result in more precisely generated code, as shown by the substantial improvement of model performance in all evaluation metrics. Alongside this, our task and dataset introduce new challenges to the community, including when and what clarification questions should be asked. Our code and dataset are available on GitHub.

pdf bib
CREST: A Joint Framework for Rationalization and Counterfactual Text Generation
Marcos Treviso | Alexis Ross | Nuno M. Guerreiro | André Martins
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Selective rationales and counterfactual examples have emerged as two effective, complementary classes of interpretability methods for analyzing and training NLP models. However, prior work has not explored how these methods can be integrated to combine their complementary advantages. We overcome this limitation by introducing CREST (ContRastive Edits with Sparse raTionalization), a joint framework for selective rationalization and counterfactual text generation, and show that this framework leads to improvements in counterfactual quality, model robustness, and interpretability. First, CREST generates valid counterfactuals that are more natural than those produced by previous methods, and subsequently can be used for data augmentation at scale, reducing the need for human-generated examples. Second, we introduce a new loss function that leverages CREST counterfactuals to regularize selective rationales and show that this regularization improves both model robustness and rationale quality, compared to methods that do not leverage CREST counterfactuals. Our results demonstrate that CREST successfully bridges the gap between selective rationales and counterfactual examples, addressing the limitations of existing methods and providing a more comprehensive view of a model’s predictions.

pdf bib
The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics
Ricardo Rei | Nuno M. Guerreiro | Marcos Treviso | Luisa Coheur | Alon Lavie | André Martins
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Neural metrics for machine translation evaluation, such as COMET, exhibit significant improvements in their correlation with human judgments, as compared to traditional metrics based on lexical overlap, such as BLEU. Yet, neural metrics are, to a great extent, “black boxes” returning a single sentence-level score without transparency about the decision-making process. In this work, we develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics. Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors, as assessed through comparison of token-level neural saliency maps with Multidimensional Quality Metrics (MQM) annotations and with synthetically-generated critical translation errors. To ease future research, we release our code at: https://github.com/Unbabel/COMET/tree/explainable-metrics

pdf bib
Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning
Duarte Alves | Nuno Guerreiro | João Alves | José Pombal | Ricardo Rei | José de Souza | Pierre Colombo | Andre Martins
Findings of the Association for Computational Linguistics: EMNLP 2023

Large language models (LLMs) are a promising avenue for machine translation (MT). However, current LLM-based MT systems are brittle: their effectiveness highly depends on the choice of few-shot examples and they often require extra post-processing due to overgeneration. Alternatives such as finetuning on translation instructions are computationally expensive and may weaken in-context learning capabilities, due to overspecialization. In this paper, we provide a closer look at this problem. We start by showing that adapter-based finetuning with LoRA matches the performance of traditional finetuning while reducing the number of training parameters by a factor of 50. This method also outperforms few-shot prompting and eliminates the need for post-processing or in-context examples. However, we show that finetuning generally degrades few-shot performance, hindering adaptation capabilities. Finally, to obtain the best of both worlds, we propose a simple approach that incorporates few-shot examples during finetuning. Experiments on 10 language pairs show that our proposed approach recovers the original few-shot capabilities while keeping the added benefits of finetuning.

pdf bib
An Empirical Study of Translation Hypothesis Ensembling with Large Language Models
António Farinhas | José de Souza | Andre Martins
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) are becoming a one-fits-many solution, but they sometimes hallucinate or produce unreliable output. In this paper, we investigate how hypothesis ensembling can improve the quality of the generated text for the specific problem of LLM-based machine translation. We experiment with several techniques for ensembling hypotheses produced by LLMs such as ChatGPT, LLaMA, and Alpaca. We provide a comprehensive study along multiple dimensions, including the method to generate hypotheses (multiple prompts, temperature-based sampling, and beam search) and the strategy to produce the final translation (instruction-based, quality-based reranking, and minimum Bayes risk (MBR) decoding). Our results show that MBR decoding is a very effective method, that translation quality can be improved using a small number of samples, and that instruction tuning has a strong impact on the relation between the diversity of the hypotheses and the sampling temperature.

pdf bib
Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation
Nuno M. Guerreiro | Elena Voita | André Martins
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Although the problem of hallucinations in neural machine translation (NMT) has received some attention, research on this highly pathological phenomenon lacks solid ground. Previous work has been limited in several ways: it often resorts to artificial settings where the problem is amplified, it disregards some (common) types of hallucinations, and it does not validate adequacy of detection heuristics. In this paper, we set foundations for the study of NMT hallucinations. First, we work in a natural setting, i.e., in-domain data without artificial noise neither in training nor in inference. Next, we annotate a dataset of over 3.4k sentences indicating different kinds of critical errors and hallucinations. Then, we turn to detection methods and both revisit methods used previously and propose using glass-box uncertainty-based detectors. Overall, we show that for preventive settings, (i) previously used methods are largely inadequate, (ii) sequence log-probability works best and performs on par with reference-based methods. Finally, we propose DeHallucinator, a simple method for alleviating hallucinations at test time that significantly reduces the hallucinatory rate.

pdf bib
BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation
Taisiya Glushkova | Chrysoula Zerva | André F. T. Martins
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical errors, such as deviations in entities and numbers. In contrast, traditional evaluation metrics such as BLEU or chrF, which measure lexical or character overlap between translation hypotheses and human references, have lower correlations with human judgements but are sensitive to such deviations. In this paper, we investigate several ways of combining the two approaches in order to increase robustness of state-of-the-art evaluation methods to translations with critical errors. We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena, which leads to gains in correlations with humans and on the recent DEMETR benchmark on several language pairs.

pdf bib
Empirical Assessment of kNN-MT for Real-World Translation Scenarios
Pedro Henrique Martins | João Alves | Tânia Vaz | Madalena Gonçalves | Beatriz Silva | Marianna Buchicchio | José G. C. de Souza | André F. T. Martins
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

This paper aims to investigate the effectiveness of the k-Nearest Neighbor Machine Translation model (kNN-MT) in real-world scenarios. kNN-MT is a retrieval-augmented framework that combines the advantages of parametric models with non-parametric datastores built using a set of parallel sentences. Previous studies have primarily focused on evaluating the model using only the BLEU metric and have not tested kNN-MT in real world scenarios. Our study aims to fill this gap by conducting a comprehensive analysis on various datasets comprising different language pairs and different domains, using multiple automatic metrics and expert evaluated Multidimensional Quality Metrics (MQM). We compare kNN-MT with two alternate strategies: fine-tuning all the model parameters and adapter-based finetuning. Finally, we analyze the effect of the datastore size on translation quality, and we examine the number of entries necessary to bootstrap and configure the index.

pdf bib
Efficient Methods for Natural Language Processing: A Survey
Marcos Treviso | Ji-Ung Lee | Tianchu Ji | Betty van Aken | Qingqing Cao | Manuel R. Ciosici | Michael Hassid | Kenneth Heafield | Sara Hooker | Colin Raffel | Pedro H. Martins | André F. T. Martins | Jessica Zosa Forde | Peter Milder | Edwin Simpson | Noam Slonim | Jesse Dodge | Emma Strubell | Niranjan Balasubramanian | Leon Derczynski | Iryna Gurevych | Roy Schwartz
Transactions of the Association for Computational Linguistics, Volume 11

Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.

pdf bib
Hallucinations in Large Multilingual Translation Models
Nuno M. Guerreiro | Duarte M. Alves | Jonas Waldendorf | Barry Haddow | Alexandra Birch | Pierre Colombo | André F. T. Martins
Transactions of the Association for Computational Linguistics, Volume 11

Hallucinated translations can severely undermine and raise safety issues when machine translation systems are deployed in the wild. Previous research on the topic focused on small bilingual models trained on high-resource languages, leaving a gap in our understanding of hallucinations in multilingual models across diverse translation scenarios. In this work, we fill this gap by conducting a comprehensive analysis—over 100 language pairs across various resource levels and going beyond English-centric directions—on both the M2M neural machine translation (NMT) models and GPT large language models (LLMs). Among several insights, we highlight that models struggle with hallucinations primarily in low-resource directions and when translating out of English, where, critically, they may reveal toxic patterns that can be traced back to the training data. We also find that LLMs produce qualitatively different hallucinations to those of NMT models. Finally, we show that hallucinations are hard to reverse by merely scaling models trained with the same data. However, employing more diverse models, trained on different data or with different procedures, as fallback systems can improve translation quality and virtually eliminate certain pathologies.

pdf bib
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation
Patrick Fernandes | Aman Madaan | Emmy Liu | António Farinhas | Pedro Henrique Martins | Amanda Bertsch | José G. C. de Souza | Shuyan Zhou | Tongshuang Wu | Graham Neubig | André F. T. Martins
Transactions of the Association for Computational Linguistics, Volume 11

Natural language generation has witnessed significant advancements due to the training of large language models on vast internet-scale datasets. Despite these advancements, there exists a critical challenge: These models can inadvertently generate content that is toxic, inaccurate, and unhelpful, and existing automatic evaluation metrics often fall short of identifying these shortcomings. As models become more capable, human feedback is an invaluable signal for evaluating and improving models. This survey aims to provide an overview of recent research that has leveraged human feedback to improve natural language generation. First, we introduce a taxonomy distilled from existing research to categorize and organize the varied forms of feedback. Next, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using feedback or training feedback models. We also discuss existing datasets for human-feedback data collection, and concerns surrounding feedback collection. Finally, we provide an overview of the nascent field of AI feedback, which uses large language models to make judgments based on a set of principles and minimize the need for human intervention. We also release a website of this survey at feedback-gap-survey.info.

pdf bib
Findings of the WMT 2023 Shared Task on Quality Estimation
Frederic Blain | Chrysoula Zerva | Ricardo Rei | Nuno M. Guerreiro | Diptesh Kanojia | José G. C. de Souza | Beatriz Silva | Tânia Vaz | Yan Jingxuan | Fatemeh Azadi | Constantin Orasan | André Martins
Proceedings of the Eighth Conference on Machine Translation

We report the results of the WMT 2023 shared task on Quality Estimation, in which the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels, without access to reference translations. This edition introduces a few novel aspects and extensions that aim to enable more fine-grained, and explainable quality estimation approaches. We introduce an updated quality annotation scheme using Multidimensional Quality Metrics to obtain sentence- and word-level quality scores for three language pairs. We also extend the provided data to new language pairs: we specifically target low-resource languages and provide training, development and test data for English-Hindi, English-Tamil, English-Telegu and English-Gujarati as well as a zero-shot test-set for English-Farsi. Further, we introduce a novel fine-grained error prediction task aspiring to motivate research towards more detailed quality predictions.

pdf bib
Scaling up CometKiwi: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task
Ricardo Rei | Nuno M. Guerreiro | José Pombal | Daan van Stigt | Marcos Treviso | Luisa Coheur | José G. C. de Souza | André Martins
Proceedings of the Eighth Conference on Machine Translation

We present the joint contribution of Unbabel and Instituto Superior Técnico to the WMT 2023 Shared Task on Quality Estimation (QE). Our team participated on all tasks: Sentence- and Word-level Quality Prediction and Fine-grained error span detection. For all tasks we build on the CometKiwi model (rei et al. 2022). Our multilingual approaches are ranked first for all tasks, reaching state-of-the-art performance for quality estimation at word-, span- and sentence-level granularity. Compared to the previous state-of-the-art, CometKiwi, we show large improvements in correlation with human judgements (up to 10 Spearman points) and surpassing the second-best multilingual submission with up to 3.8 absolute points.

pdf bib
The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation
Patrick Fernandes | Daniel Deutsch | Mara Finkelstein | Parker Riley | André Martins | Graham Neubig | Ankush Garg | Jonathan Clark | Markus Freitag | Orhan Firat
Proceedings of the Eighth Conference on Machine Translation

Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by proposing AutoMQM, a prompting technique which leverages the reasoning and in-context learning capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple score prediction prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.

2022

pdf bib
Improving abstractive summarization with energy-based re-ranking
Diogo Pernes | Afonso Mendes | André F. T. Martins
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Current abstractive summarization systems present important weaknesses which prevent their deployment in real-world applications, such as the omission of relevant information and the generation of factual inconsistencies (also known as hallucinations). At the same time, automatic evaluation metrics such as CTC scores (Deng et al., 2021) have been recently proposed that exhibit a higher correlation with human judgments than traditional lexical-overlap metrics such as ROUGE. In this work, we intend to close the loop by leveraging the recent advances in summarization metrics to create quality-aware abstractive summarizers. Namely, we propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics. We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries. Nonetheless, human evaluation results show that the re-ranking approach should be used with care for highly abstractive summaries, as the available metrics are not yet sufficiently reliable for this purpose.

pdf bib
Quality-Aware Decoding for Neural Machine Translation
Patrick Fernandes | António Farinhas | Ricardo Rei | José G. C. de Souza | Perez Ogayo | Graham Neubig | Andre Martins
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Despite the progress in machine translation quality estimation and evaluation in the last years, decoding in neural machine translation (NMT) is mostly oblivious to this and centers around finding the most probable translation according to the model (MAP decoding), approximated with beam search. In this paper, we bring together these two lines of research and propose quality-aware decoding for NMT, by leveraging recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods like N-best reranking and minimum Bayes risk decoding. We perform an extensive comparison of various possible candidate generation and ranking methods across four datasets and two model classes and find that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics (COMET and BLEURT) and to human assessments.

pdf bib
-former: Infinite Memory Transformer
Pedro Henrique Martins | Zita Marinho | Andre Martins
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Transformers are unable to model long-term memories effectively, since the amount of computation they need to perform grows with the context length. While variations of efficient transformers have been proposed, they all have a finite memory capacity and are forced to drop old information. In this paper, we propose the -former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the -former’s attention complexity becomes independent of the context length, trading off memory length with precision.In order to control where precision is more important, -former maintains “sticky memories,” being able to model arbitrarily long contexts while keeping the computation budget fixed.Experiments on a synthetic sorting task, language modeling, and document grounded dialogue generation demonstrate the -former’s ability to retain information from long sequences.

pdf bib
Chunk-based Nearest Neighbor Machine Translation
Pedro Henrique Martins | Zita Marinho | André F. T. Martins
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Semi-parametric models, which augment generation with retrieval, have led to impressive results in language modeling and machine translation, due to their ability to retrieve fine-grained information from a datastore of examples. One of the most prominent approaches, kNN-MT, exhibits strong domain adaptation capabilities by retrieving tokens from domain-specific datastores (Khandelwal et al., 2021). However, kNN-MT requires an expensive retrieval operation for every single generated token, leading to a very low decoding speed (around 8 times slower than a parametric model). In this paper, we introduce a chunk-based kNN-MT model which retrieves chunks of tokens from the datastore, instead of a single token. We propose several strategies for incorporating the retrieved chunks into the generation process, and for selecting the steps at which the model needs to search for neighbors in the datastore. Experiments on machine translation in two settings, static and “on-the-fly” domain adaptation, show that the chunk-based kNN-MT model leads to significant speed-ups (up to 4 times) with only a small drop in translation quality.

pdf bib
Disentangling Uncertainty in Machine Translation Evaluation
Chrysoula Zerva | Taisiya Glushkova | Ricardo Rei | André F. T. Martins
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Trainable evaluation metrics for machine translation (MT) exhibit strong correlation with human judgements, but they are often hard to interpret and might produce unreliable scores under noisy or out-of-domain data. Recent work has attempted to mitigate this with simple uncertainty quantification techniques (Monte Carlo dropout and deep ensembles), however these techniques (as we show) are limited in several ways – for example, they are unable to distinguish between different kinds of uncertainty, and they are time and memory consuming. In this paper, we propose more powerful and efficient uncertainty predictors for MT evaluation, and we assess their ability to target different sources of aleatoric and epistemic uncertainty. To this end, we develop and compare training objectives for the COMET metric to enhance it with an uncertainty prediction output, including heteroscedastic regression, divergence minimization, and direct uncertainty prediction.Our experiments show improved results on uncertainty prediction for the WMT metrics task datasets, with a substantial reduction in computational costs. Moreover, they demonstrate the ability of these predictors to address specific uncertainty causes in MT evaluation, such as low quality references and out-of-domain data.

pdf bib
Proceedings of the Sixth Workshop on Structured Prediction for NLP
Andreas Vlachos | Priyanka Agrawal | André Martins | Gerasimos Lampouras | Chunchuan Lyu
Proceedings of the Sixth Workshop on Structured Prediction for NLP

pdf bib
Predicting Attention Sparsity in Transformers
Marcos Treviso | António Góis | Patrick Fernandes | Erick Fonseca | Andre Martins
Proceedings of the Sixth Workshop on Structured Prediction for NLP

Transformers’ quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact sparse attention; however this approach still requires quadratic computation. In this paper, we propose Sparsefinder, a simple model trained to identify the sparsity pattern of entmax attention before computing it. We experiment with three variants of our method, based on distances, quantization, and clustering, on two tasks: machine translation (attention in the decoder) and masked language modeling (encoder-only). Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph. This allows for detailed comparison between different models along their Pareto curves, important to guide future benchmarks for sparse attention models.

pdf bib
Efficient Machine Translation Domain Adaptation
Pedro Martins | Zita Marinho | Andre Martins
Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge

Machine translation models struggle when translating out-of-domain text, which makes domain adaptation a topic of critical importance. However, most domain adaptation methods focus on fine-tuning or training the entire or part of the model on every new domain, which can be costly. On the other hand, semi-parametric models have been shown to successfully perform domain adaptation by retrieving examples from an in-domain datastore (Khandelwal et al., 2021). A drawback of these retrieval-augmented models, however, is that they tend to be substantially slower. In this paper, we explore several approaches to speed up nearest neighbors machine translation. We adapt the methods recently proposed by He et al. (2021) for language modeling, and introduce a simple but effective caching strategy that avoids performing retrieval when similar contexts have been seen before. Translation quality and runtimes for several domains show the effectiveness of the proposed solutions.

pdf bib
Beyond Characters: Subword-level Morpheme Segmentation
Ben Peters | Andre F. T. Martins
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper presents DeepSPIN’s submissions to the SIGMORPHON 2022 Shared Task on Morpheme Segmentation. We make three submissions, all to the word-level subtask. First, we show that entmax-based sparse sequence-tosequence models deliver large improvements over conventional softmax-based models, echoing results from other tasks. Then, we challenge the assumption that models for morphological tasks should be trained at the character level by building a transformer that generates morphemes as sequences of unigram language model-induced subwords. This subword transformer outperforms all of our character-level models and wins the word-level subtask. Although we do not submit an official submission to the sentence-level subtask, we show that this subword-based approach is highly effective there as well.

pdf bib
MLQE-PE: A Multilingual Quality Estimation and Post-Editing Dataset
Marina Fomicheva | Shuo Sun | Erick Fonseca | Chrysoula Zerva | Frédéric Blain | Vishrav Chaudhary | Francisco Guzmán | Nina Lopatina | Lucia Specia | André F. T. Martins
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present MLQE-PE, a new dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE). The dataset contains annotations for eleven language pairs, including both high- and low-resource languages. Specifically, it is annotated for translation quality with human labels for up to 10,000 translations per language pair in the following formats: sentence-level direct assessments and post-editing effort, and word-level binary good/bad labels. Apart from the quality-related scores, each source-translation sentence pair is accompanied by the corresponding post-edited sentence, as well as titles of the articles where the sentences were extracted from, and information on the neural MT models used to translate the text. We provide a thorough description of the data collection and annotation process as well as an analysis of the annotation distribution for each language pair. We also report the performance of baseline systems trained on the MLQE-PE dataset. The dataset is freely available and has already been used for several WMT shared tasks.

pdf bib
Searching for COMETINHO: The Little Metric That Could
Ricardo Rei | Ana C Farinha | José G.C. de Souza | Pedro G. Ramos | André F.T. Martins | Luisa Coheur | Alon Lavie
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

In recent years, several neural fine-tuned machine translation evaluation metrics such as COMET and BLEURT have been proposed. These metrics achieve much higher correlations with human judgments than lexical overlap metrics at the cost of computational efficiency and simplicity, limiting their applications to scenarios in which one has to score thousands of translation hypothesis (e.g. scoring multiple systems or Minimum Bayes Risk decoding). In this paper, we explore optimization techniques, pruning, and knowledge distillation to create more compact and faster COMET versions. Our results show that just by optimizing the code through the use of caching and length batching we can reduce inference time between 39% and 65% when scoring multiple systems. Also, we show that pruning COMET can lead to a 21% model reduction without affecting the model’s accuracy beyond 0.01 Kendall tau correlation. Furthermore, we present DISTIL-COMET a lightweight distilled version that is 80% smaller and 2.128x faster while attaining a performance close to the original model and above strong baselines such as BERTSCORE and PRISM.

pdf bib
QUARTZ: Quality-Aware Machine Translation
José G.C. de Souza | Ricardo Rei | Ana C. Farinha | Helena Moniz | André F. T. Martins
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper presents QUARTZ, QUality-AwaRe machine Translation, a project led by Unbabel which aims at developing machine translation systems that are more robust and produce fewer critical errors. With QUARTZ we want to enable machine translation for user-generated conversational content types that do not tolerate critical errors in automatic translations.

pdf bib
DeepSPIN: Deep Structured Prediction for Natural Language Processing
André F. T. Martins | Ben Peters | Chrysoula Zerva | Chunchuan Lyu | Gonçalo Correia | Marcos Treviso | Pedro Martins | Tsvetomila Mihaylova
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

DeepSPIN is a research project funded by the European Research Council (ERC) whose goal is to develop new neural structured prediction methods, models, and algorithms for improving the quality, interpretability, and data-efficiency of natural language processing (NLP) systems, with special emphasis on machine translation and quality estimation. We describe in this paper the latest findings from this project.

pdf bib
Proceedings of the Seventh Conference on Machine Translation (WMT)
Philipp Koehn | Loïc Barrault | Ondřej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussà | Christian Federmann | Mark Fishel | Alexander Fraser | Markus Freitag | Yvette Graham | Roman Grundkiewicz | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Tom Kocmi | André Martins | Makoto Morishita | Christof Monz | Masaaki Nagata | Toshiaki Nakazawa | Matteo Negri | Aurélie Névéol | Mariana Neves | Martin Popel | Marco Turchi | Marcos Zampieri
Proceedings of the Seventh Conference on Machine Translation (WMT)

pdf bib
Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust
Markus Freitag | Ricardo Rei | Nitika Mathur | Chi-kiu Lo | Craig Stewart | Eleftherios Avramidis | Tom Kocmi | George Foster | Alon Lavie | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper presents the results of the WMT22 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT22 News Translation Task on four different domains: news, social, ecommerce, and chat. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages, among other things: (i) expert-based evaluation is more reliable, (ii) we extended the pool of translations by 5 additional translations based on MBR decoding or rescoring which are challenging for current metrics. In addition, we initiated a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors. Finally, we present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. The results demonstrate the superiority of neural-based learned metrics and demonstrate again that overlap metrics like Bleu, spBleu or chrf correlate poorly with human ratings. The results also reveal that neural-based metrics are remarkably robust across different domains and challenges.

pdf bib
Findings of the WMT 2022 Shared Task on Quality Estimation
Chrysoula Zerva | Frédéric Blain | Ricardo Rei | Piyawat Lertvittayakumjorn | José G. C. de Souza | Steffen Eger | Diptesh Kanojia | Duarte Alves | Constantin Orăsan | Marina Fomicheva | André F. T. Martins | Lucia Specia
Proceedings of the Seventh Conference on Machine Translation (WMT)

We report the results of the WMT 2022 shared task on Quality Estimation, in which the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels, without access to reference translations. This edition introduces a few novel aspects and extensions that aim to enable more fine-grained, and explainable quality estimation approaches. We introduce an updated quality annotation scheme using Multidimensional Quality Metrics to obtain sentence- and word-level quality scores for three language pairs. We also extend the Direct Assessments and post-edit data (MLQE-PE) to new language pairs: we present a novel and large dataset on English-Marathi, as well as a zero-shot test set on English-Yoruba. Further, we include an explainability sub-task for all language pairs and present a new format of a critical error detection task for two new language pairs. Participants from 11 different teams submitted altogether 991 systems to different task variants and language pairs.

pdf bib
Robust MT Evaluation with Sentence-level Multilingual Augmentation
Duarte Alves | Ricardo Rei | Ana C Farinha | José G. C. de Souza | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

Automatic translations with critical errors may lead to misinterpretations and pose several risks for the user. As such, it is important that Machine Translation (MT) Evaluation systems are robust to these errors in order to increase the reliability and safety of Machine Translation systems. Here we introduce SMAUG a novel Sentence-level Multilingual AUGmentation approach for generating translations with critical errors and apply this approach to create a test set to evaluate the robustness of MT metrics to these errors. We show that current State-of-the-Art metrics are improving their capability to distinguish translations with and without critical errors and to penalize the first accordingly. We also show that metrics tend to struggle with errors related to named entities and numbers and that there is a high variance in the robustness of current methods to translations with critical errors.

pdf bib
COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task
Ricardo Rei | José G. C. de Souza | Duarte Alves | Chrysoula Zerva | Ana C Farinha | Taisiya Glushkova | Alon Lavie | Luisa Coheur | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics Shared Task. Our primary submission – dubbed COMET-22 – is an ensemble between a COMET estimator model trained with Direct Assessments and a newly proposed multitask model trained to predict sentence-level scores along with OK/BAD word-level tags derived from Multidimensional Quality Metrics error annotations. These models are ensembled together using a hyper-parameter search that weights different features extracted from both evaluation models and combines them into a single score. For the reference-free evaluation, we present CometKiwi. Similarly to our primary submission, CometKiwi is an ensemble between two models. A traditional predictor-estimator model inspired by OpenKiwi and our new multitask model trained on Multidimensional Quality Metrics which can also be used without references. Both our submissions show improved correlations compared to state-of-the-art metrics from last year as well as increased robustness to critical errors.

pdf bib
CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task
Ricardo Rei | Marcos Treviso | Nuno M. Guerreiro | Chrysoula Zerva | Ana C Farinha | Christine Maroti | José G. C. de Souza | Taisiya Glushkova | Duarte Alves | Luisa Coheur | Alon Lavie | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estimation (QE). Our team participated in all three subtasks: (i) Sentence and Word-level Quality Prediction; (ii) Explainable QE; and (iii) Critical Error Detection. For all tasks we build on top of the COMET framework, connecting it with the predictor-estimator architecture of OpenKiwi, and equipping it with a word-level sequence tagger and an explanation extractor. Our results suggest that incorporating references during pretraining improves performance across several language pairs on downstream tasks, and that jointly training with sentence and word-level objectives yields a further boost. Furthermore, combining attention and gradient information proved to be the top strategy for extracting good explanations of sentence-level QE models. Overall, our submissions achieved the best results for all three tasks for almost all language pairs by a considerable margin.

pdf bib
Findings of the WMT 2022 Shared Task on Chat Translation
Ana C Farinha | M. Amin Farajian | Marianna Buchicchio | Patrick Fernandes | José G. C. de Souza | Helena Moniz | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper reports the findings of the second edition of the Chat Translation Shared Task. Similarly to the previous WMT 2020 edition, the task consisted of translating bilingual customer support conversational text. However, unlike the previous edition, in which the bilingual data was created from a synthetic monolingual English corpus, this year we used a portion of the newly released Unbabel’s MAIA corpus, which contains genuine bilingual conversations between agents and customers. We also expanded the language pairs to English↔German (en↔de), English↔French (en↔fr), and English↔Brazilian Portuguese (en↔pt-br).Given that the main goal of the shared task is to translate bilingual conversations, participants were encouraged to train and test their models specifically for this environment. In total, we received 18 submissions from 4 different teams. All teams participated in both directions of en↔de. One of the teams also participated in en↔fr and en↔pt-br. We evaluated the submissions with automatic metrics as well as human judgments via Multidimensional Quality Metrics (MQM) on both directions. The official ranking of the systems is based on the overall MQM scores of the participating systems on both directions, i.e. agent and customer.

pdf bib
Unbabel-IST at the WMT Chat Translation Shared Task
João Alves | Pedro Henrique Martins | José G. C. de Souza | M. Amin Farajian | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

We present the joint contribution of IST and Unbabel to the WMT 2022 Chat Translation Shared Task. We participated in all six language directions (English ↔ German, English ↔ French, English ↔ Brazilian Portuguese). Due to the lack of domain-specific data, we use mBART50, a large pretrained language model trained on millions of sentence-pairs, as our base model. We fine-tune it using a two step fine-tuning process. In the first step, we fine-tune the model on publicly available data. In the second step, we use the validation set. After having a domain specific model, we explore the use of kNN-MT as a way of incorporating domain-specific data at decoding time.

2021

pdf bib
Do Context-Aware Translation Models Pay the Right Attention?
Kayo Yin | Patrick Fernandes | Danish Pruthi | Aditi Chaudhary | André F. T. Martins | Graham Neubig
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Context-aware machine translation models are designed to leverage contextual information, but often fail to do so. As a result, they inaccurately disambiguate pronouns and polysemous words that require context for resolution. In this paper, we ask several questions: What contexts do human translators use to resolve ambiguous words? Are models paying large amounts of attention to the same context? What if we explicitly train them to do so? To answer these questions, we introduce SCAT (Supporting Context for Ambiguous Translations), a new English-French dataset comprising supporting context words for 14K translations that professional translators found useful for pronoun disambiguation. Using SCAT, we perform an in-depth analysis of the context used to disambiguate, examining positional and lexical characteristics of the supporting words. Furthermore, we measure the degree of alignment between the model’s attention scores and the supporting context from SCAT, and apply a guided attention strategy to encourage agreement between the two.

pdf bib
Measuring and Increasing Context Usage in Context-Aware Machine Translation
Patrick Fernandes | Kayo Yin | Graham Neubig | André F. T. Martins
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Recent work in neural machine translation has demonstrated both the necessity and feasibility of using inter-sentential context, context from sentences other than those currently being translated. However, while many current methods present model architectures that theoretically can use this extra context, it is often not clear how much they do actually utilize it at translation time. In this paper, we introduce a new metric, conditional cross-mutual information, to quantify usage of context by these models. Using this metric, we measure how much document-level machine translation systems use particular varieties of context. We find that target context is referenced more than source context, and that including more context has a diminishing affect on results. We then introduce a new, simple training method, context-aware word dropout, to increase the usage of context by context-aware models. Experiments show that our method not only increases context usage, but also improves the translation quality according to metrics such as BLEU and COMET, as well as performance on anaphoric pronoun resolution and lexical cohesion contrastive datasets.

pdf bib
IST-Unbabel 2021 Submission for the Explainable Quality Estimation Shared Task
Marcos Treviso | Nuno M. Guerreiro | Ricardo Rei | André F. T. Martins
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

We present the joint contribution of Instituto Superior Técnico (IST) and Unbabel to the Explainable Quality Estimation (QE) shared task, where systems were submitted to two tracks: constrained (without word-level supervision) and unconstrained (with word-level supervision). For the constrained track, we experimented with several explainability methods to extract the relevance of input tokens from sentence-level QE models built on top of multilingual pre-trained transformers. Among the different tested methods, composing explanations in the form of attention weights scaled by the norm of value vectors yielded the best results. When word-level labels are used during training, our best results were obtained by using word-level predicted probabilities. We further improve the performance of our methods on the two tracks by ensembling explanation scores extracted from models trained with different pre-trained transformers, achieving strong results for in-domain and zero-shot language pairs.

pdf bib
Smoothing and Shrinking the Sparse Seq2Seq Search Space
Ben Peters | André F. T. Martins
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Current sequence-to-sequence models are trained to minimize cross-entropy and use softmax to compute the locally normalized probabilities over target sequences. While this setup has led to strong results in a variety of tasks, one unsatisfying aspect is its length bias: models give high scores to short, inadequate hypotheses and often make the empty string the argmax—the so-called cat got your tongue problem. Recently proposed entmax-based sparse sequence-to-sequence models present a possible solution, since they can shrink the search space by assigning zero probability to bad hypotheses, but their ability to handle word-level tasks with transformers has never been tested. In this work, we show that entmax-based models effectively solve the cat got your tongue problem, removing a major source of model error for neural machine translation. In addition, we generalize label smoothing, a critical regularization technique, to the broader family of Fenchel-Young losses, which includes both cross-entropy and the entmax losses. Our resulting label-smoothed entmax loss models set a new state of the art on multilingual grapheme-to-phoneme conversion and deliver improvements and better calibration properties on cross-lingual morphological inflection and machine translation for 7 language pairs.

pdf bib
Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021)
Zornitsa Kozareva | Sujith Ravi | Andreas Vlachos | Priyanka Agrawal | André Martins
Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021)

pdf bib
Proceedings of the Sixth Conference on Machine Translation
Loic Barrault | Ondrej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussa | Christian Federmann | Mark Fishel | Alexander Fraser | Markus Freitag | Yvette Graham | Roman Grundkiewicz | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Tom Kocmi | Andre Martins | Makoto Morishita | Christof Monz
Proceedings of the Sixth Conference on Machine Translation

pdf bib
Findings of the WMT 2021 Shared Task on Quality Estimation
Lucia Specia | Frédéric Blain | Marina Fomicheva | Chrysoula Zerva | Zhenhao Li | Vishrav Chaudhary | André F. T. Martins
Proceedings of the Sixth Conference on Machine Translation

We report the results of the WMT 2021 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels. This edition focused on two main novel additions: (i) prediction for unseen languages, i.e. zero-shot settings, and (ii) prediction of sentences with catastrophic errors. In addition, new data was released for a number of languages, especially post-edited data. Participating teams from 19 institutions submitted altogether 1263 systems to different task variants and language pairs.

pdf bib
IST-Unbabel 2021 Submission for the Quality Estimation Shared Task
Chrysoula Zerva | Daan van Stigt | Ricardo Rei | Ana C Farinha | Pedro Ramos | José G. C. de Souza | Taisiya Glushkova | Miguel Vera | Fabio Kepler | André F. T. Martins
Proceedings of the Sixth Conference on Machine Translation

We present the joint contribution of IST and Unbabel to the WMT 2021 Shared Task on Quality Estimation. Our team participated on two tasks: Direct Assessment and Post-Editing Effort, encompassing a total of 35 submissions. For all submissions, our efforts focused on training multilingual models on top of OpenKiwi predictor-estimator architecture, using pre-trained multilingual encoders combined with adapters. We further experiment with and uncertainty-related objectives and features as well as training on out-of-domain direct assessment data.

pdf bib
Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task
Ricardo Rei | Ana C Farinha | Chrysoula Zerva | Daan van Stigt | Craig Stewart | Pedro Ramos | Taisiya Glushkova | André F. T. Martins | Alon Lavie
Proceedings of the Sixth Conference on Machine Translation

In this paper, we present the joint contribution of Unbabel and IST to the WMT 2021 Metrics Shared Task. With this year’s focus on Multidimensional Quality Metric (MQM) as the ground-truth human assessment, our aim was to steer COMET towards higher correlations with MQM. We do so by first pre-training on Direct Assessments and then fine-tuning on z-normalized MQM scores. In our experiments we also show that reference-free COMET models are becoming competitive with reference-based models, even outperforming the best COMET model from 2020 on this year’s development data. Additionally, we present COMETinho, a lightweight COMET model that is 19x faster on CPU than the original model, while also achieving state-of-the-art correlations with MQM. Finally, in the “QE as a metric” track, we also participated with a QE model trained using the OpenKiwi framework leveraging MQM scores and word-level annotations.

pdf bib
Uncertainty-Aware Machine Translation Evaluation
Taisiya Glushkova | Chrysoula Zerva | Ricardo Rei | André F. T. Martins
Findings of the Association for Computational Linguistics: EMNLP 2021

Several neural-based metrics have been recently proposed to evaluate machine translation quality. However, all of them resort to point estimates, which provide limited information at segment level. This is made worse as they are trained on noisy, biased and scarce human judgements, often resulting in unreliable quality predictions. In this paper, we introduce uncertainty-aware MT evaluation and analyze the trustworthiness of the predicted quality. We combine the COMET framework with two uncertainty estimation methods, Monte Carlo dropout and deep ensembles, to obtain quality scores along with confidence intervals. We compare the performance of our uncertainty-aware MT evaluation methods across multiple language pairs from the QT21 dataset and the WMT20 metrics task, augmented with MQM annotations. We experiment with varying numbers of references and further discuss the usefulness of uncertainty-aware quality estimation (without references) to flag possibly critical translation mistakes.

pdf bib
SPECTRA: Sparse Structured Text Rationalization
Nuno M. Guerreiro | André F. T. Martins
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Selective rationalization aims to produce decisions along with rationales (e.g., text highlights or word alignments between two sentences). Commonly, rationales are modeled as stochastic binary masks, requiring sampling-based gradient estimators, which complicates training and requires careful hyperparameter tuning. Sparse attention mechanisms are a deterministic alternative, but they lack a way to regularize the rationale extraction (e.g., to control the sparsity of a text highlight or the number of alignments). In this paper, we present a unified framework for deterministic extraction of structured explanations via constrained inference on a factor graph, forming a differentiable layer. Our approach greatly eases training and rationale regularization, generally outperforming previous work on what comes to performance and plausibility of the extracted rationales. We further provide a comparative study of stochastic and deterministic methods for rationale extraction for classification and natural language inference tasks, jointly assessing their predictive power, quality of the explanations, and model variability.

2020

pdf bib
Proceedings of the Fifth Conference on Machine Translation
Loïc Barrault | Ondřej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussà | Christian Federmann | Mark Fishel | Alexander Fraser | Yvette Graham | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Makoto Morishita | Christof Monz | Masaaki Nagata | Toshiaki Nakazawa | Matteo Negri
Proceedings of the Fifth Conference on Machine Translation

pdf bib
Findings of the WMT 2020 Shared Task on Chat Translation
M. Amin Farajian | António V. Lopes | André F. T. Martins | Sameen Maruf | Gholamreza Haffari
Proceedings of the Fifth Conference on Machine Translation

We report the results of the first edition of the WMT shared task on chat translation. The task consisted of translating bilingual conversational text, in particular customer support chats for the English-German language pair (English agent, German customer). This task varies from the other translation shared tasks, i.e. news and biomedical, mainly due to the fact that the conversations are bilingual, less planned, more informal, and often ungrammatical. Furthermore, such conversations are usually characterized by shorter and simpler sentences and contain more pronouns. We received 14 submissions from 6 participating teams, all of them covering both directions, i.e. En->De for agent utterances and De->En for customer messages. We used automatic metrics (BLEU and TER) for evaluating the translations of both agent and customer messages and human document-level direct assessments (DDA) to evaluate the agent translations.

pdf bib
Findings of the WMT 2020 Shared Task on Quality Estimation
Lucia Specia | Frédéric Blain | Marina Fomicheva | Erick Fonseca | Vishrav Chaudhary | Francisco Guzmán | André F. T. Martins
Proceedings of the Fifth Conference on Machine Translation

We report the results of the WMT20 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word, sentence and document levels. This edition included new data with open domain texts, direct assessment annotations, and multiple language pairs: English-German, English-Chinese, Russian-English, Romanian-English, Estonian-English, Sinhala-English and Nepali-English data for the sentence-level subtasks, English-German and English-Chinese for the word-level subtask, and English-French data for the document-level subtask. In addition, we made neural machine translation models available to participants. 19 participating teams from 27 institutions submitted altogether 1374 systems to different task variants and language pairs.

pdf bib
IST-Unbabel Participation in the WMT20 Quality Estimation Shared Task
João Moura | Miguel Vera | Daan van Stigt | Fabio Kepler | André F. T. Martins
Proceedings of the Fifth Conference on Machine Translation

We present the joint contribution of IST and Unbabel to the WMT 2020 Shared Task on Quality Estimation. Our team participated on all tracks (Direct Assessment, Post-Editing Effort, Document-Level), encompassing a total of 14 submissions. Our submitted systems were developed by extending the OpenKiwi framework to a transformer-based predictor-estimator architecture, and to cope with glass-box, uncertainty-based features coming from neural machine translation systems.

pdf bib
One-Size-Fits-All Multilingual Models
Ben Peters | André F. T. Martins
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper presents DeepSPIN’s submissions to Tasks 0 and 1 of the SIGMORPHON 2020 Shared Task. For both tasks, we present multilingual models, training jointly on data in all languages. We perform no language-specific hyperparameter tuning – each of our submissions uses the same model for all languages. Our basic architecture is the sparse sequence-to-sequence model with entmax attention and loss, which allows our models to learn sparse, local alignments while still being trainable with gradient-based techniques. For Task 1, we achieve strong performance with both RNN- and transformer-based sparse models. For Task 0, we extend our RNN-based model to a multi-encoder set-up in which separate modules encode the lemma and inflection sequences. Despite our models’ lack of language-specific tuning, they tie for first in Task 0 and place third in Task 1.

pdf bib
Revisiting Higher-Order Dependency Parsers
Erick Fonseca | André F. T. Martins
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural encoders have allowed dependency parsers to shift from higher-order structured models to simpler first-order ones, making decoding faster and still achieving better accuracy than non-neural parsers. This has led to a belief that neural encoders can implicitly encode structural constraints, such as siblings and grandparents in a tree. We tested this hypothesis and found that neural parsers may benefit from higher-order features, even when employing a powerful pre-trained encoder, such as BERT. While the gains of higher-order features are small in the presence of a powerful encoder, they are consistent for long-range dependencies and long sentences. In particular, higher-order models are more accurate on full sentence parses and on the exact match of modifier lists, indicating that they deal better with larger, more complex structures.

pdf bib
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
André Martins | Helena Moniz | Sara Fumega | Bruno Martins | Fernando Batista | Luisa Coheur | Carla Parra | Isabel Trancoso | Marco Turchi | Arianna Bisazza | Joss Moorkens | Ana Guerberof | Mary Nurminen | Lena Marg | Mikel L. Forcada
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

pdf bib
Learning Non-Monotonic Automatic Post-Editing of Translations from Human Orderings
António Góis | Kyunghyun Cho | André Martins
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Recent research in neural machine translation has explored flexible generation orders, as an alternative to left-to-right generation. However, training non-monotonic models brings a new complication: how to search for a good ordering when there is a combinatorial explosion of orderings arriving at the same final result? Also, how do these automatic orderings compare with the actual behaviour of human translators? Current models rely on manually built biases or are left to explore all possibilities on their own. In this paper, we analyze the orderings produced by human post-editors and use them to train an automatic post-editing system. We compare the resulting system with those trained with left-to-right and random post-editing orderings. We observe that humans tend to follow a nearly left-to-right order, but with interesting deviations, such as preferring to start by correcting punctuation or verbs.

pdf bib
Document-level Neural MT: A Systematic Comparison
António Lopes | M. Amin Farajian | Rachel Bawden | Michael Zhang | André F. T. Martins
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

In this paper we provide a systematic comparison of existing and new document-level neural machine translation solutions. As part of this comparison, we introduce and evaluate a document-level variant of the recently proposed Star Transformer architecture. In addition to using the traditional metric BLEU, we report the accuracy of the models in handling anaphoric pronoun translation as well as coherence and cohesion using contrastive test sets. Finally, we report the results of human evaluation in terms of Multidimensional Quality Metrics (MQM) and analyse the correlation of the results obtained by the automatic metrics with human judgments.

pdf bib
DeepSPIN: Deep Structured Prediction for Natural Language Processing
André F. T. Martins
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

DeepSPIN is a research project funded by the European Research Council (ERC) whose goal is to develop new neural structured prediction methods, models, and algorithms for improving the quality, interpretability, and data-efficiency of natural language processing (NLP) systems, with special emphasis on machine translation and quality estimation applications.

pdf bib
Project MAIA: Multilingual AI Agent Assistant
André F. T. Martins | Joao Graca | Paulo Dimas | Helena Moniz | Graham Neubig
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper presents the Multilingual Artificial Intelligence Agent Assistant (MAIA), a project led by Unbabel with the collaboration of CMU, INESC-ID and IT Lisbon. MAIA will employ cutting-edge machine learning and natural language processing technologies to build multilingual AI agent assistants, eliminating language barriers. MAIA’s translation layer will empower human agents to provide customer support in real-time, in any language, with human quality.

pdf bib
Proceedings of the Fourth Workshop on Structured Prediction for NLP
Priyanka Agrawal | Zornitsa Kozareva | Julia Kreutzer | Gerasimos Lampouras | André Martins | Sujith Ravi | Andreas Vlachos
Proceedings of the Fourth Workshop on Structured Prediction for NLP

pdf bib
The Explanation Game: Towards Prediction Explainability through Sparse Communication
Marcos Treviso | André F. T. Martins
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Explainability is a topic of growing importance in NLP. In this work, we provide a unified perspective of explainability as a communication problem between an explainer and a layperson about a classifier’s decision. We use this framework to compare several explainers, including gradient methods, erasure, and attention mechanisms, in terms of their communication success. In addition, we reinterpret these methods in the light of classical feature selection, and use this as inspiration for new embedded explainers, through the use of selective, sparse attention. Experiments in text classification and natural language inference, using different configurations of explainers and laypeople (including both machines and humans), reveal an advantage of attention-based explainers over gradient and erasure methods, and show that selective attention is a simpler alternative to stochastic rationalizers. Human experiments show strong results on text classification with post-hoc explainers trained to optimize communication success.

pdf bib
Understanding the Mechanics of SPIGOT: Surrogate Gradients for Latent Structure Learning
Tsvetomila Mihaylova | Vlad Niculae | André F. T. Martins
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Latent structure models are a powerful tool for modeling language data: they can mitigate the error propagation and annotation bottleneck in pipeline systems, while simultaneously uncovering linguistic insights about the data. One challenge with end-to-end training of these models is the argmax operation, which has null gradient. In this paper, we focus on surrogate gradients, a popular strategy to deal with this problem. We explore latent structure learning through the angle of pulling back the downstream learning objective. In this paradigm, we discover a principled motivation for both the straight-through estimator (STE) as well as the recently-proposed SPIGOT – a variant of STE for structured models. Our perspective leads to new algorithms in the same family. We empirically compare the known and the novel pulled-back estimators against the popular alternatives, yielding new insight for practitioners and revealing intriguing failure cases.

pdf bib
Sparse Text Generation
Pedro Henrique Martins | Zita Marinho | André F. T. Martins
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Current state-of-the-art text generators build on powerful language models such as GPT-2, achieving impressive performance. However, to avoid degenerate text, they require sampling from a modified softmax, via temperature parameters or ad-hoc truncation techniques, as in top-k or nucleus sampling. This creates a mismatch between training and testing conditions. In this paper, we use the recently introduced entmax transformation to train and sample from a natively sparse language model, avoiding this mismatch. The result is a text generator with favorable performance in terms of fluency and consistency, fewer repetitions, and n-gram diversity closer to human text. In order to evaluate our model, we propose three new metrics for comparing sparse or truncated distributions: 𝜖-perplexity, sparsemax score, and Jensen-Shannon divergence. Human-evaluated experiments in story completion and dialogue generation show that entmax sampling leads to more engaging and coherent stories and conversations.

2019

pdf bib
Sparse Sequence-to-Sequence Models
Ben Peters | Vlad Niculae | André F. T. Martins
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Sequence-to-sequence models are a powerful workhorse of NLP. Most variants employ a softmax transformation in both their attention mechanism and output layer, leading to dense alignments and strictly positive output probabilities. This density is wasteful, making models less interpretable and assigning probability mass to many implausible outputs. In this paper, we propose sparse sequence-to-sequence models, rooted in a new family of 𝛼-entmax transformations, which includes softmax and sparsemax as particular cases, and is sparse for any 𝛼 > 1. We provide fast algorithms to evaluate these transformations and their gradients, which scale well for large vocabulary sizes. Our models are able to produce sparse alignments and to assign nonzero probability to a short list of plausible outputs, sometimes rendering beam search exact. Experiments on morphological inflection and machine translation reveal consistent gains over dense models.

pdf bib
A Simple and Effective Approach to Automatic Post-Editing with Transfer Learning
Gonçalo M. Correia | André F. T. Martins
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Automatic post-editing (APE) seeks to automatically refine the output of a black-box machine translation (MT) system through human post-edits. APE systems are usually trained by complementing human post-edited data with large, artificial data generated through back-translations, a time-consuming process often no easier than training a MT system from scratch. in this paper, we propose an alternative where we fine-tune pre-trained BERT models on both the encoder and decoder of an APE system, exploring several parameter sharing strategies. By only training on a dataset of 23K sentences for 3 hours on a single GPU we obtain results that are competitive with systems that were trained on 5M artificial sentences. When we add this artificial data our method obtains state-of-the-art results.

pdf bib
Joint Learning of Named Entity Recognition and Entity Linking
Pedro Henrique Martins | Zita Marinho | André F. T. Martins
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Named entity recognition (NER) and entity linking (EL) are two fundamentally related tasks, since in order to perform EL, first the mentions to entities have to be detected. However, most entity linking approaches disregard the mention detection part, assuming that the correct mentions have been previously detected. In this paper, we perform joint learning of NER and EL to leverage their relatedness and obtain a more robust and generalisable system. For that, we introduce a model inspired by the Stack-LSTM approach. We observe that, in fact, doing multi-task learning of NER and EL improves the performance in both tasks when comparing with models trained with individual objectives. Furthermore, we achieve results competitive with the state-of-the-art in both NER and EL.

pdf bib
Scheduled Sampling for Transformers
Tsvetomila Mihaylova | André F. T. Martins
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Scheduled sampling is a technique for avoiding one of the known problems in sequence-to-sequence generation: exposure bias. It consists of feeding the model a mix of the teacher forced embeddings and the model predictions from the previous step in training time. The technique has been used for improving model performance with recurrent neural networks (RNN). In the Transformer model, unlike the RNN, the generation of a new word attends to the full sentence generated so far, not only to the last word, and it is not straightforward to apply the scheduled sampling technique. We propose some structural changes to allow scheduled sampling to be applied to Transformer architectures, via a two-pass decoding strategy. Experiments on two language pairs achieve performance close to a teacher-forcing baseline and show that this technique is promising for further exploration.

pdf bib
OpenKiwi: An Open Source Framework for Quality Estimation
Fabio Kepler | Jonay Trénous | Marcos Treviso | Miguel Vera | André F. T. Martins
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We introduce OpenKiwi, a Pytorch-based open source framework for translation quality estimation. OpenKiwi supports training and testing of word-level and sentence-level quality estimation systems, implementing the winning systems of the WMT 2015–18 quality estimation campaigns. We benchmark OpenKiwi on two datasets from WMT 2018 (English-German SMT and NMT), yielding state-of-the-art performance on the word-level tasks and near state-of-the-art in the sentence-level tasks.

pdf bib
Latent Structure Models for Natural Language Processing
André F. T. Martins | Tsvetomila Mihaylova | Nikita Nangia | Vlad Niculae
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

Latent structure models are a powerful tool for modeling compositional data, discovering linguistic structure, and building NLP pipelines. They are appealing for two main reasons: they allow incorporating structural bias during training, leading to more accurate models; and they allow discovering hidden linguistic structure, which provides better interpretability. This tutorial will cover recent advances in discrete latent structure models. We discuss their motivation, potential, and limitations, then explore in detail three strategies for designing such models: gradient approximation, reinforcement learning, and end-to-end differentiable methods. We highlight connections among all these methods, enumerating their strengths and weaknesses. The models we present and analyze have been applied to a wide variety of NLP tasks, including sentiment analysis, natural language inference, language modeling, machine translation, and semantic parsing. Examples and evaluation will be covered throughout. After attending the tutorial, a practitioner will be better informed about which method is best suited for their problem.

pdf bib
Selective Attention for Context-aware Neural Machine Translation
Sameen Maruf | André F. T. Martins | Gholamreza Haffari
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Despite the progress made in sentence-level NMT, current systems still fall short at achieving fluent, good quality translation for a full document. Recent works in context-aware NMT consider only a few previous sentences as context and may not scale to entire documents. To this end, we propose a novel and scalable top-down approach to hierarchical attention for context-aware NMT which uses sparse attention to selectively focus on relevant sentences in the document context and then attends to key words in those sentences. We also propose single-level attention approaches based on sentence or word-level information in the context. The document-level context representation, produced from these attention modules, is integrated into the encoder or decoder of the Transformer model depending on whether we use monolingual or bilingual context. Our experiments and evaluation on English-German datasets in different document MT settings show that our selective attention approach not only significantly outperforms context-agnostic baselines but also surpasses context-aware baselines in most cases.

pdf bib
Jointly Extracting and Compressing Documents with Summary State Representations
Afonso Mendes | Shashi Narayan | Sebastião Miranda | Zita Marinho | André F. T. Martins | Shay B. Cohen
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We present a new neural model for text summarization that first extracts sentences from a document and then compresses them. The pro-posed model offers a balance that sidesteps thedifficulties in abstractive methods while gener-ating more concise summaries than extractivemethods. In addition, our model dynamically determines the length of the output summary based on the gold summaries it observes during training and does not require length constraints typical to extractive summarization. The model achieves state-of-the-art results on the CNN/DailyMail and Newsroom datasets, improving over current extractive and abstractive methods. Human evaluations demonstratethat our model generates concise and informa-tive summaries. We also make available a new dataset of oracle compressive summaries derived automatically from the CNN/DailyMailreference summaries.

pdf bib
Adaptively Sparse Transformers
Gonçalo M. Correia | Vlad Niculae | André F. T. Martins
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this work, we introduce the adaptively sparse Transformer, wherein attention heads have flexible, context-dependent sparsity patterns. This sparsity is accomplished by replacing softmax with alpha-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the alpha parameter – which controls the shape and sparsity of alpha-entmax – allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets. Findings of the quantitative and qualitative analysis of our approach include that heads in different layers learn different sparsity preferences and tend to be more diverse in their attention distributions than softmax Transformers. Furthermore, at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.

pdf bib
Proceedings of the Third Workshop on Structured Prediction for NLP
Andre Martins | Andreas Vlachos | Zornitsa Kozareva | Sujith Ravi | Gerasimos Lampouras | Vlad Niculae | Julia Kreutzer
Proceedings of the Third Workshop on Structured Prediction for NLP

pdf bib
ITIST at the SIGMORPHON 2019 Shared Task: Sparse Two-headed Models for Inflection
Ben Peters | André F. T. Martins
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper presents the Instituto de Telecomunicações–Instituto Superior Técnico submission to Task 1 of the SIGMORPHON 2019 Shared Task. Our models combine sparse sequence-to-sequence models with a two-headed attention mechanism that learns separate attention distributions for the lemma and inflectional tags. Among submissions to Task 1, our models rank second and third. Despite the low data setting of the task (only 100 in-language training examples), they learn plausible inflection patterns and often concentrate all probability mass into a small set of hypotheses, making beam search exact.

pdf bib
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Marco Turchi | Karin Verspoor
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

pdf bib
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Marco Turchi | Karin Verspoor
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

pdf bib
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Marco Turchi | Karin Verspoor
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

pdf bib
Findings of the WMT 2019 Shared Tasks on Quality Estimation
Erick Fonseca | Lisa Yankovskaya | André F. T. Martins | Mark Fishel | Christian Federmann
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

We report the results of the WMT19 shared task on Quality Estimation, i.e. the task of predicting the quality of the output of machine translation systems given just the source text and the hypothesis translations. The task includes estimation at three granularity levels: word, sentence and document. A novel addition is evaluating sentence-level QE against human judgments: in other words, designing MT metrics that do not need a reference translation. This year we include three language pairs, produced solely by neural machine translation systems. Participating teams from eleven institutions submitted a variety of systems to different task variants and language pairs.

pdf bib
Unbabel’s Participation in the WMT19 Translation Quality Estimation Shared Task
Fabio Kepler | Jonay Trénous | Marcos Treviso | Miguel Vera | António Góis | M. Amin Farajian | António V. Lopes | André F. T. Martins
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

We present the contribution of the Unbabel team to the WMT 2019 Shared Task on Quality Estimation. We participated on the word, sentence, and document-level tracks, encompassing 3 language pairs: English-German, English-Russian, and English-French. Our submissions build upon the recent OpenKiwi framework: We combine linear, neural, and predictor-estimator systems with new transfer learning approaches using BERT and XLM pre-trained models. We compare systems individually and propose new ensemble techniques for word and sentence-level predictions. We also propose a simple technique for converting word labels into document-level predictions. Overall, our submitted systems achieve the best results on all tracks and language pairs by a considerable margin.

pdf bib
Unbabel’s Submission to the WMT2019 APE Shared Task: BERT-Based Encoder-Decoder for Automatic Post-Editing
António V. Lopes | M. Amin Farajian | Gonçalo M. Correia | Jonay Trénous | André F. T. Martins
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

This paper describes Unbabel’s submission to the WMT2019 APE Shared Task for the English-German language pair. Following the recent rise of large, powerful, pre-trained models, we adapt the BERT pretrained model to perform Automatic Post-Editing in an encoder-decoder framework. Analogously to dual-encoder architectures we develop a BERT-based encoder-decoder (BED) model in which a single pretrained BERT encoder receives both the source src and machine translation mt strings. Furthermore, we explore a conservativeness factor to constrain the APE system to perform fewer edits. As the official results show, when trained on a weighted combination of in-domain and artificial training data, our BED system with the conservativeness penalty improves significantly the translations of a strong NMT system by -0.78 and +1.23 in terms of TER and BLEU, respectively. Finally, our submission achieves a new state-of-the-art, ex-aequo, in English-German APE of NMT.

pdf bib
Translator2Vec: Understanding and Representing Human Post-Editors
António Góis | André F. T. Martins
Proceedings of Machine Translation Summit XVII: Research Track

pdf bib
Pivot Machine Translation in INTERACT Project
Chao-Hong Liu | Andy Way | Catarina Silva | André Martins
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

2018

pdf bib
Sparse and Constrained Attention for Neural Machine Translation
Chaitanya Malaviya | Pedro Ferreira | André F. T. Martins
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In neural machine translation, words are sometimes dropped from the source or generated repeatedly in the translation. We explore novel strategies to address the coverage problem that change only the attention transformation. Our approach allocates fertilities to source words, used to bound the attention each word can receive. We experiment with various sparse and constrained attention transformations and propose a new one, constrained sparsemax, shown to be differentiable and sparse. Empirical evaluation is provided in three languages pairs.

pdf bib
Marian: Fast Neural Machine Translation in C++
Marcin Junczys-Dowmunt | Roman Grundkiewicz | Tomasz Dwojak | Hieu Hoang | Kenneth Heafield | Tom Neckermann | Frank Seide | Ulrich Germann | Alham Fikri Aji | Nikolay Bogoychev | André F. T. Martins | Alexandra Birch
Proceedings of ACL 2018, System Demonstrations

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

pdf bib
Proceedings of the 21st Annual Conference of the European Association for Machine Translation
Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez | Miquel Esplà-Gomis | Maja Popović | Celia Rico | André Martins | Joachim Van den Bogaert | Mikel L. Forcada
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

pdf bib
Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic Post-Editing
Ramón Astudillo | João Graça | André Martins
Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic Post-Editing

pdf bib
Interpretable Structure Induction via Sparse Attention
Ben Peters | Vlad Niculae | André F. T. Martins
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Neural network methods are experiencing wide adoption in NLP, thanks to their empirical performance on many tasks. Modern neural architectures go way beyond simple feedforward and recurrent models: they are complex pipelines that perform soft, differentiable computation instead of discrete logic. The price of such soft computing is the introduction of dense dependencies, which make it hard to disentangle the patterns that trigger a prediction. Our recent work on sparse and structured latent computation presents a promising avenue for enhancing interpretability of such neural pipelines. Through this extended abstract, we aim to discuss and explore the potential and impact of our methods.

pdf bib
Contextual Neural Model for Translating Bilingual Multi-Speaker Conversations
Sameen Maruf | André F. T. Martins | Gholamreza Haffari
Proceedings of the Third Conference on Machine Translation: Research Papers

Recent works in neural machine translation have begun to explore document translation. However, translating online multi-speaker conversations is still an open problem. In this work, we propose the task of translating Bilingual Multi-Speaker Conversations, and explore neural architectures which exploit both source and target-side conversation histories for this task. To initiate an evaluation for this task, we introduce datasets extracted from Europarl v7 and OpenSubtitles2016. Our experiments on four language-pairs confirm the significance of leveraging conversation history, both in terms of BLEU and manual evaluation.

pdf bib
Findings of the WMT 2018 Shared Task on Quality Estimation
Lucia Specia | Frédéric Blain | Varvara Logacheva | Ramón F. Astudillo | André F. T. Martins
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We report the results of the WMT18 shared task on Quality Estimation, i.e. the task of predicting the quality of the output of machine translation systems at various granularity levels: word, phrase, sentence and document. This year we include four language pairs, three text domains, and translations produced by both statistical and neural machine translation systems. Participating teams from ten institutions submitted a variety of systems to different task variants and language pairs.

pdf bib
Towards Dynamic Computation Graphs via Sparse Latent Structure
Vlad Niculae | André F. T. Martins | Claire Cardie
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Deep NLP models benefit from underlying structures in the data—e.g., parse trees—typically extracted using off-the-shelf parsers. Recent attempts to jointly learn the latent structure encounter a tradeoff: either make factorization assumptions that limit expressiveness, or sacrifice end-to-end differentiability. Using the recently proposed SparseMAP inference, which retrieves a sparse distribution over latent structures, we propose a novel approach for end-to-end learning of latent structure predictors jointly with a downstream predictor. To the best of our knowledge, our method is the first to enable unrestricted dynamic computation graph construction from the global latent structure, while maintaining differentiability.

2017

pdf bib
Learning What’s Easy: Fully Differentiable Neural Easy-First Taggers
André F. T. Martins | Julia Kreutzer
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We introduce a novel neural easy-first decoder that learns to solve sequence tagging tasks in a flexible order. In contrast to previous easy-first decoders, our models are end-to-end differentiable. The decoder iteratively updates a “sketch” of the predictions over the sequence. At its core is an attention mechanism that controls which parts of the input are strategically the best to process next. We present a new constrained softmax transformation that ensures the same cumulative attention to every word, and show how to efficiently evaluate and backpropagate over it. Our models compare favourably to BILSTM taggers on three sequence tagging tasks.

pdf bib
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics
André Martins | Anselmo Peñas
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Pushing the Limits of Translation Quality Estimation
André F. T. Martins | Marcin Junczys-Dowmunt | Fabio N. Kepler | Ramón Astudillo | Chris Hokamp | Roman Grundkiewicz
Transactions of the Association for Computational Linguistics, Volume 5

Translation quality estimation is a task of growing importance in NLP, due to its potential to reduce post-editing human effort in disruptive ways. However, this potential is currently limited by the relatively low accuracy of existing systems. In this paper, we achieve remarkable improvements by exploiting synergies between the related tasks of word-level quality estimation and automatic post-editing. First, we stack a new, carefully engineered, neural model into a rich feature-based word-level quality estimation system. Then, we use the output of an automatic post-editing system as an extra feature, obtaining striking results on WMT16: a word-level FMULT1 score of 57.47% (an absolute gain of +7.95% over the current state of the art), and a Pearson correlation score of 65.56% for sentence-level HTER prediction (an absolute gain of +13.36%).

pdf bib
The INTERACT Project and Crisis MT
Sharon O’Brien | Chao-Hong Liu | Andy Way | João Graça | André Martins | Helena Moniz | Ellie Kemp | Rebecca Petras
Proceedings of Machine Translation Summit XVI: Commercial MT Users and Translators Track

pdf bib
Unbabel’s Participation in the WMT17 Translation Quality Estimation Shared Task
André F. T. Martins | Fabio Kepler | José Monteiro
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
Semi-Supervised Learning of Sequence Models with Method of Moments
Zita Marinho | André F. T. Martins | Shay B. Cohen | Noah A. Smith
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Unbabel’s Participation in the WMT16 Word-Level Translation Quality Estimation Shared Task
André F. T. Martins | Ramón Astudillo | Chris Hokamp | Fabio Kepler
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Jointly Learning to Embed and Predict with Multiple Languages
Daniel C. Ferreira | André F. T. Martins | Mariana S. C. Almeida
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
Lisbon: Evaluating TurboSemanticParser on Multiple Languages and Out-of-Domain Data
Mariana S. C. Almeida | André F. T. Martins
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Aligning Opinions: Cross-Lingual Opinion Mining with Dependencies
Mariana S. C. Almeida | Cláudia Pinto | Helena Figueira | Pedro Mendes | André F. T. Martins
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Transferring Coreference Resolvers with Posterior Regularization
André F. T. Martins
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Parsing as Reduction
Daniel Fernández-González | André F. T. Martins
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Priberam Compressive Summarization Corpus: A New Multi-Document Summarization Corpus for European Portuguese
Miguel B. Almeida | Mariana S. C. Almeida | André F. T. Martins | Helena Figueira | Pedro Mendes | Cláudia Pinto
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we introduce the Priberam Compressive Summarization Corpus, a new multi-document summarization corpus for European Portuguese. The corpus follows the format of the summarization corpora for English in recent DUC and TAC conferences. It contains 80 manually chosen topics referring to events occurred between 2010 and 2013. Each topic contains 10 news stories from major Portuguese newspapers, radio and TV stations, along with two human generated summaries up to 100 words. Apart from the language, one important difference from the DUC/TAC setup is that the human summaries in our corpus are compressive: the annotators performed only sentence and word deletion operations, as opposed to generating summaries from scratch. We use this corpus to train and evaluate learning-based extractive and compressive summarization systems, providing an empirical comparison between these two approaches. The corpus is made freely available in order to facilitate research on automatic summarization.

pdf bib
Priberam: A Turbo Semantic Parser with Second Order Features
André F. T. Martins | Mariana S. C. Almeida
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

bib
Linear Programming Decoders in Natural Language Processing: From Integer Programming to Message Passing and Dual Decomposition
André F. T. Martins
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

This tutorial will cover the theory and practice of linear programming decoders. This class of decoders encompasses a variety of techniques that have enjoyed great success in devising structured models for natural language processing (NLP). Along the tutorial, we provide a unified view of different algorithms and modeling techniques, including belief propagation, dual decomposition, integer linear programming, Markov logic, and constrained conditional models. Various applications in NLP will serve as a motivation.There is a long string of work using integer linear programming (ILP) formulations in NLP, for example in semantic role labeling, machine translation, summarization, dependency parsing, coreference resolution, and opinion mining, to name just a few. At the heart of these approaches is the ability to encode logic and budget constraints (common in NLP and information retrieval) as linear inequalities. Thanks to general purpose solvers (such as Gurobi, CPLEX, or GLPK), the practitioner can abstract away from the decoding algorithm and focus on developing a powerful model. A disadvantage, however, is that general solvers do not scale well to large problem instances, since they fail to exploit the structure of the problem.This is where graphical models come into play. In this tutorial, we show that most logic and budget constraints that arise in NLP can be cast in this framework. This opens the door for the use of message-passing algorithms, such as belief propagation and variants thereof. An alternative are algorithms based on dual decomposition, such as the subgradient method or AD3. These algorithms have achieved great success in a variety of applications, such as parsing, corpus-wide tagging, machine translation, summarization, joint coreference resolution and quotation attribution, and semantic role labeling. Interestingly, most decoders used in these works can be regarded as structure-aware solvers for addressing relaxations of integer linear programs. All these algorithms have a similar consensus-based architecture: they repeatedly perform certain "local" operations in the graph, until some form of local agreement is achieved. The local operations are performed at each factor, and they range between computing marginals, max-marginals, an optimal configuration, or a small quadratic problem, all of which are commonly tractable and efficient in a wide range of problems.As a companion of this tutorial, we provide an open-source implementation of some of the algorithms described above, available at http://www.ark.cs.cmu.edu/AD3.

pdf bib
A Joint Model for Quotation Attribution and Coreference Resolution
Mariana S. C. Almeida | Miguel B. Almeida | André F. T. Martins
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Frame-Semantic Parsing
Dipanjan Das | Desai Chen | André F. T. Martins | Nathan Schneider | Noah A. Smith
Computational Linguistics, Volume 40, Issue 1 - March 2014

2013

pdf bib
Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning
Miguel Almeida | André Martins
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers
André Martins | Miguel Almeida | Noah A. Smith
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Structured Sparsity in Natural Language Processing: Models, Algorithms and Applications
André F. T. Martins | Mário A. T. Figueiredo | Noah A. Smith
Tutorial Abstracts at the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
An Exact Dual Decomposition Algorithm for Shallow Semantic Parsing with Constraints
Dipanjan Das | André F. T. Martins | Noah A. Smith
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

2011

pdf bib
Dual Decomposition with Many Overlapping Components
André Martins | Noah Smith | Mário Figueiredo | Pedro Aguiar
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Structured Sparsity in Structured Prediction
André Martins | Noah Smith | Mário Figueiredo | Pedro Aguiar
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

pdf bib
Turbo Parsers: Dependency Parsing by Approximate Variational Inference
André Martins | Noah Smith | Eric Xing | Pedro Aguiar | Mário Figueiredo
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

pdf bib
Summarization with a Joint Model for Sentence Extraction and Compression
André Martins | Noah A. Smith
Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing

pdf bib
Concise Integer Linear Programming Formulations for Dependency Parsing
André Martins | Noah Smith | Eric Xing
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2004

pdf bib
Design and Implementation of a Semantic Search Engine for Portuguese
Carlos Amaral | Dominique Laurent | André Martins | Afonso Mendes | Cláudia Pinto
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Search
Co-authors