José G. C. de Souza

Also published as: Jose G.C. de Souza, José G. C. de Souza, José G. Camargo de Souza, José G.C. de Souza, José Guilherme C. de Souza, José Guilherme Camargo de Souza


2023

pdf bib
Empirical Assessment of kNN-MT for Real-World Translation Scenarios
Pedro Henrique Martins | João Alves | Tânia Vaz | Madalena Gonçalves | Beatriz Silva | Marianna Buchicchio | José G. C. de Souza | André F. T. Martins
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

This paper aims to investigate the effectiveness of the k-Nearest Neighbor Machine Translation model (kNN-MT) in real-world scenarios. kNN-MT is a retrieval-augmented framework that combines the advantages of parametric models with non-parametric datastores built using a set of parallel sentences. Previous studies have primarily focused on evaluating the model using only the BLEU metric and have not tested kNN-MT in real world scenarios. Our study aims to fill this gap by conducting a comprehensive analysis on various datasets comprising different language pairs and different domains, using multiple automatic metrics and expert evaluated Multidimensional Quality Metrics (MQM). We compare kNN-MT with two alternate strategies: fine-tuning all the model parameters and adapter-based finetuning. Finally, we analyze the effect of the datastore size on translation quality, and we examine the number of entries necessary to bootstrap and configure the index.

pdf bib
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation
Patrick Fernandes | Aman Madaan | Emmy Liu | António Farinhas | Pedro Henrique Martins | Amanda Bertsch | José G. C. de Souza | Shuyan Zhou | Tongshuang Wu | Graham Neubig | André F. T. Martins
Transactions of the Association for Computational Linguistics, Volume 11

Natural language generation has witnessed significant advancements due to the training of large language models on vast internet-scale datasets. Despite these advancements, there exists a critical challenge: These models can inadvertently generate content that is toxic, inaccurate, and unhelpful, and existing automatic evaluation metrics often fall short of identifying these shortcomings. As models become more capable, human feedback is an invaluable signal for evaluating and improving models. This survey aims to provide an overview of recent research that has leveraged human feedback to improve natural language generation. First, we introduce a taxonomy distilled from existing research to categorize and organize the varied forms of feedback. Next, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using feedback or training feedback models. We also discuss existing datasets for human-feedback data collection, and concerns surrounding feedback collection. Finally, we provide an overview of the nascent field of AI feedback, which uses large language models to make judgments based on a set of principles and minimize the need for human intervention. We also release a website of this survey at feedback-gap-survey.info.

pdf bib
Findings of the WMT 2023 Shared Task on Quality Estimation
Frederic Blain | Chrysoula Zerva | Ricardo Rei | Nuno M. Guerreiro | Diptesh Kanojia | José G. C. de Souza | Beatriz Silva | Tânia Vaz | Yan Jingxuan | Fatemeh Azadi | Constantin Orasan | André Martins
Proceedings of the Eighth Conference on Machine Translation

We report the results of the WMT 2023 shared task on Quality Estimation, in which the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels, without access to reference translations. This edition introduces a few novel aspects and extensions that aim to enable more fine-grained, and explainable quality estimation approaches. We introduce an updated quality annotation scheme using Multidimensional Quality Metrics to obtain sentence- and word-level quality scores for three language pairs. We also extend the provided data to new language pairs: we specifically target low-resource languages and provide training, development and test data for English-Hindi, English-Tamil, English-Telegu and English-Gujarati as well as a zero-shot test-set for English-Farsi. Further, we introduce a novel fine-grained error prediction task aspiring to motivate research towards more detailed quality predictions.

pdf bib
Scaling up CometKiwi: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task
Ricardo Rei | Nuno M. Guerreiro | José Pombal | Daan van Stigt | Marcos Treviso | Luisa Coheur | José G. C. de Souza | André Martins
Proceedings of the Eighth Conference on Machine Translation

We present the joint contribution of Unbabel and Instituto Superior Técnico to the WMT 2023 Shared Task on Quality Estimation (QE). Our team participated on all tasks: Sentence- and Word-level Quality Prediction and Fine-grained error span detection. For all tasks we build on the CometKiwi model (rei et al. 2022). Our multilingual approaches are ranked first for all tasks, reaching state-of-the-art performance for quality estimation at word-, span- and sentence-level granularity. Compared to the previous state-of-the-art, CometKiwi, we show large improvements in correlation with human judgements (up to 10 Spearman points) and surpassing the second-best multilingual submission with up to 3.8 absolute points.

2022

pdf bib
Searching for COMETINHO: The Little Metric That Could
Ricardo Rei | Ana C Farinha | José G.C. de Souza | Pedro G. Ramos | André F.T. Martins | Luisa Coheur | Alon Lavie
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

In recent years, several neural fine-tuned machine translation evaluation metrics such as COMET and BLEURT have been proposed. These metrics achieve much higher correlations with human judgments than lexical overlap metrics at the cost of computational efficiency and simplicity, limiting their applications to scenarios in which one has to score thousands of translation hypothesis (e.g. scoring multiple systems or Minimum Bayes Risk decoding). In this paper, we explore optimization techniques, pruning, and knowledge distillation to create more compact and faster COMET versions. Our results show that just by optimizing the code through the use of caching and length batching we can reduce inference time between 39% and 65% when scoring multiple systems. Also, we show that pruning COMET can lead to a 21% model reduction without affecting the model’s accuracy beyond 0.01 Kendall tau correlation. Furthermore, we present DISTIL-COMET a lightweight distilled version that is 80% smaller and 2.128x faster while attaining a performance close to the original model and above strong baselines such as BERTSCORE and PRISM.

pdf bib
QUARTZ: Quality-Aware Machine Translation
José G.C. de Souza | Ricardo Rei | Ana C. Farinha | Helena Moniz | André F. T. Martins
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper presents QUARTZ, QUality-AwaRe machine Translation, a project led by Unbabel which aims at developing machine translation systems that are more robust and produce fewer critical errors. With QUARTZ we want to enable machine translation for user-generated conversational content types that do not tolerate critical errors in automatic translations.

pdf bib
Findings of the WMT 2022 Shared Task on Quality Estimation
Chrysoula Zerva | Frédéric Blain | Ricardo Rei | Piyawat Lertvittayakumjorn | José G. C. de Souza | Steffen Eger | Diptesh Kanojia | Duarte Alves | Constantin Orăsan | Marina Fomicheva | André F. T. Martins | Lucia Specia
Proceedings of the Seventh Conference on Machine Translation (WMT)

We report the results of the WMT 2022 shared task on Quality Estimation, in which the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels, without access to reference translations. This edition introduces a few novel aspects and extensions that aim to enable more fine-grained, and explainable quality estimation approaches. We introduce an updated quality annotation scheme using Multidimensional Quality Metrics to obtain sentence- and word-level quality scores for three language pairs. We also extend the Direct Assessments and post-edit data (MLQE-PE) to new language pairs: we present a novel and large dataset on English-Marathi, as well as a zero-shot test set on English-Yoruba. Further, we include an explainability sub-task for all language pairs and present a new format of a critical error detection task for two new language pairs. Participants from 11 different teams submitted altogether 991 systems to different task variants and language pairs.

pdf bib
Robust MT Evaluation with Sentence-level Multilingual Augmentation
Duarte Alves | Ricardo Rei | Ana C Farinha | José G. C. de Souza | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

Automatic translations with critical errors may lead to misinterpretations and pose several risks for the user. As such, it is important that Machine Translation (MT) Evaluation systems are robust to these errors in order to increase the reliability and safety of Machine Translation systems. Here we introduce SMAUG a novel Sentence-level Multilingual AUGmentation approach for generating translations with critical errors and apply this approach to create a test set to evaluate the robustness of MT metrics to these errors. We show that current State-of-the-Art metrics are improving their capability to distinguish translations with and without critical errors and to penalize the first accordingly. We also show that metrics tend to struggle with errors related to named entities and numbers and that there is a high variance in the robustness of current methods to translations with critical errors.

pdf bib
COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task
Ricardo Rei | José G. C. de Souza | Duarte Alves | Chrysoula Zerva | Ana C Farinha | Taisiya Glushkova | Alon Lavie | Luisa Coheur | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics Shared Task. Our primary submission – dubbed COMET-22 – is an ensemble between a COMET estimator model trained with Direct Assessments and a newly proposed multitask model trained to predict sentence-level scores along with OK/BAD word-level tags derived from Multidimensional Quality Metrics error annotations. These models are ensembled together using a hyper-parameter search that weights different features extracted from both evaluation models and combines them into a single score. For the reference-free evaluation, we present CometKiwi. Similarly to our primary submission, CometKiwi is an ensemble between two models. A traditional predictor-estimator model inspired by OpenKiwi and our new multitask model trained on Multidimensional Quality Metrics which can also be used without references. Both our submissions show improved correlations compared to state-of-the-art metrics from last year as well as increased robustness to critical errors.

pdf bib
CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task
Ricardo Rei | Marcos Treviso | Nuno M. Guerreiro | Chrysoula Zerva | Ana C Farinha | Christine Maroti | José G. C. de Souza | Taisiya Glushkova | Duarte Alves | Luisa Coheur | Alon Lavie | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estimation (QE). Our team participated in all three subtasks: (i) Sentence and Word-level Quality Prediction; (ii) Explainable QE; and (iii) Critical Error Detection. For all tasks we build on top of the COMET framework, connecting it with the predictor-estimator architecture of OpenKiwi, and equipping it with a word-level sequence tagger and an explanation extractor. Our results suggest that incorporating references during pretraining improves performance across several language pairs on downstream tasks, and that jointly training with sentence and word-level objectives yields a further boost. Furthermore, combining attention and gradient information proved to be the top strategy for extracting good explanations of sentence-level QE models. Overall, our submissions achieved the best results for all three tasks for almost all language pairs by a considerable margin.

pdf bib
Findings of the WMT 2022 Shared Task on Chat Translation
Ana C Farinha | M. Amin Farajian | Marianna Buchicchio | Patrick Fernandes | José G. C. de Souza | Helena Moniz | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper reports the findings of the second edition of the Chat Translation Shared Task. Similarly to the previous WMT 2020 edition, the task consisted of translating bilingual customer support conversational text. However, unlike the previous edition, in which the bilingual data was created from a synthetic monolingual English corpus, this year we used a portion of the newly released Unbabel’s MAIA corpus, which contains genuine bilingual conversations between agents and customers. We also expanded the language pairs to English↔German (en↔de), English↔French (en↔fr), and English↔Brazilian Portuguese (en↔pt-br).Given that the main goal of the shared task is to translate bilingual conversations, participants were encouraged to train and test their models specifically for this environment. In total, we received 18 submissions from 4 different teams. All teams participated in both directions of en↔de. One of the teams also participated in en↔fr and en↔pt-br. We evaluated the submissions with automatic metrics as well as human judgments via Multidimensional Quality Metrics (MQM) on both directions. The official ranking of the systems is based on the overall MQM scores of the participating systems on both directions, i.e. agent and customer.

pdf bib
Unbabel-IST at the WMT Chat Translation Shared Task
João Alves | Pedro Henrique Martins | José G. C. de Souza | M. Amin Farajian | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

We present the joint contribution of IST and Unbabel to the WMT 2022 Chat Translation Shared Task. We participated in all six language directions (English ↔ German, English ↔ French, English ↔ Brazilian Portuguese). Due to the lack of domain-specific data, we use mBART50, a large pretrained language model trained on millions of sentence-pairs, as our base model. We fine-tune it using a two step fine-tuning process. In the first step, we fine-tune the model on publicly available data. In the second step, we use the validation set. After having a domain specific model, we explore the use of kNN-MT as a way of incorporating domain-specific data at decoding time.

pdf bib
Quality-Aware Decoding for Neural Machine Translation
Patrick Fernandes | António Farinhas | Ricardo Rei | José G. C. de Souza | Perez Ogayo | Graham Neubig | Andre Martins
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Despite the progress in machine translation quality estimation and evaluation in the last years, decoding in neural machine translation (NMT) is mostly oblivious to this and centers around finding the most probable translation according to the model (MAP decoding), approximated with beam search. In this paper, we bring together these two lines of research and propose quality-aware decoding for NMT, by leveraging recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods like N-best reranking and minimum Bayes risk decoding. We perform an extensive comparison of various possible candidate generation and ranking methods across four datasets and two model classes and find that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics (COMET and BLEURT) and to human assessments.

2021

pdf bib
IST-Unbabel 2021 Submission for the Quality Estimation Shared Task
Chrysoula Zerva | Daan van Stigt | Ricardo Rei | Ana C Farinha | Pedro Ramos | José G. C. de Souza | Taisiya Glushkova | Miguel Vera | Fabio Kepler | André F. T. Martins
Proceedings of the Sixth Conference on Machine Translation

We present the joint contribution of IST and Unbabel to the WMT 2021 Shared Task on Quality Estimation. Our team participated on two tasks: Direct Assessment and Post-Editing Effort, encompassing a total of 35 submissions. For all submissions, our efforts focused on training multilingual models on top of OpenKiwi predictor-estimator architecture, using pre-trained multilingual encoders combined with adapters. We further experiment with and uncertainty-related objectives and features as well as training on out-of-domain direct assessment data.

2018

pdf bib
Generating E-Commerce Product Titles and Predicting their Quality
José G. Camargo de Souza | Michael Kozielski | Prashant Mathur | Ernie Chang | Marco Guerini | Matteo Negri | Marco Turchi | Evgeny Matusov
Proceedings of the 11th International Conference on Natural Language Generation

E-commerce platforms present products using titles that summarize product information. These titles cannot be created by hand, therefore an algorithmic solution is required. The task of automatically generating these titles given noisy user provided titles is one way to achieve the goal. The setting requires the generation process to be fast and the generated title to be both human-readable and concise. Furthermore, we need to understand if such generated titles are usable. As such, we propose approaches that (i) automatically generate product titles, (ii) predict their quality. Our approach scales to millions of products and both automatic and human evaluations performed on real-world data indicate our approaches are effective and applicable to existing e-commerce scenarios.

pdf bib
Quality Estimation for Automatically Generated Titles of eCommerce Browse Pages
Nicola Ueffing | José G. C. de Souza | Gregor Leusch
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

At eBay, we are automatically generating a large amount of natural language titles for eCommerce browse pages using machine translation (MT) technology. While automatic approaches can generate millions of titles very fast, they are prone to errors. We therefore develop quality estimation (QE) methods which can automatically detect titles with low quality in order to prevent them from going live. In this paper, we present different approaches: The first one is a Random Forest (RF) model that explores hand-crafted, robust features, which are a mix of established features commonly used in Machine Translation Quality Estimation (MTQE) and new features developed specifically for our task. The second model is based on Siamese Networks (SNs) which embed the metadata input sequence and the generated title in the same space and do not require hand-crafted features at all. We thoroughly evaluate and compare those approaches on in-house data. While the RF models are competitive for scenarios with smaller amounts of training data and somewhat more robust, they are clearly outperformed by the SN models when the amount of training data is larger.

2016

pdf bib
FBK HLT-MT at SemEval-2016 Task 1: Cross-lingual Semantic Similarity Measurement Using Quality Estimation Features and Compositional Bilingual Word Embeddings
Duygu Ataman | José G. C. de Souza | Marco Turchi | Matteo Negri
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
The FBK Participation in the WMT 2016 Automatic Post-editing Shared Task
Rajen Chatterjee | José G. C. de Souza | Matteo Negri | Marco Turchi
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
TranscRater: a Tool for Automatic Speech Recognition Quality Estimation
Shahab Jalalvand | Matteo Negri | Marco Turchi | José G. C. de Souza | Daniele Falavigna | Mohammed R. H. Qwaider
Proceedings of ACL-2016 System Demonstrations

pdf bib
TMop: a Tool for Unsupervised Translation Memory Cleaning
Masoud Jalili Sabet | Matteo Negri | Marco Turchi | José G. C. de Souza | Marcello Federico
Proceedings of ACL-2016 System Demonstrations

2015

pdf bib
Online Multitask Learning for Machine Translation Quality Estimation
José G. C. de Souza | Matteo Negri | Elisa Ricci | Marco Turchi
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
MT quality estimation for e-commerce data
José G. C. de Souza | Marcello Federico | Hassan Sawaf
Proceedings of Machine Translation Summit XV: User Track

pdf bib
Multitask Learning for Adaptive Quality Estimation of Automatically Transcribed Utterances
José G. C. de Souza | Hamed Zamani | Matteo Negri | Marco Turchi | Daniele Falavigna
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Adaptive Quality Estimation for Machine Translation
Marco Turchi | Antonios Anastasopoulos | José G. C. de Souza | Matteo Negri
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
FBK-UPV-UEdin participation in the WMT14 Quality Estimation shared-task
José Guilherme Camargo de Souza | Jesús González-Rubio | Christian Buck | Marco Turchi | Matteo Negri
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Online multi-user adaptive statistical machine translation
Prashant Mathur | Mauro Cettolo | Marcello Federico | José G.C. de Souza
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

In this paper we investigate the problem of adapting a machine translation system to the feedback provided by multiple post-editors. It is well know that translators might have very different post-editing styles and that this variability hinders the application of online learning methods, which indeed assume a homogeneous source of adaptation data. We hence propose multi-task learning to leverage bias information from each single post-editors in order to constrain the evolution of the SMT system. A new framework for significance testing with sentence level metrics is described which shows that Multi-Task learning approaches outperforms existing online learning approaches, with significant gains of 1.24 and 1.88 TER score over a strong online adaptive baseline, on a test set of post-edits produced by four translators texts and on a popular benchmark with multiple references, respectively.

pdf bib
Towards a combination of online and multitask learning for MT quality estimation: a preliminary study
José G.C. de Souza | Marco Turchi | Matteo Negri
Workshop on interactive and adaptive machine translation

Quality estimation (QE) for machine translation has emerged as a promising way to provide real-world applications with methods to estimate at run-time the reliability of automatic translations. Real-world applications, however, pose challenges that go beyond those of current QE evaluation settings. For instance, the heterogeneity and the scarce availability of training data might contribute to significantly raise the bar. To address these issues we compare two alternative machine learning paradigms, namely online and multi-task learning, measuring their capability to overcome the limitations of current batch methods. The results of our experiments, which are carried out in the same experimental setting, demonstrate the effectiveness of the two methods and suggest their complementarity. This indicates, as a promising research avenue, the possibility to combine their strengths into an online multi-task approach to the problem.

pdf bib
Machine Translation Quality Estimation Across Domains
José G. C. de Souza | Marco Turchi | Matteo Negri
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Quality Estimation for Automatic Speech Recognition
Matteo Negri | Marco Turchi | José G. C. de Souza | Daniele Falavigna
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Exploiting Qualitative Information from Automatic Word Alignment for Cross-lingual NLP Tasks
José G.C. de Souza | Miquel Esplà-Gomis | Marco Turchi | Matteo Negri
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
QuEst - A translation quality estimation framework
Lucia Specia | Kashif Shah | Jose G.C. de Souza | Trevor Cohn
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
FBK-UEdin Participation to the WMT13 Quality Estimation Shared Task
José Guilherme Camargo de Souza | Christian Buck | Marco Turchi | Matteo Negri
Proceedings of the Eighth Workshop on Statistical Machine Translation

2012

pdf bib
FBK: Machine Translation Evaluation and Word Similarity metrics for Semantic Textual Similarity
José Guilherme Camargo de Souza | Matteo Negri | Yashar Mehdad
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
FBK: Cross-Lingual Textual Entailment Without Translation
Yashar Mehdad | Matteo Negri | José Guilherme C. de Souza
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)