Miguel Graça


2024

pdf bib
LumberChunker: Long-Form Narrative Document Segmentation
André V. Duarte | João DS Marques | Miguel Graça | Miguel Freire | Lei Li | Arlindo L. Oliveira
Findings of the Association for Computational Linguistics: EMNLP 2024

Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content’s semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 “needle in a haystack” type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro.

2020

pdf bib
When and Why is Unsupervised Neural Machine Translation Useless?
Yunsu Kim | Miguel Graça | Hermann Ney
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper studies the practicality of the current state-of-the-art unsupervised methods in neural machine translation (NMT). In ten translation tasks with various data settings, we analyze the conditions under which the unsupervised methods fail to produce reasonable translations. We show that their performance is severely affected by linguistic dissimilarity and domain mismatch between source and target monolingual data. Such conditions are common for low-resource language pairs, where unsupervised learning works poorly. In all of our experiments, supervised and semi-supervised baselines with 50k-sentence bilingual data outperform the best unsupervised results. Our analyses pinpoint the limits of the current unsupervised NMT and also suggest immediate research directions.

2019

pdf bib
Generalizing Back-Translation in Neural Machine Translation
Miguel Graça | Yunsu Kim | Julian Schamper | Shahram Khadivi | Hermann Ney
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

Back-translation — data augmentation by translating target monolingual data — is a crucial component in modern neural machine translation (NMT). In this work, we reformulate back-translation in the scope of cross-entropy optimization of an NMT model, clarifying its underlying mathematical assumptions and approximations beyond its heuristic usage. Our formulation covers broader synthetic data generation schemes, including sampling from a target-to-source NMT model. With this formulation, we point out fundamental problems of the sampling-based approaches and propose to remedy them by (i) disabling label smoothing for the target-to-source model and (ii) sampling from a restricted search space. Our statements are investigated on the WMT 2018 German <-> English news translation task.

pdf bib
The RWTH Aachen University Machine Translation Systems for WMT 2019
Jan Rosendahl | Christian Herold | Yunsu Kim | Miguel Graça | Weiyue Wang | Parnia Bahar | Yingbo Gao | Hermann Ney
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the neural machine translation systems developed at the RWTH Aachen University for the German-English, Chinese-English and Kazakh-English news translation tasks of the Fourth Conference on Machine Translation (WMT19). For all tasks, the final submitted system is based on the Transformer architecture. We focus on improving data filtering and fine-tuning as well as systematically evaluating interesting approaches like unigram language model segmentation and transfer learning. For the De-En task, none of the tested methods gave a significant improvement over last years winning system and we end up with the same performance, resulting in 39.6% BLEU on newstest2019. In the Zh-En task, we show 1.3% BLEU improvement over our last year’s submission, which we mostly attribute to the splitting of long sentences during translation. We further report results on the Kazakh-English task where we gain improvements of 11.1% BLEU over our baseline system. On the same task we present a recent transfer learning approach, which uses half of the free parameters of our submission system and performs on par with it.

2018

pdf bib
The RWTH Aachen University English-German and German-English Unsupervised Neural Machine Translation Systems for WMT 2018
Miguel Graça | Yunsu Kim | Julian Schamper | Jiahui Geng | Hermann Ney
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the unsupervised neural machine translation (NMT) systems of the RWTH Aachen University developed for the English ↔ German news translation task of the EMNLP 2018 Third Conference on Machine Translation (WMT 2018). Our work is based on iterative back-translation using a shared encoder-decoder NMT model. We extensively compare different vocabulary types, word embedding initialization schemes and optimization methods for our model. We also investigate gating and weight normalization for the word embedding layer.

pdf bib
The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task
Nick Rossenbach | Jan Rosendahl | Yunsu Kim | Miguel Graça | Aman Gokrani | Hermann Ney
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the submission of RWTH Aachen University for the De→En parallel corpus filtering task of the EMNLP 2018 Third Conference on Machine Translation (WMT 2018). We use several rule-based, heuristic methods to preselect sentence pairs. These sentence pairs are scored with count-based and neural systems as language and translation models. In addition to single sentence-pair scoring, we further implement a simple redundancy removing heuristic. Our best performing corpus filtering system relies on recurrent neural language models and translation models based on the transformer architecture. A model trained on 10M randomly sampled tokens reaches a performance of 9.2% BLEU on newstest2018. Using our filtering and ranking techniques we achieve 34.8% BLEU.

2017

pdf bib
The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017
Jan-Thorsten Peter | Andreas Guta | Tamer Alkhouli | Parnia Bahar | Jan Rosendahl | Nick Rossenbach | Miguel Graça | Hermann Ney
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
The RWTH Aachen Machine Translation System for IWSLT 2016
Jan-Thorsten Peter | Andreas Guta | Nick Rossenbach | Miguel Graça | Hermann Ney
Proceedings of the 13th International Conference on Spoken Language Translation

This work describes the statistical machine translation (SMT) systems of RWTH Aachen University developed for the evaluation campaign of International Workshop on Spoken Language Translation (IWSLT) 2016. We have participated in the MT track for the German→English language pair employing our state-of-the-art phrase-based system, neural machine translation implementation and our joint translation and reordering decoder. Furthermore, we have applied feed-forward and recurrent neural language and translation models for reranking. The attention-based approach has been used for reranking the n-best lists for both phrasebased and hierarchical setups. On top of these systems, we make use of system combination to enhance the translation quality by combining individually trained systems.

2015

pdf bib
Extended Translation Models in Phrase-based Decoding
Andreas Guta | Joern Wuebker | Miguel Graça | Yunsu Kim | Hermann Ney
Proceedings of the Tenth Workshop on Statistical Machine Translation