2020
pdf
bib
abs
Combining Subword Representations into Word-level Representations in the Transformer Architecture
Noe Casas
|
Marta R. Costa-jussà
|
José A. R. Fonollosa
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
In Neural Machine Translation, using word-level tokens leads to degradation in translation quality. The dominant approaches use subword-level tokens, but this increases the length of the sequences and makes it difficult to profit from word-level information such as POS tags or semantic dependencies. We propose a modification to the Transformer model to combine subword-level representations into word-level ones in the first layers of the encoder, reducing the effective length of the sequences in the following layers and providing a natural point to incorporate extra word-level information. Our experiments show that this approach maintains the translation quality with respect to the normal Transformer model when no extra word-level information is injected and that it is superior to the currently dominant method for incorporating word-level source language information to models based on subword-level vocabularies.
pdf
bib
abs
Syntax-driven Iterative Expansion Language Models for Controllable Text Generation
Noe Casas
|
José A. R. Fonollosa
|
Marta R. Costa-jussà
Proceedings of the Fourth Workshop on Structured Prediction for NLP
The dominant language modeling paradigm handles text as a sequence of discrete tokens. While that approach can capture the latent structure of the text, it is inherently constrained to sequential dynamics for text generation. We propose a new paradigm for introducing a syntactic inductive bias into neural text generation, where the dependency parse tree is used to drive the Transformer model to generate sentences iteratively. Our experiments show that this paradigm is effective at text generation, with quality between LSTMs and Transformers, and comparable diversity, requiring less than half their decoding steps, and its generation process allows direct control over the syntactic constructions of the generated text, enabling the induction of stylistic variations.
2019
pdf
bib
abs
Evaluating the Underlying Gender Bias in Contextualized Word Embeddings
Christine Basta
|
Marta R. Costa-jussà
|
Noe Casas
Proceedings of the First Workshop on Gender Bias in Natural Language Processing
Gender bias is highly impacting natural language processing applications. Word embeddings have clearly been proven both to keep and amplify gender biases that are present in current data sources. Recently, contextualized word embeddings have enhanced previous word embedding techniques by computing word vector representations dependent on the sentence they appear in. In this paper, we study the impact of this conceptual change in the word embedding computation in relation with gender bias. Our analysis includes different measures previously applied in the literature to standard word embeddings. Our findings suggest that contextualized word embeddings are less biased than standard ones even when the latter are debiased.
pdf
bib
abs
The TALP-UPC Machine Translation Systems for WMT19 News Translation Task: Pivoting Techniques for Low Resource MT
Noe Casas
|
José A. R. Fonollosa
|
Carlos Escolano
|
Christine Basta
|
Marta R. Costa-jussà
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
In this article, we describe the TALP-UPC research group participation in the WMT19 news translation shared task for Kazakh-English. Given the low amount of parallel training data, we resort to using Russian as pivot language, training subword-based statistical translation systems for Russian-Kazakh and Russian-English that were then used to create two synthetic pseudo-parallel corpora for Kazakh-English and English-Kazakh respectively. Finally, a self-attention model based on the decoder part of the Transformer architecture was trained on the two pseudo-parallel corpora.
pdf
bib
Leveraging Rule-Based Machine Translation Knowledge for Under-Resourced Neural Machine Translation Models
Daniel Torregrosa
|
Nivranshu Pasricha
|
Maraim Masoud
|
Bharathi Raja Chakravarthi
|
Juan Alonso
|
Noe Casas
|
Mihael Arcan
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks
2018
pdf
bib
abs
The TALP-UPC Machine Translation Systems for WMT18 News Shared Translation Task
Noe Casas
|
Carlos Escolano
|
Marta R. Costa-jussà
|
José A. R. Fonollosa
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
In this article we describe the TALP-UPC research group participation in the WMT18 news shared translation task for Finnish-English and Estonian-English within the multi-lingual subtrack. All of our primary submissions implement an attention-based Neural Machine Translation architecture. Given that Finnish and Estonian belong to the same language family and are similar, we use as training data the combination of the datasets of both language pairs to paliate the data scarceness of each individual pair. We also report the translation quality of systems trained on individual language pair data to serve as baseline and comparison reference.