Reinhard Rapp

Despite impressive progress in machine translation in recent years, it has occasionally been argued that current systems are still mainly based on pattern recognition and that further progress may be possible by using text understanding techniques, thereby e.g. looking at semantics of the type “Who is doing what to whom?”. In the current research we aim to take a small step into this direction. Assuming that semantic role labeling (SRL) grasps some of the relevant semantics, we automatically annotate the source language side of a standard parallel corpus, namely Europarl, with semantic roles. We then train a neural machine translation (NMT) system using the annotated corpus on the source language side, and the original unannotated corpus on the target language side. New text to be translated is first annotated by the same SRL system and then fed into the translation system. We compare the results to those of a baseline NMT system trained with unannotated text on both sides and find that the SRL-based system yields small improvements in terms of BLEU scores for each of the four language pairs under investigation, involving English, French, German, Greek and Spanish.

2021

pdf bib

Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
Reinhard Rapp | Serge Sharoff | Pierre Zweigenbaum
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

pdf bib abs

Similar Language Translation for Catalan, Portuguese and Spanish Using Marian NMT
Reinhard Rapp
Proceedings of the Sixth Conference on Machine Translation

This paper describes the SEBAMAT contribution to the 2021 WMT Similar Language Translation shared task. Using the Marian neural machine translation toolkit, translation systems based on Google’s transformer architecture were built in both directions of Catalan–Spanish and Portuguese–Spanish. The systems were trained in two contrastive parameter settings (different vocabulary sizes for byte pair encoding) using only the parallel but not the comparable corpora provided by the shared task organizers. According to their official evaluation results, the SEBAMAT system turned out to be competitive with rankings among the top teams and BLEU scores between 38 and 47 for the language pairs involving Portuguese and between 76 and 80 for the language pairs involving Catalan.

2020

pdf bib

Proceedings of the 13th Workshop on Building and Using Comparable Corpora
Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

pdf bib abs

Overview of the Fourth BUCC Shared Task: Bilingual Dictionary Induction from Comparable Corpora
Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

The shared task of the 13th Workshop on Building and Using Comparable Corpora was devoted to the induction of bilingual dictionaries from comparable rather than parallel corpora. In this task, for a number of language pairs involving Chinese, English, French, German, Russian and Spanish, the participants were supposed to determine automatically the target language translations of several thousand source language test words of three frequency ranges. We describe here some background, the task definition, the training and test data sets and the evaluation used for ranking the participating systems. We also summarize the approaches used and present the results of the evaluation. In conclusion, the outcome of the competition are the results of a number of systems which provide surprisingly good solutions to the ambitious problem.

pdf bib abs

An Overview of the SEBAMAT Project
Reinhard Rapp | George Tambouratzis
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

SEBAMAT (semantics-based MT) is a Marie Curie project intended to con-tribute to the state of the art in machine translation (MT). Current MT systems typically take the semantics of a text only in so far into account as they are implicit in the underlying text corpora or dictionaries. Occasionally it has been argued that it may be difficult to advance MT quality to the next level as long as the systems do not make more explicit use of semantic knowledge. SEBAMAT aims to evaluate three approaches incorporating such knowledge into MT.

2018

pdf bib

A Multilingual Dataset for Evaluating Parallel Sentence Extraction from Comparable Corpora
Pierre Zweigenbaum | Serge Sharoff | Reinhard Rapp
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib

Proceedings of the 10th Workshop on Building and Using Comparable Corpora
Serge Sharoff | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

pdf bib abs

Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora
Pierre Zweigenbaum | Serge Sharoff | Reinhard Rapp
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

This paper presents the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. It recalls the design of the datasets, presents their final construction and statistics and the methods used to evaluate system results. 13 runs were submitted to the shared task by 4 teams, covering three of the four proposed language pairs: French-English (7 runs), German-English (3 runs), and Chinese-English (3 runs). The best F-scores as measured against the gold standard were 0.84 (German-English), 0.80 (French-English), and 0.43 (Chinese-English). Because of the design of the dataset, in which not all gold parallel sentence pairs are known, these are only minimum values. We examined manually a small sample of the false negative sentence pairs for the most precise French-English runs and estimated the number of parallel sentence pairs not yet in the provided gold standard. Adding them to the gold standard leads to revised estimates for the French-English F-scores of at most +1.5pt. This suggests that the BUCC 2017 datasets provide a reasonable approximate evaluation of the parallel sentence spotting task.

2016

pdf bib

Proceedings of the Sixth Workshop on Hybrid Approaches to Translation (HyTra6)
Patrik Lambert | Bogdan Babych | Kurt Eberle | Rafael E. Banchs | Reinhard Rapp | Marta R. Costa-jussà
Proceedings of the Sixth Workshop on Hybrid Approaches to Translation (HyTra6)

2015

pdf bib

Proceedings of the Eighth Workshop on Building and Using Comparable Corpora
Pierre Zweigenbaum | Serge Sharoff | Reinhard Rapp
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

pdf bib

BUCC Shared Task: Cross-Language Document Similarity
Serge Sharoff | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

pdf bib

Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)
Bogdan Babych | Kurt Eberle | Patrik Lambert | Reinhard Rapp | Rafael E. Banchs | Marta R. Costa-jussà
Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)

pdf bib

A Methodology for Bilingual Lexicon Extraction from Comparable Corpora
Reinhard Rapp
Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)

2014

pdf bib

Using Collections of Human Language Intuitions to Measure Corpus Representativeness
Reinhard Rapp
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib abs

A Graph-Based Approach for Computing Free Word Associations
Gemma Bel Enguix | Reinhard Rapp | Michael Zock
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

A graph-based algorithm is used to analyze the co-occurrences of words in the British National Corpus. It is shown that the statistical regularities detected can be exploited to predict human word associations. The corpus-derived associations are evaluated using a large test set comprising several thousand stimulus/response pairs as collected from humans. The finding is that there is a high agreement between the two types of data. The considerable size of the test set allows us to split the stimulus words into a number of classes relating to particular word properties. For example, we construct six saliency classes, and for the words in each of these classes we compare the simulation results with the human data. It turns out that for each class there is a close relationship between the performance of our system and human performance. This is also the case for classes based on two other properties of words, namely syntactic and semantic word ambiguity. We interpret these findings as evidence for the claim that human association acquisition must be based on the statistical analysis of perceived language and that when producing associations the detected statistical regularities are replicated.

pdf bib abs

Corpus-Based Computation of Reverse Associations
Reinhard Rapp
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

According to psychological learning theory an important principle governing language acquisition is co-occurrence. For example, when we perceive language, our brain seems to unconsciously analyze and store the co-occurrence patterns of the words. And during language production, these co-occurrence patterns are reproduced. The applicability of this principle is particularly obvious in the case of word associations. There is evidence that the associative responses people typically come up with upon presentation of a stimulus word are often words which frequently co-occur with it. It is thus possible to predict a response by looking at co-occurrence data. The work presented here is along these lines. However, it differs from most previous work in that it investigates the direction from the response to the stimulus rather than vice-versa, and that it also deals with the case when several responses are known. Our results indicate that it is possible to predict a stimulus word from its responses, and that it helps if several responses are given.

pdf bib abs

Using Word Familiarities and Word Associations to Measure Corpus Representativeness
Reinhard Rapp
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The definition of corpus representativeness used here assumes that a representative corpus should reflect as well as possible the average language use a native speaker encounters in everyday life over a longer period of time. As it is not practical to observe people’s language input over years, we suggest to utilize two types of experimental data capturing two forms of human intuitions: Word familiarity norms and word association norms. If it is true that human language acquisition is corpus-based, such data should reflect people’s perceived language input. Assuming so, we compute a representativeness score for a corpus by extracting word frequency and word association statistics from it and by comparing these statistics to the human data. The higher the similarity, the more representative the corpus should be for the language environments of the test persons. We present results for five different corpora and for truncated versions thereof. The results confirm the expectation that corpus size and corpus balance are crucial aspects for corpus representativeness.

pdf bib

How well can a corpus-derived co-occurrence network simulate human associative behavior?
Gemma Bel Enguix | Reinhard Rapp | Michael Zock
Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)

pdf bib

Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra)
Rafael E. Banchs | Marta R. Costa-jussà | Reinhard Rapp | Patrik Lambert | Kurt Eberle | Bogdan Babych
Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra)

pdf bib

Extracting Multiword Translations from Aligned Comparable Documents
Reinhard Rapp | Serge Sharoff
Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra)

pdf bib

Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)
Michael Zock | Reinhard Rapp | Chu-Ren Huang
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

pdf bib

The CogALex-IV Shared Task on the Lexical Access Problem
Reinhard Rapp | Michael Zock
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

pdf bib

TALN-RECITAL 2014 Workshop RLTLN 2014 : Réseaux Lexicaux pour le TAL (RLTLN 2014 : Lexical Networks for NLP)
Michael Zock | Gemma Bel-Enguix | Reinhard Rapp
TALN-RECITAL 2014 Workshop RLTLN 2014 : Réseaux Lexicaux pour le TAL (RLTLN 2014 : Lexical Networks for NLP)

2013

pdf bib

Proceedings of the Sixth Workshop on Building and Using Comparable Corpora
Serge Sharoff | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

pdf bib

pdf bib

2012

pdf bib abs

Identifying Word Translations from Comparable Documents Without a Seed Lexicon
Reinhard Rapp | Serge Sharoff | Bogdan Babych
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The extraction of dictionaries from parallel text corpora is an established technique. However, as parallel corpora are a scarce resource, in recent years the extraction of dictionaries using comparable corpora has obtained increasing attention. In order to find a mapping between languages, almost all approaches suggested in the literature rely on a seed lexicon. The work described here achieves competitive results without requiring such a seed lexicon. Instead it presupposes mappings between comparable documents in different languages. For some common types of textual resources (e.g. encyclopedias or newspaper texts) such mappings are either readily available or can be established relatively easily. The current work is based on Wikipedias where the mappings between languages are determined by the authors of the articles. We describe a neural-network inspired algorithm which first characterizes each Wikipedia article by a number of keywords, and then considers the identification of word translations as a variant of word alignment in a noisy environment. We present results and evaluations for eight language pairs involving Germanic, Romanic, and Slavic languages as well as Chinese.

pdf bib

Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
Marta R. Costa-jussà | Patrik Lambert | Rafael E. Banchs | Reinhard Rapp | Bogdan Babych
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

pdf bib

Design of a hybrid high quality machine translation system
Bogdan Babych | Kurt Eberle | Johanna Geiß | Mireia Ginestí-Rosell | Anthony Hartley | Reinhard Rapp | Serge Sharoff | Martin Thomas
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

pdf bib

Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon
Michael Zock | Reinhard Rapp
Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon

2011

pdf bib

Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Pierre Zweigenbaum | Reinhard Rapp | Serge Sharoff
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

2010

pdf bib

Proceedings of the 2nd Workshop on Cognitive Aspects of the Lexicon
Michael Zock | Reinhard Rapp
Proceedings of the 2nd Workshop on Cognitive Aspects of the Lexicon

pdf bib

Utilizing Citations of Foreign Words in Corpus-Based Dictionary Generation
Reinhard Rapp | Michael Zock
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)

pdf bib

The Noisier the Better: Identifying Multilingual Word Translations Using a Single Monolingual Corpus
Reinhard Rapp | Michael Zock
Proceedings of the 4th Workshop on Cross Lingual Information Access

2009

pdf bib

The Backtranslation Score: Automatic MT Evalution at the Sentence Level without Reference Translations
Reinhard Rapp
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib

Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)
Pascale Fung | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

2008

pdf bib

The Computation of Associative Responses to Multiword Stimuli
Reinhard Rapp
Coling 2008: Proceedings of the Workshop on Cognitive Aspects of the Lexicon (COGALEX 2008)

2007

pdf bib

Deriving an Ambiguous Word’s Part-of-Speech Distribution from Unannotated Text
Reinhard Rapp
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

pdf bib

Exploring the Sense Distributions of Homographs
Reinhard Rapp
Demonstrations

pdf bib abs

Example-Based Machine Translation Using a Dictionary of Word Pairs
Reinhard Rapp | Carlos Martin Vide
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Machine translation systems, whether rule-based, example-based, or statistical, all rely on dictionaries that are in essence mappings between individual words of the source and the target language. Criteria for the disambiguation of ambiguous words and for differences in word order between the two languages are not accounted for in the lexicon. Instead, these important issues are dealt with in the translation engines. Because the engines tend to be compact and (even with data-oriented approaches) do not fully reflect the complexity of the problem, this approach generally does not account for the more fine grained facets of word behavior. This leads to wrong generalizations and, as a consequence, translation quality tends to be poor. In this paper we suggest to approach this problem by using a new type of lexicon that is not based on individual words but on pairs of words. For each pair of consecutive words in the source language the lexicon lists the possible translations in the target language together with information on order and distance of the target words. The process of machine translation is then seen as a combinatorial problem: For all word pairs in a source sentence all possible translations are retrieved from the lexicon and then those translations are discarded that lead to contradictions when constructing the target sentence. This process implicitly leads to word sense disambiguation and to language specific reordering of words.

2005

pdf bib

A Practical Solution to the Problem of Automatic Part-of-Speech Induction from Text
Reinhard Rapp
Proceedings of the ACL Interactive Poster and Demonstration Sessions

2004

pdf bib

Utilizing the One-Sense-per-Discourse Constraint for Fully Unsupervised Word Sense Induction and Disambiguation
Reinhard Rapp
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib

A Freely Available Automatically Generated Thesaurus of Related Words
Reinhard Rapp
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib

A Practical Solution to the Problem of Automatic Word Sense Induction
Reinhard Rapp
Proceedings of the ACL Interactive Poster and Demonstration Sessions

2003

pdf bib abs

Word sense discovery based on sense descriptor dissimilarity
Reinhard Rapp
Proceedings of Machine Translation Summit IX: Papers

In machine translation, information on word ambiguities is usually provided by the lexicographers who construct the lexicon. In this paper we propose an automatic method for word sense induction, i.e. for the discovery of a set of sense descriptors to a given ambiguous word. The approach is based on the statistics of the distributional similarity between the words in a corpus. Our algorithm works as follows: The 20 strongest first-order associations to the ambiguous word are considered as sense descriptor candidates. All pairs of these candidates are ranked according to the following two criteria: First, the two words in a pair should be as dissimilar as possible. Second, although being dissimilar their co-occurrence vectors should add up to the co-occurrence vector of the ambiguous word scaled by two. Both conditions together have the effect that preference is given to pairs whose co-occurring words are complementary. For best results, our implementation uses singular value decomposition, entropy-based weights, and second-order similarity metrics.