Creating and using large monolingual parallel corpora for sentential paraphrase generation

Sander Wubben; Antal van den Bosch; Emiel Krahmer

Creating and using large monolingual parallel corpora for sentential paraphrase generation

Sander Wubben, Antal van den Bosch, Emiel Krahmer

Abstract

In this paper we investigate the automatic generation of paraphrases by using machine translation techniques. Three contributions we make are the construction of a large paraphrase corpus for English and Dutch, a re-ranking heuristic to use machine translation for paraphrase generation and a proper evaluation methodology. A large parallel corpus is constructed by aligning clustered headlines that are scraped from a news aggregator site. To generate sentential paraphrases we use a standard phrase-based machine translation (PBMT) framework modified with a re-ranking component (henceforth PBMT-R). We demonstrate this approach for Dutch and English and evaluate by using human judgements collected from 76 participants. The judgments are compared to two automatic machine translation evaluation metrics. We observe that as the paraphrases deviate more from the source sentence, the performance of the PBMT-R system degrades less than that of the word substitution baseline system.

Anthology ID:: L14-1094
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 4292–4299
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1135_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Sander Wubben, Antal van den Bosch, and Emiel Krahmer. 2014. Creating and using large monolingual parallel corpora for sentential paraphrase generation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 4292–4299, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: Creating and using large monolingual parallel corpora for sentential paraphrase generation (Wubben et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1135_Paper.pdf

PDF Cite Search Fix data