Automatic Corpus Extension for Data-driven Natural Language Generation

Elena Manishina; Bassam Jabaian; Stéphane Huet; Fabrice Lefèvre

Automatic Corpus Extension for Data-driven Natural Language Generation

Elena Manishina, Bassam Jabaian, Stéphane Huet, Fabrice Lefèvre

Abstract

As data-driven approaches started to make their way into the Natural Language Generation (NLG) domain, the need for automation of corpus building and extension became apparent. Corpus creation and extension in data-driven NLG domain traditionally involved manual paraphrasing performed by either a group of experts or with resort to crowd-sourcing. Building the training corpora manually is a costly enterprise which requires a lot of time and human resources. We propose to automate the process of corpus extension by integrating automatically obtained synonyms and paraphrases. Our methodology allowed us to significantly increase the size of the training corpus and its level of variability (the number of distinct tokens and specific syntactic structures). Our extension solutions are fully automatic and require only some initial validation. The human evaluation results confirm that in many cases native speakers favor the outputs of the model built on the extended corpus.

Anthology ID:: L16-1575
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 3624–3631
Language:
URL:: https://aclanthology.org/L16-1575/
DOI:
Bibkey:
Cite (ACL):: Elena Manishina, Bassam Jabaian, Stéphane Huet, and Fabrice Lefèvre. 2016. Automatic Corpus Extension for Data-driven Natural Language Generation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3624–3631, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: Automatic Corpus Extension for Data-driven Natural Language Generation (Manishina et al., LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1575.pdf

PDF Cite Search Fix data