Portuguese Text Generation from Large Corpora

Eder Novais, Ivandré Paraboni, Douglas Silva


Abstract
In the implementation of a surface realisation engine, many of the computational techniques seen in other AI fields have been widely applied. Among these, the use of statistical methods has been particularly successful, as in the so-called 'generate-and-select', or 2-stages architectures. Systems of this kind produce output strings from possibly underspecified input data by over-generating a large number of alternative realisations (often including ungrammatical candidate sentences.) These are subsequently ranked with the aid of a statistical language model, and the most likely candidate is selected as the output string. Statistical approaches may however face a number of difficulties. Among these, there is the issue of data sparseness, a problem that is particularly evident in cases such as our target language - Brazilian Portuguese - which is not only morphologically-rich, but relatively poor in NLP resources such as large, publicly available corpora. In this work we describe a first implementation of a shallow surface realisation system for this language that deals with the issue of data sparseness by making use of factored language models built from a (relatively) large corpus of Brazilian newspapers articles.
Anthology ID:
L12-1026
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4010–4014
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/153_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Eder Novais, Ivandré Paraboni, and Douglas Silva. 2012. Portuguese Text Generation from Large Corpora. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 4010–4014, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Portuguese Text Generation from Large Corpora (Novais et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/153_Paper.pdf