Enriching the E2E dataset

Thiago Castro Ferreira, Helena Vaz, Brian Davis, Adriana Pagano


Abstract
This study introduces an enriched version of the E2E dataset, one of the most popular language resources for data-to-text NLG. We extract intermediate representations for popular pipeline tasks such as discourse ordering, text structuring, lexicalization and referring expression generation, enabling researchers to rapidly develop and evaluate their data-to-text pipeline systems. The intermediate representations are extracted by aligning non-linguistic and text representations through a process called delexicalization, which consists in replacing input referring expressions to entities/attributes with placeholders. The enriched dataset is publicly available.
Anthology ID:
2021.inlg-1.18
Volume:
Proceedings of the 14th International Conference on Natural Language Generation
Month:
August
Year:
2021
Address:
Aberdeen, Scotland, UK
Editors:
Anya Belz, Angela Fan, Ehud Reiter, Yaji Sripada
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
177–183
Language:
URL:
https://aclanthology.org/2021.inlg-1.18
DOI:
10.18653/v1/2021.inlg-1.18
Bibkey:
Cite (ACL):
Thiago Castro Ferreira, Helena Vaz, Brian Davis, and Adriana Pagano. 2021. Enriching the E2E dataset. In Proceedings of the 14th International Conference on Natural Language Generation, pages 177–183, Aberdeen, Scotland, UK. Association for Computational Linguistics.
Cite (Informal):
Enriching the E2E dataset (Castro Ferreira et al., INLG 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.inlg-1.18.pdf
Code
 ThiagoCF05/EnrichedE2E
Data
WebNLG