Controlled Hallucinations: Learning to Generate Faithfully from Noisy Data

Katja Filippova


Abstract
Neural text generation (data- or text-to-text) demonstrates remarkable performance when training data is abundant which for many applications is not the case. To collect a large corpus of parallel data, heuristic rules are often used but they inevitably let noise into the data, such as phrases in the output which cannot be explained by the input. Consequently, models pick up on the noise and may hallucinate–generate fluent but unsupported text. Our contribution is a simple but powerful technique to treat such hallucinations as a controllable aspect of the generated text, without dismissing any input and without modifying the model architecture. On the WikiBio corpus (Lebret et al., 2016), a particularly noisy dataset, we demonstrate the efficacy of the technique both in an automatic and in a human evaluation.
Anthology ID:
2020.findings-emnlp.76
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Editors:
Trevor Cohn, Yulan He, Yang Liu
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
864–870
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.76
DOI:
10.18653/v1/2020.findings-emnlp.76
Bibkey:
Cite (ACL):
Katja Filippova. 2020. Controlled Hallucinations: Learning to Generate Faithfully from Noisy Data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 864–870, Online. Association for Computational Linguistics.
Cite (Informal):
Controlled Hallucinations: Learning to Generate Faithfully from Noisy Data (Filippova, Findings 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.findings-emnlp.76.pdf
Data
WikiBio