Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

Julia Kreutzer; Joshua Uyheng; Stefan Riezler

doi:10.18653/v1/P18-1165

Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

Julia Kreutzer, Joshua Uyheng, Stefan Riezler

Abstract

We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator α-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale.

Anthology ID:: P18-1165
Volume:: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2018
Address:: Melbourne, Australia
Editors:: Iryna Gurevych, Yusuke Miyao
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1777–1788
Language:
URL:: https://aclanthology.org/P18-1165/
DOI:: 10.18653/v1/P18-1165
Bibkey:
Cite (ACL):: Julia Kreutzer, Joshua Uyheng, and Stefan Riezler. 2018. Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1777–1788, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):: Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning (Kreutzer et al., ACL 2018)
Copy Citation:
PDF:: https://aclanthology.org/P18-1165.pdf
Note:: P18-1165.Notes.pdf
Poster:: P18-1165.Poster.pdf

PDF Cite Search Note Poster Fix data