Offline Reinforcement Learning from Human Feedback in Real-World Sequence-to-Sequence Tasks

Large volumes of interaction logs can be collected from NLP systems that are deployed in the real world. How can this wealth of information be leveraged? Using such interaction logs in an offline reinforcement learning (RL) setting is a promising approach. However, due to the nature of NLP tasks and the constraints of production systems, a series of challenges arise. We present a concise overview of these challenges and discuss possible solutions.


Introduction
When Natural Language Processing (NLP) systems are deployed in production, and interact with users ("the real world"), there are many potential ways of collecting feedback data or rich interaction logs. For example, one can ask for explicit user ratings (Kreutzer et al., 2018a), or collect user clicks (De Bona et al., 2010), or elicit user revisions (Trivedi et al., 2019) to get an estimate of how well the deployed system is doing. However, such user interaction logs are primarily used for an one-off assessment of the system, e.g., for spotting critical errors, detecting domain shifts, or identifying the most successful use cases of the system in production. This assessment can then be used to support the decision of keeping or replacing this system in production.
From a machine learning perspective, using interaction logs only for evaluation purposes is a lost opportunity for offline reinforcement learning (RL). Logs of user interactions are gold mines for off-policy learning, and they should be put to use, rather than being forgotten after a one-off evaluation purpose. To move towards the goal of using user interaction logs for learning, we will discuss which challenges have hindered RL from being employed in real-world interaction with users of NLP systems so far.
Concretely, our focus is on sequence-tosequence learning for NLP applications (see § 2 for an overview). For example, many machine translation services provide the option for users to give feedback on the quality of the translation, e.g., by collecting post-edits. Similarly, industrial chatbots can easily collect vast amounts of interaction logs, which can be utilized with offline RL methods (Kandasamy et al., 2017;Zhou et al., 2017;Hancock et al., 2019). In the following, we will thus present challenges that are encountered in userinteractive RL for NLP systems. With this discussion, we aim to (1) encourage NLP practitioners to leverage their interaction logs through offline RL, and (2) inspire RL researchers to steel their algorithms for the challenging applications in NLP.

Offline Feedback for Seq2Seq in NLP
In sequence-to-sequence (Seq2Seq) learning, the task is to map an input sequence x = x 1 , x 2 , . . . , x |x| , ∀x i ∈ X to an output sequence y = y 1 , y 2 , . . . , y |y| , ∀y j ∈ Y, where X , Y denote the sets of input and output vocabularies, respectively. The conditional distribution of the output sequence given the input can be modeled with a policy π θ with learnable parameters θ. Assuming a left-to-right generation order, the output sequence y is generated by conditioning on previous output elements y <j and the input sequence x: Mapping the sequence-to-sequence problem formulation to NLP tasks, we have for example: • Machine translation: x is a source sentence and y the translation of x in a target language.
• Semantic parsing: x is a sentence and y its semantic parse (e.g., in SQL). • Summarization: x is the document that is to be summarized and y a corresponding summary. • Dialogue generation: x is the conversation history and y an appropriate reply.
The most distinctive feature of Seq2Seq NLP tasks for RL are the extremely large, structured output spaces: given the output vocabulary of size |Y| and a maximum sequence length M , there are |Y| M possible combinations of output sequences. For instance, in machine translation there might be as many as 30 000 output tokens in the vocabulary and the output sequence length could easily be 100, leading to a total of 30 000 100 possible outputs.
A successful policy identifies the few combination of tokens that form valid output sequences. In the most extreme case only one output sequence exists that will be correct. , e.g., in a semantic parsing setup, where potentially only one specific SQL query will return the correct answer when executed. To train a policy, supervised data can be used. There we assume a given dataset D sup = {(x t , y t )} T t=1 on which the parameters θ can be learnt with a maximum likelihood approach, aiming to maximize the model score for the given reference output.
In practice, it may be too expensive to collect correct, i.e., supervised, output sequences, since they require skilled annotators, e.g., trained translators for a machine translation task. Therefore, one option is to pre-train the policy on some available supervised data, which will allow the model to concentrate on reasonable areas in the output space (Choshen et al., 2020). The model can then be used to produce potentially imperfect output sequences and humans can judge an outputỹ and a reward δ t ∈ [0, 1] is assigned. Model parameters may be optimized by pairing the model outputs with their reward estimates. Depending on the use case, quality judgments may also exist for single elements in the structure, adding δ (t,j) for every step in the output sequence. The core idea is that the weighting by δ enables learning from imperfect outputs while respecting their faults. In RL, these quality assessments are used to reward desirable model actions, here desirable sequence outputs.
When collecting quality judgments from human users in production systems, it would be risky to directly update the model online according to their feedback. 1 Some user feedback might be adversarial, inappropriate, or not representative when used for training without prior treatment (Rivas et al., 2018;Kreutzer et al., 2018a;Davis, 2016). 2 Furthermore, interpreting feedback wrongly (e.g., through incorrect credit assignment (Bahdanau et al., 2017)), or receiving misleading feedback (Nguyen et al., 2017;Kreutzer et al., 2018a), could easily push the policy into less favorable conditions.
Because updating systems online is too risky, quality judgments are instead stored in interaction logs, i.e., D log = {(x t ,ỹ t , δ t )} T t=1 , and the system is updated offline. As a result, the imperfect output sequences are produced by a possibly different policy, the logging policy µ, and updates to our learning policy are conducted offline, which is a classic off-policy RL scenario.
Due to the logging setup, the collected dataset is biased towards the choices of the deployed model, the logging policy µ. This results in a counterfactual learning scenario (Bottou et al., 2013). The bias may be corrected via importance sampling. If the logging policy is known and µ(ŷ | x) is logged as well, the policy can then be optimized for the Inverse Propensity Scoring (IPS) objective (Rosenbaum and Rubin, 1983): 3 Challenges for Off-Policy RL in NLP On top of the difficulties encountered in offline RL, additionally constraints arise in production scenarios. We address this and possible solutions in §3.1, while §3.2 focuses on how to obtain reliable data from which machine learning can succeed.

Deterministic Logging and Off-line Learning
In order to not show inferior outputs to users, production NLP systems show the most likely output, which disables the typically crucial exploration component of RL. This effectively results in deterministic logging policies that lack explicit exploration, which makes an application of standard off-policy methods for counterfactual learning questionable. For example, techniques such as inverse propensity scoring (Rosenbaum and Rubin, 1983) or weighted importance sampling (Precup et al., 2000;Jiang and Li, 2016;Thomas and Brunskill, 2016), rely on sufficient exploration of the output space by the logging system as a prerequisite for counterfactual learning. In fact, Langford et al. (2008) and Strehl et al. (2010) even give impossibility results for exploration-free counterfactual learning.
One option is to hope for implicit exploration due to input or context variability. This has been observed for the case of online advertising (Chapelle and Li, 2011) and investigated theoretically (Bastani et al., 2017). In NLP, output sequences may overlap in some of the words, so the learner could infer from rewards in which contexts specific words are more suitable than in others. This has been explored in the context of machine translation (Lawrence et al., 2017b), utilizing the Deterministic Propensity Matching (DPM) objective which closely follows the IPS objective, however, due to the deterministic logging ∀ỹ, µ(ỹ | x) = 1. While this exploration is limited by the input data, solutions for safe exploration might be attractive to transfer to NLP applications to actively guide exploration while not sacrificing quality (Hans et al., 2008;Berkenkamp et al., 2017). Another option is to consider concrete cases of degenerate behavior in estimation from logged data. We look at two such issues and possible solutions. Both problems occur irrespective of whether data is logged deterministically or not, but the effects of the degenerative behavior might be amplified in the case of deterministic logging.
The first form of degenerate behaviour occurs for a collected log D log with δ ∈ [0, 1] because IPS and DPM can trivially be minimized by setting all probabilities in the dataset D to 1 for any δ t > 0 (Lawrence et al., 2017a). Concretely, this means, while the worst output sequences with δ t = 0 are simply ignored, all other sequences are encouraged, even if their reward is close to 0. However, it is clearly undesirable to increase the probability of low reward examples (Swaminathan and Joachims, 2015;Lawrence et al., 2017b,a).
There are two possible solutions to this problem: The first solution is to tune the learning rate and perform early stopping before the degenerate state can be reached. The second solution is to utilize a multiplicative control variate (Kong, 1992) for selfnormalization (Swaminathan and Joachims, 2015). For efficient gradient calculation, batches of size B can be reweighted one-step-late (OSL) (Lawrence and Riezler, 2018) using θ from some previous iteration: . (4) Self-normalization discourages increasing the probability of low reward data because this would take away probability mass from higher reward outputs and as a result. This introduces a bias in the estimator (that decreases as T increases), however, it makes learning under deterministic logging feasible, as has been shown for learning with real human feedback in a semantic parsing scenario (Lawrence and Riezler, 2018). This gives the RL agent an edge in learning in an environment that has been deemed impossible in the literature.
A second form of degenerate behavior occurs because the reward δ t of an output sequence is typically measured with some non-negative value, e.g., δ t ∈ [0, 1]. For example, for machine translation, Kreutzer et al. (2018b) collect ratings for translations on a 5-point Likert scale and map the values linearly to [0, 1]. However, utilizing any of the above objectives means that bad output sequences with low rewards cannot actively be discouraged.
There are two possible solutions, both of which have been used as additive control variates to reduce variance in gradient estimators. First, low reward sequences can be discouraged by employing a reward baseline, where for example the average reward ∆ = 1 t t t =1 δ t is subtracted from each δ t . This will cause output sequences worse than the running average to be discouraged rather than encouraged. The second option is to use the logged data D log to learn a reward estimatorδ that can return a reward estimate for any pair (x, y). This estimator together with the IPS objective leads to the Doubly Robust (DR) objective (Dudik et al., 2011), This objective enables the exploration of other outputsỹ that are not part of the original log and encourages them based on the reward value returned by the estimator. For the task of machine translation, Lawrence et al. (2017b) show this objective to be the most successful in their setup, and Kreutzer et al. (2018a) report simulation results that show that this objective can significantly reduce the gap between offline and online policy learning, even if the reward estimator is not perfect. Zhou et al. (2017) present an alternating approach to integrating a reward estimator for exploration, by switching between learning offline from logged rewards and exploring online with the help of a reward estimator in phases.

Reliability and Learnability of Feedback
In interactive NLP, it is unrealistic to expect anything else than bandit feedback from a human user interacting with a chatbot, automatic summarization tool, or commercial machine translation system. That is, users of such systems will only provide a reward signal to the one output that is presented to them, and cannot be expected to rate a multitude of outputs for the same input. As a result, the feedback is very sparse in relation to the size of the output space.
Ideally, the user experience should not be disrupted through feedback collection. Non-intrusive interface options for example allow for corrections of the output ("post-edits" in the context of machine translation) as a negative signal, or recording whether the output is copied and/or shared without changes, which may be interpreted as a positive signal. However, the signal might be noisy, since the notion of output quality for natural language generation tasks is not a well-defined function to start with: Each input might have many possible valid outputs, each of which humans may judge differently, depending on many contextual and personal factors. In machine translation evaluation for instance, inter-rater agreements have traditionally been reported as low (Turian et al., 2003;Carl et al., 2011;Lommel et al., 2014), especially when quality estimates are collected from non-professional raters (Callison-Burch, 2009). Similar observations have been made for other text generation tasks (Godwin and Piwek, 2016;Verberne et al., 2018). Nguyen et al. (2017) illustrated how badly machine translation systems can handle humanlevel noise in direct feedback for online RL with simulations. The level of noise in real-world human feedback may be so high that it prevents learning completely, as for example experienced in ecommerce machine translation logs (Kreutzer et al., 2018a). The issue is even higher in dialogue generation where there are a plenitude of acceptable responses (Pang et al., 2020). To this aim, inverse RL has been proposed to infer reward functions from responses indirectly (Takanobu et al., 2019).
Surprisingly, the question of how to best improve an RL agent in the scenario of learning from real-world human feedback has been scarcely researched. This might originate from many RL research environments coming with fixed reward functions. In the real world, however, there is rarely a clearly defined single reward function for which it would suffice optimizing for. The suggestions in Dulac-Arnold et al. (2019) seem straightforward: warm-starting agents to decrease sample complexity or using inverse reinforcement learning to recover reward functions from demonstrations (Wang et al., 2020) -but they require additional supervision signals that RL was supposed to alleviate.
When it comes to the question which type of human feedback is most beneficial for training an RL agent, one finds a lot of blanket statements, e.g., referring to the advantages of pairwise comparisons (Thurstone, 1927). For instance, learning from human pairwise preferences from humans has been advertised for summarization (Christiano et al., 2017;Stiennon et al., 2020) and language modeling (Ziegler et al., 2019), but the reliability of the signal has not been evaluated. An exception is the work of Kreutzer et al. (2018b) which is the first to investigate two crucial questions. The first question addresses which type of human feedback -pairwise judgments or cardinal feedback on a 5point scale -can be given most reliably by human teachers. The second question investigates which type of feedback allows to learn reward estimators that best approximate human rewards and can be best integrated into an end-to-end RL-NLP task.
Regarding the first question, Kreutzer et al. (2018b) found that the common assumption -that pairwise comparisons are easier to judge than a single output on a Likert scale (Thurstone, 1927)turned out to be false for the task of machine translation. Inter-rater reliability proved to be higher for 5-point ratings (Krippendorff's α = 0.51) than for pairwise judgments (α = 0.39). (Kreutzer et al., 2018b) explain two advantages that the Likert scale setup offers: (1) it is possible to standardize cardinal judgments for each rater to remove individual biases, (2) they offer an absolute anchoring for quality, while a preference rankings leave the overall positioning of the pair of outputs on a quality scale open. For pairwise judgments it is difficult or even impossible to reliably choose between two outputs that are similarly good or bad, e.g., differing by only a few words. Therefore, filtering out raters with low intra-rater reliability proved effective for absolute ratings, while filtering outputs with a high variance in ratings was most effective for pairwise ratings, yielding the final inter-rater reliability given above. Discarding rated outputs, however, reduces the size of the log to learn from, which is undesirable in settings where rewards are scarce or costly.
To answer the second question, Kreutzer et al. (2018b) found a neural machine translation system can be significantly improved using a reward estimator trained on only a few hundred cardinal user judgments. This work highlights that future research in real-world RL might have to involve studies in user interfaces or user experience, since the interfaces for feedback collection influence the reward function that RL agents learn from -and thereby the downstream task success. Collecting implicit feedback (Kreutzer et al., 2018a;Jaques et al., 2020) might offer a better user experience.
For the challenges discussed in Sections 3.1 and 3.2, a promising approach is to tackle the arguably simpler problem of learning a reward estimator from human feedback first, then provide unlimited learned feedback to generalize to unseen outputs in off-policy RL. However, risks of bias introduction and potential benefits for noise reduction through replacing user feedback by reward estimators are yet to be quantified.

Conclusion
There is large potential in NLP to leverage user interaction logs for system improvement. We discussed how algorithms for offline RL can offer promising solutions for this learning problem. However, specific challenges in offline RL arise due to the particular nature of NLP systems that collect human feedback in real-world applications. We presented cases where such challenges have been found and offered solutions that have helped. So far, the solutions have mainly been explored in the context of machine translation and semantic parsing. In the future, it will be interesting to explore further tasks and additional real-world use cases to find out how to best learn from human feedback.