Feedback Attribution for Counterfactual Bandit Learning in Multi-Domain Spoken Language Understanding

With counterfactual bandit learning, models can be trained based on positive and negative feedback received for historical predictions, with no labeled data needed. Such feedback is often available in real-world dialog systems, however, the modularized architecture commonly used in large-scale systems prevents the direct application of such algorithms. In this paper, we study the feedback attribution problem that arises when using counterfactual bandit learning for multi-domain spoken language understanding. We introduce an experimental setup to simulate the problem on small-scale public datasets, propose attribution methods inspired by multi-agent reinforcement learning and evaluate them against multiple baselines. We find that while directly using overall feedback leads to disastrous performance, our proposed attribution methods can allow training competitive models from user feedback.


Introduction
Spoken language understanding (SLU) is a key component of task-oriented dialog systems (Tur and De Mori, 2011). It is commonly modeled as two tasks: Intent classification (IC), which assigns an intent to an utterance, and slot labeling (SL), which recognizes boundaries and types of slots in the utterance's tokens. In recent years, neural models that jointly learn both tasks, in combination with pre-trained transformers Zhang et al., 2019), have become the state-of-theart approach to SLU (Louvan and Magnini, 2020;Weld et al., 2021). However, a sufficient amount of manually labeled data is needed for fine-tuning.
If dialog systems are actively used, a costefficient alternative to manually labeled data is to leverage user feedback, which can appear explicitly (e.g. "no stop", "thank you") and implicitly (e.g. retries, interruptions). Such positive and negative feedback in response to a prediction is known as bandit feedback. Compared to ground truth labels it is less informative, as negative feedback does not reveal which prediction would have been better, but on the other hand, it is available in large quantities without manual effort. Recent work proposed multiple possible ways in which such feedback can be used to improve dialog systems (Muralidharan et al., 2019;Ponnusamy et al., 2020;Kim and Kim, 2020;Falke et al., 2020). In this work, we focus on leveraging it via counterfactual bandit learning, a well-studied approach for training models with historical bandit feedback (Langford et al., 2008;Dudík et al., 2011;Joachims et al., 2018).
The key challenge addressed in this paper is the feedback attribution problem arising in largescale multi-domain SLU systems (see Figure 1). In such systems, a common architecture is to combine domain-specific IC and SL models with a domain classifier (DC) (Jeong and Lee, 2009;Xu and Sarikaya, 2014;Hakkani-Tür et al., 2016), which allows to more easily update domain-specific behavior regularly without being bottlenecked by a single joint model. However, for the bandit learning setting, this introduces additional challenges: Given negative feedback for a combined prediction, it is unclear which individual model caused the error, and thus how to use the feedback for training. If it is given directly to all models, already correct predictions will be penalized. A three-fold attribution problem, consisting of task-level, domain-level and token-level attribution (see Section 3.2), needs to be solved to use the feedback effectively.
In this paper, we make several contributions to address the feedback attribution problem: First, we propose an experimental setup based on SNIPS (Coucke et al., 2018) and TOP (Schuster et al., 2019) to simulate the problem on small-scale public datasets. Second, we propose first attribution methods inspired by multi-agent reinforcement learning. And third, we evaluate them against multiple baselines on the two datasets. We find that while using the overall feedback without attribution leads to dis- astrous performance, the proposed attribution methods allow learning competitive models if the logged feedback data contains sufficient exploration.

Counterfactual Bandit Learning
In counterfactual bandit learning, we have access to a dataset of n tuples of bandit feedback (x i , y i , p i , δ i ) collected by a logging policy 1 π 0 . In each tuple, y i is the prediction made by π 0 for input x i with probability p i = π 0 (y i |x i ), called propensity, and δ i is the feedback received for that prediction. We assume δ i ∈ R and higher is better. The goal is to find a new policy π that maximizes the expected feedback of that policy: Given the bandit dataset, we can estimate R(π) via importance sampling with the inverse propensity scaling (IPS) estimator (Langford et al., 2008): Note that Eq. 2 relies only on the logged prediction y i for which feedback was given. This makes it possible to use the partial bandit feedback directly to train a classifier. It is different from supervised learning, which instead requires access to the ground truth label y * i (see Eq. 3). Several improvements to the IPS estimator have been proposed to learn more effectively, including adding self-normalization (Swaminathan and Joachims, 2015;Joachims et al., 2018) or by combining IPS with a direct feedback estimation models (Dudík et al., 2011;. We 1 We use policy and model interchangeably in this paper. rely on vanilla IPS for our experiments, but make no assumptions that would prevent using a more advanced bandit learning method instead. A crucial assumption of most counterfactual bandit learning methods is that the logging policy is stochastic, i.e. feedback is for predictions that were sampled from π 0 and thus cover different labels for the same input. However, real-world systems, as Lawrence et al. (2017) points out, typically use argmax predictions to deliver the best possible user experience in each case, which limits the amount of exploration that is present in historical data. In their work, they find that counterfactual learning can be possible even with fully deterministic logging policies (Lawrence et al., 2017;Lawrence and Riezler, 2018). For the case of multi-domain SLU, we address that question in this work.

Application to Multi-Domain SLU
The goal of SLU is to predict, given an utterance x with T tokens, its domain y D , intent y I and tokenwise slot labels (y S t ) T t=1 . In the modularized multidomain setup considered here, such predictions are produced by a DC model π D and domain-specific IC and SL models π I,d and π S,d (see Figure 1). At inference time, the IC and SL predictions of the top domain predicted by DC are used as the prediction.

Supervised Training
Given a dataset of examples with ground truth labels , the models in the multi-domain setting can be trained as follows: For the DC model, all examples of D S are used, and for each domain-specific IC or SL model, the subset of examples of that domain, defined by y * D i , is used. The standard loss to train π D , π I,d and π S,d with such data is cross entropy, applied at the token level in the case of π S,d : 2

Bandit Training and Attribution
In the bandit setting, we have instead a dataset with overall feedback δ i (omitting logging propensities for readability). It can be used to train the individual models π D , π I,d and π S,d by minimizing the negative IPS estimate of Eq. 2 (at the token level for π S,d ): However, since δ i is overall feedback, using it directly is problematic: If only some parts of the predicted labels are wrong, which is typically the case, using the negative feedback directly to train all models via Eq. 4 is detrimental for the already correct predictions. Three attribution challenges arise: Feedback should be attributed to the responsible model (task-level attribution) and sub-prediction (token-level attribution for SL). In addition, it is unclear which examples to use to train domainspecific IC/SL models, as the true domain is unknown (domain-level attribution). To cope with these challenges, we aim to attribute by mapping

Attribution Methods
We propose and evaluate two methods for finegrained feedback attribution: Propensity-based FA The first method relies on the propensities of the logging policy, following the idea that the policy might be self-aware of some mistakes and reflects them in the propensities. Given overall feedback δ i and propensities with p S i being the average of token-level propensities. This distributes the feedback proportional to the uncertainty reflected in the propensities.
2 For SL, a CRF loss is an alternative to token-level cross entropy, but is not necessarily superior .
Advantage-based FA For the second method, we follow the credit assignment idea of COMA (Foerster et al., 2018, Eq. 4), a method for reinforcement learning in a multi-agent setting. It uses an advantage function for each agent that subtracts a baseline estimated based on a joint Q-function.
In our setup, this idea translates to using an overall feedback estimator f (x, y D , y I , (y S t )) → δ and using it in an advantage function as follows for DC: The second part of Eq. 7 computes the average estimated feedback across a set of alternative domain predictions Y D by replacing the original y D i with the alternatives (and leaving the IC and SL predictions constant). By subtracting this baseline from the estimate for the originally predicted domain (first part of Eq. 7), we obtain the DC-specific feedback δ D i . If the original prediction was truly better or worse than the alternatives, that difference will be larger and can serve as a corresponding model-specific feedback signal. If it made no difference however, δ D i will be close to zero and the attribution thereby reflects this irrelevance.
We use variations of Eq. 7 to also attribute to IC and SL: While for DC, alternatives Y D in Eq. 7 are all domains, we use for IC and SL the domainspecific label space of intents and slots. For SL, the equation is applied per token and we restrict alternatives Y S to the two highest-propensity labels per token to limit computational complexity. We test different weights w y (see Section 4). train/dev and 700 test examples, we introduce domains by grouping the 7 intents into 3 domains. 3 We split the train/dev part of each dataset into 5% labeled data (D S ), on which the logging policy π 0 is trained with full supervision, and use the rest to create bandit data D B . For each utterance x i in D B , we take either the argmax prediction or sample from the predicted class distribution of π 0 and determine (perfect) overall feedback δ i by comparing the prediction to the ground truth. If any of the domain, intent or slot labels are incorrect, we set δ i = −1, if not δ i = 1.
We compare seven methods to attribute δ i : • Overall No attribution, δ i is used directly.
• Positive Only examples with positive feedback are used, which requires no attribution.
• Propensity Propensity-based FA following Eq. 5 and 6. Note that the method can change the magnitude of feedback per component, but it will always stay negative as the sign cannot change in the computation. As IC/SL models use only data with δ D i > 0, they will use just positive feedback. We thus also include a variation Prop-All that instead uses all examples for IC/SL, at the risk of giving noisier signals.
The feedback estimator is trained on D B min-imizing mean squared error. We evaluate two variations, Adva-Uni using uniform weights w y and Adva-Prop using π 0 's propensities.
• Oracle Perfect attribution by setting −1 or 1 based on ground truth labels (upper bound).
Note that all methods use positive feedback right away and attribute only negative feedback (except for Positive which completely drops the negative feedback examples). In our multi-domain setup, we train a DC and three IC/SL models. We rely on DistilBERT (Sanh et al., 2019) and fine-tune it with an utterance-level classifier for DC or, for IC/SL, with jointly trained utterance and token-level classifiers similar to . In each case, we use a 256-d hidden ReLU layer. For π 0 , all four models are trained with cross entropy (Eq. 3) on D S . For bandit learning, we train the same models, using π 0 as initial weights, but use IPS loss (Eq. 4) and D B . The overall feedback estimator f is trained with mean squared error on D B . Additional details and hyperparameters, tuned per FA method, can be found in the Appendix. Evaluation is done on the standard test sets and we report accuracy for DC and IC, slotbased F1 for SL and overall accuracy (Sem-Acc), all averaged over 10 runs. Table 1 shows results for both datasets and feedback setups. We make the following observations:

Results
Multi-domain SLU can be learned from bandit feedback. The initial logging policy, trained with only 5% of the usual data, leaves substantial room for improvement on both datasets, especially in SL. All models trained with counterfactual bandit learning, except for the naive Overall baseline, effectively use the bandit feedback to improve over the logging policy. With perfect feedback assignment (Oracle), bandit learning can yield models that come even close to the performance of models trained supervised on 100% of the labels.
Feedback attribution is crucial. As the Overall baseline illustrates, using the overall feedback directly without any attribution hurts the models and decreases the performance below the logging policy, confirming the need to address this challenge. The perfect attribution (Oracle) as the upper bound on the other hand shows how well the learning can work with properly attributed feedback, resulting in large improvements over Overall. The assignment methods that we evaluated are capable of using some of that potential, but also still leave room to improve performance with better attribution.
Advantage-based FA beats propensity-based FA. The propensity-based approach does poorly compared to other methods, especially Prop-All. One shortcoming is that it is not equipped to solve the domain-level attribution problem and thus cannot use feedback effectively for IC/SL, but, more importantly, another shortcoming is that the attribution relies completely on the model's selfawareness of errors and does not leverage the feedback itself for attribution. The advantage-based methods leverage the feedback via the estimation model and show better performance, in particular the uniform-weighting variant.
Exploration is crucial. Side-stepping the attribution problem by not leveraging the negative feedback at all (Positive) shows strong results in both experimental settings, the one where feedback for only the argmax prediction is available and the one with feedback for 5 samples. Only in the 5sample-setting, where the feedback estimator can be more effectively trained as the data contains some amount of exploration, Adva-Uni can beat Positive in some cases, demonstrating the potential of additionally using negative feedback. We would like to emphasize that even in this case the exploration in the dataset is limited, as the sampled predictions are still equal to the argmax in most cases (for DC/IC/SL on SNIPS 98.5%/98.2%/92.8% and TOP 99.7%/98.7%/96.9% of examples), and the setting is thus still far from the supervised setting where feedback for all possible predictions is known. We expect further improvements when using bandit data with higher exploration.

Conclusion
In this paper, we studied feedback attribution for bandit learning in multi-domain SLU. We proposed multiple attribution methods together with an evaluation setup and found that attribution is crucial to learning. Advantage-based FA helps if historical exploration is available. In future work, we plan to extend the setup to use true user feedback, which can be noisy and inconsistent, and to include components beyond SLU in the attribution scope.

Ethical Considerations
In this work, we study training SLU models directly from user feedback via counterfactual bandit learning. While making the model training more cost-efficient, this also has the benefit that the system behavior can be more in line with user expectations and as a result improve the user experience. However, optimizing models directly towards user feedback can also have negative consequences: The optimization could emphasize the feedback given by the largest user groups and thereby amplify model biases and undesired system behavior for minority groups. In addition, some users might even adapt their behavior to influence the system directly to their advantage or the disadvantage of other users. To mitigate such risks, we propose to verify the model performance on carefully curated offline test sets before deploying trained models and to try to filter out problematic user behavior from bandit data before training. As counterfactual bandit learning is fully offline, such measures can be implemented easily compared to the more challenging online learning setting.

A Training Details
SLU Models For the logging policy, we finetune the base uncased version of DistilBERT (Sanh et al., 2019) with batch size 16, dropout 0.1, Adam with learning rates 1e-4 (DC) and 5e-4 (IC/SL) and gradually unfreeze the BERT layers over the first two epochs. All models are trained until the loss stops decreasing in the validation set. For bandit learning, we use the same architecture, initialized with the final weights of the logging policy models, and continue training with batch size 16. The model has between 66.6M (DC) and 66.8M (IC/SL) parameters (with slight differences depending on the size of a domain's label space).
Feedback Estimator For the feedback estimator f (x, y D , y I , (y S t )) → δ, we fine-tune the same pretrained BERT model, but with a regression output on top of a 256-d hidden ReLU layer. As input, we provide a single sequence consisting of the predicted domain, intent and, token by token, the utterance and slot labels. To represent domain, intent and slot labels we use tokens reserved for such purposes in the pre-trained model's vocabulary. The model is trained to minimize mean squared error predicting the overall feedback in D B . The batch size is 32. It has around 66.6M parameters.
Parameter Tuning To ensure a fair comparison, we tune the learning rates used for bandit learning (including for training the feedback estimator) for each feedback assignment method separately and pick the rate with best validation loss, for DC and IC/SL separately. That is especially important as feedback assigned with different methods can be of different magnitude.