Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics for language generation tasks (e.g., perplexity, BLEU) or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data, which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.


Introduction
One of the fundamental research bottlenecks for developing dialog systems falls in evaluation, namely how to measure the performance of these systems in an automatic and scalable manner. Different from supervised natural language understanding tasks (e.g., text classification and machine translation), an ideal environment for evaluating dialog systems, also known as the Turing test, involves multi-turn human interaction (Turing, 1950; Liu * Work was done during internship at Google Cloud AI. et al., 2016;Ghandeharioun et al., 2019;See et al., 2019). While online platforms such as Amazon Mechanical Turk can provide human-based evaluation, they are often expensive and not scalable.
Researchers have adopted language quality metrics for single-turn response generation given a fixed context (e.g., BLEU score and perplexity) to implement automatic dialog systems evaluation (DeVault et al., 2011;Xiang et al., 2014;Higashinaka et al., 2014;Gandhe and Traum, 2016;Lowe et al., 2017). However, these metrics only weakly correlate to human evaluation in practice (Liu et al., 2016;Ghandeharioun et al., 2019). One cause of such weak correlation is that language quality metrics rely on the exact match between generated text and ground-truth, which generally do not fully overlap. While certain embeddingbased metrics have been developed to combat this lack of coverage (Mitchell and Lapata, 2008;Dziri et al., 2019), they are only post-hoc judgments based on static experience data, and does not necessarily reflect the dynamic quality of multi-turn interactive dialog well (Ghandeharioun et al., 2019). Moreover, evaluation of goal-oriented dialog systems should be based on how well dialog systems collect information from users and whether the goal is completed; language quality metrics are thus unable to meet these requirements.
To overcome the limitations of the aforementioned static evaluation methods, another line of work has proposed to model the interactive process of a conversation as a Markov decision process (MDP) (Möller et al., 2006;Li et al., 2016;Yu et al., 2016;Shah et al., 2018;Jaques et al., 2019). Accordingly, automatic evaluation of dialog systems can be formulated as an off-policy evaluation (OPE) problem, where a human subject is the so-called "environment" in the reinforcement learning (RL) literature. For instance, Wei et al. (2018) propose a model-based approach for goaloriented dialog systems. They first learn an envi-ronment/human model from the experience data consisting of human response, and then evaluate a dialog agent/policy by executing the policy within the learned environment. This procedure is known as "self-play evaluation". Such a model-based approach requires an accurate estimation of an environment/human. However, both the input and output of the environment are in a combinatorially large space, i.e., the trained model needs to be able to mimic complex human behavior of generating meaningful sentences from huge vocabulary. Unfortunately, such a requirement is far beyond the current capability of model-based RL algorithms. As a result, evaluations that require accurate modeling of the environment are often unreliable. A similar model-based approach is proposed (Ghandeharioun et al., 2019) to evaluate open-domain chit-chat dialog systems. In addition to modeling human behavior, they also model the reward function (for mimicking the complex mechanism behind human ratings) based on handcrafted features, which makes evaluation even more unreliable.
In this paper, we propose a general OPE framework named ENIGMA (EvaluatiNg dIaloG systeMs Automatically) for estimating human evaluation score (i.e., how a human would rate a dialog system). Different from the existing model-based approaches, which rely on complex modeling of human behavior given combinatorially large vocabulary, ENIGMA takes advantage of recent advances in model-free OPE and avoids direct modeling of dynamic transitions and reward functions in a complex environment. Moreover, ENIGMA overcomes several limitations of existing OPE methods in order to evaluate dialog systems: (I) Existing OPE methods only apply to infinite or fixed horizon settings (where horizon length corresponds to number of turns in a conversation), while conversations, on the other hand, often have varying horizon lengths; (II) Existing OPE methods require experience data to sufficiently cover states and actions a target policy might visit. Due to limited experience data and the combinatorial nature of languages, such a requirement can hardly be satisfied in dialog evaluation; (III) Certain OPE methods rely on accurate estimation of the behavior policies used to collect the experience data. Unfortunately, such behavior policies are humans or complex dialog systems, and estimating their probabilistic model is a challenging imitation learning problem. 1 1 Note that even though some of the model-free OPE es-To address (I), we propose a pseudo state padding method, which augments each conversation into infinitely many turns while preserving the original policy value; to address (II), we leverage pre-trained language models (Devlin et al., 2018), which essentially transfer knowledge from opendomain data to learn a representation for alleviating the coverage requirement in original combinatorial space; to address (III), we adopt a stationary distribution correction estimation approach (Nachum et al., 2019a), which directly models the state-action density ratio between the experience data and the target policy (Liu et al., 2018), and is therefore agnostic to the behavior policy.

Background
• Dialog Generation as Markov Decision Process. A conversation is generated through interactions alternating between an agent π (i.e., a dialog system) and an environment E (i.e., a human). We denote the conversation as h = {e 0 , a 1 , e 1 , ..., a T }, where a i and e i are sentences generated by π and E respectively, and T is the number of turns in the conversation. Dialog can be naturally described as a MDP (Puterman, 1995), M = S, A, P, R, µ 0 . Specifically, at the t-th turn, state s t ∈ S captures the previous conversation history s t = {e 0 , a 1 , e 1 , ..., a t−1 , e t−1 }. An action a t ∈ A is an agent's response given this context. Conversation can then be represented by the last state and action, i.e., h = {s T , a T }. An agent π is essentially a policy that maps S to P(A), where P(·) denotes the set of probability measures over the action space. A transition kernel P (·|s t , a t ) returns s t+1 as the state at turn t + 1, and an environment E generates a reward r t = R(s t , a t ) ∈ [0, 1]. Note that s t+1 essentially concatenates s t and a t with e t . The initial state s 1 = {e 0 } is randomly sampled from some distribution µ 0 . We follow the sparse reward setting, timators still require modeling behavior policies, they are still significantly easier than model-based OPE, which has to model the underlying dialog environment.
where each conversation is only evaluated at the ending state, i.e., r t = 0 for t < T .
• Automatic Dialog Evaluation as Off-Policy Evaluation. Dialog evaluation can be naturally viewed as computing the expected reward of the above MDP defined as where h = {s T , a T } is sampled from the initial distribution µ 0 and the interaction between π and E. When the environment (i.e., human) is accessible, ρ(π) can be directly estimated by interaction with the environment, which is known as on-policy evaluation (Sutton and Barto, 2018). When interaction with human is prohibited, human-free automatic evaluation is required and Off-policy evaluation (OPE) (Precup, 2000) is an appealing choice. In particular, OPE can estimate ρ(π) based solely on pre-collected tuples {(s, a, r, s ) i } N i=1 from (multiple) behavior policies that are different from π.
OPE has been considered as one of the most fundamental problems in RL. A straightforward approach is to first directly learn an environment model (R and P ) from experience data and then estimate ρ(π) by executing the policy within the learned environment. Such model-based OPE exactly corresponds to the so-called "self-play evaluation" in the dialog system literature (Wei et al., 2018;Ghandeharioun et al., 2019). Unfortunately, it is notoriously difficult to specify a proper model for highly complicated environments such as a dialog environment (i.e., a human), where the state and action spaces are combinatorially large due to huge vocabulary size and complex transitions. As a result, the estimation error of the environment accumulates as interaction proceeds, and modelbased self-play evaluation of dialog systems often becomes unreliable (Voloshin et al., 2019).
To address the challenge above, many modelfree OPE methods that avoid direct modeling of the environment have been proposed. Model-free OPE can be categorized into behavior-aware and behavior-agnostic methods. Specifically, behavioraware methods rely on either knowing or accurately estimating the probabilistic model of the behavior policies used for collecting the experience data (e.g., inverse propensity scoring, Horvitz and Thompson (1952)). Unfortunately, behavior policies are often unknown in practice. Estimating their probabilistic models is also quite challenging, as it requires modeling human behaviors or complex dialog systems. Behavior-agnostic methods, on the other hand, do not require explicit knowledge or direct modeling of behavior policies, and are therefore more favorable when experience data is collected by multiple (potentially unknown) behavior policies.
Unfortunately, most of the existing model-free behavior-agnostic OPE methods focus on either infinite-horizon (Nachum et al., 2019a;Zhang et al., 2020b;Yang et al., 2020) or fixed-horizon settings (Yin and Wang, 2020;Duan and Wang, 2020), and cannot be applied to evaluating dialog systems whose horizon (number of turns) vary between conversations. While LSTDQ (Lagoudakis and Parr, 2003) can be adopted to handle varying horizons, it has been shown to not work well under the sparse reward setting (Lagoudakis and Parr, 2003;Mataric, 1994).

ENIGMA
We present the ENIGMA framework for automatically evaluating dialog systems. In particular, ENIGMA is model-free and agnostic to behavior policies for generating the experience data. ENIGMA has three components: (1) pseudo-state padding for converting a dialog into an infinitehorizon MDP, (2) distribution-correction estimation (DICE, Nachum et al. (2019a)) with postnormalization for estimating the value of the target policy based on experience data, and (3) function approximation with pre-trained language models.

Pseudo-State Padding
As mentioned in Section 2, existing model-free behavior-agnostic OPE methods cannot handle varying horizon lengths in conversations under the sparse reward setting. To address this issue, we design a special padding scheme, so that the policy value can be estimated by OPE methods from the resulting padded MDP. We first pad conversation sequences with pseudo states, which leads to a padded MDP with a fixed horizon length T max . We then convert such a fixed horizon MDP into infinite horizon by augmentation, i.e., we repeatedly concatenate the ending state of the fixed horizon MDP to its initial state. More specifically, the policy takes a deterministic action at all pseudo states, i.e., π(a = NextPad|s = Pad k ) = 1. The transition kernel of the new process can be defined as This new process is still a valid MDP, as its transition kernel satisfies the Markov property. For notational simplicity, we refer to this new process as "the augmented MDP".
Accordingly, the policy value of π for the augmented MDP can be defined as where h i 's are padded conversations sampled from interactions between π and E. Since there is only one non-zero reward for every T max steps, rewards in the augmented MDP are also sparse. We remark that the augmented MDP has a unique stationary distribution d π (s, a). For the station-action pair (s t , a t ) in a conversation h with padded pseudo states, we have [µ 0 (s 1 )π(a 1 |s 1 ) P (s 2 |a 1 , s 1 ) · · · P (s t |a t−1 , s t−1 )π(a t |s t )], where {(s k , a k )} t−1 k=1 are the state-action pairs in the same conversation as (s t , a t ).
Moreover, the policy value of π under the augmented MDP is proportional to its counterpart under the original MDP without augmentation: Due to space limit, we defer the details and proof to Appendix A.1. Remark 1. Some OPE methods, e.g., LSTDQ (Lagoudakis and Parr, 2003), can handle fixed horizons, therefore only applying the fixed-horizon padding would suffice. DICE estimators (Nachum et al., 2019a), on the other hand, can only handle infinite horizons, therefore the infinite-horizon augmentation is necessary. Remark 2. In practice, we do not actually need to concatenate infinitely many conversations for computing ρ A (π). As suggested by (4), ρ A (π) can be computed based on d π (s t , a t ) defined in (3), which is the product of only finite terms.

Model-Free Behavior-Agnostic DICE
Estimator With the proposed augmentation, we obtain an infinite horizon MDP from which the policy value of the original MDP can be recovered. We then apply DICE (Nachum et al., 2019a;Yang et al., 2020) to estimate ρ A (π) based on pre-collected experience data D = {(s, a, r, s ) i } N i=1 without interacting with E (i.e., a human), where (s, a) ∼ d D are samples from some unknown distribution d D . We slightly abuse the notations and use (s, a, r, s ) ∼ d D as a shorthand for (s, a) ∼ d D , r = R(s, a), s ∼ P (·|s, a), which simulates sampling form the dataset D.
DICE is a model-free policy evaluation method (without explicitly modeling E) and does not require knowledge of behavior policies for generating the experience data, which provides a more reliable estimation of ρ A (π) than other OPE methods. Specifically, DICE decomposes ρ A (π) into: where ζ(s, a) := d π (s, a)/d D (s, a) is the distribution correction ratio. Then DICE estimates ζ by solving the following regularized minimax optimization problem: where ν(s, a)'s are auxiliary variables, f is a convex regularizer (e.g., f (x) = x 2 ), and α ζ is a tuning parameter. Due to the space limit, we omit the details of deriving the DICE estimator. Please refer to Yang et al. (2020) for more details.
• Post-Normalization. Note that (6) handles the constraint E (s,a)∼d D ζ(s, a) = 1 by Lagrange multipliers λ, which cannot guarantee that the constraint is exactly satisfied when solving (6) using alternating SGD-type algorithms (Dai et al., 2017;Chen et al., 2018). To address this issue, we propose a post-normalization step that explicitly enforces the constraint: As we will see in our experiments in Section 4, the post-normalization step is crucial for DICE to attain good estimation accuracy in practice; without postnormalization, we observe potential divergence in terms of policy value estimation.
• Why do we prefer DICE? Deep Q-learning and its variants are another popular model-free and behavior-agnostic approach to off-policy evaluation. However, due to the sparse rewards in dialogs, fitting the state-action value function (i.e., the Qfunction) in deep Q-learning is notoriously difficult (Mataric, 1994). We observe in Section 4 that deep Q-learning is computationally unstable. In contrast, DICE only needs to estimate the density correction ratio ζ, which is decoupled from the rewards associated with the policy value as shown from (6). This significantly alleviates the computational challenge incurred by sparse rewards. Moreover, DICE also applies the post-normalization, additional regularization (i.e., E (s,a)∼d D [f (ζ(s, a))]), and constraints on ζ (i.e., ζ ≥ 0 and E (s,a)∼d D [ζ(s, a)] = 1), all of which further stabilize training. These features allow DICE achieve better estimation performance than deep Q-learning in dialog systems evaluation.
Recent progresses in OPE based on density ratio estimation are remarkable (Liu et al., 2018;Nachum et al., 2019a;Xie et al., 2019;Uehara et al., 2019), however, there exists a statistical limit in off-policy evaluation. Specifically, the Cramer-Rao lower bound of the MSE has been established in Jiang and Li (2016), which is proportional to the square of the density ratio. This implies that we can only obtain accurate estimation of policy value only if the ratio ζ is not too large. While the ratiobased minimax algorithms should have achieved the lower bound (Kallus and Uehara, 2019;Yin and Wang, 2020;Ren et al., 2021), even better estimation results can be obtained when behavior and target policies are more similar. We thus introduce an experience data collection protocol in Section 4.1 which satisfies the bounded ratio requirement and ensures the success of OPE methods.

Function Approximation with RoBERTa
Despite the apparent advantages of DICE estimators, directly training DICE from scratch will fall short due to the bounded ratio requirement being quickly broken in the large combinatorial stateaction space in dialog.
We alleviate this issue by using reliable representations of pre-trained language models (Devlin et al., 2018). By virtue of the huge amounts of pretraining data and the massive model size, the pretrained models can effectively capture rich semantic and syntactic information of natural language (rather than enumerating the original combinatorial language space).
In particular, we transfer the knowledge from RoBERTa (Liu et al., 2019) to dialog evaluation, and parameterize ζ and ν as follows: we keep the pre-trained RoBERTa encoder layer and replace the original mask language modeling head by a twolayer fully connected network with a scalar output. For simplicity, we denote the corresponding parametric forms of ζ and ν in (6) as RoBERTa-ζ and RoBERTa-ν, respectively. Note that we only need RoBERTa-ζ and RoBERTa-ν to share the same encoder. We then use RoBERTa-ζ and RoBERTa-ν as the initial solution to solve (6), which is also known as fine-tuning (Devlin et al., 2018).
With a properly designed mask, the self-attention mechanism in the bi-direction transformer architecture allows us to efficiently compute ζ(s, a) and ν(s, a) for all state-action pairs in the same dialog simultaneously. Due to the space limit, we defer the mask design details to Appendix A.2.

Summary
We summarize ENIGMA in Algorithm 1. Due to the space limit, we only present ENIGMA using SGD with batch-size 1 here. We defer the details of ENIGMA with mini-batch SGD to Appendix A.3 (Algorithm 2).

Experiments
We empirically evaluate ENIGMA on two dialog datasets: AirDialog (Wei et al., 2018) for goaloriented tasks and ConvAI2 (Dinan et al., 2020) for open-domain chit-chat respectively. See details of experimental setup in Appendix B. 2 4.1 Policy Training Data and Experience Data As mentioned in Section 3.2, there exists an information theoretic limit for all off-policy evaluation methods: no method can perform well when the state-action density ratio between the target and behavior policy is too large. To avoid such a circumstance, we need to ensure that the experience data collected by a behavior policy do not deviate too much from data induced by the target policy. Unfortunately, both datasets used in our experiments do not satisfy such a requirement. Air-Dialog, for example, consists of dialog between humans, which are near-perfect golden samples as human agents almost always successfully book tickets for customers. Dialog system agents, on the other hand, have many failure modes (i.e., the target policy/agent does not book the correct ticket for a human customer). Hence, directly using human dialog as the behavior data to evaluate dialog agents is subject to limitations.
In order to properly evaluate an imperfect target policy in the presence of the information theoretic limit, we refer to Lowe et al. (2017); Ghandeharioun et al. (2019), and collect experience data using behavior policies similar to the target policy. To avoid confusion, we call data collected by the behavior policy "experience data" and data used to train an agent "policy training data". More details are elaborated below for each dataset.
It is worth noting that existing work on dialog systems evaluation also enforces similar requirements. For example, Lowe et al. (2017) show higher Pearson correlation coefficient (0.37) be-2 We release our source code for ENIGMA algorithm here: https://github.com/google-research/ google-research/tree/master/dialogue_ ope/airdialogue_ope and dataset here https: //github.com/HMJiangGatech/dialogue_ope_ data.
tween automatic metrics and human ratings when behavior policies contain the target policy. When the target policy is excluded from behavior policies, however, the correlation is only 0.13, even lower than the meaningless correlation between dialog lengths and human ratings (0.27). Another example is Ghandeharioun et al. (2019), where the studied agents are similar to each other in their hierarchical architectures, hyperparameters, and training data.

Goal-Oriented Systems
We first test ENIGMA for evaluating goal-oriented dialog systems on a flight ticket booking task.
• Policy Training Data. We use the AirDialog dataset 3 for policy training (Wei et al., 2018). It contains 402,038 pieces of dialog from human sellers and human customers collaborating on buying flight tickets. We use different proportions of the dataset and different hyperparameters to train 24 seller agents using behavioral cloning (See Appendix C for details) 4 . • Experience Data. We invite 20 people to evaluate the 24 seller agents. Specifically, each of the 20 human customers interacts with a seller agent 5 times to generate 100 pieces of dialog, and gives each piece an evaluation score between 0 and 1. The final score an agent receives is the average of the 100 scores. We consider three types of scores: flight score, status score, and overall reward used in Wei et al. (2018).
We evaluate ENIGMA, BLEU/PPL (Papineni et al., 2002) and Self-Play Evaluation (SPE) based on the correlation between estimated reward and true reward. The results are summarized in Table 1. ENIGMA uses the experience data of the other 23 agents to evaluate each agent (i.e., leave-onebot-out). Note that SPE (Wei et al., 2018) needs to train a customer agent in addition to the seller agent being evaluated. For a fair comparison, we train the SPE customer agent on both experience data and policy training data (See Appendix C for details). Our empirical observations are as follows: • ENIGMA vs. BLEU/PPL. ENIGMA significantly outperforms BLEU/PPL. As mentioned earlier, BLEU and PPL are well-known metrics for evaluating language quality. For goal-oriented systems whose goal is to complete a specific task, however, BLEU and PPL scores show little correlation  Table 1: The correlation between two metrics. Each column is a task completion score obtained by interacting human customers ("Selected Agents" denotes only evaluating agents with reasonably good performance).
with task completion scores.
• ENIGMA vs. SPE. ENIGMA significantly outperforms SPE. To better understand their performance, we also present the regression plots between estimated and true rewards in Figure 1. Both ENIGMA and SPE can easily identify agents with extremely poor rewards. However, for certain good agents whose flight score, status score, and overall reward are better than 0.5, 0.7, and 0.65 respectively, SPE performs worse than ENIGMA by a much larger margin (especially for flight score). Additional regression plots are shown in Appendix D.1.
• Ablation Study. We select 2 out of the 24 agents to illustrate the importance of each component in ENIGMA.
DICE vs. LSTDQ. Figure 2(a) and Figure 2(b) show the estimated values of LSTDQ (only fitting the Q-function) and DICE respectively: estimates of LSTDQ are stuck at 0 whereas estimates of DICE approach the true rewards (dotted lines) as training progresses. Figure 3 additionally shows that the training objectives of LSTDQ oscillates as DICE stably converges.
Post-normalization. Figure 2(c) shows the performance of ENIGMA without post-normalization: The algorithm fails to estimate the true rewards.
Pretrained Encoder. Figure 2(d) shows the performance of ENIGMA without the pretrained encoder: The estimated values can approach the true rewards, but are less stable and less accurate than the counterpart with the pretrained encoder.

Open-Domain Chit-chat Systems
We now test ENIGMA for evaluating open-domain chit-chat dialog systems.  Table 2. Moreover, we consider using 8 HCDFs to fit the true rewards using linear regression, and the results are included in Table 2.
Note that since we are considering a chit-chat dialog system, SPE does not train an additional agent but asks two identical target agents to chat with each other. However, SPE needs to train an additional model to predict the reward of each dialog. Specifically, we fine-tune the pre-trained RoBERTa encoder with an output layer over the experience data (an additional sigmoid function is applied to ensure an output between 0 and 1). For automatic evaluation of each agent using ENIGMA, we use the experience data of the other 28 agents (i.e., leave-one-bot-out).
• ENIGMA vs. Baselines. ENIGMA significantly outperforms SPE, BLEU, BLEURT, BERTscore and HCDFs in both Pearson and Spearman's rank   Value Estimation Value Estimation Value Estimation   Value Estimation

Number of Updates
Figure 2: Value estimation using different methods for two target agents (π 1 and π 2 ) vs. # of iterations. Dotted lines denote the true rewards.
correlations. Moreover, we compare the correlations between estimated rewards and human evaluation scores under each language quality metric. Due to space limit, we only show the plots of ENIGMA and SPE under 3 out of 10 language quality metrics in Figure 4. Additional plots and detailed results can be found in Appendix D.3. We see that ENIGMA outperforms SPE and HCDFs under all language equality metrics.
• Sample Efficiency of ENIGMA. To demonstrate that ENIGMA is sample efficient, we test ENIGMA on randomly sub-sampled (10% and 50%) experience data. We found that even using only 10% of the experience data, ENIGMA still outperforms SPE and HCDFs.
• Evaluation under Challenging Experience Data. To make the evaluation more challenging, we further test ENIGMA by excluding the experience data obtained by the behavior policies similar to the target policy (see more details in Appendix D.3). We see that even with such challenging experience data, ENIGMA still outperforms SPE with trained on full data and HCDFs under 0.7 0.8 0.9 0.6 almost all language quality metrics.

Discussions and Conclusion
Existing research on automatic evaluation of dialog systems can be categorized into static vs. dynamic evaluation. Most of existing research falls into static evaluation focusing on language quality of single-turn response or on task-completion given fixed dialog, while few literature emphasizes dynamic properties of an interactive environment and tries to considers the sequential interaction between a human and an agent, and thus it is more challenging. We note that in both static and dynamic evaluations, the algorithms rely on the assumption of sufficient data coverage (explicitly or implicitly) to ensure reliable evaluation. For example, in static evaluation, BLEU score requires all reasonably good responses to be exactly covered by the experience data. More recently, Lowe et al. (2017) show that their method only works when the behavior policies include the target policy. Dynamic evaluation also assumes the sufficient coverage. We emphasize that it is the information-theoretic limit of all OPE methods (Jiang and Li, 2016), which requires the experience data to cover sufficient target policy behaviors to ensure accurate estimation. Therefore, we suggest the broader research community to release human-model interaction evaluation data to further promote research in automatic dialog systems evaluation.
In this paper, we develop a model-free dynamic evaluation framework, ENIGMA, which adopts the current state-of-the-art OPE method in reinforcement learning. Different from existing singleturn language quality metrics and model-based reinforcement learning methods, ENIGMA naturally takes into consideration the interactive and dynamic nature of conversations, while avoiding the difficulty of modeling complex human conversational behaviors. Our thorough experimental results demonstrate that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores. One potential future direction is to extend ENIGMA from offpolicy evaluation to off-policy improvement, which aims to learn a dialog system based on experience data (Nachum et al., 2019b; Kallus and Uehara, 2020).

Broader Impact
This paper proposes ENIGMA, a model-free dynamic evaluation framework for dialog systems. We demonstrate that the ENIGMA framework can be used for both goal-oriented systems and chitchat systems. For AirDialog dataset, we collect experience data (human-model conversations), which does not contain any personal or sensitive information (see Figure 9, Appendix C). In all other experiments, we use publicly available data. We build our algorithms using public code bases and do not find any ethical concerns. Theorem 1. The augmented MDP with infinite horizon satisfies the following properties:

References
• It has a unique stationary state-action visitation distribution d π (s, a); • For the station-action pair (s t , a t ) in a conversation h with padded pseudo states, we have [µ 0 (s 1 )π(a 1 |s 1 )P (s 2 |a 1 , s 1 ) · · · P (s t |a t−1 , s t−1 )π(a t |s t )], (8) where {(s k , a k )} t−1 k=1 are the state-action pairs in the same conversation as (s t , a t ); • The policy value can be computed by sampling from d π (s, a), and we have Proof. First, we prove that the augmented MDP has a unique stationary state-action visitation distribution shown in (8).
As the augmented MDP is periodic with period T max , the uniqueness and stationary distribution can not be immediately obtained by ergodicity of the MDP (the first two points of the Theorem).
To obtain the stationary state-action visitation distribution, we essentially need to solve the following equations: with d π (s, a) is a probability measure on the state-action space, i.e., (s,a) d π (s, a) = 1. We first group the state-action pairs by their dialog turns t. More specifically, we define S t := {s t : s t contains t dialog turns}, A t := {a t : a t is the response at the t−th dialog turn} and Q t = S t × A t . We have the state space is the direct sum of state groups S 0 ⊕ S 1 · · · ⊕ S Tmax = S and the action space is the union of all action groups Tmax t=1 A t = A. We further have Q 0 ⊕Q 1 · · ·⊕Q Tmax = S ×A = Q. Notice that t is the number of dialog turns in original MDP, not the time step for the augmented MDP.
Till now, we have shown that the augmented MDP has a unique stationary state-action visitation distribution shown in (8) (the first two points of the Theorem).
Next, we show that the policy value of the policy π under the augmented MDP is proportional to its counterpart under the original MDP without the augmentation (the third point of the Theorem).
Recall that the expected reward of original MDP (1) is defined as µ 0 (s 1 )π(a 1 |s 1 )P (s 2 |a 1 , s 1 ) · · · P (s T |a T −1 , s T −1 )π(a T |s T ) where T is the number of turns in the original dialog before padding. Recall that, the MDP only obtain non-zero reward when the dialog ends, (i.e., when a End Conversation). On the other hand, Due to the existence of unique stationary distribution, the policy value of π for the augmented MDP (2) can written as: µ 0 (s 1 )π(a 1 |s 1 )P (s 2 |a 1 , s 1 ) · · · P (s t |a t−1 , s t−1 )π(a t |s t ) Can we directly apply infinite-horizon augmentation without padding? The answer is NO. Here we use an example to illustrate the difference between ρ A and ρ and why we need to pad every dialog to have the same length for using OPE: Example 1. Suppose you have two experience dialogs a 0 → · · · → a t 1 and b 0 → · · · → b t 2 with rewards 0 and 1 respectively. For the target policy, dialogs has per-episode density 0.2 and 0.8 respectively. The true value of such policy is 0 × 0.2 + 1 × 0.8 = 0.8. The corresponding per-state density of [a 0 , · · · , a t 1 ] is 0.2 0.2×t 1 +0.8×t 2 and the one for [b 0 , · · · , b t 1 ] is 0.8 0.2×t 1 +0.8×t 2 . The value in the new augmented MDP is 0.2 * 0+0.8 * 1 0.2×t 1 +0.8×t 2 , which depends on the dialog turns and can not be directly turned into policy value in the original MDP.

A.2 Function Approximation with Pre-Trained Language Models
We can compute all state-action pairs for the same dialog in a parallel way as shown in Figure 6. The input to the RoBERTa encoder consists of three parts, word tokens, position ids, and token types.
Notation: an experience dialog h = {e 0 , a 1 , e 1 , ..., a T }, and the corresponding response generated by the target policy π, {a t = π(s t )} T t=1 . Word Tokens. The input token is the concatenation of responses {e 0 , a 1 , a 1 , e 1 , ..., e T −1 , a T , a T }. Position Ids. The position ids is separately calculated for each response. For e i , the position ids is from l 2i = j<i len(e j ) + j≤i len(a j ) to l 2i+1 = l 2i + len(e i ), where len(·) denotes the number of tokens of a given response. For a i , the position ids is from l 2i−1 to l 2i . For a i , the position ids is from l 2i−1 to l 2i = l 2i−1 + len(a i ).
Token types. For e i 's, the token types are 0 which denotes human responses. For a i 's and a i 's, the token types are 1 which denotes agent responses.

A.3 ENIGMA with regularized DICE
Step 2: Solve Min-Max optimization with function approximator for t in 1, · · · , T i do 3: for t in 1, · · · , T i do for t in T i + 1, · · · , T max do 17:

B Experiment Set-Up
In the following experiments, we share the RoBERTa encoder for RoBERTa-ζ and RoBERTa-ν. On the top of RoBERTa-ζ and RoBERTa-ν, it is a two-layer fully connected neural network equipped with GeLU activation (Hendrycks and Gimpel, 2016) and the same hidden dimension as RoBERTa. The RoBERTa encoder is initialized from RoBERTa-base checkpoint (Liu et al., 2019). We simply use reverse gradients for the mini-max updates. We set learning rate as 2 × 10 −4 and use inverse square root learning rate decay. We impose the gradient norm clipping with the maximum norm · 2 ≤ 10. We use 100 times larger learning rate for optimizing λ, 2 times larger learning rate for RoBERTa-ν. In (6)

C Transformer-Based Agents for AirDialog
Seller Agent Transformer Architecture There are four components for the encoder: ticket encoder, reservation encoder, dialog encoder, and task-specific heads (intent classification head and name classification head). All tickets and reservation are converted to natural languages. Noticing that, we always append a pseudo ticket in the ticket database representing "no ticket found" situation. The architecture is illustrated in Figure 7. Customer Agent Transformer Architecture There are two components for the encoder: intent encoder, reservation encoder. All intents are converted to natural languages. The architecture is illustrated in Figure 8.
Training Objective Besides the language generation loss L l , the training objective for seller consists of three parts: name loss, flight loss, intent loss: The customer agent is trained with normal language generation loss.
Benchmark We compare the proposed model with the current AirDialog RNN baseline (Wei et al., 2018). As can be seen, the agent used in this paper are significantly stronger than the baseline agent used in Wei et al. (2018).   Hyper-Parameters For training 24 seller agents used in Section 4, we varies the size of training data (number of training dialogs) from 5K to the full size and varies λ i and λ f from 0.0001 to 1. For training the customer agent used in self-play evaluation, we use the full training data and tune the hyperparameters based on the BLEU score evaluated using the validation set.
Human Evaluation The human evaluation is collected from 20 different Ph.D. students majored in Math/Stats/CS/IEOR. We provide detailed guidelines to the human evaluator that they have to speak to the agents with similar tone. Figure 9 presents the screen shots of the human evaluation software.
We first provide the context to the human evaluator.
For easy use and the consistency of human evaluation, we have prepared several response templates.
The human evaluators are allowed to use their own words.
After the end of conversation, we provide the details about the agent's decision (ticket booked/cancelation), as well as the task completion scores (flight/name/status/reward score).

AirDialog Regression Plot
We present the regression plot for the full setting in Figure 10 and for the selected agent in Figure 11.

Training Curves
We show the training curves of the ENIGMA in Figure 12. Here four models are presented, the best model (ranked 100%), model ranked as 50%, model ranked as 25% and the worst model (ranked 0%). As can been seen the estimated reward estimation converges steadily to it's true values.
Ablation Study Here we provide large figures (Figure 13 and Figure 14) for the ablation study mentioned in Section 4. D.2 Additional Results for Rule-Based Agents of AirDialog • Rule-Rule (R-R): Both customer and seller agents are rule based. We fix the customer rule-based model and construct and evaluate 6 seller agents. The strongest agent can perfectly interpret the intent of  Figure 12: Learning curve for AirDialog. The x-axis is the number of mini-max updates, while y-axis is the estimated values. The straight line is the true reward, while the shaded region denotes the 90% confidence interval. The true reward and the confidence interval is obtained via different evaluation chats between the agents and the environment (model/human). Different colors denotes different agents.  Value Estimation Value Estimation

Number of Updates
Figure 13: Reward estimation of two target agents (π 1 and π 2 ) vs. # of iterations. Dotted lines represents true rewards. rule-based customers. While the weaker agents interprets the intent with different levels of noise. The learning curve is presented in Figure 15.

D.3 ConvAI2
Training Curves. Similar to the AirDialog dataset, we also show the training curves for the agents ranked at 100%, 50%, 25%, 0% in Figure 16. ENIGMA also converges steadily to the true values within a resonable error. Regression Plot.
We present the regression plot for the all 10 metrics in setting in Figure 17. The corresponding corresponding correlation is presented in Table 5. For comparison, we present the regression plot for self-play evaluation in Figure 18. Experience Data. To analysis how many human-model evaluation dialogs are needed, we analysis ENIGMA error under different sizes of the experience data. For ConvAI2, we compare the error for using 100% data, 50% data and 10% data. As shown in Table 5 and Figure 19, when we use half of the data, the error is similar to the one using full data. If we only use 10% data, ENIGMA becomes very inaccurate. OPE under low resource setting remains very challenging.
In Figure 19, we study the estimation error under different sizes of the experience data. As can be seen, when using 50% data, the reward value estimation is very similar to the one of using full data. When using only 10% data, the error is larger and ENIGMA has lower correlation with the true reward.
A More Challenging Setting.   Considering that some target agents are similar to the behavior policies with only slight difference in the way of decoding, they might yield very the similar dialog when the human acts in the same way. Specifically, in the data collection process, the target model might yield the responses that are very similar to the ones of the behavior policy for all turns in the dialog: EditDistance(a t , a t ) ≤ 15 ∀0 ≤ t ≤ T . For a more realistic setting, we consider removing these highly overlapped dialogs after the data collection process. This setting is very challenging that the target policy behavior is less covered by the experience data and ENIGMA can only hopefully generalize via pre-trained RoBERTa. The results are shown in Figure 20 and Table 5. As can be seen, this setting remains challenging as the Pearson correlation is between 0.5 and 0.8. For comparison, we present the regression plot for self-play evaluation using this challenging subset of the experience data in Figure 21. We remark that such experiments can also be done for AirDialog. However, due to the limitation that most agents are just learning template responses due to the goal-oriented nature, removing overlapped dialogs results in an extremely incomplete experience dataset. For example, most "cancelation" dialogs will be removed since they are very simple and basically the same for different agents. As a result ENIGMA can not make a reasonable estimation due to the highly incomplete experience data. Figure 22 compares the error of ENIGMA between using the normal experience data and the selected challenging one. As can be seen, the error using the selected data is larger particularly for the agents with exceptionally low/high true reward. That indicates the problem of the lack of dialog coverage is exaggerated under the challenging setting, while the ENIGMA estimation remains accurate when there is sufficient dialog coverage.  Figure 22: ENIGMA Error Comparison between using normal and selected challenging experience data on Con-vAI2. The x-axis is the true average reward. The y-axis is the ENIGMA error. The solid line is the fitted quadratic function. The histogram is the empirical distribution of the rewards of all the experience data. Orange represents challenging dataset, and blue represents normal dataset.

Comparison to Automatic Hand-crafted Metrics.
We compare ENIGMA with other automatic hand-crafted metrics proposed in See et al. (2019). For a more intuitive comparison, we use heat map and box plot to visualize the correlations between different automatic evaluation metrics and different human evaluation metrics. As can be seen in Figure 23 and Figure 24, most hand-crafted metrics have relatively low correlation to human evaluation metrics. The only exception is the "question marks" automatic metrics for inquisitive human evaluation metric. Some hand-crafted metrics have high Pearson correlation to some human evaluation metrics, while the corresponding Spearman's rank correlation is low. The reason is that they can easily identify some extremely good/bad agents while they are less effective for identifying agents with similar performance.
Comparison to BLEU, BLEURT, and BERTscore. We compare ENIGMA with other automatic single-turn language quality metrics in Figure 23: BLEU, BLEURT (Sellam et al., 2020), and BERTscore (Zhang et al., 2019). As can be seen, these metrics only have high correlation to certain human evaluation metrics and low correlation to other metrics. Note that, we do not compare the perplexity as the agents rely on complicated decoding methods (See et al., 2019) and perplexity does not take decoding into consideration.

D.4 Error Analysis
We analyze the detailed errors to identify the error pattern for better understand the limit of ENIGMA. We calculate the absolute difference between the estimation and the true average reward. The results are summarized in Figure 25. A common pattern we see in ConvAI2 is that, when the true average reward is too high or too low, the ENIGMA becomes less accurate. One possible reason for that is the lack of samples of dialogs with the extreme rewards in the experience data. We empirically verify this conjecture by comparing the the error with the reward distribution in the experience data in Figure 25. For AirDialog, such pattern is not obvious. That is because the quality of the decision module is more important to the agent performance for this task completion scores. As a result, even performance of the target agent is much higher/lower than the experience data, as long as they share similar languages, ENIGMA can estimate the performance accurately.

D.5 Embedding Visualization
In Figure 26, we present the t-SNE plots for the embedding of the state-action pairs from the behavior experience data and the target policy. The two sets of embeddings provided by the pre-trained language models are largely overlapped with rich semantic information. On the other hand, the embeddings provided by a randomly initialized model spread over the entire high-dimensional space.

E.1 Static Methods
As can be seen, most previous methods only focus on evaluating language quality for single-turn response of a fixed context. These methods can not evaluate agents under interactive context. As a result, they can not be extended to goal-oriented dialogs. For goal-oriented dialogs, the static evaluation methods are very limited. The static methods can only evaluate the model actions to a fixed complete dialog, e.g., intent detection.
Comparison to Meena Paper (Adiwardana et al., 2020): 1. They only show that PPL correlates with one specific metric: Sensibleness and Specificity Average. We consider a wide range of metrics for both task-completion scores and dialog quality scores (listed in Table 7). No evidence shows PPL correlates well with most metrics. 2. They draw the conclusion using only 7 chatbots. This conclusion is not statistically reliable, i.e. for R 2 = 0.93 with 7 data points, the 95% confident interval is 0.64 ≤ R 2 ≤ 0.99. On the other hand, we use 24/29 agents. With 24 data points, the 95% CI is 0.87 ≤ R 2 ≤ 0.96, which is much more reliable.

E.2 Dynamic Methods
Previous dynamic methods under RL framework are based on self-play evluation, which requires learning the environment, i.e, human. As discussed in the main paper, learning a human model is significantly beyond the current technical limit.
ENIGMA overcome learning the environment by directly modeling the performance of agents.

E.3 Information Theoretic Limit
The common limitation of all existing methods is that they require similarity between the target policy and behavioral policies, so that the experience data can cover sufficient interaction patterns between the target policy and human.
For example, BLEU score requires the agent response being similar to the reference response. Another example is ADEM (Lowe et al., 2017), they include the target policy into the experience data collection to achieve decent performance (0.37 Pearson correlation to human ratings). If the target policy is excluded from the behavior policies, ADEM only achieves 0.13 Pearson correlation, which is even lower than the one between dialog length and human ratings 0.27.
For static single-turn evaluation for language quality, one might satisfy the requirement by just using human as the behavior policy and large-scale diverse experience data. That is because the single-turn responses of the target model have a very similar pattern to the human responses, as they are usually trained to mimic one-turn human response. However, high similarity of responses between the target model and human requires a very strong target model trained with large-scale data, which is not practical in most settings. Some existing work try to alleviate such requirement and increase the coverage of experience data by external knowledge graph (Huang et al., 2020) and synthetic samples (Sellam et al., 2020). We remark that although the static methods only require single-turn similarity between behavior and target policies, their empirical performance is unsatisfactory comparing with multi-turn interactive human evaluation (Ghandeharioun et al., 2019).
In multi-turn interactive evaluation, we can not just use human as the behavior policy especially for goal-oriented dialogs. That is because the multi-turn behavior of the target model is very different from the human behavior. Take Airdialog as an example, human agents can always book the correct tickets while the target model may fail for many times.
Such a limitation is the theoretical requirement of bounded state-action density ratio between target and behavior policies, which has been discussed in many off-policy evaluation literature (Wang et al., 2020b;Xie et al., 2019).
Due to such theoretical limitation, a large amount of human-model interactive evaluation data is needed to study automatic interactive evaluation. However, most evaluation logs are not publicly available, and research in this direction has largely lagged behind. To the best of our knowledge, ConvAI2 (See et al., 2019) is the only public comprehensive human-model interactive evaluation data. 7 Therefore, we recommend that the research community release human-model interaction evaluation data to promote dialog evaluation/learning research and benefit the entire community.