Using Domain Knowledge to Guide Dialog Structure Induction via Neural Probabilistic Soft Logic

Dialog Structure Induction (DSI) is the task of inferring the latent dialog structure (i.e., a set of dialog states and their temporal transitions) of a given goal-oriented dialog. It is a critical component for modern dialog system design and discourse analysis. Existing DSI approaches are often purely data-driven, deploy models that infer latent states without access to domain knowledge, underperform when the training corpus is limited/noisy, or have difficulty when test dialogs exhibit distributional shifts from the training domain. This work explores a neural-symbolic approach as a potential solution to these problems. We introduce Neural Probabilistic Soft Logic Dialogue Structure Induction (NEUPSL DSI), a principled approach that injects symbolic knowledge into the latent space of a generative neural model. We conduct a thorough empirical investigation on the effect of NEUPSL DSI learning on hidden representation quality, few-shot learning, and out-of-domain generalization performance. Over three dialog structure induction datasets and across unsupervised and semi-supervised settings for standard and cross-domain generalization, the injection of symbolic knowledge using NEUPSL DSI provides a consistent boost in performance over the canonical baselines.


INTRODUCTION
The seamless integration of prior domain knowledge into the neural learning of language structure has been an open challenge in the machine learning and natural language processing communities.In this work, we inject symbolic knowledge into the neural learning process of a two-party dialog structure induction (DSI) task (Zhai & Williams, 2014;Shi et al., 2019).This task aims to learn a graph, known as the dialog structure, capturing the potential flow of states occurring in a dialog dataset for a specific task-oriented domain, e.g., Figure 1 represents a possible dialog structure for the goal-oriented task of booking a hotel.Nodes in the dialog structure represent conversational topics or dialog acts that abstract the intent of individual utterances, and edges represent transitions between dialog acts over successive turns of the dialog.Similar to the motivation described in Shi et al. (2019), previous work in DSI has been split between supervised and unsupervised methods.In particular, traditional supervised methods Jurafsky (1997) relied on dialog structure hand-crafted by human domain experts.Unfortunately, this process is labor-intensive and, in most situations, does not generalize easily to new domains.Therefore, recent work attempts to overcome this limitation by studying unsupervised DSI; e.g., hidden Markov models Chotimongkol (2008); lan Ritter et al. (2010); Zhai & Williams (2014) and more recently Variational Recurrent Neural Networks (VRNN) Chung et al. (2015); Shi et al. (2019).However, being purely data-driven, these approaches have difficulty with limited/noisy data and cannot easily exploit domain-specific or domain-independent constraints on dialog that may be readily provided by human experts (e.g., Greet utterances are typically made in the first couple of turns).
In this work, we propose Neural Probabilistic Soft Logic Dialog Structure Induction (NEUPSL DSI).This practical neuro-symbolic approach improves the quality of learned dialog structure by infusing domain knowledge into the end-to-end, gradient-based learning of a neural model.We leverage Probabilistic Soft Logic (PSL), a well-studied soft logic formalism, to express domain knowledge as soft rules in succinct and interpretable first-order logic statements that can be incorporated easily into differentiable learning (Bach et al., 2017;Pryor et al., 2022).This leads to a simple method for knowledge injection with minimal change to the SGD-based training pipeline of an existing neural generative model.Our key contributions are: 1) We propose NEUPSL DSI, which introduces a novel smooth relaxation of PSL constraints tailored to ensure a rich gradient signal during back-propagation; and 2) We evaluate NEUPSL DSI over synthetic and realistic dialog datasets under three settings: standard generalization, domain generalization, and domain adaptation.

BACKGROUND ON PROBABILISTIC SOFT LOGIC
This work introduces soft constraints in a declarative fashion, similar to Probabilistic Soft Logic (PSL) Bach et al. (2017).PSL models relational dependencies and structural constraints using firstorder logical rules, referred to as templates with arguments known as atoms.For example, the statement "the first utterance in a dialog is likely to belong to the greet state" can be expressed as: Where (FIRSTUTT(U), STATE(U, greet)) are the atoms (i.e., atomic boolean statements) indicating, respectively, whether an utterance U is the first utterance of the dialog, or if it belongs to the state greet.The atoms in a PSL rule are grounded by replacing the free variables (such as U above) with concrete instances from a domain of interest (e.g., the concrete utterance 'Hello!') to create ground atoms.The observed variables and target/decision variables of the probabilistic model correspond to ground atoms constructed from the domain, e.g., FIRSTUTT( ′ Hello! ′ ) is an observed variable and STATE( ′ Hello! ′ , greet) is a target variable.PSL allows the originally Boolean-valued atoms to take continuous truth values in the interval [0, 1].In doing so, PSL replaces logical operations with a form of soft logic called Lukasiewicz logic Klir & Yuan (1995): 1) A ∧ B = max(0.0,A+B − 1.0), 2) A ∨ B = min(1.0,A+B), and 3) ¬A = 1.0 − A1 .A and B represent either ground atoms or logical expressions over atoms and take values in [0, 1].For example, PSL converts Equation 1 into: PSL creates a collection of functions {ϕ i } m i=1 , called potentials, that map data to [0, 1].PSL defines a conditional probability density function over the unobserved random variables y given the observed data x and non-negative weights λ known as the Hinge-Loss Markov Random Field (HL-MRF): (3)

NEURAL PROBABILISTIC SOFT LOGIC DIALOG STRUCTURE INDUCTION
Our neuro-symbolic approach to dialog structure induction combines the principled formulation of probabilistic soft logic (PSL) (Bach et al., 2017) rules with the state-of-the-art Direct-Discrete Variational Recurrent Neural Network (DD-VRNN) (Shi et al., 2019).We refer to our approach as Neural Probabilistic Soft Logic Dialog Structure Induction (NEUPSL DSI).Throughout this section, we define the dialog structure learning problem, describe how to integrate the neural and symbolic losses, and introduce an improvement to the neuro-symbolic gradient.
Problem Formulation Given a goal-oriented dialog corpus D, we consider the DSI problem of learning a graph G underlying the corpus.More formally, a dialog structure is defined as a directed graph G = (S, P ), where S = {s 1 , . . ., s m } encodes a set of dialog states, and P a probability distribution p(s t |s <t ) representing the likelihood of transition between states (see Figure 1 for an example).Given the underlying dialog structure G, a dialog d i = {x 1 , . . ., x T } ∈ D is a temporallyordered set of utterances x t .Assume x t is defined according to an utterance distribution conditional on past history p(x t |s ≤t , x <t ), and the state s t is defined according to p(s t |s <t ).Given a dialog corpus D = {d i } n i=1 , the task of DSI is to learn a directed graphical model G = (S, P ) as close to the underlying graph as possible.

INTEGRATING NEURAL AND SYMBOLIC LEARNING UNDER NEUPSL DSI
We now introduce how the NEUPSL DSI approach formally integrates the DD-VRNN with the soft symbolic constraints to allow for end-to-end gradient training.To do this, we build upon the foundations developed by Pryor et al. (2022) on Neural Probabilistic Soft Logic (NeuPSL) by augmenting the standard unsupervised DD-VRNN loss (Shi et al., 2019) with a constraint loss.Figure 2 provides a graphical representation of this integration of the DD-VRNN and the symbolic constraints.Intuitively, NEUPSL DSI can be described in three parts: instantiation, inference, and learning.
Instantiation of a NEUPSL DSI model uses a set of first-order logic templates to create a set of potentials that define a loss used for learning and evaluation.Let p w be the DD-VRNN's predictive function of latent states with hidden parameters w and input utterances x vrnn .The output of this function, defined as p w (x vrnn ), will be the probability distribution representing the likelihood of each latent class for a given utterance.Given a first-order symbolic rule ℓ i (y, x vrnn , x) where the decision variable y = p w (x vrnn ) is the latent state prediction from the neural model p w and x are the observed variables, we can instantiate a set of deep hinge-loss potentials of the form: For example, in reference to Equation 2, the decision variable y = p w (x vrnn ) is associated with the STATE(U, greet) random variables, leading to: The instantiated model described above breaks the NEUPSL DSI inference objective into neural inference and symbolic inference objectives.The neural inference objective is computed by evaluating the DD-VRNN model predictions with respect to the standard loss function for DSI.
Given the deep hinge-loss potentials {ϕ w,i } m i=1 , the symbolic inference objective is the HL-MRF likelihood (Equation 3) evaluated at the decision variables y = p w (x vrnn ): . Under NEUPSL DSI, the decision variables y = p w (x vrnn ) are implicitly controlled by neural network weights w, therefore the conventional MAP inference in symbolic learning for decision variables y * = arg min y P w (y|x vrnn , x, λ) can be done simply via neural weight minimization arg min w P w (y|x vrnn , x, λ).As a result, NEUPSL DSI learning minimizes a constrained optimization objective: where L constraint is the log-likelihood of the hinge loss: −logP w (y|x vrnn , x, λ).

IMPROVING SOFT LOGIC CONSTRAINTS FOR GRADIENT LEARNING
The straightforward linear soft constraints used by the classic Lukasiewicz relaxation fail to pass back gradients with a magnitude and instead pass back a direction (e.g., ±1).Formally, the gradient of a potential ϕ w (x vrnn , x) = min(1, ℓ(p w (x vrnn ), x)) with respect to w is: Here ℓ(p w , x) = a•p w +b where a, b ∈ R and p w ∈ [0, 1], which leads to the gradient ∂ ∂pw ℓ(p w , x) = a.Observing the three Lukasiewicz operations described in Section 2, it is clear that a will always result in ±1 unless there are multiple p w per constraint.As a result, this classic soft relaxation leads to a naive, non-smooth gradient ( ∂ ∂w ϕ w = a1 ϕw<1 • ∂ ∂w p w ) that mostly consists of the predictive probability gradient ∂ ∂w p w .It barely informs the model of the degree to which p w satisfies the symbolic constraint ϕ w (other than the non-smooth step function 1 ϕw<1 ), thereby creating challenges in gradient-based learning.In this work, we propose a novel log-based relaxation that provides smoother and more informative gradient information for the symbolic constraints: This seemingly simple transformation brings a non-trivial change to the gradient behavior: The gradient now contains 1 ϕw , which informs the model of the degree to which the prediction satisfies the symbolic constraint.As a result, when the satisfaction of a rule ϕ w is low (i.e., uncertain), the gradient magnitude will be high, and when the satisfaction of the rule is high, the gradient magnitude will be low.In this way, the gradient of the symbolic constraint guides the neural model to focus on learning the challenging examples that violate the symbolic rules.

EXPERIMENTAL EVALUATION
Datasets Experiments are conducted using three goal-oriented dialog datasets: MultiWoZ 2.1 synthetic Campagna et al. (2020) and two versions of the Schema Guided Dialog (SGD) dataset; SGDsynthetic (where the utterance is generated by a template-based dialog simulator) and SGD-real (which replaces the machine-generated utterances of SGD-synthetic with its human-paraphrased counterparts) (Rastogi et al., 2020).SGD-real dataset is evaluated over three unique settings: standard generalization (train and test over the same domain), domain generalization (train and test over different domains), and domain adaptation (train on (potentially labeled) data from the training domain and unlabeled data from the test domain, and test on evaluation data from the test domain). 2onstraints In the synthetic MultiWoZ setting, we introduce a set of 11 structural domain agnostic dialog rules.An example of one of these rules can be seen in Equation 1.These rules are introduced to represent general facts about dialogs, with the goal of showing how the incorporation of a few expert-designed rules can drastically improve generalization performance.For SGD settings, a single dialog rule that encodes the concept that dialog acts should contain utterances with correlated tokens is used, e.g., utterances with 'hello' are likely to belong to the greet state.This rule demonstrates the boost in performance a model can achieve from a simple source of prior information. 1

Metrics and Methodology
We assess the correctness of learned latent dialog structure and the quality of learned hidden representation using Adjusted Mutual Information (AMI) and linear probing, respectively.AMI allows for a comparison between ground truth labels3 (e.g., "greet", etc.) and latent state predictions (e.g., State 1 , etc.).Linear probing trains a lightweight probing model on top of the frozen learned representation and evaluates the linear model's generalization performance for supervised tasks Tenney et al. (2019).We train both a full supervision and few-shot supervision linear classifier on top of input features extracted from the penultimate layer of the DD-VRNN.Full supervision averages the class-balanced accuracy of two separate models that classify dialog acts (e.g., "greet", etc.) and domains ("hotel", etc.), respectively.Few-shot averages the class-balanced accuracy of models classifying dialog acts with one, five, and ten-shot settings. 1  Table 1 summarizes the results of NEUPSL DSI and DD-VRNN in unsupervised settings.NEUPSL DSI outperforms the strictly data-driven DD-VRNN on AMI by 4%-27% depending on the setting while maintaining or improving the hidden representation quality.To reiterate, this improvement is achieved without supervision in the form of labels, but rather a few structural constraints.Comparing AMI performance on SGD-real across different settings (standard generalization v.s.domain generalization/adaptation), we see the NEUPSL DSI consistently improves over DD-VRNN, albeit with the advantage slightly diminished in the non-standard generalization settings.

CONCLUSION
This paper introduces NEUPSL DSI, a novel neuro-symbolic learning framework that guides latent dialog structure learning using differentiable symbolic knowledge.Through extensive empirical evaluations, we illustrate how the injection of just a few domain knowledge rules significantly improves both correctness and hidden representation quality in this unsupervised NLP task.

A MODEL DETAILS
This section provides additional details on the NEUPSL DSI models for the Multi-WoZ and SGD settings.Throughout these subsections, we cover the symbolic constraints, evaluation metrics, and hyperparameters.The code is under the Apache 2.0 license.

A.1 SGD CONSTRAINTS
The NEUPSL DSI model uses a single constraint for all SGD settings (synthetic, standard, domain generalization, and domain adaptation).Figure 3 provides an overview of the constraint, which contains the following two predicates: 1. STATE(Utt, Class) The STATE continuous valued predicate is the probability that an utterance, identified by the argument Utt, belongs to a dialog state, identified by the argument Class.For instance, the utterance hello world !for the greet dialog state would create a predicate with a value between zero and one, i.e., STATE(hello world !greet) = 0.7.

HASWORD(Utt, Class)
The HASWORD binary predicate indicates if an utterance, identified by the argument Utt, contains a known token for a particular class, identified by the argument Class.For instance if a known token associated with the greet class is hello, then the utterance hello world !would create a predicate with value one, i.e.HASWORD(hello world !, greet) = 1.
This token constraint encodes the prior knowledge that utterances' are likely to belong to dialog states when an utterance contains tokens representing that state.For example, if a known token associated with the greet class is hello, then the utterance hello world ! is likely to belong to the greet state.The primary purpose of incorporating this constraint into the model is to show how even a small amount of prior knowledge can aid predictions.To get the set of tokens associated with each state, we trained a supervised linear classifier where the input is an utterance, and the label is the class.After training, every token is individually run through the trained model to get a set of logits over each class.These logits represent the relative importance that each token has over every class.Sparsity is introduced to this set of logits, leaving only the top 0.1% of values and replacing the others with zeros.This sparsity reduces the set of 261,651 logits to 262 non-zero logits.

A.2 MULTI-WOZ CONSTRAINTS
The NEUPSL DSI model for the Multi-WoZ setting uses a set of dialog constraints, which can be broken into dialog start, middle, and end. Figure 4 provides an overview of the constraints, which contains the following 11 predicates: 1. STATE(Utt, Class) The STATE continuous valued predicate is the probability that an utterance, identified by the argument Utt, belongs to a dialog state, identified by the argument Class.For instance, the utterance hello world !for the greet dialog state would create a predicate with a value between zero and one, i.e., STATE(hello world !greet) = 0.7.

FIRSTUTT(Utt)
The FIRSTUTT binary predicate indicates if an utterance, identified by the argument Utt, is the first utterance in a dialog.

LASTUTT(Utt)
The LASTUTT binary predicate indicates if an utterance, identified by the argument Utt, is the last utterance in a dialog.

PREVUTT(Utt1, Utt2)
The PREVUTT binary predicate indicates if an utterance, identified by the argument Utt2, is the previous utterance in a dialog of another utterance, identified by the argument Utt1.

HASGREETWORD(Utt)
The HASGREETWORD binary predicate indicates if an utterance, identified by the argument Utt, contains a known token for the greet class.The list of known greet words is

HASINFOQUESTIONWORD(Utt)
The HASINFOQUESTIONWORD binary predicate indicates if an utterance, identified by the argument Utt, contains a known token for the info question class.The list of known info question words is [ ′ address ′ , ′ phone ′ ].

HASSLOTQUESTIONWORD(Utt)
The HASSLOTQUESTIONWORD binary predicate indicates if an utterance, identified by the argument Utt, contains a known token for the slot question class.The list of known slot question words is

HASINSISTWORD(Utt)
The HASINSISTWORD binary predicate indicates if an utterance, identified by the argument Utt, contains a known token for the insist class.The list of known insist words is

HASCANCELWORD(Utt)
The HASCANCELWORD binary predicate indicates if an utterance, identified by the argument Utt, contains a known token for the cancel class.The list of known cancel words is

HASACCEPTWORD(Utt)
The HASACCEPTWORD binary predicate indicates if an utterance, identified by the argument Utt, contains a known token for the accept class.The list of known accept words is

HASENDWORD(Utt)
The HASENDWORD binary predicate indicates if an utterance, identified by the argument Utt, contains a known token for the end class.The list of known end words is The dialog start constraints take advantage of the inherent structure built into the beginning of taskoriented dialogs.In the same order as the dialog start rules in Figure 4: 1) If the first turn utterance does not contain a known greet word, then it does not belong to the greet state.2) If the first turn utterance contains a known greet word, then it belongs to the greet state.3) If the first turn utterance does not contain a known greet word, then it belongs to the initial request state.
The dialog middle constraints exploit the temporal dependencies within the middle of a dialog.In the same order as the dialog middle rules in Figure 4: 1) If the previous utterance belongs to the greet state, then the current utterance belongs to the initial request state.2) If the previous utterance does not belong to the greet state, then the current utterance does not belong to the initial request state.
3) If the previous utterance belongs to the initial request state, then the current utterance belongs to the second request state.4) If the previous utterance belongs to the second request state and it has a known info question token, then the current utterance belongs to the inf o question state.5) If the previous utterance belongs to the second request state and it has a known slot question token, then the current utterance belongs to the slot question state.4) If the previous utterance belongs to the end state and it has a known cancel token, then the current utterance belongs to the cancel state.
The dialog end constraints take advantage of the inherent structure built into the end of task-oriented dialogs.In the same order as the dialog end rules in Figure 4: 1) If the last turn utterance contains a known end word, then it belongs to the end state.2) If the last turn utterance contains a known accept word, then it belongs to the accept state.3) If the last turn utterance contains a known insist word, then it belongs to the insist state.

A.3 EVALUATION METRICS
Adjusted Mutual Information (AMI) evaluates dialog structure prediction by evaluating the correctness of the dialog state assignments.Let U * = {U * 1 , . . ., U * C * } be the ground-truth assignment of dialog states for all utterances in the corpus, and U = {U 1 , . . ., U C } be the predicted assignment of dialog states based on the learned dialog structure model.U * and U are not directly comparable because they draw from different base sets of states (U * from the ground truth set of states and U from the collection of states induced by the DD-VRNN) that may even have different cardinalities.We address this problem using Adjusted Mutual Information (AMI), a metric developed initially to compare unsupervised clustering algorithms.Intuitively, AMI treats each assignment as a probability distribution over states and uses Mutual Information to measure their similarity, adjusting for the fact that larger clusters tend to have higher MI.AMI is defined as follows: AM Where M I(U, U * ) is the mutual information score, E(M I(U, U * )) is the expected mutual information over all possible assignments, and Avg(H(U ), H(U * )) is the average entropy of the two clusters (Vinh et al., 2010).

A.4 HYPERPARAMETERS
The DD-VRNN uses an LSTM Hochreiter & Schmidhuber (1997) with 200-400 units for the RNNs, and fully-connected highly flexible feature extraction functions with a dropout of 0.4 for the input x, the latent vector z, the prior, the encoder and the decoder.The input to the DD-VRNN is the utterances with a 300-dimension word embedding created using a GloVe embedding Pennington et al. (2014) and a Bert embedding (Devlin et al., 2019).The maximum utterance word length was set to 40, the maximum length of a dialog was set to 10, and the tunable weight, γ, was set to 0.1.The total number of parameters is 26,033,659 for the model with GloVe embedding and 135,368,227 with Bert embedding.The experiments are run in Google TPU V4, and the total GPU hours for all finetuning are 326 GPU hours.

B DATASETS
This section provides additional information on the SGD, SGD synthetic, and MultiWoZ 2.1 synthetic datasets.

Figure 1 :
Figure 1: Example dialog structure for the goal-oriented task booking a hotel.

Figure 2 :
Figure 2: The high-level pipeline of the NEUPSL DSI learning procedure.
Figure 3: SGD Structure Induction Constraint Model

Figure 5 :
Figure 5: Ground truth dialog structure used to generate the MultiWoZ 2.1 dataset.The transition graph shows transitions over 0.05%.

Table 1 :
Test set performance on all datasets.All reported results are averaged over 10 splits.The highest-performing methods per dataset and learning setting are bolded.