End-to-End Neural Discourse Deixis Resolution in Dialogue

We adapt Lee et al.’s (2018) span-based entity coreference model to the task of end-to-end discourse deixis resolution in dialogue, specifically by proposing extensions to their model that exploit task-specific characteristics. The resulting model, dd-utt, achieves state-of-the-art results on the four datasets in the CODI-CRAC 2021 shared task.


Introduction
Discourse deixis (DD) resolution, also known as abstract anaphora resolution, is an under-investigated task that involves resolving a deictic anaphor to its antecedent.A deixis is a reference to a discourse entity such as a proposition, a description, an event, or a speech act (Webber, 1991).DD resolution is arguably more challenging than the extensivelyinvestigated entity coreference resolution task.Recall that in entity coreference, the goal is to cluster the entity mentions in narrative text or dialogue, which are composed of pronouns, names, and nominals, so that the mentions in each cluster refer to the same real-world entity.Lexical overlap is a strong indicator of entity coreference, both among names (e.g., "President Biden", "Joe Biden") and in the resolution of nominals (e.g., linking "the president" to "President Biden").DD resolution, on the other hand, can be viewed as a generalized case of event coreference involving the clustering of deictic anaphors, which can be pronouns or nominals, and clauses, such that the mentions in each cluster refer to the same real-world proposition/event/speech act, etc.The first example in Figure 1 is an example of DD resolution in which the deictic anaphor "the move" refers to Salomon's act of issuing warrants on shares described in the preceding sentence.DD resolution is potentially more challenging than entity coreference resolution because (1) DD resolution involves understanding clause semantics since antecedents are clauses, and clause semantics Salomon Brothers International Ltd. announced it will issue warrants on shares of Hong Kong Telecommunications Ltd.The move closely follows a similar offer by Salomon of warrants for shares of Hongkong & Shanghai Banking Corp.
Figure 1: Examples of discourse deixis resolution.In each example, the deictic anaphor is italicized and the antecedent is boldfaced.are arguably harder to encode than noun phrase semantics; and (2) string matching plays little role in DD resolution, unlike in entity coreference.
In this paper, we focus on end-to-end DD resolution in dialogue.The second example in Figure 1 shows a dialogue between A and B in which the deictic anaphor "it" refers to the utterance by B in which s/he said s/he would donate $10.While the deictic anaphors in dialogue are also composed of pronouns and nominals, the proportion of pronominal deictic anaphors in dialogue is much higher than that in narrative text.For instance, while 76% of the deictic anaphors in two text corpora (ARRAU RST and GNOME) are pronominal, the corresponding percentage rises to 93% when estimated based on seven dialogue corpora (TRAINS91, TRAINS93, PEAR, and the four CODI-CRAC 2021 development sets).In fact, the three pronouns "that", "this", and "it" alone comprise 89% of the deictic anaphors in these dialogue corpora.The higher proportion of pronominal deictic anaphors potentially makes DD resolution in dialogue more challenging than those in text: since a pronoun is semantically empty, the successful resolution of a pronominal deictic anaphor depends entirely on proper understanding of its context.In addition, it also makes DD recognition more challenging in dialogue.For instance, while the head of a non-pronominal phrase can often be exploited to determine whether it is a deictic anaphor (e.g., "the man" cannot be a deictic anaphor, but "the move" can), such cues are absent in pronouns.
Since DD resolution can be cast as a generalized case of event coreference, a natural question is: how successful would a state-of-the-art entity coreference model be when applied to DD resolution?Recently, Kobayashi et al. (2021) have applied Xu and Choi's (2020) re-implementation of Lee et al.'s span-based entity coreference model to resolve the deictic anaphors in the DD track of the CODI-CRAC 2021 shared task after augmenting it with a type prediction model (see Section 4).Not only did they achieve the highest score on each dataset, but they beat the second-best system (Anikina et al., 2021), which is a non-span-based neural approach combined with hand-crafted rules, by a large margin.These results suggest that a spanbased approach to DD resolution holds promise.
Our contributions in this paper are three-fold.First, we investigate whether task-specific observations can be exploited to extend a span-based model originally developed for entity coreference to improve its performance for end-to-end DD resolution in dialogue.Second, our extensions are effective in improving model performance, allowing our model to achieve state-of-the-art results on the CODI-CRAC 2021 shared task datasets.Finally, we present an empirical analysis of our model, which, to our knowledge, is the first analysis of a state-of-the-art span-based DD resolver.

Related Work
Broadly, existing approaches to DD resolution can be divided into three categories, as described below.Rule-based approaches.Early systems that resolve deictic expressions are rule-based (Eckert and Strube, 2000;Byron, 2002;Navarretta, 2000).Specifically, they use predefined rules to extract anaphoric mentions, and select antecedent for each extracted anaphor based on the dialogue act types of each candidate antecedent.Non-neural learning-based approaches.Early non-neural learning-based approaches to DD resolution use hand-crafted feature vectors to represent mentions (Strube and Müller, 2003;Müller, 2008).A classifier is then trained to determine whether a pair of mentions is a valid antecedent-anaphor pair.Deep learning-based approaches.Deep learning has been applied to DD resolution.For instance, Marasović et al. (2017) and Anikina et al. (2021) use a Siamese neural network, which takes as input the embeddings of two sentences, one containing a deictic anaphor and the other a candidate antecedent, to score each candidate antecedent and subsequently rank the candidate antecedents based on these scores.In addition, motivated by the recent successes of Transformer-based approaches to entity coreference (e.g., Kantor and Globerson (2019)), Kobayashi et al. (2021) have recently proposed a Transformer-based approach to DD resolution, which is an end-to-end coreference system based on SpanBERT (Joshi et al., 2019(Joshi et al., , 2020)).Their model jointly learns mention extraction and DD resolution and has achieved state-of-the-art results in the CODI-CRAC 2021 shared task.

Corpora
We use the DD-annotated corpora provided as part of the CODI-CRAC 2021 shared task.For training, we use the official training corpus from the shared task (Khosla et al., 2021), ARRAU (Poesio and Artstein, 2008), which consists of three conversational sub-corpora (TRAINS-93, TRAINS-91, PEAR) and two non-dialogue sub-corpora (GNOME, RST).For validation and evaluation, we use the official development sets and test sets from the shared task.The shared task corpus is composed of four well-known conversational datasets: AMI (McCowan et al., 2005), LIGHT (Urbanek et al., 2019), Persuasion (Wang et al., 2019), and Switchboard (Godfrey et al., 1992).Statistics on these corpora are provided in Table 1.

Baseline Systems
We employ three baseline systems.
The first baseline, coref-hoi, is Xu and Choi's (2020) re-implementation of Lee et al.'s (2018) widely-used end-to-end entity coreference model.The model ranks all text spans of up to a predefined length based on how likely they correspond to entity mentions.For each top-ranked span x, the model learns a distribution P (y) over its antecedents y ∈ Y(x), where Y(x) includes a dummy antecedent ϵ and every preceding span: to the same entity 1 (s a (x, ϵ) = 0 for dummy antecedents): where g x and g y are the vector representations of x and y, W c is a learned weight matrix for bilinear scoring, FFNN(•) is a feedforward neural network, and ϕ(•) encodes features.Two features are used, one encoding speaker information and the other the segment distance between two spans.The second baseline, UTD_NLP 2 , is the topperforming system in the DD track of the CODI-CRAC 2021 shared task (Kobayashi et al., 2021).It extends coref-hoi with a set of modifications.Two of the most important modifications are: (1) the addition of a sentence distance feature to ϕ(•), and (2) the incorporation into coref-hoi a type prediction model, which predicts the type of a span.The possible types of a span i are: ANTECEDENT (if i corresponds to an antecedent), ANAPHOR (if i corresponds to an anaphor), and NULL (if it is neither an antecedent nor an anaphor).The types predicted by the model are then used by coref-hoi as follows: only spans predicted as ANAPHOR can be resolved, and they can only be resolved to spans predicted as ANTECEDENT.Details of how the type prediction model is trained can be found in Kobayashi et al. (2021).
The third baseline, coref-hoi-utt, is essentially the first baseline except that we restrict the candidate antecedents to be utterances.This restriction is motivated by the observation that the 1 See Lee et al. (2018) for a description of the differences between sc(•) and sa(•), 2 For an analysis of this and other resolvers competing in the CODI-CRAC 2021 shared task, see Li et al. (2021).
antecedents of the deictic anaphors in the CODI-CRAC 2021 datasets are all utterances.To see what an utterance is, consider again the second example in Figure 1.Each line in this dialogue is an utterance.As can be seen, an utterance roughly corresponds to a sentence, although it can also be a text fragment or simply an interjection (e.g., "uhhh").While by definition the antecedent of a deictic anaphor can be any clause, the human annotators of the DD track of the CODI-CRAC 2021 shared task decided to restrict the unit of annotation to utterances because (1) based on previous experience it was difficult to achieve high interannotator agreement when clauses are used as the annotation unit (Poesio and Artstein, 2008); and (2) unlike the sentences in narrative text, which can be composed of multiple clauses and therefore can be long, the utterances in these datasets are relatively short and can reliably be used as annotation units.From a modeling perspective, restricting candidate antecedents also has advantages.First, it substantially reduces the number of candidate antecedents and therefore memory usage, allowing our full model to fit into memory.Second, it allows us to focus on resolution rather than recognition of deictic anaphors, as the recognition of clausal antecedents remains a challenging task, especially since existing datasets for DD resolution are relatively small compared to those available for entity coreference (e.g., OntoNotes (Hovy et al., 2006)).each other in the corresponding document (because names are non-anaphoric), the distance between a deictic anaphor and its antecedent is comparatively smaller.To model recency, we restrict the set of candidate antecedents of an anaphor to be the utterance containing the anaphor as well as the preceding 10 utterances, the choice of which is based on our observation of the development data, where the 10 closest utterances already cover 96-99% of the antecedent-anaphor pairs.E2.Modeling distance.While the previous extension allows us to restrict our attention to candidate antecedents that are close to the anaphor, it does not model the fact that the likelihood of being the correct antecedent tends to increase as its distance from the anaphor decreases.To model this relationship, we subtract the term γ 1 Dist(x, y) from s(x, y) (see Equation ( 1)), where Dist(x, y) is the utterance distance between anaphor x and candidate antecedent y and γ 1 is a tunable parameter that controls the importance of utterance distance in the resolution process.Since s(x, y) is used to rank candidate antecedents, modeling utterance distance by updating s(x, y) will allow distance to have a direct impact on DD resolution.
E3. Modeling candidate antecedent length.Some utterances are pragmatic in nature and do not convey any important information.Therefore, they cannot serve as antecedents of deictic anaphors.Examples include "Umm", "Ahhhh... okay", "that's right", and "I agree".Ideally, the model can identify such utterances and prevent them from being selected as antecedents.We hypothesize that we could help the model by modeling such utterances.To do so, we observe that such utterances tend to be short and model them by penalizing shorter utterances.Specifically, we subtract the term γ 2 1 Length(y) from s(x, y), where Length(y) is the number of words in candidate antecedent y and γ 2 is a tunable parameter that controls the importance of candidate antecedent length in resolution.
E4. Extracting candidate anaphors.As mentioned before, the deictic anaphors in dialogue are largely composed of pronouns.Specifically, in our development sets, the three pronouns "that", "this", and 'it' alone account for 74-88% of the anaphors.Consequently, we extract candidate deictic anaphors as follows: instead of allowing each span of length n or less to be a candidate anaphor, we only allow a span to be a candidate anaphor if its underlying word/phrase has appeared at least once in the training set as a deictic anaphor.E5.Predicting anaphors.Now that we have the candidate anaphors, our next extension involves predicting which of them are indeed deictic anaphors.To do so, we retrain the type prediction model in UTD_NLP, which is a FFNN that takes as input the (contextualized) span representation g i of candidate anaphor i and outputs a vector ot i of dimension 2 in which the first element denotes the likelihood that i is a deictic anaphor and the second element denotes the likelihood that i is not a deictic anaphor.i is predicted as a deictic anaphor if and only if the value of the first element of ot i is bigger than its second value: where A (ANAPHOR) and NA (NON-ANAPHOR) are the two possible types.Following UTD_NLP, this type prediction model is jointly trained with the resolution model.Specifically, we compute the cross-entropy loss using ot i , multiply it by a type loss coefficient λ, and add it to the loss function of coref-hoi-utt.λ is a tunable parameter that controls the importance of type prediction relative to DD resolution.E6.Modeling the relationship between anaphor recognition and resolution.In principle, the model should resolve a candidate anaphor to a nondummy candidate antecedent if it is predicted to be a deictic anaphor by the type prediction model.However, type prediction is not perfect, and enforcing this consistency constraint, which we will refer to as C1, will allow errors in type prediction to be propagated to DD resolution.For example, if a non-deictic anaphor is misclassified by the type prediction model, then it will be (incorrectly) resolved to a non-dummy antecedent.To alleviate error propagation, we instead enforce C1 in a soft manner.To do so, we define a penalty function p 1 , which imposes a penalty on span i if C1 is violated (i.e., a deictic anaphor is resolved to the dummy antecedent), as shown below: Intuitively, p 1 estimates the minimum amount to be adjusted so that span i's type is not ANAPHOR.
We incorporate p 1 into the model as a penalty term in s (Equation ( 1)).Specifically, we redefine s(i, j) when j = ϵ, as shown below: where γ 3 is a positive constant that controls the hardness of C1.The smaller γ 3 is, the softer C1 is.Intuitively, if C1 is violated, s(i, ϵ) will be lowered by the penalty term, and the dummy antecedent will less likely be selected as the antecedent of i. E7.Modeling the relationship between nonanaphor recognition and resolution.Another consistency constraint that should be enforced is that the model should resolve a candidate anaphor to the dummy antecedent if it is predicted as a nondeictic anaphor by the type prediction model.As in Extension E6, we will enforce this constraint, which we will refer to as C2, in a soft manner by defining a penalty function p 2 , as shown below: Then we redefine s(i, j) when j ̸ = ϵ as follows: where γ 4 is a positive constant that controls the hardness of C2.Intuitively, if C2 is violated, s(i, j) will be lowered by the penalty term, and j will less likely be selected as the antecedent of i. E8.Encoding candidate anaphor context.Examining Equation (1), we see that s(x, y) is computed based on the span representations of x and y.While these span representations are contextualized, the contextual information they encode is arguably limited.As noted before, most of the deictic anaphors in dialogue are pronouns, which are semantically empty.As a result, we hypothesize that we could improve the resolution of these deictic anaphors if we explicitly modeled their contexts.Specifically, we represent the context of a candidate anaphor using the embedding of the utterance in which it appears and add the resulting embedding as features to both the bilinear score s c (x, y) and the concatenation-based score s a (x, y): where W c and W s are learned weight matrices, g s is the embedding of the utterance s in which candidate anaphor x appears, and ϕ(x, y) encodes the relationship between x and y as features.
Filling words yeah, okay, ok, uh, right, so, hmm, well, um, oh, mm, yep, hi, ah, whoops, alright, shhhh, yes, ay, hello, aww, alas, ye, aye, uh-huh, huh, wow, www, no, and, but, again, wonderful, exactly, absolutely, actually, sure thanks, awesome, gosh, ooops Reporting verbs command, mention, demand, request, reveal, believe, guarantee, guess, insist, complain, doubt, estimate, warn, learn, realise, persuade, propose, announce, advise, imagine, boast, suggest, remember, claim, describe, see, understand, discover, answer, wonder, recommend, beg, prefer, suppose, comment, think, argue, consider, swear, ask, agree, explain, report, know, tell, decide, discuss, repeat,   E9.Encoding the relationship between candidate anaphors and antecedents.As noted in Extension E8, ϕ(x, y) encodes the relationship between candidate anaphor x and candidate antecedent y.In UTD_NLP, ϕ(x, y) is composed of three features, including two features from coref-hoi-utt (i.e., the speaker id and the segment distance between x and y) and one feature that encodes the utterance distance between them.Similar to the previous extension, we hypothesize that we could better encode the relationship between x and y using additional features.Specifically, we incorporate an additional feature into ϕ(x, y) that encodes the utterance distance between x and y.Unlike the one used in UTD_NLP, this feature aims to more accurately capture proximity by ignoring unimportant sentences (i.e., those that contain only interjections, filling words, reporting verbs, and punctuation) when computing utterance distance.The complete list of filling words and reporting verbs that we filter can be found in Table 2. E10.Encoding candidate antecedents.In coref-hoi-utt, a candidate antecedent is simply encoded using its span representation.We hypothesize that we could better encode a candidate antecedent using additional features.Specifically, we employ seven features to encode a candidate antecedent y and incorporate them into ϕ(x, y): (1) the number of words in y; (2) the number of nouns in y; (3) the number of verbs in y; (4) the number of adjectives in y; (5) the number of content word overlaps between y and the portion of the utterance containing the anaphor that precedes the anaphor; (6) whether y is the longest among the candidate antecedents; and (7) whether y has the largest number of content word overlap (as computed in Feature 5) among the candidate antecedents.Like Extension E3, some features implicitly encode the length of a candidate antecedent.Despite this redundancy, we believe the redundant information could be exploited by the model differently and may therefore have varying degrees of impact on it.
6 Evaluation 6.1 Experimental Setup Evaluation metrics.We obtain the results of DD resolution using the Universal Anaphora Scorer (Yu et al., 2022b).Since DD resolution is viewed as a generalized case of event coreference, the scorer reports performance in terms of CoNLL score, which is the unweighted average of the F-scores of three coreference scoring metrics, namely MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998), and CEAF e (Luo, 2005).In addition, we report the results of deictic anaphor recognition.We express recognition results in terms of Precision (P), Recall (R) and F-score, considering an anaphor correctly recognized if it has an exact match with a gold anaphor in terms of boundary.
Model training and parameter tuning.For coref-hoi and coref-hoi-utt, we use SpanBERT Large as the encoder and reuse the hyperparameters from Xu and Choi (2020) with the only exception of the maximum span width: for coref-hoi, we increase the maximum span width from 30 to 45 in order to cover more than 97% of the antecedent spans; for coref-hoi-utt we use 15 as the maximum span width, which covers more than 99% of the anaphor spans in the training sets.For UTD_NLP, we simply take the outputs produced by the model on the test sets and report the results obtained by running the scorer on the outputs. 3For dd-utt, we use SpanBERT Large as the encoder.Since we do not rely on span enumerate to generate candidate spans, the maximum span width can be set to any arbitrary number that is large enough to cover all candidate antecedents and anaphors.In our case, we use 300 as our maximum span width.We tune the parameters (i.e., λ, γ 1 , γ 2 , γ 3 , γ 4 ) using grid search to maximize CoNLL score on development data.For the type loss coefficient, we search out of {0.2, 0.5, 1, 200, 500, 800, 1200, 1600}, and for γ, we search out of {1, 5, 10}.We reuse other hyperparameters from Xu and Choi (2020).
All models are trained for 30 epochs with a dropout rate of 0.3 and early stopping.We use 1 × 10 −5 as our BERT learning rate and 3 × 10 −4 as our task learning rate.Each experiment is run using a random seed of 11 and takes less than three hours to train on an NVIDIA RTX A6000 48GB.
Train-dev partition.Since we have four test sets, we use ARRAU and all dev sets other than the one to be evaluated on for model training and the remaining dev set for parameter tuning.For example, when evaluating on AMI test , we train models on ARRAU, LIGHT dev , Persuasion dev and Switchboard dev and use AMI dev for tuning.

Results
Recall that our goal is to perform end-to-end DD resolution, which corresponds to the Predicted evaluation setting in the CODI-CRAC shared task.
Overall performance.Recognition results (expressed in F-score) and resolution results (expressed in CoNLL score) of the three baselines and our model on the four test sets are shown in Table 3, where the Avg.columns report the macroaverages of the corresponding results on the four test sets, and the parameter settings that enable our model to achieve the highest CoNLL scores on the development sets are shown in Table 4. Since coref-hoi and coref-hoi-utt do not explicitly identify deictic anaphors, we assume that all but the first mentions in each output cluster are anaphors when computing recognition precision; and while UTD_NLP (the top-performing system in the shared task) does recognize anaphors, we still make the same assumption when computing its recognition precision since the anaphors are not explicitly marked in the output (recall that we computed results of UTD_NLP based on its outputs).
We test the statistical significance among the four models using two-tailed Approximate Randomization (Noreen, 1989).For recognition, the models are statistically indistinguishable from each other w.r.t.their Avg.score (p < 0.05).For resolution, dd-utt is highly significantly better than the baselines w.r.t.Avg.(p < 0.001), while the three baselines are statistically indistinguishable from each other.These results suggest that (1) dd-utt's superior resolution performance stems from better antecedent selection, not better anaphor recognition; and (2) the restriction of candidate antecedents to utterances in coref-hoi-utt does not enable the resolver to yield significantly better resolution results than coref-hoi.
Per-anaphor results.Next, we show the recognition and resolution results of the four models on the most frequently occurring deictic anaphors in Table 5 after micro-averaging them over the four test sets.Not surprisingly, "that" is the most frequent deictic anaphor on the test sets, appearing as an anaphor 402 times on the test sets and contributing to 68.8% of the anaphors.This is followed by "it" (16.3%) and "this" (4.3%).Only 8.9% of the anaphors are not among the top four anaphors.Consider first the recognition results.As can be seen, "that" has the highest recognition F-score among the top anaphors.This is perhaps not surprising given the comparatively larger number of "that" examples the models are trained on.While "it" occurs more frequently than "this" as a deictic anaphor, its recognition performance is lower than that of "this".This is not surprising either: "this", when used as a pronoun, is more likely to be deictic than "it", although both of them can serve as a coreference anaphor and a bridging anaphor.In other words, it is comparatively more difficult to determine whether a particular occurrence of "it" is deictic.Overall, UTD_NLP recognizes more anaphors than the other models.
Next, consider the resolution results.To obtain the CoNLL scores for a given anaphor, we retain all and only those clusters containing the anaphor in both the gold partition and the system partition and apply the official scorer to them.Generally, the more frequently occurring an anaphor is, the better its resolution performance is.Interestingly, for the "Others" category, dd-utt achieves the highest resolution results despite having the lowest recognition performance.In contrast, while UTD_NLP achieves the best recognition performance on average, its resolution results are among the worst.Per-distance results.To better understand how resolution results vary with the utterance distance between a deictic anaphor and its antecedent, we show in Table 6 the number of correct and incorrect links predicted by the four models for each utterance distance on the test sets.For comparison purposes, we show at the top of the table the distribution of gold links over utterance distances.Note that a distance of 0 implies that the anaphor refers to the utterance in which it appears.
A few points deserve mention.First, the distribution of gold links is consistent with our intuition: a deictic anaphor most likely has the immediately preceding utterance (i.e., distance = 1) as its referent.In addition, the number of links drops as distance increases, and more than 90% of the antecedents are at most four utterances away from their anaphors.Second, although UTD_NLP recognizes more anaphors than the other models, it is the most conservative w.r.t.link identification, predicting the smallest number of correct and incorrect links for almost all of the utterance distances.Third, dd-utt is better than the other models at (1) identifying short-distance anaphoric dependencies, particularly when distance ≤ 1, and (2) positing fewer erroneous long-distance anaphoric dependencies.These results provide suggestive evidence of dd-utt's success at modeling recency and distance explicitly.Finally, these results suggest that resolution difficulty increases with distance: except for UTD_NLP, none of the models can successfully recognize a link when distance > 5. Ablation results.To evaluate the contribution of each extension presented in Section 5 to dd-utt's resolution performance, we show in Table 6: Distribution of links over the utterance distances between the anaphor and the antecedents.
extension at a time from dd-utt and retraining it.
For ease of comparison, we show in the first row of the table the CoNLL scores of dd-utt.
A few points deserve mention.First, when E1 (Modeling recency) is ablated, we use as candidate antecedents the 10 highest-scoring candidate antecedents for each candidate anaphor according to s c (x, y) (Equation (3)).Second, when one of E2, E3, E6, and E7 is ablated, we set the corresponding λ to zero.Third, when E4 is ablated, candidate anaphors are extracted in the same way as in coref-hoi and coref-hoi-utt, where the top spans learned by the model will serve as candidate anaphors.Fourth, when E5 is ablated, E6 and E7 will also be ablated because the penalty functions p 1 and p 2 need to be computed based on the output of the type prediction model in E5.
We use two-tailed Approximate Randomization to determine which of these ablated models is statistically different from dd-utt w.r.t. the Avg.score.Results show that except for the model in which E1 is ablated, all of the ablated models are statistically indistinguishable from dd-utt (p < 0.05).Note that these results do not imply that nine of the extensions fail to contribute positively to dd-utt's resolution performance: it only means that none of them is useful in the presence of other extensions (2) when the model is retrained, the learner manages to adjust the network weights so as to make up for the potential loss incurred by ablating an extension; and (3) large fluctuations in performance can be observed on individual datasets in some of the experiments, but they may just disappear after averaging.Experiments are needed to determine the reason.

Error Analysis
Below we analyze the errors made by dd-utt.
DD anaphora recognition precision errors.A common type of recognition precision errors involves misclassifying a coreference anaphor as a deictic anaphor.Consider the first example in Figure 2, in which the pronoun "that" is a coreference anaphor with "voice recognition" as its antecedent but is misclassified as a deictic anaphor with the whole sentence as its antecedent.This type of error occurs because virtually all of the frequently occurring deictic anaphors, including "that", "it", "this", and "which", appear as a coreference anaphor in some contexts and as a deictic anaphor in other contexts, and distinguishing between the two different uses of these anaphors could be challenging.
DD anaphor recognition recall errors.Consider the second example in Figure 2, in which A: The design should minimize R_S_I and be easy to locate and we were still slightly ambivalent as to whether to use voice recognition there, though that did seem to be the favored strategy, but there was also, on the sideline, the thought of maybe having a beeper function.
A: Sounds like a blessed organization.B: Yes, it does.A: Did you know they've won over 7 different awards for their charitable work?A: As a former foster kid, it makes me happy to see this place bring such awareness to the issues and needs of our young.B: I am not surprised to hear that at all."it" is a deictic anaphor that refers to the boldfaced utterance, but dd-utt fails to identify this and many other occurrences of "it" as deictic, probably because "it" is more likely to be a coreference anaphor than a deictic anaphor: in the dev sets, 80% of the occurrences of "it" are coreference anaphors while only 5% are deictic anaphors.DD resolution precision errors.A major source of DD resolution precision errors can be attributed to the model's failure in properly understanding the context in which a deictic anaphor appears.Consider the third example in Figure 2, in which "that" is a deictic anaphor that refers to the boldfaced utterance.While dd-utt correctly identifies "that" as a deictic anaphor, it erroneously posits the italicized utterance as its antecedent.This example is interesting in that without looking at the boldfaced utterance, the italicized utterance is a plausible antecedent for "that" because "I am not surprised to hear that at all" can be used as a response to almost every statement.However, when both the boldfaced utterance and the italicized utterance are taken into consideration, it is clear that the boldfaced utterance is the correct antecedent for "that" because winning over seven awards for some charitable work is certainly more surprising than seeing a place bring awareness to the needs of the young.Correctly resolving this anaphor, however, requires modeling the emotional implication of its context.

Further Analysis
Next, we analyze the deictic anaphors correctly resolved by dd-utt but erroneously resolved by the baseline resolvers.
The example shown in Figure 3 is one such case.In this example, dd-utt successfully extracts the anaphor "that" and resolves it to the correct antecedent "Losing one decimal place, that is okay".UTD_NLP fails to extract "that" as a deic-  tic anaphor.While coref-hoi correctly extracts the anaphor, it incorrectly selects "You want your rating to be a two?" as the antecedent.From a cursory look at this example one could infer that this candidate antecedent is highly unlikely to be the correct antecedent since it is 10 utterances away from the anaphor.As for coref-hoi-utt, the resolver successfully extracts the anaphor but incorrectly selects "Its just two point five for that one" as the antecedent, which, like the antecedent chosen by coref-hoi, is farther away from the anaphor than the correct antecedent is.We speculate that coref-hoi and coref-hoi-utt fail to identify the correct antecedent because they do not explicitly model distance and therefore may not have an idea about how far a candidate antecedent is from the anaphor under consideration.The additional features that dd-utt has access to, including the features that encode sentence distance as well as those that capture contextual information, may have helped dd-utt choose the correct antecedent, but additional analysis is needed to determine the reason.

Conclusion
We presented an end-to-end discourse deixis resolution model that augments Lee et al.'s (2018) span-based entity coreference model with 10 extensions.The resulting model achieved state-ofthe-art results on the CODI-CRAC 2021 datasets.We employed a variant of this model in our recent participation in the discourse deixis track of the CODI-CRAC 2022 shared task (Yu et al., 2022a) and achieved the best results (see Li et al. (2022) for details).To facilitate replicability, we make our source code publicly available. 4

Limitations
Below we discuss several limitations of our work.
Generalization to corpora with clausal antecedents.As mentioned in the introduction, the general discourse deixis resolution task involves resolving a deictic anaphor to a clausal antecedent.The fact that our resolver can only resolve anaphors to utterances raises the question of whether it can be applied to resolve deictic anaphors in texts where antecedents can be clauses.To apply our resolver to such datasets, all we need to do is to expand the set of candidate antecedents of an anaphor to include those clauses that precede it.While corpora annotated with clausal antecedents exist (e.g., TRAINS-91 and TRAINS-93), we note that the decision made by the CODI-CRAC 2021 shared task organizers to use utterances as the unit of annotation has to do with annotation quality, as the interannotator agreement on the selection of clausal antecedents tends to be low (Poesio and Artstein, 2008), Discourse deixis resolution in dialogue vs. narrative text.Whether our model will generalize well to non-dialogue datasets (e.g., narrative text) is unclear.Given the differences between dialogue and non-dialogue datasets (e.g., the percentage of pronominal anaphors), we speculate that the performance of our resolver will take a hit when applied to resolving deictic anaphors in narrative text.
Size of training data.We believe that the performance of our resolver is currently limited in part by the small amount of data on which it was trained.The annotated corpora available for training a discourse deictic resolver is much smaller than those available for training an entity coreference resolver (e.g., OntoNotes (Hovy et al., 2006)).
Data biases.Generally, our work should not cause any significant risks.However, language varieties not present in training data can potentially amplify existing inequalities and contribute to misunderstandings.

A Detailed Experimental Results
We report the resolution results of the four resolvers (UTD_NLP, coref-hoi, coref-hoi-utt, and dd-utt) on the CODI-CRAC 2021 shared task test sets in terms of MUC, B 3 , and CEAF e scores in Table 8 and their mention extraction results in terms of recall (R), precision (P), and Fscore (F) in Table 9.
Consider first the resolution results in Table 8.As can be seen, not only does dd-utt achieve the best CoNLL scores on all four datasets, but it does so via achieving the best MUC, B 3 , and CEAF e F-scores.In terms of MUC F-score, the performance difference between dd-utt and the second best resolver on each dataset is substantial (2.2%-14.9%points).These results suggest that better link identification, which is what the MUC Fscore reveals, is the primary reason for the superior performance of dd-utt.Moreover, Persuasion appears to be the easiest of the four datasets, as this is the dataset on which three of the four resolvers achieved the highest CoNLL scores.Note that Persuasion is also the dataset on which the differences in CoNLL score between dd-utt and the other resolvers are the smallest.These results seem to suggest that the performance gap between dd-utt and the other resolvers tends to widen as the difficulty of a dataset increases.
Next, consider the anaphor extraction results in Table 9.In terms of F-score, dd-utt lags behind UTD_NLP on two datasets, AMI and Switchboard.Nevertheless, the anaphor extraction precision achieved by dd-utt is often one of the highest in each dataset.

Figure 2 :
Figure 2: Examples illustrating the three majors types of errors made by dd-utt.

Figure 3 :
Figure 3: Example in which the correct antecedent is identified by dd-utt but not by the baseline resolvers.

Table 1 :
Statistics on the datasets.

Table 2 :
Lists of filtered words.

Table 3 :
LIGHT AMI Pers.Swbd.Avg.LIGHT AMI Pers.Swbd.Avg.Resolution and recognition results on the four test sets.

Table 4 :
Parameter values enabling dd-utt to achieve the best CoNLL score on each development set.

Table 7 :
Resolution results of ablated models.
You want your rating to be a two?A: Is that what you're saying?B: Yeah, I just got it the other way.B: Uh in Yep, I just got A: Okay.A: So, I'll work out the average for that again at the end.A: It's very slightly altered.Okay, and we're just waiting for Losing one decimal place, that is okay. A: