Signed Coreference Resolution

Coreference resolution is key to many natural language processing tasks and yet has been relatively unexplored in Sign Language Processing. In signed languages, space is primarily used to establish reference. Solving coreference resolution for signed languages would not only enable higher-level Sign Language Processing systems, but also enhance our understanding of language in different modalities and of situated references, which are key problems in studying grounded language. In this paper, we: (1) introduce Signed Coreference Resolution (SCR), a new challenge for coreference modeling and Sign Language Processing; (2) collect an annotated corpus of German Sign Language with gold labels for coreference together with an annotation software for the task; (3) explore features of hand gesture, iconicity, and spatial situated properties and move forward to propose a set of linguistically informed heuristics and unsupervised models for the task; (4) put forward several proposals about ways to address the complexities of this challenge effectively.


Introduction
While signed languages are fully-fledged natural languages with sophisticated grammatical systems that are fully comparable to those of spoken languages (Emmorey, 2001), they are also in a completely different modality with such extreme complexity that has yet to be thoroughly studied and understood. Much of our current language technologies are not effective on signed languages, as natural language processing (NLP) modeling approaches are often based on linguistic theories of spoken languages, and expect either speech or written text as input. This results in technology that may be inaccessible to Deaf people where 1 Our code, data and signed coreference annotation software are publicly available at https://github.com/ kayoyin/scr. signed languages are their primary mean of communication and who strongly prefer using their native language than a spoken language (Padden and Humphries, 1988;Glickman and Hall, 2018), thus it is essential to extend NLP to signed languages. On the other hand, most of the recent research in Sign Language Processing (SLP) mainly focus on the visual component of signed languages and fail to address its linguistic challenges, such as coreference resolution (Yin et al., 2021a).
Coreference resolution is a critical component of natural language understanding and higher-level NLP applications including information extraction, text summarization, and machine translation, yet it is completely unexplored for signed languages. Although coreference relations have been studied in sign linguistics, computational models fall short in this area. Resolving coreference in signed languages presents novel challenges as the meaning of pronominal signs are highly dependent on discourse and spatial context (Cormier et al., 2010(Cormier et al., , 2013. Tackling this problem will help us gain a better understanding of how grounding is achieved across different types of natural languages and in multimodal communication, and broaden the ability of current NLP systems to handle multiple modalities. In addition, achieving automatic coreference resolution for signed languages will enable technologies for Sign Language Translation or provide educational tools for sign language learners, among many more. In this paper, we introduce Signed Coreference Resolution (SCR) (Figure 1) as a new challenge for coreference resolution and SLP. We present how coreference is established in signed languages and explore its features of gesture, discourse and spatial grounding for modeling. We then develop a software to annotate signed coreference and release DGS-Coref, a German Sign Language (DGS) dataset to evaluate SCR models. We propose a novel architecture based on multigraphs and linguistically-informed heuristics to perform unsupervised SCR, which we hope to extend to other signed languages, as all signed languages exhibit similar properties and methods to establish referents in space (McBurney, 2004). Finally, we discuss the complexities and important considerations to take into account for SCR, as well as suggestions for future directions of research. We believe that the development of SCR will provide an important stepping stone to sign language understanding.

Related Work
Coreference resolution aims to identify all references to the same entity in discourse and forms a core component of NLP. While automatic coreference resolution has been widely studied for various spoken languages (McCarthy and Lehnert, 1995;Pradhan et al., 2012), no existing work to our knowledge attempts to resolve coreference in signed phrases automatically. We refer to Mitkov (1999) for an overview of the early coreference resolution algorithms, Ng (2010) for the mention-pair model, entity-mention model, and ranking models, and Sukthanker et al. (2020) for a more recent survey of deep-learning based approaches.

Unsupervised Coreference Resolution
Collecting signed language data is costly, due to the limited availability of qualified signers and annotators, and the complexity of signing videos: 1 hour of a signed video can take up to 100 hours to annotate all manual and non-manual components, compared to 30 hours of annotation for speech (Dreuw and Ney, 2008). Due to the lack of existing annotated signed language data for coreference resolution, we adopt an unsupervised approach.
For unsupervised coreference resolution in spoken languages, earlier works are based on a clustering (Cardie and Wagstaff, 1999;Angheluta et al., 2004) and unsupervised generative models (Haghighi and Klein, 2007;Ng, 2008;Charniak and Elsner, 2009;Ma et al., 2016). However, these approaches require unannotated training data to learn the model parameters, which are considerably difficult to obtain for the majority of signed languages. Multi-pass sieve systems (Haghighi and Klein, 2009;Raghunathan et al., 2010;Lee et al., 2011Lee et al., , 2013 were popular and effective before the advent of deep learning. However, the sieves used for English cannot be directly applied to signed languages, and it is unclear whether such architecture provides an advantage in our setting while linguistic tools such as POS tags the sieves rely on are not available for signed languages, and the nature of sieves for SCR is unexplored. Martschat (2013) uses a multigraph-based approach that models a document as a graph, where edges between mentions are established through heuristics. Unlike other graph-based approaches, this method does not need to learn edge weights and therefore remains fully unsupervised. However, it uses constant edge weights which does not account for features with variable strengths. We, instead, build on this approach by proposing novel linguistically-informed heuristics for signed languages, and assign continuous-valued edge weights conditionally to the strength of pair-wise mention features.

Coreference in Signed Languages
Coreference is a core property of natural language (Jackendoff, 2002) and signed language is no exception. Expressed in the visual modality, signed languages use space to maintain discourse coherence and refer back to previously mentioned entities (Liddell, 1980;Kegl, 1987). Moreover, research suggests that the ability to use space to ground referents is innate to humans (Coppola and So, 2006). Therefore, studying coreference in signed languages will give us a better understanding of fundamental phenomena of natural language and help us build tools in various communication systems that are expressed in the visual modality.

Pronominal Pointing Signs
Sign linguists generally recognize the existence of signs serving a pronominal function in various signed languages (e.g. Van Hoek (1992); Emmorey and Lillo-Martin (1995); Emmorey and Falgier (2004); ić Ciciliani and Wilbur (2006); Cormier et al. (2010)). Referents of pronominal signs are often established in the signing space. 2 The signer can point to the actual location of the referent, such as towards themselves for "I", towards the addressee for "you", or towards an entity in the same room for "he, she, they, it". For entities that are not present, the signer can assign a locus 3 , to the entity, then point at this locus for all mentions of the entity. For example, in Figure 1, the two characters Alice and Bob are introduced by fingerspelling 4 their names on the left and right side of the signer respectively. To explicitly ground them in the signing space, the signer can also point to the assigned locus after each fingerspelling, although this is not always required. Then, instead of fingerspelling their names at each subsequent mention of one of the characters, the signer can simply point to the locus assigned to the character. Here, the indexing signs serve a similar pronominal function as "she" and "he" in English, and the visual space is heavily exploited to make referencing clear. As a result, the meaning of pronominal pointing signs is not stable and highly depends on its context, and therefore coreference resolution is necessary to identify the antecedent of these signs.

Complexities of Pointing Signs
There are several complexities in pointing signs to consider during their modeling. For instance, besides a pronominal one, pointing signs may serve other functions as well: as an example, locative pointing signs point to a space to refer to that location instead of a referent (Özyürek et al., 2010), similarly to adverbs "here" and "there" in English, and determiner pointing signs occur in a noun phrase to assign a new locus to an entity (Cormier et al., 2013). Current SLP systems often process solely the local visual features of signs, such as handshape, facial expressions and movement, and therefore cannot disambiguate pointing signs with different meanings and functions that, removed from discourse and spatio-temporal context, have identical visual features.
To compare with spoken languages, while an English pronoun, such as "he", "she", "they" carry some meaning on their own, such as the gender or number of the referent, pronominal signs often use the same indexing handshape for personal pronouns, or an open hand with no spaces between fingers for possessive pronouns, regardless of the referents. On the other hand, while the same pronoun "she" can refer to two or more distinct entities at once (for example "My mother never liked Alice, she thought she was up to no good"), a given locus refers to at most one referent at a time. However, the same locus can be reassigned to different entities, and a signed entity may also be assigned one locus to another during a given discourse. Therefore, models must be able to handle long-term dependencies as well as detect and keep track of when there is a change in entity-locus assignment.
Another notable feature of spatial grammar in signed languages is the role of iconicity. Loci in signed languages can simultaneously have a grammatical (i.e. pronominal) and a logical function. Signed languages observe an iconic semantics where some geometric properties of signs in signing space reflect those in real life, such as the relative positions or sizes of different entities (Schlenker, 2018). Therefore, studying the integration of iconicity and situated referents in signed communication will provide valuable insights in understanding grounded spoken language as well.
While signed languages can help us better understand multimodal communication and linguistic universals in general (Sandler and Lillo-Martin, 2006), some theories of coreference in spoken languages may be extended to signed languages as well. For instance, Steinbach and Onea (2016) extends the classical Discourse Representation Theory (Kamp et al., 2011) to DGS by incorporating the geometrical properties of loci in signing space where discourse referents are grounded. Moreover, Wienholz et al. (2020) finds evidence of the first mention effect in DGS as well (Gernsbacher and Hargreaves, 1988). This suggests that several properties of coreference observed in spoken languages are shared across modalities, which further motivates the development of linguistically-informed SLP models for NLP challenges. We therefore propose to extend the task of coreference resolution to signed languages.

Signed Coreference Resolution
We formalize the novel challenge of Signed Coreference Resolution (SCR) by decomposing it into two tasks: Mention Detection Given a video of signing S, we extract all mentions {m 1 , m 2 , ..., m N }, that is the signs or group of signs in the video that refer to some entity. This task would first require the visual processing of multiple manual and nonmanual features in the video to identify each sign, as well as the modeling of long-term dependencies between different signs to deduce mentions. A related existing task is Continuous Sign Language Recognition (CSLR) (Cui et al., 2017;) that extracts all signed glosses 5 from a video, though mention detection requires an additional step to group glosses and detect mentions.
There are two possible ways to perform this task: either mention detection is performed at once during visual processing, where a single pipeline outputs mentions from videos, or CSLR is first performed to extract all glosses, which are then analysed to identify mentions. The advantage of the first method is that it can make full use of all visual features for mention detection and mitigates the bottleneck of an intermediate glossing step. However, SLP research is still at its infancy and CSLR alone is still an ongoing challenge, therefore it may benefit from decomposing the task into several parts. Signed language datasets used for SLP often contain gloss annotations Hanke et al., 2020a), therefore it is possible to model mention detection directly on glosses to remove the overhead of visual processing.

Data
To evaluate SCR models, we develop a small dataset of a signed language with gold coreference labels. The Public DGS Corpus (Hanke et al., 2020b) is a dataset comprising 50 hours of annotated dialogue between two native signers of DGS. We use this dataset for the following reasons: (i) it is the largest publicly available dataset of a signed language containing gloss annotations at the time, which enables the extraction of enough instances of pronominal pointing to train our models; (ii) it is an open-domain collection of natural signing by 330 native signers, which more closely portray signing in the real-world than other datasets (Yin et al., 2021b); (iii) its annotations include pose estimations, specific glosses for different indexing signs as well as English and German translations, which we use during our modeling.
Although our study is limited to DGS, primarily because of the lack of adequate resources in other signed languages, research suggests all (studied) signed languages use signing space similarly to ground discourse entities and establish pronominal references (McBurney, 2004). Thus, we believe Figure 2: Our annotation interface for pronominal indexing signs. For each annotation, the link to the signing video is shown at the top of the page. Annotators are given the previous 7 sentences as the context, with both the English translations (left) and the gloss annotations (right) from the original dataset. The gloss of the sign to be annotated is underlined, and annotators can annotate all glosses shown on the screen that refer to the same entity as the underlined gloss by highlighting them. Annotators can also report their confidence level for each annotation. that the task we define and the modeling approach we propose are easily generalizable to other signed languages.

Coreference Annotation
While existing annotated sign language datasets sometimes contain glosses for signs or translations of the signed phrase in a spoken language, none of them contain explicit annotation for coreference. We therefore enhance the gloss annotations from the Public DGS Corpus to construct our dataset.
To do so, we develop an annotation interface for signed languages (Figure ), as existing annotation tools for signed languages, especially for targeted tasks such as SCR, are scarce. For each video, our software displays a signed sentence containing a pronominal sign to annotate, accompanied by its English translation. Because coreference can often span several sentences, the previous contextual sentences are also displayed, both in gloss form and in English. The annotator may also play the video of the phrase being signed along with timestamped gloss annotations. Annotations are submitted by highlighting all glosses shown that refer to the same entity as the underlined gloss. We hired ASL students who were paid 15$/hour, and our data collection process was approved by our institution's human subject review board. We obtain an inter-annotator agreement of 93.93 in terms of MUC score (Vilain et al., 1995), which suggests high agreement.

DGS-Coref Dataset
We release the DGS-Coref Dataset, a subset of the Public DGS Corpus that has been enhanced with annotations for pronominal indexing coreference.

Model
In this initial study of SCR, we use DGS glosses and spatial features extracted from pose estimations of the signed phrases to remove the overhead of visual processing and model the linguistic aspect of the task while adequate sign language recognition resources are lacking. We jointly model mention detection and coreference resolution on glosses.

Unsupervised Continuous Multigraph
The backbone of our approach is based on the unsupervised multigraph coreference model Martschat (2013). The advantage of this model is that it achieves competitive performance in unsupervised coreference resolution in English, while not requiring large unannotated data to tune parameters, which is not always readily available in signed languages. Moreover, its architecture is flexible in allowing the modeling of features with various importance, which is especially adapted to the continuous nature of signing space.
We model the document as a directed labeled weighted multigraph D = (R, V, A, w). Two mentions m, n ∈ V are two nodes of the graph, and have a directed edge e = (m, n, r) ∈ A with weight w(e) and label r ∈ R if m precedes n and the relation r(m, n) holds true (Figure 3). Then, clustering is applied to the resulting multigraph to obtain the entity groups contained in the document. 6 By convention, we will refer to sign glosses using all capitals

Relations
First, we define a set of relations that either suggests coreference between two candidate mentions, or provides constraints against possible coreference candidates. Previously explored coreference relations for spoken languages often rely on lexical heuristics and linguistic features such as syntactic dependencies, part-of-speech tags, or morphology. However, such features are currently not available for signed languages due to the lack of core NLP tools to provide them and the recency of linguistic studies on signed languages to develop such tools. Moreover, coreference is inherently expressed differently between spoken and signed languages, which motivates us to design a new set of indicators and constraints for coreference.
First, we propose the following heuristics as positive relations that indicate of coreference: (1) P IAndI The two signs are produced by the same signer and point to the signer's chest.
(2) P YouAndYou The two signs are produced by the same signer and point away from the signer's body towards the addressee.
(3) P IAndYou The two signs are produced by different signers, one points to the signer's chest and the other points away from the signer's body towards the addressee.
(4) P TemporallyCloseIndex The two signs are indexing signs produced by the same signer and have less than 10 signs between them.
(5) P NounPhrase If an indexing sign has no other indexing signs within the previous 10 signs, it is coreferent to the temporally closest previous sign, that is not a verb, produced by the same signer.
(6) P SpatiallyCloseIndex The two signs are indexing signs produced by the same signer and the Euclidean distance between the two locations of production is less than 50 pixels.
We also add constraints to coreference as negative relations: (7) N IAndI The two signs are produced by different signers and point to the respective signer's chest.
(8) N YouAndYou The two signs are produced by different signers and point to the respective adddressee.
(9) N IAndYou The two signs are produced by the same signer, one points to the signer's chest and the other points away from the signer's body towards the addressee. towards the signer's chest or towards the addressee, and the other points to a third location.
(11) N SpatiallyFarIndex The two signs are indexing signs produced by the same signer and the Euclidean distance between the two loci of production is greater than 100 pixels.

Weight Assignment
For all negative relations, we assign the weight w(e) = −∞ as they are hard constraints for coreference. For binary positive relations (relations 1-3), we assign a fixed weight w(e) = 0.5.
Because spoken languages are discrete in nature, it is reasonable that previous work models coreference with fixed weights. However, in signed languages, referents are grounded in continuous time and space, and we hypothesize that the temporal or spatial proximity of signs are strong signals for coreference. Therefore, we introduce a novel continuous weighting system to our model. For (4) P TemporallyCloseIndex and (5) P NounPhrase, if the signs m and n have k < 10 signs between them, the assigned weight is w(e) = (10 − k)/20. For (6) P SpatiallyCloseIndex, if the Euclidean distance between the signs m and n is k < 50, the assigned weight is w(k) = (50−k)/50. We assign stronger weights to spatially close indexing signs than temporally close ones, based on the hypothesis that referencing in signed languages are mostly grounded in space.

Clustering
We apply 1-nearest-neighbor clustering on the obtained multigraph to identify coreferent signs: for every sign n, its candidate antecedents are all signs m such that there exists at least one edge e = (m, n, r) ∈ A, and the sum of edge weights between m and n is strictly positive. n is a mention if it has at least one candidate. If n is a mention, the antecedent of n is the candidate whose sum of edge weights with n is maximal. Ties for antecedents are broken by selecting the closest sign temporally.

Results
In this section, we discuss the strengths and limitations of our approach. As SCR is a new challenge with no existing baseline, our proposed unsupervised model presents a strong baseline for subsequent works.

Quantitative Evaluation
We evaluate our system on commonly used metrics for coreference resolution in spoken languages: MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), and CEAF e (Luo, 2005). We use the official CoNLL shared task scorer 7 . Table 1 shows the full results of our model. We achieve a mean F1 of 50.54 across all indexing signs. Overall, we achieve a high mean F1 of 92.2 on I and YOU signs, which is expected as they contain low ambiguity in meaning. On INDEX signs, where the model must keep track of spatial coherence in discourse and resolve different loci, we achieve 26.9 mean F1, which shows there is still much room for improvement to disambiguate third-person indexing signs.
INDEX signs obtain the lowest F1 on the MUC metric, which focuses on the links between pairs of mentions, therefore is especially penalized when there are either extra or missing links in the prediction. On the other hand, INDEX signs obtain the highest F1 on the B 3 metric, which is a mentionbased metric and scores are computed based on individual mentions rather than links. This can lead to the mention identification effect (Moosavi and Strube, 2016), where the metric unreliably rewards mentions that are correctly identified, but linked to the wrong entity, and suggests that our model may be able to detect mentions accurately but is weaker at finding the correct links.

Relation Video
TO-SEE YOU GOOD YOU (2) P YouAndYou I think you could do a good job there.
GEST-DECLINE1 I CAN NOT TO-SAY TO-HOLD-ON I (1) P IAndI, (3) P IAndYou I can't keep that promise.
STUTTGART NUM-1 NAME INDEX NUM-1 FREIBURG (5) P NounPhrase Once we were in Stuttgart, once in Ingolstadt and once in Freiburg.

WITH TRIP INDEX SHIP INDEX (4) P TemporallyCloseIndex
We went there with an excursion boat.  Table 2: Qualitative analysis of model outputs, with relations that were applied for the prediction and the video frames of the glosses in bold. Bold glosses are mentioned predicted by our model as coreferent. Underlined glosses are ground-truth coreferent mentions. English translations are provided in italics.

Qualitative Analysis
To go beyond the limitations of automatic coreference metrics and investigate how our system handles various phenomena in pronominal indexing signs, we perform a qualitative analysis of our model outputs. In Table 2, we give examples of our model outputs and the gold annotations. The first example shows how most coreference relations with I and YOU are effectively handled by our system. The second example demonstrates how the model can detect the introduction of a new referent to the discourse and signing space. In the third example, the model successfully resolves the two indexing signs as coreferent, due to their temporal and spatial proximity.
In the last example, the model fails to identify "Hamburg" being introduced as a new referent. Instead, it resolves the second INDEX to the first, as relations (4) and (6) give stronger weights to the multi-edge between the two indexing signs than the relation (5) does to the edge between HAMBURG and INDEX . In general, the main weakness of our model is choosing correctly between an antecedent candidate that is a spatially close indexing sign and another candidate that marks the introduction of a new referent. To overcome this challenge, we believe that a more sophisticated system to model the deeper meaning of the signed phrase is needed.

Discussion
We now discuss phenomena that are beyond the scope of this initial study, but that are important challenges and considerations to take for future efforts in SCR.
Naturally, signers may reassign the locus to a new referent, which our current approach does not explicitly address and can only capture this if the locus is reassigned after an extended period of not being used, which is not always the case. Future approaches need to be able to detect when a change of referent for a locus occurs.
As discussed in §3, not all indexing signs are pronominal either, some may serve a locative function where it is not necessarily coreferent with another sign in discourse, but is used to refer to a physical location in space. Future work should therefore be able to distinguish the different functions of indexing signs.
Finally, the partitioning of signing space is dynamic (Steinbach and Onea, 2016). For example, when there are only two referents established, the locus assigned to each can be relatively large without causing ambiguity, such as the first referent being assigned the right half, and the second the left half of the signing space. As more referents are introduced, the signing space is partitioned into smaller loci. Therefore, what constitutes two indexing signs that are "spatially close enough" to be pointing to the same locus depends on the evolution of the discourse, whereas our approach maintains the same heuristic on spatial relations throughout discourse.

Conclusions and Future Work
We present a new challenge for automatically resolving and evaluating coreference in signed languages. We also release the first dataset in German Sign Language with gold labels for coreference resolution, as well as a web interface to annotate coreference in signed languages. Finally, we propose a novel model to perform unsupervised coreference resolution that relies on a multigraph-based architecture with new, linguistically-informed heuristics which provides a strong baseline for this task.
Our paper performs coreference resolution on glosses to remove the overhead of visual processing and focus on the purely linguistic aspect of signed coreference. Future work involves modeling approaches that process signing videos directly that may more closely reflect real-world applications. We also leave for future work the resolution of nonindexing signs that also may serve a pronominal function, such as body shift and facial markers. Our work can additionally be extended to studying other types of ambiguous signs, such as directional verbs where the subject and/or object are not explicitly signed but grounded in space.
This task also provides the opportunity to explore ways studying SCR can benefit spoken lan-guage understanding, particularly multimodal communication where meaning in spoken languages can also be conveyed through the visual modality, such as co-speech indexing gestures. We also hope that future efforts towards SCR and SLP in general, through close collaboration with signing communities, result in assistive technology that can help deaf students in education, research, and everyday communication in their preferred language.