End-to-End AMR Coreference Resolution

Although parsing to Abstract Meaning Representation (AMR) has become very popular and AMR has been shown effective on the many sentence-level downstream tasks, little work has studied how to generate AMRs that can represent multi-sentence information. We introduce the first end-to-end AMR coreference resolution model in order to build multi-sentence AMRs. Compared with the previous pipeline and rule-based approaches, our model alleviates error propagation and it is more robust for both in-domain and out-domain situations. Besides, the document-level AMRs obtained by our model can significantly improve over the AMRs generated by a rule-based method (Liu et al., 2015) on text summarization.

Existing work on AMR mainly focuses on individual sentences (Lyu and Titov, 2018;Naseem et al., 2019;Ge et al., 2019;Cai and Lam, 2020a;Zhou et al., 2020). On the other hand, with the advance of neural networks in NLP, tasks involving multiple sentences with Figure 1: Multi-sentence AMR example, where nodes with the same non-black color are coreferential and the dotted ellipse represents an implicit role coreference.
cross-sentence reasoning (e.g., text summarization, reading comprehension and dialogue response generation) have received increasing research attention. Given the effectiveness of AMR on sentencelevel tasks (Pan et al., 2015;Rao et al., 2017;Issa Alaa Aldine et al., 2018;Song et al., 2019b), it is important to extend sentence-level AMRs into the multi-sentence level. To this end, a prerequisite step is AMR coreference resolution, which aims to find the AMR components referring to the same entity. Figure 1 shows the AMR graphs of two consecutive sentences in a document. An AMR coreference resolution model need to identify two coreference cases: "he" refers to "Bill" in the first graph, and "arrive-01" omits an argument ":arg3" that refers to "Paris".
Relatively little research has been done on AMR coreference resolution. Initial attempts (Liu et al., 2015) merge the nodes that have the same surface string. To minimize noise, only named entities and date entities are considered, and they do not consider merging non-identical nodes (e.g., "Bill" and "he" in Figure 1) that are also frequent in reallife situation. Subsequent work considers more co-reference cases by either manually annotating AMR coreference information (O'Gorman et al., 2018) or taking a pipeline system (Anikina et al., 2020) consisting of a textual coreference resolution model (Lee et al., 2018) and an AMR-to-text aligner (Flanigan et al., 2014). Yet there is little research on automatically resolving coreference ambiguities directly on AMR, making use of AMR graph-structural features.
In this work, we formulate AMR coreference resolution as a missing-link prediction problem over AMR graphs, where the input consists of multiple sentence-level AMRs, and the goal is to recover the missing coreference links connecting the AMR nodes that represent to the same entity. There are two types of links. The first type corresponds to the standard situation, where the edge connects two entity nodes (e.g., "Bill" and "he" in Figure  1) that refer to the same entity. The second type is the implicit role coreference, where one node (e.g., "Paris" in Figure 1) is a dropped argument (":arg3") of other predicate node ("arrive-01").
We propose an AMR coreference resolution model by extending an end-to-end text-based coreference resolution model (Lee et al., 2017). In particular, we use a graph neural network to represent input AMRs for inducing expressive features. To enable cross-sentence information exchange, we make connections between sentence-level AMRs by linking their root nodes. Besides, we introduce a concept identification module to distinguish functional graph nodes (non-concept nodes, e.g., "person" in Figure 1), entity nodes (e.g., "Bill"), verbal nodes with implicit role (e.g., "arrive-01") and other regular nodes (e.g., "leave-11") to help improve the performance. The final antecedent prediction is conducted between the selected nodes and all their possible antecedent candidates, following previous work on textual coreference resolution (Lee et al., 2017).
Experiments on the MS-AMR benchmark 1 (O'Gorman et al., 2018) show that our model outperforms competitive baselines by a large margin. To verify the effectiveness and generalization of our proposed model, we annotate an out-of-domain test set over the gold AMR Little Prince 3.0 data following the guidelines of O' Gorman et al. (2018), and the corresponding results show that our model is consistently more robust than the baselines in domain-transfer scenarios. Finally, results on docu-1 It consists gold coreference links on gold AMRs. ment abstractive summarization show that our document AMRs lead to much better summary quality compared to the document AMRs by Liu et al. (2015). This further verifies the practical value of our approach. Our code and data is available at https://github.com/Sean-Blank/AMRcoref

Model
Formally, an input instance of AMR coreference resolution consists of multiple sentence-level AMRs G 1 , G 2 , ..., G n , where each G i can be written as G i = V i , E i with V i and E i representing the corresponding nodes and edges for G i . We consider a document-level AMR graphĜ = [G 1 , G 2 , ..., G n ;ê 1 ,ê 2 , ...,ê m ], where eachê i is a coreference link connecting two nodes from different sentence-level AMRs. The task of AMR coreference resolution aims to recoverê 1 , ...,ê m , which are missing from the inputs. Figure 2 shows the architecture of our model, which consists of a graph encoder ( § 2.1), a concept identifier ( § 2.2), and an antecedent prediction module ( § 2.3).

Representing Input AMRs using GRN
Given sentence-level AMRs G 1 , ..., G n as the input, randomly initialized word embeddings are adopted to represent each node v k as a dense vector e k . To alleviate data sparsity and to obtain better node representation, character embeddings e char k are computed by using a character-level CNN. We concatenate both e k and e char k embeddings for each concept before using a linear projection to form the initial representation: where W node and b node are model parameters.
To enable global information exchange across different sentence-level AMRs, we construct a draft document-level graph by connecting the root nodes of each AMR subgraph as shown in Figure 2. This is important because AMR coreference resolution involves cross-sentence reasoning. We then adopt Graph Recurrent Network (GRN, Beck et al., 2018) to obtain rich document-level node representations. GRN is one type of graph neural network that iteratively updates its node representations with the message passing framework (Scarselli et al., 2009  Message passing In the message passing framework, a node v k receives information from its directly connected neighbor nodes at each layer l.
We use a hidden state vector h l k to represent each node, and the initial state h 0 k is defined as a vector of zeros.
In the first step at each message passing layer, the concept representation of each neighbor of v k is combined with the corresponding edge representation to make a message x k,j . This is because edges contain semantic information that are important for learning global representation and subsequent reasoning. Formally, a neighbor v j of node v k can be represented as where e label k,j denotes the label embedding of the edge from node v k and to v j .
Next, representations of neighboring nodes from the incoming and outgoing directions are aggregated: where N in (k) and N out (k) denote the set of incoming and outgoing neighbors of v k , respectively.
Similarly, the hidden states from incoming and outgoing neighbors are also summed up: where h l−1 j denotes the hidden state vector for node v j at the previous (l−1) layer. Finally, the message passing from layer l − 1 to l is conducted following the gated operations of LSTM (Hochreiter and Schmidhuber, 1997): where i l k , o l k and f l k are a set of input, output and forget gates to control information flow from different sources, u l k represents the input messages, c l k is the cell vector to record memory, and c 0 k is also initialized as a vector of zeros. W m z , W x z and b z (z ∈ {i, o, f, u}) are model parameters. We adopt L GRN layers in total, where L is determined by a development experiment. The output h L k at layer L is adopted as the representation of each node v k for subsequent procedures.
Formally, a concept representation h L k from the top GRN layer is concatenated with a learnable type embedding e type k (t) of type t for each concept v k , and the corresponding type score s k type (t) is computed using a feed-forward network: (6) where W type is a mapping matrix. e type k (t) represents a concept-type embedding and is randomly initialized. A probability distribution P (t|v k ) over all concept types T for each concept v k is calculated as follows using a softmax layer: Finally, we predicate the type t * k for each concept and use it to filter the input nodes. In particular, functional concepts are dropped directly and the other concepts (i.e., ent, ver 0 , ver 1 , ver 2 , reg) are selected as candidate nodes for antecedent prediction.

Antecedent Prediction
Given a selected node v k by the concept identifier, the goal is to predict its antecedent y k from all possible candidate nodes Y k = { , y π , ..., y k−1 }, where a dummy antecedent is adopted for the nodes that are not coreferent with any previous concepts. π = min(1, k − ψ), where ψ represents the maximum antecedents considered as candidates.
As mentioned by previous work on textual coreference resolution (Lee et al., 2017), considering too many candidates can hurt the final performance. We conduct development experiments to decide the best ψ. The finally predicted coreference links implicitly determine the coreference clusters.
2 We do not model other :argx to avoid long tail issue.
Type information in § 2.2 can help to guide the antecedent prediction and ensure global type consistency. We combine the node hidden vector and its type representation as the final concept state: where e type k (t * ) denotes the learned embedding of the concept type of node v k .
Similar with Lee et al. (2017), the goal of the antecedent prediction module is to learn a distribution Q(y k ) over the antecedents for each node v k : where s(k, a) computes a coreference link score for each concept pair (v k , v a ): Here a < k, and s m (k) means whether concept v k is a mention involved in a coreference link. It is calculated by using a feed-forward network: s an (k, a) indicates whether mention v a is an antecedent of v k and measures the semantic similarity between v k and v a , computed with rich features using a feed-forward network: where • denotes element-wise multiplication of each mention pair (v k , v a ), and a feature vector φ(k, a) represents the normalized distance between two mentions and the speaker information if available. Following Lee et al. (2017), we also normalize the distance values by grouping them into the following buckets [1, 2, 3, 4, 5-7, 8-15, 16-31, 32-63, 64+]. All features (speaker, distance, concept type) are randomly initialized 32-dimensional embeddings jointly learned with the model.

Training
Our objective function takes two parts: L type (θ) (i.e., the concept-type identification loss), and L antecedent (i.e., the antecedent prediction loss) where λ is the weight coefficient (we empirically set λ = 0.1 in this paper).  Concept Identification Loss. L type measures whether our model can accurately identify meaningful concepts and learn the correct type representations. Specifically, given the concept set V = {v 1 , ...v N }, the concept identifier is trained to minimize an average cross-entropy loss: where θ are the set of model parameters, P (t * k |v k ) denotes the output probability of predicted type t * k for each node v k as in Eq. 7. Antecedent Prediction Loss. Given a training AMR document with gold coreference clusters GOLD(k)| N k=1 and antecedent candidates Y k = { , y π , ..., y k−1 } for mention v k , L antecedent measures whether mentions are linked to their correct antecedent. Since the antecedents are latent, the antecedent loss is a marginal log-likelihood of all correct antecedents implied by gold clustering: where GOLD(k) = if mention v k does not belong to any gold cluster. Q(y) is calculated using Eq. 10.

Experiments
We conduct experiments on the MS-AMR dataset 3 (O'Gorman et al., 2018), which is annotated over a previous gold AMR corpus (LDC2017T10). It has 293 annotated documents in total with an average of 27.4 AMRs per document, covering roughly 10% of the total AMR corpus. We split a dev data with the same size as the test set from the training set.
Following the annotation guidelines of MS-AMR, we manually annotate the AMR coreference resolution information over the development and test data of the Little Prince (LP) AMR corpus 4 and use it as an out-of-domain test set. For this dataset, we consider each chapter as a document. The data statistics are shown in Table 1.

Setup
Evaluation Metrics We use the standard evaluation metrics for coreference resolution evaluation, computed using the official CoNLL-2012 evaluation toolkit. Three measures include: MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998) and CEAF φ 4 (Luo, 2005). Following previous studies (Lee et al., 2018), the primary metric AVG-F is the unweighted average of the above three F-scores.
Baselines To study the effectiveness of end-toend AMR coreference resolution, we compare our model with the following baselines: • Rule-based (Liu et al., 2015): a heuristic method that builds a large document-level AMR graph by linking identical entities.

Models
We study two versions of our model with or without BERT features.
• AMRcoref-base: it corresponds to our model described in § 2 only with word embeddings.
• AMRcoref-bert: it denotes our model in § 2 except that the word embeddings (e k in Eq. 1) are concatenated with BERT outputs. Specifically, we use a cased BERT-base model with fixed parameters to encode a sentence, taking an AMRto-text aligner (Flanigan et al., 2014) to project BERT outputs to the corresponding AMR nodes.
Hyperparameters We set the dimension of concept embeddings to 256. Characters in the character CNN ( § 2.1) are represented as learned embeddings with 32 units and the convolution window sizes include 2, 3, and 4 characters, each consisting of 100 filters. We use Adam (Kingma and Ba, 2015) with a learning rate of 0.005 for optimization.

Development Experiments
We first conduct development experiment to choose the values for the crucial hyperparameters. GRN Encoder Layers The number of recurrent layers L in GRN defines the amount of message interactions. Large message passing layers may lead to over-smoothing problems, while small layers may result in weak graph representation (Qin et al., 2020;. Figure 3 shows development experiments of the AMRcoref-base model in this aspect. We observe large improvements when increasing the layers from 1 to 3, but further increase from 3 to 7 does not lead to further improvements. Therefore, we choose 3 layers for our final model. Antecedent Candidates How many antecedents are considered as candidates (denoted as ψ in Section 2.3) for making each coreference decision is another important hyperparameter in a coreference resolution model (Lee et al., 2017). Intuitively, allowing more antecedents gives a higher upper bound, but that also introduces a larger search space. Table 3 shows the statistics of the distance between each mention and its gold antecedent and the devset performance of AMRcoref-base model that uses this distance as the search space. The performance of AMRcoref-base improves when increasing the search space, and the best performance was observed when 250 antecedents are considered as the search space. We choose ψ =250 in subsequent experiments.  Table 3: Devset statistics on mention-gold-antecedent distance and the performances of AMRcoref-base using the distance as the search space.

In-domain Results
The Rule-based method performs the worst, because it only links the identical entity and suffers from low recall. The Pipeline model performs better than the Rule-based model due to better coverage, but it can suffer from error propagation in both textual coreference and inaccurate AMR aligner. In addition, it does not make use of AMR structure features, which is less sparse compared to text cues. Our proposed AMRcorefbase model outperforms the two baselines by a huge margin, gaining at least 9.3% and 13.2% average F1 scores, respectively. This verifies the effectiveness of the end-to-end framework.

Out-domain Results
On the cross-domain LP data, our model largely outperforms both Rulebased method and the Pipeline model. Compared with the in-domain setting, there is minor drop on the out-of-domain dataset (4.1% and 2.3% F1 score for AMRcoref-base and AMRcoref-bert respectively). Neither the performances of Rulebased nor Pipeline change much on this dataset, which is because these systems are not trained on a certain domain. This also reflects the quality of our LP annotations, because of the consistent performance changes of both AMRcoref-base and AMRcoref-bert when switching from MS-AMR to LP.

Analysis
We analyze the effects of mention type, textual embedding and various extra features in this section. Concept Identification As shown in the first group of Table 4, we conduct an ablation study on the concept identification module, which has been shown crucial on the textual coreference resolution (Lee et al., 2017). Removing the concept identifier from the AMRcoref-base model results in a large performance degradation of up to 19.9%, indicating that concept type information of the AMR node can positively guide the prediction of coreference links. On the other hand, when the concept identifier outputs are replaced with gold mentions, the results can be further improved by 19.1%. This indicates that better performances can be expected if concept identification can be further improved.
Injecting BERT knowledge As shown in the second group of Table 4, we study the influence of rich features from BERT in our model, which has been proven effective on text-based coreference resolution. Two alternatives of using BERT are studied, concatenate (i.e. AMRcoref-bert) denotes concatenating the AMR node embeddings with the corresponding textual BERT embedding, and graph means that we construct an AMR-token graph that connects AMR nodes and the corresponding tokens. We find that the AMRcoref-base model can be improved by a similar margin using both approaches. This is consistent with existing observations from other structured prediction tasks, such as constituent parsing (Kitaev et al., 2019) and dependency parsing . Due to the limited scale of our training data, we expect the gain to be less with more training data.
Features Ablation As shown by the last group in Table 4, we investigate the impacts of each component in our proposed model on the development set of MS-AMR. We have the following observations. First, consistent with findings of Lee et al. (2017), Figure 4: Testing results of AMRcoref-base regarding different ratios of training data used. The F1 score of Pipeline is 42.0% (Table 2).
the distance between a pair of AMR concepts is an important feature. The final model performance drops by 2.1% when removing the distance feature (Eq. 13). Second, the speaker indicator features (Eq. 13) contribute to our model by a 1.9% improvement. Intuitively, speaker information is helpful for pronoun coreference resolution in dialogues. For example, "my package" in one sentence may represent identical entity with "your package" in the next utterance. Third, the character CNN provides morphological information and a way to back off for out-of-vocabulary tokens. For AMR node representations, we see a modest contribution of 1.2% F1 score. Finally, we exploit the necessity of cross-sentence AMR connections. Compared to encoding each AMR graph individually, global information exchange across sentences can help to achieve a significant performance improvement.
Data Hunger Similar to other results, it is important to study how much data is necessary to obtain a strong performance (at least be better than the baseline). Figure 4 shows the performances when training the AMRcoref-base model on different portions of data. As the number of training samples increases, the performance of our model continuously improves. This shows that our model has room for further improvement with more training data. Moreover, our model even outperforms the Pipeline baseline when trained on only 20% data. This confirms the robustness of our end-toend framework.
Effect of Document Length Figure 5 shows the performance on different MS-AMR document lengths (i.e., the number of AMR graphs in the document). We can see that both our model and the Pipeline model show performance decrease  when increasing input document length. This is likely because a longer document usually involves more complex coreference situations and brings more challenge for the encoder. Insufficient information interaction for distant nodes further leads to weaker inference performance. As expected, the Rule-based approach (Liu et al., 2015) is not significantly affected, but its result is still pretty low. When the document contains more than 30 sentences, the AMRcoref-base model slightly under-performs both the Rule-based method and the Pipeline baseline. One reason is that only a few training instances have a long document length, so we expect that the performance of our model can be further improved given more long documents. Table 5 compares the summarization performances using the document-level AMRs generated by various methods on the LDC2015E86 benchmark (Knight et al., 2014). Following Liu et al. (2015), Rouge scores (R-1/2/L Lin 2004) are used as the metrics. To consume each document AMR and the corresponding text, we take a popular dual-tosequence model (D2S, Song et al. 2019b), which extends the standard sequence-to-sequence framework with an additional graph encoder and a dual attention mechanism for extracting both text and graph contexts during decoding. For previous work, summarization using AMR was first explored by Liu et al. (2015). They first use a rule-based method to build document AMRs and then take a statistic model to generate summaries. Dohare et al. (2017) improves this approach by selecting important sentences before building a document AMR. The D2S-Rule-based can be considered as a fair comparison with Liu et al. (2015) on the same summerization platform.  The overall performance of the D2S models outperform the previous approaches, indicating that our experiments are conducted on a stronger baseline. Though Pipeline is better than Rule-based on AMR coreference resolution, D2S-Pipeline is comparable with D2S-Rule-based on the downstream summerization task. This shows that the error propagation issue of Pipeline can introduce further negative effects to a downstream application. On the other hand, both D2S-AMRcoref-base and D2S-AMRcoref-bert show much better results than the baselines across all Rouge metrics. This demonstrates that the improvements made by our end-toend model is solid and can transfer to a downstream application. D2S-AMRcoref-bert achieves the best performance, which is consistent with the above experiments.

Related Work
Multi-sentence AMR Although some previous work (Szubert et al., 2020;Van Noord and Bos, 2017) explore the coreference phenomena of AMR, they mainly focus on the situation within a sentence. On the other hand, previous work on multi-sentence AMR primarily focuses on data annotation. Song et al. (2019a) annotate dropped pronouns over Chinese AMR but only deals with implicit roles in specific constructions. Gerber and Chai (2012) provide implicit role annotations, but the resources were limited to a small inventory of 5-10 predicate types rather than all implicit arguments. O'Gorman et al. (2018) annotated the MS-AMR dataset by simultaneously considering coreference, implicit role coreference and bridging relations. We consider coreference resolution as the prerequisite for creating multi-sentence AMRs, proposing the first end-to-end model for this task.
Coreference Resolution Coreference resolution is a fundamental problem in natural language processing. Neural network models have shown promising results over the years. Recent work (Lee et al., 2017(Lee et al., , 2018Kantor and Globerson, 2019) tackled the problem end-to-end by jointly detecting mentions and predicting coreference. Lee et al. (2018) build a complete end-to-end system with the span-ranking architecture and higher-order inference technique. While previous work considers only text-level coreference, we investigate AMR co-reference resolution.
AMR Representation using GNN To encode AMR graphs, many variants of GNNs such as GRNs Beck et al., 2018), GCNs (Zhou et al., 2020; and GATs (Damonte and Cohen, 2019;Cai and Lam, 2020b;Wang et al., 2020) have been introduced. We choose a classic GRN model following  to represent our document-level AMR graph and leave the exploiting on a more efficient GNN structure for future work.

Conclusion
We investigated a novel end-to-end multi-sentence AMR coreference resolution model using a graph neural network. Compared with previous rulebased and pipeline methods, our model better captures multi-sentence semantic information. Results on MS-AMR (in-domain) and LP (out-of-domain) datasets show the superiority and robustness of our model. In addition, experiments on the downstream text summarization task further demonstrate the effectiveness of the document-level AMRs produced by our model.
In future work, we plan to resolve both the cross-AMR coreference links and the sentence-level ones together with our model.