Neural semi-Markov CRF for Monolingual Word Alignment

Monolingual word alignment is important for studying fine-grained editing operations (i.e., deletion, addition, and substitution) in text-to-text generation tasks, such as paraphrase generation, text simplification, neutralizing biased language, etc. In this paper, we present a novel neural semi-Markov CRF alignment model, which unifies word and phrase alignments through variable-length spans. We also create a new benchmark with human annotations that cover four different text genres to evaluate monolingual word alignment models in more realistic settings. Experimental results show that our proposed model outperforms all previous approaches for monolingual word alignment as well as a competitive QA-based baseline, which was previously only applied to bilingual data. Our model demonstrates good generalizability to three out-of-domain datasets and shows great utility in two downstream applications: automatic text simplification and sentence pair classification tasks.


Introduction
Monolingual word alignment aims to align words or phrases with similar meaning in two sentences that are written in the same language. It is useful for improving the interpretability in natural language understanding tasks, including semantic textual similarity (Li and Srikumar, 2016) and question answering (Yao, 2014). Monolingual word alignment can also support the analysis of human editing operations ( Figure 1) and improve model performance for text-to-text generation tasks, such as text simplification (Maddela et al., 2021) and neutralizing biased language (Pryzant et al., 2020). It has also been shown to be helpful for data augmentation and label projection With Canadian collaborators, Lloyd went on to conduct laboratory simulations of his model.

Deletion
Deletion Substitution Insertion Figure 1: An example that illustrates monolingual word alignment (shown as arrows) can support analysis of human editing process and training of text generation models ( §6.1), such as for simplifying complex sentences for children to read. (Culkin et al., 2021) when combined with paraphrase generation. One major challenge for automatic alignment is the need to handle not only alignments between words and linguistic phrases (e.g., a dozen ↔ more than 10), but also non-linguistic phrases that are semantically related given the context (e.g., tensions ↔ relations being strained in Figure 3). In this paper, we present a novel neural semi-Markov CRF alignment model, which unifies both word and phrase alignments though variablelength spans, calculates span-based semantic similarities, and takes alignment label transitions into consideration. We also create a new manually annotated benchmark, Multi-Genre Monolingual Word Alignment (MultiMWA), which consists of four datasets across different text genres and is large enough to support the training of neuralbased models (Table 1). It addresses the shortcomings of existing datasets for monolingual word alignment: MTReference (Yao, 2014) was annotated by crowd-sourcing workers and contains many obvious errors (more details in §4); iSTS (Agirre et al., 2016) and SPADE/ESPADA Tsujii, 2018, 2020) were annotated based on chunking and parsing results, which may restrict the granularity and flexibility of the alignments.
Our experimental results show that the proposed semi-Markov CRF model achieves state-of-the-art performance with higher precision, in comparison to the previous monolingual word alignment models (Yao et al., 2013a,b;Sultan et al., 2014), as well as another very competitive span-based neural model (Nagata et al., 2020) that had previously only applied to bilingual data. Our model exceeds 90% F1 in the in-domain evaluation and also has very good generalizability on three out-of-domain datasets. We present a detailed ablation and error analysis to better understand the performance gains. Finally, we demonstrate the utility of monolingual word alignment in two downstream applications, namely automatic text simplification and sentence pair classification.

Related Work
Word alignment has a long history and was first proposed for statistical machine translation. The most representative ones are the IBM models (Brown et al., 1993), which are a sequence of unsupervised models with increased complexity and implemented the GIZA++ toolkit (Och and Ney, 2003). Many more works followed, such as FastAlign (Dyer et al., 2013). Dyer et al. (2011) also used a globally normalized log-linear model for discriminative word alignment. Bansal et al. (2011) proposed a hidden semi-Markov model to handle both continuous and noncontinuous phrase alignment. These statistical methods promoted the development of monolingual word alignment (MacCartney et al., 2008;Thadani and McKeown, 2011;Thadani et al., 2012). Yao et al. (2013a) proposed a CRF aligner following (Blunsom and Cohn, 2006), then extended it to a semi-CRF model for phrase-level alignments (Yao et al., 2013b). Sultan et al. (2014) designed a simple system with heuristic rules based on word similarity and contextual evidence.
Neural methods have been explored in the past decade primarily for bilingual word alignment. Some early attempts (Yang et al., 2013;Tamura et al., 2014) did not match the performance of GIZA++, but recent Transformer-based models started to outperform. Garg et al. (2019) proposed a multi-task framework for machine translation and word alignment, while Zenkel et al. (2020) designed an alignment layer on top of Transformer for machine translation. Both can be trained without word alignment annotations but rely on millions of bilingual sentence pairs. As for supervised methods, Stengel-Eskin et al. (2019) extracted representations from the Transformerbased MT system, then used convolutional neural network to incorporate neighboring words for alignment. Nagata et al. (2020) proposed a span prediction method and formulated bilingual word alignment as a SQuAD-style question answering task, then solved it by fine-tuning multilingual BERT. We adapt their method to monolingual word alignment as a new state-of-the-art baseline ( §5.1). Some monolingual neural models have different settings from this work. Ouyang and McKeown (2019) introduced pointer networks for long, sentence-or clause-level alignments. Tsujii (2017, 2020) utilized constituency parsers for compositional and non-compositional phrase alignments. Culkin et al. (2021) considered span alignment for FrameNet (Baker et al., 1998) annotations and treated each span pair as independent prediction.

Neural Semi-CRF Alignment Model
In this section, we first describe the problem formulation for monolingual word alignment, then present the architecture of our neural semi-CRF word alignment model ( Figure 2).

Problem Formulation
We formulate word alignment as a sequence tagging problem following previous works (Blunsom and Cohn, 2006;Yao et al., 2013b). Given a source sentence s and a target sentence t of the same language, the span alignment a consists of a sequence of tuples (i, j), which indicates that span s i in the source sentence is aligned with span t j in the target sentence. More specifically, a i = j means source span s i is aligned with target span t j . We consider all spans up to a maximum length of D words. Given a source span where b i is the beginning word index, its corresponding label a i means every word within the span s i is aligned to the target span t a i . That is, the word-level alignments a w b i , a w b i +1 , ..., a w b i +d−1 have the same value j. We use a w to denote the label sequence of alignments between words and s w b i to denote the b i th word in the source sentence. There might be cases where span s i is not aligned to any words in the target sentence, then a i = [NULL]. When D ≥ 2, the Markov property would no longer hold for word- level alignment labels, but for span-level labels. That is, a i depends on a w b i −1 , the position in the target sentence where the source span (with ending word index b i − 1) that precedes the current span s i is aligned to. We therefore design a discriminative model using semi-Markov conditional random fields (Sarawagi and Cohen, 2005) to segment the source sentence and find the best span alignment, which we present below. One unique aspect of our semi-Markov CRF model is that it utilizes a varied set of labels for each sentence pair.

Our Model
The conditional probability of alignment a given a sentence pair s and t is defined as follows: p(a|s, t) = e ψ(a,s,t) a ∈A e ψ(a ,s,t) (1) where the set A denotes all possible alignments between the two sentences. The potential function ψ can be decomposed into: where i denotes the indices of a subset of source spans that are involved in the alignment a; a * represents the gold alignment sequence at spanlevel. The potential function ψ consists of three elements, of which the first two compose negative log-likelihood loss: the span interaction function υ, which accounts the similarity between a source span and a target span; the Markov transition function τ , which models the transition of alignment labels between adjacent source spans; the cost is implemented with Hamming loss to encourage the predicted alignment sequence to be consistent with gold labels. Function υ and τ are implemented as two neural components which we describe below.
Span Representation Layer. First, source and target sentences are concatenated together and encoded by the pre-trained SpanBERT (Joshi et al., 2020) model. The hidden representations in the last layer of the encoder are extracted for each WordPiece token, then averaged to form the word representations. Following previous work (Joshi et al., 2020), the span is represented by a selfattention vector computed over the representations of each word within the span, concatenated with the Transformer output states of two endpoints.
Span Interaction Layer. The semantic similarity score between source span s i and target span t j is calculated by a 2-layer feed-forward neural network FF sim with Parametric Relu (PReLU) (He et al., 2015), 2 after applying layer normalization to each span representation: where [; ] is concatenation and • is element-wise multiplication. We use h s i and h t j to denote the representation of source span s i and target span t j , respectively.
Markov Transition Layer. Monolingual word alignment moves along the diagonal direction in most cases. To incorporate this intuition, we propose a scoring function to model the transition between the adjacent alignment labels a w b i −1 and a i . The main feature we use is the distance between the beginning index of current target span and the end index of the target span that the prior source span is aligned to. The distance is binned into 1 of 13 buckets with the following boundaries [-11, -6, -4, -3, -2, -1, 0, 1, 2, 3, 5, 10], and each bucket is encoded by a 128-dim randomly initialized embedding. It is then transformed into a real-value score by a 1-layer feed forward neural network.
Training and Inference. During training, we minimizes the negative log-likelihood of the gold alignment a * , and the model is trained from both directions (source to target, target to source): where a * s2t and a * t2s represent the gold alignment labels from both directions.
During inference, we use the Viterbi algorithm to find the optimal alignment. There are different strategies to merge the outputs from two directions, including intersection, union, grow-diag (Koehn, 2009), bidi-avg (Nagata et al., 2020), etc. It can be seen as a hyper-parameter and decided based on the dev set. In this work, we use intersection in our semi-CRF model for all experiments.

Implementation Details
We implement our model in PyTorch (Paszke et al., 2017). We use the Adam optimizer and set both the learning rate and weight decay as 1e-5. We set the maximum span size to 3 for our neural semi-CRF model, which can converge within 5 epochs. The neural semi-CRF model has ∼2 hour training time per epoch for MultiMWA-MTRef, measured on a single GeForce GTX 1080 Ti GPU.

A Multi-Genre Benchmark for Monolingual Word Alignment
In this section, we present the manually annotated Multi-genre Monolingual Word Alignment (Mul-tiMWA) benchmark that consists of four datasets of different text genres. As summarized in Table  1, our new benchmark is the largest to date and of higher quality compared to existing datasets. In contrast to iSTS (Agirre et al., 2016) and SPADE/ESPADA Tsujii, 2018, 2020), our annotation does not rely on external chunking or parsing that may introduce errors or restrict the granularity and flexibility. Our benchmark contains both token alignments and a significant portion of phrase alignments as they are semantically equivalent as a whole. Our benchmark also contains a large portion of semantically similar but not strictly equivalent sentence pairs, which are common in text-to-text generation tasks and thus important for evaluating the monolingual word alignment models under this realistic setting. For all four datasets, we closely follow the standard 6-page annotation guideline 3 from (Callison-Burch et al., 2006) and further extend it to improve the phrase-level annotation consistency (more details in Appendix B.1). We describe each of the four datasets below.
MultiMWA-MTRef. We create this dataset by annotating 3,998 sentence pairs from the MTReference (Yao, 2014), which are human references used in a machine translation task. The original labels in MTReference were annotated by crowd-sourcing workers on Amazon Mechanical Turk following the guideline from (Callison-Burch et al., 2006). In an early pilot study, we discovered that these crowd-sourced annotations are noisy and contain many obvious errors. It only gets 73.6/96.3/83.4 for Precision/Recall/F 1 on a random sample of 100 sentence pairs, when compared to the labels we manually corrected.
To address the lack of reliable annotation, we hire two in-house annotators to correct the original labels using GoldAlign 4 (Gokcen et al., 2016), an annotation tool for monolingual word alignment. Both annotators have linguistic background and extensive NLP annotation experience. We provide a three-hour training session to the the annotators, during which they are asked to align 50 sentence pairs and discuss until consensus. Following previous work, we calculate the inter-annotator agreement as 84.2 of F 1 score for token-level nonidentical alignments by comparing one annotator's annotation against the other's. The alignments between identical words are usually easy for human annotators. After merging the the labels from both annotators, we create a new split of 2398/800/800 for train/dev/test set. To ensure the quality, an adjudicator further exams the dev and test sets and constructs the final labels.
MultiMWA-Newsela. Newsela corpus (Xu et al., 2015b) consists of 1,932 English news articles and their simplified versions written by  professional editors. It has been widely used in text simplification research (Xu et al., 2016;Zhang and Lapata, 2017;. We randomly select 500 complex-simple sentence pairs from the test set of Newsela-Auto , 5 which is the newest sentencealigned version of Newsela. 214 of these 500 pairs contain sentence splitting. An in-house annotator 6 labels the word alignment by correcting the outputs from GIZA++ (Och and Ney, 2003).
MultiMWA-arXiv. The arXiv 7 is an openaccess platform that stores more than 1.7 million research papers with their historical versions. It has been used to study paraphrase generation (Dong et al., 2021) and statement strength (Tan and Lee, 2014). We first download the L A T E X source code for 750 randomly sampled papers and their historical versions, then use OpenDetex 8 package to extract plain text from them. We use the trained neural CRF sentence alignment model  to align sentences between different versions of the papers and sample 200 nonidentical aligned sentence pairs for further annotation. The word alignment is annotated in a similar procedure to that of the MultiMWA-Wiki.
MultiMWA-Wiki. Wikipedia has been widely used in text-to-text tasks, including text simpli-fication , sentence splitting (Botha et al., 2018), and neutralizing bias language (Pryzant et al., 2020). We follow the method in (Pryzant et al., 2020) to extract parallel sentences from Wikipedia revision history dump (dated 01/01/2021) and randomly sample 4,099 sentence pairs for further annotation. We first use an earlier version of our neural semi-CRF word aligner ( §3) to automatically align words for the sentence pairs, then ask two in-house annotators to correct the aligner's outputs. The interannotator agreement is 98.1 at token-level measured by F 1 . 9 We split the data into 2514/533/1052 sentence pairs for train/dev/test sets.

Experiments
In this section, we present both in-domain and outof-domain evaluations for different word alignment models on our MultiWMA benchmark. We also provide a detailed error analysis of our neural semi-CRF model and an ablation study to analyze the importance of each component.

Baselines
We introduce a novel state-of-the-art baseline by adapting the QA-based method in (Nagata et al., 2020), which has not previously applied to monolingual word alignment but only bilingual word alignment. This method treats the word alignment problem as a collection of independent predictions  We report the precision (P), recall (R), F 1 , and exact match (EM), which is the percentage of sentence pairs for which model predictions are exactly same as gold labels for the entire sentence. For each metric, we also report the performance on identical alignments (P i , R i , F 1i ) and non-identical alignments (P n , R n , F 1n ) separately. * MultiMWA-Wiki contains only about 5% non-identical alignment.  from every token in the source sentence to a span in the target sentence, which is then solved by finetuning multilingual BERT (Devlin et al., 2019) similarly as for SQuAD-style question answering task. Taking the sentence pair in Figure 1 as an example, the word to be aligned is marked by ¶ in the source sentence and concatenated with the entire target sentence to form the input as "With Canadian · · · ¶conduct ¶ · · · his model. Lkoyd performed · · · his model. ". A span prediction model based on fine-tuning multilingual BERT is then expected to extract performed from the target sentence. The predictions from both directions (source to target, target to source) are symmetrized to produce the final alignment, using a probability threshold of 0.4 instead of the typical 0.5. We change to use standard BERT in this model for monolingual alignment and find that the 0.4 threshold chosen by Nagata et al. (2020) is almost optimal in maximizing the F 1 score on our MultiMWA-MTRef dataset. This QA-based method alone outperforms all existing models for monolingual word alignment, including: Jacana-Token aligner (Yao et al., 2013a), which is a CRF model using hand-crafted features and external resources; JacanaPhrase aligner (Yao et al., 2013b), which is a semi-CRF model relying on feature templates and external resources; Pipelin-eAligner (Sultan et al., 2014), which is a pipeline system that utilizes word similarity and contextual information with heuristic algorithms. We also create a variation of our model, a Neural CRF aligner, in which all modules remain the same but the max span length is set to 1, to evaluate the benefits of span-based alignments.

Experimental Results
Following the literature (Thadani et al., 2012;Yao et al., 2013a,b), we present results under both Sure and Sure + P oss settings for the MultiMWA-MTRef dataset. Sure + P oss setting includes all the annotated alignments, and Sure only contains a subset of them which are agreed by multiple annotators. We consider Sure + P oss as the default setting for all the other three datasets.
The in-domain evaluation results are shown in   QA-based aligner also has competitive performance with strong recall, however, its precision is lower compared to our model. It is worthy to note that our model has a modular design, and can be more easily adjusted than QA-based method to suit different datasets and downstream tasks. Table 3 presents the out-of-domain evaluation results. Our neural models achieve the best performance across all three datasets. This demonstrates the generalization ability of our model, which can be useful in the downstream applications. Table 4 shows the ablation study for our neural semi-CRF model. F 1 and EM drops by 1.3 and 4.4 points respectively after replacing SpanBERT with BERT, indicating the importance of optimized pre-trained representations. Markov transition layer contributes mainly to the alignment accuracy (EM). We have experimented with different strategies to merge the outputs from two directions: intersection yields better precision, growdiag and union bias towards recall. Leveraging the span interaction matrix generated by our model (details in §3.2), we design a simple postprocessing rule to extend the phrasal alignment to spans that are longer than 3 tokens. Adjacent target words are gradually included if they have very high semantic similarity with the same source span. This rule further improves recall and achieves the best F 1 on the MultiMWA-MTRef.

Error Analysis
We sample 50 sentence pairs from the dev set of MultiMWA-MTRef and analyze the errors under Sure+Poss setup. 10 Figure 4 shows how the performance of different alignment models would improve, if we resolve each of the 7 types of errors. We discuss the categorization of errors and their breakdown percentages below: Phrase Boundary (58.6%). The phrase boundary error (see 3 in Figure 3 for an example) is the most prominent error in all models, attributing 7.6 points of F 1 for JacanaPhrase, 5.7 for QA aligner, and 4.7 for neural semi-CRF aligner. For another example, instead of 3x2 alignment funds for research ↔ research funding, our model captures two 1x1 alignments, funds ↔ funding and research ↔ research. This is largely due to the fact that alignments are not limited to linguistic phrases (e.g., noun phrases, verb phrases, etc.), but rather, include non-linguistic phrases. It could also be challenging to handle longer spans, such as keep his position ↔ protect himself from being removed (more on this in Appendix B.2). Although we use SpanBERT for better phrase representation, there is still room for improvement.
Function Words (19.1%). Function words can be tricky to align when rewording and reordering happens, such as 2 . Adding on the complexity, same function word may appear more than once in one sentence. This type of error is common in all the models we experiment with. It attributes 4.7 points of F 1 for JacanaPhrase, 1.3 for QA aligner, and 1.5 for our neural semi-CRF aligner.
Content Words (14.2%). Similar to function words, content words (e.g., security bureau ↔ defense ministry) can also be falsely aligned or missed, but the difference between neural and nonneural model is much more significant. This error type attributes 7.7 points of F 1 score for Jacana aligner, but only 1.1 and 0.8 for neural semi-CRF aligner and QA aligner, respectively.
Context Implication (5.6%). Some words or phrases that are not strictly semantically equivalent can also be aligned if they appear in a similar context. For example, given the source sentence 'Gaza international airport was put into operation the day before' and the target sentence 'The airport began operations one day before', the phrase pair was put into ↔ began can be aligned. This type is related to 2.8 F 1 score improvement for Jacana aligner, but only 0.4 and 0.2 for neural semi-CRF and QA-based aligners, respectively.
Debatable Labels (1.9%). Word alignment annotation can be subjective sometimes. Take phrase alignment two days of ↔ a two-day for example, it can go either way to include the function word 'a' in the alignment, or not.
Name Variations (0.6%). While our neural semi-CRF model is designed to handle spelling variations or name abbreviations, it fails sometimes as shown by 1 in Figure 3 as an example. Some cases can be very difficult, such as SAWS ↔ the state's supervision and control bureau of safe production, where SAWS stands for State Administration of Work Safety.
Skip Alignment (0.0%). Non-contiguous tokens can be aligned to the same target token or phrase (e.g., owes ... to ↔ is a result of), posing a challenging situation for monolingual word aligners. However, this error is rare, as only 0.6% of all alignments in MTRef dev set are discontinuous.

Downstream Applications
In this section, we apply our monolingual word aligner to some downstream applications, including both generation and understanding tasks.

Automatic Text Simplification
Text simplification aims to improve the readability of text by rewriting complex sentences with simpler language. We propose to incorporate word alignment information into the state-of-the-art Ed-itNTS model (Dong et al., 2019) to explicitly learn the edit operations, including addition, deletion and paraphrase. The EditNTS model uses a neural programmer-interpreter architecture, which derives the ADD, KEEP and DELETE operation sequence based on the edit-distance measurements during training time. We instead construct this edit sequence based on the neural semi-CRF aligner's outputs (trained on MTRef Sure+P oss ) with an additional REPLACE tag to train the EditNTS model (more details in Appendix A). Table 5 presents the text simplification results on two benchmark datasets, Newsela-auto and Wikipedia-auto , where we improve the SARI score (Xu et al., 2016) by 0.9 and 0.6, respectively. The SARI score averages the F 1 /precision of n-grams inserted (add), kept (keep) and deleted (del) when compared to human references. We also calculate the BLEU score with respect to the input (s-BL), the percentage of new words (%new) added, and the percentage of system outputs being identical to the input (%eq) to show the paraphrasing capability. We manually inspect 50 sentences sampled from Newselaauto test set and find that both models (EditNTS and EditNTS+Aligner) generate the same output for 10 sentences. For the remaining 40 sentences, the original EditNTS only attempts to paraphrase 4 times (2 are good). Our modified model (Edit-NTS+Aligner) is more aggressive, generating 25 paraphrases (11 are good). With the help of word aligner, the modified model also produces a higher number of good deletions (20 vs. 13) and a lower number of bad deletions (6 vs. 12), which is consistent with the better keep and del scores.

Conclusion
In this work, we present the first neural semi-CRF word alignment model which achieves competitive performance on both in-domain and outof-domain evaluations. We also create a manually annotated Multi-Genre Monolingual Word Alignment (MultiMWA) benchmark which is the largest and of higher quality compared to existing datasets.

Acknowledgement
We thank Yang Chen, Sarthak Garg, and anonymous reviewers for their helpful comments. We also thank Sarah Flanagan, Yang Zhong, Panya Bhinder, Kenneth Kannampully for helping with data annotation. This research is supported in part by the NSF awards IIS-2055699, ODNI and IARPA via the BETTER program contract 19051600004, ARO and DARPA via the SocialSim program contract W911NF-17-C-0095, and Criteo Faculty Research Award to Wei Xu. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of NSF, ODNI, IARPA, ARO, DARPA or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

A EditNTS with Aligner
The original EditNTS model constructs expert program with the shortest edit path from complex sentence to simple sentence, specifically, it calculates the Levenshtein distances without substitutions and recovers the edit path with three labels: ADD, KEEP and DEL. Since edit distance relies on word identity to match the sentence pair, it cannot produce lexical paraphrases (e.g. conduct ↔ performed and simulations ↔ experiments in Figure 1,). The final edit sequence will mix paraphrase words (performed and experiments) and normal added words (successful) together under the same ADD label. In order to differentiate these two types of added words, we introduced special tags (REPLACE-S and REPLACE-E) to refer to lexical paraphrases specifically. During the edit label construction process, after checking the word pair identity for KEEP label, we additionally check whether they are aligned by our neural semi-CRF aligner, if so, we produce REPLACE-S/E tags, otherwise we do normal ADD/DEL tags. See Table 7 for a specific example. Word alignment can arbitrarily align any words in the target sentence, this can break the sequential de-pendency of the edit labels, we therefore discard some lexical paraphrases to guarantee such propriety (conduct ↔ performed in Table 7). In order to show the effectiveness of our modified model, we compared two more versions of EditNTS in Table 8: EditNTS (original) + Aligner, where we directly add word alignment information to the original EditNTS model without any REPLACE tags; EditNTS (new), where we keep the REPLACE tags but don't use any word alignments. The results show that EditNTS model with REPLACE tags can improve the performance, but it is not significant. After adding the word alignment information, we can further improve the SARI score significantly, which can demonstrate the effectiveness of our modified EditNTS with aligner.

B More Details for MultiMWA Benchmark B.1 Updated Annotation Guideline
After the first round of annotation, we discovery that the definition of phrasal alignment can be ambiguous, which will hinder the development and error analysis for word alignemnt models. There-fore, we further extend the standard 6-page annotation guideline 11 from (Callison-Burch et al., 2006) to cover three linguistics phenomena to improve the phrase-level annotation consistency.
• "a/an/the + noun" should be aligned together with noun if both nouns are same. • noun 1 should be only aligned to noun 1 in the phrase "noun 1 and noun 2 ". • noun should be only aligned to noun in the "adjective + noun" phrase.
Utilizing the constituency parser implemented in the AllenNLP package (Gardner et al., 2018), we first write a script to implement these rules and apply them to all the training/dev/test sets of MultiMWA-MTRef. Then, we manually go through both dev and test sets to further ensure the annotation consistency.

B.2 Statistics of Alignment Shape
We also analyze the shape of alignment in each dataset, and the statistics can be found in Table  9. Statistical result showes that the dev and test of MultiMWA-MTRef contain a similar portion of phrasal alignment, and less than the training set. There even exists 1×10 alignment annotations in MultiMWA-MTRef, which are actually correct based on our manual inspection. Both MultiMWA-Newsela and MultiMWA-arXiv contain significantly larger portion of 1×1 alignment, especially the latter one contains only 3.2% of phrasal alignment.   Table 9: Statistics of alignment shapes in each dataset. Each number represents how many word alignments are included for phrasal alignment with specific shape. For example, one 2×3 phrasal alignment will contribute to six word alignments. %of 1×1 is calculated by 1×1 over the sum of the row.