Unsupervised Neural Machine Translation with Universal Grammar

Machine translation usually relies on parallel corpora to provide parallel signals for training. The advent of unsupervised machine translation has brought machine translation away from this reliance, though performance still lags behind traditional supervised machine translation. In unsupervised machine translation, the model seeks symmetric language similarities as a source of weak parallel signal to achieve translation. Chomsky’s Universal Grammar theory postulates that grammar is an innate form of knowledge to humans and is governed by universal principles and constraints. Therefore, in this paper, we seek to leverage such shared grammar clues to provide more explicit language parallel signals to enhance the training of unsupervised machine translation models. Through experiments on multiple typical language pairs, we demonstrate the effectiveness of our proposed approaches.


Introduction
Recently, Neural Machine Translation (NMT) (Bahdanau et al., 2014;Sutskever et al., 2014) has been greatly developed and become the dominant paradigm in machine translation. On the one hand, the development of deep neural networks such as Transformer (Vaswani et al., 2017;Li et al., 2021a) has played a significant role in NMT's improvements. On the other hand, large-scale parallel corpora like the UN corpus (Ziemski et al., 2016) have also played an important role.
Despite the recent success of NMT in standard benchmarks, the need for large-scale parallel corpora has limited the effectiveness of NMT in many language pairs, especially in low-resource language pairs (Koehn and Knowles, 2017). Unsupervised Neural Machine Translation (UNMT)  English Penn Treebank (PTB) and German dataset of SPMRL14 shared task. The dotted box indicates the constituents that can be masked for prediction. (Artetxe et al., 2018b) was proposed to alleviate this issue by completely removing the need for parallel data and training an NMT system in a completely unsupervised manner, relying on nothing but monolingual corpora. Unsupervised machine translation does not need the parallel information from parallel sentences; rather, it generally uses embedding alignments, initializes parameters with pretrained language models, and uses iterative backtranslation between two languages to synthesize pseudo parallel corpora for model training (Lample et al., 2018a,c;Sun et al., 2019;Conneau and Lample, 2019;Li et al., 2020a).
The pseudo parallel data created by iterative back-translation is the key to the success of unsupervised NMT model training (Kim et al., 2020). It takes advantage of the equivalence of translation languages to bring supervision (albeit weak supervision) to model training. Recent results in semi-supervised NMT have demonstrated that further training a UNMT model with true bilingual parallel sentences can lead to better translation performance (He et al., 2016;Kim et al., 2020;Conneau and Lample, 2019;Song et al., 2019a), which suggests that after training, UNMT models are still not optimized because of their lack of explicit supervision.
Universal grammar (UG) is a notion in linguistics and philosophy that goes back at least to Roger Bacon's observation, "in its substance, grammar is one and the same in all languages, even if it accidentally varies" (Bacon, 1902). Chomsky (1965a,b) developed a universal grammar theory. The idea of a universal grammar states that all human languages are species of a common genus because they have all been shaped by a factor that is common to all human beings (Lappin and Shieber, 2007;Nivre, 2015). Therefore, in this paper, we leverage this grammar commonality to derive additional supervision to enhance UNMT training. In other words, our proposed method is built on the existence of universal grammar. If there is no crosslingual commonality and definitional similarity in the syntactic structure, then we will not be able to obtain weakly supervised signals for UNMT.
Specifically, we choose the grammar representation framework of constituent syntax as the research object. Unlike typical approaches to leveraging syntax information, rather than adopting a syntactic encoder to enhance representations, we focus on acquiring more supervision by finding commonalities between two languages' syntaxes and demonstrate this supervision by training UNMT models. Since different languages often share some of the same constituent types (syntax categories), predicting these matching constituents in model training can be used for a weak alignment. As shown in Figure 1, although the two sentences are not parallel, during the training, the model is exposed to both NP and VP constituents, and a weak alignment between these constituents can be used to enhance the UNMT training, i.e., the NP constituents in English and the NP constituents in German (the same to VP, PP, ..., etc.) are more likely to be parallel. Notably, our method is only an application of Universal Grammar in UNMT, but far from all applications since we only leverage a very small part of Universal Grammar (universal constituent and syntactic label definition).
Masked Language Modeling (MLM) is a com-monly used training approach for language modeling. In MLM, some of the tokens in the sentence are masked, and then the model is required to predict these masked tokens at their placeholders. Based on MLM, we propose a CONSTMLM approach that also draws from constituent syntax. In our CONSTMLM, constituents are masked, and the model is tasked with predicting both the tokens in a constituent and the constituent's syntactic category. Masking large constituents will present too difficult a problem for the model, as there will be insufficient context, so we also propose BTLM, a method of leveraging back-translation to provide more context and alleviate this issue. We then implement CONSTBTLM based on the CONSTMLM, which leverage our proposed BTLM. To accommodate both UNMT and language modeling training, we have prepared both encoder-decoder models and encoder-only models for our CONSTBTLM, BTLM, and CONSTMLM approaches.
In our experiments, we demonstrated the effectiveness of leveraging universal grammar and of our proposed approaches on multiple unsupervised translation tasks. Our proposed approaches show consistent improvements compared to the baselines in these tasks. We also present a significantly boosted performance on several low-resource semisupervised tasks. These results verify that universal grammar commonalities can bring additional supervision information to bolster the training of unsupervised and low-resource translation models.

Background
We formally present the background of our baseline UNMT system in terms of unsupervised machine translation between languages L 1 and L 2 . Our UNMT model follows an encoder-decoder architecture as in standard NMT. We use a joint subword (Sennrich et al., 2016b) vocabulary shared between languages and share parameters between source→target and target→source models to take advantage of multilingualism (Edwards, 2002). In this framework, three training methods are indispensable for the feasibility of unsupervised machine translation: initialization, denoising generation, and iterative back-translation. UNMT models typically use denoising generation and iterative back translation simultaneously by alternating between the two methods in a single phase rather than separately in multiple phases. The model is given monolingual data {X i } in language L 1 and {Y j } in language L 2 . |X| and |Y | are the number of sentences in monolingual data {X i } and {X j }, respectively.
Initialization Initialization is a crucial step for bootstrapping UNMT models. The initialization process injects non-randomized cross-or multilingual knowledge into a UNMT model. In general, two types of initialization are usually adopted (Lample et al., 2018c). The first entails initializing the embedding layer of a UNMT model with pre-trained embeddings, while the second uses a pre-trained language model with the same structure as the UNMT encoder to initialize the embedding layer and most of the neural network parameters in the encoder and decoder (Conneau and Lample, 2019). The experimental performance in (Conneau and Lample, 2019) shows that using a pre-trained language model to initialize a UNMT model can produce better performance, so we choose this as our method of initialization.
Denoising Generation Denoising generation training aims to help UNMT models learn to generate fluent texts. Noise is introduced to input sentences via replace, delete, and shuffle functions, and then the UNMT model is tasked with encoding these noisy sentences and using the encoded noisy sentences to reconstruct the original sentences. The UNMT model is optimized by loss L D during this training process: where N (·) refers to the noise functions and θ represents the UNMT model parameters. P L 1 →L 1 and P L 2 →L 2 denote the reconstruction probabilities in the languages L 1 and L 2 , respectively.
Iterative Back-translation Back-translation (Sennrich et al., 2016a) was first proposed to boost translation performance using target-side monolingual data. By using symmetric models, it can boost translation in both directions. In UNMT, back-translation is used to synthesize pseudo parallel data from monolingual text, which alleviates the scarcity of true parallel data. This synthesis is performed repeatedly throughout the UNMT training. The loss, L B , is defined as follows: where S L 1 →L 2 and S L 2 →L 1 represent the translation processes from L 1 to L 2 and L 2 to L 1 , respectively. P L 1 →L 2 and P L 2 →L 1 denote the translation probabilities between the two languages.

CONSTMLM
We propose Constituent Masked Language Modeling (CONSTMLM) in this section. ConstMLM is a variant of MLM that is enhanced with constituent syntax information. In traditional MLM, given a sentence X = {x 1 , x 2 , ..., x n }, length of tokens n, and set of masked positions M, the training loss L MLM for the MLM training is: where |M| is the size of set M, and X \M indicates the sequence after masking. The masked positions set M consists of randomly sampled discrete positions, that is, M = TopK([rand i (0, 1)] n i=1 ). Here, TopK is a function that selects positions by probability until the masking budget has been spent. In span-based MLM like (Joshi et al., 2020), a span of length is first sampled from a geometric distribution ∼ Geo(p), and the start position of a span is sampled in the same manner as in MLM, giving final masked span set In another linguistically guided language modeling approach, Zhou et al. (2020b) proposed Syntactic/Semantic Phrase Masking (SPM) for their model LIMIT-BERT. In SPM, the masked positions set consists of tuples randomly sampled from the linguistic span set instead of the discrete token position set. Only the span boundary information, however, is used in SPM; the linguistic label is ignored, so we remedy this and propose CONSTMLM.
In CONSTMLM, we first extract and filter the constituent span set CS = {(s, e, c) i } m i=1 , where s, e, and c represent the start position, end position, and syntactic category, respectively. During filtering, constituent parse trees with a span ratio greater than γ = /n are removed. Random sampling is also performed on this set to obtain the masked span set. Unlike SpanBERT and LIMIT-BERT, we only sample one span at a time because CONSTMLM not only predicts the masked token in the sampled span but also predicts the syntactic category of the sampled span. CONSTMLM sums the loss from both the span's syntactic category and the regular masked language model objective for each token in the masked span: LCONSTMLM = e:s − log P (xi|X \s:e , θ) + − log P (c|X \s:e , θ).
(4) Since the UNMT model architecture, which includes both an encoder and a decoder, is different from pre-trained language models in general, we provide two implementations of CONSTMLM: encoder-only and encoder-decoder. In the encoderonly CONSTMLM, the masked span's token and syntactic category prediction are both performed on the encoder side, which is no different from popular pre-trained language models such as BERT that only consist of encoders. Both target prediction probabilities are calculated using the following process: where enc(·) represents the encoding process, and Pooling(·) is a pooling operation that uses a firsttoken pooling strategy.
In the encoder-only CONSTMLM, only the encoder is updated by the loss; the decoder can not benefit from it. Using the same training method on the decoder as on the encoder is not viable; because the decoder uses incremental self-attention instead of full self-attention. To mitigate this, we propose an encoder-decoder CONSTMLM, in which the masked token prediction probability is calculated as: P (xi|X \s:e , θ) = Softmax(MLP( dec([ BOS , Xs:e−1], enc(X \s:e )))), where dec(·) represents the decoding process, and [ BOS , X s:e−1 ] is the operation of prepending a BOS token before sequence X s:e−1 . In encoderdecoder CONSTMLM, the encoder still handles the incomplete sentence encoding, so the syntactic category prediction is consistent with that of the encoder-only version. This means that the weak alignment information brought by the syntactic category still directly trains the encoder, while the decoder is optimized by the span generation process.

BTLM and CONSTBTLM
Whether in traditional MLM or span-based MLM, the number of tokens masked is limited to a certain ratio of the sentence. In BERT's implementation, at most 15% of the tokens are put up for masking. SpanBERT followed this practice and after obtaining span lengths by sampling a geometric distribution skewed towards shorter spans, removed spans with a length greater than max = 10. Skewing towards shorter spans is crucial because of an issue in MLM: if too many tokens are masked, it is difficult for the model to recover these tokens using the remaining incomplete sentences. Limiting the number of masked tokens is especially important for span-based MLM, as spans can compose much larger parts of the sentence. We call this the difficulty of reasoning with insufficient information. This situation is still acceptable for language model pre-training, and limiting the maximum ratio of masked tokens in MLM and the span length in span-based MLM alleviates the issue, but for linguistically-guided span-based MLM, the length of the extracted span cannot be flexibly set because it contains specific grammatical information. Making the maximum span width too small means too few spans or even no spans for some trees are extracted. To combat the difficulty of reasoning with insufficient information, we first propose Back-translation Language Modeling (BTLM), a training method that can use crosslingual translation as a source of information for inference. It can be formally presented as: In BTLM, the sentence X in language L 1 is first translated into language L 2 by S L 1 →L 2 for use as cross-lingual context. Then, X is masked as in MLM. Finally, the target prediction is performed by combining and considering the cross-lingual context and the MLM context. Due to the existence of a complete (albeit noisy) cross-lingual context, the proportion of masked spans in a sentence can be significantly increased. In addition, this training forces the model to infer with a cross-language context, which implicitly promotes bilingual alignment.
We also implemented encoder-only and encoderdecoder versions with CONSTBTLM for different purposes. In encoder-only CONSTBTLM, the target prediction probability becomes: P (xi|X \s:e ,Ŷ , θ) = Softmax(MLP(enc([Ŷ , X \s:e ]))), P (c|X \s:e ,Ŷ , θ) = Softmax(MLP( Pooling(enc([Ŷ , X \s:e ])))), whereŶ = S L 1 →L 2 (X), and [Ŷ , X \s:e ] indicates that the translated sequenceŶ is prepended to the rest of the sequence. Purely from an implementation perspective, the use of cross-lingual context here is consistent with the TLM proposed in (Conneau and Lample, 2019), but the difference is that we only mask the input monolingual sequence, while TLM masks both the input parallel sentences.
In CONSTMLM, the encoder handles predicting syntactic categories. Although the encoderdecoder version is designed so that the decoder can also be updated during CONSTMLM training, it is still only responsible for the masked span sequence generation.
In the encoder-only version with CONSTBTLM, the encoder also handles the prediction of syntactic categories, but cross-lingual context is adopted to support larger span masking. As for the encoderdecoder version, the encoder handles the crosslingual context and the decoder predicts syntactic categories and generates masked span text. In CON-STMLM and the encoder-only CONSTBTLM, the weak alignment training of the syntactic category is performed on the source side, while it is completed on the target side in the encoder-decoder CON-STBTLM. For detailed training process, please refer to Appendix A.1.

Setup
Following the XLM codebase 1 and model structure setup (6 stacked Transformer layers with hidden dimension size of 1024) of (Conneau and Lample, 2019), we train the baseline UNMT model with an embedding-shared Transformer encoderdecoder architecture. The UNMT model training is divided into two stages: pre-training and unsupervised training. Our method is only used in the second stage for fast convergence. In order to make the unsupervised training more sufficient, we used an epoch size of 400K instead of the original recommended 200K in XLM. The γ in CONSTMLM is set to 0.3, and 0.5 in CONSTBTLM.
As the source of monolingual corpus for training, we use the 2007-2018 News Crawl dataset for English (En), French (Fr), German (De), Romanian (Ro), and Chinese (Zh). Since the Chinese News Crawl data is relatively small, we extracted sentences from Wikipedia dumps and converted them from traditional Chinese to simplified Chinese for  use. Joint Byte-Pair Encodings (BPE) (Sennrich et al., 2016a) with 60K merge operations were used in the translation experiments for all language pairs. We explored the role of UG at two different monolingual corpus sizes in UNMT. All monolingual data from the newstest 2008-2018 is combined for use in the large-scale setting, while a subset of 5M sentences per language was randomly sampled from this data in the smaller scale setting. Our evaluations were mainly carried out under unsupervised and low-resource semi-supervised scenarios. In the unsupervised translation scenario, we reported results on WMT newstest2014 for En-Fr and En-Ro, WMT newstest2016 for En-De, and WMT newstest2020 for En-Zh. In the low-resource semi-supervised translation scenario, the IWSLT'14 En-Fr and En-De parallel sentences were used for training. IWSLT14.TED.dev2010, tst2010, tst2011, and tst2012 were merged to evaluate the En-Fr translation model and dev2010, dev2012, tst2010, tst2011, and tst2012 in IWSLT14.TED to evaluate the En-De model.
To acquire constituent parse trees for monolingual sentences, we adopted the current state-of-theart Berkeley Neural Parser (Kitaev and Klein, 2018) as our parsing model and trained an En parser using PTB (Marcus et al., 1993), Fr and De parsers using the SPMRL14 multilingual constituent treebank (Seddah et al., 2014), and a Zh Parser using CTB (Xue et al., 2005). Since a constituent treebank is not available in Ro and for the consistency of the constituent trees used in En-Ro UNMT, we created En and Ro pseudo-constituent treebanks by converting their respective UD 2.7 treebanks using Head Feature Princinple (HFP) (Pollard and Sag, 1994), and trained En * and Ro * parsers using this. The processing and training details of each parser are presented in Appendix A.2. For each language, 500K sentences are parsed with these trained parsers for UNMT and low-resource semisupervised NMT enhancement.

Results and Analysis
The results of the UNMT experiment are mainly shown in Table 1. When a large-scale monolingual corpus is used, our baseline model outperforms XLM's reported results. This may be due to the use of the larger epoch size, which makes for more adequate training. Based on our strong baseline model, the four implementations of our CONSTMLM and CONSTBTLM approaches achieve consistent improvements in all language pairs, which demonstrates the effectiveness of universal grammar in UNMT. Based on the large-scale monolingual corpus scenario, comparing the four implementations of CONSTMLM and CONSTBTLM, we find that enc-only is generally weaker than the enc-dec implementation. This shows that training the model as a whole is better than training part of the model. This conclusion also partially explains the source of improvement of other enc-dec pre-training methods in UNMT like MASS (Song et al., 2019b) and BART (Lewis et al., 2020).
In the small-scale monolingual training data scenario, the performance of the baseline model has a large decline compared with the large-scale monolingual scenario, which shows that the size of monolingual data is still an important factor in the performance of the UNMT model. Similar to the largescale monolingual scenario, our CONSTMLM and CONSTBTLM achieve improvements in translation performance, and the maximum increase is even greater than that in the large-scale monolingual scenario. This shows that in the case of relatively scarce training data, the introduction of universal grammar as a prior knowledge can effectively alleviate the performance loss.
Comparing the improved results in our approaches of each language pair horizontally, we find the average improvement of each language pair is basically consistent with the overlap of constituent labels between languages; that is, En-De, En-Zh, and En * -Ro * are more improved than is En-Fr (refer to Appendix 4.4 for the detailed statistics). This shows that the more grammatical commonalities two languages have, the greater their alignment's supervision will be. In addition, compared to the recent state-of-art work -MASS, due to their focus on pre-training, while ours concentrate on the NMT training with weak parallel information from universal grammar, our contribution is orthogonal to theirs.
In Table 2, we report the evaluation results of the low-resource semi-supervised scenario. We use a small-scale, monolingually trained UNMT model as the basis, so we also include the results of the UNMT model evaluated on the test datasets directly. After using the parallel data, the performance of our baseline model greatly improved, which reinforces our claim that UNMT models do not receive enough supervision in BT training.  With the use of universal grammar for enhancement, the CONSTMLM and CONSTBTLM enconly methods only achieved a slight improvement, which maybe suggest the training enhancement on the encoder side does not significantly improve the performance of translation after the introduction of parallel data. In the enc-dec approaches, the encoder and decoder are jointly optimized, and the performance improvement is greater, especially in CONSTBTLM enc-dec when larger and more spans can be leveraged.

Constituent trees and parallel data size
To show that UG plays a similar role to the alignment information given by the parallel corpus, we compare the semi-supervised and UG-enhanced UNMT (UGUNMT) settings. The experimental results are evaluated on IWSLT'14 En-Fr. In the semi-supervised setting, we vary the amount of parallel data, while we vary the number of monolingual parse trees in UGUNMT. The performance trend is shown in Figure 3. The trends in the figure demonstrate that the performance of the UNMT model steadily improved with the addition of parallel corpus. The performance changes for UGUNMT also had a similar trend with the increase in the constituent parse data. This suggests that UG information plays a role similar to that of parallel data; that is, it brings supervision signals. The demand for monolingual constituent parse data, however, is greater than that from parallel data, and the improvement of parallel data is greater than that from constituent parses, which shows that UG can only provide a weak signal of supervision. While UG cannot achieve the same effect as parallel data, it is quite useful when there is a lack of parallel data.

Different Maximum Span Ratios
As in our approach description, we propose BTLM and its variant with the goal of mitigating the difficulty of reasoning with insufficient information in MLM. Although this problem has been noted in the training of PrLMs such as SpanBERT, in order to verify this problem's presence in the UNMT model and show that our proposed BTLM alleviates this issue, we explored the effects of different maximum span ratios γ in UNMT training. The results are shown in Table 3.
The comparison shows that the higher γ is, the greater the utilization proportion of the phrases in the constituent trees is. In CONSTMLM and CON-STBTLM, when γ is small, the phrases for training are limited, and therefore, the performance gains are limited. With increased γ, the utilization proportion increases, but CONSTMLM struggles with reasoning with insufficient data because too many spans are masked, and the performance even declines compared to baseline. CONSTBTLM can adapt to larger γ and higher phrase utilization proportions, it achieves better results.

Cross-lingual Alignment Evaluation
In order to verify that better alignment in the UNMT model is obtained using UG and our proposed training approaches, we conducted an experimental exploration of embedding alignment according to the experimental settings of (Conneau and Lample, 2019) and evaluated models on the SemEval'17 En-De cross-lingual semantic word similarity task (Camacho-Collados et al., 2017).

Method
Cosine sim. L2 dist. Pearson cor.   Table 4. As the results show, our method is not only better than pure embedding training methods, Concat Fasttest and MUSE, on the three evaluation metrics, but also surpasses our strong XLM baseline, which demonstrates that the alignment of the UGUNMT model is indeed improved with the weak alignment information from syntactic categories.

Universal Constituent Labels
To illustrate the universal nature of the phrase grammar, we calculate statistics on the labels of the constituents in the annotations of each language. Specifically, the proportions of shared and differing labels are also calculated. The statistics are shown in Table 7. The statistical data shows that most of the grammatical phenomena (constituent labels) of the three language pairs overlap, and distributions of these labels are also close across language. The proportions of common labels in En-De and En-Zh are greater than that in En-Fr. Although En, Fr, De, and Zh have their own unique grammatical phenomena, they have greater proportions of overlapping labels than differing labels. Since En and Ro are pseudo-constituent labels transformed from UD, they cannot be directly compared with En-Fr, En-De, and En-Zh, but they do also have many similar labels and comparable common label proportions, indicating the UD annotation's universality and the effectiveness of our conversion in preserving grammatical features. This does not explain more complicated issues such as language similarity or commonality but rather indicates the overlap of grammatical phenomena and universal features in the annotations and parser predictions.

Effects of SpanBERT, LIMIT-BERT, and CONSTBTLM for UNMT
From the main experiments, the UNMT performance is improved, especially for the small-scale data setting. To find out that if the improvements are caused by CONSTMLM/CONSTBTLM and the syntactic information is really necessary, we compare our approaches with LIMIT-BERT which apply a linguistically guided span based MLM objective during UNMT training, and SpanBERT which is with a non-syntax based span masking strategy. Compared with SpanBERT and LIMIT-BERT in our UNMT framework, the implementation is relatively simple. By removing the syntactic category prediction objective in the CONSTMLM enc-only variant, it is consistent with the objective of LIMIT-BERT, and further removes the use of the syntactic parse tree in the span sampling, the same objective of SpanBERT is achieved. The results of the comparison are shown in Table 7. The use of SpanBERT and LIMIT-BERT training approaches has resulted in a performance improvement in translation over the XLM baseline, which indicates that additional span-based pretraining is helpful for UNMT. SpanBERT outperforms LIMIT-BERT because syntactic annotation is costly, the fixed-size syntactic parse tree used severely limits the pre-training with span boundaries considered only, while SpanBERT with dynamic span mask can get sufficient training. But in ConMLM, this disadvantage was mitigated by the introduction of additional syntactic label predictions, and when we used the enc-dec variant, which is more suitable for encoder-decoder structures, its performance exceeded SpanBERT. This suggests that it is not that syntactic information is useless. With the help of ConstBTLM, a stronger variant, the UNMT model achieves much better translation results. This demonstrates that in UNMT training on the one hand additional pre-training is helpful, on the other hand, the use of effective means to integrate the weak alignment information provided by syntactic parse tress is also beneficial to improve translation performance.

Related Work
UNMT has been greatly developed in recent years (Artetxe et al., 2018b;Sun et al., 2019;Conneau and Lample, 2019;Ren et al., 2019). Syntax has been used extensively explored in supervised MT research field (Wu et al., 2018

Conclusion and Future Work
In this paper, we mine weak alignment information from universal grammar annotations and use it to improve unsupervised machine translation. Two specific training approaches, CONSTMLM and CONSTBTLM, are proposed to apply this weak supervision. Via empirical exploration on unsupervised and semi-supervised machine translation benchmarks, we verify that universal grammar will boost cross-lingual alignment for UNMT. Our analysis shows that using universal grammar, the reliance on parallel corpora can be reduced under the premise of achieving the same effect because the weak supervision signal based on universal grammar can play a similar role to the supervision signal of the parallel corpus.
In this work, we rely on the dependency syntax of 100+ languages provided by the universal dependency project for synthesizing pseudo-constituent syntax in some languages. In the future, we intend to train a multilingual parser based on the multilingual language model -XLM-R (with the training data as a combination of 10+ language constituent syntax), which has the ability to parse 100+ languages in a single model, further increasing the practicality of our method. In addition, we will examine more low-resource languages to verify the method's universality.

A.2 Parser Training and Evaluation
In this section, we evaluate the performance of parsers used in this paper on their respective test sets. Our parsing model is based on the architecture described in (Kitaev and Klein, 2018), a stateof-the-art multilingual parser. We trained our En constituent parser with Penn Treebank (Marcus et al., 1994), Zh parser with Chinese Penn Treebank (Xue et al., 2005), and the Fr and De parsers  with the SPMRL 2013/2014 shared task (Seddah et al., 2013(Seddah et al., , 2014. Thus, these parsers are also evaluated on the test datasets of these treebanks or shared tasks. Some languages lack well-annotated constituent treebanks, which adds some difficulty to our research in using universal grammar for UNMT. Universal Dependencies (UD), however, is a consistent dependency syntactic annotation on more than 100 languages. Dependency treebanks are usually converted from constituent treebanks, though they may be independently annotated as well for the same languages. Constituent trees can be accurately converted to dependency representations using grammatical rules or machine learning methods (de Marneffe et al., 2006). Such convertibility shows a close relation between constituent and dependency representations. Therefore, we consider transforming the widely annotated UD treebank 2 into a constituent treebank for languages that lack constituent annotations. It is not hard to obtain an approximate constituent structure from the dependency structure, but the labels change a lot, and it is also very difficult to train a machine learning conversion model when the original constituent annotations are lacking.
In order to address this inconvenience, we propose converting the dependency structure to the constituent structure using the HFP. Our UNMT model does not need a genuine constituent label; rather, it only needs labels to be consistent across corpora in different languages. As a result, we use the relationship between the head word of a constituent and its dependency head as a constituent label, resulting in a complete annotated constituent parse tree. Like (Kitaev et al., 2019), we use the pre-trained language model BERT to enhance the parser. En uses bert-base-cased, Zh uses bertbase-chinese, and Fr, De, and Ro use bert-base- Table 7: Statistics of common and different constituent labels in different language pairs. * indicates that the statistics are based on the dataset transformed from UD. The L 1 ∩ L 2 column refers to the number of common constituent labels for languages L 1 and L 2 , and the proportions of these labels appearing in the respective datasets are in parentheses. L 1 − L 2 refers to the number and proportions of constituent labels that only exist in language L 1 , L 2 − L 1 refers to the number and proportions of constituent labels that only exist in language L 2 . multilingual-cased. The results of the evaluation on each language data set are shown in Table 6.

A.3 Related Work
Unsupervised machine translation systems have been developed since Knight et al. (2006). Ravi and Knight (2011) framed the unsupervised MT problem as a decipherment task between two languages. With the development of deep end-to-end neural network translation and language models, UNMT has begun to be competitive in translation benchmarks. Before this development, unsupervised cross-lingual embeddings (Artetxe et al., 2017;Zhang et al., 2017) and word translation with parallel data (Lample et al., 2018b) were alternative approaches to unsupervised machine translation. (Artetxe et al., 2018a;Lample et al., 2018c) studied unsupervised training using phrase-based translation systems. Recently, UNMT has been a hot research topic in machine translation (Artetxe et al., 2018b;Sun et al., 2019;Conneau and Lample, 2019;Ren et al., 2019;Sun et al., 2020;Li et al., 2020b). Our work builds on part of these works in unsupervised machine translation, but we focus on improving by leveraging universal grammar.
Grammar information, especially syntax information, has always been the focus of research in the field of machine translation. In statistical machine translation, syntactic trees were used as the basis for re-structuring, re-labeling, and re-aligning (re-ordering) sentences to improve the translation accuracy (Wang et al., 2010). Based on the type of linguistic information used, the syntactic SMT can be divided into four types: tree-to-string, string-to-tree, tree-to-tree, and hierarchical phrase-based (Zhang et al., 2008;Nguyen et al., 2008). Our use of universal grammar to enhance UNMT, from a motivation perspective, is similar to a tree-to-tree approach in SMT. Parallel syntactic trees are used to obtain structure alignment information in treeto-tree SMT, while our approach leverages nonparallel syntactic parsing trees to obtain weak alignment information based on our proposed training objectives in UNMT. In NMT, syntactic information is mainly used as features and/or constraints (regularization). (Eriguchi et al., 2016;Bastings et al., 2017) augmented the RNN encoder for feature extraction with an additional syntactic encoder as in Tree-LSTM (Tai et al., 2015) and GCN (Kipf and Welling, 2016); and combined this with a standard RNN decoder. Chen et al. (2018)  Sharing NMT model parameters with a syntactic parser for multi-task learning is also a popular approach to obtaining syntactically-aware representations (Luong et al., 2016;Dyer et al., 2016;Eriguchi et al., 2017;Nȃdejde et al., 2017). The use of syntax in UNMT research is relatively rare. Xu et al. (2020) incorporated syntax information into a UNMT model by leveraging linearized parse trees of the training sentences. Although all these works use syntactic information, our motivation is very different. Unlike other approaches that use syntax information as a feature or constraint, we use syntax information to produce a form of weak supervision that can guide model training. We differ from multi-task learning approaches combining syntax and machine translation in that our purpose is not to predict the syntactic tree but to align text across languages using syntactic categories, and we do this through a masking-prediction process of syntactic constituents.