PROSE: A Pronoun Omission Solution for Chinese-English Spoken Language Translation

Neural Machine Translation (NMT) systems encounter a significant challenge when translating a pro-drop (‘pronoun-dropping’) language (e.g., Chinese) to a non-pro-drop one (e.g., English), since the pro-drop phenomenon demands NMT systems to recover omitted pro-nouns. This unique and crucial task, however, lacks sufficient datasets for benchmarking. To bridge this gap, we introduce PROSE, a new benchmark featured in diverse pro-drop instances for document-level Chinese-English spoken language translation. Furthermore, we conduct an in-depth investigation of the pro-drop phenomenon in spoken Chinese on this dataset, reconfirming that pro-drop reduces the performance of NMT systems in Chinese-English translation. To alleviate the negative impact introduced by pro-drop, we pro-pose Mention-Aware Semantic Augmentation, a novel approach that leverages the semantic embedding of dropped pronouns to augment training pairs. Results from the experiments on four Chinese-English translation corpora show that our proposed method outperforms existing methods regarding omitted pronoun retrieval and overall translation quality.


Introduction
In recent years, neural machine translation (NMT) technology has made significant progress in lowering communication barriers between individuals from different language backgrounds.However, NMT systems often struggle when translating sentences from a pro-drop ('pronoun-dropping') language, such as Chinese, Korean and Japanese (Shimazu et al., 2020), to a non-pro-drop language, such as English, French and German (Haspelmath, 2001)).While the pro-drop phenomenon has been widely studied in the research community (Nagard † Equal Contribution.and Koehn, 2010; Taira et al., 2012;Wang et al., 2016Wang et al., , 2018a;;Tan et al., 2019), advanced commercial NMT systems occasionally fail to faithfully recover dropped pronouns in the source language.
In some cases, leaving missed pronouns unrecovered could result in severe semantic distortion and alter the intended meaning of the translated text, as demonstrated in Figure 1.
To tackle this issue, researchers have proposed two primary strategies: (1) incorporating additional pro-drop resolution systems to provide supplementary syntactic information (Nagard and Koehn, 2010;Taira et al., 2012;Wang et al., 2016).For instance, Xiang et al. (2013) modeled Empty Categories within the framework of governmentbinding theory; (2) treating pro-drop resolution as a regularization component of NMT task directly (Wang et al., 2018b;Tan et al., 2019).This approach suggests filling dropped pronouns (Wang et al., 2018a) or predicting pro-drops in the Chinese text encoder component of a seq2seq model (Wang et al., 2019).Despite the studies done on resolving Chinese pro-drop in NMT so far, relevant benchmarks evaluating the effectiveness of pro-drop mitigation are highly limited, and Chinese-English spoken translation datasets with fine-grained annotation are even fewer.
In this study, we present PROSE, a PRonoun Omission Solution for Chinese-English spoken language translation.To facilitate research in this area, we introduce a novel dataset for document-level Chinese-English spoken language translation that includes abundant and diverse pro-drop instances with contextual and pro-drop annotations across four spoken language genres (talk, drama, movie, and vlog).The analysis of this dataset reveals that the negative impact of pro-drop on Chinese-English spoken language translation.Furthermore, we propose the Mention-Aware Semantic Augmentation approach, which utilizes a mention encoder to capture the semantic embedding of dropped pronouns and employs a mention-side data augmentation technique to generate additional training pairs.Experiment results on four Chinese-English translation corpora demonstrate that our proposed approach significantly increase translation quality and the recover rate of missed pronouns, in comparison with baseline methods on both automatic and human evaluation metrics.Additionally, we conducted ablation studies to provide further insights on the effect of designated losses.
Our contributions are summarized as follows: • We construct a document-level Chinese-English spoken translation dataset that covers multiple spoken genres and provides detailed contextual and pro-drop annotation information.
• Our analysis reveals that pro-drop negatively impacts the quality of Chinese-English spoken language translation.
• We propose a Mention-Aware Semantic Augmentation approach to increase the recover rate of dropped pronouns when translating and hence enhance overall translation quality.

Dataset Creation
To mitigate the scarce of benchmarks evaluating pro-drop in Chinese-English spoken language translation, we collect and construct a new benchmark, PROSE, a high-quality Chinese-English bilingual dataset of four different genres, including Talk, Drama, Movie and Vlog.

Data Collection and Filtering
The raw data was collected from bilingual subtitles of publicly accessible videos on the internet.We assume that these subtitles reflect authentic daily spoken expressions in Chinese and cover a diverse range of zero anaphora phenomena.Specifically, our filtering process is based on three criteria.
• The chosen domain must be spoken, rather than written, such as news articles, to preserve the colloquial features of Chinese; • To ensure high-quality English translations, we only considered source materials in Chinese that have undergone manual translation by professionals, rather than relying on machine translations.For instance, we primarily chose movies from China that have been promoted overseas and short videos with professional fan-made translations on YouTube.; • To enable accurate restoration of missing pronouns, the source material must contain contextual sentences that provide greater context and accuracy to the translations.
We end up collecting over 20,000 videos in Chinese and over 2 million lines of English and Chinese subtitle, which can be classified into four distinct spoken genres: • Talk: Subtitles from personal presentations on websites like TED.
• Drama: Subtitles from Chinese TV series, such as Nirvana in Fire (琅琊榜).
• Movie: Subtitles from Chinese films, such as The Great Cause of the Founding of the People ( 建国大业 ).
• Vlog: Subtitles from short videos filmed by Chinese internet celebrities, such as Huanong Brothers (华农兄弟).

Pro-drop Annotation
We employ DDparser (Zhang et al., 2020), a Chinese dependency parsing tool, to detect the omissions of subject or object pronouns in the source language Chinese.Subject pronouns are tagged as SBV (subject-verb) or VV (verb-verb), while object pronouns are tagged as VOB (verb-object), POB (preposition-object), DOB (double object), or DBL (double).Dependencies that do not contain these marks are assumed to be missing either the subject or object pronoun.Although this method of labeling is not perfect, it warrants further study.
Below is an example from the subtitles of a short video about cooking. Chinese

Data Statistics
The data statistics for our datasets, which include four genres of spoken Chinese, are presented in Table 2. CWMT20181 is the most popular Chinese-English machine translation corpus, containing written language such as news articles, while AIChallenger2 is the largest spoken Chinese-English machine translation dataset to the best of our knowledge.
In comparison with those two widely used bilingual datasets, our dataset is 1) more representative with a higher pro-drop ratio, 2) more diverse, containing four genres of spoken language, and 3) more informative, with contextual and pro-drop annotation information.

Pronoun-Dropping Analysis
To gain more insights into the phenomenon of prodrop in the translation of spoken Chinese into English, we examine the prevalence of pro-drop in spoken Chinese and its impact on the quality of Chinese-English spoken language translation.Spoken Chinese Contains More Pro-drop Than Literary Language Formally, pro-drop refers to a reference position that is filled with amorphologically unrealized form, and is one of the most common referential options in many languages such as Chinese (Wang et al., 2018a), Japanese (Taira et al., 2012), Korean (Park et al., 2015), and Thai (Kawtrakul et al., 2002).Previous studies have revealed that spoken Chinese language tends to contain more pro-drops than literary language (Wang et al., 2016(Wang et al., , 2017;;Xu et al., 2021).However, quantitative studies on pro-drops in different genres of spoken Chinese, remain scarce.
As demonstrated in Table 2, both written and spoken languages contain a certain proportion of prodrops, which is consistent with the unique grammatical phenomenon of Chinese.However, written language contains fewer Object Ellipsis than spoken language.For example, in the CWMT2018 dataset, the proportion of Object Ellipsis (i.e., 2.80%) is significantly smaller than that of Subject Ellipsis (i.e., 9.00%).Our four bilingual spoken language corpora are varied, displaying differences in the  rates of subject and object pronoun drop, average sentence length, average document length, and so on.For example, the average length of sentences in the three genres of spoken corpora, namely Drama, Movie and Vlog, is much shorter than that of Talk (i.e., individual talks) and AIChallenger.In particular, the Drama, Movie and Vlog corpora in our data set contain a surprising proportion of pro-drops (about 33% to 46%), which is more extensive than the current largest Chinese-English spoken translation corpus AIChallenger.
Pro-drop Harms the Quality of Chinese-English Spoken Language Translation Subjective and objective pronouns are frequently omitted in spoken Chinese, but should be recovered in non-prodrop languages like English.The question arises whether the current NMT system is able to accurately translate spoken Chinese sentences with dropped pronouns into English, a non-pro-drop language, as illustrated in Figure 1.  Figure 2 shows the distribution of Chinese-to-English translation errors in our online simultaneous machine translation system.The primary use case of our translation system is Chinese-to-English translation (primarily spoken Chinese) in meetings and conferences.Moreover, we have experienced labeling experts to categorize the bad cases generated by the online system.It can be seen from Figure 2 that the proportion of errors caused by pro-drop is relatively high, constituting more than 11% of all errors.This is one of the major factors contributing to the degradation of the translation quality of our system.To investigate the potential of reinstated pronouns in Chinese spoken sentences to improve the quality of Chinese-English spoken language translation, we conduct experiments using spoken Chinese sentences with omitted pronouns complemented by humans.We first train a Transformerbase (Vaswani et al., 2017;Hassan et al., 2018) model on the CWMT dataset, and then report the BLEU (Papineni et al., 2002) scores with Sacre-BLEU3 (Post, 2018) on test sets of our four cor-pora (i.e., Talk, Drama, Movie and Vlog).Next, the spoken Chinese in test sets that is detected as pro-drop are completed manually, as shown in the content in brackets in Figure 1.
The experimental results before and after human completion are shown in Table 3.Although the model achieves a 27.40 BLEU score on the CWMT dataset, its performance on our dataset shows a significant BLEU score decline (from 9.38 to 15.56 across four genres).This indicates a large discrepancy between spoken and written Chinese for neural machine translation systems that rely on data-driven approaches.For convenience, the second column in Table 3 displays the proportion of different datasets with pro-drop.Human completion of dropped pronouns leads to varying performance improvement levels, with the improvement being roughly proportional to the ratio of pro-drops.Interestingly, even on the CWMT dataset, human completion has improved translation quality (i.e., +0.51 BLEU score ), suggesting that pro-drop may also degrade the quality of the Chinese-English translation of that dataset.

Problem Definition
Given two data spaces, X and Y, encompassing all possible sentences in source (Chinese) and target (English) languages, each sample is a pair of sentences belonging to different languages, i.e., (x, y) ∈ (X, Y).Here, x = {x 1 , x 2 , • • • , x |x| } is the Chinese sentence containing |x| tokens, and y = {y 1 , y 2 , • • • , y |y| } is the English sentence with |y| tokens.To identify the mentions (coreferences) of entities (i.e., pronouns) in x, its surrounding context is expressed as c.For example, in the context of c = "饭应该做好了 (The meal should be ready)", the missing object pronoun of "吃 (eat)" in the sentence x = "走走走，一起去 吃吧" can be inferred to be "饭 (meal)", thus the translation of the non-pro-drop sentence would be "Let's go out and have a meal together".
The neural machine translation task (Bahdanau et al., 2015;Gehring et al., 2017;Vaswani et al., 2017) seeks to model the translation probability P (y|x, c; Θ) using a conditional language model based on Transformer architecture (Vaswani et al., 2017), where Θ represents the parameters of the model to be optimized.Formally, the training objective of a given set of observed sentence pairs is to maximize the log-likelihood:

Mention-Aware Semantic Augmentation
Motivated by the high prevalence of pro-drop in spoken Chinese and the consequent difficulty in automatically understanding pro-drop source sentences when translated into non-pro-drop English, we present Mention-Aware Semantic Augmentation (illustrated in Figure 3) as a potential solution.
Architecture This approach is built on top of Transformer (Vaswani et al., 2017)  Overall, our approach leverages 1) the mention encoder to focus on completing the dropped pronouns in the input x from the context c in the case of limited parallel corpus, and 2) representation interpolation in the semantic space of observed samples to expand the training data pairs, thus compensating for the lack of large-scale Chinese-English spoken language translation corpora.
Mention-Aware Contrastive Learning We propose a contrastive objective to learn the semantic embeddings m of mentions in the source sentence x.Specifically, the representations of sentences containing mentions of entities should be "closer" to m than those without mentions.
To this end, we expect the similarity between m and a "similar" sample m + to be far greater than that between m and a negative sample m − , i.e., Sim(m, m + ) ≫ Sim(m, m − ).To obtain m − , we use DDparser (Zhang et al., 2020) to detect all mentioned entities in the context, and then randomly replace them with a special token [MASK].m + is sampled by randomly replacing non-entity words.The measure of similarity between two embeddings, denoted as Sim, is calculated using the dot product.This can be interpreted as the angle between the two embeddings in the vector space.Consequently, the mention-aware contrastive objective is formulated as follows: ].
We introduce a regularization loss to further reduce the disagreements among the mention projection matrix and reduce the redundancy of parameters: L reg (Θ) = ||A T A − I|| 2 , where I is the identity matrix.
Mention-Side Mixup Interpolation Drawing inspiration from Mixup approaches (Zhang et al., 2018;Wang et al., 2021;Wei et al., 2022), we propose to sample data points from the adjacency mention semantic region to augment the current training instance.Given pairs of samples (x 1 , y 1 ) and (x 2 , y 2 ), Mixup chooses a random mixing proportion λ from a Beta distribution β(α, α) controlled by the hyper-parameter α, and creates an artificial training example (λx 1 +(1−λ)x 2 , λy 1 +(1−λ)y 2 ) to train the network by minimizing the loss on mixed-up data points: where ℓ is the cross entropy loss (de Boer et al., 2005).According to Appendix A, we can simplify Equation 3 as follows: which enables us to avoid the requirement for label blending when combining labels y 1 and y 2 , with λ drawn from β(α + 1, α).This is beneficial in scenarios where y 2 is a discrete sequence.Accordingly, our mention-side mixup loss minimizes the interpolations loss from a vicinity distribution (Chapelle et al., 2000) defined in the representation space: In other words, we can utilize the presence or absence of pronoun context (i.e, m and m + ) to argument the training samples for enhancing the robustness towards pronouns.

Training and Inference
Finally, we optimize the sum of the above losses: During inference, beam search decoding is performed.
points on the test set.Afterward, we optimize the parameters on our specific spoken Chinese corpus, which is relatively small in size.The implementation details are shown in Appendix B.
For analysis, we also show the performance of NMT models trained on different corpora, including: (1) Base: Training the NMT model solely on a small Chinese-English spoken language translation corpus.(2) Fine-tuning: Training the NMT model on the AIChallenger dataset and then fine-tuning the model on Chinese-English spoken corpora.

Automatic Evaluation
For automatic translation evaluation, we report the classical BLEU (Papineni et al., 2002) scores with SacreBLEU.The automatic evaluation results on our four-genre Chinese-English spoken translation dataset are presented in Table 4.
Our experiment results show that Fine-tuning method outperforms the Base method by 4.76 BLEU points, indicating that the amount of data remains the bottleneck of translation performance on the task of spoken language translation with a limited corpus.Furthermore, the document-level machine translation method (Han-NMT) is significantly better than single-text-inputbased methods (RecNMT and pro-dropP&T) and data-augmentation-based methods (AdvAug and CsaNMT), indicating that context information is useful for pro-drop translation.Interestingly, the data-augmentation-based NMT methods (AdvAug and CsaNMT) also have an approximate BLEU gain of 1.34 to 2.00, demonstrating that the sampling method in the semantic space to expand the training dataset can well enhance the generalization of pro-drop spoken language translation.In any case, our method greatly outperforms these baseline methods, demonstrating the effectiveness of our proposed approach for pro-drop translation.

Human Evaluation
We also conduct a human evaluation focusing on three metrics: pronoun recovery (determining whether the translated sentence is complete or contains missing mentions), semantic correctness (determining whether the translated sentence is semantically consistent with the source text sentence) and overall quality.
We sample 200 instances from four corpora, and hired two workers to rate the translation results of pro-dropP&T, HanNMT, CsaNMT and our model based on the above three aspects.We used Best-Worst Scaling, which has been shown to produce more reliable results than ranking scales (Kiritchenko and Mohammad, 2017).Specifically, each score is computed as the percentage of times it was selected as best minus the percentage of times it was selected as worst, and ranges from -1 (unanimously worst) to +1 (unanimously best).The order in which the translated texts were presented to the judges was random.The details of the questions can be found in Appendix C. Table 5 indicates that HanNMT, a strong document-level machine translation method, performs better than CsaNMT and RecNMT in recovering missing pronouns, possibly due to its use of rich source-side context.Interestingly, CsaNMT, which utilizes data augmentation, exhibits superior semantic correctness and overall quality.Nonetheless, our method outperforms all baselines in terms of pronoun recovery and overall quality, indicating that the performance improvement is attributed to pro-drop resolution.More examples of generated translations of our model against comparison systems are presented in Appendix D.

Ablation Study
We conduct various ablation studies on our dataset as shown in Table 6, which assess the contribution of different losses.The SacreBLEU scores are reported on test sets.
The experiment results show that the removal of Mention-Side Mixup Interpolation results in a 0.72 BLEU point drop, indicating that the data augmentation method based on mentions can increase the generalization of pro-drop translation.Moreover, we find that all our losses, especially L mcl , are beneficial for improving the translation quality.This implies that our mention-aware contrastive learning is capable of capturing the lost pronoun information and thus improving overall performance of NMT.
It is worth noting that the third row in Table 6 is a strong document-level machine translation baseline, indicating that the improvement of our model mainly comes from the mention-aware loss rather than the wide contexts in the source side.

Related Work
Pro-drop in Machine Translation Research on pro-drop machine translation mainly falls into two categories: (1) methods using extra pro-drop resolution systems and (2) joint pro-drop resolution and translation training methods.The former relies on some syntax tools to provide extra information for the MT system (Nagard and Koehn, 2010;Taira et al., 2012;Wang et al., 2016), such as modeling empty categories (Xiang et al., 2013).However, directly using the results of external pro-drop resolution systems for the translation task shows limited improvements (Taira et al., 2012), since such external systems are trained on small-scale data that is non-homologous to MT.To bridge the gap between the two tasks, some later studies (Wang et al., 2018b;Tan et al., 2019;Xu et al., 2022) directly integrated the pro-drop resolution task into the machine translation task, such as reconstructing the missing pronouns (Wang et al., 2018a) in the encoder or predicting the pro-drop (Wang et al., 2019).
Unlike previous methods, our method recovers the missing pro-drop information from the context and uses data augmentation in the semantic space to increase the training data.To the best of our knowledge, we are the first to construct a documentlevel Chinese-English spoken translation dataset covering multiple spoken genres.

Document-Level Machine Translation
Recent works on customized model architectures have focused on improving context representations for document-level translation models, such as contextaware encoders (Voita et al., 2019a), context-aware decoders (Voita et al., 2019b), hierarchical history representations (Miculicich et al., 2018), and memory networks (Maruf and Haffari, 2018).However, Yin et al. 2021 points out that simply feeding contextual text may not be able to accurately disambiguate pronouns and polysemous words that require contexts for resolution.Thus, we employ contrastive learning to enforce the model to incorporate the mention-information about the dropped pronouns.

Data Augmentation in Machine Translation
Our approach is also related to Vicinal Risk Minimization (Chapelle et al., 2000), which formalizes data augmentation as extracting additional pseudo samples from the vicinal distribution of observed instances (Krizhevsky et al., 2012;Zhang et al., 2018;Wang et al., 2021).In machine translation, this vicinity is often defined through adversarial augmentation with manifold neighborhoods (Cheng et al., 2020;Wei et al., 2022).Our approach is similar in that it involves an adjacency mention semantic region as the vicinity manifold for each training instance.

Conclusion
This study provides valuable insights into the phenomenon of pro-drop in Chinese and its impact on Chinese-English spoken language translation.Furthermore, we introduced a new dataset that improves upon existing corpora in terms of representativeness, diversity, and informational value.Lastly, our proposed approach, Mention-Aware Semantic Augmentation, demonstrates superior performance over existing methods in addressing the challenges posed by pro-drops.
Our study underscores the critical importance of taking into account pro-drops in NMT systems, and offers valuable benchmarks and insights to guide future advancements in this field.

Limitations
Our method has shown effectiveness in improving translation quality in pro-drop machine translation task from pro-drop languages such as Chinese to a non-pro-drop target language, English in this case.However, due to the limited availability of data resources, the translation performance from other pro-drop languages such as Japanese (Sasano and Kurohashi, 2011), Thai (Kawtrakul et al., 2002), Korean (Park et al., 2015), Italian (Iida and Poesio, 2011), Spanish (Palomar et al., 2001), etc. to non-pro-drop languages remains to be evaluated.Furthermore, our method may not be able to match the performance of large language models such as PaLM (Chowdhery et al., 2022), ChatGPT4 and GPT45 , which are trained with massive machine translation corpora and other language resources.insights and expertise have been instrumental in the realization of our ideas.We are also immensely grateful to the anonymous reviewers for their constructive feedback and comments.Their perspectives have greatly enhanced the quality of our work.Lastly, we extend our gratitude to all those who have indirectly contributed to this project.Your support has not gone unnoticed and is much appreciated.
• Eq (10): The Beta distribution is conjugate prior for the Bernoulli.

B Implementation Details
We implement our method on top of the Transformer-base (Vaswani et al., 2017) implemented in Fairseq (Ott et al., 2019).For this, the dimension k was set to 512, the number of attention heads to 8, the mention encoder E m , and the text encoder E t and text decoder D t to 6 layers, and the maximum sequence length to 200.The beam size of the beam search was 5. Other hyper-parameters included a dropout rate of 0.1, Adam with a learning rate of 1e-5, β 1 = 0.9, and β 2 = 0.999.To address the out-of-vocabulary problem, we apply byte-pair-encoding (BPE) vocabulary (Sennrich et al., 2016) with 40k merge operations and set α in β(α + 1, α) to 0.1.We implemented our model using PyTorch and used 8 Tesla V100 graphic cards for training.

C Human Evaluation Questions
• Completeness: Does the translated sentence demonstrate syntactic completeness?
• Semantic Correctness: Is the translated sentence semantically correct?
• Overall: What is the overall quality of the translation?

D Examples of Generated Translations
Examples of generated translations of our model and comparison systems are show in Table 7, Table 8, Table 9, and Table 10.

Figure 1 :
Figure 1: Examples of pro-drop in daily spoken Chinese with corresponding English translations provided by our model and several mature commercial NMT systems (Google, DeepL, Baidu and ChatGPT, respectively, data collected on January 13th, 2023).The Chinese pronouns in brackets are dropped, and the dropped subject and object pronouns are marked in red and green, respectively.

Figure 2 :
Figure 2: Error distribution of Chinese-English Spoken translation in our online simultaneous translation system.Errors caused by pro-drop (i.e, Ellipsis) account for about 11% of errors.

Figure 3 :
Figure 3: The framework of mention-aware semantic augmentation.x and y represent sentences in the source and target languages, respectively.The contextual text is denoted by c.

Table 1 :
The accuracies of the Subject Ellipsis and Object Ellipsis marking by the annotation tool.

Table 2 :
The data distribution of our Chinese-English pro-drop machine translation datasets.Doc. and Sen. indicate Document and Sentence respectively.# stands for the quantity, and / denotes the ratio.

Table 3 :
Results of Chinese-English spoken language translation with the omitted pronouns complemented by humans.Although the model achieved a high BLEU score of 27.40 on the CWMT dataset, its performance on their dataset showed a significant decline, with a BLEU score dropping from 9.38 to 15.56.
and consists of three modules: a text encoder E t , a text decoder D t , and an additional mention encoder E m .The mention encoder E m is a 6-layer transformer encoder which translates the context c to representationsE m (c) ∈ R k ,where k is the embedding dimension.To obtain a real-valued output, a projection matrix A ∈ R k×k is applied to E m (c), resulting in m = E m (c)A.The mention representation m and the text representation r are concatenated together at each time-step and sent to the decoder to calculate the cross attention.It is worth noting that our mention encoder module shares parameters with the text encoder E t .Moreover, it is agnostic to the model architecture and can easily be adapted to other text generation frameworks.

Table 5 :
Human evaluation results in terms of the Best-Worst scaling.The kappa coefficient of judges is 0.52.

Table 6 :
Ablation study of different losses.