Better Chinese Sentence Segmentation with Reinforcement Learning

A long-standing challenge in Chinese–English machine translation is that sentence boundaries are ambiguous in Chinese orthography, but inferring good splits is necessary for obtaining high quality translations. To solve this, we use reinforcement learning to train a segmentation policy that splits Chinese texts into segments that can be independently translated so as to maximise the overall translation quality. We compare to a variety of segmentation strategies and ﬁnd that our approach improves the baseline BLEU score on the WMT2020 Chinese–English news translation task by +0.3 BLEU overall and improves the score on input segments that contain more than 60 words by +3 BLEU.


Introduction
Machine translation systems typically operate on sentence-like units, where sentences are translated independently of each other (Vaswani et al., 2017;Bahdanau et al., 2015;Koehn et al., 2003;Brown et al., 1993), in some cases with additional conditioning on a representation of adjacent sentences to improve coherence (Miculicich et al., 2018;Zhang et al., 2018). While many pairs of languages use similar orthographic conventions to designate sentence boundaries, English and Chinese diverge considerably: complete sentences in Chinese may be terminated either unambiguously with a full stop (。) or ambiguously with a comma. See Figure 1 for an example.
This divergence poses a challenge for Chinese-English translation systems since they must either be able to cope with potentially long, multisentence inputs (i.e., translating any text that falls between unambiguous sentence-ending punctuation) or, alternatively, they must be able to determine which comma occurrences terminate complete sentences that can be translated independently  : An example taken from the WMT2020 test set that shows a single source Chinese segment is translated into two separate English sentences. The highlighted comma separates the two corresponding complete sentences in the Chinese text, whereas the other two commas are sentence-internal boundaries. and which do not.
Being able to directly accommodate long, multisentence inputs has clear appeal. However, in practice, the training data available for translation models is dominated by the relatively short (sub)sentence pairs that are preferentially recovered by standard approaches to sentence alignment (Tiedemann, 2011;Gale and Church, 1993), and as a result of the natural distribution of sentence lengths. Unfortunately, generalisation from training on short sequences to testing on long sequences continues to be an unsolved problem even in otherwise well-performing translation models (Lake and Baroni, 2018;Koehn and Knowles, 2017). Rather than addressing the length generalisation problem directly, in this paper we side-step it by learning to make decisions about segmentation so as to maximise the performance of an unreliable machine translation system that operates optimally only on shorter segments of input Chinese text.
While numerous text segmentation techniques designed to improve machine translation have been proposed over the years ( §5), these have typically been based on heuristics that capture linguistic or statistical notions about what "minimal translatable units" consist of. In Chinese, robustly identifying such units is particularly challenging, on account of the lack of overt tense and frequent argument dropping which deprive an annotator of important clues (Huang, 1984). In contrast, we formalise the segmentation problem as a series of classification decisions about whether or not to split at candidate segmentation boundaries and use reinforcement learning to train the segmentation policy to optimise the aggregate BLEU score that results from translating the resulting segments with a particular translation system. Our approach is therefore robust to the idiosyncrasies of the underlying translation system, it is capable of discovering a policy that deviates from perhaps unreliable intuitions about minimal translatable units, and it can easily be retrained as the translation system improves.
Experiments indicate that the proposed approach outperforms a baseline that carries out no sentence splitting other than at unambiguous points, a classification approach based on linguistic criteria, and a heuristic system used in prior work. Overall, we improve the BLEU score on the WMT2020 Chinese-English news translation task by 0.3 BLEU, but for segments consisting of more than 60 words, the BLEU score increases by 3 BLEU.

Problem setup
We setup segmentation problem as a Markov decision problem (MDP) whose state, actions and rewards are characterized as follows. Every example in the training dataset is treated as a new episode and the objective is to maximise the sentence level BLEU score of the translation.

• The action set
where n is the length of the input sequence x in words, φ(x i ) is a vector encoding of ith token in context, and p (t) i is a record of previous decisions taken by the classifier about split decisions. The decision state for each token at timestep t can take on 4 possible discrete values: no punctuation (no actions are taken on these), undecided punctuation (punctuation on which an action still needs to be taken), un-split punctuation and split punctuation. At each timestep t, an action is taken at the next immediate undecided punctuation from the left and the state update involves appropriately updating the punctuation marker (to unsplit or split) corresponding to that position. The episode is considered terminal when there are no unattended punctuation markers in the sentence.
• For our reward, we use r t = BLEU(τ (s t+1 ), y * ) − BLEU(τ (s t ), y * ), the marginal difference added to BLEU score based on the current action decision similar to Wu et al. (2018). τ represents the translation of the source inputs constructed from the state definition, where we split the sentence according to the punctuation markers in s t , translate each segment independently and recombine the best translation of each segment. An action of no split yields 0 rewards as segments remain identical and splitting would produce positive or negative rewards depending on the improvement/degradation to the quality of the overall sentence translation. Marginal BLEU sums to the sentence level BLEU obtained on the full sequence of translations, but provides a denser reward signal which makes policy learning more efficient.
Network and Learning Algorithm. We learn optimal segmentation policies using a distributed deep RL algorithm, IMPALA (Espeholt et al., 2018) which provides data-efficient stable learning at high throughput by combining decoupled acting and learning. At each time-step, our policy receives a state s t and defines a distribution over the discrete action set A -i.e. π(s t ) : R n → A{SPLIT, CONTINUE}. Our algorithm also employs a value network that learns the value of a state V (s t ) : R n → R, which is used to regularize the policy network. In this work, we use a transformer encoder with self-attention layers as an observation encoder and then apply a feed-forward classifier on this encoded observation to learn the policy's action distribution. More details on the network architecture can be found in Appendix A.

Experiments
Our experiments are carried out on the WMT2020 Chinese-English news translation task, which is subjected to the same pre-processing steps as described in (Yu et al., 2020). Figure 2 shows the distribution of the maximum number of possible segments that we could split for each given example in our test dataset if a model were to split on every available punctuation (comma and full-stop in our case). We report case-sensitive BLEU, as computed with sacrebleu (Post, 2018). All the model/training details such as datasets, model architectures and hyperparameters pertaining to the baseline models are listed in Appendix B. We compare RLSEGMENT, our proposed solution's performance to six other baselines which help highlight different trade-offs involved in segmentation decisions for machine translation.
• NOSPLIT -Our key baseline that we wish to improve -a strategy of not doing any splits on the source inputs beyond unambiguous full stops. • ALLSPLIT -An aggressive segmentation policy where we segment the source on every possible comma and full-stop. • ORACLE -To compute the oracle score, we translate possible splits of a given source sentence and select the one that maximizes the example-level BLEU score. This benchmark is the upper limit of any segmentation policy, given a translation model. It is quite expensive to compute since it requires decoding all possible segmentations of every sequence. • ORACLESUP -Oracle segmentation decisions from the training corpus are used to setup a supervised binary classification task with an architecture similar to RLSEGMENT's policy network. • COMMACLASS -Using the syntactic patterns from the Penn Chinese Treebank 6.0, this system builds a comma classifier to disambiguate terminal and non-terminal commas similar to (Xue and Yang, 2011). This uses a transformer encoder followed by a positional feed-forward network to classify every comma.   • HEURISTIC -The uses a combination of predictions from COMMACLASS together with heuristic length constraints that only split long inputs (> 60 words) on terminal commas suggested by the model and terminal punctuations, and only if the resulting segments are not too short (> 10 words).
As discussed above, there is no standard segmentation of Chinese texts into sentences, and therefore all the "supervised" approaches-including our baselines ORACLESUP and HEURISTIC, construct their own training data to train a classifier. RLSEGMENT on the other hand, requires only a translation system and a corpus of standard parallel data for training and learns without any handengineered constraints on the model itself, thus presenting a generic solution that can scale across languages and system variants.  Figure 4: A example translation where segmentation with RLSEGMENT mitigates premature truncation in our system; material dropped by the baseline system is highlighted in grey. Figure 3 shows the distribution of segments proposed by our model RLSEGMENT. We see that a lot of the long sentences (source length ≥ 60 words) are split into two or more independent segments to mitigate premature truncation seen in transformer models with these long sentences. Table 1 compares different segmentation policies when translating the WMT20 Chinese-English news test dataset. While the BLEU scores indicate the quality of translations on the entire corpus, we also report BLEU scores on sources longer than 60 words as a metric to show the performance of these models on longer sentences where standard transformers tend to produce translations that are too short. In both the cases, we also report the brevity penalty (BP), a component of BLEU to show the impact on the overall length of the translation. We see that our proposed segmentation policy, RLSEGMENT improves both the BLEU scores and brevity penalties as compared to the baseline translation case NOSPLIT. Specifically, the RL model improves BLEU scores on long sentences by 3+ BLEU points and BP on those sentences by about 9+ points. This shows that our model, via smart segmentation, suffers less because of premature truncation of long translations as compared to the baseline-a common problem (Meister et al., 2020;Koehn and Knowles, 2017). While segmentation of long sentences at appropriate punctuations helps performance, segmentation at all punctuations is expected to hurt performance as it is highly likely to produce extremely small segments which lose a lot of necessary source context when individually translated. This is demonstrated by poorer BLEU score of the ALLSPLIT baseline, even though it achieves good BP scores both on the corpus and long translations. Compared to supervised baselines trained on syntactic data such as COMMACLASS and HEURISTIC, our model performs competitively on both BLEU and BP without any supervised data for segmentation or hand-engineered length constraints. In Figure 4, we see an example where RLSEGMENT mitigates premature truncation of the resultant translation. In this example, although input (and resulting English translation) consists of three sentences separated by commas, the segmentation policy has only chosen to split at only one position, having learned that the underlying translation system is capable of translating some two-sentence inputs. This example thus illustrates the practicality of learning a segmentation policy based on the abilities of the underlying translation system, not just on the basis of normative notions of translatable units. (More examples can be found in Appendix C.1, C.2) While our model does better than the baselines, there is a sufficient performance gap to the oracle BLEU scores (because of the different data/length characteristics between training and test time) that could be achieved via "perfect" segmentation, demonstrating the value for further research into better segmentation strategies. However, we also note that the RLSEGMENT outperforms OR-ACLESUP-especially on long sentences. We suspect this has to do with the relative scarcity of such examples in the training data-while a supervised learner can happily ignore those rare cases at little cost in terms of cross-entropy, they have an out-sized impact on BLEU, and therefore the RL learner is sensitive to them.

Results
Finally, it is important to note that while RLSEG-MENT improves BLEU at a corpus level, there exist cases where individual translation examples (Appendix C.3) are worse because of inappropriate segmentation.

Related Work
The segmentation of long texts and sentences into segments suitable for translation has been a recurring topic in machine translation research (Tien and Minh, 2019;Pouget-Abadie et al., 2014;Goh and Sumita, 2011;Doi and Sumita, 2003); however, we are the first to apply reinforcement learning to solve the problem. A related problem to the segmentation problem occurs in automated simultaneous interpretation, where the system must produce translations as quickly as possible, but it is necessary to wait until sufficient context has been received before an accurate translation can be produced. Grissom II et al. (2014) used an RL approach, targeting a reward that balances translation quality with translation latency.
Chinese comma disambiguation has likewise been studied. However, without exception these have sought to predict normative notions of what constitutes a complete clause or elementary discourse unit (Xu and Li, 2013;Xue and Yang, 2011;Jin et al., 2004), on the basis of syntactic annotations in the Chinese Treebank (Xue et al., 2005). In contrast, our solution is directly targeted at developing a segmentation strategy that results in a good downstream translation, rather than conforming to any single normative notion of what constitutes a complete sentence.

Conclusion
In this work, we have addressed a key challenge in Chinese-English machine translation : the ambiguity of English-like sentence boundaries in Chinese, resulting in long Chinese sentence data for machine translation tasks. Our solution casts the Chinese sentence segmentation problem into a sequential decision making problem and then uses Reinforcement Learning to learn an optimal segmentation policy to maximize the BLEU scores of the eventual translation from the independent segment translations. Our solution does not require any paired training data for segmentation and is able to learn an optimal strategy purely from paired machine translation data. Our model is able to outperform a baseline translation strategy that segments on only unambiguous full-stops by 0.3 BLEU at a corpus level and by 3 BLEU on a sub-corpus comprising only source sentences longer than 60 words.

A Data and Model details for RLSEGMENT A.1 Dataset
We use the training, validation and test datasets from the WMT2020 Chinese-English constrained data provided to shared task participants. All the examples are aligned paired translations with sentence-like units provided as part of the dataset. Pre-processing of text is done in exactly the same methodology as described in Section 3 of (Yu et al., 2020). We also use the same sentencepiece tokenizer described in Section 3 of that work to convert the text into integer tokens.

A.3 RL Models
The policy and value networks share common encoding layers, with different feed-forward networks on top of the shared encoder layer. The shared encoder consists of 2 stacked self-attention layers with 8 attention heads each and feed-forwards of size 512, with 0.1 as the rate for all attention, feed-forward and sub-layer dropouts. Sequence lengths for training were restricted to 280 tokens (maximum sentence length in the validation set is 277). The policy network applies a feed-forward network of sizes 256-2 on the outputs of the shared encoder and value network applies a feed-forward network of sizes 256-1. Adam optimizer with learning rate=0.0002, b1=0.0 and b2=0.99 was used for training. For the IMPALA style loss function (Espeholt et al., 2018), the weights of policy, baseline and entropy loss are set as 1.0, 0.5 and 0.0005 respectively. Key hyperparameters such as the weights of losses, learning rates and model sizes were tuned (a single trial for each hyperparameter configuration) using BLEU scores on the WMT20 Chinese-English validation dataset. Compute and other details -For both inference and learning, we use 2x2 slices (4 TPUs) of Google TPU v2 with 2 cores per device (8 cores were split into 2 cores used for inference and 6 for learning). 512 actors on CPU were run in parallel to generate transition data for the distributed learner to learn from. Learning was done on 70 million episodes with a batch size of 256 per core and the entire experiment had an average run-time of approximately 6 hours.

B Data and Model details for BASELINES
All baseline models use the same inference model as described in Appendix A.2. All baselines are evaluated on the WMT20 Chinese-English test data as described in Appendix A.1. The baselines NOSPLIT, ALLSPLIT and ORACLE do not require any training data and are directly employed during test time.

B.1 COMMACLASS and HEURISTIC
Dataset -The comma classifier used in these baselines is trained on Chinese Treebank data prepared in the same format as described in (Xue and Yang, 2011).
Model -Both these baselines rely on a comma classifier model.

C Example Model Outputs
The splits in the source sentence have been highlighted. Sailing Series 5 years ago , resulting in the muscular rupture , and fortunately , the local race medical team helped him deal with it so that he could stand here to take part in the race after 5 years .

RLSEGMENT Translation
despite the intense sea breeze , Pierre Ives Durand , the helmsman of the French ABM team , told his warm past with Qingdao , the venue of the competition, that he was injured accidentally in the International Extreme Sailing Series five years ago . caused muscle fracture , thanks to the handling of the local event medical team , he can still stand on the venue today five years later .

RLSEGMENT Translation
Kazan , Nova Scotia , Krasnoyarsk and Arkhangelsk are all improved models of White wax tree -M , with a displacement of 13,800 tons and a submersible depth of 520 meters . staffing 64 , underwater speed of 31 knots. They will carry mines , torpedoes and " caliber " and " agate " cruise missiles .