Enhancing Neural Machine Translation with Semantic Units

Conventional neural machine translation (NMT) models typically use subwords and words as the basic units for model input and comprehension. However, complete words and phrases composed of several tokens are often the fundamental units for expressing semantics, referred to as semantic units. To address this issue, we propose a method Semantic Units for Machine Translation (SU4MT) which models the integral meanings of semantic units within a sentence, and then leverages them to provide a new perspective for understanding the sentence. Specifically, we first propose Word Pair Encoding (WPE), a phrase extraction method to help identify the boundaries of semantic units. Next, we design an Attentive Semantic Fusion (ASF) layer to integrate the semantics of multiple subwords into a single vector: the semantic unit representation. Lastly, the semantic-unit-level sentence representation is concatenated to the token-level one, and they are combined as the input of encoder. Experimental results demonstrate that our method effectively models and leverages semantic-unit-level information and outperforms the strong baselines. The code is available at https://github.com/ictnlp/SU4MT.


Introduction
Neural machine translation (NMT) (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015b;Gehring et al., 2017;Vaswani et al., 2017;Zhang et al., 2023) has achieved significant success and has been continuously attracting significant attention.Currently, the mainstream language models use tokens as meaningful basic units, which are typically words or subwords, and in a few cases, characters (Xue et al., 2022).However, these tokens are not necessarily Provide cover for doctors on the pic@@ ket line .semantic units in natural language.For words composed of multiple subwords, their complete meaning is expressed when they are combined.For some phrases, their overall meanings differ from those of any individual constituent word.Consequently, other tokens attend to each part of a semantic unit rather than its integral representation, which hinders the models from effectively understanding the whole sentence.
Recent studies have been exploring effective ways to leverage phrase information.Some works focus on finding phrase alignments between source and target sentences (Lample et al., 2018;Huang et al., 2018).Others focus on utilizing the source sentence phrase representations (Xu et al., 2020;Hao et al., 2019;Li et al., 2022aLi et al., , 2023)).However, these approaches often rely on time-consuming parsing tools to extract phrases.For the learning of phrase representations, averaging token representations is commonly used (Fang and Feng, 2022;Ma et al., 2022).While this method is simple, it fails to effectively capture the overall semantics of the phrases, thereby impacting the model performance.Some researchers proposed to effectively learn phrase representations (Yamada et al., 2020;Li et al., 2022b) using BERT (Devlin et al., 2019), but this introduces a large number of additional parameters.
Therefore, we propose a method for extracting semantic units, learning the representations of them, and incorporating semantic-level representations into Neural Machine Translation to enhance translation quality.In this paper, a semantic unit means a contiguous sequence of tokens that collectively form a unified meaning.In other words, a semantic unit can be a combination of subwords, words, or both of them.
For the extraction of semantic units, it is relatively easy to identify semantic units composed of subwords, which simplifies the problem to extracting phrases that represent unified meanings within sentences.Generally, words that frequently cooccurent in a corpus are more likely to be a phrase.Therefore, we propose Word Pair Encoding (WPE), a method to extract phrases based on the relative co-occurrence frequencies of words.
For learning semantic unit representations, we propose an Attentive Semantic Fusion (ASF) layer, exploiting the property of attention mechanism (Bahdanau et al., 2015b) that the length of the output vector is identical with the length of the query vector.Therefore, it is easy to obtain a fixedlength semantic unit representation by controlling the query vector.To achieve this, pooling operations are performed on the input vectors and the result is used as the query vector.Specifically, a combination of max-pooling, min-pooling, and average pooling is employed to preserve the original semantics as much as possible.
After obtaining the semantic unit representation, we propose a parameter-efficient method to utilize it.The token-level sentence representation and semantic-unit-level sentence representation are concatenated together as the input to the encoder layers, with separate positional encoding applied.This allows the model to fully exploit both levels of semantic information.Experimental results demonstrate this approach is simple but effective.

Transformer Neural Machine Translation model
Neural machine translation task is to translate a source sentence x into its corresponding target sentence y.Our method is based on the Transformer neural machine translation model (Vaswani et al., 2017), which is composed of an encoder and a decoder to process source and target sentences respectively.Both the encoder and decoder consist of a word embedding layer and a stack of 6 encoder/decoder layers.The word embedding layer maps source/target side natural language texts x/y to their vectored representation x emb /y emb .An encoder layer consists of a self-attention module and a non-linear feed-forward module, which outputs the contextualized sentence representation of x.A decoder layer is similar to the encoder layer but has an additional cross-attention module between self-attention and feed-forward modules.The input of a decoder layer y emb attends to itself and then cross-attends to x to integrate source sentence information and then generate the translation y.The generation is usually auto-regressive, and a crossentropy loss is applied to train a transformer model: An attention module projects a query and a set of key-value pairs into an output, where the query, key, value, and output are vectors of the same dimension, and the query vector determines the output's length.The output is the weighted sum of values, where the weight is determined by a dot product of query and key.For training stability, a scaling factor 1 √ d is applied to the weight.The attention operation can be concluded as equation ( 2) Specifically, in the transformer model, the selfattention module uses a same input vector as query and key-value pairs, but the cross-attention module in the decoder uses target side hidden states as query and encoder output x as key-value pairs.

Byte Pair Encoding (BPE)
NLP tasks face an out-of-vocabulary problem which has largely been alleviated by BPE (Sennrich et al., 2016).BPE is a method that cuts words into sub-words based on statistical information from the training corpus.
There are three levels of texts in BPE: character level, sub-word level, and word level.BPE does not learn how to turn a word into a sub-word, but learns how to combine character-level units into sub-words.Firstly, BPE splits every word into separate characters, which are basic units.Then, BPE merges adjacent two basic units with the highest concurrent frequency.The combination of two units forms a new sub-word, which is also a basic Token Embedding Tom is re@@ tic@@ ent in regards to his prize  unit and it joins the next merge iteration as other units do.Merge operation usually iterates 10k-40k times and a BPE code table is used to record this process.These operations are called BPE learning, which is conducted on the training set only.
Applying BPE is to follow the BPE code table and conduct merge operations on training, validation, and test sets.Finally, a special sign "@@" is added to sub-words that do not end a word.For example, the word "training" may become two subwords "train@@" and "ing".

Method
In this section, we present our methods in detail.As depicted in Figure 2, the proposed method introduces an additional Attentive Semantic Fusion (ASF) module between the Encoder Layers and the Token Embedding layer of the conventional Transformer model, and only a small number of extra parameters are added.The ASF takes the representations of several tokens that form a semantic unit as input, and it integrates the information from each token to output a single vector representing the united semantic representation.The sentence representations of both token-level and semanticunit-level are then concatenated as the input of the first one of the encoder layers.Note that the token-level representation is provided to the encoder layers to supplement detailed information.We also propose Word Pair Encoding (WPE), an offline method to effectively extract phrase spans that indicate the boundaries of semantic units.

Learning Semantic Unit Representations
In our proposed method, a semantic unit refers to a contiguous sequence of tokens that collectively form a unified meaning.This means not only phrases but also words composed of subwords can be considered as semantic units.
We extract token representations' features through pooling operations.Then, we utilize the attention mechanism to integrate the semantics of the entire semantic unit and map it to a fixed-length representation.
The semantic unit representations are obtained through the Attentive Semantic Fusion (ASF) layer, illustrated on the right side of Figure 2. It takes a series of token representations T i∼i+k = [T i , T i+1 , ..., T i+k ] as input, which can form a phrase of length k + 1.To constrain the output size as a fixed number, we leverage a characteristic of attention mechanism (Bahdanau et al., 2015a) that the size of attention output is determined by its query vector.Inspired by Xu et al. (2020), who The striking likeness between British long jumper and...

WPE Phrase Table
The striking likeness between British long #$& jumper and...

Apply BPE
The striking lik@@ eness between British long jum@@ per and... have demonstrated max pooling and average pooling are beneficial features for learning phrase representations, we propose to leverage the concatenation of min pooling, max pooling, and average pooling to preserve more original information, yielding T pool = concat(T min , T max , T avg ), which serves as the query vector for the attentive fusion layer.The attentive fusion layer uses T i∼i+k as key and value vector and outputs the fused representation of length 3:

Remove WPE Signs
The output is then transposed to be a vector with triple-sized embed_dim but a length of 1.This is followed by a downsampling feed-forward layer to acquire the single-token-sized semantic representation P i .
In practical applications, we also consider the semantic units that are composed of only one token as phrases: a generalized form of phrases with a length of 1.Therefore, all tokens are processed by ASF to be transformed into semantic representations.By doing so, representations of all semantic units are in the same vector space.

Integrating Semantic Units
The left part of Figure 2 illustrates how the whole model works.The tokens in the input sentence are printed in different colors, and the tokens of the same color can form a semantic unit.When SU4MT takes a sentence as input, the token embedding1 maps these tokens into token-level representations, denoted by colored circles in Figure 2.
One copy of the token-level sentence representation is directly prepared to be a half of the next layer's input after being added with positional encoding.The other copy goes through the ASF layer and is transformed into the semantic-unit-level sentence representation, denoted by colored squares in Figure 2. It is then added with positional encoding, the position count starting from zero.Subsequently, the two levels of sentence representations are concatenated as the complete input of encoder layers.In practice, a normalization layer is applied to enhance the model stability.
Since the two levels of sentence representations are respectively added with positional encodings, they essentially represent two separate and complete sentences.This is like generating translations based on two different perspectives of a same natural language sentence, allowing the model to capture either the overall information or the finegrained details as needed.To ensure the validity of concatenating the two levels of sentence representations, we conducted preliminary experiments.When a duplicate of the source sentence is concatenated to the original one as input of the conventional Transformer model, the translation quality does not deteriorate.This ensures the rationality of concatenating the two levels of sentence representations.

Extracting Semantic Spans
The semantic units encompass both words comprised of subwords and phrases.The former can be easily identified through the BPE mark "@@".However, identifying phrases from sentences requires more effort.Previous research often relied on parsing tools to extract phrases, which can be time-consuming.In our model, the syntactic structure information is not needed, and we only have to concern about identifying phrases.Therefore, we propose Word Pair Encoding (WPE) as a fast method for phrase extraction.Since our model only utilizes the phrase boundary information as an auxiliary guidance, it does not require extremely high accuracy.As a result, WPE strikes a good balance between quality and speed, making it suitable for the following stages of our approach.
Inspired by byte pair encoding (BPE) (Sennrich et al., 2016), we propose a similar but higher-level algorithm, word pair encoding (WPE), to extract phrases.There are two major differences between WPE and BPE: the basic unit and the merge rule.
The basic units in BPE are characters and premerged sub-words as discussed in section 2.2.Analogously, basic units in WPE are words and pre-merged sub-phrases.Besides, the merge operation does not paste basic units directly but adds a special sign "#$&" to identify word boundaries.
The merge rule means how to determine which two basic units should be merged in the next step.In BPE, the criterion is co-current frequency.In WPE, however, the most frequent co-current word pairs mostly consist of punctuations and stopwords.To effectively identify phrases, we change the criterion into a score function as shown in equation (3).
w 1 and w 2 represent two adjacent basic units, and δ is a controllable threshold to filter out noises.Notably, our WPE is orthogonal to BPE and we provide a strategy to apply both of them, as illustrated in Figure 3. Firstly, BPE word table and WPE phrase table are learned separately on raw text X.Then, WPE is applied to yield X W , denoted by the red boxes.Next, we apply BPE to X W and obtain X W B , denoted by the purple boxes.Finally, the special WPE signs are moved out and result in sub-word level sentences X B , which is the same as applying BPE on raw text.The important thing is we extract semantic units no matter whether the integrant parts are words or subwords.To sum up, WPE is applied to extract the position of semantic units, without changing the text.
In practice, there are some frequent long segments in the training corpus, bringing nonnegligible noise to WPE.So we clip phrases with more than 6 tokens to avoid this problem.

Experiments
In this section, we conduct experiments on three datasets of different scales to validate the effectiveness and generality of the SU4MT approach.

Data
To explore model performance on small, middle, and large-size training corpus, we test SU4MT and related systems on WMT14 English-to-German (En→De), WMT16 English-to-Romanian (En→Ro), and WMT17 English-to-Chinese (En→Zh) translation tasks.

Datasets
For the En→De task, we use the script "prepare-wmt14en2de.sh" provided by Fairseq (Ott et al., 2019) to download all the data.The training corpus consists of approximately 4.5M sentence pairs.We remove noisy data from the training corpus by filtering out (1) sentences with lengths less than 3 or more than 250, (2) sentences with words composed of more than 25 characters, (3) sentence pairs whose ratio of source sentence length over target sentence length is larger than 2 or less than 0.5.4.02M sentence pairs.We use newstest2013 as the validation set and use newstest2014 as the test set.
For the En→Zh task, we collect the corpora provided by WMT17 (Bojar et al., 2017).The training corpus consists of approximately 25.1M sentence pairs, with a notable number of noisy data.Therefore, we clean the corpus by filtering out nonprint characters and duplicated or null lines, leaving 20.1M sentence pairs.We use newsdev2017 and newstest2017 as validation and test sets.
For the En→Ro task, the training corpus consists of 610K sentence pairs, and newsdev2016 and newstest2016 are served as validation and test sets.

Data Preparation
We use mosesdecoder (Koehn et al., 2007) for the tokenization of English, German, and Romanian texts, while the jieba tokenizer2 is employed for Chinese texts.For the En-De and En-Ro datasets, we apply joint BPE with 32,768 merges and use shared vocabulary with a vocabulary size of 32,768.As for the En-Zh dataset, since English and Chinese cannot share the same vocabulary, we perform 32,000 BPE merges separately on texts in both languages.The vocabulary sizes are not limited to a certain number, and they are about 34K for English and about 52K for Chinese.
In our approach, the proposed WPE method is used to extract the boundaries of semantic units.For all three tasks, we employ the same settings: the δ value of 100 and 10,000 WPE merges.Finally, the resulting spans longer than 6 are removed.

Implementation Details Training Stage
In SU4MT and the baseline method, the hyperparameters related to the training data scale and learning strategies remain unchanged across all tasks and model scales.We set the learning rate to 7e − 4 and batch size to 32k.Adam optimizer (Kingma and Ba, 2015) is applied with β = (0.9, 0.98) and ϵ = 1e − 8.The dropout rate is adjusted according to the scales of model parameters and training data.For base-setting models, we set dropout = 0.1.For large-setting models, we set dropout = 0.3 for En→Ro and En→De tasks, and set dropout = 0.1 for En→Zh task.
For SU4MT, a pretrain-finetune strategy is applied for training stability.Specifically, we initialize our method by training a Transformer model for approximately half of the total convergence steps.At this point, the token embedding parameters have reached a relatively stable state, which facilitates the convergence of the ASF layer.The En→Ro task, however, is an exception.We train SU4MT models from scratch because it only takes a few steps for them to converge.
We reproduce UMST (Li et al., 2022a) according to the settings described in the paper.Note that the authors have corrected that they applied 20k BPE merge operations in the WMT16 En→Ro task, and we follow this setting in our re-implementation.

Inference Stage
For all experiments, we average the last 5 checkpoints as the final model and inference with it.Note that, we save model checkpoints for every epoch for En→Ro and En→De tasks and every 5000 steps for En→Zh task.

Evaluation Stage
We report the results with three statistical metrics and two model-based metrics.Due to space limitation, we exhibit two mainstream metrics here and display the complete evaluation in Appendix A. For En→Ro and En→De tasks, we report Multi-BLEU for comparison with previous works and Sacre-BLEU as a more objective and fair comparison for future works.For the En→Zh task, we discard Multi-BLEU because it is sensitive to the word  segmentation of Chinese sentences.Instead, we report SacreBLEU and ChrF (Popović, 2015).For model-based metrics, COMET (Rei et al., 2022)3 and BLEURT (Sellam et al., 2020;Pu et al., 2021) 4 are leveraged.

Main Results
We report the experiment results of 3 tasks in Ta-ble1, 2, 3. Of all the comparing systems, MG-SH (Hao et al., 2019), Proto-TF (Yin et al., 2022) and UMST (Li et al., 2022a) uses external tools to obtain syntactic information, which is timeconsuming.Notably, our approach and LD (Xu et al., 2020) can work without any syntactic parsing tools and yield prominent improvement too.Apart from Multi-BLEU, we also report SacreBLEU in brackets.Results in Table 1 demonstrate that our approach outperforms other systems in both base and big model settings and significantly surpasses the systems without extra knowledge.
To make the evaluation of translation performance more convincing, we report detailed evaluation scores on five metrics, which are displayed in Appendix A. representations with only one type, but duplicate it and concatenate it to the original sentence to eliminate the influence of doubled sentence length.
As demonstrated by the experimental results, using only one perspective of sentence representation leads to a decrease in translation quality, indicating that diverse sentence representations complement each other and provide more comprehensive semantic information.Furthermore, we remove the duplicated portion to investigate the impact of adding redundant information on translation.Surprisingly, when doubling the token-level sentence representation, SacreBLEU increases by 0.6, but when doubling the semantic-unit-level sentence representation, SacreBLEU even exhibits a slight decline.We speculate that repeating the information-sparse tokenlevel sentence representation allows the model to learn detailed semantic information and to some extent learn overall semantic representation separately from the two identical representations.However, semantic-unit-level sentence representation is semantically dense.It's already easy for the model to understand the sentence and duplicating it may introduce some noises.In conclusion, using both perspectives of semantic representation together as input maximizes the provision of clear, comprehensive semantic and detailed linguistic features.

Influence of Granularity
We also discuss the criterion of how to consider a phrase as a semantic unit.We change the WPE merge steps to control the granularities.In Table 5, "BPE" indicates we treat all bpe-split subwords as semantic units."RANDOM" means we randomly select consecutive tokens as semantic units until the ratio of these tokens is the same as that of our default setting (approximately 36%).The results demonstrate it is essential to integrate semantic meanings of subwords.

Performance on Different Span Number
To explore the effectiveness of our method in modeling semantic units, we divide the test set of the En→De task into five groups based on the number of semantic-unit spans contained in the source sentences (extracted using WPE 10k+BPE).We calculate the average translation quality of our method and the comparison systems in each group.The results are presented in Figure 4, where the bar graph represents the proportion of sentences in each group to the total number of sentences in the test set, and the line graph shows the variation in average SacreBLEU.When the span count is 0, SU4MT degrades to the scenario of doubling the token-level sentence representation, thus resulting in only a marginal improvement compared to the baseline.However, when there are spans present in the sentences, SU4MT significantly outperforms the baseline and consistently surpasses the comparison system (Li et al., 2022a) in almost all groups.This demonstrates that SU4MT effectively models semantic units and leverages them to improve translation quality.

Effectiveness of Modeling Semantic Units
To evaluate the efficacy of the proposed approach in modeling the integral semantics of semantic units, we calculate the translation accuracy of target-side words that correspond to the source-side semantic units.For this purpose, we leverage the Gold Alignment for German-English Corpus dataset5  proposed by RWTH (Vilar et al., 2006), which consists of 508 sentence pairs and human-annotated word alignments.The alignments are labeled either S (sure) or P (possible).We train SU4MT and baseline systems on WMT14 En-De dataset and evaluate them on the RWTH alignment dataset.The recall rate of target-side words that correspond to source-side semantic units is reported as the metric, i.e., the proportion of correctly translated semantic units.The results are presented in Table 6.Table 6: Our approach presents more accurate translation on words that align with semantic units.In "recall(S)", only the words with "sure" alignment are calculated, while in "recall(P&S)", we calculate words with both "sure" and "possible" alignment.
was a hot research topic, leveraging it in neural machine translation is hindered by its intrinsic characteristics like discontinuity (Constant et al., 2017), e.g., the expression "as ... as ...".
Despite the large overlap between these two notions, we claim the distinctions of semantic units.For empirical usage, we require semantic units to be strictly consecutive tokens.Subword tokens are also considered a part of a semantic unit.

Extraction of Phrases
From the perspective of extracting semantic phrases, some of these methods require an additional parsing tool to extract semantic units and their syntactic relations, which may cost days in the data preprocessing stage.Phrase extraction has been explored by various methods, such as the rule-based method (Ahonen et al., 1998), the alignment-based method (Venugopal et al., 2003;Vogel, 2005;Neubig et al., 2011), the syntax-based method (Eriguchi et al., 2016;Bastings et al., 2017), and the neural network-based method (Mingote et al., 2019).

Translation with Phrases
Several works have discussed leveraging semantic or phrasal information for neural machine translation.Some of them take advantage of the span information of phrases but do not model phrase representation.This kind of method modifies attention matrices, either conducting matrix transformation to integrate semantic units (Li et al., 2022a) or applying attention masks to enhance attention inside a semantic span (Slobodkin et al., 2022).Other works model a united representation of semantic units.Xu et al. (2020) effectively learns phrase representations and achieves great improvement over the baseline system.However, a complicated model structure is designed to leverage semantic information, resulting in heavy extra parameters.MG-SA (Hao et al., 2019) models phrase representations but requires them to contain syntactic information instead of semantics, which narrows the effectiveness of modeling phrases.Proto-TF (Yin et al., 2022) uses semantic categorization to enhance the semantic representation of tokens to obtain a more accurate sentence representation, but it fails to extend the method into phrasal semantic representations.

Conclusions
In this work, we introduce the concept of semantic units and effectively utilize them to enhance neural machine translation.Firstly, we propose Word Pair Encoding (WPE), a phrase extraction method based on the relative co-occurrence of words, to efficiently extract phrase-level semantic units.Next, we propose Attentive Semantic Fusion (ASF), a phrase information fusion method based on attention mechanism.Finally, we employ a simple but effective approach to leverage both token-level and semantic-unit-level sentence representations.Experimental results demonstrate our method significantly outperforms similar methods.

Limitations
Our approach aims to model the semantic units within a sentence.However, there are still a few sentences that contain semantic units as Figure 4 shows.In this scenario, SU4MT brings less improvement in translation performance.Therefore, the advantage of our approach may be concealed in extreme test sets.Admittedly, SU4MT requires more FLOPs than Transformer does, which is an inevitable cost of leveraging extra information.There are also potential risks in our work.Limited by model performance, our model could generate unreliable translations, which may cause misunderstanding in cross-culture scenarios.

Figure 1 :
Figure 1: An example of how integrating semantic units helps reduce translation errors.Instead of translating each word separately, SU4MT learns a unified meaning of "picket line" to translate it as a whole.

Figure 2 :
Figure 2: This picture illustrates an overview of SU4MT.The left part illustrates the whole model structure.We only modify the embedding layer of the encoder and keep others unchanged.The circles denote token-level representations and circles of the same color show they can form a semantic unit.The squares denote semantic-level representations.The right part illustrates the detailed structure of the Attentive Semantic Fusion (ASF) layer.The input is contiguous tokens that constitute a semantic unit and the output of it is a single semantic representation.

Figure 3 :
Figure 3: An example of how to get the span positions of semantic units in a subword-level sentence by applying WPE.The boxes are colored differently to distinguish different processing stages.Tokens that form a semantic unit are highlighted with a background color.The two highlighted strings in the uppermost box are the extracted semantic units.

Table 1 :
This table shows experimental results on En→De translation task.The proposed SU4MT approach is compared with related works that involve phrases or spans.The * sign denotes the results come from the original papers, and the † sign denotes the approaches use external parsing tools.Besides Multi-BLEU, we report SacreBLEU in brackets.The results show that our approach outperforms other methods w/ or w/o prior syntactic knowledge.The ↑ sign denotes our approach significantly surpasses Transformer (p < 0.01) and UMST (p < 0.05).

Table 2 :
After cleaning, the training corpus contains Experimental results on small-sized WMT16 En→Ro dataset.The † sign denotes the approach uses an external parsing tool.The ⇑ sign and ↑ denote the approaches significantly surpass the Transformer system, measured by SacreBLEU(p < 0.05 and p < 0.1 respectively).

Table 3 :
Experimental results on En→Zh translation task are reported with SacreBLEU and ChrF.This is because the tokenization of Chinese is highly dependent on the segmentation tool, leading to the unreliability of Multi-BLEU as a metric.

Table 4 :
In this set of experiments, encoder input is varied to discover the effectiveness of concatenating sentence representations of both levels.Experiments are conducted on En→De task and the strategy of identifying semantic units is WPE 10k + BPE.