UvA-DARE (Digital Academic Repository) Word Translation Prediction for Morphologically Rich Languages with Bilingual Neural Networks

Translating into morphologically rich languages is a particularly difﬁcult problem in machine translation due to the high degree of inﬂectional ambiguity in the target language, often only poorly captured by existing word translation models. We present a general approach that exploits source-side contexts of foreign words to improve translation prediction accuracy. Our approach is based on a probabilistic neural network which does not require linguistic annotation nor manual feature engineering. We report signiﬁcant improvements in word translation prediction accuracy for three morphologically rich target languages. In addition, preliminary results for integrating our approach into a large-scale English-Russian statistical machine translation system show small but statistically signiﬁcant improvements in translation quality.


Introduction
The ability to make context-sensitive translation decisions is one of the major strengths of phrasebased SMT (PSMT).However, the way PSMT exploits source-language context has several limitations as pointed out, for instance, by Quirk and Menezes (2006) and Durrani et al. (2013).First, the amount of context used to translate a given input word depends on the phrase segmentation, with hypotheses resulting from different segmentations competing with one another.Another issue is that, given a phrase segmentation, each source phrase is translated independently from the others, which can be problematic especially for short phrases.As a result, the predictive translation of a source phrase does not access useful linguistic clues in the source sentence that are outside of the scope of the phrase.
Lexical weighting tackles the problem of unreliable phrase probabilities, typically associated with long phrases, but does not alleviate the problem of context segmentation.An important share of the translation selection task is then left to the language model (LM), which is certainly very effective but can only leverage target language context.Moreover, decisions that are taken at early decoding stages-such as the common practice of retaining only top n translation options for each source span-depend only on the translation models and on the target context available in the phrase.
Source context based translation models (Gimpel and Smith, 2008;Mauser et al., 2009;Jeong et al., 2010;Haque et al., 2011) naturally address these limitations.These models can exploit a boundless context of the input text, but they assume that target words can be predicted independently from each other, which makes them easy to integrate into state-of-the-art PSMT systems.Even though the independence assumption is made on the target side, these models have shown the benefits of utilizing source context, especially in translating into morphologically rich languages.One drawback of previous research on this topic, though, is that it relied on rich sets of manually designed features, which in turn required the availability of linguistic annotation tools like POS taggers and syntactic parsers.
In this paper, we specifically focus on improving the prediction accuracy for word translations.Achieving high levels of word translation accuracy is particularly challenging for language pairs where the source language is morphologically poor, such as English, and the target language is morphologically rich, such as Russian, i.e., language pairs with a high degree of surface realization ambiguity (Minkov et al., 2007).To address this problem we propose a general approach based on bilingual neural networks (BNN) exploiting source-side contextual information.
This paper makes a number of contributions: Unlike previous approaches our models do not require any form of linguistic annotation (Minkov et al., 2007;Kholy and Habash, 2012;Chahuneau et al., 2013), nor do they require any feature engineering (Gimpel and Smith, 2008).Moreover, besides directly predicting fully inflected forms as Jeong et al. (2010), our approach can also model stem and suffix prediction explicitly.Prediction accuracy is evaluated with respect to three morphologically rich target languages (Bulgarian, Czech, and Russian) showing that our approach consistently yields substantial improvements over a competitive baseline.We also show that these improvements in prediction accuracy can be beneficial in an end-to-end machine translation scenario by integrating into a large-scale English-Russian PSMT system.Finally, a detailed analysis shows that our approach induces a positive bias on phrase translation probabilities leading to a better ranking of the translation options employed by the decoder.

Lexical coverage of SMT models
The first question we ask is whether translation can be improved by a more accurate selection of the translation options already existing in the SMT models, as opposed to generating new options.
To answer this question we measure the lexical coverage of a baseline PSMT system trained on English-Russian. 1 We choose this language pair because of the morphological richness on the target side: Russian is characterized by a highly inflectional morphology with a particularly complex nominal declension (six core cases, three genders and two number categories).As suggested by Green and DeNero (2012), we compute the recall of reference tokens in the set of target tokens that the decoder could produce in a translation of the source, that is the target tokens of all phrase pairs that matched the input sentence 1 Training data and SMT setup are described in Section 6. and that were actually used for decoding. 2We call this the decoder's lexical search space.Then, we compare the reference/space recall against the reference/MT-output recall: that is, the percentage of reference tokens that also appeared in the 1-best translation output by the SMT system.Results for the WMT12 benchmark are presented in Table 1.From the first two rows, we see that only a rather small part of the correct target tokens available to the decoder are actually produced in the 1-best MT output (50% against 86%).Although our word-level analysis does not directly estimate phrase-level coverage, these numbers suggest that a large potential for translation improvement lies in better lexical selection during decoding.To quantify the importance of morphology, we count how many reference tokens matched the MT output only at the stem level 3 and for how many of those the correct surface form existed in the search space (reachable matches).These two numbers represent the upper bound of the improvement achievable by a model only predicting suffixes given the target stems.As shown in Table 1, such a model could potentially increase the reference/MT-output recall by 12.3% with generation of new inflected forms, and by 11.2% without.Thus, also when it comes to morphology, generation seems to be of secondary importance compared to better selection in our experimental setup.

Predicting word translations in context
It is standard practice in PSMT to use wordto-word translation probabilities as an additional phrase score.More specifically, state-of-the-art PSMT systems employ the maximum-likelihood estimate of the context-independent probability of a target word given its aligned source word P (t j |s i ) for each word alignment link a ij .The main goal of our work is to improve the estimation of such probabilities by exploiting the context of s i , which in turn we expect will result in better phrase translation selection.Figure 1 illustrates this idea: the translation of "law" in this example has a wrong case-nominative instead of genitive.Due to the rare word "Indiana/индиана", the target LM must backoff to the bigram history and does not penalize this choice sufficiently.However, a model that has access to the word "of" in the near source context could bias the translation of "law" to the correct case.
We then model P (t j |c s i ) with source context c s i defined as a fixed-length word sequence centered around s i : Our definition of context is similar to the n − 1 word history used in n-gram LMs.Similarly to previous work in source context-sensitive translation modeling (Jeong et al., 2010;Chahuneau et al., 2013), target words are predicted independently from each other, which allows for an efficient decoding integration.We are particularly interested in translating into morphologically rich languages where source context can provide useful information for the prediction of target translation, for example, the gender of the subject in a source sentence constrains the morphology of the translation of the source verb.Therefore, we integrate the notions of stem and suffix directly into the model.We assume the availability of a word segmentation function g that takes a target word t as input and returns its stem and suffix: g(t) = (σ, µ).Then, the conditional probability p(t j |c s i ) can be decomposed into stem probability and suffix probability: These two probabilities can be estimated separately, which yields the two subtasks: 1. predict target stem σ given source context c s ; 2. predict target suffix µ given source context c s and target stem σ.
Based on the results of our analysis, we focus on the selection of existing translation candidates.We then restrict our prediction on a set of possible target candidates depending on the task instead of considering all target words in the vocabulary.More specifically, for each source word s i , our candidate generation function returns the set of target words T s = {t 1 , . . ., t m } that were aligned to s i in the parallel training corpus, which in turn corresponds to the set of target words that the SMT system can produce for a given source.In practice, we use a pruned version of T s to speed up training and reduce noise (see details in Section 5).
As for the morphological models, given T s and g, we can obtain L s = {σ 1 , . . ., σ k }, the set of possible target stem translations of s, and M σ = {µ 1 , . . ., µ l }, the set of possible suffixes for a target stem σ.The use of L s , and M σ is similar to stemming and inflection operations in (Toutanova et al., 2008) while the set T s is similar to the GEN function in (Jeong et al., 2010). 4ur approach differs crucially from previous work (Minkov et al., 2007;Chahuneau et al., 2013) in that it does not require linguistic features such as part-of-speech and syntactic tree on the source side.The proposed models automatically learn features that are relevant for each of the modeled tasks, directly from word-aligned data.To make the approach completely language independent, the word segmentation function g can be trained with an unsupervised segmentation tool.The effects of using different word segmentation techniques are discussed in Section 5.

Bilingual neural networks for translation prediction
Probabilistic neural network (NN), or continuous space, language models have received increasing attention over the last few years and have been applied to several natural language processing tasks (Bengio et al., 2003;Collobert and Weston, 2008;Socher et al., 2011;Socher et al., 2012).Within statistical machine translation, they have been used for monolingual target language modeling (Schwenk et al., 2006;Le et al., 2011;Duh et al., 2013;Vaswani et al., 2013), n-gram translation modeling (Son et al., 2012), phrase translation modeling (Schwenk, 2012;Zou et al., 2013;Gao et al., 2014) and minimal translation modeling (Hu et al., 2014).The recurrent neural network LMs of Auli et al. (2013) are primarily trained to predict target word sequences.However, they also experiment with an additional input layer representing source side context.Our models differ from most previous work in neural language modeling in that we predict a target translation given a source context while previous models predict the next word given a target word history.Unlike previous work in phrase translation modeling with NNs, our models have the advantage of accessing source context that can fall outside the phrase boundaries.
We now describe our models in a general setting, predicting target translations given a source context, where target translations can be either words, stems or suffixes. 5

Neural network architecture
Following a common approach in deep learning for NLP (Bengio et al., 2003;Collobert and Weston, 2008), we represent each source word s i by a column vector r s i ∈ R d .Given a source context c s i = s i−k , ..., s i , ..., s i+k of k words on the left and k words on the right of s i , the context representation r cs i ∈ R (2k+1)×d is obtained by concatenating the vector representations of all words in c s i : Our main BNN architecture for word or stem prediction (Figure 2a) is a feed-forward neural network (FFNN) with one hidden layer, a matrix W 1 ∈ R n×(2k+1)d connecting the input layer to the hidden layer, a matrix W 2 ∈ R |Vt|×n connecting the hidden layer to the output layer, and a bias vector b 2 ∈ R |Vt| where |V t | is the size of target translations vocabulary.The target translation distribution P BNN (t|c s i ) for a given source context c s i is computed by a forward pass: where φ is a nonlinearity (tanh, sigmoid or rectified linear units).The parameters of the neural 5 The source code of our models is available at https: //bitbucket.org/ketran The suffix prediction BNN is obtained by adding the target stem representation r σ to the input layer (see Figure 2b).

Model variants
We encounter two major issues with FFNNs: (i) They do not provide a natural mechanism to compute word surface conditional probability p(t|c s ) given individual stem probability p(σ|c s ) and suffix probability p(µ|c s , σ), and (ii) FFNNs do not provide the flexibility to capture long dependencies among words if they lie outside the source context window.Hence, we consider two BNN variants: a log-bilinear model (LBL) and a convolutional neural network model (ConvNet).LBL could potentially address (i) by factorizing target representations into target stem and suffix representations whereas ConvNets offer the advantage of modeling variable input length (ii) (Kalchbrenner et al., 2014).
Log-bilinear model.The FFNN models stem and suffix probabilities separately.A log-bilinear model instead could directly model word prediction through a factored representation of target words, similarly to Botha and Blunsom (2014).Thus, no probability mass would be wasted over stem-suffix combinations that are not in the candidate generation function.The LBL model specifies the conditional distribution for the word translation t j ∈ T s i given a source context c s i : We use an additional set of word representations q t j ∈ R n for target translations t j .The LBL model computes a predictive representation q of a source context c s i by taking a linear combination of the source word representations r s i+m with the position-dependent weight matrices C m ∈ R n×d : The score function s θ (t j , c s i ) measures the similarity between the predictive representation q and the target representation q t j : (a) BNN for word prediction.Here b t j is the bias term associated with target word t j .b h ∈ R n are the representation biases.s θ (t j , c s i ) can be seen as the negative energy function of the target translation t j and its context c s i .The parameters of the model thus are θ = {r s i , C m , q t j , b h , b t j }.Our log-bilinear model is a modification of the log-bilinear model proposed for n-gram language modeling in (Mnih and Hinton, 2007).
Convolutional neural network model.This model (Figure 3) computes the predictive representation q by applying a sequence of 2k convolutional layers {L 1 , . . ., L 2k }.The source context c s i is represented as a matrix m cs i ∈ R d×(2k+1) : q r s 1 r s 2 r s 3 r s 4 r s 5 r s 6 r s 0 Each convolutional layer L i consists of a onedimensional filter m i ∈ R d×2 .Each row of m i is convolved with the corresponding row in the previous layer resulting in a weight matrix whose number of columns decreases by one.Thus after 2k convolutional layers, the network transforms the source context matrix m cs i to a feature vector q ∈ R d .A fully connected layer with weight matrix W followed by a softmax layer are placed after the last convolutional layer L 2k to perform classification.The parameters of the convolutional neural network model are θ = {r s i , m j , W}.
Here, we focus on a fixed length input, however convolutional neural networks may be used to model variable length input (Kalchbrenner et al., 2014;Kalchbrenner and Blunsom, 2013).

Training
In training, for each example (t, c s ), we maximize the conditional probability P θ (t|c s ) of a correct target label t.The contribution of the training example (t, c s ) to the gradient of the log conditional probability is given by: Note that in the gradient, we do not sum over all target translations T but a set of possible candidates T s of a source word s.In practice |T s | ≤ 200 with our pruning settings (see Section 5.1), thus training time for one example does not depend on the vocabulary size.Our training criterion can be seen as a form of contrastive estimation (Smith and Eisner, 2005), however we explicitly move the probability mass from competing candidates to the correct translation candidate, thus obtaining more reliable estimates of the conditional probabilities.
The BNN parameters are initialized randomly according to a zero-mean Gaussian.We regularize the models with L 2 .As an alternative to the L 2 regularizer, we also experiment with dropout (Hinton et al., 2012), where the neurons are randomly zeroed out with dropout rate p.This technique is known to be useful in computer vision tasks but has been rarely used in NLP tasks.In FFNN, we use dropout after the hidden layer, while in Con-vNet, dropout applies after the last convolutional layer.The dropout rate p is set to 0.3 in our exper-iments.We use rectified nonlinearities6 in FFNN and after each convolutional layer in ConvNet.We train our BNN models with the standard stochastic gradient descent.

Evaluating word translation prediction
In this section, we assess the ability of our BNN models to predict the correct translation of a word in context.In addition to English-Russian, we also consider translation prediction for Czech and Bulgarian.As members of the Slavic language family, Czech and Bulgarian are also characterized by highly inflectional morphology.Czech, like Russian, displays a very rich nominal inflection with as many as 14 declension paradigms.Bulgarian, unlike Russian, is not affected by case distinctions but is characterized by a definiteness suffix.

Experimental setup
The following parallel corpora are used to train the BNN models: • English-Russian: WMT13 data (News Commentary and Yandex corpora); • English-Czech: CzEng 1.0 corpus (Bojar et al., 2012) (Web Pages and News sections); • English-Bulgarian: a mix of crawled news data, TED talks and Europarl proceedings.
Detailed corpus statistics are given in Table 2.For each language pair, accuracies are measured on a held-out set of 10K parallel sentences.
To prepare the candidate generation function, each dataset is first word-aligned with GIZA++, then a bilingual lexicon with maximum-likelihood probabilities (P mle ) is built from the symmetrized alignment.After some frequency and significance pruning,7 the top 200 translations sorted by P mle (t|s) • P mle (s|t) are kept as candidate word translations for each source word in the vocabulary.Word alignments are also used to train the BNN models: each alignment link constitutes a training sample, with no special treatment of unaligned words and 1-to-many alignments.
The context window size k is set to 3 (corresponding to 7-gram) and the dimensionality of source word representations to 100 in all experiments.The number of hidden units in our feedforward neural networks and the target translation embedding size in LBL models are set to 200.All models are trained for 10 iterations with learning rate set to 0.001.Table 2: BNN training corpora statistics: number of sentences, tokens, and type/token ratio (T/T).

Word, stem and suffix prediction accuracy
We measure accuracy at top-n, i. e. the number of times the correct translation was in the top n candidates sorted by a model.For each subtaskword, stem and suffix prediction-the BNN model is compared to the context-independent maximum-likelihood baseline P mle (t|s) on which the PSMT lexical weighting score is based.Note that this is a more realistic baseline than the uniform models sometimes reported in the literature.The oracle corresponds to the percentage of aligned source-target word pairs in the held-out set that are covered by the candidate generation function.Out of the missing links, about 4% is due to lexicon pruning.Results for all three language pairs are presented in Table 3.In this series of experiments, the morphological BNNs utilize unsupervised segmentation models trained on each target language following Lee et al. (2011). 8 As shown in Table 3, the BNN models outperform the baseline by a large margin in all tasks and languages.In particular, word prediction accuracy at top-1 increases by +6.4%, +24.6% and +9.0%absolute in English-Russian, English-Czech and English-Bulgarian respectively, without the use of any features based on linguistic annotation.While the baseline and oracle differences among languages can be explained by different levels of overlap between training and held-out set, we cannot easily explain why the Czech BNN performance is so much higher.When comparing the From these figures, it is hard to predict whether word BNNs or morphological BNNs will have a better effect on SMT performance.On one hand, the word-level BNN achieves the highest gain over the MLE baseline.On the other, the stem-and suffix-level BNNs provide two separate scoring functions, whose weights can be directly tuned for translation quality.A preliminary answer to this question is given by the SMT experiments presented in Section 6.

Effect of word segmentation
This section analyzes the effect of using different segmentation techniques.We consider two supervised tagging methods that produce lemma and inflection tag for each token in a context-sensitive manner: TreeTagger (Sharoff et al., 2008) for Russian and the Morce tagger (Spoustová et al., 2007) for Czech. 9Finally, we employ the Russian Snowball rule-based stemmer as a light-weight context-9 Annotation included in the CzEng 1.0 corpus release.As shown in Figure 4, accuracies for both stem and suffix prediction vary noticeably with the segmentation used.However, higher stem accuracies corresponds to lower suffix accuracies and vice versa, which can be mainly due to a general preference of a tool to segment more or less than another.In summary, the unsupervised segmentation methods and the light-weight stemmer appear to perform comparably to the supervised methods.

Effect of training data size
We examine the predictive power of our models with respect to the size of training data.Table 4 shows the accuracies of stem and suffix models trained on 200K and 1M English-Russian sentence pairs with unsupervised word segmentation.Surprisingly, we observe only a minor loss when we decrease the training data size, which suggests that our models are robust even on a small data set.

Fine-grained evaluation
We evaluate the suffix BNN model at the part-ofspeech (POS) level.Table 5 provides suffix prediction accuracy per POS for En-Ru.For this analysis, Russian data is segmented by TreeTag-ger.Additionally, we report the average number of suffixes per stem given the part-of-speech.
Our results are consistent with the findings of Chahuneau et al. (2013):11 the prediction of adjectives is more difficult than that of other POS while Russian verb prediction is relatively easier in spite of the higher number of suffixes per stem.These differences reflect the importance of source versus target context features in the prediction of the target inflection: For instance, adjectives agree in gender with the nouns they modify, but this may be only inferred from the target context.Table 5: Suffix prediction accuracy at top-1 (%), breakdown by category (A: adjectives, V: verbs, N: nouns, M: numerals and P: pronouns).|M σ | denotes the average number of suffixes per stem.

Neural Network variants
Table 6 shows the stem and suffix accuracies of BNN variants on English-Czech.Although none of the variants outperform our main FFNN architecture, we observe similar performances by the LBL on stem prediction, and by the ConvNet on suffix prediction.This suggests that future work could exploit their additional flexibilities (see Section 4.2) to improve the BNN predictive power.
As for the low suffix accuracy by the LBL, it can be explained by the absence of nonlinearity transformation.Nonlinearity is important for the suffix model where the prediction of target suffix µ j often does not depend linearly on s i and σ j .The predictive representation of target stem in the LBL stem model, however, mainly depends on the source representation r s i through a position dependent weight matrix C 0 .Thus, we observe a smaller accuracy drop in the stem model than in the suffix model.Conversely, the ConvNet performs poorly on stem prediction because it captures the meaning of the whole source context instead of emphasizing the importance of the source word s i as the main predictor of the target translation t j .
Unexpectedly, no improvement is obtained by the use of dropout regularizer (see Section 4.3).

SMT experiments
While the main objective of this paper is to improve prediction accuracy of word translations, see Section 5, we are also interested in knowing to which extent these improvements carry over within an end-to-end machine translation task.To this end, we integrate our translation prediction models described in Section 4 into our existing English-Russian SMT system.
For each phrase pair matching the input, the phrase BNN score P BNN-p is computed as follows: where a is the word-level alignment of the phrase pair (s, t) and {a i } is the set of target positions aligned to s i .If a source-target link cannot be scored by the BNN model, we give it a P BNN probability of 1 and increment a separate count feature ε.Note that the same phrase pair can get different BNN scores if used in different source side contexts.
Our baseline is an in-house phrase-based (Koehn et al., 2003) statistical machine translation system very similar to Moses (Koehn et al., 2007).All system runs use hierarchical lexicalized reordering (Galley and Manning, 2008;Cherry et al., 2012), distinguishing between monotone, swap, and discontinuous reordering, all with respect to left-to-right and right-to-left decoding.Other features include linear distortion, bidirectional lexical weighting (Koehn et al., 2003), word and phrase penalties, and finally a word-level 5gram target LM trained on all available monolingual data with modified Kneser-Ney smoothing (Chen and Goodman, 1999)  limit is set to 6 and for each source phrase the top 30 translation candidates are considered.When translating into a morphologically rich language, data sparsity issues in the target language become particularly apparent.To compensate for this we also experiment with a 5-gram suffix-based LM in addition to the surface-based LM (Müller et al., 2012;Bisazza and Monz, 2014).
The BNN models are integrated as additional log-probability feature functions (log P BNN-p ): one feature for the word prediction model or two features for the stem and suffix models respectively, plus the penalty feature ε.
Table 7 shows the data used to train our English-Russian SMT system.The feature weights for all approaches were tuned by using pairwise ranking optimization (Hopkins and May, 2011) on the wmt12 benchmark (Callison-Burch et al., 2012).During tuning, 14 PRO parameter estimation runs are performed in parallel on different samples of the n-best list after each decoder iteration.The weights of the individual PRO runs are then averaged and passed on to the next decoding iteration.Performing weight estimation independently for a number of samples corrects for some of the instability that can be caused by individual samples.The wmt13 set (Bojar et al., 2013) was used for testing.We use approximate randomization (Noreen, 1989) to test for statistically significant differences between runs (Riezler and Maxwell, 2005).
Translation quality is measured with caseinsensitive BLEU[%] using one reference translation.As shown in Table 8, statistically significant improvements over the respective baseline (Baseline and Base+suffLM) are marked at the p < .01level.Integrating our bilingual neural network approach into our SMT system yields small but statistically significant improvements of 0.4 BLEU over a competitive baseline.We can also To better understand the BNN effect on the SMT system, we analyze the set of phrase pairs that are employed by the decoder to translate each sentence.This set is ranked by the weighted combination of phrase translation and lexical weighting scores, target language model score and, if available, phrase BNN scores.As shown in Table 9, the morphological BNN models have a positive effect on the decoder's lexical search space increasing the recall of reference tokens among the top 1 and 3 phrase translation candidates.The mean reciprocal rank (MRR) also improves from 0.655 to 0.662.Looking at the 1-best SMT output, we observe a slight increase of reference/output recall (50.0% to 50.7%), which is less than the increase we observe for the top 1 translation candidates (57.6% to 59.0%).One possible explanation is that the new, more accurate translation distributions are overruled by other SMT model scores,  like the target LM, that are based on traditional maximum-likelihood estimates.While the suffixbased LMs proved beneficial in our experiments, we speculate that higher gains could be obtained by coupling our approach with a morphologyaware neural LM like the one recently presented by Botha and Blunsom (2014).

Related work
While most relevant literature has been discussed in earlier sections, the following approaches are particularly related to ours: Minkov et al. (2007) and Toutanova et al. (2008) address target inflection prediction with a log-linear model based on rich morphological and syntactic features.Their model exploits target context and is applied to inflect the output of a stem-based SMT system, whereas our models predict target words (or pairs of stem-suffix) independently and are integrated into decoding.Chahuneau et al. (2013) address the same problem with another feature-rich discriminative model that can be integrated in decoding, like ours, but they also use it to inflect onthe-fly stemmed phrases.It is not clear what part of their SMT improvements is due to the generation of new phrases or to better scoring.Jeong et al. (2010) predict surface word forms in context, similarly to our word BNN, and integrate the scores into the SMT system.Unlike us, they rely on linguistic feature-rich log-linear models to do that.Gimpel and Smith (2008) propose a similar approach to directly predict phrases in context, instead of words.All those approaches employed features that capture the global structure of source sentences, like dependency relations.By contrast, our models access only local context in the source sentence but they achieve accuracy gains comparably to models that also use global sentence structure.

Conclusions
We have proposed a general approach to predict word translations in context using bilingual neural network architectures.Unlike previous NN approaches, we model word, stem and suffix distributions in the target language given context in the source language.Instead of relying on manually engineered features, our models automatically learn abstract word representations and features that are relevant for the modeled task directly from word-aligned parallel data.Our preliminary results with LBL and ConvNet architectures suggest that potential improvement may be achieved by factorizing target representations or by dynamically modeling source context size.Evaluated on three morphologically rich languages, our approach achieves considerable gains in word, stem and suffix accuracy over a context-independent maximum-likelihood baseline.Finally, we have shown that the proposed BNN models can be tightly integrated into a phrase-based SMT system, resulting in small but statistically significant BLEU improvement over a competitive, largescale English-Russian baseline.
Our analysis shows that the number of correct target words occurring in highly scored phrase translation candidates increases after integrating the morphological BNNs.However, only few of these end up in the 1-best translation output.Future work will investigate the benefits of coupling our BNN models with target language models that also exploit abstract word representations, such as Botha and Blunsom (2014) and Auli et al. (2013).

Figure 1 :
Figure 1: Fragment of English sentence and its incorrect Russian translation produced by the baseline SMT system.Square brackets indicate phrase boundaries.

r
(b) BNN for suffix prediction.

Figure 2 :
Figure 2: Feed-forward BNN architectures for predicting target translations: (a) word model (similar to stem model), and (b) suffix model with an additional vector representation r σ for target stems σ.

Figure 3 :
Figure 3: Convolutional neural network model.Edges with the same color indicate the same kernel weight matrix.

Figure 4 :
Figure 4: Effect of different word segmentation techniques (U: unsupervised, S: supervised, R: rule-based stemmer) on stem and suffix prediction accuracy.The dark part of each bar stands for top-1, the light one for top-3 accuracy.

Table 1 :
Lexical coverage analysis of the baseline SMT system (English-Russian wmt12).

Table 4 :
Accuracy at top-1/top-3 (%) of stem and suffix BNNs with different training data sizes.
. The distortion

Table 7 :
SMT training and test data statistics.All numbers refer to tokenized, lowercased data.

Table 9 :
Target word coverage analysis of the English-Russian SMT system before and after adding the morphological BNN models.