Word Alignment without NULL Words

,


Introduction
When the IBM models (Brown et al., 1993) were designed, some way of accounting for words that likely have no translation was needed. The modellers back then decided to introduce a NULL word on the target (generating) side 1 . All words on the source side without a proper target translation would then be generated by that NULL word.
While this solution is technically valid, it neglects that those untranslatable words are required for source fluency. Moreover, the NULL word, although hypothetical in nature, does have a position. It is well-known that this NULL posi-tion is problematic for distortion-based alignment models. Alignments to NULL demand a special treatment as they would otherwise induce very long jumps that one does not usually observe in distortion-based alignment models. Examples of this can be found in Vogel et al. (1996), who drop the NULL word entirely and thus force all source words to align lexically, and Och and Ney (2003), who choose a fixed NULL probability.
In the present work, we introduce a family of IBM-style alignment models that can express dependencies between translated and untranslated source words. The models do not use NULL words and instead allow untranslatable source words to be generated from translated words in their context. This is achieved by modelling source word collocations. From a technical point of view the model can be seen as a mixture of an alignment and a language model.

IBM models 1 and 2
Here, we quickly review the IBM alignment models 1 and 2 (Brown et al., 1993). We assume a random variable E over the English (target) vocabulary 2 , a variable F over the French (source) vocabulary and a variable A over alignment links 3 . The IBM models assign probabilities to alignment configurations and source sentences given the target side. Under the assumption that all source words are conditionally independent given the alignment links, these probabilities factorise as where x k 1 is a vector of outcomes x 1 , . . . , x k and e a j denotes the English word that the French word in the j th position (f j ) is aligned to under a m 1 . In IBM model 1 P (a m 1 ) is uniform. In IBM model 2, all alignment links a j are assumed to be independent and follow a categorical distribution. Here, we choose to parametrise this categorical based on the distance between the two words to be aligned, as has been done by Vogel et al. (1996) and Liang et al. (2006). Thus, in our IBM model 2 where i is the position of the English word that a j links to and the values l and m stand for the target and source sentence lengths. Notice that there is a target position i = 0 for the NULL word. Alignment to this NULL position often causes unusually long alignment jumps.
3 Removing the NULL word

Model description
Our model consists of an alignment model component (which is either IBM model 1 or 2 without NULL words) and a language model component.
It also contains a random variable Z that indicates which component to use. If Z = 0 we use the alignment model, if Z = 1 we instead use the language model. We generate each z j conditional on f j−1 . By making the outcome z j depend on f j−1 , we allow the model to capture the tendency of individual source words to be part of a collocation, i.e. to be followed by a closely related word. A similar strategy has been employed for topic modelling by Griffiths et al. (2007). When generating the source side, the model does the following for each source word f j : 1. Depending on the previous source word f j−1 , draw z j .
2. If z j = 1, generate f j from f j−1 and choose a j according to P (a j ). Otherwise, if z j = 0, generate f j from the target side and choose a j according to the probability that it has under the relevant alignment model without a target NULL word.
Our model thus induces a joint probability distribution of the form Figure 1: A graphical representation of our model for S sentence pairs. We use V f /e to denote the source/target vocabulary sizes and D to denote the number of possible alignment link configurations. Furthermore, S m/l is the number of source/target words in the current sentence and f prv the source word preceding the one that we currently generate.
where it is crucial to note that there is no E 0 variable, standing for the NULL word, anymore. Therefore, jumps to a NULL position do not need to be modelled. Notice further that the formulation of our model is general enough to be readily extensible to an HMM alignment model (Vogel et al., 1996). Depending on the value of z j , F j is distributed either according to an alignment (4) or a language model 4 (5).

The full model
Our full model is a Bayesian model, meaning that we treat all model parameters as random variables that are drawn from prior distributions. A graphical depiction of the model can be found in Figure  1. We impose Dirichlet priors on the translation (θ e ), language model (θ f ) and distortion parameters (θ a ). This has been done before and improved the standard IBM models. In order to be able to bias the model against using the language model component (5) too often and instead make it prefer the alignment model component (4), we impose a Beta prior on the Bernoulli distributions over component choices.
In effect, the model will only explain a source word with the language model if there is a lot of evidence that this word cannot be translated from the target side. The full model can be summarised as follows: For IBM model 1, A j is uniformly distributed whereas for model 2 we have

Inference
We use a Gibbs sampler to perform inference of the alignment and choice variables. Since our priors are conjugate to the model distributions, we integrate over the model parameters, giving us a collapsed sampler 5 . The sampler alternates between sampling alignment links A and component choices Z.
The predictive posterior probabilities for Z j = 0 and Z j = 1 are given in Equations (6) and (7) (up to proportionality). We use c(·) as a (conditional) count function that counts how often an outcome has been observed in a given context. We furthermore use V f to denote the French (source) vocabulary size. To ease notation, we also introduce the context set C −X j which contains the current values of all variables in our model except X j and the set H which simply contains all hyperparameters.
When Z j = 0, the predictive probability for alignment link A j is proportional to Equation (8).
P (a j ) c(f j |e a j , z = 0) + α c(e a j |z = 0) + αV f 5 Derivations of samplers similar to ours can be found in the appendices of Mermer et al. (2013) and Griffiths et al. (2007). We omit the derivation here for space reasons.
When Z j = 1, it is simply proportional to P (a j ).
In the case of IBM model 1, P (a j ) is a constant. For IBM model 2, we use where l and m are the target and source sentence lengths. Notice that target positions start at 1 as we do not use a NULL word. Notice that a naïve implementation of our sampler is unpractically slow. We therefore augment the sampler with an auxiliary variable (Tanner and Wong, 1987) that uniformly chooses only one possible new assignment per sampled link. The sampling complexity, which would normally be linear in the size of the target sentence, thus becomes constant. In practice this speed up the sampler by several orders of magnitude, making our aligner as fast as Giza++. Unfortunately, this strategy also slightly impairs the mobility of our sampler.

Decoding
Our samples contain assignments of the A and Z variables. If for a word f j we have z j = 1, we treat the word as not aligned. We then use maximum marginal decoding (Johnson and Goldwater, 2009) over alignment links to generate final word alignments. This means that we align each source word to the target word it has been aligned to most often in the samples. If the word was unaligned in most samples, we leave it unaligned in the output alignment.

Experiments and results
We present translation experiments on English paired with German, French, Czech and Japanese, thereby covering four language families. We compare our model and the Bayesian IBM models 1 and 2 of Mermer et al. (2013) against IBM model 2 as a baseline.

Experiments
Data We use the news commentary data from the WMT 2014 translation task 6 for German, French and Czech paired with English. We use newstest-2013 as development data and we use the newstest-2014 for testing. We use all available monolingual data from WMT 2014 for language modelling. All data are truecased and sentences   (1a) and bottom (1b) tables were obtained in the target-to-source direction and symmetrised, respectively. Differences are computed with respect to the directional IBM model 2 in its original parameterisation (Brown et al., 1993). The best Bayesian model in each column is boldfaced.
with more than 100 words discarded as is standardly done in SMT. The Japanese training data consist of 200.000 randomly extracted sentence pairs from the NTCIR-8 Patent Translation Task. The full data are used for language modelling. We use the NTCIR-7 dev sets for tuning and the NTCIR-9 test set for testing. 7 Training The maximum likelihood IBM model 2 is initialized with model 1 parameter estimates and trained for 5 EM iterations. Following Mermer and Saraçlar (2011), we initialize the Gibbs samplers of all Bayesian models with the Viterbi alignment from IBM model 1. We run each sampler for 1000 iterations and take a sample after every 25 th iteration. We do not use burn-in. 8 Hyperparameters All Bayesian models are trained with α = 0.0001 and β = 0.0001 to induce sparse lexical distributions. We also set s = 1 and r = 0.1 when IBM1 is the alignment component in our model. This has the effect of biasing the model towards using the align-7 The Japanese data was provided to us by a colleague with the pre-processing steps already performed, with sentences shortened to at most 40 words. Our algorithm can handle sentences of any length and there is actually no need to restrict the sentence lengths. 8 Burn-in is simply a heuristic that is not guaranteed to improve the samples in any way. See http://users. stat.umn.edu/˜geyer/mcmc/burn.html for further details. ment component. For the IBM2 version we even set r = 0.01 since IBM2 is a more trustworthy alignment model. For IBM2, we furthermore set γ = 1 to obtain a flat distortion prior.
Observe that experiments presented here use the same fixed hyperparameters for all language pairs. We tried to add another level to our model by imposing Gamma priors on the hyperparameters. The hyperparameters were then inferred using slice sampling after each Gibbs iteration. When run on the German-English and Czech-English data, this strategy increased the posterior probability of the states visited by our sampler but had no effect on BLEU. This may indicate that either the hand-chosen hyperparameters are adequate for the task or that the model generally performs well for a large range of hyperparameters.
Translation We train Moses systems (Koehn et al., 2007) with 5-gram language models with modified Kneser-Ney-smoothing using KenLM (Heafield et al., 2013) and orientation-based lexicalised reordering. We tune the systems with MERT (Och, 2003) on the dev sets. We report the BLEU score (Papineni et al., 2002) for all models averaged over 5 MERT runs.

Results
We report the translation results in Tables (1a) and (1b). Results of the full Giza++ pipeline and fastAlign (Dyer et al., 2013) are reported as a com-parison standard. All symmetrised results were obtained using the grow-diag-final-and heuristic.
Using IBM2 as an alignment component, our model mostly outperforms the standard IBM models and their Bayesian variants. Importantly, the improvement that our model 2 achieves over its model 1 variant is much larger than the difference between the corresponding models of Mermer et al. (2013). This indicates that our model makes better use of the distortion distribution that is not altered by NULL alignments. We also observe that our model gains relatively little from symmetrisation, likely because it is a very strong model already. It is interesting that although our model 2 does not use fertility parameters or dependencies between alignment links, it often approaches the performance of Giza which does use these features. Moreover, it also approaches the performance of fastAlign which does not use fertility nor dependencies between alignment links, but has a stronger inductive bias with respect to distortion.

Discussion and future work
We have presented an IBM-style word alignment model that does not need to hypothesise a NULL word as it explains untranslatable source words by grouping them with translated words. This also leads to a cleaner handling of distortion probabilities.
In our present work, we have only considered IBM models 1 and 2. As we have mentioned already, our model can easily be extended with the HMM alignment model. We are currently exploring this possibility. Our models also allow symmetrisation (Liang et al., 2006) of all translation and distortion parameters where before the NULL distortion parameters had to be fixed. We therefore plan to extend them towards model-based instead of heuristic alignment symmetrisation.
A limitation of our model is that it is only capable of modelling left-to-right linear dependencies in the source language. In languages like German or English, however, where an adjective or determiner is selected by the following noun, this may not be appropriate to model selection biases amongst neighbouring words. An interesting extension to our model is thus to add more structure to it such that it will be able to capture more complex source side dependencies.
Another concern is the inference in our model.
Using the auxiliary variable sampler, inference becomes very fast but may sacrifice performance. This is why we are interested in improving the inference method, e.g. by using a more mobile sampler or by employing a variational Bayes algorithm. The software used in our experiments can be downloaded from https://github.com/ philschulz/Aligner.