Bilingual Termbank Creation via Log-Likelihood Comparison and Phrase-Based Statistical Machine Translation

Bilingual termbanks are important for many natural language processing (NLP) applications, especially in translation workﬂows in industrial settings. In this paper, we apply a log-likelihood comparison method to extract monolingual terminology from the source and target sides of a parallel corpus. Then, using a Phrase-Based Statistical Machine Translation model, we create a bilingual terminology with the extracted monolingual term lists. We manually evaluate our novel terminology extraction model on English-to-Spanish and English-to-Hindi data sets, and observe excellent performance for all domains. Furthermore, we report the performance of our monolingual terminology extraction model comparing with a number of the state-of-the-art terminology extraction models on the English-to-Hindi datasets.


Introduction
Terminology plays an important role in various NLP tasks including Machine Translation (MT) and Information Retrieval. It is also exploited in human translation workflows, where it plays a key role in ensuring translation consistency and reducing ambiguity across large translation projects involving multiple files and translators over a long period of time. The creation of monolingual and bilingual terminological resources using human experts are, however, expensive and time-consuming tasks. In contrast, automatic terminology extraction is much faster and less expensive, but cannot be guaranteed to be error-free. Accordingly, in real NLP applications, a manual inspection is required to amend or discard anomalous items from an automatically extracted terminology list. The automatic terminology extraction task starts with selecting candidate terms from the input domain corpus, usually in two different ways: (i) linguistic processors are used to identify noun phrases that are regarded as candidate terms (Kupiec, 1993;Frantzi et al., 2000), and (ii) non-linguistic n-gram word sequences are regarded as candidate terms (Deane, 2005).
Various statistical measures have been used to rank candidate terms, such as C-Value (Ananiadou et al., 1994), NC-Value (Frantzi et al., 2000), log-likelihood comparison (Rayson and Garside, 2000), and TF-IDF (Basili et al., 2001). In this paper, we present our bilingual terminology extraction model, which is composed of two consecutive and independent processes: 1. A log-likelihood comparison method is employed to rank candidate terms (n-gram word sequences) independently from the source and target sides of a parallel corpus, 2. The extracted source terms are aligned to one or more extracted target terms using a Phrase-Based Statistical Machine Translation (PB-SMT) model (Koehn et al., 2003).
We then evaluate our novel bilingual terminology extraction model on various domain corpora considering English-to-Spanish and low-resourced and less-explored English-to-Hindi language-pairs and see excellent performance for all data sets.
The remainder of the paper is organized as follows. In Section 2, we discuss related work. In Section 3, we describe our two-stage terminology extraction model. Section 4 presents the results and analyses of our experiments, while Section 5 concludes, and provides avenues for further work.
In this work, we focus on extracting bilingual terminology from a parallel corpus. He et al. (2006) demonstrate that using log-likelihood for term discovery performs better than TF-IDF. Accordingly, similarly to Rayson and Garside (2000) and Gelbukh et al. (2010), we extract terms independently from both sides of a parallel corpus using log-likelihood comparisons with a generic reference corpus. Some of the most influential research on bilingual terminology extraction includes Kupiec (1993), Gaussier (1998), Ha et al. (2008) and Lefever et al. (2009). Lefever et al. (2009) proposed a sub-sentential alignmentbased terminology extraction module that links linguistically motivated phrases in parallel texts. Unlike our approach, theirs relies on linguistic analysis tools such as PoS taggers or lemmatizers, which might be unavailable for under-resourced languages (e.g., Hindi). Gaussier (1998) and Ha et al. (2008) applied statistical approaches to acquire parallel term-pairs directly from a sentence-aligned corpus, with the latter focusing on improving monolingual term extraction, rather than on obtaining a bilingual term list. In contrast, we build a PB-SMT model (Koehn et al., 2003) from the input parallel corpus, which we use to align a source term to one or more target terms. While Rayson and Garside (2000) and Gelbukh et al. (2010) only allowed the extraction of single-word terms, we focus on extraction of up to 3-gram terms.

Methodology
In this section, we describe our two-stage bilingual terminology extraction model. In the first stage, we extract monolingual terms independently from either side of a sentence-aligned domain-specific parallel corpus. In the second stage, the extracted source terms are aligned to one or more extracted target terms using a PB-SMT model.

Monolingual Terminology Extraction
The monolingual term extraction task involves the identification of terms from a list of candidate terms formed from all n-gram word sequences from the monolingual domain corpus (i.e. in our case, each side of the domain parallel corpus, cf. Section 4.1). On both source and target sides, we used lists of languagespecific stop-words and punctuation marks in order to filter out anomalous items from the candidate termlists. In order to rank the candidate terms in those lists, we used a log-likelihood comparison method that compares the frequencies of each candidate term in both the domain corpus and the large general corpus used as a reference. 1 The log-likelihood (LL) value of a candidate term (C n ) is calculated using equation (1) from Gelbukh et al. (2010).
where F d and F g are the frequencies of C n in the domain corpus and the generic reference corpus, respectively. E d and E g are the expected frequencies of C n , which are calculated using (2) and (3).
where N n d and N n g are the numbers of n-grams in the domain corpus and reference corpus, respectively. Thus, each candidate term is associated with a weight (LL value) which is used to sort the candidate terms: those candidates with the highest weights have the most significant differences in frequency in the two corpora. However, we are interested in those candidate terms that are likely to be terms in the domain corpus. Gelbukh et al. (2010) used the condition in (4) in order to filter out those candidate terms whose relative frequencies are bigger in the domain corpus than in the reference corpus, and we do likewise.
In contrast with Gelbukh et al. (2010), we extract multi-word terms up to 3-grams, whereas they focused solely on extracting single word terms.

Creating a Bilingual Termbank
We obtained source and target termlists from the bilingual domain corpus using the approach described in Section 3.1. We use a PB-SMT model (Koehn et al., 2003) to create a bilingual termbank from the extracted source and target termlists. This section provides a mathematical derivation of the PB-SMT model to show how we scored candidate term-pairs using the PB-SMT model. We built a source-to-target PB-SMT model from the bilingual domain corpus using the Moses toolkit (Koehn et al., 2007). In PB-SMT, the posterior probability P(e I 1 |f J 1 ) is directly modelled as a (log-linear) combination of features (Och and Ney, 2002), that usually comprise M translational features, and the language model, as in (5): where e I 1 = e 1 , ..., e I is the probable candidate translation for the given input sentence f J 1 = f 1 , ..., f J and s K 1 = s 1 , ..., s k denotes a segmentation of the source and target sentences respectively into the sequences of phrases (f 1 , ...,f k ) and (ê 1 , ...,ê k ) such that (we set i 0 := 0): (5) can be rewritten as in (6): Therefore, the translational features in (5) can be rewritten as in (7): In equation (7),ĥ m is a feature defined on phrase-pairs (f k ,ê k ), and λ m is the feature weight ofĥ m . These weights (λ m ) are optimized using minimum error-rate training (MERT) (Och, 2003) on a held-out 500 sentence-pair development set for each of the experiments. We create a list of probable source-target term-pairs by taking each source and target term from the source and target termlists, respectively, provided that those source-target term-pairs are present in the PB-SMT phrase-table. We calculate a weight (w) for each source-target term-pair (essentially, a phrasepair, i.e. (ê k ,f k )) using (8): 2 In order to calculate w, we used the four standard PB-SMT translational features (ĥ m ), namely forward phrase translation log-probability (log P(ê k |f k )), its inverse (log P(f k |ê k )), the lexical log-probability (log P lex (ê k |f k )), and its inverse (log P lex (f k |ê k )). We considered a higher threshold value for weights and considered those term-pairs whose weights exceeded this threshold. For each source term, we considered a maximum of the four highest-weighted target terms.

Data Used
We conducted experiments on several data domains for two different language-pairs, English-to-Spanish and English-to-Hindi. For English-to-Spanish, we worked with client-provided data taken from six different domains in the form of translation memories. For English-to-Hindi, we used three parallel corpora from three different sources (EILMT, EMILLE and Launchpad) taken from HindEnCorp 3 (Bojar et al., 2014) released for the WMT14 shared translation task, 4 and a parallel corpus of KDE4 localization files 5 (Tiedemann, 2009). The EMILLE corpus contains leaflets from the UK Government and various local authorities. The domain of the EILMT 6 corpus is tourism. We used data from a collection of translated documents from the United Nations (MultiUN) 7 (Tiedemann, 2009) and the European Parliament (Koehn et al., 2005) as the monolingual English and Spanish reference corpora. We used the HindEnCorp monolingual corpus (Bojar et al., 2014) as the monolingual Hindi reference corpus. The statistics of the data used in our experiments are shown in Table 1.

Runtime Performance
Our terminology extraction model is composed of two main processes: (i) Moses training and tuning (restricting the number of iterations of MERT to a maximum of 6), and (ii) terminology extraction. In Table 2, we report the actual runtimes of these two processes on the six domain corpora. As Table   3 http://ufallab.ms.mff.cuni.cz/ bojar/hindencorp/ 4 http://www.statmt.org/wmt14/ 5 http://opus.lingfil.uu.se/KDE4.php 6 English-to-Indian Language Machine Translation (EILMT) is a Ministry of IT, Govt. of India sponsored project. 7 http://opus.lingfil.uu.se/MultiUN.php 2 demonstrates, both MT system-building (training and tuning combined) and terminology extraction processes are very short on each corpus. Given the crucial influence of bilingual terminology on quality in translation workflows, we believe that the creation of such assets from scratch in less than 30 minutes may prove to be a significant breakthrough for translators.

Human Evaluation
Of course, it is one thing to rapidly create translation assets such as bilingual termbanks, and another entirely to ensure the quality of such resources. Accordingly, we evaluated the performance of our bilingual terminology extraction model on each English-to-Spanish and English-to-Hindi domain corpus reported in Table 1, with the evaluation goals being twofold: (i) measuring the accuracy of the monolingual terminology extraction process, and (ii) measuring the accuracy of our novel bilingual terminology creation model. As mentioned in Section 3.2, a source term may be aligned with up to four target terms. For evaluation purposes, we considered the top-100 source terms based on the LL values (cf. (1)) and their target counterparts (i.e. one to four target terms). The quality of the extracted terms was judged by native Spanish and Hindi speakers, both with excellent English skills, and the evaluation results are reported in Table  3. Note that we were not able to measure recall of the term extraction model on the domain corpora due to the unavailability of a reference terminology set. The evaluator counted the number of valid terms in the source term list for the domain in question, and the percentage of valid terms with respect to the total number of terms (i.e. 100) is reported in the second column in Table 3. We refer to this as VST (Valid Source Terms). For each valid source term there are one to four target terms that are ranked according to the weights in (8). In theory, therefore, the top-ranked target term is the most suitable target translation of the aligned source term. The evaluator counted the number of instances where the top-ranked target term was a suitable target translation of the source term; the percentage with respect to the number of valid source terms is shown in the third column in Table 3, and denoted as VTT (Valid Target Terms). The evaluator also reported the number of cases where any of the four target terms was a suitable translation of the source term; the percentage with respect to the number of valid source terms is given in the fourth column in Table 3. Furthermore, the evaluator counted the number of instances where any of the four target terms with minor editing can be regarded as suitable target translation; the percentage with respect to the number of valid source terms is reported in the last column of Table 3. In Table 4, we show three English-Spanish term-pairs extracted by our automatic term extractor where the target terms (Spanish) are slightly incorrect. In all these examples the edit distance between the correct term and the one proposed by our automatic extraction method is quite low, meaning that just a few keystrokes can transform the candidate term into the correct one. In these cases editing the candidate term is much cheaper (in terms of time) than creating the translations from scratch.   In Table 3, we see that the accuracy of the monolingual term extraction model varies from 72% to 94% for both English-to-Spanish and English-to-Hindi. For English-to-Spanish, the accuracy of our bilingual terminology creation model ranges from 86.1% to 93.6%, 91.7% to 97.8% and 93.1% to 97.8% when the 1-best, 4-best and 4-best with slightly edited target terms are considered, respectively. For Englishto-Hindi, the accuracy of our bilingual terminology creation model ranges from 62.1% to 95.4%, 83.5% to 98.8% and 94.9% to 98.8% when the 1-best, 4-best and 4-best with slightly edited target terms are considered, respectively.
We are greatly encouraged by these results, as they demonstrate that our novel bilingual termbank creation method is robust in the face of the somewhat noisy monolingual term-extraction results; as a consequence, if better methods for suggesting monolingual term candidates are proposed, we expect the performance of our bilingual term-creation model to improve accordingly.
We calculated the distributions of unigram, bigram and trigram in the valid source terms (cf. Table 3) and reported in Table 5. We also calculated the percentages of their distributions in the valid source terms averaged over all 10 data sets. As can be seen from Table 3, the percentage of the average distribution of the trigram terms is quite low (i.e. 2.5%). This result justifies our decision for extraction of up to 3-gram terms.

Comparison: Monolingual Terminology Extraction
In this section we report the performance of our monolingual terminology extraction model (cf. Section 3.1) comparing with the performance of several state-of-the-art terminology extraction algorithms capable of recognising multiword terms. In order to extract monolingual multiword terms we used the JATE toolkit 8 (Zhang et al., 2008). This toolkit first extracts candidate terms from a corpus using linguistic tools and then applies term extraction algorithms to recognise terms specific to the domain corpus. The JATE toolkit is currently available only for the English language. For evaluation purposes, we considered the source-side of the English-to-Hindi domain corpora.

Algorithm
Reference EILMT EMILLE Launchpad KDE4 LLC (Bilingual) cf. VST in Table 3  91  79  88  79  LLC  77  53  80  71  STF  46  04  54  44  ACTF  42  15  62  48  TF-IDF  50  36  45 17 Glossex Kozakov et al. (2004)   (1)). The automatic term extraction algorithms in JATE assign weights (domain representativeness) to the candidate terms giving an indication of the likelihood of being a good domain-specific term. The quality of the extracted terms (top-100 highest weighted) was judged by an evaluator with excellent English skills, and the evaluation results are reported in Table 6. The evaluator counted the number of valid terms in the highest weighted 100 terms that were extracted using different state-of-the-art term extraction algorithms.
The third row of Table 6 represents the percentage of the valid source terms extracted by our loglikelihood comparison (LLC) based monolingual term extraction algorithm. The next three rows represent three basic monolingual term extraction algorithms (STF: simple term frequency, ACTF: average corpus term frequency and TF-IDF) available in the JATE toolkit. The last seven rows represent seven state-of-the-art terminology extraction algorithms. As can be seen from Table 6, LLC is the bestperforming algorithm with the Weirdness (Ahmad et al., 1999) and the Glossex (Kozakov et al., 2004) algorithms on the EILMT and the KDE4 corpora, respectively. The LLC is also the second-best performing algorithm on the EMILLE and the Lauchpad corpora.
We see in Table 6 that the percentage of valid source terms is quite low on the EMILLE corpus. This might be caused by it containing information leaflets in a variety of domains (consumer, education, housing, health, legal, social), which might bring down the percentage of valid source terms on this corpus.
Note that the percentage of valid source terms (VST) reported in Table 3 is calculated taking the top-100 source terms from the bilingual term-pair list that were extracted using the method described in Section 3.2. For comparison purposes we again report this percentage (VST in Table 3) in the second row in Table 6. Our bilingual term extraction method discards any anomalous pairs from the initial candidate term-pair list (cf. Section 3.2). This essentially removes some of the source entries that are not pertinent to the domain. As a result, the percentage of the valid source terms extracted applying our bilingual terminology extraction method (Table 3) is higher than the percentage of the valid source terms extracted applying our monolingual terminology extraction algorithm (LLC) ( Table 6). We clearly see from Tables 3 and 6 that this bilingual approach to term extraction not only achieves remarkable performance on the bilingual task, but that when used in a monolingual context it outperforms most state-of-the-art extraction algorithms, and is comparable with the best ones. We should also note that JATE's implementation of these algorithms (including Weirdness) uses language-dependent modules such as a lemmatizer, unlike our implementation of LLC which is language-independent.

Conclusions and Future Work
In this paper we presented a bilingual multi-word terminology extraction model based on two independent consecutive processes. Firstly, we employed a log-likelihood comparison method to extract source and target terms independently from both sides of a parallel domain corpus. Secondly, we used a PB-SMT model to align source terms to one or more target terms. The manual evaluation results on ten different domain corpora of two syntactically divergent language-pairs showed the accuracy of our bilingual terminology extraction model to be very high, especially in the light of the rather noisier monolingual candidate terms presented to it. Given the reported high levels of performance -minimum levels of 93.1% and 94.9% in the 4-best set-up across all six domains for English-to-Spanish and all four domains for English-to-Hindi, respectively -we are convinced that the extracted bilingual multiword termbanks are useful 'as is', and with a small amount of post-processing from domain experts would be completely error-free.
The proposed bilingual terminology extraction model has been tested on a highly investigated language-pair, English-to-Spanish, and a less-explored and low-resourced English-to-Indic languagepair, English-to-Hindi. Interestingly, the performance of the bilingual terminology extraction model is excellent for the both language-pairs. We also tested several state-of-the-art monolingual terminology extraction algorithms including our own (log-likelihood comparison) on the source-side of the four English-to-Hindi domain data sets. According to the manual evaluation results, our monolingual multiword term extraction model proves to be the best-performing algorithm on two domain data sets and the second best-performing algorithm on the remaining two domain data sets. Our monolingual multiword terminology extraction method is clearly comparable to the state-of-the-art monolingual terminology extraction algorithms.
In this work, we considered all n-gram word sequences from the domain corpus as candidate terms.
In future work, we would like to incorporate the candidate phrasal term identification model of Deane (2005), which would omit irrelevant multiword units, and help us extend our evaluation beyond the top-100 terms. We also plan to demonstrate the impact of the created termbanks on translator productivity in a number of workflows -different language pairs, domains, and levels of post-editing -in an industrial setting.