Data Selection for Bilingual Lexicon Induction from Specialized Comparable Corpora

Narrow specialized comparable corpora are often small in size. This particularity makes it difficult to build efficient models to acquire translation equivalents, especially for less frequent and rare words. One way to overcome this issue is to enrich the specialized corpora with out-of-domain resources. Although some recent studies have shown improvements using data augmentation, the enrichment method was roughly conducted by adding out-of-domain data with no particular attention given to how to enrich words and how to do it optimally. In this paper, we contrast several data selection techniques to improve bilingual lexicon induction from specialized comparable corpora. We first apply two well-established data selection techniques often used in machine translation that is: Tf-Idf and cross entropy. Then, we propose to exploit BERT for data selection. Overall, all the proposed techniques improve the quality of the extracted bilingual lexicons by a large margin. The best performing model is the cross entropy, obtaining a gain of about 4 points in MAP while decreasing computation time by a factor of 10.


Introduction
Mining translation equivalents to automate the process of generating and extending dictionaries from bilingual corpora is a well known Natural Language Processing (NLP) application. Initially conducted on parallel data, this modus operandi has rapidly moved to using comparable corpora, mainly due to their availability and the fact that they are easier to acquire (Sharoff et al., 2013). Comparable corpora gather texts of the same domain, often over the same period without being in a translation relation. Journalistic or encyclopedic websites as well as web crawl data are usually popular for this purpose (Rapp, 1999;Li and Gaussier, 2010;Rapp et al., 2020). In technical and scientific domains, comparable corpora suffer from their modest and limited size compared to general language corpora. This phenomenon is amplified by the difficulty to obtain many specialized documents in a language other than English (Morin and Hazem, 2014). This aspect, which is gradually changing with the deployment of open access data, is nevertheless a major difficulty.
One way to overcome the limited size of specialized (in-domain) comparable corpora for bilingual lexicon induction (BLI) is to associate them with out-of-domain resources as lexical databases (Bouamor et al., 2013) or large general domain corpora (Hazem and Morin, 2018). For instance, by combining a specialized comparable corpus with a general domain corpus, methods based on distributional analysis are boosted (Hazem and Morin, 2018). General domain corpora greatly enhance the representation of specialized vocabulary by adding new contexts. The main drawbacks of this data augmentation approach is the introduction of polysemous information as well as the tremendous increase of computation time. Consider for instance, a French/English comparable corpus in the medical domain and the French terms os (bone) and sein (breast). When looking for additional contexts in general corpora, os is very likely to recover contexts dedicated to an operating system since OS is the abbreviation used in French. Similarly, when studying the French term sein, many contexts related to the French preposition au sein de (within) will be added. In the same way, the English medical term breast can be associated with food just as in chicken breast to mention only the most obvious ones.
Associating out-of-domain data with a specialized corpus is not the mainstream research direction in bilingual lexicon induction from specialized comparable corpora. In computational terminology it is generally accepted that the specialized corpus conveys knowledge of the domain. In the same way, polysemy is often considered as a marginal phenomenon in technical and scientific domains. In this sense, associating out-of-domain data with a specialized comparable corpus can be seen as a "heresy" or, as in the previous examples, an unfortunate way to introduce polysemy.
In the context of Statistical Machine Translation (SMT), Wang et al. (2014) demonstrated that adding out-of-domain data to the training material of a system was detrimental when translating scientific and technical domains. Instead of using a data augmentation strategy, data selection is proposed to improve the quality of SMT systems (Moore and Lewis, 2010; Axelrod et al., 2011;Wang et al., 2014). The basic idea is that in-domain training data can be enriched with suitable sub-parts of out-of-domain data.
Within this context, we apply two well-established data selection techniques often used in machine translation that is: Tf-Idf and cross entropy. We also propose for the first time to exploit BERT for data selection. We show that a subtle selection of the contexts of out-of-domain data allows to improve the representation of specialized domains, while preventing the introduction of polysemy induced by data augmentation. Overall, all the proposed techniques improve the quality of the extracted bilingual lexicons by a large margin. The best performing model is the cross entropy, obtaining a gain of about 4 points in MAP while decreasing computation time by a factor of 10.

Related Work
The historical distributional approach (Fung, 1998;Rapp, 1999), known as the standard bag of words approach, builds a context vector for each word of the source and the target languages, translates the source context vectors into the target language using a bilingual seed lexicon, and compares the translated context vectors to each target context vector using a similarity measure. While this approach gives interesting results for large general comparable corpora (Gaussier et al., 2004;Laroche and Langlais, 2010;Vulić et al., 2011), it is rather unsuitable for small specialized comparable corpora due to the context vectors sparsity. (Chiao and Zweigenbaum, 2002;Morin et al., 2007;Prochasson et al., 2009).
More recent distributed approaches, based on deep neural network models (Mikolov et al., 2013), have come to renew traditional ones. Using these approaches, words are embedded into a low-dimensional vector space. For instance, Mikolov et al. (2013) proposed an approach to learn a linear transformation from the source language into the target language. Faruqui and Dyer (2014) used the Canonical Correlation Analysis to project the embeddings in both languages into a shared vector space. Xing et al. (2015) proposed an orthogonal transformation based on normalized word vectors to improve the inconsistency among the objective function used to learn the word vectors and to learn the linear transformation matrix. More recently, Artetxe et al. (2016;2018a) proposed and then improved an approach that generalizes previous works in order to preserve monolingual invariance using several meaningful and intuitive constraints (i.e. orthogonality, vector length normalization, mean centering, whitening, etc.). Finally, Conneau et al. (2017) and Artetxe et al. (2018b) proposed unsupervised mapping methods getting results close to supervised methods. A thorough comparison of cross-lingual word embeddings has been proposed by Ruder (2017).
A comparison of the approaches of Mikolov et al. (2013) and Faruqui and Dyer (2014) with the standard bag of words approach was carried out by Jakubina and Langlais (2017). The authors showed that embedding approaches perform well when the terms to be translated occur very frequently while the standard approach is slightly better when the terms are less frequent. On specialized domains, Hazem and Morin (2018) compared different approaches and observed that the one described by Bojanowski et al. (2016) (an enhanced variant of the skip-gram and C-BOW models) outperformed the others.
Recently, Peters et al. (2018) proposed a deep contextualized word representation that improves the state-of-the-art across different challenging NLP tasks. This representation is useful to model different types of syntactic and semantic information about words-in-context. It is thus possible from a general language corpus to have different embeddings for a given word, with each of these embeddings reflecting one of the word's meanings. In specialized domains, it is widely accepted that words can have only one meaning even if their contexts may convey different meanings in a general language corpus.
El Boukkouri et al. (2019) showed that using a combination of out-of-domain contextualized word representation (ELMo) with small in-domain uncontextualized word representation (word2vec) does not improve the results while compared to models trained on large in-domain corpora. For this reason, we focused this study on uncontextualized word embeddings (fastText).

Data Selection Techniques
While abruptly adding out-of-domain data to a specialized corpus, with no further thoughts on how to do it improves the results, the computational time is greatly increased and polysemous information is introduced. To leverage the aforementioned issues, we present in this section different data selection techniques that we use in our experiments.
Tf-Idf is the term frequency-inverse document frequency score and is performed at the document level.
It is computed for both in-domain and out-of-domain vocabularies, then the out-of-domain documents are ranked by measuring their cosine similarity with the in-domain corpus. Given an indomain document D I and an out-of-domain document D O , the similarity is computed as follows: Cross entropy is used as defined in Moore and Lewis (2010) to select out-of-domain data that is close to the in-domain corpus. Conversely to the Tf-Idf technique, the data selection is performed at the sentence level. Given the in-domain corpus I, a language model is first computed for the source side (LM I,s ) and the target side (LM I,t ) of the bilingual corpus. Similarly, we compute a language model for the out-of-domain corpus O (LM O,s and LM O,t ) which is drawn from a random sample of the same size of the specialized corpus. The cross entropy is computed as follows: where P LM is the probability of a LM for the word sequence W and w 1 , ..., w i−1 represents the history of the word w i . Formally, H LM I,s (W ) represents the cross entropy of the sentence W given the language model LM I,s . The cross entropy is computed for every sentence of the out-of-domain corpus given the out-of-domain and in-domain language models. The source and target sentences are then evaluated by t} refers to the side of the corpus) and ranked accordingly. The cross entropy is computed using the xenC tool (Rousseau, 2013).
BERT is a supervised learning model that has proven to be efficient in many downstream NLP tasks (Devlin et al., 2019). It has been trained on the Masked Language Model (MLM) and Next Sentence Prediction (NSP) objectives (Devlin et al., 2019). Hence, it can be used whether for word representation or for sentence classification. In order to perform data selection, we use BERT as a binary classifier to predict if a given input sentence is in-domain or out-of-domain. The intuition behind this strategy is that BERT can learn shared features between in-domain sentences. For each positive (in-domain) training sentence, we randomly select a negative sentence from a general domain data set to keep a balanced training set. Class balancing is often a vital setup for classifiers' efficiency. Even if some existing techniques do reduce the impact of unbalanced classes, we only deal with balanced training in this work.
Random Data Selection we also apply a random data selection at the sentence level, where we shuffle randomly the sentences of the out-of-domain corpora.

Experimental Protocol
We present in this section the used data that is: the comparable corpora, the seed lexicon, the reference lists and the methodology applied for our experiments.

Data
We conducted our experiments on two French/English specialized data sets: breast cancer and wind energy. The Breast Cancer corpus (BC) is composed of scientific documents published between 2001 and 2015 available in open access on the ScienceDirect portal 1 where the title or the keywords contain the term breast cancer in English and its translation in French. The Wind Energy corpus (WE) has been released within the TTC project 2 and is composed of documents harvested from the Web using a focused crawler based on several keywords such as wind, energy, rotor in English and their translation in French.
We considered two separate out-of-domain data sets in English and French: i) JRC acquis corpus (JRC) composed of legislative texts of the European Union 3 (we used the French/English version at OPUS which is based on the paragraph-aligned corpus provided by JRC (Tiedemann, 2012)) and ii) a fraction of Wikipedia corpus (WIKI) 4 . The used corpora differ in size and quality. On the one hand, for the specialized corpora, BC is of better quality than WE, due to their construction methods (crawled from a scientific portal vs crawled from the web). On the other hand, for the general corpora, WIKI is way bigger and diverse in terms of topics than JRC. The bilingual terminology reference lists required to evaluate the quality of bilingual terminology induction from comparable corpora are the same as used in (Hazem and Morin, 2016) and was derived from the UMLS meta-thesaurus for the breast cancer domain and are provided with the corpora for wind energy domain (see footnote 2). Each word in the reference list appears at least 5 times in the specialized corpus. The reference list for the breast cancer is composed of a set of 248 French/English single word pairs and of 145 single word pairs for the wind energy domain. We are aware that the reference lists are small, but considering the fact that we are in a specialized domain, they remain representative of its vocabulary, and getting larger lists is difficult since the specialized vocabulary is small.
For the seed lexicon, we employed the ELRA-M0033 dictionary 5 which contains 243,539 pairs of French/English general terms, from which we remove pairs containing words from our reference lists.

Methodology
Hereafter, we specify the methodology of our experiments: • For a given specialized corpus, we first use our data selection methods to find the most similar sentences (or documents for the Tf-Idf approach) of our out-of-domain corpora on both source and target languages. For fine-tuning BERT, we used bert-base-cased for English (Devlin et al., 2019), and Camembert (Martin et al., 2020) for French with their default parameters. To build the training data set, we used the news commentary 6 corpus as out-of-domain data set.
• Then, we create our training data for both languages by concatenating our full specialized corpus with sub-parts of our general corpus. We build several training data sets by incrementally adding samples of 10%, from the most similar to the least similar data (but also from the least similar to the most similar to study the impact of data selection).
• To project our word embeddings of source and target languages in the same space, we use the VecMap tool (Artetxe et al., 2018a).
• Finally, for a word to translate, we measure its similarity with every word of the target language. The target candidates are ranked using the Cross-domain Similarity Local Scaling which is an improvement of the cosine similarity that takes into account nearest neighbors (Conneau et al., 2017).
The results are measured in terms of the Mean Average Precision (MAP) (Manning et al., 2008): where |Ref | is the number of terms of the reference list and r i the rank of the correct translation i.

Results
In this section we present the results obtained for BLI on our corpora as well as the differences in computation time for each method. Figure 1 presents the obtained results for the four data selection methods (CrossEntropy, T f -Idf , BERT and Random) for our two specialized corpora and the different combinations with data from the two general corpora. The curves labeled + mean that we take the documents/sentences from the most similar to the least similar, while the curves labeled − indicate that we select the documents/sentences from the least similar to the most similar. This distinction allows to see the real impact of data selection. The first point of all the curves (x axis = 0%) corresponds to the results obtained with the specialized corpus only, and each following point corresponds to a combination of the specialized corpus with a different percentage of data selected from the general corpus, up to the full corpus. We first note that for every corpus and configuration, adding general data improves the results. If we look at the trends of the CrossEntropy+ curves, we see a great improvement at 10% and then a slower improvement for JRC and a slow degradation for WIKI. These results are particularly interesting because they show the usefulness of a good data selection, but they also show the need of a corpus that is large enough to see a real impact on the results. WIKI being larger than JRC, it has more various contexts and 10% of it corresponds to 30M words while it is only 6M words for JRC. Conversely, the − presents the opposite trend and shows slow improvements. The best results are obtained with the full corpora. The curves + for T f -Idf and BERT are not as interesting as the CrossEntropy+ (with the exception of the BC+WIKI for BERT ) as there is not a great improvement followed by a slow degradation. Also, we do note observe much differences between the − and Random curves. Table 2 presents some major points of our experiments. First, we present the results on the general and specialized corpora only. Then we show three results for each combination of specialized and general corpora. The first result (column 100%) is the concatenation of the specialized corpora with the full      general corpora, and illustrates a strategy of data augmentation. The two other columns illustrate the optimal percentage for the CrossEntropy data selection approach. The + (resp. −) columns represent the most (least) similar documents of the general corpora.

BLI Evaluation
We clearly see improvements for both specialized corpora with the WIKI corpus when going from a full data augmentation (100%) to data selection (10%). In terms of MAP, this makes us gain 3.7 points for BC and 5 points for WE. However, for the smallest JRC, the gain is less important. If we still have an interesting MAP improvement of 2.2% and 3.2% for BC and WE respectively, it needs way more data to be effective (70%). These observations confirm the usefulness of data selection, but under the condition of using large enough corpora with various content.
Based on the results of CrossEntropy+ at 10% of data for WIKI, we conducted one more experi- ment on smaller percentages to get better insights on smaller data selection effect. This experiment is illustrated in Figure 2 where the evolution of data quantity is represented in terms of number of words, to allow the comparison with the JRC corpus (3M words represent 1% of WIKI and 5% of JRC). Overall, we see that the two curves keep the same trend, with better MAP scores for WIKI. Interestingly, we observe that the selection of only 3M words from WIKI (1% of the corpus) allows to reach a MAP score of 82.3% which is almost the same a the whole corpus (83.9%). This result is however surpassed when selecting 6M words (2% of the corpus) reaching a MAP score of 87.2%. The best results for WIKI are achieved at 8 and 9% with a MAP score of 88.9%. For JRC, the improvement with 3M words (5% of the corpus) remains interesting (75.7%) but does not surpass the previous best results (83.2%). These observations show the necessity of using a corpus related to many diverse topics to obtain the best results.

Computation Time
In this section, we report the computation time for all the configurations. All our experiments have been carried out with an Intel Core i9-9900K and a GeForce RTX 2080.   Table 3 shows the computation time needed to perform data selection and to train the word embeddings for BC and WIKI. The embeddings training time corresponds to the computation time of the source and target embeddings. We see that using in-domain corpus only (No Selection) is really fast (180s) but does not present good results (MAP of 50.6%) while a data augmentation approach improves the results (83.9%) but also drastically increase the computation time (65, 521s). Finally, with a good data selection approach (CrossEntropy), the augmentation of computation time is smaller (6, 972s) and even improves the MAP score (87.6%). BERT and T f -Idf selections also present better results than the data augmentation approach, but no real improvement in terms of computation time is observed and the MAP increase is less interesting than for CrossEntropy.  The optimal selection is for 10% as seen in the previous section.
The results showed in Section 5 establish a relationship between data selection and the quality of bilingual lexicons. More precisely, they show that we can improve the results while reducing the computation time in comparison with data augmentation. The results obtained when contrasting one document-level selection (T f -Idf +) and two sentence-level selections (CrossEntropy+ and BERT +) confirmed the usefulness of using data selection and, more importantly, stressed the fact that a small amount of data could be sufficient for that purpose. If the main issue is to better characterize specialized terms while preserving the domain characteristics, we assume that the representations of common (i.e. from a general domain) words present in the general corpus are also enriched. In order to measure the augmentation impact on BLI, we selected several translation pairs that were affected either positively or negatively by data increase and discuss each case. Table 4 shows for each data selection amount, the rank of a given translation term as well as the occurrence of the translation pair after data augmentation for BC. To each rank we assign a color according to the impact of the selection on the results. Hence, green means that the obtained rank after adding n% of data equalized or improved the translation rank while red means that the rank was degraded. This table presents the results for the CrossEntropy+ lists where we had optimal results at 10% of data selected.
First, we see that the term breast is -almost-correctly translated to sein (Rank = 2) when no data is added. However, with 100% of the general corpus, the rank is hugely degraded (Rank ≥ 1000). A closer look at the added contexts revealed that the majority of documents contained the French expression au sein de (appeared 47, 025 times) which can be translated as within or at the heart of. This pair is one of the few that doesn't improve with the addition of out-of-domain data, but we can see that taking only 10% (Rank = 3) leads to a significant improvement compared to the full general corpus (Rank ≥ 1000).
Contrariwise, calcium showed the benefits of adding correct contexts on both sides. If its correct translation calcium was ranked at 140 when only specialized corpus was used, the rank is immediately improved by the addition of out-of-domain data to rank 1. It is also important to note that the source and the target words remains close in terms of frequency. This pair also shows that the selection succeeded to find most of the interesting occurrences of the words in the first 10%.
The pair back -dos shows the same trend as breast -sein, mainly due to the frequent use of back in English. However, adding out-of-domain data didn't decrease the rank. On the contrary, it had a quite positive impact, mostly because at first, there were few occurrences of dos in BC.
Finally, our last pair shows that the addition of out-of-domain data remains interesting even if the words pair are -almost-not found in the general corpora. Here, with only one new occurrence and the improved representation of other words of the specialized vocabulary, rank one could be reached.
These examples reflect the benefits of a data selection approach, which helps enriching the specialized vocabulary as a whole without degrading the representation of the terms of the reference list.
In what follows, we briefly illustrate some examples of the type of contexts that have been selected. Table 5 presents the result of the sentences selected by the CrossEntropy on WIKI for the specialized corpus BC. It is interesting to see that the best selected sentences (CrossEntropy+) do not look like real sentences, but are closer to phrases related to the domain (tumor, gene, operate, needle...), which really shows that CrossEntropy+ selected contexts related to the specialized vocabulary. However, we find real sentences among the highest ranked (see line 115), but the CrossEntropy seems to favor phrases. As intended, for the least interesting sentences (CrossEntropy−), we can't find any relation between the breast cancer domain and the proposed sentences.  Finally, Table 6 illustrates the evolution of the candidate translations over three stages of data selection. The words in bold are the correct translations. The first pair pressure-pression illustrates the problem of having only a small specialized corpus, where the words do not have enough context to be precisely represented. For the second pair breast-sein, we already have two words with a lot of occurrences, so we do not have this problem of bad representation of the words, even if some candidates should not be here (année, der...). And the data augmentation column (100%) shows the introduction of polysemy: we can't see sein as a correct translation because of the French preposition au sein de, but we still have related terms like poitrine. This problem is mostly resolved by the data selection column (10%), where sein is still in the top candidates (rank 3).

Conclusion
In order to improve BLI from specialized comparable corpora, we have studied in this paper the impact of common data selection techniques such as CrossEntropy, T f -Idf and also applied BERT for the first time as a data selection technique. The experiments on two specialized comparable corpora and two general corpora revealed that selecting a small and adapted amount of out-of-domain data is sufficient to obtain better results (MAP of 87.6% for BC and 80.9% for WE) than a high amount of data (resp. 83.9% and 75.9%) while reducing computation time by a factor 10. The data selection strategy also allowed to reduce the impact of polysemy compared to data augmentation. This strategy shows that bringing new occurrences for both the word to translate and the general vocabulary improves their mutual discrimination. An interesting perspective could be to select for each word to translate the adequate amount of out-of-domain data instead of a common selection for all the terms to translate.