Textual Representations for Crosslingual Information Retrieval

In this paper, we explored different levels of textual representations for cross-lingual information retrieval. Beyond the traditional token level representation, we adopted the subword and character level representations for information retrieval that had shown to improve neural machine translation by reducing the out-of-vocabulary issues in machine translation. We found that crosslingual information retrieval performance can be improved by combining search results from subwords and token level representation.Additionally, we improved the search performance by combining and re-ranking the result sets from the different text representations for German, French and Japanese.


Introduction
Cross-lingual information retrieval (CLIR) systems commonly use machine translation (MT) systems to translate the user query to the language of the search index before retrieving the search results (Fujii and Ishikawa, 2000;Pecina et al., 2014;Saleh and Pecina, 2020;Bi et al., 2020).
Traditionally, information retrieval and machine translation systems convert search queries to tokens and n-grams level textual representation (Jiang and Zhai, 2007;McNamee and Mayfield, 2004;Leveling and Jones, 2010;Yarmohammadi et al., 2019). Modern neural machine translation (NMT) systems have shown that subwords and character representations with flexible vocabularies outperform fixed vocabulary token-level translations (Sennrich et al., 2016;Lee et al., 2017;Kudo and Richardson, 2018;. This study explores the shared granularity of textual representations between machine translation and cross-lingual information retrieval.
Textual representations of varying granularity encode queries differently, resulting in more diverse and robust search retrieval. Potentially, subwords and character-level representations are less sensitive to irregularities in noisy user-generated queries, e.g. misspellings and dialectal variants.

Subwords:
am er ic ium ist ein chemische s element ... Characters: a m e r c i u m i s t e i n c h e m i s c h e s e l e m e n t

Related Work
Neural machine translation had shown to outperform older paradigm of statistical machine translation models significantly and even "achieved human parity in specific machine translation tasks" (Hassan et al., 2018;Läubli et al., 2018;Toral, 2020). Moving from fixed token-level vocabulary to a subword representation unlocks open vocabulary capabilities to minimize out-of-vocabulary (OOV) issues 1 . Byte-Pair Encoding (BPE) is a popular subword algorithm that splits tokens into smaller units (Sennrich et al., 2016). This is based on the intuition that smaller units of character sequences can be translated easily across languages.
For instance, these smaller units appear when handling compound words via compositional translations, such as For instance, subword units can better handle compound words via compositional German to English translations, schokolade → chocolate and schoko-creme → chocolate cream. Suwbords can also cope with translations where we can easily copy or translate part of the source tokens or translate cognates and loanwords via phonological or morphological transformations, e.g. positiv → positive and negativ (German) → negative.
While BPE reduces the OOV instances, it requires the input to be pre-tokenized before applying the subword compression. Alternatively, Kudo and Richardson (2018) proposed a more languageagnostic approach to subword tokenization directly from raw string inputs using unigram language models.
Completing the whole gamut of granular text representations, Lee et al. (2017) explored characterlevel neural machine translations that do not require any form of pre-processing or subword or token-level tokenization. They found that multilingual many-to-one character-level NMT models are more efficient and can be as competitive as or sometimes better than subwords NMT models. Moreover, character-level NMT can naturally handle intra-sentence code-switching. In the context of CLIR, they will be able to handle mixed language queries. Following this,  found that using byte-level BPE vocabulary is 1/8 the size of a full subword BPE model. A multilingual NMT (many-to-one) setting achieves the best translation quality, outperforming subwords models and character-level models.
While finer granularity of text representations was exploited for machine translation, to our best knowledge, information retrieval studies have yet to study the impact of using these subword representations on traditional information retrieval systems (Robertson, 2004;Robertson and Zaragoza, 2009;Aly et al., 2014). However, many previous works have leapfrogged to using fully neural information retrieval systems representing text with underlying various subword representations and neural dense text representation.
Often, these neural representations are available in multilingual settings in which the same neural language model can encode texts in multiple languages. Jiang et al. (2020) explored using the popular multilingual Bidirectional Encoder Representations from Transformers (BERT) model to learn the relevance between English queries and foreign language documents in a CLIR setup. They showed that the model outperforms competitive non-neural traditional IR systems on a few of the sub-tasks.
Alternatively, previous researches have also used a cascading approach to machine translation and traditional IR where (i) the documents are translated to the foreign languages with neural machine translation and/or (ii) the foreign queries are trans-lated before retrieval from the source document index (Saleh and Pecina, 2020;Oard, 1998;Mc-Carley, 1999). Saleh and Pecina (2020) compared the effects of statistical machine translation (SMT) and NMT in a cascaded traditional CLIR setting. They found that the better quality translations from NMT outperforms SMT and translating queries to the source document language that achieved better IR results than using foreign language queries on an index of translated documents.
Although fully neural IR systems are changing the paradigm of information retrieval, traditional IR (e.g. TF-IDF or BM25) approaches remain very competitive and can still outperform neural IR systems for some tasks (Boytsov, 2020;Jiang et al., 2020). In this regard, we follow up on the cascading approach to machine translation and information retrieval on traditional IR systems. This study fills the knowledge gap of understanding the effects of subword representation in traditional IR indices.

Experiments
We report the experiments on different textual representations on traditional IR in a cross-lingual setting using a large-scale dataset derived from Wikipedia Sasaki et al. (2018). Sasaki et al. (2018) focused their work on a supervised re-ranking task using relevance annotations. We use those annotations from the same Wikipedia dataset to perform the typical retrieval task. The dataset was designed so that the English queries are expected to retrieve the Wikipedia documents in the foreign languages, and the foreign documents with the highest relevance are annotated with three levels of relevance. Formally, the ground truth data is a set of tuples: (English query, q, foreign document, d and relevance judgement r, where r ∈ {0, 1, 2}  We note that the Wikipedia documents in the dataset are not parallel (i.e. not translations of each other) but they are comparable in nature depending on the varying amounts of contributions available on the official Wikipedia dumps across different languages. For our study, we use the German, French and Japanese document collections and report retrieval performance of English queries translated to these languages. 3 The Wikipedia corpus came pre-tokenized, so we had to detokenize the documents 4 (Tan, 2018) before putting them through the subword tokenizer. We used pre-trained SentencePiece subword tokenizers used by the OPUS machine translation models(Tiedemann and Thottingal, 2020) 5 . Additionally, we emulated the typical pre-processing steps for character-level machine translation and split all individual characters by space, replacing the whitespaces with an underscore character. Table 2 shows the corpus statistics of the number of documents, tokens, subwords, and characters for the respective languages. Although Latin alphabetic languages benefit from the extra information produced by splitting the tokens into subwords, Japanese presents an opposite condition. Japanese became more compact when represented by the subwords in place of the tokens. The examples in Table  1 show an instance of a sentence pre-processed in different levels of granularity. The underscore in the subword sequence represents a symbolic space and is usually attached to the following subword unit, whereas the whitespace represents the unit boundary between the subwords.
The English queries were translated using the same OPUS machine translation models. 6 Although these machine translation models are open source and free to use under a permissive CC-BY license, it takes a significant amount of GPU computation and major changes to the HuggingFace API (Wolf et al., 2020) to efficiently translate the query samples parallelized inference. We will release the modified code for parallel GPU inference and translation outputs for the data used in this experiment for future convenience to improve the replicability of this paper.

Information Retrieval System
We use the Okapi BM25 implementation in PyLucene as the retrieval framework with hyperparameter setting (k 1 = 1.2, b = 0.75) (Manning et al., 2008). We consider the top 100 documents (top k = 100) in the search ranking as search results for each query.

Building index for the documents
For each foreign language, we created an index for the documents with 5 TextField as follows:

Querying the document index
During retrieval, each translated query is first processed into its respective text representations (tokens, subwords or characters) and parsed using Lucene's built-in query parser and analyzer. Additionally, we tried to improve the search results by combining and re-ranking the result sets from the different text representations.

Search result expansion
Our intuition is that queries of more granular text representation can improve the robustness of the retrieval and potentially override the textual noise (e.g., misspellings are handled better for some languages). Hence, we attempt to expand the list of possible candidate documents by combining the search results from the token and the subword representations. Given a query q and its token q token and subword q subword representations, we obtained two sets of search results from their respective indices R tokens and R subword . We concatenated R tokens and R subword , and remove the repeated candidates that appear in both sets from R subword as illustrated in Figure 1.

Search result re-ranking
Aside from expanding the search results, we tried a re-ranking technique. We presumed that if different representations retrieve a document from a single query, it is more relevant than the documents that appear solely from one representation. Thus, we boosted the rank of the documents (D shared ) that are retrieved both in R tokens and R subword from the same query. After boosting the rank of such documents (D shared ) by 1: d ∈ D shared , rank new (d) = rank original (d)−2, we re-rank the token-based search result, as illustrated in Figure 2 to get the final search result R.

Evaluation Metrics
We choose the following ranking metrics to evaluate the retrieval performance of the different text representations of query translation. Those ranking metrics are Mean Reciprocal Ranking (MRR), Mean Average Precision (MAP), normalized Discounted Cumulative Gain (nDCG); • MRR measures the ranking of the first document that is relevant to a given query in the search result.
• MAP evaluates the rankings of top 100 docu- • nDCG calibrates the ranking and relevance score of all the documents that are relevant to a given query in the search result. We compute nDCG@16 for the top-16 search results respectively. Table 3, 4 and 5 show the result for the CLIR experiments on the translated English queries and the German, French, and Japanese documents of different textual representations. For all the German and French setups, the token level representation achieved the best MAP, MMR, and NDCG scores, followed by subwords at significantly lower performance. Character-level representation performs the words at a magnitude 104 times worse than token-level results. We expected a margin between the token and subword level performance but the stark difference was surprising. Although machine translation can exploit the sequential nature of the open vocabulary with the subwords representation, traditional information retrieval methods disregard the other textual representation to a lesser extent. However, for    Japanese, we see that the subword representation performs very similarly to the tokens counterparts.

Results
For German and French documents, the intuition behind the poor performance of the character-level representation can be attributed to the meaningless and arbitrary nature of the unordered bag of characters. Whereas in Japanese, with its mix of syllabic and logographic orthography, the individual characters can potentially encode crucial semantic information.
We can see that both search result expansion and re-ranking techniques can improve the final search results for some languages. Table 3, 4 and 5 show that the search result expansion technique improves MRR for all three languages compared with the token-based retrieval baseline, and it improves both MRR and MAP for Japanese. The re-ranking technique achieves the highest MRR for both German and Japanese. Improvement in the MRR indicates that those two techniques can improve the ranking of the first relevant document appearing in the search results, which can be beneficial for cross-lingual e-commerce search systems. Neither the expansion nor the re-ranking technique achieves a better nDCG score, which is consistent with our expectation of improving the accuracy and robustness of retrieval with minimal changes to the relevance score that affects nDCG.

Conclusion
We explored the different granularity of textual representations in a traditional IR system within the CLIR task by re-using the subword representation from the neural machine translation systems. Our experiments in this paper provide empirical evidence for the underwhelming impact of subwords in traditional IR systems for Latin-based languages as opposed to the advancements that subword representation has made in machine translation. 7 In some scenarios, it is possible to achieve better CLIR performance by combining and expanding retrieval results of token and subword representations.
We conducted the experiments in this study using well-formed queries and documents. Our intuition is that a combination of the different textual representations can improve the robustness of the indexing and retrieval systems in realistic situations with noisier data (e.g. queries spelling or translations errors). For future work, we want to explore similar experiments with noisy e-commerce search datasets. 8