IR like a SIR: Sense-enhanced Information Retrieval for Multiple Languages

With the advent of contextualized embeddings, attention towards neural ranking approaches for Information Retrieval increased considerably. However, two aspects have remained largely neglected: i) queries usually consist of few keywords only, which increases ambiguity and makes their contextualization harder, and ii) performing neural ranking on non-English documents is still cumbersome due to shortage of labeled datasets. In this paper we present SIR (Sense-enhanced Information Retrieval) to mitigate both problems by leveraging word sense information. At the core of our approach lies a novel multilingual query expansion mechanism based on Word Sense Disambiguation that provides sense definitions as additional semantic information for the query. Importantly, we use senses as a bridge across languages, thus allowing our model to perform considerably better than its supervised and unsupervised alternatives across French, German, Italian and Spanish languages on several CLEF benchmarks, while being trained on English Robust04 data only. We release SIR at https://github.com/SapienzaNLP/sir.


Introduction
Information Retrieval (IR) is the task of retrieving from a large collection of unstructured information -generally textual documents -those items deemed relevant to users, and which are expressed by a query -typically a few keywords.
IR systems have become an integral part of our daily lives, as Web Search engines testify, by allowing us to address distinct search tasks. Relevance is the key notion in IR: indeed, the core component of an IR system is the ranking module, which estimates the relevance of a document to a given query. This is achieved through a ranking function that complies with an underlying formal modeling such as the Vector Space Model, probabilistic models and, more recently, neural models (Guo et al., 2020). Lately, IR systems have begun taking advantage of these latter models, whose aim is learning continuous representations capable of grasping the semantics of the text, as opposed to the traditional lexical approaches comprising the bag-of-words representation. In this new line of research, following the success of neural models in several Natural Language Processing (NLP) tasks, researchers employed contextualized word representations (Devlin et al., 2019;Conneau et al., 2020) in IR to capture semantic aspects of texts (for query and documents) which prove beneficial to ranking approaches (MacAvaney et al., 2019(MacAvaney et al., , 2020b. Moreover, thanks to the unsupervised training strategy of contextualized language models, i.e., Masked Language Modeling, it is feasible to train multilingual models which are able to encode sentences across languages within the same semantic space. Nonetheless, there are challenges peculiar to IR that may hinder the effectiveness of contextualized embeddings. For example, queries are typically composed of just a few keywords, which may not be sufficient to assess the relevance of documents to a query effectively. In classical IR, the technique of query expansion is employed to provide more context about users' actual needs (Rocchio, 1971), by exploiting synonymous terms to overcome the vocabulary mismatch problem. However, this is not suitable for neural language models which are trained to process well-formed sentences. This issue is even more pronounced when dealing with languages other than English, where the lack of training data hinders the use of machine learning in the multilingual setting.
Recently, Word Sense Disambiguation (WSD) has received greatly increased attention (Bevilacqua et al., 2021), reporting large improvements not only in English (Lacerra et al., 2020;Blevins and Zettlemoyer, 2020;Bevilacqua and Navigli, 2020;Barba et al., 2021a), but also across other languages (Scarlini et al., 2020;Procopio et al., 2021). We argue that word senses, thanks to their glosses, i.e., sentences defining word meanings, can provide valuable information to enrich the input query and to aid retrieving relevant documents that are semantically related. Moreover, multilingual sense vocabularies (where concepts are lexicalized with synonymous words in different languages) may provide a bridge across languages, leading neural models to perform better in a zeroshot setting.
Based on these hypotheses, this paper makes the following contributions: 1. we introduce, for the first time, a neural approach to augment the input query with sentences defining the meanings of the words therein, 2. we present SIR, a supervised neural architecture leveraging additional semantic information for the monolingual ad-hoc Information Retrieval task, and 3. we perform an extensive evaluation in English and across several test collections on French, German, Italian and Spanish in a zero-shot setting.
Our findings show that word definitions are indeed beneficial to the task, allowing SIR to better contextualize queries and thus match more relevant documents in respect of all its baselines.

Related Work
Information Retrieval approaches have long relied on simple statistical metrics based on term frequency, such as TF-IDF and BM25 (Robertson et al., 1996), to represent texts and to match documents against a given query. These methods are still used as strong baselines nowadays (Lin, 2019), especially because they perform retrieval in an unsupervised and efficient way. In the last decade, two different kinds of neural approaches to IR have been defined (Mitra and Craswell, 2018): the first aims at encoding queries and documents within the same vector space Huang et al., 2013); the second, instead, focuses on learning an estimator for the relevance of a document with respect to a query (Guo et al., 2016). More recently, with the advent of transformer-based language models such as BERT (Devlin et al., 2019), contextualized representations rapidly got incorporated into retrieval models (MacAvaney et al., 2019) -which previously had relied on static embeddings only -mainly by pairing contextualized models with a binary classifier to compute a score per query-document (MacAvaney et al., 2019; or query-sentence pair (Akkalyoncu Yilmaz et al., 2019;Dai and Callan, 2019). Nevertheless, most of the supervised works focused on the English retrieval task, where enough labeled data are available to train a neural model. Instead, while datasets in languages other than English do exist in several tracks of TREC (Braschler et al., 2000;Oard and Gey, 2002) or CLEF (Braschler, 2003), they are rather small and not suitable for training deep neural networks. In this setting, multilingual pretrained language models came out as an effective solution, and showed themselves able to successfully leverage annotations in one language (typically English) and perform retrieval in other languages, e.g., Arabic, Mandarin, and Spanish (MacAvaney et al., 2020b) or Chinese, Arabic, French, Hindi and Bengali (Shi et al., 2020).
However, by relying on large pretrained language models, these approaches assume that queries are expressive enough to model their underlying semantics, which is not always the case. This is a long-standing issue in IR, and one which has stimulated extensive research for years. Different approaches to query expansion such as Markov chains (Metzler and Croft, 2007), term classification (Cao et al., 2008), and static word embeddings (Diaz et al., 2016;Zamani and Croft, 2016) have been applied effectively to improve query representation. More recently, researchers have tried to tackle the problem from the opposite perspective by expanding documents Raffel et al., 2020)   Word Sense Disambiguation (WSD) is specifically tailored to resolve this issue, and several attempts were made in the past to include word senses within IR pipelines. These early attempts, unfortunately, did not produce encouraging results (Krovetz and Croft, 1992;Voorhees, 1993;Sanderson, 1994). Indeed, Sanderson (2000) emphasised that the effectiveness of WSD integration was diminished by the inaccuracies in disambiguation. A little over a decade later, instead, Zhong and Ng (2012) presented a successful application of WSD in IR by incorporating word senses and synonym relations into a language modeling approach. In addition, further developments over the years led to the remarkable performance attained by modern WSD models, which now perform close to the inter-annotator agreement upper bound (Blevins and Zettlemoyer, 2020;Bevilacqua and Navigli, 2020;Barba et al., 2021a,b). This makes us optimistic that these models are finally suitable to be used within downstream tasks.
Differently from previous works, in this paper, we explore this possibility and focus on enriching the query context by devising a neural approach to first retrieve word senses for the input query terms, and then encode their definitions together with query and documents to perform end-to-end document ranking. To the best of our knowledge, this is the first time a Word Sense Disambiguation approach has been employed to expand the query with sense definitions and we show that this is not only beneficial in the monolingual setting but also cross-lingual zero-shot settings.

Preliminaries
In this Section, we describe the task we are tackling and the resources we exploit.

Task
We focus on the task of ranking documents given a query, i.e., a topic composed of a title and a description. More formally, let Q title = [t 1 , . . . , t n ] be the sequence of n terms of the topic title, 1 Q desc = [d 1 , . . . , d l ] the sequence of l words describing the topic, and C a collection of documents. The retrieval task we focus on consists of learning a scoring function S θ (Q, D) ∀D ∈ C, to rank documents in the collection according to their relevance to the query Q, where Q = Q title ||Q desc = [q 1 , . . . , q n+l ], i.e., the concatenation of Q title and Q desc and θ denotes model parameters.

Resources
In our approach we make use of BabelNet 2 (Navigli and Ponzetto, 2010; Navigli et al., 2021) as vocabulary of senses. BabelNet is a multilingual knowledge base, which organizes word meanings -namely senses -into synsets, i.e., sets of synonyms that express a common concept in different languages (up to 500). Each synset within Babel-Net is associated with different glosses in multiple languages 3 that describe its meaning.
The query Q title consists of three terms only, i.e., Polygamy, Polyandry, and Polygyny, and it is not a well-formed sentence. The query description, i.e., Q desc = A look at the roots and prevalence of polygamy in the world today in the example, has proved to be useful in enabling neural models to better represent the input query (Dai and Callan, 2019), as it describes the kind of documents to be retrieved. Therefore, we further leverage this information to also retrieve sense definitions related to the terms within the title through a system for Word Sense Disambiguation. For example, given the title and its description, we can add the following sense definitions: i) Having more than one spouse at a time, ii) Having more than one husband at a time, and iii) Having more than one wife at a time, which explicitly define the meaning of each query term.
With this in mind, in this Section we introduce SIR, our approach to Sense-enhanced Information Retrieval. SIR is divided into two steps: i) expand ( § 4.2), where we employ a multilingual neural model to expand the input query ( Figure 1, A), and ii) rank ( § 4.3), where the actual document scoring takes place (Figure 1, B).

Query Expander
Inspired by multiple retrieval-augmented approaches for NLP (Guu et al., 2020;Lewis et al., 2020), we enrich the query with the definitions of the senses that are most closely related to its terms, which we collect by means of a learned sense gloss retriever component. To this end, we leverage a simple yet effective 1-Nearest-Neighbours (1-NN) approach between the query contextualized word embeddings and sense vectors for BabelNet concepts. As representations for word senses, we use ARES (Scarlini et al., 2020), which provides English and multilingual sense embeddings for all BabelNet synsets containing a WordNet sense. 4 This choice is motivated by three reasons: • ARES embeddings have been successfully applied to English and multilingual WSD with a simple 1-NN algorithm, achieving state-ofthe-art performances; 5 • the ARES embedding space is comparable to that of BERT (Devlin et al., 2019); • the linkage of ARES with BabelNet allows us to easily collect sense definitions in different languages.
To represent query terms q i ∈ Q, instead, we use BERT as its representations are comparable to those of ARES, thus making the retrieval easy and without any need for training. Indeed, in order to retrieve the senses -and thus the definitionsthat are closely related to a query Q, we first feed it through BERT and extract the representations for each word q i therein. Then, for each term of the query title, i.e., q i , i ≤ n, we retrieve the sense with the closest vector in terms of L2 distance. 6 To avoid the query becoming excessively long, we retain only the top-k closest senses according to their L2 distance, where k = min(m, n) and m is a hyperparameter of the system. For each sense s i ∈ [s 1 , . . . , s k ] 7 that we retain, we collect its gloss G i in the language of interest from BabelNet.

Document Ranker
After the query expansion step, we use the enriched query in a Document Ranker module. While our approach can be used in combination with any document ranker, in this paper we employ a popular neural ranking model from the literature based on BERT, i.e., VanillaBERT (MacAvaney et al., 2019), which has been applied to both English and multilingual zero-shot IR settings (MacAvaney et al., 2020b). In Figure 1 (B) we schematize the Document Ranker architecture. Following VanillaBERT, we finetune a pretrained BERT Transformer model for learning the query-document scoring function. The input to the model is formatted following the standard practice, i.e., [CLS]Q e [SEP]D[SEP], while the ranking score is produced by projecting the vector of the [CLS] token through a dense layer. The model is trained using a pairwise crossentropy loss between a relevant and a non-relevant document for the query, which leads the model to rank the relevant document always higher than the non-relevant one. More formally, given a triple (Q e , D+, D-), where document D+ is ranked CLEF 2000-2003CLEF 2004-20082000200120022003200420052006 French 34  49  50  52  49  50  49  German  37  49  50  56  ---Italian  34  47  49  51  ---Spanish  -49  50  57 --- higher than document D-, the model is trained to optimize the loss function: where θ denotes the parameters of the model and S θ (·, ·) is the ranking function that we are learning. At inference time, given a query, we score all documents in the collection and rank them accordingly.

Experiments
In this Section, we describe the baselines we compare our approach with, as well as the tasks and datasets used for training and evaluating them.

Experimental Setup
We focus on the monolingual English and non-English Information Retrieval tasks. However, due to the lack of large non-English labeled datasets suitable for training neural ranking models, we follow the zero-shot setting proposed by MacAvaney et al. (2020b), i.e., zero-shot cross-lingual ranking. In this setting, the training of the model is done in a language for which there exists enough relevancelabeled data, i.e., English, and it is tested on queries and documents written in other languages. 8 Datasets.  12 We report the number of queries for each test collection in Table 1.
Comparison systems. We compare SIR with BM25 and BM25+RM3 query expansion as implemented in the Anserini toolkit (Yang et al., 2018), using the default parameters. Our main competitor is VanillaBERT, 13 which has the same underlying neural ranking model as SIR, with the exception of our Query Expander module. This comparison allows us to clearly measure the impact of sense glosses on the document ranking task. As for the non-English setting, we evaluate two versions of SIR: i) SIR EN which augments the query with the English glosses of the retrieved senses, and ii) SIR TL which concatenates to the non-English query the glosses of the retrieved senses in the target language, when applicable. 14 Interestingly enough, in this setting, switching from SIR EN to SIR TL comes at no cost, since we rely on a multilingual knowledge base, i.e., BabelNet. To remain consistent with the non-English setting, we consider only English glosses during training (since query language is always in English), and feed SIR TL with glosses in other languages at inference time only.
Training and hyperparameters. The SIR model relies on two BERT Transformer models, one for the Query Expander, to encode the query, and another one for the Document Ranker component, to encode the query-document pair. We use BERT as query encoder so as to create 11 We use the folds in Table 1 of Huston and Croft (2014). 12 We do not evaluate in the multilingual TREC benchmarks as in MacAvaney et al. (2020b) due to unavailability of the data. Instead we run their released code in CLEF Test Suite for comparison. 13 github.com/Georgetown-IR-Lab/cedr 14 When there is no available gloss in the language of the query, we fallback to the English gloss.  Table 2: Results on each fold of TREC Robust04 for English retrieval: SIR EN outperforms VanillaBERT in both P@20 and MAP score, with larger gains in separate folds but also in ALL. Best per metric column in bold.
contextualized word representations that are comparable to those of ARES (see §4.2). Specifically, we use bert-large-cased for English and bert-base-multilingual-cased for other languages. Since ARES representations are conceived and computed to be in the same space as BERT representations, we do not need to train the query encoder, but rather we simply employ a 1-NN strategy. That is, for each query term encoded through BERT, we retrieve the sense (and thus the gloss) with the most similar vector. We then retain only the top m glosses to be considered for a query, and set m = 3 as that is the average number of query terms in Robust04. For the Document Ranker component, we follow MacAvaney et al. (2019, 2020b) and finetune a bert-base-uncased model for English, and bert-base-multilingual-cased for all the other languages of the non-English tasks. Both VanillaBERT and SIR take as input query the concatenation of the query title and its description, and the first 800 tokens of a document. We limit the maximum number of tokens for a query to 100, while for the expanded query, we additionally consider a maximum number of 100 tokens for the retrieved glosses. We choose the best model by monitoring precision@20 (P@20) score in the validation set. We include more implementation details in Appendices C and D.
Evaluation. To evaluate SIR and VanillaBERT models we consider the top 150 documents returned by the term-weighting unsupervised algorithms, i.e., BM25, in the English retrieval task -following MacAvaney et al. (2019), and BM25+RM3 for non-English tasks. RM3 (Abdul-Jaleel et al., 2004) shows consistent improvements over BM25, reinforcing the claims as to the effectiveness of query expansion mechanisms. Therefore, re-ranking BM25+RM3 results shows whether both VanillaBERT and SIR are able to improve the ranking even when the baseline considers extra terms for the query. We use P@20 and mean average precision (MAP) metrics computed with the official trec_eval 15 tool to evaluate the performance of participating systems.

Results
English. In Table 2 we report the ranking results in Robust04 benchmark. Firstly, both neural reranking approaches, i.e., VanillaBERT and SIR EN , significantly 16 outperform the BM25 baseline. This result is in line with the previously reported findings in the literature. More importantly, SIR EN attains better performances than VanillaBERT, in almost all folds, both in terms of P@20 and MAP. Across all folds, we observe relative 17 improvements in MAP score from 4% to 7% (folds 1 and 3), with an overall improvement of 2.4% in ALL. As for P@20 instead, SIR EN improves VanillaBERT by 3% and 4% in folds 2 and 4, with an overall improvement of 2.2% in ALL. When considering the highest reachable performance, i.e., perfectly ranking the documents returned by BM25, SIR reduces the error rate of VanillaBERT by 3.8% in P@20 and 3.3% in MAP score overall. This shows that the sense glosses retrieved by our Query Expander (see §4.2) are of high quality and beneficial to the model, aiding to substantially reduce the error rate.
Non-English. In Table 3 we report the performances in the CLEF 2000-2003 ad-hoc test collections. In this setting, we rerank the documents returned by BM25+RM3, as this latter achieves consistently better performances than BM25 alone. Similarly to the English retrieval task, the re-ranking systems, i.e., VanillaBERT and SIR variants, outperform both baselines in all benchmarks. When considering only the behaviour of SIR variants, we observe that using language-  specific glosses (SIR TL ) does not affect the performance in general. In fact, SIR TL shows mostly comparable or slightly worse results than SIR EN across all years and measures. This could be due to the fact that the model is trained on English glosses only, which come from a manually-curated English source, i.e., WordNet (see §3.2), whereas non-English glosses come from Wikipedia, which are written in a different style, and are inherently of lower quality and have limited coverage.
When compared to VanillaBERT, SIR EN attains better results across the board, showing significant improvements in MAP score on most datasets. More specifically, SIR EN improves VanillaBERT baseline by 1.6% in P@20 and 3.1% in MAP score in the ALL dataset of the French language, with significant gains in year 2001. Also, SIR EN significantly outperforms VanillaBERT with respect to the overall MAP score, and increases its performance by roughly 1% in P@20 and 8.6% in MAP score in the ALL dataset of the German language, with the largest gain in year 2000. Further-more, both SIR EN and SIR TL significantly outperform VanillaBERT in MAP score across the row block of the Italian language, with SIR TL showing higher improvements. Indeed, on ALL, it improves the performance of the baseline by roughly 4% in P@20 and 10% in MAP score. Differently from all the other languages, although the contribution of SIR EN in Spanish is more modest across years 2001 and 2002, it brings roughly 3.5% and 1.5% improvements in both measures in the 2003 and ALL datasets, respectively. We continue our evaluation by showing in Table 4 the results in the CLEF 2004-2008 ad-hoc News French monolingual tasks. The behaviour in these benchmarks is similar to that of CLEF 2000-2003, with SIR variants consistently improving over VanillaBERT. Differently from the trend of results in Table 3, SIR TL shows slightly higher or comparable performance than SIR EN , especially regarding P@20. In comparison to VanillaBERT, the best SIR variant improves its P@20 by 5.6% and MAP by 6.6% in the ALL dataset.  Asia -the largest continent with 60% of the earth's population exports -sell or transfer abroad timber -fragments of wood 0.325 0.392 safety plastic surgery: Find documents that discuss the safety of or the hazards of cosmetic plastic surgery.
plastic -generic name for certain synthetic or semisynthetic materials that can be molded [. . . ] surgery -the branch of dentistry involving surgical procedures safety -a safe for storing meat 0.644 0.728 women ordained Church of England: [. . . ] arguments for and against Great Britain's approval of women being ordained as Church of England priests?
England -a division of the United Kingdom ordained -appoint to a clerical posts Church -one of the groups of Christians who have their own beliefs and forms of worship Table 5: Excerpt of term definitions retrieved by our Query Expander: Accurate disambiguations improve performance by more than inaccurate disambiguations degrade it (upper); Even when accurately disambiguated, retrieval results decrease due to more general information in glosses (lower). VB denotes VanillaBERT.
In summary, the contribution of SIR is mainly evident in the MAP score across both tables, suggesting that gloss information, while not improving by a large margin in P@20, i.e., top retrieved documents, enables the system to return an overall better ranking of all the relevant documents.

Error Analysis
We here provide insights into the cases where the retrieved definitions do indeed help the underlying model in the retrieval tasks. To this end, we manually check the quality of the disambiguation of the query terms, and perform a comparison of VanillaBERT and SIR according to the MAP score per query. More specifically, we compute the absolute difference of MAP between systems for each query in Robust04 and pick the top ones where SIR performs better than VanillaBERT and those where it performs worse. We report an excerpt in Table 5. By inspecting the data we note that, firstly, accurate disambiguation improves performance by a larger margin than inaccurate disambiguation degrades it. This phenomenon can be attributed to the ability of the Document Ranker to ignore noisy input while benefiting from useful extra information, and this appears to be so in the majority of cases, as demonstrated by our experimental results (see §5.2). Secondly, we notice that there are some disambiguation mistakes: we attribute this issue to the absence of any mechanism restricting the possible senses for a given word, since we base our retrieval only on representations' L2 distance (see §4.2). For instance, the words safety and surgery in Table 5 are associated with glosses that are somehow related to but that are not specific to any sense of the target words. While this issue can be alleviated by filtering the possible senses for a word, similarly to the standard WSD task, we decide not to do so as it would require lemmatizing and POS tagging the input query and we want to keep the approach as endto-end and scalable across languages as possible.
Alternatively, more recent WSD approaches could be useful and we leave this extensive study for future work. Another source of error, even though less frequent, concerns SIR's failure to outperform VanillaBERT even when the disambiguation of its terms is accurate. We inspect the possible reasons behind errors of this kind by checking the top documents retrieved by each system and provide these in the second row block of Table 5. Although the retrieved glosses are factually correct for the query words, the gloss for women has been discarded as scoring lower than the top 3 senses in the sense retrieval step, thus the highest ranked documents generally focused on the Church of England rather than on women in the Church. This issue requires further investigation and analysis. A possible direction for future work would be to identify the most peculiar terms within the query and ensure that their definitions are included in its expanded version.

Conclusions
In this paper we presented SIR, a novel approach for ranking documents in multiple languages. Our approach is the first to take advantage of a WSD model to expand the input query with sense definitions as additional semantic information. By evaluating SIR on multiple gold Information Retrieval benchmarks across languages, we show that our approach consistently improves over its main competitors that do not have access to sense glosses, thus demonstrating that such information is beneficial for the English retrieval task, as well as in the zero-shot cross-lingual setting. In addition, through a simple qualitative analysis, we highlight the advantages and disadvantages of SIR, suggesting promising directions for better utilizing WSD to improve IR models. We release SIR at https://github.com/SapienzaNLP/sir to ease future research in this direction.

A Sense Retrieval
As per the standard practice, we tokenize Q by applying wordpiece tokenization, adding the [CLS] prefix and the [SEP] suffix. Following ARES, we represent the query term vector V q i as the sum of the BERT representations of the last four hidden layers, and average the wordpiece vectors belonging to the same query term. Moreover, since ARES vectors are composed of two stacked BERT representations, we concatenate V q i with itself. We search the most related senses for each term q i within the query, first normalizing q i vectors and those of ARES and then employing L2 distance search index provided by the FAISS (Johnson et al., 2021) library.

B ARES WSD Performance
In Table 6 we show the results obtained by ARES in the SemEval-2013 benchmark of all-words WSD task in different languages as reported by Scarlini et al. (2020). We choose to report SemEval-2013 only as it comprises all the languages of interest. We direct the reader to Scarlini et al. (2020) for the complete evaluation of ARES in WSD.

C Document Ranker Details
For the Document Ranker component, we follow MacAvaney et al. (2019MacAvaney et al. ( , 2020b and finetune a bert-base-uncased model for English, and bert-base-multilingual-cased for all the non-English tasks. Both VanillaBERT and SIR take as input query the concatenation of the query title and its description, and the first 800 tokens of a document. We limit the maximum number of tokens for a query to 100, while for the expanded query, we additionally consider a maximum number of 100 tokens for the retrieved glosses. Since BERT supports 512 tokens, we split longer documents into segments, separately encoding each with the query. Then we average the multiple [CLS] tokens to compute the final query-document pair representation used for classification.

D Training Hyperparameters
We employ the hyperparameters of VanillaBERT (MacAvaney et al., 2019) on top of which we show the improvements of our contribution. The models are trained with Adam optimizer with learning rate 0.001 for the classifier and 2 × 10 −5 for BERT layers. The training process is carried out on a single GPU (Nvidia GeForce GTX 1080Ti), for 100 epochs each of which is trained on 32 batches comprising 16 query-document pairs. We validate by monitoring P@20 and employ early stopping with patience 20 epochs. Training takes 5-10 hours for both VanillaBERT and SIR, depending on whether the early stopping is triggered. VanillaBERT and SIR have 110M and 179M trainable parameters when trained with bert-base-uncased and bert-base-multilingual-cased BERT models 18 , respectively.