A General-Purpose Multilingual Document Encoder

Massively multilingual pretrained transformers (MMTs) have tremendously pushed the state of the art on multilingual NLP and cross-lingual transfer of NLP models in particular. While a large body of work leveraged MMTs to mine parallel data and induce bilingual document embeddings, much less effort has been devoted to training general-purpose (massively) multilingual document encoder that can be used for both supervised and unsupervised document-level tasks. In this work, we pretrain a massively multilingual document encoder as a hierarchical transformer model (HMDE) in which a shallow document transformer contextualizes sentence representations produced by a state-of-the-art pretrained multilingual sentence encoder. We leverage Wikipedia as a readily available source of comparable documents for creating training data, and train HMDE by means of a cross-lingual contrastive objective, further exploiting the category hierarchy of Wikipedia for creation of difficult negatives. We evaluate the effectiveness of HMDE in two arguably most common and prominent cross-lingual document-level tasks: (1) cross-lingual transfer for topical document classification and (2) cross-lingual document retrieval. HMDE is significantly more effective than (i) aggregations of segment-based representations and (ii) multilingual Longformer. Crucially, owing to its massively multilingual lower transformer, HMDE successfully generalizes to languages unseen in document-level pretraining. We publicly release our code and models at https://github.com/ogaloglu/pre-training-multilingual-document-encoders .


Introduction
Massively multilingual Transformers (MMTs) such as XLM-R (Conneau et al., 2020), and mT5 (Xue et al., 2021) have drastically pushed the state-ofthe-art in multilingual NLP, especially for mediumresourced languages included in their pretraining, enabling effective cross-lingual transfer of taskspecific NLP models from languages with plenty of training data to languages with little or no annotated task data.Being standard transformerbased language models, MMTs process text linearly -as a flat sequence of tokens, which hasin monolingual contexts -been shown suboptimal for document-level tasks (e.g., document classification or retrieval) for two main reasons: (1) it does not correspond to the hierarchical nature of document organization -documents are sequences of (presumably meaningfully ordered) paragraphs, which are in turn sequences of sentences (Zhang et al., 2019;Glavaš and Somasundaran, 2020), and (2) representing documents longer than the MMTs maximal input length requires either document trimming, which leads to loss of potentially taskrelevant information, or segmentation, which leading to context fragmentation (Ding et al., 2021).
A number of models that produce documentlevel representations have been proposed, albeit predominantly in the monolingual (English) realm, with two prominent lines of work.(1) Hierarchical encoders (Pappas and Popescu-Belis, 2017;Pappagari et al., 2019;Zhang et al., 2019;Yang et al., 2020;Glavaš and Somasundaran, 2020;Chalkidis et al., 2022) typically contextualize sentence-level representations with additional document-level parameters (e.g., an additional, document-level transformer).These document-level parameters of the encoder, added on top of a pretrained language model like BERT (Devlin et al., 2019), are typically trained on large task-specific datasets, ranging from document classification (Pappagari et al., 2019) to summarization (Zhang et al., 2019) and segmentation (Glavaš and Somasundaran, 2020).Task-specific training of document-level parameters impedes the transfer of such encoders to other tasks.(2) Sparse attention models (Child et al., 2019;Zaheer et al., 2020;Beltagy et al., 2020;Tay et al., 2020) modify the attention mechanism in order to reduce its computational complexity and consequently be able to encode longer texts.Although flat long-text encoders do not model the hierarchical nature of documents, they allow for flat encoding of substantially longer documents.
In this work, we demonstrate the benefits of hierarchical document representations in multilingual context.We propose to train a hierarchical transformer model (HMDE), coupling (i) a pretrained multilingual sentence encoder as a lower encoder with (ii) an upper transformer that contextualizes sentence representations against each other and from which we derive document representations.Unlike in monolingual setup, where task-specific data is commonly used to train the parameters of the upper transformer (Zhang et al., 2019;Glavaš and Somasundaran, 2020), we exploit the fact that in the multilingual context one can leverage crosslingual document alignments to guide the pretraining of the document encoder, i.e., its upper transformer.To this end, we leverage Wikipedia as a readily available source of quasi-parallel documents, and additionally exploit its hierarchy of categories to create hard negative examples for our contrastive pretraining objective.
We evaluate HMDE in two arguably most prominent (cross-lingual) document-level tasks: (1) cross-lingual transfer for document classification (XLDC) and (2) cross-lingual document retrieval (CLIR).For XLDC, as a supervised task, we finetune HMDE on English task-specific data; in CLIR, in contrast, we leverage HDME in an unsupervised fashion, using it to produce static document embeddings (and its lower transformer to produce query embeddings).HDME exhibits performance superior to that of competitive models -MMTs with sliding window and multilingual Longformer (Yu et al., 2021;Sagen, 2021).Crucially, HMDE generalizes well to languages unseen in its documentlevel pretraining.Our further analyses offer additional insights: (i) that it is important to allow updates from document-level training to propagate to the sentence-level encoder (i.e., not to freeze the parameters of the pretrained sentence encoder) and (ii) that the size of the document-level pretraining corpora matters more than its linguistic diversity (i.e., number of languages it encompasses).

Hierarchical Multilingual Encoder
The HMDE architecture, illustrated in Figure 1, is similar to that of hierarchical document encoders trained monolingually in task-specific training (e.g., (Glavaš and Somasundaran, 2020)): a sentencelevel (lower) encoder produces sentence embeddings from tokens, whereas the document-level (upper) transformer yields document representation from a sequence of sentence embeddings.We initialize the lower transformer with the pretrained weights of a multilingual sentence encoder (Feng et al., 2022), and train the whole model via a biencoder configuration (also known as Siamese architecture) -where we compute a similarity score between representations of two documents produced independently with HDME -using a crosslingual contrastive objective with both in-batch and hard negatives (Oord et al., 2018).

Hierarchical Encoding
The role of the sentence-level (lower) transformer is to produce sentence representations from sequences of tokens.Because of this, we initialize it with the pretrained weights (including subword embeddings) of LaBSE (Feng et al., 2022), a state-of-the-art multilingual sentence encoder. 2he sentence embedding is the transformed representation of the special beginning-of-sequence (BOS) token.The sequence of sentence embeddings obtained with the sentence-level transformer is then forwarded to the document-level (upper) transformer, which mutually contextualizes them, prepended with a special document-level beginning-of-sequence token (DBOS, with a randomly initialized embedding).We derive the document representation by average-pooling contextualized sentence embeddings (i.e., output of the last layer of the document-level transformer).
2 )/τ (1) with d ∈ R h as the embedding of d, i.e., the output of the document-level transformer (and h as the hidden size of upper transformer), s(d i , d j ) as the scoring function capturing similarity between the two document embeddings, and τ as the hyperparameter (the so-called temperature) of the InfoNCE loss.Following common practice, we use cosine similarity as the scoring function s.
Note that the loss we compute is both multilingual and cross-lingual: documents d 2 ) are cross-lingual.Among the in-batch negatives, there will be cross-lingual as well as monolingual pairs (when 2 happen to be documents written in the same language).Our hard negatives are, by design, always monolingual pairs.While one could create cross-lingual hard negatives in the same manner (e.g., by pairing the English article "France" with an Italian article "Svizzera" (Switzerland) that covers another concept from the same category "Country"), monolingual hard negatives should be harder because the two document representations will originate from the same language-specific subspace of the embedding space of the lower (multilingual) transformer (Cao et al., 2020;Wu and Dredze, 2020).

Experimental Setup
We first describe how we created the multilingual dataset for HMDE pretraining from Wikipedia ( §3.1).We then briefly describe the two evaluation tasks -cross-lingual transfer for document classification and cross-lingual information retrievaland their respective datasets ( §3.2), following with the description of the baselines -a multilingual sentence encoder with a sliding window and a multilingual Longformer (Yu et al., 2021;Sagen, 2021) ( §3.3).We provide training and optimization details for all models in the Appendix A.1.

Data Creation
Wikipedia has been leveraged as a suitable source for mining comparable and parallel corpora for decades (Ni et al., 2009;Plamadȃ and Volk, 2013;Schwenk et al., 2021, inter alia).We add to the body of work that exploits Wikipedia as a massively multilingual text resource by using it to build pretraining data for HMDE.Concretely, for a set of languages L = {L 1 , L 2 , . . ., L n }, we first fetch monolingual portions from the Wiki-40B corpus. 4e then identify articles in different languages that are about the same concept (via the wikidata_id field) and keep only those concepts for which pages are found in at least two languages from L. For each such concept with pages p 1 , p 2 , . . ., p m in m different languages, we create all possible crosslingual pairs of articles (p i , p j ) covering the same concept.For each pair (p i , p j ), we then leverage Wikipedia metadata -namely mapping of Wikipedia pages into its hierarchy of categories -to select an article n i from the same monolingual Wikipedia as p i (i.e., written in the same language as p i ) that belongs to (at least one) same Wikipedia category as p i .This yields triples (p i , p j , n i ) from which we create cross-lingual positives (p i , p j ) and their corresponding monolingual hard negatives (p i , n i ) for our contrastive objective (see §2.2).
On the one hand, the quality of MMTs' representations of a particular language depends on the size of the pretraining corpora of that language (Hu et al., 2020;Lauscher et al., 2020).On the other hand, multilingual model training with instances from linguistically diverse languages may generalize better to unseen languages (Chen et al., 2019;Ansell et al., 2021).Most resourced languages, however, tend to be Indo-European (Joshi et al., 2020), putting corpus size and linguistic diversity at odds.We thus create two different datasets, each emphasis one of these two aspects: (1) XLW-4L is built starting from four high-resource Indo-European languages: English, German, French, and Italian; (12) XLW-12L is built starting from a set of 12 linguistically diverse languages: English, French, Russian, Japanese, Chinese, Hungarian, Finnish, Arabic, Persian, Turkish, Greek, and Malay.With 1.1M triples (p i , p j , n i ), XLW-4L is almost twice as large as XLW-12L (which encompasses 592K triples), despite encompassing three times fewer languages: this is primarily because there are many more shared concepts between large Wikipedias of XLW-4L (e.g., German and Italian) than between smaller Wikipedias of XLW-12L (e.g., Turkish and Malay).5

Evaluation Tasks and Datasets
HMDE is meant to be a general-purpose multilingual document encoder.It thus needs to be useful both (1) when fine-tuned for a supervised document-level task, and (2) as a standalone document encoder.We thus evaluate HMDE in (1) zero-shot cross-lingual transfer for supervised document classification (XLDC) and (2) unsupervised cross-lingual document retrieval (CLIR).
XLDOC.Regular MMTs (e.g., mBERT or XLM-R) are primarily used in zero-shot cross-lingual transfer for supervised NLP tasks: an MMT finetuned on task-specific training data in a resourcerich language is used to make predictions for language(s) without task data.We evaluate HMDE in exactly the same zero-shot cross-lingual transfer setup, only for a document-level task -topical document classification.We fine-tune HMDE in the standard manner, by stacking a softmax classifier on top the output of the document-level encoder.With d as HDME's encoding of the input document d, classifier's prediction is computed as: with W ∈ R C×h and b ∈ R C as classifier's trainable parameters (and C as the number of classes).
We fine-tune HMDE on the English training portion of the MLDOC dataset (Schwenk and Li, 2018) and evaluate its performance on the test portions of all other (target) languages.MLDOC is a subset of the Reuters Corpus Volume 2 (RCV2), with training, development, and test portions in 8 languages (English, Spanish, German, French, Italian, Russian, Japanese and Chinese), consisting of 1000, 1000, and 4000 documents, respectively.News stories are categorized into C = 4 semantically closely related classes (Corporate/Industrial, Economics, Government/Social, and Markets).
CLIR.We evaluate the effectiveness of HMDE as a standalone document encoder in an unsupervised cross-lingual document retrieval task: queries (short text) in one language are fired against a collection of documents written in another language.We adopt a simple retrieval model: we rank documents in decreasing order of cosine similarity of their embeddings d, produced by the HMDE, with the embedding q of the query, cos(d, q).We obtain the query embedding q by encoding the query only with HMDE's lower (sentence-level) transformer: q is the transformed representation of the beginning-of-sequence ([BOS]) token.
We carry out the evaluation on CLEF-2003, 6 a popular CLIR benchmark, including the following languages: English (EN), German (DE), Italian (IT), Finnish (FI) and Russian (RU).Following prior work (Glavaš et al., 2019;Litschko et al., 2022), we evaluate HMDE on 9 language pairs (with first language being the query language): EN-FI, DE, IT, RU, DE-FI, IT, RU, FI-IT, RU.For each language pair we work with 60 queries and document collections of following sizes: RU -17K, FI -55K, IT -158K, and DE -295K.

Baseline Models
There are two main alternatives to hierarhical (long) document encoding.The first is to (i) fragment the document into smaller segments, (ii) encode each segment with a regular pretrained MMT (e.g., vanilla MMT like XLM-R or a multilingual sentence encoder like LaBSE), and (iii) aggregate the document representation from the embeddings of segments.The second is to train a multilingual sparse-attention encoder, akin to (Sagen, 2021).

MMT with a Sliding Window (LaBSE-Seg).
For fair comparison, we use LaBSE (Feng et al., 2022) -the same pretrained MMT that we use for the initialization of the lower transformer in HMDE -to independently encode overlapping segments of the input document.We break down the document into segments of length N S tokens.Following Dai et al. (2022), who find that overlapping segments alleviate the context fragmentation problem, we make adjacent segments overlap in N S /3 tokens.After encoding each segment with LaBSE, we average-pool the document representation d from the set of segment embeddings.In XLDX (topical document classification) this average of segment embeddings is fed into the classification head.In CLIR, it is compared with the LaBSE encoding of the query.

Multilingual Longformer (mLongformer).
Longformer architecture (Beltagy et al., 2020) combines local-window attention with global attention, resulting in a hybrid attention mechanism, the memory requirements of which scale linearly with the input length.Beltagy et al. (2020) additionally propose multi-step procedure for initializing Longformer's parameters based on the parameters of a pretrained regular transformer (e.g., in the case of monolingual English Longformer from RoBERTa (Liu et al., 2019)) and then further train the Longformer via masked language modeling (MLM).We train the multilingual Longformer following the same procedure: for fair comparison with HMDE, we initialize its parameters from the parameters of LaBSE and carry out the additional MLM training on XLW-4L, the same corpus on which we train HMDE.

Results and Discussion
We first report and discuss the main results we obtain with HMDE on XLDC and CLIR (in §4.1).In a series of follow-up experiments, we further analyze key design choices for HMDE ( §4.2).

Main Results
Cross-lingual Document Classification.Table 1 compares HMDE trained on XLW-4L against several standard and long document multilingual encoders: besides the baselines introduced in §3.3, for completeness we add the results for vanilla LaBSE (i.e., without sliding over the long document) and models based on XLM-R and mBERT reported by Dong et al. (2020) and Zhao et al. (2021), respectively.Expectedly, all long-document encoders outperform all of the standard MMTs.mLongformer and HMDE generally exhibit similar performance, surpassing the performance of segmentation-based LaBSE-Seg for virtually all languages.Comparable performance of mLongformer and HMDE suggests that in the presence of task-specific finetuning data it does not really matter whether we aggregate document representations in a flat or hieratrchical fashion.What is particularly encouraging is that both HDME and mLongformer exhibit strong performance for languages that they did not observe in document-level pretraining: Spanish, Russian, Japanese, and Chinese. 7,8  Cross-lingual Retrieval.The results for unsupervised CLIR are shown in Table 2. Like in XLDC, we additionally report the results for LaBSE that encodes only the beginning of the document (without sliding) as well as for mBERT, reported by Litschko et al. (2022) (Litschko et al., 2022) .145 .146 .167 .107 .151 .116 .149 .117 .128 .136 Multilingual Long Document Encoders LaBSE-Seg .243 .169 .107 .194 .268 .178 .104 .153 .014 .159 mLongformer (XLW-4L) .  .380 .282 .141 .326 .352 .259 .130 .238 .129 .249Table 2: Performance of HDME compared against standard MMTs and baseline multilingual long-document encoders on unsupervised cross-lingual document retrieval (CLEF-2003).Bold: best performance in each column.
MMTs, mLongformer requires fine-tuning and cannot encode reliably encode documents "out of the box".HMDE also substantially outperforms LaBSE-Seg, the long-document encoder based on sliding LaBSE over the document.Interestingly, vanilla LaBSE, which encodes only the beginning of the document, also outperforms its sliding counterpart LaBSE-Seg, which is exposed to the entire document.We believe that this is because (1) in CLEF, retrieval-relevant information often occurs at the beginnings of documents and in such cases (2) LaBSE-Seg's average-pooling over all document segments then dilutes the encoding of queryrelevant content.Importantly, HMDE in CLIR also seems to generalize very well to languages unseen in its document-level pretraining (in particular for Finnish documents).

Further Analysis
We next empirically examine how different choices in HDME's design and pretraining affect its performance, focusing on: (i) linguistic diversity and size of the pretraining corpus (XLW-4L vs. XLW-12L), (ii) freezing of the lower transformer (i.e., LaBSE weights) after initialization, and (iii) initializing it with the weights of XLM-R as the standard MMT (vs.initialization with LaBSE as the sentence encoder).We provide a further ablations on document segmentation (sentences vs. token sequences ignorant of sentence boundaries) in the Appendix A.2.
As discussed in §3.1, we prepare two different corpora for HMDE pretraining: XLW-4L, which is larger (1.1M instances) but encompasses only four major Indo-European languages and XLW-12L, which is smaller (590K instances) but has documents from a set of 12 linguistically diverse languages.To control for the size, and assess the effect of linguistic diversity alone, we randomly downsample XLW-4L, creating a 4-language dataset XLW-4L-S that matches in size XLW-12L.Figure 2 shows the downstream performance of HMDE when pretrained on each of these three datasets.Comparison between XLW-4L and XLW-4L-S (same languages, different dataset size) shows that our flavor of cross-lingual contrastive pretraining ( §2.2) leads to a fairly sample-efficient pre-training: cutting the training data almost in half leads to small performance drops (mere 0.3 accuracy points in XLDC; 1.3 MAP points in CLIR).Comparison between XLW-4L-S and XLW-12L (same size, different language sets) quantifies the role of linguistic diversity in pretraining.Somewhat surprisingly, the more linguistically diverse pretraining on XLW-12L does not bring better performance compared to "Indo-European-only" pretraining on XLW-4L-S: while they perform comparably on XLDC, more diverse pretraining (XLW-12L) leads to worse CLIR performance (-1.3 MAP points on average).We hypothesize that this is due to higher-quality of representation of the four Indo-European languages (EN, DE, FR, IT) in LaBSE (owing to their overrepresentation in LaBSE's pretraining), with which we initialize the lower transformer of HMDE.We find this result to be particularly encouraging, as -together with the observation that HMDE generalizes well to languages unseen in its documentlevel pretraining -it suggests that document-level pretraining itself does not necessarily need to be massively multilingual in order to yield successful massively multilingual document encoders.
Lower Transformer.We next investigate two aspects of the lower-transformer: (1) with which weights to initialize it and (2) whether it pays off to update its parameters during the document-level pretraining.For the former, we compare our default LaBSE-based initialization (with LaBSE as a sentence-specialized multilingual encoder) against the initialization with weights of XLM-R, as the vanilla multilingual MMT.To answer the latter, we additionally train HMDE by freezing its lower
Results are averages across all test languages (XLDC) and language pairs (CLIR).
transformer in document-level pretraining.Table 3 summarizes the results of these ablations.
While freezing the lower transformer after initialization leads to much faster training, it results in poorer document encoder, especially if used for standalone document encoding, without taskspecific fine-tuning9 (HMDE-LaBSE Updated vs. Frozen; 1 accuracy point drop in XLDC vs. 8 MAP points drop in CLIR).Initializing HDME's lower transformer with LaBSE weights leads to much better downstream performance compared to initialization with XLM-R which is not specialized for sentence-level semantics.

Related Work
We position our contributions w.r.t.three related lines of work: (1) pretraining long-document encoders, (2) self-supervised pretraining for retrieval, and (3) mining parallel documents.

Long-Document
Encoders.Hierarchical (Zhang et al., 2019;Yang et al., 2020;Glavaš and Somasundaran, 2020) and sparse-attention-based encoders (Beltagy et al., 2020;Zaheer et al., 2020;Tay et al., 2020) already discussed in §1 account for the vast majority of long-document encoding approaches.Dai et al. (2022) extensively compare Longformer (Beltagy et al., 2020) against hierarchical transformers on various long-document classification tasks, showing that the latter exhibit slightly better performance, especially if the lower encoder encodes overlapping segments.Ding et al. (2021) propose a different, segmentation-based model based on recurrence transformers (Dai et al., 2019), designed to remedy for context fragmentation with a retrospective feed mechanism: each segment is encoded twice -after initial left-to-right segment with a recurrent transformer, segment representations are further mutually contextualized bidirectionally.Their training couples MLM-ing with a segment reordering objective.
The vast majority of work on pretraining encoders for long documents focuses on monolingual (mainly English) models.The few multilingual exceptions (Yu et al., 2021;Sagen, 2021) derive a multilingual Longformer from standard MMTs (XLM-R and mBERT) in exactly the same fashion in which the original work (Beltagy et al., 2020) pretrains English Longformer after initialization from RoBERTa weights.In this work, we replicated this effort, evaluating mLongformer as the main baseline for HMDE.
Pretraining for Retrieval.Self-supervised and distantly-supervised approaches have recently been proposed for pretraining documents encoders specifically for the task of document retrieval (Izacard et al., 2022;Yu et al., 2021;Gao et al., 2022).Izacard et al. (2022) pretrain Contriever -a BERTbased document encoder with an objective based on the inverse cloze task (Lee et al., 2019): a positive query-document pair is created by extracting a span of text from the document and using it as a "query"; they train with a contrastive objective that scores the document from which the query was extracted higher than other documents.Gao et al. (2022) feed queries as prompts to a generative language model, which then generates document; they then use Contriever to embed this synthetic document and find most similar real documents in the collection, finally fine-tuning Contriever on querydocument pairs obtained this way.In a manner similar to ours, Yu et al. (2021) leverage Wikipedia as a source of quasi-parallel data: while we exploit document-level alignments, they leverage sectionlevel aligments to create positive cross-lingual training instances for paragraph retrieval: a section title ("query") in one language is coupled with the section body ("document") in another language; they then train a multilingual Longformer initialized from mBERT with a combination of query MLMing and contrastive relevance ranking.In contrast to these efforts, we create a general-purpose (i.e., task-agnostic) multilingual document encoder that can both be fine-tuned for supervised tasks and used as a standalone document embedder.
Mining Parallel Documents.Mining parallel documents -a task which aims to identify mu-tual translations in a large document collection and is often used as a first step in extracting parallel sentences (Resnik and Smith, 2003;Uszkoreit et al., 2010;Schwenk, 2018, inter alia) -is the task that bears most resemblance to our pretraining.Transformer-based approaches to the task (Guo et al., 2019;El-Kishky and Guzmán, 2020;Gong et al., 2021) typically aggregate document-level representations from multilingual sentence embeddings.The work of Guo et al. (2019) is arguably most related to ours: they train a hierarchical encoder with a simple feed-forward net as the upper encoder that independently transforms precomputed sentence embeddings: document embedding is then the average of feed-forward-transformed sentence embeddings.The model is trained bilingually (English-Spanish and English-French) with a contrastive objective on a huge silver-standard corpus of parallel documents (13M and 6M document pairs, respectively) and evaluated on the very same task of parallel document mining.Our work differs in two crucial aspects: (1) while (Guo et al., 2019) train bilingual models for recognizing parallel documents, we train a single generalpurpose massively multilingual document encoder; (2) we train on a much smaller corpus of comparable (not parallel) documents, readily available from Wikipedia.Both aspects make HMDE much more widely applicable, for both supervised and unsupervised document-level tasks and any of the languages from LaBSE's pretraining (as HMDE's lower encoder is initialized with LaBSE's weights).

Conclusion
In this work, we pretrain a multilingual document encoder based on a hierarchical transformer architecture (HMDE), and initialize its lower-level encoder with the weights of a state-of-the-art multilingual sentence encoder.We leverage Wikipedia as a rich source of quasi-parallel long documents and train HDME with a contrastive cross-lingual document matching objective.We show that the obtained model is a general-purpose multilingual document encoder that can successfully be both (1) fine-tuned for document-level cross-lingual transfer and (2) used as a document embedding model out of the box.Our results render HMDE substantially more effective than both multilingual Longformer and segmentation-based document encoding.Crucially, HMDE generalizes well to languages unseen in its document-level pretraining.Our follow-up experiments reveal that the size of the pretraining corpus affects the performance more than the number and diversity of languages involved, suggesting that reliable massively multilingual document encoders do not necessarily require equally massively multilingual pretraining.

Limitations
Because we initialize the lower transformer of HMDE with LaBSE (Feng et al., 2022), the set of languages that HMDE "supports" out of the box is bound to the set of 109 languages included in LaBSE's pretraining. 10This means that HMDE will, in principle, be less effective as a document encoder for other languages.11HDME, like LaBSE, should in principle be useless for languages written in a script that LaBSE (or in fact, mBERT, from which LaBSE borrows the vocabulary and pretrained subword embeddings) has not seen in its pretraining, as the corresponding tokenizer will produce a sequence of unknown tokens ([UNK]).This means that HMDE, much like the rest of existing multilingual encoders, supports only a small fraction of world's 7000+ languages (Joshi et al., 2020).Moreover, all languages included in our evaluation datasets -MLDOC and CLEF -are covered by this set of 109 languages, which means that the average performance we report is likely a gross overestimate for languages unseen in LaBSE's pretraining.Further, HMDE leverages Wikipedia for training (with sets of either 4 or 12 languages, see 3.1) -the number of Wikipedia pages (and more generally, digital footprint of a language on the web) varies tremendously across languages, effectively limiting the selection of languages for HMDE's documentlevel pretraining.Our results (see 4.1), however, show that HMDE generalizes well to languages not seen in its document-level pretraining.
Further, HMDE is implemented as a Bi-Encoder (aka Siamese network), which means that for a given pair of documents in a training example (positive or negative pair), it separately encodes each of the documents.Cross-Encoder architecture, in which the documents would be concatenated before encoding, would have the advantage of allowing the encoder to contextualize the token/sentence representations of one document with those of the other before the computation of their similarity score.Cross-encoding architectures have been shown effective, albeit not efficient (i.e., slow) in training for document retrieval, in which the (short) query is concatenated with the (long) document (MacAvaney et al., 2020;Shi et al., 2020;Rosa et al., 2022).We do not explore cross-encoding in our work; in our case, it implies joint encoding of the concatenation of two long documents (in different languages), arguably exploding in GPU memory occupancy and possibly preventing us from fitting even single-instance batches on our GPU cards.

Ethical Considerations
We do not test HMDE explicitly to check whether the representations it produces reflect negative societal biases and stereotypes (e.g., sexism or racism), but given that its lower encoder is initialized from LaBSE's weights, it would not be surprising if this was the case.If so, many of the existing techniques from the literature designed to debias pretrained language models (Qian et al., 2019;Barikeri et al., 2021;Guo et al., 2022) could be applied to HMDE too, and in principle "as-is" (i.e., without special modifications).

Figure 1 :
Figure 1: Illustration of HDME: hierarchical transformer architecture coupled with a cross-lingual contrastive objective.Document colors indicate the Wikipedia concepts: d 1 and d 2 are the pages of the same concept (e.g., New York) in two different languages, L 1 and L 2 ; documents d 3 and d 4 are pages of other concepts in L 1 .The pair (d 1 , d 2 ) is a positive pair (i.e., same concept) for the contrastive training objective and pairs (d 1 , d 3 ) and (d 1 , d 4 ) are corresponding negative pairs (i.e., different concepts). 3

2. 2
Multi-and Cross-Lingual Objective Our training dataset consists of Wikipedia pages written in one of n languages (see §3.1 for details on the creation of different training datasets): let L = L 1 , L 2 , . . ., L n denote our set of training languages.In each training step, we select a batch of N documents pairs, pages of the same concept but in two different languages, L k and L m ∈ L. Each of the documents d .e., first document of each pair) is additionally paired with a document d (i) neg -a document in the same language L k as d (i) 1 and from the same Wikipedia category -representing a hard negative for d (i) 1 (see §3.1 for details).We then compute and minimize a variant of the popular InfoNCE loss(Oord et al., 2018) that incorporates hard negatives, treating all other batch documents d the |L| languages, and positive pairs (d

Table 1 :
Performance of HDME compared against standard MMTs and baseline multilingual long-document encoders on supervised topical document classification (MLDOC).Performance (except En) for zero-shot crosslingual transfer: all models are fine-tuned only on English training data.Bold: best performance in each column.

Table 3 :
HMDE results for different choices w.r.t. to initialization and training of the lower transformer.