Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?

Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art resultsin various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back on computing document embeddings based on sentence representations. Although there exist architectures and models to encode documents fully, they are in general limited to English and few other high-resourced languages. In this work, we provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches, on 3 multi- and cross-lingual tasks in 8 languages belonging to 3 different language families. Our task-based extrinsic evaluations show that, independently of the language, a clever combination of sentence embeddings is usually better than encoding the full document as a single unit, even when this is possible. We demonstrate that while a simple sentence average results in a strong baseline for classification tasks, more complex combinations are necessary for semantic tasks


Introduction
Semantic representations, especially embeddings, are crucial for natural language processing (NLP).In fact, the field has exploded since the success of dense word embeddings (Mikolov et al., 2013).For some tasks like finding semantic or syntactic relations among words, high quality word embeddings are enough.Other tasks, like question classification or paraphrase detection, benefit from sentence embeddings.Finally, lots of tasks deal with documents: summarisation, document classification, question answering, etc. Document representations are difficult to be learned, especially multilingually, given the amount of available training data and the length of each training instance.
For these reasons, document embeddings usually resort to sentence embeddings.Since some of the state-of-the-art techniques for language modelling and sentence embeddings are based on selfattention architectures such as BERT (Devlin et al., 2019), and self-attention scales quadratically with the input length, one cannot afford arbitrarily long inputs.Training is usually constrained to input fragments up to 512 tokens (subunits).This limit goes well beyond an average sentence length and can cover several paragraphs.However, full documents can be significantly longer.The average length of a Wikipedia article in English is 647 words (not subunits) for example,2 and the average for two of the tasks that we consider in this work, document alignment and ICD code classification, is around 800 words, with documents up to 40k words.
In order to be able to process long inputs, more efficient architectures such as Linformer (Wang et al., 2020), Big Bird (Zaheer et al., 2020) or Longformer (Beltagy et al., 2020) implement sparse attention mechanisms that scale linearly instead of quadratically.These architectures accept at least 4096 input tokens.With this length, one can embed most Wikipedia articles, news articles, medical records, etc.These architectures are available as pre-trained models in English 3 and can be finetuned for NLP tasks such as document classification, question answering or summarisation.How-ever, multilingual or non-English versions are rare.For most languages, it is not just a matter of training a model from scratch, but the amount of documents is just not enough to train high quality models.
LASER (Artetxe and Schwenk, 2019;Heffernan et al., 2022), Sentence BERT (Reimers andGurevych, 2019, 2020) and LaBSE (Feng et al., 2022) are representative and state-of-the-art models which largely adapt language models to be used as task-independent sentence representations.These models are available as pre-trained models and, contrary to the long sequence models introduced before, they are multilingual.LASER, which is not transformer-based, allows longer inputs.
These observations explain why the two main approaches to obtain multilingual (or non-English) document embeddings are simply (i) truncating the input to 512 tokens and feeding it into a sentencelevel encoder or (ii) splitting the document in shorter fragments and then combine their embeddings.There are few works that do a systematic comparison among methods.Park et al. (2022) perform a systematic study for document classification in English and found that the most sophisticated models such as Longformer do not always improve on a baseline that truncates the input to fit it into a fine-tuned BERT.The results mostly depend on how the information is distributed along a document and therefore varies from dataset to dataset.
In this work we explore multilingual documentlevel embeddings in three tasks in detail: document alignment, a bilingual semantic task; ICD code (multi-label) classification in 2 languages; and cross-lingual document classification in 8 languages.We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches.Our results show that a simple sentence average is a very strong baseline, even better than considering the whole document as a single unit, but that positional information is needed when the distribution of information across a document is not uniform.

Related Work
Word embeddings have been exceptionally successful in many NLP applications (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017).Subsequent works developed methods to learn continuous vector representations for longer sequences such as sentences or even documents.Skip-thought embeddings (Kiros et al., 2015) train an encoderdecoder architecture to predict surrounding sentences.Conneau et al. (2017) showed that the task on which sentence representations are learnt significantly impacts their quality.InferSent (Conneau et al., 2017), a Siamese BiLSTM network with max pooling, and Universal Sentence Encoder (Cer et al., 2018), a transformer-based network, are trained over the SNLI dataset which is suitable for learning semantic representations (Bowman et al., 2015).
These methods primarily work on a single language but as multilingual representations have attracted more interest, sentence-level embeddings have been extended to obtain a wider language coverage.Artetxe and Schwenk (2019) (LASER) learn joint multilingual sentence representations for 93 languages based on a single BiLSTM encoder with a shared BPE vocabulary trained on publicly available parallel corpora.However, this architecture was shown to underperform in highresource scenarios (Feng et al., 2022).LASER is especially interesting for our work as, being LSTMbased, it does not have the 512-length constraint.Li and Mak (2020) introduce T-LASER, which is a version of LASER that uses a transformer encoder in place of the original bidirectional LSTM.However, this model was tested only on the Multilingual Document Classification (MLDoc) corpus (Schwenk and Li, 2018), which does not have significantly long documents.Similarly, Reimers and Gurevych (2019) (sBERT in the following) extended a transformer-encoder architecture, BERT, by using a Siamese network with cosine similarity for contrastive learning in order to derive semantically meaningful sentence representations.More recently, Feng et al. (2022) (LaBSE) explored crosslingual sentence embeddings with BERT by introducing a pre-trained multilingual language model component and show that on several benchmarks, their method outperforms many state-of-the-art embeddings such as LASER.
While sentence-level representations have been widely explored in literature, document-level representations are less well-explored.The earliest approaches in learning document-level vector representations included an extension of the Word2Vec algorithm named Doc2Vec (Le and Mikolov, 2014) with two variants proposed, a bagof-words and a skip-gram based model.However, while these methods worked well at the word-level, the document-level counterpart led to issues in scaling due to large vocabulary sizes (Lau and Baldwin, 2016).Due to these limitations, further works have attempted to improve the computational bottlenecks involved with training on long sequences such as documents.Linformer (Wang et al., 2020) is a transformer-based architecture with linear complexity due to a sparse self-attention mechanism making it significantly more memory-and timeefficient in comparison with the original transformer (Vaswani et al., 2017).Works such as Big Bird (Zaheer et al., 2020) and Longformer (Beltagy et al., 2020) introduced a sparse attention mechanism and localised global attention respectively.BigBird is able to handle sequences of up to 4,096 tokens and Longformer scales linearly with the sequence length, with experiments on sequences of length upto 32,256.To the best of our knowledge, to date not much has been done to extend them beyond English.Shen (2021) and Romero (2022) made available Chinese and Spanish Longformer models, respectively, while Sagen (2021) trained a multilingual version starting from a RoBERTa checkpoint and not from scratch.We use Longformer as a comparison system in our experiments but we do not consider the multilingual model given that multilinguality was achieved by finetuning on question answering data and we do not explore this task.

Sentence Embeddings
We use three multilingual sentence-level embedding models that cover different languages, architectures and learning objectives: LASER (Schwenk and Douze, 2017;Artetxe and Schwenk, 2019) uses max-pooling over the output of a stacked BiLSTM-encoder.The encoder is extracted from an encoder-decoder machine translation setup trained on parallel corpora over 93 languages.Since it is not based on transformers but on LSTMs, the maximum number of input tokens can in principle be arbitrary and is set to 12,000.sBERT Reimers and Gurevych (2019) use the output of BERT-base with mean pooling to create a fixed-size sentence representation.A Siamese-BERT architecture trained on NLI is used to obtain the final sentence-embedding model.The maximum number of input tokens is 512, with a default value of 128.We use the multilingual version (Reimers and Gurevych, 2020).

Document Embeddings
We divide our approaches to build document embeddings into three families: in (i) Document Excerpts, we feed token sequences as they are directly into LASER, LaBSE and sBERT to obtain a document-level representation, in (ii) Sentence Weighting Schemes, we divide documents into sentences represented using base sentence embeddings and then explore different combination and weight strategies to obtain document embeddings, in (iii) Windowing Approaches, we study different distributions to learn document-level positional and semantic information.

(i) Document Excerpts
All Tokens: The full document is fed into the system (no truncation).We explore this option only with LASER which does not have the 510token-length restriction4 and when possible (English, Spanish and Chinese) with Longformer.
Top-N Tokens: The document is truncated to the first n = 510 tokens.

Bottom-N Tokens:
The last n = 510 tokens are fed into the system.
Top-N + Bottom-M Tokens: We select N = 128 and M = 382 to use the first N and last M tokens of the documents.These values are based on empirical explorations by Sun et al. (2019).

(ii) Sentence Weighting Schemes
Sentence Average: Each base sentence embedding (obtained with LASER, LaBSE or SBERT) is given a uniform weight.This computes the vanilla average embedding vector of all sentences in the document.
Top/Bottom-Half Average: Only the top (bottom) half of the sentences in the document are considered for averaging.

TF-IDF Weights:
We compute TF-IDF scores for all terms in a document, and average their values at sentence level.The base sentence embeddings (LASER, LaBSE, SBERT) are then weighted by the normalised value of the TF-IDF averages.Following Buck and Koehn (2016b), we use different TF-IDF computations based on variations of term frequency tf and inverse document frequency idf definitions.For words w in a document d belonging to a collection D we report results using: with df (w, D) = |{d ∈ D|w ∈ d}|, and where S k is a sentence in a given document d, and #w k is the number of words in sentence S k .
The weights of these models are fixed for the static tasks and used as initialisation when training a classifier.
(iii) Windowing Approaches TK-PERT: Thompson and Koehn (2020) introduced a windowing approach that weights the contribution of each sentence according to the modified PERT function (Vose, 2008) and a down-weighting function for boilerplate text.The latter was introduced to deal with webpages but it can be ignored for other types of documents.The smoothed overlapping windowing functions based on a cache of the PERT distribution (PERT-cache) encode finegrained positional information into the resultant document vector.
A document with N sentences S i|i∈{0,...,N −1} is split uniformly into J parts and the final representation D for a document is given by a concatenation of normalised position-weighted (via PERT) subvectors where each sub-vector D j is emb is the (LASER, LaBSE, SBERT) embedding of sentence n, P is the modified PERT function for part j and B is a boilerplate function if there is one.
In cases when no boilerplate text is present, we set it to 1.Following Thompson and Koehn (2020) setting for the modified PERT distribution, we use J = 16 and set its shape parameter to γ = 20.
TF-PERT: is a new extension of TK-PERT to further incorporate semantics.PERT focuses on positional information encoded in the document while TF-IDF focuses on the semantic information, therefore a combined metric would likely be able to consider both features.We combine the two contributions with a multiplication at sentence level: where we use the same notation as in Eqs. 4 and 5.
ATT-PERT: is a new extension of TK-PERT to further incorporate a global learnable attention.Figure 1 illustrates the basic architecture.The PERT distribution encodes global positional information of the document.By adding an attention layer over it, we introduce a global attention that weights the different parts of the document and that is combined with the standard local attention at word level performed by the sentence encoder.Mathematically, where S n refers to the sentence embedding that has been trained for a classification task and a j (n) is the respective global attention weight.
In TK-PERT, the static PERT distribution is multiplied by the fine-tuned sentence embeddings.In contrast, in ATT-PERT, the distribution is multiplied with the embeddings prior to training a classifier without freezing the embedding layer, as this allows the positional weights in the PERT distribution to be trained for the specific task.
ATT-TF-PERT: is a new extension of TF-PERT to further incorporate a global learnable attention as in ATT-PERT.In this configuration, we learn combined TF-IDF-PERT weighted embeddings whose attention weights are further updated while training the classifier.We use the same global attention a j (n) as in ATT-PERT, however here it is multiplied with both the TF-IDF weight of the sentence tf idf j (w, S n ) as computed in the TF-IDF set up and the PERT distribution P j (n) as in TK-PERT:

Evaluation Tasks
We apply the different configurations discussed above across the following tasks: Bilingual Document Alignment aims at aligning documents from two collections in language L1 and language L2 according to whether they are parallel or comparable.In our experiments, we use the data given for the WMT 2016 Shared Task on Bilingual Document Alignment to align French web pages to English web pages for a given crawled webdomain (Buck and Koehn, 2016a).In these experiments we do not perform any learning using the training data, but just estimate document-level semantic similarity between the pairs of documents in the test set.To compute this, we find the top K=32 candidate translations using approximate nearest neighbor search via FAISS5 as in (Buck and Koehn, 2016a).We use cosine similarity to quantify semantic similarity on the document embeddings.
Multi-label ICD Code Classification aims at assigning one or more ICD-10 codes to medicaldomain texts (electronic health records).Here there can be an arbitrary number of ICD-10 codes assigned to the input text.In particular, out of all the possible ICD-10 Codes, 4 account for more than 90% of the documents, making this an imbalanced classification task and leading to the 'tail end problem' (Chapman and Neumann, 2020).We use the CLEF eHealth 2019 task for German non-technical summaries (Neves et al., 2019) and CANTEMIST-CODING (Miranda-Escalada et al., 2020) for Spanish electronic health records.Here, we learn a weighted-attention classifier layer (Lee et al., 2022) on top of the base document embeddings consisting of a feed-forward neural network with a single hidden layer of 10 units.
Cross-lingual Document Classification aims at classifying documents in a set of predefined categories in a language (usually English) and then transfer the model to unseen languages.We use the MLDoc dataset for this purpose (Schwenk and Li, 2018).The corpus contains 1,000 development documents and 4,000 test documents in eight languages (English, German, French, Italian, Spanish, Japanese, Russian and Chinese), divided in four different genres with uniform class priors.For zero-shot transfer, we train a classifier on top of the multilingual document representations estimated as described in Section 4 by using only the English training data and the hyperparameters optimised in Artetxe and Schwenk (2019).Similar to the previous classification task, we use a feed-forward neural network with one hidden layer with 10 units.We use this classifier on top on the multilingual embeddings to evaluate the system on the remaining languages.
Table 1 shows the statistics for the datasets used in the three tasks as well as an average length of training instances in terms of sentencepiece tokens. 6The average document length in the document alignment and ICD code classification tasks is larger than 512 tokens, making the usage of sentence embeddings alone insufficient.This is not the case for document classification, but we still consider it in order to compare the different approaches and add a highly multilingual setting.

Results and Discussion
Thompson and Koehn (2020) empirically obtained the best trade-off between accuracy and inference time when using PCA-reduced sentence embeddings of 128 dimensions in the bilingual document alignment task.We performed equivalent experiments with 128 and 256 dimensions for selected configurations in the three tasks and confirmed the trend.As we obtained no major gains in using more dimensions, we report all the results for the three tasks with 128-dimensional sentence embeddings.
We report confidence intervals at 95% confidence level using bootstrap resampling with 1000 samples for document alignment, 500 samples for ICD code classification and 1000 samples for document classification.
Bilingual Document Alignment quality ranges from 65% to 96% recall depending on the document embedding method.Table 2 shows the results obtained for all the configurations considered.A simple sentence average achieves a recall around 82% (depending on the sentence embedding used).When using LASER, the only method that allows the comparison, the recall with sentence average is larger but not statistically significantly over embedding the full document as a single unit (81.8% vs 81.2%).Taking a token-based excerpt of the document is 10 percentage points below sentenceaveraging the same excerpt.The information in webpages seems to be more densely distributed towards the top of the page.Looking at the tophalf versus the bottom-half of the sentences of the webpages, there is a 17% reduction in the scores obtained.In these unweighted and average configurations in both the token and sentence-based methods, we do not encode any positional information: sentence order and semantic relevance is not considered in the final document embeddings.However, intuitively, these factors are indicative of each sentence's contribution to the larger document embedding.In order to incorporate semantic relevance into our final embeddings, we consider the weighted average using TF-IDF.We explore several TF-IDF forms and obtain a difference of 7% on average among them.Table 2 shows the 2 most promising ones.With the best option (tf 4 − idf 4 ), TF-IDF weighting improves between 3 and 5 percentage points with respect to the sentence averaging which uses uniform weights.We use tf 4 − idf 4 for the next experiments when required as these formulae empirically performed the best.To include sentence order, we use the PERT-window based approach.TK-PERT outperforms all other methods by a margin of 11.7%.This result attests the relevance of contextual information, sentence order, and positional importance.Although we find improvements over the baseline models by introducing TF-IDF weights and the PERT distribution, a combination of the two in TF-PERT does not lead to further improvements.
The other dimension of the study, the particulars of sentence embeddings, is less important to the recall.LASER, LaBSE and sBERT achieve similar results.As we are working with French and English documents, both languages being highresource, all base sentence embeddings are highquality and therefore they do not impact the final model strongly in a consistent way.
Multi-label ICD Code Classification shows the same trend with respect to different sentence embeddings as above for German and Spanish, with a slight preference towards LaBSE embeddings.Table 3 shows the results for this task.There is a large discrepancy between the scores for the German and the Spanish datasets, as already noticed by the evaluations in the original corresponding shared tasks.
The classification in Spanish achieves much lower results probably because of a very small training corpus.Our results indicate that the information is spread throughout documents in this case.The difference between only using the top of the document and only using the bottom part is small, and using the whole document either by sentence averaging or considering it a single unit is always better than any of its parts at a 95% significance level.Semantic (TF-IDF) and positional (TK-PERT) information is less relevant.For the German task, either considering the full document as a whole (All tokens) or averaging all the sentences gives the highest performance.For the Spanish task, even with a very low overall quality, learning specific weights for different parts of the document (ATT-PERT) boosts the quality.Comparing ATT-PERT with TK-PERT, we find that the trainable alternative performs better for all languages and base embeddings considered, however, the improvements are not statistically significant for all base embeddings in the case of German.In general, the windowing approaches that combine semantics with position (TF-PERT and ATT-TF-PERT) do not perform significantly better than the pure positional methods (TK-PERT and ATT-PERT).This can be explained by looking a concrete example.Figure 2 shows the distribution of weights across a document from the CANTEMIST health record corpus for 8 configurations based on LASER embeddings.The example shows that the effect of the tf idf component in ATT-TF-PERT (configuration 7) is equivalent to move weight mass from ATT-PERT (configuration 6) into TF-IDF (configuration 3).When this happens, the result is a score in the middle of the way between ATT-PERT and TF-IDF.In this document, a medical diagnostic evaluation is detailed and includes patient information, past diagnoses, family medical history, as well potential evolution of the disease.We observe that while the 'Sentence average' configuration places largely equivalent weights on all the sentences, the TF-IDF weights place more emphasis on the beginning and end of the document which stores information about the patient and the evolution of the disease respectively.This behaviour is similar to the one exhibited by the PERT family of methods: the weight pattern observed for configurations 3-7 remain quite consistent but vary in their intensity.
Cross-lingual Document Classification data allows us to test the embedding methods on 8 languages (Table 4).The languages belong to three families, Indo-European (Germanic, Romance and Slavic), Japonic and Sino-Tibetan.All languages are high-resourced and included in our pre-trained sentence representation models.MLDoc documents are shorter than 1,000 tokens with an average length of 275 tokens for English and 562 for Chinese; the other languages stay in the middle.Given that length, the methods that use different 510-sized excerpts of the documents do not differ much as all the excerpts are -for most of the documents-the same.
Accuracies in Table 4 show that the documents convey slightly more meaning at the top part than at the bottom (Top-Half Avg.vs Bottom-Half Avg.).The sentence average is a very strong baseline and, for half of the languages (English, German, Russian and Chinese), this is statistically significantly better at 95% confidence level than treating the document as a single unit with LASER.The TF-IDF version is worse than the simple sentence average except for Japanese.Japanese has the lowest accuracy for all the languages and a high difference between the information at the top and the bottom of its documents.In general, position (TK-PERT) is more important than semantics (TF-IDF) and learning task-specific weights further increases accuracy.Additional experiments with TF-PERT and ATT-TF-PERT do not show statistically significant improvements over their counterparts TK-PERT and ATT-PERT, similarly to the trend observed in the previous tasks.For English, Chinese and Spanish, we are further able to compare the performance of pre-trained large-input transformers.Longformer achieves 92.3% of accuracy for English, which is 4.1% better than the 88.7% that LASER achieves in the All tokens configuration and about 2% better than the best performing architecture, the sentence average of LaBSE embeddings (90.9%).However, the latter is not statistically significant at 95% confidence level.The result is different for Chinese and Spanish.In both cases, considering all tokens with LASER and sentence average are better than Longformer, although the difference is not statistically significant for Spanish.This indicates that smaller amounts of training data can prevent native full document-level embeddings to be extended to languages other than English.exhaustive evaluations across three sentence embeddings models, three tasks and eight languages.
Our experiments show that specific base sentence embedding models (LASER, LaBSE, sBERT) do not impact the performance of the document-level embeddings much.We observe similar performance amongst them across all experiments.However, it is to be noted that we experiment with languages that while being morphologically distinct, are well resourced and covered by the three base sentence-embedding models.It would be interesting to explore how models behave when embeddings have a lower quality.For this, one would need to create evaluation datasets at the document level for low-resourced languages but this is out of the scope of this work.
We observed that a simple sentence average is a very strong pooling strategy, specially for classification tasks.Positional and contextual information is more important than semantic information for the final performance as exemplified by the fact that PERT-based weightings perform better than TF-IDF's in all the tasks.When combining both, positional and semantic information, we do not observe statistically significant improvements with respect to only including positional information.For the classification tasks which include a learnable layer, we extend TK-PERT to ATT-PERT (and the semantic counterparts) and include global trainable attention on the positional information.This global attention is beneficial in all the cases.
The type of document is also relevant to chose the best method.Long documents might have the most crucial information stored in different parts.For instance, webpages have a majority of their information in the first half of the document as we observed in the document alignment task.In this case, the positional information significantly outperforms any model that does not take it into account.

Limitations
One of the main focal points of this work is multilinguality.In the presented approaches, the multilinguality of the resultant document embeddings depends solely on the language coverage and cross-lingual transfer ability of the pre-trained sentence embeddings used as basic units.Document-level representations are as robust to new languages and scripts as the base sentence embeddings are.Crosslingual transfer is a perpendicular dimension not studied in this work.
We introduce ATT-PERT, a new learnable approach for the combination of sentence embeddings.This model is therefore of use for tasks with a learning/fine-tuning phase but it is not intended for ready-to-use multilingual document-level embeddings in contrast to the existing pre-trained sentence-level counterparts.

LaBSE
Feng et al. (2022) train a multilingual BERT-like model with a masked LM and translation LM objective functions.A dual-encoder transformer is initialised with the model and fine-tuned on a translation ranking task.The final model covers 109 languages.The maximum number of input tokens is 512.

Figure 1 :
Figure 1: ATT-PERT model for classification.A static modified PERT distribution is used to extend the sentence embeddings to documents.Afterwards, an attention-weighted classifier is learnt.

Table 1 :
Number of documents and average tokenised document length in sentencepiece units (prior to boilerplate downweighting for Document Alignment) for the three tasks used in the experiments.

Table 2 :
Document recall on WMT-16 Shared Task on English-French document alignment.Best score for each family is in bold.

Table 3 :
F1 scores for the Multi-label ICD code classification task for German (de) and Spanish (es) documents.Best scores are in bold, and best scores per family are in italics.

Table 4 :
Accuracy for MLDoc classification on the zero-shot transfer task.Best results per language are shown in bold and per family in italics.