Discrete Cosine Transform as Universal Sentence Encoder

Modern sentence encoders are used to generate dense vector representations that capture the underlying linguistic characteristics for a sequence of words, including phrases, sentences, or paragraphs. These kinds of representations are ideal for training a classifier for an end task such as sentiment analysis, question answering and text classification. Different models have been proposed to efficiently generate general purpose sentence representations to be used in pretraining protocols. While averaging is the most commonly used efficient sentence encoder, Discrete Cosine Transform (DCT) was recently proposed as an alternative that captures the underlying syntactic characteristics of a given text without compromising practical efficiency compared to averaging. However, as with most other sentence encoders, the DCT sentence encoder was only evaluated in English. To this end, we utilize DCT encoder to generate universal sentence representation for different languages such as German, French, Spanish and Russian. The experimental results clearly show the superior effectiveness of DCT encoding in which consistent performance improvements are achieved over strong baselines on multiple standardized datasets


Introduction
Recently, a number of sentence encoding representations have been developed to accommodate the need of sentence-level understanding; some of these models are discussed in (Hill et al., 2016;Logeswaran and Lee, 2018;Conneau et al., 2017), yet most of these representations have focused on English only.
To generate sentence representations in different languages, the most obvious solution is to train monolingual sentence encoders for each language. However, training a heavily parameterized mono-lingual sentence encoder for every language is inefficient and computationally expensive, let alone the impact on the environment. Thus, utilizing a non-parameterized model with ready-to-use word embeddings is an efficient alternative to generate sentence representations in various languages.
A number of non-parameterized models have been proposed to derive sentence representations from pre-trained word embeddings (Rücklé et al., 2018;Yang et al., 2019;Kayal and Tsatsaronis, 2019). However, most of these models, including averaging, disregard structure information, which is an important aspect of any given language. Recently, Almarwani et al. (2019) proposed a structure-sensitive sentence encoder, which utilizes Discrete Cosine Transform (DCT) as an efficient alternative to averaging. The authors show that this approach is versatile and scalable because it relies only on word embeddings, which can be easily obtained from large unlabeled data. Hence, in principle, this approach can be adapted to different languages. Furthermore, having an efficient, readyto-use language-independent sentence encoder can enable knowledge transfer between different languages in cross-lingual settings, empowering the development of efficient and performant NLP models for low-resource languages.
In this paper, we empirically investigate the generality of DCT representations across languages as both a single language model and a cross-lingual model in order to assess the effectiveness of DCT across different languages.

DCT as sentence Encoder
In signal processing domain DCT is used to decompose signal into component frequencies revealing dynamics that make up the signal and transitions within (Shu et al., 2017). Recently, DCT has been adopted as a way to compress textual information (Kayal and Tsatsaronis, 2019;Almarwani et al., 2019). A key observation in NLP is that word vectors obey laws of algebra King -Man + Woman = (approx.) Queen (Mikolov et al., 2013). Thus, given word embeddings, cast a sentence as a multidimensional signal over time, in which DCT is used to summarize the general feature patterns in word sequences and compress them into fixed-length vectors (Kayal and Tsatsaronis, 2019;Almarwani et al., 2019).
Mathematically, DCT is an invertible function that maps an input sequence of N real numbers to the coefficients of N orthogonal cosine basis functions of increasing frequencies (Ahmed et al., 1974). The DCT components are arranged in order of significance. The first coefficient (c[0]) represents the sum of the input sequence normalized by the square length, which is proportional to the average of the sequence (Ahmed et al., 1974). The lower-order coefficients represent lower signal frequencies which correspond to the overall patterns in the sequence. For example, DCT is used for compression by preserving only the coefficients with large magnitudes. These coefficients can be used to reconstruct the original sequence exactly using the inverse transform (Watson, 1994).
In NLP, Kayal and Tsatsaronis (2019) applied DCT at the word level to reduce the dimensionality of the embeddings size, while Almarwani et al.
(2019) applied it along the sentence length as a way to compress each feature in the embedding space independently. In both implementations, the top coefficients are concatenated to generate the final representation for a sentence. As shown in (Almarwani et al., 2019), applying DCT along the features in the embeddings space renders representations that yield better results. Also, Zhu and de Melo (2020) noted that similar to vector averaging the DCT model proposed by (Almarwani et al., 2019) yields better overall performance compared to more complex encoders, thus, in this work, we adopt their implementation to extract sentencelevel representations.  Results: Figure 1 shows a heat-map reflecting the probing results of the different languages relative to English. Overall, French (FR) seems to be the closest to English (EN) followed by Spanish (ES) then German (DE) and then finally Russian (RU) across the various DCT coefficients. Higher coefficients reflect majority better performance across most tasks for FR, ES and DE. We see the most variation with worse results than English on the syntactic tasks of TreeDepth, CoordInv, Tense, SubjNum and ObjNum for RU. SOMO stands out for RU where it outperforms EN. The variation in Russian might be due to the nature of RU being a more complex language that is morphologically rich with flexible word order (Toldova et al., 2015).
In terms of the performance per number of DCT coefficients, we observe consistent performance gain across different languages that is similar to the English result trends. Specifically, for the surface level tasks, among the DCT models the c[0] model significantly outperforms the AV G with an increase of ∼30 percentage points in all languages. The surface level tasks (SentLen and WC) show the most notable variance in performance, in which the highest results are obtained using the c[0] model. However, the performance decreases in all languages when K is increased. On the other hand, for all languages, we observe a positive effect on the model's performance with larger K in both the syntactic and semantic tasks. The complete numerical results are presented in the Appendix in Table   3 Available at: https://fasttext.cc.

Approach
Aldarmaki and Diab (2019)proposed sentence-level transformation approaches to learn context-aware representations for cross-lingual mappings. While the word-level cross-lingual transformations utilize an aligned dictionary of word embeddings to learn the mapping, the sentence-level transformations utilize a large dictionary of parallel sentence embeddings. Since sentences provide contexts that are useful for disambiguation for the individual word's specific meaning, sentence-level mapping yields a better cross-lingual representation compared to word-level mappings. A simple model like sentence averaging can be used to learn transformations between two languages as shown in (Aldarmaki and Diab, 2019). However, the resulting vectors fail to capture structural information such as word order, which may result in poor cross-lingual alignment. Therefore, guided by the results shown in (Aldarmaki and Diab, 2019), we further utilize DCT to construct sentence representations for the sentencelevel cross-lingual modeling.

Experiments Setups and Results
For model evaluation, we use the same crosslingual evaluation framework introduced in (Aldarmaki and Diab, 2019). Intuitively, sentences tend to be clustered with their translations when their vectors exist in a well-aligned cross-lingual space. Thus, in this framework, cross-lingual mapping ap-proaches are evaluated using sentence translation retrieval by calculating the accuracy of correct sentence retrieval. Formally, the cosine similarity is used to find the nearest neighbor for a given source sentence from the target side of the parallel corpus.

Evaluation Datasets and Results
To demonstrate the efficacy of cross-lingual mapping using the sentence-level representation generated by DCT models, similarly to Aldarmaki and Diab (2019), we used the WMT'13 data set that includes EN, ES and DE languages (Bojar et al., 2013). We further used five language pairs from the WMT'17 translation task to evaluate the effectiveness of DCT-based embeddings. Specifically, we used a sample of 1 million parallel sentences from WMT'13 common-crawl data; this subset is the same one used in (Aldarmaki and Diab, 2019). 4 To assess efficacy of the DCT models for the crosslingual mapping, we reported the performances of the sentence translation retrieval task within the WMT'13 test set, which includes EN, ES, and DE as test languages (Bojar et al., 2013). Specifically, we first used the 1M parallel sentences for the alignment between source languages (ES and DE) to a target language (EN) independently. We evaluated the translation retrieval performance in all language directions, from source languages to English: ES-EN and DE-EN, as well as between the sources languages: ES-DE.
Similarly, we conduct a series of experiments on 5 different language pairs from the WMT'17 translation task, which includes DE, Latvian (LV), Finnish (FI), Czech (CS), and Russian (RU), each of which is associated with an English translation (Zhang et al., 2018). 5 For each language pair, we sampled 1M parallel sentences from their training corpus for the cross-lingual alignment between each source language and EN. Also, we used the test set available for each language pair to evaluate the translation retrieval performances.
In our experiments, we evaluate the translation retrieval performance in all language directions using three type of word embeddings: 1-a publicly available pre-trained word embeddings in which we show the performance of DCT against averaging, which we refer to hereafter as out-of-domain 4 Evaluation scripts and WMT'13 dataset as described in (Aldarmaki and Diab, 2019) are available in https://github.com/h-aldarmaki/sent translation retrieval 5 The pre-processed version of the WMT'17 dataset was used. For more information refer to (Zhang et al., 2018).  embeddings as shown in Table 2. 2-Also, we ran additional experiments in which we used a domain specific word embedding (that we trained on genre that is similar to the translation task) and 3-contextualized word embedding, which we refer to hereafter as in-domain embeddings as shown in Table 3.
Out-of-domain embeddings: For all language pairs, DCT-based models outperform AVG and c[0] models in the sentence translation retrieval task. In the direction → EN , while the c[0:2] model achieve the highest accuracy for ES, DE, RU, and FI languages, the c[0:3] model achieved the highest accuracy for CS and LV languages. Specifically, the c[0:2] model yields increases of 5.59%-30% in the direction from source languages (ES, DE, RU, and FI) to English compared to the AVG model. Also, while the c[0:3] model yielded an increase of 13% gains over the baseline for CS, it provides the most notable increase of 38% for LV. For the opposite directions EN → source, the DCT-based embeddings model also outperformed AVG and c[0] models. In particular, we observed accuracy gains of at least 3.81% points using more coefficients in DCT-based models compared to the AVG and c[0] models for all languages. A similar trend is observed in the zero-shot translation retrieval between the two non English languages (ES and DE), in which DCT-based models outperform the AVG and c[0] models.  In-domain embeddings: To ensure comparability to state-of-the-art results, we further utilized indomain FastText embeddings as those used in (Aldarmaki Table 3, using in-domain word embeddings yields stronger results compared to the pre-trained embeddings we use in the previous experiments as illustrated in Table 2. On the other hand, we observe additional improvements using mBERT as word embeddings on all models. Furthermore, increasing K has positive effect on both embeddings, in which c[0 : 1] demonstrate performance gains compared to other models in all language directions. This trend is clearly observed in the zero-shot performance between the non English languages. Furthermore, as shown in Table 4, we obtained a state-of-the-art result using mBERT c[0 : 1] with 91.83% average accuracy across all translation directions compared to the 84.03% average accuracy of ELMo as reported in (Aldarmaki and Diab, 2019).

Conclusion
In this paper, we extended the application of DCT encoder to multi-and cross-lingual settings. Experimental results across different languages showed that similar to English using DCT outperform the vector averaging. We further presented a sentencelevel-based approach for cross-lingual mapping without any additional training parameters. In this context, the DCT embedding is used to generate sentence representations, which are then used in the alignment process. Moreover, we have shown that incorporating structural information encoded in the lower-order coefficients yields significant performance gains compared to the AVG in sentence translation retrieval.  Table 5 shows the complete numerical results for the probing tasks on all languages.