Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers

We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register-annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zero-shot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.


Introduction
Text genre or register (Biber, 1988), such as discussion forum, news article or poem, is one of the most important predictors of linguistic variation (Biber, 2012). Thus, register affects crucially also the automatic processing of language (Mahajan et al., 2015;Webber, 2009;Van der Wees et al., 2018). Yet, despite its importance, register information is not available in web-crawled datasets that are widely used e.g. for pre-training language models in modern NLP. This is a challenge, as better structured language resources would also enable more detailed understanding and more sophisticated use of this data.
While web register identification would allow better realization of the potential offered by web- † The marked authors contributed equally to this paper. crawled datasets, most previous web register identification studies have been limited by skewed datasets, low performance, and near-exclusive focus on English. For example, Asheghi et al. (2014) and Pritsos and Stamatatos (2018) reported comparatively strong results, but their evaluations were based on datasets representing only a subset of the registers found online. With the CORE corpus, Egbert et al. (2015) were the first to present a dataset featuring the full extent of registers found on the open, searchable English web. While Biber and Egbert (2016b) demonstrated the possibility of automatic register classification using Stepwise Discriminant Analysis, improvements in modeling and more efficient methods remained necessary in order to reach practical levels of performance.
A challenge in modeling web registers is that web documents drawn from the unrestricted web do not always fit discrete classes but could rather be described in a continuous space (Biber and Egbert, 2018;Sharoff, 2018). Not all documents have clear characteristics of one single register, or even any register at all. This has shown also in relatively low inter-annotator agreement for web register annotation (Crowston et al., 2010).
Very recently, however, the advances brought to NLP by neural networks have shown that registers can be identified also in a corpus featuring the full range of online language variation (Laippala et al., 2020a). Laippala et al. (2019) extended the possibilities of web register identification beyond English by presenting an online register corpus on Finnish (FinCORE) and demonstrating that web registers can be modeled also in a cross-lingual setting.
In this paper, we substantially extend on this early work on cross-lingual web register identification through the following contributions: 1) we  Hybrids include all documents annotated with several register labels, and Empty refers to documents not assigned any label.
introduce manually annotated web register datasets for two new languages, French and Swedish, 2) we demonstrate competitive performance for crosslingual transfer of a register classification model from English to other languages in a zero-shot setting, and 3) we analyze zero-shot vs. monolingual training for register classification and remaining challenges in both. In particular, using Transformer-based pre-trained language models, we show that a zero-shot cross-lingual approach outperforms monolingual results achieved by a previously proposed state-of-the-art method for all the three language pairs (En-Fr, En-Sv, and En-Fi), and that strong monolingual performance can be achieved with limited training data.

Data
We use four register-annotated corpora representing the unrestricted open web: the English CORE and Finnish FinCORE, which have been introduced in previous work (Egbert et al., 2015;Laippala et al., 2019), and two new corpora, FreCORE for French and SweCORE for Swedish. These novel datasets are released under open licences together with this paper. 1 With these new resources, the possibilities for web register identification expand substantially. FreCORE and SweCORE are random samples of the 2017 CoNLL datasets (Ginter et al., 2017) originally drawn from Common Crawl. Both datasets were deduplicated using Onion (Pomikálek, 2011) with 0.7 threshold and n-gram length of 5. All material not belonging to the body of text, such as boilerplate, was removed. Titles, however, were preserved. The cleaning and pre-processing steps follow the procedure suggested in Laippala et al. (2020b). The register annotation of the datasets was conducted individually by two trained annotators with a linguistics background. Uncertain cases were discussed and resolved together with an annotation supervisor. The inter-annotator agreement, counted prior to the discussions, was 78% F1-score for FreCORE and 84% for SweCORE. This can be considered as a lower bound.
All datasets are similarly annotated across languages, and they all apply the same hierarchical register class taxonomy originally introduced for CORE. It includes eight main registers (e.g., Narrative) and approximately 30 sub-registers (e.g., News report within Narrative). The main and subregister categories are illustrated in the appendix. When a document shares characteristics of several registers, it can be assigned several labels both at the main and sub-register level. These documents are called hybrids. As our focus in this paper is on general register categories, we initially pre-process all four corpora to remove the more specific subregister labels.
The general register categories and their distributions as well as the average document length and standard deviation for all classes are presented in Table 1 and Table 2, respectively. The register class Empty consists of texts whose register the annotators could not agree on. Due to the very small number of each type of hybrid label combination in the data, in Tables 1 and 2, the class Hybrids includes all documents that have more than one label.  hybrids among the four most frequent categories. The top four also include Informational persuasion in FinCORE, FreCORE, and SweCORE, while in CORE this label is relatively infrequent. Additionally, Opinion is notably more frequent in CORE and FinCORE than in FreCORE and SweCORE. These differences may reflect differences in data compilation. Table 2 shows that, on average, English documents are longer than documents in other languages, whereas Swedish documents tend to be shortest. Overall the number of words in a document in most of the classes show large variation, with the longest documents containing tens of thousands of words.

Experimental setup
The architectures and models we are using are presented below. 2 We perform multi-label document classification, where each document can have zero, one, or several register labels. The experiments are divided into 1) a monolingual setup with training and evaluation on Finnish, French, Swedish, and English (as reference), and 2) a zero-shot crosslingual setup with training on English and evaluation on the other languages. BERT, Bidirectional Encoder Representations from Transformers (Devlin et al., 2019) is a stateof-the-art deep bidirectional language model pretrained on large unlabelled corpora. BERT's architecture is a multi-layer Transformer encoder that is based on the original Transformer architecture introduced by Vaswani et al. (2017). We use cased BERT models (TensorFlow versions) through 2 The code is available at: https://github.com/ TurkuNLP/Multilingual-register-corpora the Huggingface Transformers library (Wolf et al., 2020) with the following language-specific models: the original English BERT, Finnish FinBERT (Virtanen et al., 2019), French FlauBERT (Le et al., 2020) and Swedish KB-BERT (Malmsten et al., 2020). Additionally, we use Multilingual BERT (mBERT) (Devlin et al., 2019), which was pretrained on monolingual Wikipedia corpora from 104 languages with a shared multilingual vocabulary.
XLM-RoBERTa (XLM-R, ) is a multilingual language model that follows the Cross-lingual Language Modeling (XLM) approach (Conneau and Lample, 2019) and is based on the RoBERTa model (Liu et al., 2019), which shares the architecture of BERT. The authors argue that XLM and mBERT are undertuned and that the improved and prolonged training procedure of RoBERTa in combination with more data -on average two orders of magnitude more for low-resource languages -is key to improving cross-lingual performance. XLM-R is trained on 2.5TB of filtered Common Crawl  data comprising of monolingual texts in 100 languages. It is claimed to be the first multilingual model to outperform monolingual models, as well as Multilingual BERT in a number of experiments Libovický et al., 2020;Tanase et al., 2020).
We also apply a CNN (Convolutional Neural Network) based architecture following Kim (2014), as our baseline model. We modify the cross-lingual CNN used by Laippala et al. (2019)   a max-pooling layer and sigmoid activation.
The French and Swedish data were divided into training, development and test sets using stratified sampling with a 50/20/30 split. For BERT-based models we used large model size when available to maximize model performance. We used the maximum sequence length of 512 tokens (with truncation at the end) and batch size of 7, and performed a grid search on learning rate (8e -6 -6e -5 ) and number of training epochs (3-7). For the CNN, we performed a grid search on the kernel size (1-2), learning rate (1e -4 -1e -2 ), and prediction threshold (0.4, 0.5, 0.6).

Results
In Table 3, we present the primary results on English, Finnish, French and Swedish monolingual classification with the models described in Section 3, as well as cross-lingual results with English as the source language and Finnish, French and Swedish as target languages. We report the mean and standard deviation of F1 over three repetitions.
In monolingual settings, XLM-R large performs competitively compared to monolingual models and clearly outperforms both mBERT and the CNN baseline. The lead of XLM-R over monolingual models is substantial in all cases except for the Fin-BERT model, where the two perform within one standard deviation of each other. Our results sup- port the claimed competitiveness of XLM-R large with monolingual models, mentioned in Section 3. English, Finnish and French BERT models achieve similar monolingual test results (73-74% F1-score), while the Swedish KB-BERT achieves the highest F1-score (81%). The Finnish classification task is seemingly easier due to smaller number of classes, nevertheless, other factors may cause the difficulty of the task to differ between languages. For instance, the measured human inter-annotator agreements at 78% (Fr) and 84% (Sv) F1-score (see Section 2) represent a theoretical upper bound for the classification task and reflect the tendency of  Swedish being easier to classify; the level of agreement has not been reported for Finnish. Although not strictly comparable, our results clearly outperform the previous state-of-the-art results achieved with the CNN (Laippala et al., 2019) in terms of F1, which in turn outperforms Biber and Egbert (2016b), who used the same corpus but in multiclass setting. Furthermore, Table 3 shows very strong zeroshot cross-lingual results with XLM-R large, with F1-scores in the 61-69% range. This represents a remarkably consistent relative decrease of 16.2-16.6% (11.8-13.8% absolute) from the monolingual scores of XLM-R. Its lead over mBERT increases from 6.6-8.4% absolute F1 to 7.8-11.4% in the cross-lingual settings, whereas its lead over the CNN goes from 15.1-18.8% to 17.5-25.4%. Most interestingly, the zero-shot XLM-R even beats the monolingually trained CNN baselines by a significant margin for Finnish and French, while its lead remains within a standard deviation for Swedish.

English-Swedish
In Figure 1, we illustrate the effect of training monolingual XLM-R large models with varying train set sizes and compare the performance against the reported zero-shot performance. The optimal monolingual hyperparameter settings for each language are used, while training the model instances on 100-900 examples each. We see that zero-shot cross-lingual performance is surpassed already with about 150 training instances for French, 225 for Swedish and 400 for Finnish, while performance seems to converge around 500.
Previous studies have shown repeatedly that registers vary considerably in terms of how well they are linguistically defined and thus how well they can be automatically identified Egbert, 2018, 2016a;Laippala et al., 2020a). For instance, while texts in the IN (Informational description) and NA (Narrative) classes, such as Encyclopedia articles and Sports reports, have very distinctive characteristics and can be identified with a very high reliability, others, such as Information blogs in the IN class or Advice in the OP (Opinion) class receive much lower scores. Figure 2 presents confusion matrices on the predictions in monolingual and cross-lingual settings, using the best-performing model. 3 For the sake of simplicity, the multi-label predictions have been collapsed into multi-class by including all hybrids under one label HYB in Figure 2. In the monolingual settings, we can see that particularly hybrids present a challenge. This is expected, as they feature characteristics of several registers. Additionally, while IP (Informational persuasion) and NA are predicted with high performance in all three languages, the other classes display more variation. For instance, ID (interactive discussion) reaches an F1-score of 90% (see appendix) in French monolingual setting, whereas in Swedish and Finnish it is frequently misclassified, most likely because of the small number of examples in the training data.
The hybrids are also frequently misclassified in cross-lingual settings. Interestingly, register classes also feature clear differences in the extent to which the cross-lingual transfer affects the identification performance. The register class IN tends to be predicted strongly in all zero-shot language pairs. This is probably due to the IN class including documents with strong cross-lingual signals. For instance, IN includes Encyclopedia articles (see appendix), such as Wikipedia texts, that tend to be very similar across languages.
While most of the non-hybrid classes experience a small drop in performance, the identification rate for IP and HI (How-to/Instructions) drops dramatically in cross-lingual settings in all language pairs. The decrease of IP can be linked to its smaller proportion in the English data (see Section 2), but the drops experienced by IP and HI can also reflect the variation displayed by registers across languages. Biber (2014) showed that registers, such as spoken texts, display functional similarities across languages, which obviously is needed for highquality transfer in register identification. However, analyzing the English CORE registers, Laippala et al. (2020a) noted that some registers, such as many blogs, depend highly on lexical characteristics reflecting the discussion topics. These topics, however, may vary extensively between languages. This, again, may complicate the transfer learning for these classes.

Discussion and conclusions
Despite the many opportunities that reliable recognition of text register would introduce for the analysis and use of web documents and many efforts to address this task over the years, only limited progress has been made toward unrestricted web document register classification. Previous work has also focused almost exclusively on English.
In this study, we have introduced manual register annotation compatible with that of the large English CORE corpus for two languages previously lacking such a resource, namely French and Swedish. We also demonstrated that state-of-the-art multilingual neural language models can support zero-shot transfer of register annotations from English to a Germanic, Romance and Finnic language at levels of performance broadly comparable or better to previously published monolingual results on CORE.
Moreover, we demonstrated that small amounts of monolingual training data are needed to reach or surpass this level of performance, which attests that reliable register identification in a new language is readily attainable using current pre-trained language models. We further compared and analysed the results for monolingual and cross-lingual register classifiers, finding that certain registers as well as hybrid texts combining several register characteristics continue to pose challenges in particular for cross-lingual transfer. In future work, we will build on these results to extend multi-and cross-lingual modeling in order to create massive multilingual register-annotated web corpora.

En-Fi
En   Narrative News report / news blog, sports report, personal blog, historical article, fiction, travel blog, community blog, online article Informational description Description of a thing, encyclopedia article, research article, description of a person, information blog, FAQ, course material, legal terms / condition, report, job description Opinion Review, opinion blog, religious blogs/sermon, advice Interactive discussion Discussion forum, question-answer forum How-to/Instructions How-to/instruction, recipe

Informational Persuasion
Description with intent to sell, news+opinion blog / editorial Lyrical Songs, poem Spoken Interview, formal speech, TV transcript