MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark

There is a lack of research into capabilities of recent LLMs to generate convincing text in languages other than English and into performance of detectors of machine-generated text in multilingual settings. This is also reflected in the available benchmarks which lack authentic texts in languages other than English and predominantly cover older generators. To fill this gap, we introduce MULTITuDE, a novel benchmarking dataset for multilingual machine-generated text detection comprising of 74,081 authentic and machine-generated texts in 11 languages (ar, ca, cs, de, en, es, nl, pt, ru, uk, and zh) generated by 8 multilingual LLMs. Using this benchmark, we compare the performance of zero-shot (statistical and black-box) and fine-tuned detectors. Considering the multilinguality, we evaluate 1) how these detectors generalize to unseen languages (linguistically similar as well as dissimilar) and unseen LLMs and 2) whether the detectors improve their performance when trained on multiple languages.


Introduction
Machine text generation has significantly progressed in the past few months thanks to a new generation of large language models (LLMs).First, it was the arrival of ChatGPT and later GPT-4 that made available inexpensive generation of text in a range of languages to millions of people with ChatGPT becoming the fastest growing consumer application in history.Second, the introduction of LLaMA (Touvron et al., 2023b)  possibilities for researchers and practitioners for inexpensive fine-tuning of LLMs, consequently ushering fine-tuned models like Alpaca (Taori et al., 2023) or Vicuna (Chiang et al., 2023) mimicking the capabilities of much larger (and more expensive) ones such as ChatGPT.The defining characteristic of this new generation of LLMs is not only the increased quality of text generation, but also their multilinguality.
Due to the potential for misuse of machinegenerated text (MGT) for influence operations (Goldstein et al., 2023), disinformation (Buchanan et al., 2021), spam or unethical authorship (Crothers et al., 2022a), there has been a substantial amount of research on the task of machine-generated text detection (Jawahar et al., 2020;Stiff and Johansson, 2022;Uchendu et al., 2023a).Although GPT-3 was already capable of generating text in languages other than English and despite the availability of multilingual BLOOM (Scao et al., 2022), most of the prior works on this task have been (until very recently) still focusing on GPT-2 with English-only support or newer models like GPT-Neo (Gao et al., 2020), GPT-NeoX (Black et al., 2022) or GPT-J (Wang and Komatsuzaki, 2021) which were all trained on an English language only datasets.
Recently, first MGT datasets in languages other than English -AuTexTification for Spanish (Sarvazyan et al., 2023) and RuATD for Russian (Shamardina et al., 2022) -were made public for the detection task.There is also a dataset containing 5 languages (Chen et al., 2022), but this was obtained by the use of machine translation with human corrections, which renders it less useful for MGT detection benchmarking due to the potential noise.Thus, a dataset comprising authentic texts in multiple languages from a single domain (i.e., a text form, such as a news article or social media post) is still missing, hampering the comprehensive evaluation of detection methods in multilingual setting.At the same time, prior works have already shown that detectors fine-tuned on data in English fail to generalize to other languages (e.g. to German in case of Mitchell et al., 2023 showing a drop from 0.946 AUC ROC to 0.537), which is also confirmed by our results (see Table 1).
In this paper, we aim to address shortcomings of prior works and focus on the multilingual aspect of MGT detection task (a binary classification of a text to be human-written or machine-generated).Our main contributions are: (1) We evaluate the cross-language generalization of fine-tuned detectors trained in monolingual vs. multilingual settings.More specifically, we evaluate how the detection methods fine-tuned to a specific language (monolingual) or to a set of training languages (multilingual) generalize to unseen languages.We observe strong influence of language family and script on generalization and clear benefits of multilingual fine-tuning.
(2) We provide a first comprehensive multilingual benchmark of a range of state-of-the-art (SOTA) detection methods, comparing the performance of fine-tuned detectors and their ability to generalize to unseen LLMs to the performance of the zero-shot statistical detectors, such as Detect-GPT (Mitchell et al., 2023) or black-box methods, such as GPTZero.

Related Work
Large Language Models (LLMs).They are language models with an unprecedented number of parameters trained on massive amounts of data, including models such as ChatGPT powered by GPT-3.5 or GPT-4 (OpenAI, 2023), OPT (Zhang et al., 2022), LLaMA (Touvron et al., 2023a), PaLM (Narang and Chowdhery, 2022), LaMDA (Collins and Ghahramani, 2021), BLOOM (Scao et al., 2022), Vicuna (Chiang et al., 2023), Alpaca (Taori et al., 2023), etc.The scale of LLMs has led to emergent abilities, observed only with these models, and solving of several non-trivial NLG and NLI (Natural Language Inference) tasks.Among the most impressive is the ability to generate authenticlooking human-like texts, nearly indistinguishable from human-written texts.Similarly impressive is the ability to generate coherent texts in languages other than English (Scao et al., 2022) as LLMs are mostly trained with over 50 languages.Because of these abilities, LLMs can be maliciously used, e.g. to generate misinformation (Shevlane et al., 2023).To combat LLM misuse, generated text detectors and benchmarks are required.
However, the vast majority of these datasets are in English.A few researchers released non-English datasets in Russian (Shamardina et al., 2022), Chinese (Pu et al., 2022b), Iberian languages (Sarvazyan et al., 2023), and M4 which contains Russian, Chinese, Urdu, Indonesian, & Arabic languages (Wang et al., 2023).In light of the increasing number of multilingual LLMs, we generate the largest multilingual dataset for machine-generated text detection containing 11 languages.Detection Methods.Prior works have shown that humans are already not capable of reliably distinguishing machine-generated from human-written text, with accuracy only slightly above random guessing (Uchendu et al., 2021) and even finding MGTs more trustworthy (Zellers et al., 2019).Thus, researchers have proposed a variety of automatic MGT detection methods.These include 3 https://github.com/rowanz/grover/tree/master/generation_examples stylometric-based, deep learning-based, statisticsbased, and hybrid approaches (i.e., the ensemble of at least 2 approaches) (Uchendu et al., 2023a).
For stylometric-based detectors, researchers used linguistic features to capture the unique writing styles of the machine and human authors (Uchendu et al., 2020;Fröhling and Zubiaga, 2021;Kumarage et al., 2023).Due to the computational cost of extracting the linguistic features, researchers proposed a deep learning-based detector which involves fine-tuned and other variants of BERT (Zellers et al., 2019;Uchendu et al., 2021;Bakhtin et al., 2019;Liyanage et al., 2022;Rosati, 2022).However, deep learning models have a few limitations: (1) they are susceptible to adversarial perturbations (Gagiano et al., 2021;Crothers et al., 2022b;Wolff and Wolff, 2020) and ( 2) need a lot of labeled data to perform well.Thus, researchers proposed statistics-based techniques which are robust to adversarial perturbations and are unsupervised, requiring minimal data (Gehrmann et al., 2019;Mitchell et al., 2023;Gallé et al., 2021;Su et al., 2023).However, while these statistics-based detectors are more robust to perturbations than deep learning-based techniques, they still underperform deep learning-based models in terms of non-perturbed performance.Therefore, researchers combine statistics-based and deep learning-based techniques to gain adversarial robustness and high performance (Kushnareva et al., 2021;Liu et al., 2022;Uchendu et al., 2023b;Zhong et al., 2020).

Benchmark Dataset
As suitable multilingual human-machine pair dataset containing authentic non-English texts and machine texts generated by SOTA text-generation models is not available, we have put together a new dataset, called MULTITuDE (benchmark dataset for MULTIlingual machine Text DEtection).Its human segment comprises texts in 11 languages (authentic news articles) from the MassiveSumm dataset (Varab and Schluter, 2021) (see Figure 2).Titles of the selected articles have been used in the prompts for 8 language models to generate corresponding machine texts.The titles were split into train and test portion of the dataset, ensuring machine and human texts generated for the same title to be in the same split.The train split is used to fine-tune detectors in monolingual (a single training language) as well as multilingual (multiple training languages) manner; the test split of the dataset is used for evaluation of the detectors' performances.Details regarding text generation and pre-processing of both, human and generated, texts can be found in Appendix B.
Language Selection.We have deliberately selected major languages from three different language families as our training languages -English (Germanic), Spanish (Romance), and Russian (Slavic).To see whether linguistic similarity influences the transferability of the detection, we have selected two genealogically related test languages for all of them -Dutch and German for English, Czech and Ukrainian for Russian, Portuguese and Catalan for Spanish.On top of that, we have also generated test data for linguistically completely unrelated Arabic (Semitic) and Mandarin Chinese (Sino-Tibetan).
Most of the languages use Latin script.Russian is the only training language that uses a different script -Cyrillic.We have deliberately selected Czech (Latin) and Ukrainian (Cyrillic) as its Slavic neighbours to see how the script affects the results.Arabic and Chinese use their own scripts.Overall, our selection of languages is still biased towards Indo-European languages and Latin script.
The selected languages are just representatives to see the effect and answer our research questions, while avoiding waste of resources (including more languages than necessary).Using the published source code, the study can easily be extended to other languages.Generated Texts.The summarized statistics of the MULTITuDE dataset are provided in Table 2.The dataset includes approximately 1000 human texts for each training language along with the corresponding MGTs from each generation model.For each test language, 300 human texts with the same amount of MGTs per model are included.
The linguistic analysis (see Appendix B.4) confirmed that all used text-generation models have been able to generate texts in the requested language in more than 95% cases (based on the Fast-Text language detection) except for LLaMA 65B (still reaching over 85%), failing mostly in Arabic and Chinese texts which it was not pre-trained on.The numbers of unique sentences and words per text is comparable to human texts and the results from the subsequent experiments show that none of the LLMs generated texts that are especially easy to detect.Nevertheless, some artifacts in the machine texts may still be present, since such a detailed analysis of the generated texts was not performed.

Detection Methods
For the purpose of this benchmark, the MGT detection methods are divided into three categories.The first category includes the black-box detectorszero-shot methods available either through web interface or API, providing only small amount or no information about the underlying model or method used for detection, typically provided as commercial paid services.The second category includes statistical detectors -zero-shot methods, relying on distributional differences between generated and human-written text.The third category includes the fine-tuned detectors -language-model-based methods, which require fine-tuning of the models for the MGT detection task.See Appendix C for a complete list and more details on detection methods used in the benchmark.
Black-Box Detectors.We examine popular commercial black-box detectors, specifically ZeroGPT4 and GPTZero5 .Despite their wide use and support of non-English languages, the extent of their zeroshot multilingual and cross-lingual proficiency in detecting MGT remains unknown.The training methodologies, weight parameters, and the specific data used for these detectors remain undisclosed.We interacted with these detectors via a subscription-based API, enabling us to assess their performance on our multilingual dataset.
Statistical Detectors.We evaluate our dataset on all the current baseline and SOTA statisticsbased detectors that had been previously shown to perform very well on English datasets (He et al., 2023;Mitchell et al., 2023), with all models (except Entropy) achieving over a 75% performance -Log-likelihood, Rank, Log-Rank, Entropy, GLTR Test-2, and DetectGPT.The main benefit of these techniques is that they require no training.Instead, they utilize the probability of each word in a piece of text to distinguish MGT from human-written texts.Typically, these statistics-based models use GPT-2 Medium to get the probability of the words, however, as we are evaluating with a multilingual dataset, we use mGPT, a multilingual GPT-based model.Further for DetectGPT, an additional model is used to generate perturbations to the original text, so we keep the default model for perturbation -T5 (Raffel et al., 2020).
Fine-Tuned Detectors.We have selected 7 most popular HuggingFace language models representing SOTA while taking multilinguality into account.
In the experiments, these detector models have been fine-tuned on MGT detection task, taking various combinations of source languages and text generation models' output contained in the MUL-TITuDE dataset.For each of the three training languages separately (English, Spanish, Russian), for all training languages combined, and for English language with 3-times more train samples, we have fine-tuned these detector models for each generator separately and for all generators data combined.It resulted in 45 fine-tuned versions of each detector model, resulting in 315 fine-tuned detection methods in total.Details regarding fine-tuning process can be found in Appendix D.

Experiments and Results
We evaluate various aspects of multilingual capabilities of the existing SOTA MGT detection methods.Mainly, we focus on detection capabilities in English and non-English languages.However, we also analyze cross-lingual relations and generalization capabilities of detectors fine-tuned on a specific language and specific MGT generator data.
Firstly, in Table 3, we provide comparison of detectors' performance evaluated on the whole created multilingual MULTITuDE benchmark test data (i.e., the data balanced across 11 languages).
There are 315 fine-tuned detection methods, which is infeasible to show in a single table.For clarity, the table contains all evaluated black-box and statistical methods, but includes only the bestperforming version of each base model of finetuned methods (i.e., only one fine-tuned version of XLM-RoBERTa is provided with information about language and generator LLM data used for training).Performance evaluation of all versions can be found in the associated GitHub repository.The results are ordered according to the achieved macro average F1-score (since the test data are highly imbalanced in terms of machine vs human classes).This metric is used in all experiments if not stated otherwise.In the table, we also show other standard performance metrics, such as weighted average of F1-score, Precision, Recall, Accuracy, and FPR (false positive rate) with FNR (false negative rate).These metrics are calculated based on a default classification threshold of 0.5.Such a threshold can be calibrated based on various aspects (such as minimizing FPR or maximizing Recall).Therefore, we also show AUC ROC (area under the curve of receiver operating characteristic), which is a threshold-independent metric calculated based on prediction probabilities rather than the predictions themselves.Unfortunately, due to missing prediction probabilities, it is available only for fine-tuned methods in our results.It must be noted that even when using optimal thresholds maximizing true positive rate and minimizing false positive rate, the key conclusions reported in this paper hold.We use the mentioned default threshold also when reporting results in the rest of this paper.
Based on the results, we can clearly see that fine-tuned methods outperform the others, when utilizing training data from all LLMs and all train languages (with two exceptions when fine-tuning on a single language performed better).We can  also notice that zero-shot methods cannot clearly distinguish between human and machine texts generated by newest LLMs.However, this is evaluated across data of all test languages combined; the results can differ among languages.In the following subsections, we thus report the results per individual test languages.

Zero-Shot Setting
We aim to answer the following research question: How are zero-shot (statistical and black-box) detectors capable of detecting MGT in multiple languages?The objective is to see how well these detectors can detect machine text generated by the newest LLMs and whether these detectors are able to detect MGT in non-English languages.
To answer this question, we run these detectors on the test split of the MULTITuDE dataset and analyze their per-language performances.
(1) Statistical detectors cannot cope with multilingual data.From Table 3, we observe that these models achieve about 47% F1 score.This suggests that these statistics-based models are unable to perform well with this multilingual constraint.Also, we observe that Rank, Entropy, and DetectGPT achieve the same performance.This is because all 3 models only predict one class, the machine class.The MGTBench6 implementation of these methods, used in our experiments, uses a Logistic Regression classifier for binary predictions with default parameters.For the Entropy based method, we have also used a Random Forest classifier with hyperparameters optimized using Randomized Grid Search with 5-fold cross-validation and 1k of iterations (details regarding the optimized hyperparameters can be found in Appendix D.), achieving a slightly higher performance.Notably, we can see that such a model is predicting also a human class, although negligibly, meaning the method actually works.Finally, the low performance of these previously high-performing statistical models suggests the non-trivial nature of evaluating on a multilingual dataset.
(2) Transferability to non-English languages cannot be properly evaluated.It is due to the overall low performance (e.g., predicting a single class only) of the statistical and black-box detectors (even on English, as previously mentioned).Per-language results show that black-box detectors outperformed statistical detectors on English data, but their performance on other languages is the same or even worse (see Table 10 in Appendix E).

Monolingual Generalization
In this experiment, we aim to answer the following research question: Do detectors fine-tuned in monolingual settings generalize to other languages?Meaning, for example, will a detector fine-tuned on English data only perform well on Spanish?Is there a relation between how close languages are and how well the detectors generalize?
To answer this question, we use various versions of detectors, fine-tuned on individual language data.To better show language dependencies, we perform this experiment per each generator data separately.For example, the XLM-RoBERTa model is finetuned on GPT-4 machine data (plus human data)  from train split for English, and evaluated on GPT-4 machine data (plus human data) from test split for all languages separately (English, Spanish, etc.).
Table 4 shows the aggregated performance across all generators and all multilingual detectors (i.e., detectors having a multilingual base LM).We only use multilingual detectors here because they have the best performance, as the cross-lingual generalization capability of English-only models is worse (see Table 3).For each test language, we test whether the differences between train languages observed in Table 4 are statistically significant.To do this, we conduct repeated measures ANOVA tests for each test language: we use macro F1-score for a given test language as a dependent variable, the combinations of detectors and text generators as "subjects" and train language as an independent within-subjects variable.For all 11 test languages, the observed differences are statistically significant (p < 0.05).We further conduct post hoc pairwise tests between pairs of train languages per each test language for a more in depth analysis.We also show how the performance for individual languages correlate in Table 5.For completeness, we also provide full results in Appendix E in Tables 11-13 for English, Spanish and Russian training language respectively.There are several observations that can be made based on our results.(1) The results confirm that the monolingually fine-tuned detectors are able to generalize to other languages, although with some performance degradation.There are significant differences of performance achieved for individual test languages (ranging from 0.54 to 0.96).
(2) Linguistic similarity matters.The results indicate that the similarity between languages plays a role in how they would generalize between each other.Spanish dominates both Catalan and Portuguese, similarly Russian works really well for Ukrainian.The correlations also clearly show that the performance of similar languages correlate with each other.Czech is the one exception from this trend, but it might be caused by the fact that it is both Slavic (more similar to Russian), but it also uses the Latin script (making it more similar to other Latin-using Indo-European languages).
(3) English is an outlier language.Overall it has low (but statistically significant) correlation with other related languages and it is the only language that has negative correlation with any other language.It is outperformed by Spanish in most cases, even for the languages from its own language family; observed differences in performance between using Spanish or English as a training language are statistically significant for both German and Dutch.At the same time, detectors trained on other training languages (Spanish and Russian) have unusually weak performance for English.We hypothesize that this is caused by the fact that English is often the most common language in the pre-training data for both the generators and the detectors, which might lead to different behavior (regarding cross-lingual capability) for this particular language (e.g., the perplexity might be lower).
(4) Languages with Non-Latin scripts are correlated.Even though Russian is completely unrelated to Arabic or Chinese, it has the best performance as a training language (although the differences in performance when using Russian or Spanish as a training language are not statistically significant).The Non-Latin script languages seem to correlate well with each other.This might indi-  cate that the models behave differently for the Latin script (which is, again, by far the most common) than for other scripts.

Multilingual Generalization
In this experiment, we aim to answer the following research question: Do detectors fine-tuned in multilingual settings generalize better to unseen languages than monolingually fine-tuned ones?The objective is to see whether it is beneficial to train detectors on multilingual rather than on monolingual data in regard to transferability to other languages.
For the purpose of this experiment, we fine-tune the detectors by using the train samples of all three languages combined.The train set consists of 1k MGTs and 1k human texts for each train language (this train set is denoted as all in the results).The evaluation is also done per each LLM separately.In order to see whether the performance is not strictly based on a higher number of train samples, we also fine-tune the detectors with a comparable amount of English-only data, i.e., also approximately 6k train samples (this train set is denoted en3).The mean results are provided in bottom two rows of Table 4, analogously to the previous experiment.For completeness, the full results of each detector per each LLM-generated data are provided in Appendix E in Tables 14 and 15.
The multilingually fine-tuned detectors perform better on unseen languages than the monolingually fine-tuned ones.The observed differences in performance between using all three train languages (all) and all other train setups are statistically significant in case of Czech, German, Dutch and Portuguese.For all other test languages, the multilingually fine-tuned detectors also perform better (with the sole exception of the Ukrainian language where the detectors trained on Russian slightly outperform the ones trained on all), but the differences to the best monolingually fine-tuned detectors are not statistically significant.The reason for a better performance of the multilingually fine-tuned detectors may be a higher amount of training samples.Indeed, when we look at the results of detectors fine-tuned with the en3 train set, they achieve a slightly (but mostly not statistically significantly) better performance in almost all cases compared to the original English fine-tuned detectors (performing almost the same as multilingually fine-tuned detectors on English).However, the generalization to other languages is still significantly better in multilingually fine-tuned detectors (en3 having a minimum at 0.56 for Arabic vs. all having a minimum at 0.77 for Chinese).Thus, regarding transferability to other languages, the detectors finetuned in multilingual manner seem stronger (for detectors with multilingual base models as well as for the ones with monolingual base models; see Appendix E for the latter).

Cross-Generator Generalization
We also aim to answer the research question: How do fine-tuned detectors trained on data from a single LLM perform in detecting MGT by different LLMs?Analogously to cross-lingual evaluation in Section 5.2, the objective of this experiment is to scrutinize the cross-generalizability among distinct LLMs by each individual detector fine-tuned on data from a single LLM.
Table 6 shows the correlation in the performance of each individual LLM.Comprehensive results (i.e., all the macro F1-scores, mean, and standard deviation separated per each language) are provided in Appendix E in Tables 16-20.
LLM similarity matters.We discern two distinct groups, namely, Group 1: text-davinci-003, text-davinci-003 gpt-3.5-turbogpt-4 alpaca-lora-30b vicuna-13b llama-65b opt-66b opt-iml-max-1.3bLinguistic similarity matters.Our results show that the linguistic similarity between the languages influence how well they generalize to each other and how much the performance for the languages correlate.The typology of the languages, but also the script they use are important.Multilingual MGT detectors should be trained and tested with a diverse set of languages to ensure the inclusivity of their performance.Yet, the practical development is often hindered by the fact that different LMs (used both as generators and detectors) support different sets of languages to different extent, making it hard to create one-model-fits-all solutions.
English is an inappropriate default.As mentioned previously, the performance on the English language is an outlier and the models do not generalize to other languages that well from this language.English is often used as the de facto default language for many NLP use cases, including using it as a source language for crosslingual learning.This should be reconsidered for multilingual MGT detection.

Conclusion
In the paper, we provide the first comprehensive benchmark of black-box, statistical and fine-tuned machine-generated text detection methods in multilingual settings using our novel MULTITuDE dataset, covering 11 languages and 8 SOTA LLMs.Our results show that most currently available black-box methods do not work in multilingual settings and that the statistical approaches lag behind the fine-tuned ones.We also show that fine-tuning models in a multilingual manner (i.e., train data in multiple languages) results in better performance of detectors for unseen languages compared to monolingual fine-tuning.The generalization is strongly affected by language script as well as language family branches of the train and test languages.Also, English seems a particularly inappropriate choice of a training language if one aims for generalization of machine-generated text detection to non-English languages.This further emphasizes the importance of creating multilingual benchmarks for machinegenerated text detection such as MULTITuDE.As a future work, we plan to extend it with a more diverse set of languages (in terms of scripts and language families) and with texts from other domains, especially social media.

Limitations
Language Selection.Our work is limited by the final selection of 3 training and 11 testing languages.This already allowed us to discover that there are interesting linguistic properties in the detector methods, but based on our work we still can not tell how they would behave with all the other languages.Non-European languages especially are still a blind spot in our evaluation.
Limited Amount of Training Data.Another limitation, closely related to the previous one is the fact, that the amount of data we use for benchmarking is limited.Apart from using different languages, the data could also be expanded by different domains, writing styles, etc.The amount of training data we use is also limited (several thousand samples), and simply extending the existing data could also yield additional improvements.
Limited Selection of Generative Language Models.In the end, we have selected and experimented with 8 generative language models, which are capable of generating multilingual content.It is hard to ascertain how generalizable the results are for all the other language models that are being or will be developed in future with different training data and different training regimes.

Ethics Statement
As a part of this paper, we introduce the MUL-TITuDE dataset consisting of human-written and machine-generated texts.The human-written texts are news articles collected in the MassiveSumm dataset (Varab and Schluter, 2021).The Mas-siveSumm dataset does not specify a license under which they publish the data as its public version only contains a list of URLs and a software package for their downloading and processing.Thus, we can assume that the news articles are protected by copyright, which, however, allows their use for non-commercial research such as our work.Although most of the LLMs we used were hosted at our premises, we also used OpenAI API.As a part of the prompts, we were sending headlines of the news articles to the API; these, however, are not used by the OpenAI to train or improve their models (which would constitute a commercial use) and they are retained for a maximum duration of 30 days, after which they are deleted. 77 https://openai.com/policies/api-data-usage-policies Regarding the used LLMs, we made sure to follow their terms of use as well.LLaMA models (and their variants Alpaca and Vicuna) are licensed for non-commercial use only. 8Additionally, outputs of OpenAI services cannot be used to "develop models that compete with OpenAI."9Respecting these limitations, we publish the MULTITuDE dataset containing both the human-written texts with attribution (original source) and the machine-generated texts only for non-commercial research purposes.
Intended Use.The collected dataset is primarily intended to be used for research of multilingual machine-generated text detection.We used it for binary classification, but it could also be employed for multi-class classification (i.e., authorship attribution as defined in Uchendu et al., 2021Uchendu et al., , 2023a)).We also publish the code for analysis and reproduction of our results including the training (finetuning) of the detection methods.These are also intended for research purposes only.They are not intended (in their current form) to be used in actual deployment where they would be automatically classifying the texts as human-written or machinegenerated.
Failure Modes.As already noted in limitations, the fine-tuned detectors, while showing promising performance in our experiments, might fail when used on unseen languages, texts from different domains or writing styles.Additionally, they can fail to generalize to other unknown LLMs, decoding strategies or obfuscation efforts.The potential harms are not only from false negatives (i.e., failing to detect machine-generated texts), but also (and potentially even more so) from false positives (i.e., falsely flagging a text as being machine-generated while it was in fact human-written).It is also worth noting that there are many non-malicious uses of machine-generated texts (e.g., proofing, translation, etc.), which needs to be considered before any use of the detection methods trained on our collected dataset for purposes beyond research.
Biases.Although we selected languages from different language families and with different scripts (see Section 3), the dataset is still biased towards Indo-European languages and Latin script.Because of the nature of the training data which consists of news articles written in a standardized form of each included language, detectors fine-tuned on the dataset might be biased with respect to use of dialects, slang or code-switching which could potentially harm individuals from some ethnic groups or social origins.
Misuse Potential.We believe that there is only a limited possibility of misuse of our dataset.First, the dataset is published for research purposes only.Second, the machine-generated texts, although inauthentic and most likely false, should not cover any sensitive topics.Also, the used prompts to LLMs were to generate news articles given a headline; we did not prompt the LLMs to intentionally generate disinformation.So their potential harm and impact in case of misuse is limited.
Collecting Data from Users.The collection and processing of the dataset did not include any crowd workers or any other annotators.We do not intentionally collect or store any personal data as a part of this research.Some personal data (e.g., names) might be generated by the LLMs, but we can assume these to be mostly public figures that could have appeared in the training data of LLMs.
Potential Harm to Vulnerable Populations.To the best of our knowledge, the dataset does not cover any sensitive topics beyond what is normally covered in the news.As already noted in Biases, the dataset does not include texts in different writing styles, dialects or slang which can be used by marginalized populations and the detectors finetuned on the dataset could thus fail in such cases.

B Data Pre-processing
For MULTITuDE dataset creation, we have used human texts from the original articles included in the MassiveSumm11 (Varab and Schluter, 2021) dataset.We have used on-request author-provided processed data as well as CommonCrawl based links published in the GitHub repository.Both sources result in per-language datasets (split into files according to languages).

B.1 Human-Text Pre-processing
We have selected 3 languages for training (detectors fine-tuning) and 8 more for testing (11 languages in total).For each of the selected languages, we have taken the first 50,000 samples (if available for that language) from each data source.It resulted up to 100,000 samples per language.The texts of these samples were stripped, meaning that white-space characters from beginning and end of texts are removed.Out of these samples, we have dropped samples with missing values and duplicated samples.Then, we have dropped samples that contained texts with 5 or less words (where the "word" is represented as a white-space delimited substring).Table 7: Overview of the amount of language samples from the MassiveSumm dataset remaining after each pre-processing step.
The MassiveSumm per-language datasets contained texts in other languages, meaning that some texts were in different languages than the intended language according to the file (identified in the filename).Therefore, we have performed language detection of samples and dropped those that did not match.For language detection, we have used a combination of FastText12 (Joulin et al., 2017) and polyglot13 tools.We have taken the language predictions into account only if the probability score ("confidence") was at least 0.9.We have performed such detection separately using the title and the first sentence of the text for a given sample, resulting in 4 predictions.Out of these a majority voting was used to give a final detection result.
Table 7 provides amounts of texts per language available after each of the above mentioned preprocessing step.
After pre-processing, we have pseudo-randomly sub-sampled the English data to 3300 samples, Spanish and Russian to 1300 samples, and others to 300 samples.300 samples of each language are then used for test split, while the remaining ones are in train split of the dataset.
The selected numbers of samples are based on our preliminary study using the existing datasets: TuringBench English data (Uchendu et al., 2021), AuTexTification Spanish data (Sarvazyan et al., 2023), and RuATD Russian data (Shamardina et al., 2022).We have extensively experimented and chose a minimal number of samples needed to fine-tune the detectors and properly evaluate them.Specifically, we have compared the performances using the selected smaller amount of samples (500, 600, 1000, 1500 human samples where available) and all samples available (5000 for English, 1150 for Spanish, 2450 for Russian).These experiments resulted in 1k human samples and 1k samples per text-generator required for training, with a negligible drop in the performance of fine-tuned detectors (i.e., within 5 %).For testing, 300 texts per class shown to be enough, giving the same detector performance values as using all the available samples.In addition, since we are experimenting with smaller number of samples in train sets, we have provided 3k human texts for English train data to ensure that the performance effect is not simply due to a higher number of train samples in case of multilingually fine-tuned detectors.

B.2 Machine Text Generation
Titles (i.e., headlines) of the human-text samples have been used in prompts to generate corresponding machine texts by multiple large language models.For instruction-following models, an instruction-based prompt has been used (a universal prompt in English instructing to generate text in a target language, corresponding to the title in the target language), pure title in the others.The instruction-based prompt was used in the form as follows (where language_name and headline are variables inserting strings specifying language and title of an individual text sample): You are a multilingual journalist.\n\nTask:Write a news article in {language_name} for the following headline: "{headline}".Leave out the instructions, return just the text of the article.\n\nOutput:Settings for the text generation include minimal length of generated text set to 200 tokens, maximal length of generated text set to 512 tokens, number of return sequences set to 1, sampling activated with beams number of 1, top_k of 50, and top_p of 0.95.For models available via OpenAI API, we have set only maximal number of tokens to 512 and top_p to 0.95.

B.3 Generated-Text Pre-processing
Pre-processing of the generated texts included text stripping (i.e., white-space characters from beginning and end of texts are removed), removal of prompts from the generated text (both title and instruction-based prompt), removal of unfinished sentence from the end of the text (if more sentences are present).In order to achieve similar text lengths distribution between human and machine texts, each machine text is shortened if the corresponding human text (title of which was used in the prompt) is shorter.Similarly, the human texts are shortened to a mean value of lengths of the corresponding machine texts.Shortening occurs only if the difference between the corresponding machine and human texts lengths is greater than 5 words and if more than one sentence is present.Shortening is performed by removal of the last sentence from the longer text until the condition is met.Such texts are then processed by the FastText full-text language detection, and language mismatch is analyzed (see the next subsection).
After the initial analysis of the generated texts, we have noticed multiple prompts were duplicated.In order to preserve consistency, we have moved all texts generated for the duplicated prompts to the same (train) split of the dataset.The intuition is to avoid having the texts generated for the same prompt in both splits.We have then dropped generated samples with 5 or less words and dropped text duplicates, ensuring no text sample has multiple labels.Fortunately, even after occurrence of duplicated texts and their removal from the final dataset, the numbers of samples are not significantly reduced.The smallest number in the train split is the number of human Spanish samples having 937 unique texts (out of intended 1000).The smallest number in the test split is the number of llama-65b English samples having 275 unique texts (out of intended 300).Thus, the numbers of samples for each text-generation language model and for each language are still well balanced.

B.4 Linguistic Analysis of the Generated Text
The statistics of the analyzed generated texts per language model are provided in Table 8.For Chi-nese word count, polyglot tool was utilized.For other languages, white-space separated substrings are counted.The Empty Text column contains number of samples with no new text generated (i.e., returned only the provided prompt or no text at all).The Short Text column represents the number of generated texts with 5 or less words.As the table indicates, the llama-65b model performed worst in generating texts in multiple languages.But it still had more than 85% accuracy regarding language match (i.e., the language of the generated text is the same as the one queried by the prompt).We must also take into account the fact that FastText language identification is not error-free (i.e., misclassification can occur).As expected, deeper analysis revealed that llama-65b model have missed mostly Chinese and Arabic languages (202 and 140 mismatched samples, respectively), since these were not used in the model training.

C Description of Detection Methods
Table 9 shows descriptions of all detection methods used in the benchmark.The base models for the detection methods have been carefully selected with respect to the state-of-the-art and the limited experimentation resources.We have primarily used multilingual base models (XLM-RoBERTa, BERT-multilingual, mGPT, and MDeBERTa) that are publicly available and belong to the SOTA multilingual pre-trained models used for a wide range of downstream tasks.Besides these, we have also used English-only pretrained models that have been commonly used as detectors in previous studies (Uchendu et al., 2021;Zhong et al., 2020).We used these to see how they would perform on non-English language datasets.In this group, there were RoBERTa-large-OpenAI-detector, GPT2, and ELECTRA.Given a text, this method calculates the average word log probability of each word.The log probability is extracted from a language model (i.e., mGPT (Shliazhko et al., 2022)).Rank (Gehrmann et al., 2019) Statistical Similar to Log-likelihood, given each word in a text, using the context, this method calculates the absolute rank of the word.Next, we calculate the average rank score of each word.Log-Rank (Mitchell et al., 2023) Statistical Similar to the Rank metric, Log-Rank takes the log probability of the Rank score for each word.Entropy (Mitchell et al., 2023) Statistical Similar to the Rank score, Entropy is calculated by obtaining the entropy score of each word, given its context (i.e., previous words), and calculating the average of the final scores.GLTR Test-2 (Rank) (Gehrmann et al., 2019) Statistical GLTR uses 3 tests to calculate scores used to distinguish machine-generated text from human-written text.However, following the same procedure used by (He et al., 2023), we only use the 2nd test -calculating the rank of the fraction of words within the top-10, top-100, top-1000, > top-1000 probable words.DetectGPT (Mitchell et al., 2023) Statistical DetectGPT perturbs the text and compares the changes between the original and the perturbed text.This comparison is done by calculating the log probability of the original vs. perturbed texts.The hypothesis is that machine-generated text tends to lie in the negative log probability curve, while human-written text will have a higher or lower probability than the perturbed text.RoBERTa-large-OpenAIdetector 14 (Solaiman et al., 2019) Fine-tuned This is a sequence classifier based on RoBERTa Large, fine-tuned to distinguish between GPT-2 generated text and WebText.
GPT-2 Medium (Radford et al., 2019) Fine-tuned GPT-2 Medium 15 is a transformer-based autoregressive language model, pre-trained on English language.XLM-RoBERTa-large (Conneau et al., 2019) Fine-tuned XLM-RoBERTa 16 is a pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.BERT-base-multilingualcased (Devlin et al., 2018) Fine-tuned Multilingual version of BERT 17 is a masked language model pre-trained on the top 104 languages with the largest Wikipedia.
MDeBERTa-v3-base (He et al., 2021a) Fine-tuned mDeBERTa 18 is a multilingual version of DeBERTa (He et al., 2021b) trained with CC100 data containing 100 languages.ELECTRA-large (Clark et al., 2020) Fine-tuned ELECTRA 19 is a language model pre-trained as a discriminator rather than a generator.
mGPT (Shliazhko et al., 2022) Fine-tuned mGPT 20 is a multilingual autoregressive model using GPT-3 architecture, trained on 61 languages from 25 language families using Wikipedia and Colossal Clean Crawled Corpus.
Table 9: A detailed list of all detection methods used in this benchmark together with their descriptions.

D Model Parameters
For the purpose of fine-tuning language models for machine-generated text detection task, we have used Trainer 21 API of the transformers library 22 for PyTorch framework.We have used maximum number of 10 epochs with early-stopping mechanism (patience of 5), a batch size of 16 with gradient accumulation of 4 steps, an adaptive learning rate using the AdaFactor optimizer, weight decay of 0.01, half-precision training (except for the mDe-BERTa model, where the half-precision training is faulty), using the Macro avg.F1-score as a metric for the best model selection.The tokenizers used truncation of texts to maximum length of 512 tokens.We have done manual hyper-parameter search prior to running the fine-tuning, finding the parameters working for all detector models.For Random Forest classifier used for entropybased detector, we have executed optimization of hyperparameters on the train split of the dataset using automated Randomized Grid Search with 5fold cross-validation and 1k of iterations.The grid consisted of the following parameters: • n_estimators = [10, 50, 100, 150, 300] -a number of trees in the random forest, • criterion = [ ′ gini ′ , ′ entropy ′ ] -a function to measure the quality of a split, • max_f eatures = [ ′ sqrt ′ , ′ log2 ′ , N one] -a number of features in consideration at every split, • max_depth = [N one, 10, 100] -a maximum number of levels allowed in each decision tree, • min_samples_split = [2, 4, 6] -a minimum sample number to split a node, • min_samples_leaf = [1, 3] -a minimum sample number that can be stored in a leaf node, • bootstrap = [T rue, F alse] -a method used to sample data points.

E Results Data
In Table 10, performance results (Macro average F1-score) of all statistical and black-box detectors per each test language are provided.The highest value for each test language is boldfaced.
In Tables 11-15, performance results (Macro average F1-score) of all fine-tuned detector models per each test language are provided.The highest value in a row is boldfaced.It denotes, on which test language a particular fine-tuned detector version performs the best.At the bottom of the tables, mean values of all detectors per test language are provided, along with separated mean results of multilingual and monolingual detectors' base models.
In Tables 16-18, performance results (Macro average F1-score) of all fine-tuned detector models across individual text-generation LLM data are provided.Also in this case, the highest value in a row is boldfaced.

Table 1 :
Figure 1: Train and Test languages from our dataset.Average F1-score of detectors fine-tuned on English train split of MULTITuDE dataset, then evaluated on English test split vs. non-English test languages.The ↓ 25.7% drop in performance calls attention to the need for accurate multilingual MGT detectors.

Table 2 :
Number of samples per language and per generator for train and test splits of the MULTITuDE dataset.

Table 3 :
General performance of detection methods on the whole test split (all languages) of the MULTITuDE benchmark.Symbol * denotes detectors capable to handle multilingual text.Letters "B", "S" and "F" denote the category of the detector as black-box, statistical and fine-tuned respectively.Due to space limitations, we only report the best-performing version of each base model in case of fine-tuned detectors.

Table 4 :
Performance for the test languages based on various train language combinations.It shows the mean of all trained detectors with multilingual base models along with 95% confidence interval error bounds.The reported score is macro average F1-score.

Table 5 :
The correlations between the macro average F1-score performance of the test languages calculated based on the results from multilingual detectors (i.e., having a multilingual base LM).The results that are not statistically significant are marked by (n.s.).

Table 6 :
The correlations between the macro average F1-score performance of the cross-generator.All the presented results are statistically significant.
e., gpt-3.5-turbo)quality(Chiangetal., 2023).Hence, we consider these as OpenAIbased models.On the other hand, Group 2 encompasses models developed and released by Meta AI, hence recognized as Meta-based models.The LLM architectures within each group are similar, which may cause the similar fine-tuned performance on the dataset and this observed phenomenon.

Table 8 :
Statistics of the machine-generated texts.Mean and standard deviation of ratios are provided, where WC refers to the word count, US refers to the unique sentences, and UW refers to the unique words.
ZeroGPT service uses a series of complex and deep algorithms to analyze the text, presented with an accuracy rate of text detection up to 98%, claiming to detect AI text output in all the available languages.https://www.zerogpt.comGPTZeroBlack-boxGPTZero model can detect AI-generated and human-written text across the sentence, paragraph, and document levels.Training on a mixed corpus of AI and human English writings, it can accurately classify 85% of AI and 99% of human texts using a 0.65 threshold.To reduce false positives, a 0.65 or higher threshold is advised.

Table 10 :
Cross-lingual performance of zero-shot detection models.

Table 19
shows how the text-generation LLMs correlate based on the detectors performances, separated per each train language.In Table20, mean performance results with standard-deviation values are provided for each LLM, aggregated per each train language.

Table 11 :
Cross-lingual performance of all detection models fine-tuned on the English language (evaluated per LLM).

Table 12 :
Cross-lingual performance of all detection models fine-tuned on the Spanish language (evaluated per LLM).

Table 13 :
Cross-lingual performance of all detection models fine-tuned on the Russian language (evaluated per LLM).

Table 14 :
Cross-lingual performance of all detection models fine-tuned on all three train languages (evaluated per LLM).

Table 16 :
Cross-generator performance of all detection models fine-tuned on the English language (evaluated per LLM).

Table 17 :
Cross-generator performance of all detection models fine-tuned on the Spanish language (evaluated per LLM).