Coherent or Not? Stressing a Neural Language Model for Discourse Coherence in Multiple Languages

In this study, we investigate the capability of a Neural Language Model (NLM) to distinguish between coherent and incoherent text, where the latter has been artificially created to gradually undermine local coherence within text. While previous research on coherence assessment using NLMs has primarily focused on English, we extend our investigation to multiple languages. We employ a consistent evaluation framework to compare the performance of monolingual and multilingual models in both in-domain and out-domain settings. Additionally, we explore the model’s performance in a cross-language scenario.


Introduction
Coherence is a fundamental aspect of a wellorganized text and it can be defined as "a semantic property of discourse, based on the interpretation of each individual sentence relative to the interpretation of other sentences" (Van Dijk, 1977).In order to be fully coherent, a discourse must exhibit both a local and a global coherence, where the former concerns mainly the relationships between adjacent or nearby sentences whereas the latter focuses on the discourse-level relations connecting remote sentences.Modeling discourse coherence has a long history in the NLP community, particularly in the "pre-deep-learning" era, where a great deal of studies was inspired to the Centering Theory framework (Grosz et al., 1995), such as the popular entity-grid approach for measuring local coherence (Barzilay and Lapata, 2008).
The long-standing interest for coherence modeling has been also motivated by the large variety of downstream applications which can benefit by measuring coherence in text, such as automatic essay scoring in language learning scenarios (Lai and Tetreault, 2018;Mesgar and Strube, 2018), language assessment in clinical settings (Elvevåg et al., 2007;Iter et al., 2018), readability assess-ment (Muangkammuen et al., 2020).A further emerging scenario, which is closer to our study, is related to research on the interpretability of modern deep neural networks.In this respect, while the majority of existing tasks and benchmarks on which NLMs are evaluated focuses on properties acquired from stand-alone sentences, their ability to model discourse and pragmatic phenomena is still unclear.Few exceptions are represented by recent works such as (Shen et al., 2021;Chen et al., 2019;Farag et al., 2020), which introduced dedicated test suites aimed at measuring if neural sentence representations show sensitivity to discourse phenomena spanning across sentences.Our paper intends to provide a novel contribution to the current body of literature by investigating whether and to what extent NLMs in multiple languages are able to distinguish a coherent piece of text from an incoherent one, where the latter has been artificially created to undermine local coherence within text, at gradual levels of difficulty.While all previous work on coherence assessment using NLMs has been focused on English, we probed these models for multiple languages using the same evaluation framework and compared the performance achieved by monolingual and multilingual models both in a in-domain and out-domain setting, as well as in a cross-language scenario.
Our Contributions This paper makes the following contributions: i) we devised a new task to model discourse coherence understanding; ii) we compiled two new multilingual datasets (freely available) representative of two different domains and levels of complexity, containing coherent and incoherent passages (artifically manipulated); iii) we assessed how a multilingual NLM, XLM-RoBERTa-base, performs over the task and compared the performance against the model without pretraining in order to measure the impact of pretraining on the task at hand; iv) we evaluated the task performance in a cross-domain and cross-

Target
Coherence In 1998, an intense flare was observed.The star has also been a target of plans for interstellar travel such as Project Daedalus.In 2005, astronomers using data from the Green Bank Telescope discovered a superbubble so large that it extends beyond the plane of the galaxy.
It is called the Ophiuchus Superbubble.

✓
What do they do?Well, let's first check and make sure they're really amnesiac.We ask these amnesiac patients to tell us which one they own, which one they chose last time, which one is theirs.
Here's what normal controls do: they synthesize happiness.

Coherence Evaluation Framework
We formulated the task of coherence modeling as a binary classification problem, that is: given a short piece of text (hereafter referred to as prompt) along with an individual sentence (the target), the model is asked to predict whether the target is contiguous or not, thus joining it to the prompt gives out a coherent or incoherent text.See Table 1 for reference.More specifically, we designed two task configurations, namely forward and backward.In the forward configuration, the model is asked to assess if the target follows the closing sentence of the prompt, whereas in the backward one if it comes before the initial sentence of the prompt.Regardless the direction, the negative target was always selected as either occurring in the same document of the prompt or randomly chosen from a different document.When selecting the target from the same document, we specifically chose it as the 5th or 10th sentence preceding the first or following the last sentence of the prompt.
By systematically manipulating the distance from the prompt we had intended to introduce incremental degrees of complexity in approaching the task, assuming that candidates closer to the prompt would pose a higher likelihood of being misleading.
We tested our approach on the following languages: English, French, Italian, Portuguese, and Spanish.

Dataset construction
For each language we built two distinct datasets, which were chosen as representative of both written and spoken modalities: on one side, we exploit the well known and (ab)used Wikipedia data; on the other, we relied on TED talks transcriptions.The latter can be seen as a middle modality in between written and spoken.Indeed, even if public speeches are performed orally, they often derive from written scripts, and they are prepared and rehearsed in advance.It derives that these communication events lack the typical spontaneity (Chafe, 1994) that characterize everyday oral communications and they do not contain phenomena such as false starts, retracting, and on-line discourse generation, thus they cannot be considered as natural spoken language examples.Nevertheless, TED-style talks represent a different domain with respect to Wikipedia, and in general to 'standard' written language, thus we included these transcriptions to test NLMs in a slightly more complex scenario.
As anticipated, the data source used to build the written section of the dataset is Wikipedia.Texts have been automatically extracted from the dumps and cleaned using Wikiextractor (Attardi, 2015).
The spoken section of the dataset has been derived from two sources, both collecting TED talks, i.e. the multilingual TEDx Dataset (Salesky et al., 2021),1 and the TED2020 Dataset (Reimers and Gurevych, 2020) 2 .The latter has been used to include English data, that are not present in the TEDx Dataset.We discarded aligned translations, in order to collect exclusively original monolingual data.
To ensure consistent analysis for coherence assessment, we extracted passages consisting of four consecutive sentences from each text, considering them as our unit of analysis.As regards the written dataset, we utilized the existing paragraph segmentation in Wikipedia to select four-sentence paragraphs.For the spoken one, given that TED speeches lack such an internal structure, we split all the transcripts into passages of four sentences.
To meet the requirement of identifying negative targets within a maximum distance of 10 sentences from the beginning or end of the prompt, we only retained prompts for which it was possible to retrieve such targets in both directions.Once a prompt was paired with a correct target, it was excluded from being a source for extracting negative items, and vice versa.It is worth noting that the positive items remained the same across all experiments, while the negative items, which shared the prompt but varied in the target sentence, were unique to each experimental variant.
Following these constrains, we ended up with a train-test dataset splits respectively of 8000 and 800 prompt-target pairs for each language, domain and configuration. 3An example for each configuration in the dataset can be found in Appendix A.

Experimental settings
To evaluate our coherence assessment framework we devised three main sets of experiments.In the first one (in-domain /in-language), we examined the ability of a multilingual NLM to comprehend local coherence for each language and domain.To determine the impact of the linguistic knowledge acquired by the model during pretraining on the task, we compared its performance with a baseline model that lacked pretraining.
Through the second set of experiments we proceeded to evaluate the generalization abilities of the multilingual model in a cross-language scenario.Thus, the model was trained on one language and tested against all others.We compared the scores with the ones obtained by the same multilingual model trained on all languages simultaneously, and with the ones obtained by a monolingual model trained only in the corresponding language.
The last set of experiments (cross-domain) aimed at assessing whether and to what extent the model is able to learn information about coherence that can be generalized across datasets: for each language, we thus computed the performances of the multilingual model trained on one domain and tested on the other.
As regards the multilingual model used in the experiments, we relied on XLM-RoBERTa-base (Ruder et al., 2019), a multilingual version of RoBERTa-base (Liu et al., 2019) pretrained on 2.5TB of data containing 100 languages (including those under examination).The model consists of 12 layers with 12 attention heads.The monolingual model was chosen so as to be more similar as possible to xlm-RoBERTa-base and available within the Huggingface released models.Accordingly, we used the original RoBERTa-base (Liu et al., 2019) for English, the BERTIN version of RoBERTa (la Rosa et al., 2022) for Spanish, CamemBERT (Martin et al., 2020) for French, GilBERTo4 for Italian.As regards Portuguese, since a reference version of this model is not available for this language we chose the RoBERTa model most used by the community5 .For all settings, the passages have been fed to the examined model by simply concatenating the target sentence to the prompt without using special characters for separation.All experiments were executed using the Huggingface library6 and the models were trained for 10 epochs since experiments on more epochs showed no improvements in terms of convergence.The remaining training hyper-parameters were set to their default values as specified by the Hugging Face framework, with the exception of the learning rate, which was set to 5e-6.

Results
Figure 1 reports the results of the multilingual model in the in-domain setting across languages for both written (top) and spoken (bottom) domain.
As a general remark, we observe that the baseline model reports chance-level performance or even below across all task configurations and languages.This suggests that the knowledge acquired in the pre-training phase enables the model to capture information that is involved in local coherence.Such an impact is particularly beneficial when the negative target is sourced from a different text, especially evident in the Wikipedia data, where the pretrained model achieves an average F-score of 0.94 across all languages, compared to the 0.83 achieved on the TED data.Conversely, although still performing better than the baseline, the model's performance significantly decreases when the negative target belongs to the same document of the prompt.This effect becomes more pronounced as the target gets closer to the prompt (either preceding or following it).This suggests that the model tends to rely more on explicit lexical clues to detect incoherent passages and may be more confounded when the target and the prompt share the same topic.This observation is particularly relevant in TED speeches, where clear topic distinctions are less prominent, and the discourse structure is less defined compared to encyclopedic written articles.In this case, the "easiest" scenario of out-domain negative targets becomes more challenging, indicating that the model struggles to grasp coherence-related cues beyond lexical or semantic ones.
Taken overall, these results highlight that the out-document negative items are extremely easy to detect, whereas similar scores are obtained despite the configuration and prompt-pair distance.Based on this, we decided to conduct the cross-language and cross-domain experiments exclusively using the forward configuration, where the negative targets correspond the 10th sentence following the prompt.Tables 2 provides a summary of the results of the cross-language experiments.As we can see, the best overall scores are obtained by the multilingual model fine-tuned with all languages (row ALL in both Tables), especially for the TED dataset.As expected, training the model in a language dif-ferent from the target language leads to slightly lower performance, although the differences are not dramatic.Interestingly, the monolingual model performs comparably to the multilingual model, except for Portuguese.Shifting our focus to the cross-domain classification (Table 3), we observe a considerable decrease in performances for models fine-tuned on one domain and tested on the other, as anticipated.This holds especially when the model is tested on the Ted datasets.We can attribute this phenomenon to the less structured nature of TED speeches compared to the Wikipedia texts, but also to the fact that Wikipedia texts are part of the base training of the models.This effect is particularly appreciable if we look at the performances on French or Portuguese languages, but less marked in on Italian data.

Conclusion
In this study we carried out a comprehensive series of experiments to evaluate the ability of XLM-RoBERTa base, one of the leading Neural Language Models (NLMs), in distinguishing coherent text from incoherent text, where the latter has been artificially created to gradually undermine local coherence within text.Our findings indicate that NLMs still face challenges in modeling discourse coherence, and the linguistic knowledge acquired during the pre-training phase provides limited assistance when coherence relies on information not directly related to the topic.As expected, the crossdomain experiments highlighted that the model performances degrade with respect to the in-domain classification scenario, particularly when tested on data with a less defined structure, such as TED talks.Interestingly, the generalisation ability of the multilingual model holds across different languages, showing competitive results with the monolingual ones.

Limitations
We recognize the following main limitations of the present study.Although the approach we devised is not bounded to a specific model architecture and language, our study fouced only on one neural language model and a limited set of languages and this may limit the generalization of our results.Moreover, we are aware that discourse coherence is a multifactorial phenomenon that can only be partially covered by the devised methodology and dataset.-5 0 1887#464 TED Now, you can think of that as the backbone that holds the rest of the molecule together.The three long chains on the right are called fatty acids, and it's subtle differences in the structures of these chains that determine whether a fat is, let's say, solid or liquid; whether or not it goes rancid quickly; and, most importantly, how good or how bad it is for you.Let's take a look at some of these differences.

A Data Sample
Thank you for having me.
-10 0 The excavated remains of culled animal bones suggest that people may have gathered at the site for the winter rather than the summer.
-5 0 4183#39#3 WIKI Products made from cellulose include rayon and cellophane, wallpaper paste, biobutanol and gun cotton.Sugarcane, rapeseed and soy are some of the plants with a highly fermentable sugar or oil content that are used as sources of biofuels, important alternatives to fossil fuels, such as biodiesel.Sweetgrass was used by Native Americans to ward off bugs like mosquitoes.
Others are simple derivatives of botanical natural products.

Figure 1 :
Figure 1: Summary of the in-domain classification scores of the multilingual model across languages.Columns represent F-score obtained in the different classification settings.The colors indicate the distance of the negative target from the prompt: 5, 10 sentences far from the prompt ('-' preceding/'+' following the prompt); out, the negative target belongs to a different document.The white dash within each column represents the score of the baseline model, which is the multilingual model without pretraining.

Table 1 :
Examples of prompt-target pairs: coherent the first; incoherent the latter.

Table 3 :
Model performances in the cross-dataset experiments.

Table 4 :
TED Data sample from the forward configuration.

Table 5 :
Wikipedia Data sample from the forward configuration.

Table 6 :
Detailed in-domain classification scores reported by the xlm-RoBERTa-base model on Wikipedia data.Baseline scores are in italic.

Table 7 :
Detailed in-domain classification scores reported by the xlm-RoBERTa-base model on TED data.Baseline scores are in italic.