Rethinking Coherence Modeling: Synthetic vs. Downstream Tasks

Although coherence modeling has come a long way in developing novel models, their evaluation on downstream applications for which they are purportedly developed has largely been neglected. With the advancements made by neural approaches in applications such as machine translation (MT), summarization and dialog systems, the need for coherence evaluation of these tasks is now more crucial than ever. However, coherence models are typically evaluated only on synthetic tasks, which may not be representative of their performance in downstream applications. To investigate how representative the synthetic tasks are of downstream use cases, we conduct experiments on benchmarking well-known traditional and neural coherence models on synthetic sentence ordering tasks, and contrast this with their performance on three downstream applications: coherence evaluation for MT and summarization, and next utterance prediction in retrieval-based dialog. Our results demonstrate a weak correlation between the model performances in the synthetic tasks and the downstream applications, motivating alternate training and evaluation methods for coherence models.


Introduction and Related Work
Coherence is an important aspect of discourse that distinguishes a well-written text from a poorlywritten one that is difficult to comprehend (Halliday and Hasan, 1976). Computational models that can assess coherence have applications in text generation and ranking, such as summarization, machine translation, essay scoring and dialog systems.
Researchers have proposed a number of formal theories of discourse coherence, which have in-spired the development of many coherence models -both traditional and neural ones. Inspired by the Centering Theory (Grosz et al., 1995), the entity based local models (Barzilay and Lapata, 2008;Elsner and Charniak, 2011b) formulate coherence in terms of syntactic roles (e.g., subject, object) of entities in nearby sentences. Another branch of models (Pitler and Nenkova, 2008;Lin et al., 2011;Feng et al., 2014) use coherence relations between adjacent sentences to model local coherence, inspired by the discourse structure theories of Mann and Thompson (1988) and Webber (2004). Other traditional methods include word cooccurrence based local models (Soricut and Marcu, 2006), topic based global models (Barzilay and Lee, 2004;Elsner et al., 2007), and syntax based local and global models (Louis and Nenkova, 2012).
Despite continuous research efforts in developing novel coherence models, their usefulness in downstream applications has largely been ignored. They have been evaluated in mainly two ways. The most common approach has been to evaluate them on synthetic discrimination tasks that involve identifying the right order of the sentences at the local and global levels (Barzilay and Lapata, 2008;Elsner and Charniak, 2011b;Moon et al., 2019). The other (rather infrequent) way has been to assess the impact of coherence score as an additional feature in downstream tasks like readability assessment and essay scoring (Barzilay and Lapata, 2008;Mesgar and Strube, 2018). But since the concept of coherence goes beyond these constrained tasks and domains, so should the models.
Given the recent advances in neural NLP methods, with claims of reaching human parity in machine translation (Hassan et al., 2018), fluency in summarization Celikyilmaz et al., 2018), or context-consistent response generation (Zhang et al., 2020;Hosseini-Asl et al., 2020), coherence modeling of machine-generated texts, par-ticularly at a document-level, is now more crucial than ever (Läubli et al., 2018;Sharma et al., 2019). Traditional task-specific evaluation methods (e.g., BLEU, ROUGE) may not be an accurate reflection of their real-world performance in terms of readability (Paulus et al., 2017;Reiter, 2018). However, it is unclear if existing coherence models are capable of this task, since their performance on downstream applications is rarely studied, even though that is one of the main motivations for their development.
Our main goal in this work is to assess the performance of the existing coherence models not only on standard, challenging synthetic tasks like global and local discrimination, but more importantly on real downstream text generation problems. Specifically, we investigate the performance of coherence models in three different settings: • Traditional synthetic tasks involving discrimination of real documents from their permutations.
• Coherence evaluation for machine translations and system-generated extractive and abstractive summaries, which are more representative of realworld use cases for coherence models.
• Next utterance ranking for dialogs, which is a downstream application similar to the synthetic task of insertion, but uses conversational data from DSTC 8 (Kim et al., 2019).
We show through experiments that there is only a slight correlation between model performances on synthetic tasks and the real-world use cases. Although models perform strongly in the synthetic tasks, they show poor performance and low correlations with human judgments on distinguishing coherent machine translations and system-generated summaries from incoherent ones. They also fail to perform well on the next utterance ranking task, which is similar to the synthetic task of insertion (Elsner and Charniak, 2011b), even if re-trained with task-specific data.
However, we show that re-training the coherence models with task-specific data for machine translation evaluation leads to improved results and agreements with human judgments. This leads us to conclude that there is a possible mismatch in the task setting that is used to train coherence models. Models trained on traditional synthetic tasks do not seem to be learning features that are useful for downstream applications. We hope that our results will motivate the broadening of the standard of coherence model evaluations to include more downstream tasks, and also motivate the redesigning of the training paradigm for coherence models.

Coherence Models
Advancements in deep learning have inspired researchers to neuralize many of the traditional models. Li and Hovy (2014) model syntax and intersentence relations using a recurrent sentence encoder followed by a fully-connected layer. In a follow-up work, Li and Jurafsky (2017) use generative models to incorporate global topic information with an encoder-decoder architecture. Mohiuddin et al. (2018) propose a neural entity grid model using convolutions over distributed representations of entity transitions. Mesgar and Strube (2018) model change patterns of salient semantic information between sentences.  propose a local discriminative model that retains the advantages of generative models and uses a smaller negative sampling space that can learn against incorrect orderings. Moon et al. (2019) propose a unified model that incorporates sentence syntax, inter-sentence coherence relations, and global topic structures in a single Siamese framework.
We benchmark the performance of five representative coherence models on the tasks discussed above. Our selected models comprise of both traditional and neural models. Moreover, two models are currently the state-of-the-art at the time of submission (Transferable and Unified Neural Model).
Entity Grid (EGRID). Lapata (2005, 2008) introduced the popular entity-based model for representing and assessing text coherence motivated by the Centering Theory (Grosz et al., 1995). This model represents a text with a twodimensional array called an entity grid, that captures transitions of discourse entities across sentences. These local entity transitions are used as deciding patterns for text coherence; a local entity transition of length k is a sequence of {S,O,X,-} k representing grammatical roles (Subject, Object, Other, and Absent, respectively) played by an entity in k consecutive sentences. The salience of the entities, quantified by the occurrence frequency, is also incorporated to identify transitions of important entities. Elsner and Charniak (2011b) improve the basic entity grid by including non-head nouns as entities (with the grammatical role X). Instead of using a coreference resolver, they match the nouns to detect coreferent entities. In our work, we consider this version of the entity grid model.
Neural Entity Grid (NEURALEGRID). A neural version of the entity grid model was proposed by Nguyen and Joty (2017). The grammatical roles in the grid are converted into their distributed representations, and the entity transitions are modeled in the distributed space by performing convolutions over it. The final coherence scores are computed from convolved features that have gone through a spatial max-pooling operation. A global, documentlevel pairwise loss is used to train the model.
Lexicalized Neural Entity Grid. Mohiuddin et al. (2018) propose an improvement of the neural entity grid (LEXNEUEGRID) by lexicalizing the entity transitions using off-the-shelf word embeddings to achieve better generalization.

Transferable Neural Model (TRANSMODEL).
In order to generalize the coherence model across domains,  propose a transferable neural model that considers coherence at a local level, taking only adjoining sentences as input. Coupled with pre-training of the sentence encoders in a generative fashion, their model demonstrates significant improvements in performance, despite being a local coherence model.
Unified Neural Model (UNIFIEDMODEL). Moon et al. (2019) propose a unified model that captures syntax (as a proxy of intention), discourse relations, entity attention and global topic structures. The syntax is captured by incorporating an explicit language model loss. A bi-linear layer is used to capture the inter-sentential discourse relations, while light-weight convolution is used to capture the attention and topic structures.

Evaluation Tasks and Experiments
In this section, we present the performance of the coherence models on standard synthetic tasks (i.e., Global/Local Discrimination), followed by the experiments where we apply the coherence models trained on the global discrimination task to three downstream tasks (i.e.,, abstractive summarization, extractive summarization, and machine translation). We then present the results of the coherence models re-trained on the next utterance ranking task.
For each of the coherence models, we conducted experiments with publicly available codes from the respective authors. The three recent methods  use word embeddings: LEXNEUEGRID, TRANS-MODEL and UNIFIEDMODEL use Word2vec (Mikolov et al., 2013), average GloVe (Pennington et al., 2014), and ELMo (Peters et al., 2018) embeddings respectively. We use the default settings and hyperparameters suggested by the authors.

Synthetic Tasks
Traditionally coherence models have been evaluated mostly on synthetic tasks. For comparison with previous work, we use two representative synthetic tasks to compare the coherence models.

Global Discrimination.
Introduced by Barzilay and Lapata (2008), in this task coherence models are asked to distinguish an original (coherent) document from its incoherent renderings generated by random permutations of its sentences. We follow the same experimental setting of the Wall Street Journal (WSJ) news dataset as used in previous studies (Elsner and Charniak, 2011b;Moon et al., 2019;. Similar to them, we use 20 random permutations of each document for both training and testing. Additionally, we evaluate on inverse discrimination (Mohiuddin et al., 2018), where the sentence order is reversed to create the incoherent version.
Setup. We follow the same experimental settings of the WSJ news dataset as used in previous works Mohiuddin et al., 2018;Elsner and Charniak, 2011b;Feng et al., 2014). We use 20 random permutations of each document for both training and testing, excluding the permutations that match the original one. Table 1 summarizes the data sets used in the global discrimination task. We randomly select 10% of the training set for development purposes.
Results. Table 2 presents the results in terms of accuracy on the two global discrimination tasksthe standard and the inverse order discrimination. We see that UNIFIEDMODEL achieves the highest accuracy on the standard order discrimination task and TRANSMODEL performs the best on the Inverse order discrimination task. The other three   models use entity grids, hence they may lose the sentence-level syntactic and semantic information.

Local Discrimination.
Local discrimination was proposed by Moon et al. (2019). In this task, two documents differ only in a local context (windows of 3 sentences). In this case, the models need to be sensitive to local changes. We use the same WSJ dataset as used by Moon et al. (2019).
Setup. We use the same WSJ articles used in the global discrimination task (Table 1) to create our local discrimination datasets. We use the code released by Moon et al. (2019) to generate these datasets. 2 Sentences within a local window of size 3 are re-ordered to form a locally incoherent text.
Only articles with more than 10 sentences are included in the dataset. Table 3 summarizes the datasets. We randomly select 10% of the training set for development purposes. Following Moon et al. (2019), we create four datasets for our local discrimination task: D w=1 , D w=2 , D w=3 and D w=1,2,3 . D w=1 contains the documents where only one randomly selected window is permuted, D w=2 contains the documents where two randomly selected windows are permuted; D w=3 is similarly created for 3 windows. D w=1,2,3 denotes the concatenated datasets.
Results. From Table 4, we see that the UNIFIED-MODEL achieves the highest accuracy on all four datasets. A possible reason could be the loss function it uses to train the model. Unlike other models, UNIFIEDMODEL uses an adaptive pairwise ranking 2 https://github.com/taasnim/unified-coherence-model  loss which does not penalize the locally coherent sentences. In the local discrimination task, the difference between positive and negative examples is small; they differ only in 1-3 windows, while the other parts are locally coherent. UNIFIEDMODEL's loss function can model this better.

Coherence Evaluation Tasks
We evaluate the coherence models trained on the global discrimination task on two downstream tasks: machine translation (MT) and summarization coherence evaluation. Note that both the MT and summarization data are from the same domain (news) as the original WSJ training data.

Machine Translation Evaluation
The outputs of neural machine translation (NMT) systems have been shown to be more fluent than their phrase-based predecessors (Castilho et al., 2017). However, recent studies have shown that there is a statistically strong preference for human translations in terms of both adequacy and fluency at a document level (Läubli et al., 2018;Popel et al., 2020). Smith et al. (2016) evaluated traditional (nonneural) coherence models to see if they can distinguish a reference from a system translated document, and reported very low accuracy. However, the situation has changed with the advancements of neural models; today's coherence models are claimed to be much more accurate.
Our goal therefore is to evaluate the coherence models on how well they can judge the coherence of MT outputs at the document level. To do this, we use the system translations released by the annual Workshop (now Conference) on Machine Translation (WMT) through the years 2017 and 2018. At a document level, reference (human) translations have been shown to be more coherent than MT outputs (Smith et al., 2015(Smith et al., , 2016Läubli et al., 2018). Therefore, we evaluate the performance of the coherence models based on their accuracy of scoring the reference (document) higher than the system translation (document).
We also obtain rankings given by humans in a user study. Fig. 1 shows the layout of the study, where participants were shown four sentences from three candidate translations of the same source text and asked to rank them against each other. One of the given translations is the reference, used as a control, and to validate our assumption that the reference is more coherent than the system translations. 3 participants annotated 100 such samples.
Participants chose the reference as more coherent with an agreement of 0.84, confirming our assumption. 3 We evaluate the system translations by producing a ranking between the different translations of the same source text. To do this, we first obtain scores from the coherence models for the reference and each of the corresponding system translations. Then, we normalize the scores of the system translations by subtracting them from score of the reference. These normalized coherence scores are used to rank the system translations, which are then used to calculate agreements.
Setup. We use the reference and the system translations provided by WMT2017-2018 as our test data, under the assumption that the reference translations are more coherent than the system translations. This results in a testset of 20,680 referencesystem translation document-pairs.
Results. We report the accuracy of the coherence models trained on the global discrimination task in distinguishing the more coherent reference text from the less coherent system translations in Table 5. We can see that most models perform 3 Traditional correlation measures such as Cohen's Kappa are not robust to skewed distributions of annotations, which was an issue here since the annotators were always more likely to choose the reference as better. Thus, we report the more appropriate Gwet's AC1/gamma coefficient (Gwet, 2008), which controls for this.

Model
Acc. (%) AC1 Agr.  Table 5: Machine Translation setting results on WMT2017-2018 data. Accuracies: % of times reference scored higher and AC1 agreements for system translation rankings between annotators and models.
worse than a random baseline of 50%, showing that their training on the global discrimination task is not helpful in detecting coherence quality in MT text. The difference in performance is particularly glaring for the TRANSMODEL and the UNIFIED-MODEL, both of which have over 90% accuracy on the global discrimination tasks, but only manage 48.67% and 43.36% on this task respectively. We also report the agreement with human rankings on the study data in Table 5. Overall, only EGRID has good agreement with human rankings, with all other models doing similarly poorly. 4

Abstractive Summarization
Generating coherent summaries has always been a goal in summarization (Nenkova and McKeown, 2011). The widely used automatic evaluation metric ROUGE (Lin, 2004) measures the n-gram overlap between the generated summaries and the reference summaries at a sentence level, and thus is not sufficient for measuring coherence. Kryściński et al. (2019) also recently found almost negligible correlation between ROUGE scores and human judgments on summary coherence, especially for abstractive summaries generated by recent neural summarization models. We therefore propose to evaluate the coherence of summaries using different coherence models and measure their effectiveness on this task.
For abstractive summarization, we use summaries from popular neural abstractive summarization systems for CNN/DM dataset (Hermann et al., 2015;Nallapati et al., 2016). Since abstractive systems vary in their architectures and loss functions, they may produce very different summaries. We run a human study to validate the rankings given by the coherence models.  Table 6: Abstractive Agreement and Extractive Agreement shows the AC1 agreements for the pairwise ranking of the generated abstractive summaries and extractive summaries between two annotators and the models, respectively. As discussed, we directly use the coherence models trained on the WSJ dataset for the global discrimination task. The coherence models predict the scores for each system-generated summary in the testset. The scores produced by the models are then used to rank the system-generated summaries of the same original article.

Setup
We conducted a user study to validate the effectiveness of the rankings produced by the coherence models. We randomly sampled 10 sets of summaries from the dataset with each set containing four generated summaries of the same article, thus resulting in 4 2 × 10 = 60 pairs of system summaries. Two annotators were asked to rank each pair of the summaries in terms of coherence; see Appendix for the human study interface.
Results. For the user study, the agreement between the two annotators was 0.78, which indicates fairly reliable data. After we obtain the rankings based on the coherence scores produced by the models, we compute the agreements between the systems and the two annotators. From the results in Table 6, we see that EGRID and LEXNEUEGRID show the highest agreement with human judgements. However, despite strong performance in synthetic tasks, models like UNIFIEDMODEL and TRANSMODEL are unable to convert the high accuracy into high human agreement, which demonstrates the inefficiency of current synthetic tasks.

Extractive Summarization
For evaluating the coherence of extractive summaries, we use the dataset prepared by Barzilay and Lapata (2008) for their coherence model evaluation. The dataset comes with human ratings of the summaries from the Document Understanding Conference (DUC), 2003.
Setup. The dataset from Barzilay and Lapata (2008) provides 16 sets of summaries where each set corresponds to a multi-document cluster and contains summaries generated by 5 systems and 1 human. The human ratings for these summaries based on coherence are also available. 5 We follow the same experimental setup as in abstractive summarization. We use the coherence models trained on the WSJ dataset to produce scores that can be used to obtain the pairwise ranking of generated summaries. Based on the ratings provided by Barzilay and Lapata (2008), we can generate the human pairwise rankings.
Results. We present the agreements between the generated human ranking and the systems in Table  6. We observe the same problem as in abstractive summarization that high accuracy in synthetic tasks does not lead to high human agreement in evaluating downstream summarization systems.

Task-specific Training for Dialog
The global and local discrimination tasks are synthetic, while the MT and summarization coherence evaluation performance may be affected by the difference between the testing and training setup. To control for this, we re-train and test the coherence models on a task-specific setup for next utterance ranking. This task has the advantage of being non-synthetic while providing task specific training data, but also being similar to the synthetic task of insertion, helping us evaluate the generalizability of the coherence model performance.

Next Utterance Ranking
The quality of a dialog depends on various conversational aspects such as engagement, coherence, coverage, conversational depth, and topical diversity (See et al., 2019). Liu et al. (2016) show that commonly used metrics such as BLEU and ROUGE show very weak or no correlation with human judgements. They also suggest using metrics  that take dialog context into account. This is particularly important as Sankar et al. (2019) empirically show that current neural dialog systems rarely use conversational history. We therefore propose to evaluate the usefulness of coherence models in dialog systems. We evaluate the models on the Noetic End-to-End Response Selection Challenge II (NOESIS II), a track in the Dialog System Technology Challenges 8 (DSTC 8) (Kim et al., 2019). In this problem, each example consists of a conversational context U = (u 1 , . . . , u |U | ) and a set of potential utterances (candidates) C = {c 1 , . . . , c |C| } that may occur next in the dialog; the task is to select the correct next-utterance r ∈ C.
This task is a nice fit for evaluating coherence models, as a good model should rank a coherent dialog higher than an incoherent one. The correct utterance along with the conversational context forms the coherent example P = (u 1 , . . . , u |U | , r), while other candidate utterances c j ∈ C with the conversational context form the incoherent examples N = (u 1 , . . . , u |U | , c j ). This is a considerably harder task as the difference between coherent and incoherent dialog is only the last utterance. We train the coherence models with these coherent (P ) and incoherent (N ) examples. The trained models give a score for each example based on its coherence. We then use our aforementioned assumption (coherence models should score P higher than N ) for the evaluation. This task resembles the (synthetic) insertion task (Elsner and Charniak, 2011b) in that the goal here is to find the next correct utterance for the last position.
Setup. We evaluated the coherence models on both datasets of the DSTC8 response selection track, i.e., the Advising and Ubuntu datasets. 6 The 6 https://github.com/dstc8-track2/NOESIS-II/  former contains two-party dialogs that simulate a discussion between a student and an academic advisor, while the latter consists of multi-party conversations extracted from the Ubuntu IRC channel (Kummerfeld et al., 2019). For a given conversational context, the goal is to select the next utterance from a candidate pool of 100 utterances, which may or may not contain the correct next utterance. We filter the datasets to suit the settings for coherence models. In our refined datasets, we exclude the conversations that have less than 7 or more than 50 utterances in the context. To ensure that we have pairwise coherent and incoherent examples, we only include the conversations that contain the correct next utterance in the candidate pool. Table 7 shows the statistics of our refined datasets for the utterance ranking task.
Results. Table 8 summarizes the results on the refined datasets for the utterance ranking task. In the last column, we report the accuracy for the number of samples in which the coherence models score the positive sample higher than the negative one. All model performances are better than a random baseline, with UNIFIEDMODEL reaching 74.49% on the Ubuntu dataset. Note, however, that because there are 100 negative samples for every positive sample, the accuracies are skewed and not representative of actual task difficulty.
The DSTC8 challenge ranking considers the average of Recall@1, Recall@5, Recall@10 and Mean Reciprocal Rank (MRR). We report both the official evaluation results and the coherence models' performance even though the latter is tested on the refined datasets. From the results, we see that the overall performance of all the coherence models is quite poor. Despite being re-trained on task specific data, we find that coherence model performance in this task is sub-par.

Task-specific Training for MT
As a special use case, we report the results of retraining the coherence models using machine translation data for coherence evaluation. The aim is to investigate whether changing the usual training setup, that uses negative documents which are only small variations of the positive documents, might help coherence models learn more useful task-specific features.
Setup. Under the assumption that the reference translations are more coherent at the document level than the system translations, we train the coherence models with the reference text as the positive and the system translation as the negative document, forming a positive-negative document pair. We use the data from WMT-2011 to WMT-2015 for training (28,985 document-pairs), WMT-2016 for development (7,647 document-pairs) and the same test data (WMT-2017 to WMT-2018;20,680 document-pairs) and study data as used for the previous experiment ( §3.2.1).
Results. Table 9 reports the accuracy of the retrained models and the results of the model ranking comparison against human rankings. Many of the models show improved performance, with the agreements increasing correspondingly. The UNIFIEDMODEL has the highest accuracy improvement by far of 34%, improving from 43.36% to 77.35%. It also has the highest agreement with human rankings at 0.82. We surmise that the model's adaptive pairwise ranking loss along with its additional language model loss boosts its performance on in-domain test data.

Discussion
Compared to the downstream tasks of coherence evaluation in MT and extractive and abstractive summarization, the traditional global discrimination task can be considered to be a simpler task  Table 9: Re-trained MT setting results on WMT2017-2018 data. Accuracies: % of times reference scored higher and AC1 agreements for system translation rankings between annotators and models. (Elsner and Charniak, 2011b), since the difference between the positive and the negative document is a permutation/re-ordering of the sentences. This may be rendering the models unable to learn features that are useful for downstream applications, which are likely to have other, different kinds of errors.
On the next utterance ranking task, the models fail to generalize and perform quite poorly despite task-specific re-training. The best model performance for the synthetic task of insertion, which is similar, also barely reaches 26% (Elsner and Charniak, 2011b;Nguyen and Joty, 2017). This indicates that the training procedures may not be providing the right setting to learn features that are generic enough to apply to tasks in a harder setup.
In the synthetic tasks, the models' selfsupervision comes from distinguishing an original coherent document from its incoherent renderings generated by random permutations of its sentences. This permutation-based self-supervision tries to capture document-level language properties. However, it is quite likely that this is simply a poor approximation of real-world coherence problems. Consider for example that MT systems mostly translate at the sentence-level. Consecutive sentences may lack coherence, but if two system translations of a text are compared, the translations themselves will be in the same order for both. The coherence models are not trained for such (realworld) settings.
Another possibility is that outputs from downstream tasks have different error distributions that are captured to varying degrees by different models, since they are originally designed based on synthetic tasks. That is, models that perform very well on the permutation task might be overfitting on this task, and therefore failing to find coherence issues that are more subtle than shuffled text. Thus, we conclude that the current self-supervision for coherence modeling is not suitable for downstream coherence problems.
When re-trained on machine translation data, most of the model performances improve, implying that a different training setting may be required to make the models applicable to actual downstream tasks. This is not apparent from the evaluation results that are usually reported, which show performances crossing the 90% mark. Elsner and Charniak (2011a) show a similar lack of generalizability and applicability of coherence models to the downstream task of chat disentanglement. Our results suggest that despite nearly a decade of research since, the standard training and testing paradigm for coherence modeling continues to be inadequate in its capability to generalize to real-world use-cases and even to similar task settings, and also fails in being indicative of realworld task performance.

Conclusions
We benchmark the performance of representative traditional and neural coherence models on standard synthetic discrimination tasks, and contrast this with their performance on various downstream application tasks in NLP. We show that higher accuracies on synthetic tasks do not translate into better performance on downstream tasks. We demonstrate this for real-world tasks like MT and summarization coherence evaluation, and next utterance ranking. Our results signal a need for change in the way coherence models are typically trained and evaluated.
Other downstream applications like coherence evaluation of language model generated text and tasks such as chat disentanglement are also good candidates for testing coherence models. It would be worthwhile to build a coherence testset that is independent of the training tasks and similar to downstream applications, which could be used by the community to test the generalization ability of their models. In future work, we also hope to investigate the possible training scenarios that will result in more generalizable coherence models which can be used for evaluating downstream tasks.