A Multi-task Learning Framework for Quality Estimation

Quality Estimation (QE) is the task of evaluating machine translation output in the absence of reference translation. Conventional approaches to QE involve training separate models at different levels of granularity viz., word-level, sentence-level, and document-level, which sometimes lead to inconsistent predictions for the same input. To overcome this limitation, we focus on jointly training a single model for sentence-level and word-level QE tasks in a multi-task learning framework. Using two multi-task learning-based QE approaches, we show that multi-task learning improves the performance of both tasks. We evaluate these approaches by performing experiments in different settings, viz., single-pair, multi-pair, and zero-shot. We compare the multi-task learning-based approach with base-line QE models trained on single tasks and observe an improvement of up to 4 . 28% in Pearson’s correlation ( r ) at sentence-level and 8.46% in F1-score at word-level, in the single-pair setting. In the multi-pair setting, we observe improvements of up to 3.04% at sentence-level and 13.74% at word-level; while in the zero-shot setting, we also observe improvements of up to 5.26% and 3.05%, respectively. We make the models proposed in this paper publically available 1 .


Introduction
Quality Estimation (QE) is a sub-task in the Machine Translation (MT) field.It facilitates the evaluation of MT output without a reference translation by predicting its quality rather than finding its similarity with the reference (Specia et al., 2010).QE is performed at different levels of granularity, viz., word-level QE (Ranasinghe et al., 2021), sentencelevel QE (Ranasinghe et al., 2020b), and documentlevel QE (Ive et al., 2018).
In the sentence-level QE task, current models predict the z-standardized Direct Assessment (DA) score when a source sentence and its translation are provided as inputs.The DA score is a number in the range of 0 to 100, denoting the quality of the translation, obtained from multiple human annotators.These scores are then standardized into z-scores, which are used as labels to train the QE model (Graham et al., 2016).
Unlike the sentence-level QE task, the wordlevel QE task consists of training a model to predict the 'OK' or 'BAD' tag for each token in a source sentence and its translation.These tags are obtained automatically by comparing the translation with its human post-edits using a token-matching approach.Each source sentence token is tagged as 'OK' if its translation appears in the output and is tagged as 'BAD' otherwise.Similarly, a translation token is assigned an 'OK' tag if it is a correct translation of a source sentence token, and 'BAD' otherwise.Apart from the tokens in the translation, the gaps between the translation tokens are also assigned OK/BAD tags.In case of missing tokens, the gap is tagged as 'BAD', and 'OK' otherwise (Logacheva et al., 2016).
To perform each of these tasks, various deep learning-based approaches are being used (Zerva et al., 2022).While these approaches achieve acceptable performance by focusing on a single task, the learning mechanism ignores information from other QE tasks that might help it do better.By sharing information across related tasks, one can essentially expect the task performance to improve, especially when the tasks are closely related as is the case with the sentence-level and word-level QE.Also, having a separate model for each QE task can cause problems in practical scenarios, like having higher memory and computational requirements.In addition, the different models can produce conflicting information e.g.high DA score, but many errors at word level.
In this paper, we utilize two multi-task learning (MTL)-based (Ruder, 2017) approaches for word-level and sentence-level QE tasks with the help of a single deep neural network-based architecture.We perform experiments with existing QE datasets (Specia et al., 2020;Zerva et al., 2022) with both MTL approaches to combine word-level and sentence-level QE tasks.We test the following scenarios: a) single-pair QE, b) multi-pair QE, and c) zero-shot QE.The code and models are made available to the community via GitHub.
To the best of our knowledge, we introduce a novel application of the Nash-MTL (Navon et al., 2022) method to both tasks in Quality Estimation.Our contributions are: 1. showing that jointly training a single model using MTL for sentence and word-level QE tasks improves performance on both tasks.In a single-pair setting, we observe an improvement of up to 3.48% in Pearson's correlation (r) at the sentence-level and 7.17% in F1score at the word-level.
2. showing that the MTL-based QE models are significantly more consistent, on word-level and sentence-level QE tasks for same input, as compared to the single-task learning-based QE models.
We discuss the existing literature in Section 2 and the datasets used in Section 3. The MTL-based QE approach is presented in Section 4. The experimental setup is described in 5. Section 6 discusses the results in detail, including a qualitative analysis of a few sample outputs.We conclude this article in Section 7, where we also propose future research directions in the area.

Related Work
During the past decade, there has been tremendous progress in the field of machine translation quality estimation, primarily as a result of the shared tasks organized annually by the Conferences on Machine Translation (WMT), since 2012.These shared tasks have produced benchmark datasets on various aspects of quality estimation, including wordlevel and sentence-level QE.Furthermore, these datasets have led to the development and evaluation of many open-source QE systems like QuEst (Specia et al., 2013), QuEst++ (Specia et al., 2015), deepQuest (Ive et al., 2018), and OpenKiwi (Kepler et al., 2019).Before the neural network era, most of the quality estimation systems like QuEst (Specia et al., 2013), and QuEst++ (Specia et al., 2015) were heavily dependent on linguistic processing and feature engineering to train traditional machinelearning algorithms like support vector regression and randomized decision trees (Specia et al., 2013).
In recent years, neural-based QE systems such as deepQuest (Ive et al., 2018), andOpenKiwi (Kepler et al., 2019) have consistently topped the leaderboards in WMT quality estimation shared tasks (Kepler et al., 2019).These architectures revolve around an encoder-decoder Recurrent Neural Network (RNN) (referred to as the 'predictor'), stacked with a bidirectional RNN (the 'estimator') that produces quality estimates.One of the disadvantages of this architecture is they require extensive predictor pre-training, which means it depends on large parallel data and is computationally intensive (Ive et al., 2018).This limitation was addressed by TransQuest (Ranasinghe et al., 2020b), which won the WMT 2020 shared task on sentence-level DA.TransQuest eliminated the requirement for predictor by using cross-lingual embeddings (Ranasinghe et al., 2020b).The authors fine-tuned an XLM-Roberta model on a sentencelevel DA task and showed that a simple architecture could produce state-of-the-art results.Later the TransQuest framework was extended to the wordlevel QE task (Ranasinghe et al., 2021).
A significant limitation of TransQuest is that it trains separate models for word-level and sentencelevel QE tasks.While this approach has produced state-of-the-art results, managing two models requires more computing resources.Furthermore, since the two models are not interconnected, they can provide conflicting predictions for the same translation.To overcome these limitations, we propose a multi-task learning approach to QE.
Multitask architectures have been employed in several problem domains, such as those in computer vision (Girshick, 2015;Zhao et al., 2018) and natural language processing (NLP).In NLP, tasks such as text classification (Liu et al., 2017), natural language generation (Liu et al., 2019), part-ofspeech tagging and named entity recognition (Collobert and Weston, 2008) have benefited from MTL.In QE too, Kim et al. (2019) has developed an MTL architecture using a bilingual BERT model.However, the model does not provide results similar to or better than state-of-the-art QE frameworks such as TransQuest (Ranasinghe et al., 2021).Some of the recent WMT QE shared task submissions also use MTL to develop QE systems (Specia et al., 2020(Specia et al., , 2021;;Zerva et al., 2022).As all these submissions are not evaluated under the same experimental settings and use different techniques along with MTL, the improvements due to MTL alone are difficult to assess.In this paper, we introduce a novel MTL approach for QE that outperforms TransQuest in both word-level and sentence-level QE tasks, in various experimental settings.

Datasets: WMT 2022
We use data provided in the WMT21 (Specia et al., 2021), and WMT22 (Zerva et al., 2022) Quality Estimation Shared tasks for our experiments.We choose language pairs for which word-level and sentence-level annotations are available for the same source-translation pairs.The data consists of three low-resource language pairs: English-Marathi (En-Mr), Nepali-English (Ne-En), Sinhalese-English (Si-En); three mediumresource language pairs: Estonian-English (Et-En), Romanian-English (Ro-En), Russian-English (Ru-En); and one high-resource language pair: English-German (En-De).For the English-Marathi language pair, the data consists of 20K training instances and 1K instances each for validation and testing2 .The training set consists of 7K instances for all other language pairs, and validation and test sets consist of 1K samples each.
Each sample in the word-level QE data for any language pair except English-Marathi consists of a source sentence, its translation, and a sequence of tags for tokens and gaps.For the English-Marathi pair, the WMT22 dataset does not contain tags for gaps in tokens.Therefore, we used the QE corpus builder3 to obtain annotations for translations using their post-edited versions.

Approach
In this section, we briefly discuss the TransQuest framework, explain the architecture of our neural network, and then discuss the MTL approaches we used for the experimentation, along with the mathematical modeling.

TransQuest Framework
We use the MonoTransQuest (for sentence-level QE model) (Ranasinghe et al., 2020b)   Similarly, the MicroTransquest architecture presented in figure 2 also uses the XLM-R transformer.The input to this model is a concatenation of the original sentence and its translation, separated by the [SEP] token.Additionally, the [GAP] tokens are added between the translation tokens.Finally, an output of each token is passed through a softmax layer to obtain the OK or BAD tag for each token.

Model Architecture
Considering the success that transformers have demonstrated in translation quality estimation (Ranasinghe et al., 2020a;Wang et al., 2021), we chose to employ the transformer as a base model for our MTL approach.Our approach learns two tasks jointly: sentence-level and word-level quality estimation.
Figure 3 depicts the model's architecture used in our approach.The implemented architecture shares hidden layers between both sentence-level and word-level QE tasks.The shared portion includes the XLM-Roberta (Conneau et al., 2020) model that learns shared representations (and extracts information) across tasks by minimizing a combined/compound loss function.The taskspecific heads receive input from the last hidden layer of the transformer language model and predict the output for each task (details provided in the next two sections).
Sentence-level Quality Estimation Head By utilizing the hidden representation of the classification token (CLS) within the transformer model, we predict the DA scores by applying a linear transformation: (1) where , and D is the dimension of input layer h (top-most layer of the transformer).
Word-level Quality Estimation Head We predict the word-level labels (OK/BAD) by applying a linear transformation (also followed by the softmax) over every input token from the last hidden layer of the model: where t marks which token the model is to label within a T -length window/token sequence, W token ∈ R D×2 , and b token ∈ R 1×2 .This part is similar to the MicroTransQuest architecture in Ranasinghe et al. (2021).

Multi-Task Learning
We use two MTL approaches to train the QE models.In the first approach, task-specific losses are combined into a single loss by summing them.The second approach considers the gradient conflicts and follows a heuristic-based approach to decide the update direction.
Linear Scalarization (LS) We train the system by minimizing the Mean Squared Error (MSE) for the sentence-level QE task and cross-entropy loss for the word-level QE task as defined in Equation 3and Equation 4, where y da and y word represent ground true labels.These particular losses are: where v[i] retrieves the ith item in a vector v and ⊙ indicates element-wise multiplication.For combining the above two losses into one objective, α and β parameters are used to balance the importance of the tasks.n this study, we assign equal importance to each task in our experiments, therefore we set α = β = 1 in this study.The final loss is shown in Equation 5.
We set up two baselines -single-task learningbased sentence-level QE and word-level QE models.The sentence-level QE model takes a source sentence and its translation as input and predicts the DA score.We use the MonoTransQuest implementation in Ranasinghe et al. (2020b) for this sentencelevel QE model.The word-level QE model predicts whether each token (word) is OK or BAD using a softmax classifier as well.We use the MicroTran-sQust implementation in Ranasinghe et al. (2021) as the word-level QE model.Nash Multi-Task Learning (Nash-MTL) Joint training of a single model using multi-task learning is known to lower computation costs.However, due to potential conflicts between the gradients of different tasks, the joint training typically results in the jointly trained model performing worse than its equivalent single-task counterparts.Combining per-task gradients into a combined update direction using a specific heuristic is a popular technique for solving this problem.In this approach, the tasks negotiate for a joint direction of parameter update.

Algorithm 1 Nash_MTL
Input: θ 0 -initial parameter vector, {l i } K i=1differentiable loss functions, η -learning rate Output: θ T for t = 1,..., T do Compute task gradients For the MTL problem with parameters θ, the method assumes a sphere B ϵ , with a center at zero and a radius ϵ.The update vectors ∆θ are searched inside this sphere.The problem is framed as a bargaining problem by considering the centre as the point of disagreement and the B ϵ as an agreement set.For every player, the utility function is u i (∆θ) = g T i ∆θ where g i denotes the gradient vector at θ of the loss of task i.Additional details, theoretical proof and empirical results on various tasks can be followed from Navon et al. (2022), who proposed this gradient combination.

Experimental Setup
This section describes the different experiments we perform and the metrics we use to evaluate our approach.We also discuss the training details and mention the computational resources used for training the models.
Experiments We perform our experiments under three settings: single-pair, multi-pair, and zeroshot.For each setting, we train one sentence-level, one word-level, and two MTL-based QE models.The first two models are the Single-Task Learning (STL)-based QE models (STL QE), and we use their performance as baselines.The TransQuest framework (Ranasinghe et al., 2020b) contains the MonoTransQuest model for the sentence-level QE task and the MicroTransQuest model (Ranasinghe et al., 2021) for word-level QE task which helped us reproduce baseline results over all the language pairs investigated for this paper.The next two models are the MTL-based QE models (MTL QE) trained using two different MTL approaches explained in Section 4. For training LS models, we use the Framework for Adapting Representation Models (FARM)4 , while for training Nash-MTL models, we used implementation5 shared by the authors.All the experiments use all seven language pairs introduced in Section 3.
In the single-pair setting, we only use the data of one particular language pair for training and evaluation.However, in the multi-pair setting, we combine training data of all the language pairs and evaluate the model using test sets of all language pairs.For the transfer-learning experiments (zeroshot setting), we combine training data of all language pairs except the language pair on which we evaluate the model.
Evaluation We use the Pearson Correlation (r) between the predictions and gold-standard annotations for evaluating the sentence-level QE as it is a regression task.Similarly, for the word-level QE, which is treated as a token-level classification task, we consider the F1-score as an evaluation metric.We perform a statistical significance test considering primary metrics using William's significance test (Graham, 2015).
Training Details To maintain uniformity across all the languages, we used an identical set of settings for all the language pairings examined in this work.For the STL and LS-MTL models, we use a batch size of 16.We start with a learning rate of 2e − 5 and use 5% of training data for warm-up.We use early stopping and patience over the 10 steps.The Nash-MTL models are trained using the configuration outlined in (Navon et al., 2022).Considering the availability of computational resources, the STL QE models are trained using the NVIDIA RTX A5000 GPUs, and the MTL QE models using the NVIDIA DGX A100 GPUs.Additional training details are provided in Appendix A.

Results and Discussion
Results of the single-pair, multi-pair, and zero-shot settings are presented in this section.The tables referred to in this section report performance of the STL, LS-MTL, and Nash-MTL QE models using the Pearson correlation (r) and F1-score for sentence-level and word-level QE, respectively.We could not conduct a direct performance comparison between our QE models and winning entries of the recent WMT QE shared tasks due to the following reasons: (1) Nature of the word-level QE task, and its evaluation methodology have changed over the years.Until last year, gaps between translation tokens were a part of the data, and the 'OK' or 'BAD' tags were predicted for them as well.But the WMT22 shared task did not consider these gaps; and (2) Not all the language pairs investigated in this paper have been a part of WMT QE tasks in the same year.Therefore, we establish a standard baseline using the Transformers-based framework, TransQuest, and show improvements.
We also compare Pearson correlation coeffi-cients obtained by STL and MTL QE models to assess whether MTL QE model predictions on both tasks for the same inputs are consistent (Table 4).Furthermore, we perform a qualitative analysis of the output for En-Mr, Ro-En, and Si-En language pairs, and show some examples in Table 5.We discuss the analysis in detail in subsection 6.4.

Single-Pair Setting
The results for the first experimental setting are presented in Table 1.The MTL QE approaches provide significant performance improvements for all language pairs in the sentence and word-level QE tasks over the respective STL QE models.In the word-level QE task, the Nash-MTL QE models outperform the STL and LS-MTL models for all language pairs.Our approach achieves the highest improvement of 8.46% in terms of macro F1-score for the Et-En language pair.While for the En-De, we observe the least improvement from the LS-MTL QE model (2.49%).The average improvement in the F1-score from Nash-MTL model and LS-MTL model is 6.29% and 4.06%, respectively.For the sentence-level QE task, Pearson correlation (r) between the QE system prediction scores and true labels is used as an evaluation metric.For this task, the MTL QE models, again, outperform the STL QE models for all language pairs.Here, the En-De Nash-MTL QE model obtains the most significant performance improvement of 4.28% over the corresponding STL QE model.A minor performance improvement of 0.33% is observed for the Ro-En language pair using the LS-MTL QE model.The average improvement in Pearson's correlation (r) from the Nash-MTL model and the LS-MTL model is 2.75% and 2.10%, respectively.
Except for the Ro-En Nash-MTL QE model's performance in the sentence-level QE task, we see the Nash-MTL QE models amass the most improvements over the STL and LS-MTL QE models for all language pairs in both tasks.It shows that the bargaining between the gradient update directions for sentence-level and word-level QE tasks that the Nash-MTL method arranges results in effective learning.The results of both tasks also show that we get more improvements for low-resource and mid-resource language pairs than for the highresource language pair.We additionally report the results obtained by the WMT QE shared task winning systems in Appendix C. The WMT figures are not directly comparable to our results.The WMT figures are higher than ours but that is really not the point.Our aim is to show that multitask learning is more effective than single-task learning.Any QE technique can seriously be considered adopting MTL in preference to the STL.Of course, if the STL figures are already high then the improvement may not be significant which we also have observed.

Multi-Pair Setting
Table 2 tabulates the results for the multi-pair setting.The multi-pair setting can benefit the wordlevel QE task due to vocabulary overlap and the sentence-level QE tasks due to syntactical similarities between the language pairs.In this setting, MTL improves performance for all language pairs in the word-level QE task.Using the LS-MTL QE model, the highest F1-score improvement of 7.63% is observed for the Si-En language pair, while with the Nash-MTL QE model, the best improvement is of 13.74%.The least improvement with the LS-MTL QE model is observed for the Ru-En pair 2.76%, while for the Nash-MTLbased QE model, it is of 2.46% for the Ru-En pair.
Though the improvements observed in the wordlevel QE task in this setting when using MTL QE approaches are even higher compared to the single-pair setting, we see an opposite trend in the sentence-level QE task results.At the sentence level, we observe a slight degradation in the results of the En-Mr, En-De, and Ro-En MTL QE models.We observe the most improvement of the 3.04% in Pearson Correlation over the STL QE model by the Nash-MTL QE model.For the Ro-En pair, both QE models fail to bring improvements over the STL QE model.For Ne-En and Et-En pairs, the LS-MTL QE model outperforms the Nash-MTL QE model.In this setting, the Nash-MTL technique provides similar results to the LS-MTL technique.Also, we observe that the Nash-MTL QE approach benefits the most to the low-resource language pairs.We also see higher improvements for the mid-resource language pairs than the highresource language pair.

Zero-shot Setting
Table 3 shows the results for the zero-shot setting.
The MTL QE models achieve better performance for both tasks over their STL-based counterparts for all the language pairs, except for the En-Mr language pair in the sentence-level QE task.Surprisingly, for the Ne-En pair, the LS-MTL model outperforms the Nash-MTL QE model in the sentencelevel QE task by a small margin (0.0053).While for all other language pairs, the Nash-MTL QE models outperform the respective LS-MTL QE models.Similar to the trend in the previous two settings, the MTL QE approaches bring more benefits to the low-resource and mid-resource language pairs than the high-resource language pair.In Appendix B, for each low-resource language pair, we include a table showing the comparison of STL, LS-MTL, and Nash-MTL QE models.These tables show that the multi-pair setting helps the low-resource scenario.

Discussion
Consistent Predictions Improvements shown by the MTL QE models in varied experimental settings on both tasks show that the tasks complement each other.We further assess the potential of the MTL QE models in predicting consistent outputs for both tasks over the same inputs.We do so by computing a correlation between the predicted DA scores and the percentage of tokens in a sentence for which the 'BAD' tag was predicted.Therefore, a stronger negative correlation denotes better consistency.Table 4 shows Pearson and Spearman correlations between sentence-level and word-level QE predictions on the test sets, in a single pair setting.For all the language pairs, Nash-MTL QE models show a stronger correlation than the STL QE models.We also perform a qualitative analysis of the STL and MTL QE models for the En-Mr, Ro-En, and Si-En language pairs.Qualitative Analysis The first English-Marathi example is shown in Table 5.It contains a poor translation of the source sentence meaning, "The temple is close to the holy place where ages ago the Buddha was born."The STL word-level QE and MTL QE models predict the same output assigning correct tags to tokens, yet we observe a significant difference in the sentence-level scores predicted by the models.The STL sentence-level QE model outputs a high score of 0.25, while the score given by the MTL QE model is -0.64.It supports the observation that the MTL QE model outputs are more consistent.
Unlike the STL sentence-level QE models, the MTL QE models predict more justified quality scores when translations have only minor mistakes.The translation in the first Ro-En example in Table 5 is a high-quality translation.In this translation, the word "overwhelming" could have been replaced with a better lexical item.The STL QE model harshly penalizes the translation by predicting the z-score at -0.0164, while the MTL model predicts a more justifiable score (0.8149).Similar behaviour is reflected in the second Si-En example as well (last row).Even though the translation reflects the meaning of the source sentence adequately and is also fluent, the STL QE model predicts a low score of -0.35, while the MTL QE model rates the translation appropriately by predicting 0.66 as score.
We also observed that the MTL QE models have an edge when rating translations with many named entities.This can be seen through the second English-Marathi (Row 3), second Romanian-English (Row 5), and first Sinhala-English (Row 6) examples in Table 5.The translations are of high quality in both examples, and the MTL QE models rate them more appropriately than the STL QE models.

Conclusion and Future Work
In this paper, we showed that jointly training a single, pre-trained cross-lingual transformer over the sentence-level and word-level QE tasks improves performance on both tasks.We evaluated our approach in three different settings: single-pair, multipair, and zero-shot.The results on both the QE tasks show that the MTL-based models outperform their STL-based counterparts for multiple language pairs in the single-pair setting.Given the performance in the zero-shot setting, we see promising transfer-learning capabilities in our approach.Consistent scores across both QE tasks for the same inputs demonstrate the effectiveness of the MTL method to QE.We release our MTL-based QE models and our code under the CC-BY-SA 4.0 license publicly for further research.
In future, we wish to extend this work and evaluate the MTL-based QE models in a few-shot setting to assess the effectiveness of transfer learning.Fur-ther, we would like to explore the usage of wordlevel QE and sentence-level QE to assist in the task of automatic post-editing for MT.We also wish to explore the use of language-relatedness for building multi-pair MTL-based QE models.

Limitations
The experimental results suggest the possibility of our MTL-based QE approach being biased towards the word-level QE task, as the jointly trained QE models show better performance improvements for the word-level QE task as compared to the sentencelevel QE task.Further, we also observe that our approach does not work well for language pairs with English as a source language (En-De and En-Mr).The qualitative analysis of the English-Marathi MTL-based QE model shows that the model performs poorly when inputs are in the passive voice.Our multi-pair setting experiments use all seven language pairs.We do not consider properties like the similarity between the languages, translation directions, etc., to group the language pairs.So it may be possible to achieve comparable performance using a subset of languages.We choose the Nash-MTL approach for MTL-based experiments because it has been compared with around ten other MTL techniques and it has been shown that the Nash-MTL approach outperforms them on different combinations of the tasks.In the current work, we have not experimentally analyzed how the Nash-MTL approach gives better improvements than the LS-MTL approach.
and Mi-croTransQuest (for word-level QE model) (Ranas-inghe et al., 2021) architectures to perform the single-task-based QE experiments.The MonoTran-sQuest architecture (1) uses a single XLM-R (Conneau et al., 2020) transformer model.The input of this model is a concatenation of the original sentence and its translation.Both these sequences are separated by a special [SEP] token.The inputs are passed to an embedding layer to obtain embeddings for each token.The Direct Assessment (DA) scores are produced by passing the output of the [CLS] token through a softmax layer.

Figure 1 :
Figure 1: Architecture of the MonoTransQuest sentence-level QE Model.

Figure 3 :
Figure 3: Architecture of the MTL QE model.

Table 5 :
Source-translation pairs along with z-standardized DA scores by STL, Nash-MTL QE models, and the ground truth labels.

Table 1 :
Results obtained for word-level (F1-scores) and sentence-level (Pearson (r)) QE tasks in the single-pair setting.STL: results from the models trained using TransQuest.LS-MTL and Nash-MTL: results obtained using the Linear Scalarization MTL approach, and the Nash-MTL-based models, respectively.The first three rows show results for the low-resource language pairs, the next three for mid-resource, and the last for a high-resource language pair.[( * ) indicates the improvement is not significant with respect to the baseline score.]

Table 2 :
Results obtained for word-level and sentence-level QE tasks in the multi-pair setting.[ * indicates the improvement is not significant with respect to the baseline score.]

Table 3 :
Results obtained for word-level and sentence-level QE tasks in the zero-shot setting.[ * indicates the improvement is not significant with respect to the baseline score.]

Table 4 :
Pearson (r)andSpearman (ρ)correlations between sentence-level and word-level QE predictions using STL and Nash-MTL QE models.The sentence-level QE prediction is the z-standardized Direct Assessment (DA) score, and the word-level QE prediction is the bad tag count normalized by sentence length.