D 2 TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

Many-to-many multimodal summarization (M 3 S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence, which essentially comprises multimodal mono-lingual summarization (MMS) and multimodal cross-lingual summarization (MXLS) tasks. Although much work has been devoted to either MMS or MXLS and has obtained increasing attention in recent years, little research pays attention to the M 3 S task. Besides, existing studies mainly focus on 1) utilizing MMS to enhance MXLS via knowledge distillation without considering the performance of MMS or 2) improving MMS models by filtering summary-unrelated visual features with implicit learning or explicitly complex training objectives. In this paper, we first introduce a general and practical task, i.e. , M 3 S. Further, we propose a dual knowledge distillation and target-oriented vision modeling framework for the M 3 S task. Specifically, the dual knowledge distillation method guarantees that the knowledge of MMS and MXLS can be transferred to each other and thus mutually prompt both of them. To offer target-oriented visual features, a simple yet effective target-oriented contrastive objective is designed and responsible for discarding need-less visual information. Extensive experiments on the many-to-many setting show the effectiveness of the proposed approach. Additionally, we will contribute a many-to-many


Introduction
Given a document input in the source language (e.g., English) and its corresponding image sequence, multimodal monolingual summarization (MMS) aims to generate a summary in the same Figure 1: An example of our M 3 Sum dataset.Inputs: an article and corresponding image sequence; Output: summaries in different languages.MMS: the summary in the same language as the input article; MXLS: the summary in a different language from the input article.The M 3 S setting covers both MMS and MXLS.language (i.e., English) while the goal of multimodal cross-lingual summarization (MXLS) is to produce a summary in a different language (e.g., Chinese).With the rapid increase of multimedia data, the MMS (Tjondronegoro et al., 2011;Evangelopoulos et al., 2013;Erol et al., 2003;Li et al., 2017Li et al., , 2018a;;Sanabria et al., 2018;Zhu et al., 2018;Chen and Zhuge, 2018;Li et al., 2020a;Fu et al., 2021;Zhao et al., 2022) and MXLS (Liu et al., 2022) tasks have attracted much attention in the research community because both tasks can help users quickly master the core idea from the cumbersome multimodal data.Essentially, the many-tomany multimodal summarization (M 3 S) consists of MMS and MXLS tasks, which generate summaries in any language given the multimodal inputs (in any language), as Fig. 1 shows.Intuitively, the many-to-many setup should be more general and practical for its application in the multilingual and multimodal world (Wang et al., 2023b).
In the literature, although plenty of studies have been carried out on MMS or MXLS, there is only one study that involves both of them, i.e., Liu et al. (2022) devise a triple-stage training framework and distill the knowledge from MMS to enhance MXLS while ignoring the performance of MMS.Despite their effectiveness on MXLS, to our knowledge, little research attention has been paid to simultaneously supporting both MMS and MXLS tasks and prompting both of them.Besides, the visual features generally include noise which is summary-unrelated.Thus the remaining work mainly focuses on improving MMS models by filtering these noises with (a) implicit learning or (b) complex training objectives.For (a), researchers design various fusion methods to effectively model the interactions between textual articles and visual features (Liu et al., 2020;Yu et al., 2021;Palaskar et al., 2019;Zhang et al., 2021a).For (b), to explicitly filter needless visual information, Liang et al. (2022b) present two well-designed auxiliary tasks, i.e., vision to summary and masked image modeling.Albeit effective, implicit learning via the MMS objective may limit the potential of visual features, and explicit training objectives are complex and time-consuming to be trained and applied in the real world.
To address these issues, in this paper, we first introduce a more general task, i.e., M 3 S, which supports both MMS and MXLS tasks.Further, we propose a Dual knowledge Distillation and Target-oriented Vision enhanced framework, named D2 TV, for the new task.Specifically, the dual knowledge distillation approach ensures that the knowledge from MMS can be transferred to MXLS and vice versa, and thus mutually improve both tasks.Furthermore, to discard the summaryunrelated visual information, a target-oriented contrastive objective is devised to directly optimize the visual features.In this way, the model is enhanced to explicitly exploit the summary-oriented visual features, thereby yielding more accurate summaries.
To validate the D 2 TV framework, we provide a Many-to-Many Multimodal Summarization (M 3 Sum) benchmark dataset by reorganizing the cross-lingual summarization dataset (Bhattacharjee et al., 2022) and MM-Sum dataset (Liang et al., 2022b).The M 3 Sum covers 44 languages and thus involves 44*44 language directions.To efficiently evaluate our approach, we randomly select 4 languages (i.e., English, Indonesian, Russian, and Urdu 2 ), which consist of 4*4 language directions.We implement our approach grounding on two generative pre-trained language models, i.e., mT5 (Xue et al., 2021) and mBART-50 (Tang et al., 2021).Extensive experiments on both backbones show that our model significantly outperforms related methods in terms of ROUGE (Lin, 2004) and BERTScore (Zhang et al., 2020) scores, demonstrating its effectiveness.The human evaluation further suggests the superiority of our approach.In summary, our main contributions are: • To the best of our knowledge, we are the first that introduces the general many-to-many multimodal summarization (M 3 S) task and contributes a corresponding benchmark dataset.• We propose a dual knowledge distillation and target-oriented vision modeling framework for the M 3 S task.• Experiments on M 3 Sum benchmark show that our model builds new state-of-the-art performance, showing the effectiveness of the proposed approach.

Problem Formulation
Given an input article X L 1 ={x L 1 k } M k=1 in language L 1 and its corresponding visual features V={v ij } i≤n,j≤m i=1,j=1 , where x L 1 k denotes the k-th token, and M is the number of tokens in the article, and v ij represents the detected j-th object of the i-th image (n, m is the number of images and detected objects in each image, respectively), the many-to-many multimodal summarization task is defined as: where y L 2 <t indicates the tokens before the t-th time step of the summary Y L 2 ={y L 2 t } N t=1 in language L 2 and N is the number of tokens in the summary.The L 1 and L 2 can be any language.

The MMS Model
Following Yu et al. (2021); Liang et al. (2022b), the MMS model is an extension of the pre-trained language model (e.g., mT5 (Xue et al., 2021)) based on Transformer architecture (Vaswani et al., 2017).As shown in the left part of Fig. 2, it includes four modules: textual encoder, visual encoder, textvision fusion, and decoder.Textual Encoder.The textual encoder consists of N e stacked layers, where each layer consists of two sub-layers, a multihead self-attention sublayer (SelfAttn) and a position-wise feed-forward network (FFN) sub-layer: where H ℓ−1 T and H ℓ T denote the inputs and outputs of the ℓ-th encoder layer, respectively, and H 0 T is initialized as the embedding of input tokens X L 1 and d is the hidden dimension.Visual Encoder.Following Yu et al. (2021); Liang et al. (2021Liang et al. ( , 2022c,a),a); Zhang et al. (2021a,b), the visual encoder is also the Transformer (Vaswani et al., 2017) encoder with N v stacked layers.The difference is the visual inputs.Generally, there is an image sequence to be extracted by the Faster R-CNNs (Ren et al., 2015) pre-trained on Visual Genome (Krishna et al., 2017).Specifically, for the i-th input image, we obtain a set of detected objects from Faster R-CNNs, i.e., I i = {o i,1 , o i,2 , o i,3 , ..., o i,m }, where m is the number of extracted objects and o i, * ∈ R dv .Each object is captured by a dense feature representation, which can be mapped back to a bounding box / region (i.e., Region-of-Interest (RoI)).Finally, the image sequence is converted to visual features I={o ij } i≤n,j≤m i=1,j=1 .Following Cho et al. (2021), the RoI bounding box coordinates E box ij , image id embedding E img i , and region id embedding E reg j are added on the visual features to keep the order information of the image sequence: Then, they are fed into the visual encoder for better modeling the intramodal dynamics and enhancing the vision-specific order information.
where H ℓ−1 V and H ℓ V denote the inputs and outputs of the ℓ-th encoder layer, respectively, and H 0 V is initialized as the Z={v ij } i≤n,j≤m i=1,j=1 , and d v is the hidden dimension.Text-Vision Fusion.Following Yu et al. (2021), the visual features are firstly injected by cross-modal multi-head attention (CrossMAttn): where Q are the projected textual features Q = H Ne T W q , K and V are the projected visual features with different weights, i.e., K = H , and d c is the common hidden dimension.
Secondly, a forget gate G is used to filter redundant and noisy information from the visual features: Finally, the vision-guided output Z T +V is concatenated by Z V and textual features H Ne T , and then linearly project it to the original dimension d: where Concat is the concatenation operation and W * and b * are trainable weights.Decoder.The decoder follows a similar architecture but each of N d decoder layers has an additional multi-head cross-attention (CrossAttn) sub-layer: where H ℓ dec ∈ R N ×d denotes the state of the ℓ-th decoder layer.Then, at each decoding time step t, the top-layer (N d -th) decoder hidden state Z N d dec,t is fed into the softmax layer to produce the probability distribution of the next target token as: where W o and b o are trainable weights.
Finally, the loss function is written as:

D 2 TV Training Framework
Based on the MMS model described in § 2.2, we firstly introduce the proposed dual knowledge distillation (DKD) method in § 3.1, which improves both MMS and MXLS tasks.Further, we present a simple yet effective target-oriented contrastive objective to filter needless visual information in § 3.2.Finally, we describe the training and inference in § 3.3.Teacher→Student.Specifically, for training the student model, given an input Then, we train the student model with two objectives as follows: where L L 2 ,L 1 MXLS denotes by maximizing the likelihood of the ground-truth tokens which takes the cross-entropy form: and α is the trade-off factor and L represents KD loss to penalize the large distance of two hidden states of two summaries generated by the student and teacher models: ), (5) where dist(•, •) is the distance function to evaluate the difference between two representations (e.g., KL and cosine similarity), and . ., h T N } denote the contextualized representations produced by the decoder of the teacher model, and . ., h S N } denote the representations from the decoder of the student model.Student→Teacher.In particular, given the input document M 2 } in language L 1 and corresponding visual features V, the teacher model aims to generate its summary Y L 1 in the same language.We update the parameters of the teacher model with the following objective: ).
(7) Finally, to flexibly distill the knowledge in Eq. 3 and Eq. 6, we apply an annealing strategy to dynamically adjust the balancing factor α: where t1 is the training step ranging from 0 to the max training step T and T 1 is a hyperparameter.In this manner, the teacher model dominantly guides the student model in the first T 1/2 training steps and the student model gradually distills the knowledge to the teacher.After training step T 1, both models begin to equally distill their knowledge to each other.

Target-oriented Contrastive Objective
The M3 S task requires a model to have the ability to understand and generate in multiple languages.However, there are some languages that are low-resource and lack enough data to train a good summarizer.Therefore, we aim to take visual features as the bridge between languages and hope that the visual features can be summary-oriented i.e., discarding the noise that not appeared in the summary.To this end, we elaborately design an explicit target-oriented contrastive objective.
Particularly, we push the visual feature V i close to its corresponding summary Y L 1 i and push apart irrelevant pairs, e.g., (V i , Y L 1 j ) where i ̸ = j.Therefore, we treat the paired (V i , Y L 1 i ) as the positive sample and treat the pair (V i , Y L 1 j ) as the negative samples where i ̸ = j.To obtain the representation of summary and image sequence, we apply mean-pooling with mask operation over the summary output H Ne,L 1 T,sum of the N e -th encoder layer and visual output H Nv V of the N v -th encoder layer, respectively.That is, h , where M sum ∈ R N denotes the mask matrix, whose value is either 1 or 0 indicating whether the token is padded.Similarly, we obtain the representation of image sequence, i.e., where M vis ∈ R n×m denotes the mask matrix and MLP is a fully-connected layer.Finally, the target-oriented contrastive training objective is defined by (B is mini-batch size): where sim(•, •) is the cosine similarity and τ denotes a temperature hyperparameter.

Training and Inference
At training, we train our model with the following objective: where K is the number of languages and β is balancing hyper-parameter.
Note that the MMS model and the MXLS model are shared and thus the final model can conduct summarization in any language.During inference, the training objectives are not involved and only the model is used to generate summaries.

M 3 Sum Dataset
There is no many-to-many multimodal summarization benchmark dataset until now.We construct one as follows.Based on the CrossSum dataset (Bhattacharjee et al., 2022) and MM-Sum dataset (Liang et al., 2022b), we construct a Many-to-Many Multimodal Summarization (M 3 Sum) dataset.The original CrossSum dataset is crawled from the BBC website 3 and its quality has been verified and ensured reliability by Bhattacharjee et al. (2022).However, the lack of associated image sequence in CrossSum, makes it impossible to directly conduct research on MMS and MXLS.The original MM-Sum dataset is also crawled from the BBC website, which includes multilingual multimodal summarization.But it cannot conduct cross-lingual summarization due to the lacking of cross-lingual alignment.Therefore, we reorganize both datasets and conduct cross-lingual alignment through the same url in each dataset.
According to the dataset size of each language, we follow CrossSum (Bhattacharjee et al., 2022) and utilize about 80% training:10% validation:10% test splitting.Besides, in CrossSum, the number of languages is 44 and thus there are 44*44 language directions.Tab. 4 of Appendix A shows the detailed statistic of our M 3 Sum and please refer to it for details.

Setup and Metrics
Implementation Details.For efficiency, we randomly select 4 languages (i.e., English, Indonesian, Russian, and Urdu), which totally cover 16 language directions.Please refer to Appendix B for   (Lin, 2004) with the statistical significance test (Koehn, 2004) for a fair comparison.Besides, we apply BERTSCORE (Zhang et al., 2020) for a comprehensive comparison.

Comparison Models
• MMS: It is the MMS model trained with the objective Eq. 2. • MXLS: It is the MXLS model trained with the objective Eq. 4. • MMS+MXLS: It is the model jointly trained with the objectives Eq. 2 and Eq. 4, which actu-ally is the M 3 S training objective.• Vanilla-KD: It is the model enhanced with the knowledge distillation, which is trained with the objectives Eq. 2, Eq. 4 and Eq. 3. • D 2 TV : It is the proposed model which are trained with the objective Eq. 10.
All the above models use the multimodal Transformer described in § 2.2 and involve two strong training backbones: mT5 (Xue et al., 2021) and mBART-50 (Tang et al., 2021).

Main Results
Tab. 1 presents the main results on many-to-many scenarios grounding on different backbones.Overall, our model obtains significantly better results than all contrast models in both settings.Results based on mT5 backbone.In Tab. 1 (a), Table 2: Ablation study based on the mT5 ( Avg. results of ROUGE-1 / ROUGE-2 / ROUGE-L / BERTSCORE), where each component is separately added on the "MMS+MXLS"."*" denotes the four languages (i.e., English, Indonesian, Russian, and Urdu).The "CAT" denotes the complex auxiliary tasks of (Liang et al., 2022b).Train (S) denotes how many seconds are required for each model to train one step (32 batch size * 8 GPUs).
1) in each group (e.g., English→{English, Indonesian, Russian, Urdu}), the MMS model typically performs better in generating monolingual summaries while it cannot process well in cross-lingual settings.The reason is that the MMS model has no access to the cross-lingual data during training.The MXLS model faces a similar phenomenon where it cannot handle well the monolingual summaries while generating better cross-lingual summaries.In contrast, the "MMS+MXLS" model, as a multitask model, achieves better results than both MMS and MXLS models, showing that the MMS and MXLS tasks can benefit each other and thus improve both of them.Based on this finding, a dual knowledge distillation is more reasonable than unidirectional knowledge distillation.Our results further demonstrate this (See ablation study).2) Generally, in each block, we find that our D 2 TV approach notably outperforms the Vanilla-KD method, showing the effectiveness of dual knowledge distillation and target-oriented contrastive learning.Although our results are slightly worse than the MMS model in English→English, Indonesian→Indonesian, and Russian→Russian directions of " */*/*/* " blocks, our D 2 TV model can balance well between MMS and MXLS.The results in each " Avg." blocks fully prove this point.3) On average, our model consistently and significantly surpasses all baselines by large margins (e.g., the previous best "Vanilla-KD", up to 1.70/0.60/1.50ROUGE and 0.77 BERTScore scores in Urdu→* directions, respectively).
In Tab. 1 (b), we observe similar findings as in the mT5-based scenario.This demonstrates that our conclusions are solid and convincing on general pre-trained language models.All these results prove the superiority of our approach.

Ablation Study
We conduct ablation studies to investigate how well each component works.The results are shown in Tab. 2. We have the following conclusions: • (Row 1 vs. row 0).The results show that incorporating visual features has a positive impact on the model performance, demonstrating the importance of image sequence for the summary.• (Row 2 vs. row 0).The vanilla KD makes reasonable contributions, showing that the MMS model indeed helps improve the quality of summaries in terms of both ROUGE and BERTScore scores, suggesting that distilling the knowledge of MMS to MXLS is helpful to summarization; • (Row 3 vs.row 2&row 0).The results show that dual knowledge distillation further improves the model performance, indicating that the knowledge of MMS and MXLS are beneficial to each other and thus can enhance both of them.• (Row 5 vs. row 4&row 0).The results show that summary-oriented visual features can significantly improve the quality of summaries and our simple TCO achieves comparable performance with the CAT with less training time.This shows the superiority of the target-oriented contrastive objective.• (Row 6 vs. row 0).Adding DKD and TCO exhibit notable cumulative benefits, showing the effectiveness of the proposed approach.

Human Evaluation
Following Liang et al. (2022b), we conduct human studies on 50 samples randomly selected from English→English and Russian→English test sets to further evaluate the performance of all models.We invite three Chinese postgraduate students who major in English to compare the generated sum- maries4 and assess each summary from three independent aspects: fluency (Flu.), conciseness (Con.) and informativeness (Inf.).We ask them to score each aspect from 1 (worst) to 5 (best).The average results are presented in Tab. 3. Tab. 3 shows the human results.We find that our D 2 TV substantially outperforms all contrast models under all criteria in both directions, which further shows the effectiveness and superiority of our approach.The Fleiss' Kappa scores (Fleiss and Cohen, 1973) of Flu., Con. and Inf. are 0.74, 0.70 and 0.65, respectively, which indicates a substantial agreement among three evaluators.Furthermore, we present a case study in Appendix C and it intuitively shows the superiority of our D 2 TV.

Related Work
Multimodal Monolingual Summarization (MMS).With the rapid growth of multimedia, many MMS datasets have been built which cover video summarization (Tjondronegoro et al., 2011;Sanabria et al., 2018), movie summarization (Evangelopoulos et al., 2013), meeting records summarization (Erol et al., 2003), sentence summarization (Li et al., 2018a(Li et al., , 2017)), product summarization (Li et al., 2020a), and news summarization (Zhu et al., 2018;Chen and Zhuge, 2018;Hasan et al., 2021;Fu et al., 2021;Liang et al., 2022b).With the data resources extensively used, the MMS task has attracted much attention, where the existing work mainly focuses on 1) how to effectively exploit the additional features which are generally implicitly learned by the MMS objective or 2) explicit and complex auxiliary tasks, having achieved impressive performance on these high-resource English datasets (Li et al., 2018b(Li et al., , 2020b;;Zhu et al., 2020Zhu et al., , 2021;;Zhang et al., 2021b,a;Yu et al., 2021).In this work, we instead of focusing on introducing a more general and practical many-to-many multimoal summarization setting and also provide a corresponding benchmark dataset.Additionally, we propose a simple yet effective target-oriented contrastive learning objective to filter needless visual features, i.e., offer summary-oriented visual features.

Multimodal Cross-lingual Summarization (MXLS).
There is only one study that focuses on the MXLS task, i.e., Liu et al. (2022) first propose this task and design a triple-stage training framework and distill the knowledge from MMS to enhance MXLS while ignoring the performance of MMS.Different from this work, we introduce the many-to-many multimodal summarization task.Furthermore, we devise a dual knowledge distillation approach to simultaneously improve both MMS and MXLS tasks.
Knowledge Distillation (KD).KD (Hinton et al., 2015) is to transfer the knowledge (e.g., soft targets outputs) of the stronger model (aka.the teacher model) to the small model (aka.the student model), which has achieved impressive results in the literature (Zhang et al., 2023).In summarization, (Zhang et al., 2021b) adopt KD from a visionlanguage pre-trained model to improve image selection when generating multimodal summaries.Besides, researchers (Nguyen and Luu, 2022;Liu et al., 2022) typically treat the monolingual summarization model as the teacher model and the cross-lingual one as the student model because the monolingual summarization model is easier to train well than the cross-lingual one, which has shown promising performance on cross-lingual summarization task while ignoring the performance of the monolingual one.In this work, we aim to mutually prompt both monolingual and cross-lingual summarization tasks via dual KD rather than only improving the cross-lingual summarization task by unidirectional KD.
Constrastive Learning.The idea of contrastive learning aims to learn effective representation by pulling semantically close neighbors together and pushing apart non-neighbors (Hadsell et al., 2006), which has verified its superiority in many fields (Zhou et al., 2023).In summarization, Liu and Liu (2021) use contrastive loss to post-rank generated summaries and achieves good results in textual-only benchmark datasets.Cao and Wang (2021) and Xu et al. (2021) use contrastive learning to improve faithfulness and factuality and observe consistent improvements.Wang et al. (2021) apply contrastive learning for multilingual summarization and obtain promising performance.Differently, we introduce it into the multimodal area and aim to pull the visual feature close to its corresponding summary and offer summary-oriented visual features.Therefore, we can improve the quality of summaries from the perspective of visual features rather than the textual document.

Conclusion
In this paper, we first introduce a more general task, i.e., M 3 S, which can support both MMS and MXLS tasks.Further, we propose a dual knowledge distillation and target-oriented vision (D 2 TV) enhanced framework for the new task.Extensive experiments demonstrate that our model significantly outperforms related baselines in terms of ROUGE, BERTScore scores, and human evaluation.Furthermore, we contribute a many-to-many multimodal summarization (M 3 Sum) dataset to the research community.

Limitations
Although we show that our D 2 TV outperforms the vanilla-kD model based on two stronger backbone i.e., mT5 (Xue et al., 2021) and mBART-50 (Tang et al., 2021), there are some limitations worth considering to study in future work: (1) In this study, we only provide 44 languages and conduct experiments on four out of them, and future work could extend our method to more languages; (2) With the development of the large-scale language models, extending and validating our approach on them may be future work.

Ethics Statement
In this section, we consider the potential ethical issues of our model.In this paper, we propose D 2 TV which is trained on the publicly-available BBC datasets.Therefore, D 2 TV might lead to incorrect summaries in applications and involve the same biases and toxic behaviors exhibited by the datasets.Besides, we obtained our M 3 Sum dataset by reorganizing the CrossSum (Bhattacharjee et al., 2022) and MMSum (Liang et al., 2022b) datasets5 and its permissions are granted to copy, distribute and modify the contents under the terms of the Creative Commons AttributionShareAlike 3.0 Unported License and Creative Commons CC0 License, respectively.

Figure 2 :
Figure 2: The overview of our model architecture.The left part is a general MMS model, which is enhanced by DKD and TCO.As shown in the right part, the (a) dual knowledge distillation (DKD) and (b) target-oriented contrastive objective (TCO), are proposed to improve the M 3 S model performance.3.1 Dual Knowledge DistillationAs shown in the right part of Figure2(a), our framework involves training both MXLS and MMS models.Essentially, the MXLS model needs to simultaneously conduct machine translation and summarization(Liang et al., 2022d; Wang et al.,  2022a,b)  while the MMS model only conducts summarization.Obviously, it is harder to train an MXLS model than to learn an MMS model and that is why researchers (Nguyen and Luu, 2022;Liu et al., 2022) take the MMS model as the teacher to help the MXLS student model (i.e., teacher→student distillation).However, when the MXLS model achieves a level of multilingual and cross-lingual ability, the MXLS model can better transfer and share task knowledge among different languages.Therefore, the MXLS model, in turn, can guide the MMS model to conduct summarization in diverse languages (e.g., English→English, Indonesian→Indonesian, Russian→Russian, and Urdu→Urdu), especially for low-resource ones (i.e., student→teacher distillation).That is why we propose DKD to mutually enhance their performance.Teacher→Student.Specifically, for training the student model, given an inputX L 2 = {x L 2 1 , x L 2 2 , . . ., x L 2 M 1 } in language L 2and corresponding visual features V, the student model is to generate the cross-lingual summary Y L 1 = {y L 1 1 , y L 1 2 , . . ., y L 1 N } where L 2 ̸ = L 1 .Then, we train the student model with two objectives as follows:

Table 1 :
The block in " */*/*/* " denotes the MMS results and the block in " */*/*/* " indicates the MXLS results.The " */*/*/* " indicates the average (Avg.)score for each model and the best scores in each block are bold.Our bold results indicate that statistically significantly better than the "Vanilla-KD" with t-test p < 0.05.Note that the results out of each block (e.g., English→English block) cannot be compared to others (e.g., Indonesia→English block) because they belong to different language directions.Therefore, in each block of MMS, the MMS always surpasses MXLS without any exception.In each block of MXLS, the MXLS always surpasses MMS without any exception.