Co-Teaching Student-Model through Submission Results of Shared Task

Shared tasks have a long history and have become the mainstream of NLP research. Most of the shared tasks require participants to submit only system outputs and descriptions. It is uncommon for the shared task to request submission of the system itself because of the license issues and implementation differences. Therefore, many systems are abandoned without being used in real applications or contribut-ing to better systems. In this research, we propose a scheme to utilize all those systems which participated in the shared tasks. We use all participated system outputs as task teachers in this scheme and develop a new model as a student aiming to learn the characteris-tics of each system. We call this scheme “Co-Teaching.” This scheme creates a uniﬁed system that performs better than the task’s single best system. It only requires the system outputs, and slightly extra effort is needed for the participants and organizers. We apply this scheme to the “SHINRA2019-JP” shared task, which has nine participants with various output accuracies, conﬁrming that the uniﬁed system outperforms the best system. Moreover, the code used in our experiments has been released. 1


Introduction
Shared tasks have a long history and have become the highlight of NLP research (Sundheim, 1995;Tjong Kim Sang and Buchholz, 2000;Ounis et al., 2008;Dang and Owczarzak, 2009). These tasks have contributed to natural language processing technology development by attracting researchers interested in being the best task player. The systems are evaluated using the output submitted to the task, and they usually have no obligation to submit the system. It limits the participant's contribution once the task is over because the system 1 https://github.com/k141303/co_ teaching_scheme is a future asset. We believe all participating systems have values as a resource, even if they are not the best. It may be desirable to share it for the sake of innovation in the field as a whole. However, sharing the system is challenging because of the license issue and the running environment. Although sharing the system is ideal, we believe sharing system outputs are much easier, and only slight additional effort is needed for the task participants and organizers. We propose a scheme to utilize all system outputs in the shared task to build a unified system that is better than the best single system. More specifically, we construct a system that reproduces the participating systems embedded in the submission results by treating the system submission results as training data (teacher) and building a new model (student). Here, those submissions include evaluation data and large unlabeled data submissions. This is an adaptation of the Teacher-Student architecture of model compression methods such as knowledge distillation (Ba and Caruana, 2014;Hinton et al., 2015). We call this scheme "Co-Teaching," borrowing from realworld educational terminology, because the group of participants in the shared task act as teachers and teach a common student. This scheme can be applied to most shared tasks, as it requires submitting the evaluation data and the unlabeled data results. Its benefits include building a better system for the task, as well as salvaging the effort of the participants who did not produce the top results. The scheme can provide the best-performing system without violating the participant's license, and it is also possible to design a shared task so that it aims to build a single system from the beginning.
In order to prove the effectiveness of the proposed scheme, we conducted an experiment on the SHINRA2019-JP task. The task is to extract values corresponding to predefined attributes from Wikipedia articles to structure the Japanese Wikipedia. SHINRA2019-JP follows the con-cept of "Resource by Collaborative Contribution (RbCC)" (Sekine et al., 2019), in which a resource (e.g., a knowledge base) is built within the framework of a shared task, and the submission results are made publicly available as a resource. The evaluation data is not announced for this task, and the participants are required to submit results for all Wikipedia articles. There are many system predictions for the unlabeled portion of the data due to this evaluation setting. We use ensemble learning to create better results for this task since the outputs are all public. The details of the SHINRA2019-JP task are described in Sec. 4.1.
"RbCC" is a scheme to create a resource collaboratively, but our proposed Co-Teaching scheme aims to create a system out of many submitted outputs. We trained the student model to build a system using publicly available submission results and the training data distributed to the task participants. The result shows that the proposed system achieves a better score than the best participating system.
The contributions of this paper can be summarized below: • By proposing a Co-Teaching scheme, i.e., building a single system via a shared task, we have exhibited a new way of utilizing shared tasks. To the best of our knowledge, there is no effort to exploit the participant's effort by releasing the integrated system.
• We applied the proposed scheme to an actual shared task, SHINRA2019-JP, and demonstrated that the system proposed by the scheme achieves a better score than the best participating systems. Additionally, we proved the effectiveness of using the participant results in ablation tests.
• We enumerated the shared tasks that have been conducted recently in the field of natural language processing and discussed our scheme's applicability.
2 Related Work

Knowledge Distillation
Knowledge Distillation (Ba and Caruana, 2014;Hinton et al., 2015) is a method mainly used for model compression. Specifically, the results of a model with many parameters, or the ensemble results of multiple models, are used as training data to learn a new model with fewer parameters. In this case, the learning source model is called the teacher model, the learning destination is called the student model, and this combination is called the Teacher-Student architecture. The learning itself is called distillation. In many cases, the teacher model is supervised, and the same training data is also used when training student. In the Teacher-Student architecture, there are two categories of knowledge transfer methods: response-based (Ba and Caruana, 2014;Hinton et al., 2015) and featurebased (Romero et al., 2014). For the responsebased method, the student model is trained from the teacher's output. In the feature-based method, students are trained from the teacher's intermediate output and/or weights. Distillation methods can be categorized into online (Zhang et al., 2018) and offline distillation (Ba and Caruana, 2014;Hinton et al., 2015), depending on whether the teacher parameters are updated while the students are learning. In our study, we cannot access the teacher model. Therefore we transfer responsebased knowledge to a student through offline distillation. Response-based knowledge refers more explicitly to information propagated through the teacher's output probability. Suppose the output probabilities are not included in the submission results of the shared task, as in this paper. In that case, the teacher's knowledge can also be extracted from their predictions for additional unlabeled data.

Semi-Supervised Learning
The learning method used in our scheme can be classified as a semisupervised method such as Self-Training (Yarowsky, 1995) and Co-Training (Blum and Mitchell, 1998) in that it uses unlabeled data predictions for learning. Self-Training is a method for building a more robust machine learning model by adding labels with high confidence to the training data from trained model predictions and retraining the model. Co-Training is an extension of the Self-Training method, where the instances added to the training data are determined by the label confidence obtained using two or more models. Those methods are an approach that combines a small and large amount of labeled and unlabeled data, respectively, during model training. However, to the best of our knowledge, no study has used the results of a shared task to extend the training data. We try simple self-training in Sec. 4.3 to show the benefits of extending the training data with the results of a Share Model θ !" Obtained by Proposal Scheme Figure 1: Overview of the Teacher-Student architecture used in Co-Teaching. Here, we do not have access to the systems that participated in the shared task, but we have access to their submission results. In order to reproduce the inaccessible participation systems, we treat those results as teachers and train a student model. shared task on unlabeled data.

Proposal Scheme
We propose a "Co-Teaching" scheme to address the system submission, which is problematic due to license protection and operating environment issues in the shared task, even when the system developed by the task participant is valuable. In the proposed scheme, the systems submitted to the shared task are considered teachers, and a new student model is trained through their outputs. We expect that this scheme allows us to build a system that comprehensively and integrally reproduces the teacher characteristics. The training data distributed to participants in the shared task can also be used for student learning. This scheme can be performed even after completing the shared task, as long as the training data distributed in the task and the submission results are available. However, it is better to have more information about the teachers available for training the students. It is desirable to be able to use the prediction probabilities assigned to the teacher outputs, as well as training and unlabeled data predictions. Therefore, cooperation from shared tasks is essential for this scheme to work effectively.
An overview of the teacher-student architecture used in the scheme is presented in Fig. 1. Let us assume that we have the outputs {Y s } s∈[S] of the participation system for the input space X , where S is the number of participation systems. Each data point can be described as , where N is the number of data instances, x i is i-th instance, and y s i is the output of s-th teacher for i-th instance. Some instances also have ,M ≤N ∈ Y ground truth labels that were used to train the teachers. In this scheme, our goal is to learn the student model {θ sh , θ 1 , ..., θ S , θ out }, where θ sh is the shared parameter to reproduce the features common among the teachers, θ s is the private parameter to reproduce each teacher output, and θ out is the parameter to output the overall prediction results. Here, we simultaneously reproduce each teacher model f (x; θ sh , θ s ) : X → Y and learn the overall output from the labeled data f (x; θ sh , θ out ) : X → Y. The loss function used for learning is written aŝ α is a weight that determines the balance of loss between the ground truth label and the teacher output, respectively. We train the student model by minimizing the above loss.
Since we also utilize the logit z s i of each private layer other than the logit z out i of the output layer, the final prediction probability p i of the entire model is calculated as follows:

SHINRA2019-JP
SHINRA2019-JP is a shared task to extract the attribute values in Japanese Wikipedia articles. These articles are preclassified into Extended Named Entity (ENE) categories by Suzuki et al. (2018). ENE is a set of Named Entity types defined by Satoshi (2008)  In this task, participants can access the manually annotated training data and articles in each category. For the SHINRA shared task evaluation, a portion of the data is hidden, and all the participants have to annotate all the data so that the organizer can create unified data for all Wikipedia entries.
A total of nine teams participated in the SHINRA2019-JP task. Some participants submit results for a subset of the categories, and six to nine systems submit results for every category. Various methods are used, including rule-based methods, the ML method using CRF and SVM, a deep learning-based method, and DrQA (Chen et al., 2017).
This task follows the Resource by Collaborative Contribution (RbCC) scheme. Therefore, the task organizers release all submission results as a resource. In this task, participants do not necessarily need to submit the prediction probabilities assigned to the system outputs; thus, the organizers do not share these values. The organizer also distributed the development data for the City and Lake categories, which are not used to evaluate. In our study, we use those labels for our detailed analysis. The task organizers have not released the evaluation set used in the task. Therefore, we sent our results to the organizers and received the evaluation results.

Co-Teaching on Shared Task
In order to demonstrate the effectiveness of the proposed scheme, we use the submission results shared by SHINRA2019-JP to train a student model. Although this is an attribute value extraction task, it can be solved as a sequence labeling task because each attribute value contains the offset of its occurrence in the text. We use the IOB2 (Tjong Kim Sang and Veenstra, 1999) scheme to solve the sequence labeling task. That is, we classify the first word of an attribute value as Beginning (B), the following words as Inside (I), and words outside the attribute value as Outside (O). In this task, a word may have multiple attribute labels. Therefore we use I, O, and B tags for each attribute. More specifically, we classify the word {t i,j ∈ x i } j∈ [K] into y g i,j and y s i,j ,∀s∈ [S] , where |y g i,j | = C, |y s i,j ,∀s∈[S] | = C, K is the sentence length, and C is the number of attributes to be extracted. We use MeCab 2 to tokenize Japanese text. Furthermore, we define the student model used in this experiment, as shown in Fig. 3. We use BERT-base (Devlin et al., 2019) for θ sh and a linear layer for θ 1 , ..., θ S , θ out , respectively, and apply Dropout (Srivastava et al., 2014) and an activation function GeLU (Hendrycks and Gimpel, 2020) to the output of BERT. BERT is pretrained using the same scheme as RoBERTa (Liu et al., 2019) utilizing Japanese Wikipedia. Class Balanced Focal Loss (Cui et al., 2019), which is a combination of Class Balanced Loss (Cui et al., 2019) and Focal Loss (Lin et al., 2017), is used for the loss functionŝ L t andL g to deal with the class imbalance IOB2 labels. We also determine α, which balances the loss between the ground truth label and the teacher out-2 https://taku910.github.io/mecab/ puts, as α = . This weight value equalizes the impact of the two losses on the entire dataset.
In this task, 269 to 308610 articles are available for each category, and 147 to 1000 have ground truth labels. All articles have the system results that participated in each category. For reducing the computational cost, we limited the number of articles to 2000, including all labeled data. The detailed data statistics are shown in Table 1. A student model is trained for each category, and we use 10% of the labeled data as development data and the rest, including all unlabeled data, as training data. Models are trained for each category.
In this experiment, we also fine-tune BERT using only the training data of SHINRA2019-JP. This is called the Non-Teaching setting. By comparing Co-Teaching and Non-Teaching, we can separate the advantages of the proposed scheme and the model structure. Moreover, we integrate the predictions for unlabeled data of the model obtained in the Non-Teaching setting with the training data and retrain the model. This setting is similar to the self-training setting, so we temporarily call it Self-Teaching. By comparing Co-Teaching and Self-Teaching, we can separate the advantage of the proposed method from input expansion using unlabeled data. That means we can only evaluate the advantage gained by the knowledge extracted from the system outputs that participated in the shared task. These comparisons are a kind of ablation test.
We use the Adam optimizer to train the model in each setting and use mixed precision for computational efficiency. We also determine the batch size and the γ used for Class Balanced Loss by grid search from {8, 16, 32} and {0.999, 0.9999, 0.99999}, respectively, and obtain the class statistics used for Class Balanced Loss from the ground truth labels. The other parameters used in this experiment are: Adam learning rate α lr = 5 × 10 −5 , Adam β 1 = 0.9, Adam β 2 = 0.999, and Adam = 10 −8 . When we use 2000 articles for training, once training takes about 10 hours on a GPU when the batch size is eight and about five hours on eight GPUs, the batch size is 32. Here, all GPUs are NVIDIA Tesla V100.

Experimental Results
The experimental results are listed in Table 2. Each score is the across-category macro-average F1 value for each subclass, and the score for each cate-  gory is the across-attribute micro-average F1 value. For equal comparison, system results that did not submit predictions for all categories belonging to a subclass were temporarily excluded from the table. We can see from the table that Co-Teaching obtained a better score than the best system in the JP-5 subclass and Location subclass. In particular, we can see a great improvement in the Location subclass (i.e., 4.12 points) compared to the best system. The significant difference in scores between the Non-Teaching and Co-Teaching results signifies that this improvement was not obtained due to the model structure's advantages. In addition, Self-Teaching is slightly superior to Non-Teaching, which may be due to the effect of input expansion using unlabeled data. However, Self-Teaching is also significantly inferior to Co-Teaching. This difference suggests that the knowledge derived from the participating systems is more valuable than the input extension. When we focus on the Non-Teaching results in the Organization subclass, the difference in scores between the best system is -6.77, which is significantly inferior. This difference suggests that either the BERT model's structure is incompatible with the Organization subclass or the participants may have used additional knowledge about the Organization subclass. However, in Co-Teaching, the score is equivalent to the best system in the Organization subclass by learning from the participating system results. The overall Co-Teaching score is better than the best system (Team ID:10) score. This best system consists of two BERT that take plain text input as used in this study or HTML text input recovered from the Wikipedia dump. Therefore, the student model performs better than the teacher model, even though less information is given to each input instance.
This result implies that the system can be obtained indirectly through the proposed scheme without requiring the shared task participants to submit their systems, even if they use additional knowledge. In addition, the score improvement is based on the knowledge gained from the systems other than the best one, demonstrating the potential usefulness of those systems. This result motivates task participants.
The scores for each category are displayed in Table 3. The method of calculating the score is the same as in Table 2. Best System means the best score from the system results. In this task, five teams seem to have achieved the best score in one or more categories. In order to obtain the overall best system for this task, we need to require all five teams to submit their systems. However, the Co-Teaching score is equivalent to the average of the best systems. This result shows the effectiveness of the proposed scheme.
In all categories, Co-Teaching performed better than Non-Teaching. Teachers do not negatively affect student learning in this situation, as Best System is better than Non-Teaching in most categories.
The table confirms that Co-Teaching performed worse than Best System in 15 out of 33 categories. The correlation between the best and second-best system difference and the improvement from the Best System with Co-Teaching is shown in Fig. 4. The correlation coefficient between these two values is r = −0.519, indicating a negative correlation. 3 In situations where only a single teacher is superior, the student model is not learning well. In this study, the losses are averaged from the teachers we use for student learning, so the loss of a single  Table 3: Experimental results for each category on SHINRA2019-JP. Bold number represents the highest score in the right four columns, and underlined number designates the highest score in the participation systems.
superior system may be dominated by other systems. For further improvement, we may need to apply methods such as MGDA (Sener and Koltun, 2018) used in multitask learning to balance teacher losses during student learning dynamically.
Our study limited the data used for learning the student model to 2000 articles in each category due to the computational cost. However, there are much more articles available for some categories. Therefore, we studied the score variance of the student model in the City category as we used more articles for training. The score for each output of the student model is also tracked. We use the data for analysis in the City category for this experiment. As in other experiments, the batch size and γ used for Class Balanced Loss is determined using a grid search. The batch size is determined from {8, 16, 32} when the number of articles used for training is 2000, {20, 40, 80} when the number of articles is 5000, and {40, 80, 160} when the number of articles is 10000. Also, γ was determined from {0.999, 0.9999, 0.99999}.
The results are shown in Fig. 5. The model's final output result is better when 5000 articles are used for training compared to 2000 articles. How-ever, when 10000 articles are used for training, the score drops. In this study, we have determined α to make the effect of these two losses equivalent across the dataset. Therefore, as the unlabeled data increases, the loss between the teacher and student for each instance becomes much smaller than the loss between the ground truth label and the student. When using large amounts of unlabeled data, it may be necessary to constrain α. Surprisingly, the average score of private layers is consistently higher than the output layer trained on ground truth. This result signifies that the information obtained from the teachers is more valuable than the ground truth label. However, using the appropriate α, both outputs complement the final output, as in the case of using 5000 articles. In future work, we examine how to find the appropriate α corresponding to the proportion of unlabeled data.

Discussion
In order to apply the Co-Teaching scheme to a shared task, it is required to disclose the participants' results. The task organizer needs to obtain permission to publish the results in advance, which may become an obstacle for the participants. How-0 2 4 6 8 10 12 Difference between best and 2nd systems 10 5 0 5 10 Improvement with Co-Teaching Figure 4: Correlation between the best and second-best system difference and the improvement from the best system with Co-Teaching. ever, the disclosure of the results should be a much lower burden than the system disclosure. For example, a dictionary of a Named Entity Recognition task may be a valuable resource for the participant's organization. Therefore, it could be hard for the organization to disclose the system with the dictionary. However, the task's outputs may be much less valuable for the participant's organization, as it is challenging to reproduce the dictionary from the system results.
Suppose a Co-Teaching scheme is to be incorporated into a shared task. In that case, it must be possible to design the student model technically, and it is also desirable that a large amount of unsupervised data is easily obtainable. We discuss the shared tasks that have been implemented in the past from those perspectives. We believe that the conditions are satisfied for the classification tasks such as the Sentiment Analysis and Relation Classification tasks in SemEval (Hendrickx et al., 2010), Dialect Classification task in NADI (Abdul-Mageed et al., 2021), and word classification task in CoNLL (Tjong Kim Sang and De Meulder, 2003). Also, the conditions are satisfied for translation tasks such as those in WMT (Barrault et al., 2020) and WAT (Nakazawa et al., 2020) as well as generation tasks such as those in SDP (Chandrasekaran et al., 2020) and FNS (El-Haj et al., 2020). The similarity between these tasks is that the formats of the training data and the task submissions are essentially the same. That is, a student model can be designed with few modifications to the model designed for the training data (e.g., by adding an output layer or decoder for each participating system). However, there are cases where the format of the training data and the task submission are different. For example, the training data in the IWSLT speech translation task (Ansari et al., 2020) consists of the end-to-end speech translation or the transcription dataset and the bilingual corpus, but only the final target-language text is submitted. In this case, if the participant uses the latter non-end-to-end dataset, the dataset used and the task submission format are different. However, the task organizers can solve this formatting problem by requesting the participants to submit the transcribed text. In the above tasks, except for the relational classification task requiring entity pair information, the unlabeled data, such as plain text or speech, is easily obtainable. The prediction results of the systems developed by task participants for those unlabeled data help make the Co-Teaching scheme work effectively, as shown in the experiments in this paper. As mentioned above, many shared tasks are suitable for the application of the proposed scheme.
We have discussed the design of the student model and the data required in the Co-Teaching scheme, but whether the student model can successfully reproduce the submission system needs to be shown in many future experiments. For these experiments, we need the submission results of many shared tasks. In addition, although it was not available in this experiment, if the output probability of the system is available, further improvement can be expected using the KL-divergence and other methods. We hope that these data become more open in the future.
We have given approximate computational costs at the end of Sec. 4.2. As shown, the effort and cost involved in implementing this scheme are not trivial because we need to build a student model with sufficient capacities to apply the Co-Teaching scheme, such as BERT. However, we believe that these costs and efforts are much smaller than the total costs and efforts spent by the shared task participants.
The Teacher-Student architecture that we used in this study is simple. Nevertheless, we were able to demonstrate the usefulness of the Co-Teaching scheme. However, there is room for improvement in the architecture, e.g., preventing score degradation in cases where only a single system is superior. In the future, we would like to compare knowledge distillation methods and develop a more suitable architecture for the Co-Teaching scheme. Also, we aim to conduct detailed validation of our proposed scheme by ablation tests with the removal of each system and stress tests with the addition of noise systems.
Finally, we would like to introduce research with similar motivations. Potthast et al. (2019) developed an architecture for shared tasks called TIRA, consisting of virtual environments for system development and evaluation modules. On TIRA, each participant of the shared task develops a system on the given virtual environment and receives an evaluation by submitting the system to the evaluation module. In this process, TIRA stores the systems that have been submitted. But, third parties cannot access such systems directly. Instead, TIRA provides API, and they can receive the system results for any input via API. Thus, the systems are public virtually, but licensing issues are minimized as long as the whole system or part of it does not leak out of TIRA. Although this research has a different focus, integrating the Co-Teaching scheme into the TIRA architecture would allow seamless student model learning and efficient leveraging of the participants' efforts.

Conclusion
In this paper, we proposed a new scheme for shared tasks called "Co-Teaching." It is a scheme to build a single system from the participants' outputs under the Teacher-Student architecture. We conducted an experiment based on the SHINRA2019-JP shared task to demonstrate the effectiveness of our scheme. As a result, we were able to construct a system that was 2.38 points higher in F1-value than the best participating system. We hope that this scheme will be applied to many shared tasks to utilize the participant's efforts effectively. Furthermore, we believe the shared tasks can be a more useful scheme if it is not only the place for the optimization of the given task but the outcome is designed so that some resource is obtained, such as a Knowledge Base (RbCC) or a superior system can be created from the participant's system collection (Co-Teaching).