Cross Encoding as Augmentation: Towards Effective Educational Text Classification

Text classification in education, usually called auto-tagging, is the automated process of assigning relevant tags to educational content, such as questions and textbooks. However, auto-tagging suffers from a data scarcity problem, which stems from two major challenges: 1) it possesses a large tag space and 2) it is multi-label. Though a retrieval approach is reportedly good at low-resource scenarios, there have been fewer efforts to directly address the data scarcity problem. To mitigate these issues, here we propose a novel retrieval approach CEAA that provides effective learning in educational text classification. Our main contributions are as follows: 1) we leverage transfer learning from question-answering datasets, and 2) we propose a simple but effective data augmentation method introducing cross-encoder style texts to a bi-encoder architecture for more efficient inference. An extensive set of experiments shows that our proposed method is effective in multi-label scenarios and low-resource tags compared to state-of-the-art models.


Introduction
Due to the overwhelming amount of educational content available, students and teachers often struggle to find what to learn and what to teach.Autotagging, or text classification in education, enables efficient curation of content by automatically assigning relevant tags to educational materials, which aids in both students' understanding and teachers' planning (Goel et al., 2022).
However, applying auto-tagging for real-world education is challenging due to data scarcity.This is because auto-tagging has a potentially very large label space, ranging from subject topics to knowledge components (KC) (Zhang et al., 2015;Koedinger et al., 2012;Mohania et al., 2021;Viswanathan et al., 2022).The resulting data scarcity decreases performance on rare labels during training (Chalkidis et al., 2020;Lu et al., 2020;Snell et al., 2017;Choi et al., 2022).
In this paper, we aim to solve the data scarcity problem by formulating the task as a retrieval problem following a recent proposal (Viswanathan et al., 2022).This can utilize a language model's ability to understand the tag text, such that even for an unseen tag, the models would be able to capture the relationship between the terms in the input content and labels.However, performance in the auto-tagging context still critically depends on the amount of training data.
To this end, we first propose to leverage the knowledge of language models that are fine-tuned on large question-answering datasets.Our intuition is that question of finding an answer in a passage can be a direct (or indirect) summary of the passage (Nogueira et al., 2019b), which can serve as an efficient proxy of the gold tag for educational content.The large question-answering datasets thus become a better prior for the tag spaces.Specifically, we adopt a recent bi-encoder architecture, called DPR (Karpukhin et al., 2020) 1 , for transfer learning, which performs BERT encoding over the input and candidate label separately and measures the similarity between the final representations.To the best of our knowledge, our work is the first to leverage transfer learning from QA models for text classification tasks.
As a further innovation, we introduce a novel data augmentation method for training a bi-encoder architecture, named CEAA, which adds the crossencoder view of the input-label pair in the biencoder architecture, as shown in Figure 1.By capturing the full interaction between input and labels already during training time, the models can be further optimized to take advantage of token- level interactions that are missing in traditional bi-encoder training.At the same time, the computational efficiency of the bi-encoder is maintained, which makes CEAA able to tackle large label spaces as opposed to existing solutions based on cross-encoder architectures (Urbanek et al., 2019;Wolf et al., 2019;Vig and Ramea, 2019).Experiments show that CEAA provides significant boosts to performance on most metrics for three different datasets when compared to state-of-the-art models.
We also demonstrate the efficacy of the method in multi-label settings with constraints of training only with a single label per context.

Related Work
Text classification in the education domain is reportedly difficult as the tags (or, labels) are hierarchical (Xu et al., 2019;Goel et al., 2022;Mohania et al., 2021), grow flexibly, and can be multi-labeled (Medini et al., 2019;Dekel and Shamir, 2010).Though retrieval-based methods were effective for such long-tailed and multilabel datasets (Zhang et al., 2022;Chang et al., 2019), they relied on vanilla BERT (Devlin et al., 2018) models, leaving room for improvement, for which we leverage question-answering fine-tuned retrieval models (Karpukhin et al., 2020).
Recently, (Viswanathan et al., 2022) proposed TagRec++ using a bi-encoder framework similar to ours, with an introduction of an additional crossattention block.However, this architecture loses the efficiency of the bi-encoder architecture in the large taxonomy space for the education domain.Unlike TagRec++, our distinction is that we leverage the cross-attention only in training time via input augmentation.

Problem formulation
In this paper, we address the text classification task, which aims to associate an input text with its corresponding class label, as a retrieval problem.Formally, given a context c and tag candidates T , the goal of the retrieval model is to find the correct (or, relevant) tag t ∈ T , where its relevance score with the context s(c, t) is the highest among the T or higher than a threshold.For this purpose, our focus is to better train the scoring function s(c, t) to be optimized against the given relevance score between the context c and candidate tag t.

Bi-Encoder
In this paper, we use a bi-encoder as a base architecture for the retrieval task, as it is widely used for its fast inference (Karpukhin et al., 2020).Specifically, the bi-encoder consists of two encoders, E C , and E T , which generate embedding for the context c and the tag t.The similarity between the context and tag is measured using the dot-product of their vectors: Both encoders are based on the BERT architecture (Devlin et al., 2018), specifically "bert-baseuncased" provided by HuggingFace (Wolf et al., 2020), that is optimized with the training objective of predicting randomly-masked tokens within a sentence.We use the last layer's hidden layer of the classification token is used as context and tag embeddings.
For training the bi-encoder, we follow the inbatch negative training in (Karpukhin et al., 2020).Gold tags from other contexts inside the batch are treated as negative tags.As tags are often multilabeled, we use binary cross-entropy loss: where s(c i , t j ) scores the similarity between context c i and tag t j , and y i,j is 1 if they are relevant and 0 otherwise.We will denote this model variant as a bi-encoder (BERT) below.

Cross-Encoding As Augmentation
The cross-encoder (Nogueira and Cho, 2019) is another method in information retrieval tasks in which a single BERT model receives two inputs joined by a special separator token as follows: where F is a neural function that takes the representation of the given sequence.Cross-encoders perform better than bi-encoders as they directly compute cross-attention over context and tag along the layers (Urbanek et al., 2019;Wolf et al., 2019;Vig and Ramea, 2019).However, relying on this approach is impractical in our scenario as it requires processing every existing tag for a context during inference time.As a result, this method is typically used for re-ranking (Nogueira et al., 2019a;Qu et al., 2021;Ren et al., 2021).
As shown in Figure 1, we adopt an augmentation method that enables the bi-encoder framework to mimic cross-encoder's representation learning.Compared to other knowledge distillation methods (Qu et al., 2021;Ren et al., 2021;Thakur et al., 2020), our approach does not require an additional cross-encoder network for training.Furthermore, as such cross-encoding is introduced as an augmentation strategy, it doesn't require additional memory or architecture modifications, while improving the test performance.
Specifically, for a context c, we randomly sample one of the tags in the original batch.We extend the batch in our training by introducing a context-tag concatenated input [c; t] which has "is relevant" as a gold tag.Our bi-encoder must be able to classify relevance when an input includes both context and tag with the following score function: Since we use the augmentation method via input editing without an extra teacher cross-encoder model for distillation, we call this model Cross Encoding As Augmentation (CEAA).

Transfer Learning
To overcome the data scarcity in auto-tagging tasks, we introduce bi-encoder (DPR) models that distill knowledge from large question-answering datasets.We argue that the training objective of question answering is similar to the context and tag matching in the auto-tagging task, as a question is a short text that identifies the core of a given context.Therefore, while the previous works have relied on vanilla BERT, here we explore whether pertaining on question-answering tasks would improve the performance in the auto-tagging tasks.Specifically, we replace the naive BERT encoders with DPR (Karpukhin et al., 2020), which is further optimized with the Natural Question dataset (Lee et al., 2019;Kwiatkowski et al., 2019) to solve open-domain question-answering tasks of matching the representations of document and question.To match the overall length of the texts, we use "dpr-ctx_encoder-single-nq-base" and "dpr-question_encoder-single-nq-base" for context and tag encoders respectively.

Results and Analysis
Overall Accuracy: The main objective of this work is to improve the bi-encoder models for the purpose of better text classification in two aspects: R@1 R@3 R@5 R@1 R@3 R@5 RP@5 nDCG@5 BM25 0. transfer learning and CEAA.Regarding the effect of using two different pretrained models, the results from Table 1 show that models trained on DPR achieve higher performance than models from BERT.Specifically, Bi-encoder (DPR) outperforms the Bi-encoder (BERT) for ARC (0.54 > 0.51 in R@1) and QC-Science (0.69 > 0.67 in R@1).The performance of the EURLEX57K datasets in both RP@5 and nDCG@5 increases by 0.02.Applying our augmentation method to the Bi-encoder (both vanilla BERT and QA-finetuned BERT) improves the performance by 0.06, 0.02, and 0.03 points in ARC, QC-Science, and EURLEX57k, respectively.Additionally, the Bi-encoder (DPR) + CEAA demonstrates the highest overall performance in most cases (except for R@3 and R@5 of the QC-Science dataset where differences were small).For example, compared to TagRec++, which is the current state-of-the-art model on the datasets, we observed that our best model improves on TagRec++ by 0.05 points in R@12 .Figure 2 further demonstrates the change in RP@K and nDCG@K across a varying range of values for K on EURLEX57K, where CEAA shows consistently better performance.Notably, the gap from Bi-encoder (BERT) increases as K increases for both metrics.
Multi-label Generalization: To further highlight differences between single-label and multilabel settings, the two best models, Bi-encoder (DPR) and Bi-encoder (DPR) + CEAA, were trained with a modified single-labeled EURLEX57K dataset, where we sampled only a single tag from the multi-label space.When the models are evaluated on the original multi-label dataset, as a context in the EURLEX57K dataset has ≥ 5 gold tags on average, it is important to achieve high nDCG@K performance on K ≥ 5.The results are presented in Figure 3.We observe that the models show comparable performance with values of 0.65, 0.70, and 0.73 for Bi-encoder (DPR), Bi-encoder (DPR) + CEAA and BERT classification, respectively at K = 1.Though the classification model performs slightly better than CEAA at low K values, performance significantly degrades for K ≥ 5. Overall, the cross-encoder augmentation helped the model to better find related tags at the top rank.From these results, we argue that evaluating against the single-labeled dataset may not be an appropriate testing tool for comparing the auto-tagging models, as BERT classification was considered the best at first, even though it is poorly working on multilabel scenarios.This problem is critical as multilabel issues are prevalent in education.Specifically, we manually checked failure cases of both Bi-encoder (DPR) and Bi-encoder (DPR) + CEAA at top 1, to qualitatively examine which one is better at ranking the relevant tags.The results in Appendix B.2 show that Bi-encoder (DPR) + CEAA is able to retrieve better candidates than the Bi-encoder (DPR) more often.An interesting example is, given the context ["The sector in which employees have more job security is an organized sector"], where the gold tag is one related to the economy, the Bi-encoder (DPR) + CEAA returns a tag ["human resources"], which is sufficiently relevant but not labeled one.From these results, we once again confirm that the multilabel problem is severe in the auto-tagging tasks and that our model yields sufficiently significant results beyond the reported performance.
Data Efficiency: To identify the effectiveness of augmentation with low-resource labels, we measured nDCG@5 on the splits of labels based on their occurrence in training data.EURLEX57 considered the labels that occurred more than 50 times in the training set as frequent and few otherwise.We set the ARC dataset's threshold to 5. Figure 4 shows that both CEAA and transfer learning contribute to better performance for the frequent labels.Further, we observe that the retrieval methods are more effective for the rarely occurring tags than standard classification methods.Notably, in ARC of a smaller dataset than EURLEX57K (5K < 45K), the combination of CEAA and transfer learning, CEAA (DPR), achieves the best performance.

Conclusion
In this paper, we discuss the problem of 'autotagging' with regard to data scarcity due to its large label space -an issue that is critical in the education domain, but also for other domains with a multi-label structure such as jurisdictional or clinical contexts.We propose two innovations to address this problem: First, exploiting the knowledge of language models trained on large questionanswering datasets.Second, applying a novel augmentation for bi-encoder architecture inspired by cross-encoders to better capture the full interaction between inputs and labels while maintaining the bi-encoder's efficiency.A set of experiments demonstrated the effectiveness of our approach, especially in the multi-label setting.Future research will explore re-ranking scenarios in which the bi-encoder trained with our cross-encoding augmentation (CEAA) is re-used to effectively rerank the tags with cross-encoding mechanism as in (Nogueira and Cho, 2019).

Limited Size of Language Models
Due to the recent successes of generative large language models as zero-shot (or, few-shot) text classifiers (Radford et al., 2019;Brown et al., 2020), one may ask about the practicality of our methods.
Even when disregarding computational efficiency3 , we argue that applying such large language models for XMC problems is not trivial, as it is challenging to constrain the label space appropriately.For example, even when the tag candidates we wanted for a task were entailment, neutral, and contradiction), the generative model will output tags outside this range such as hamburger (Raffel et al., 2020).In-context learning (Min et al., 2022) may alleviate this concern, but in the context of the large label spaces of our application, the token limits of standard language models will be exceeded.

Lack of Knowledge-level Auto-tagging
Though we pursue text classification tasks in the education domain, the classes usually represent only superficial information, such as chapter titles, which neglects the deeper relationships between educational contents like precondition between knowledge.For example, to solve a quadratic problem mathematical problem, an ability to solve the first-order problem is required.However, the available texts have only the last superficial tags.These concerns were not considered when creating these public datasets.Instructor-driven labeling would be an effective and practical solution for knowledgelevel auto-tagging.

Inefficiency of Tag Encoder
One may argue that the performance of one BERT system is good enough to cast doubt on using two BERTs for the bi-encoder.In this context, experiments showed additional efficiency of our approach for low-frequency tags.Nonetheless, the current tag encoder could be made much more efficient using a smaller number of layers in BERT which will be explored in the future.

Ethical Considerations
Incorrect or hidden decision processes of the AI tagging model could result in the wrong learning path.
The system would therefore need to be subject to human monitoring for occasional supervision.At the same time, the potential benefits of properlytagged content will be large for both the learner's learning experience and the teacher's labeling cost as the model can narrow down full tag space to the top-K candidates.et al., 2019): This dataset consists of 7,775 multiple-choice questions and answer pairs from the science domain.Each data is paired with classification taxonomy.The taxonomy is constructed to categorize questions into coarse to fine chapters in a science exam.There are a total of 420 unique labels.The dataset is split in train, validation, and test by 5,597, 778, and 1400 samples.QC-Science (Mohania et al., 2021): this larger dataset consists of 47,832 question-answer pairs also in the science domain with 312 unique tags.Each tags are hierarchical labels in the form of subject, chapter, and topic.The train, validation, and test sets consist of 40,895, 2,153, and 4,784 samples.
EURLEX57K (Chalkidis et al., 2019): The dataset contains 57,000 English legislative documents from EUR-LEX with a split of 45,000, 6,000, and 6,000.Every document is tagged with multilabel concepts from European Vocabulary.The average number of tags per document is 5, totaling 4,271 tags.Additionally, the dataset divides the tags into frequent (746), few (3,362), and zero (163), based on whether they appeared more than 50, fewer than 50, but at least once, or never, respectively.

A.2 Details on Evaluation Metric
In this section, we explain the metric used in the paper.First, recall@K(R@K) is calculated as follows: where N is the number of samples to test, R n is the number of true tags for a sample n, and S t (K) is the number of true tags within the top-K results.For evaluation on multi-label dataset we used R-Precision@K (RP @K) (Chalkidis et al., 2019): RP@K divides the number of true positives within K by the minimum value between K and R n , resulting in a more fair and informative comparison in a multi-label setting.nDCG@K (Manning, 2008) is another metric commonly used in such tasks.The difference between RP@K and nDCG@K is the latter includes the ranking quality by accounting for the location of the relevant tags within the top-K retrieved tags as follows: where Rel(n, k) is the relevance score given by the dataset between a retrieved tag k of a sample n.The value can be different if the tags' relevant score is uniquely given by the dataset.Without extra information, it is always one if relevant and zero otherwise.Z Kn is a normalizing constant that is output of DCG@K when the optimal top-K were retrieved as true tags.

A.3 Hyperparmeter Setting
The architecture we used can handle a maximum of 512 tokens.Therefore, to concatenate tag tokens with context tokens, we set the maximum context token to 490 and truncate if the context is longer.The remaining space is used for tag token concatenation.For every dataset, we used 20 contexts inside a batch.The number of unique tags inside a batch can vary with multi-label settings.During cross-encoder augmentation, we sampled five negative tags for each context to be joined together and one positive tag.We used Adam optimizer with a learning rate of 1e-5.For inference, we used the Pyserini framework to index the entire tag set embeddings (Lin et al., 2021).

B.1 Comments on Poly-encoder
In this section, we discuss the low performance of Poly-encoder (Humeau et al., 2019) in our main results.To be more specific, poly-encoder-16 and 360 were found to be performing below TagRec++.The value 16, and 360 is the number of vectors to represent a context.We think the low performance could be due to a potential implementation issue of the poly-encoder into the classification task.The performance could differ if we had used 16 or 360 vectors to represent the tag rather than a context.For our future work, we also aim to investigate this change.

B.2 Extra Qualitative Result
Table 2 shows the samples we used to find the potential of CEAA method in multi-label tasks.The shown results were randomly picked.

ACL 2023 Responsible NLP Checklist
A For every submission: A1.Did you describe the limitations of your work?Yes, we discuss the limitation in Sec 6, about the limited size of the model, knowledge level auto tagging and inefficient of the tag encoder A2.Did you discuss any potential risks of your work?Yes, we discuss the ethical consideration in Sec 7. We talk about potential impact in the education domain, wrong or efficient learning paths for learners, and helping instructors' labeling process.
A3. Do the abstract and introduction summarize the paper's main claims?
Yes, we include the abstract and section 1 as an introduction to summarize the main claim.

A4. Have you used AI writing assistants when working on this paper?
To be honest, we initially used AI writing assistant as a "suggestion" for a better way of organizing statements in the abstract.However, soon we found the output of AI writing assistants ("ChatGPT") either includes wrong information or had feeling we, ourselves can already determine which part is from the assistant because it felt like a fixed template style.Therefore, after that, we neglected using the assistant but used "Grammarly" as checking simple grammatical errors throughout the writing.

B Did you use or create scientific artifacts?
Yes, we used Huggingface discussed in Section 3. We also used data QC-Science, ARC, and EURLEX57K as well as results from TagRec in Section 4m but not sure whether these facts are considered as scientific artificats.

B1. Did you cite the creators of artifacts you used?
We added data, result citations in Section 2, 3, and 4, which links to references B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Not applicable.Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Not applicable.Left blank.

C Did you run computational experiments?
Left blank.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Not applicable.Left blank.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Not applicable.Left blank.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Not applicable.Left blank.
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Not applicable.Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?
Yes and No, in section 4.2, we discuss one of the qualitative results we obtained after the model training.However, we only used this as a hint for us to investigate the effectiveness of the model in a multi-label setting.We are not sure, but we think this question is more focused on using a human annotator's result for quantitave performance statement.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Comparative Illustration of Encoding Methods.CEAA is done for the bi-encoder to process input in which context and tag are given together, computing full token-level interactions between context and tag.

Figure 2 :
Figure 2: Comparison of models on EURLEX57K with two different metrics.

Figure 3 :
Figure 3: Multi-label evaluation.All models are trained on the single-label version of EURLEX57K but evaluated as multi-label.

FrequentFigure 4 :
Figure 4: Analysis on data efficiency.We report nDCG on a varying number of training labels on EURLEX57K and ARC.

Table 1 :
Results of experiments on ARC, QC-Science, and EURLEX57K dataset.We mainly compared Bi-encoder with Bi-encoder + CEAA where each encoder is pretrained with different training objectives, BERT and DPR.