A Semantic Feature-Wise Transformation Relation Network for Automatic Short Answer Grading

Automatic short answer grading (ASAG) is the task of assessing students’ short natural language responses to objective questions. It is a crucial component of new education platforms, and could support more wide-spread use of constructed response questions to replace cognitively less challenging multiple choice questions. We propose a Semantic Feature-wise transformation Relation Network (SFRN) that exploits the multiple components of ASAG datasets more effectively. SFRN captures relational knowledge among the questions (Q), reference answers or rubrics (R), and labeled student answers (A). A relation network learns vector representations for the elements of QRA triples, then combines the learned representations using learned semantic feature-wise transformations. We apply translation-based data augmentation to address the two problems of limited training data, and high data skew for multi-class ASAG tasks. Our model has up to 11% performance improvement over state-of-the-art results on the benchmark SemEval-2013 datasets, and surpasses custom approaches designed for a Kaggle challenge, demonstrating its generality.


Introduction
Educators at every level rely on classroom assessments to evaluate students' knowledge, often through quizzes. Multiple choice questions can be graded automatically, but many studies have shown that short answer, constructed response questions provide greater benefit to students (Lee et al., 2011;McDaniel et al., 2007;Butler and Roediger, 2007;Clariana, 2003). Manual assessment of short asnwer questions is time-consuming, and has bias and errors (Galhardi et al., 2020;Bejar, 2012). Automatic Short Answer Grading (ASAG) applies NLP methods to reduce the assessment burden for short answer questions (Burrows et al., 2015), with potential to assist educators in providing more timely feedback to students on a preferred assessment method. With increased reliance on virtual learning environments and educational technology, the potential impact of ASAG has grown.
The most common ASAG approach classifies students' answers into two or more categories. The key challenge is that the classification problem is inherently relational, involving the relation of the student's answer to the question, as well as to one or more reference answers. Another challenge is that existing datasets are relatively small, especially for the kinds of neural network models that perform best on other NLP tasks. Much of the ASAG work is conducted in industry labs with proprietary methods and datasets . Creation of benchmark datasets, however, has fostered broader interest in the problem from the NLP community. The main benchmark dataset, SemEval-2013 (Dzikovska et al., 2013), covers multiple STEM domains, and has multiple classification tasks regarding both the number of classes, and the level of generalization required. For example, one task addresses unseen answers to known questions (UA), another addresses unseen questions (UQ) within the same subject domain, and a third is unseen domains (UD) for student answers to topics and questions not seen in the training data. While this is a rich dataset, it is not large enough to support complex models for all the classification tasks. A somewhat larger dataset, ASAP-SAS 1 is less well structured, and contains a range of supporting information other than reference answers, such as rubrics. While several challenge models performed well, they were not described in publications. We use both datasets.
Besides the potential benefits for educational technology, we believe ASAG can push NLP research in new directions due to the relational nature of the task, and the graded difficulty of the different classification tasks that ASAG presents. Figure 1 shows a question (Q), reference answer (R), and student answer (A) from SemEval-2013. The color coding of words simulates a Venn diagram of conceptual overlap of all three components (green), versus only Q and R (blue), only R and A (yellow), and only Q and A (red). We hypothesize that a relation network can better exploit these different similarity spaces. The input to our neural network consists of QRA triples, which are then modeled as a 3-way relation.
We present a Semantic Feature-Wise transformation Relation Network (SFRN). It is inspired by vision research using relation networks (Santoro et al., 2017) and feature mapping (Perez et al., 2018) functions, which were both motivated by a desire for greater generalization ability (referred to as reasoning), combined with computational efficiency and low model complexity. SFRN is an end-to-end model with three components. An encoder first encodes each component of a QRA triple, producing vectors for the question, one of a set of reference answers, and the student answer. When there are multiple reference answers, a relation network converts each triple of vectors for a given student answer into a single relation vector, and a learned feature-wise transformation function merges all the relation vectors for the student answer by leveraging the attentions calculated by a QRA triple. Finally, a classifier determines which class the student answer belongs to.
To address data insufficiency and class imbalance, we adopt a simple data augmentation method, back-translation, which was studied for augmentation of paraphrase data (Wieting et al., 2017), and has been used in other NLP problems. Most ASAG datasets are relatively small, especially com-pared to the typical training sets for large neural networks, and the multiway classification problems have extreme class imbalance. For example, the SemEval-2013 Beetle subset for the 5-way classification task has only 4,146 training samples, and one of the classes has only 195 examples. Here, as in (Xie et al., 2020), we apply back-translation to generate examples from existing ones without changing the label. We find that data augmentation is beneficial for simple models, like logistic regression, and for complex models, including our most complex model, SFRN+BERT. SFRN+BERT combined with data augmentation achieves up to 11% performance improvement over the state-of-the-art.
Our contributions are: SFRN+, a novel relation network that outperforms the state-of-the-art, use of data augmentation to address data insufficiency and imbalance, and ability to learn ASAG relations from either reference answers or rubrics. The next section presents related work on ASAG, relation networks, and data augmentation. Section 3 presents our SFRN and SFRN+ models. Section 4 describes the datasets and classification tasks. Section 5 presents our data augmentation method. Section 6 presents our experiments and results.
2 Related Work ASAG is generally modeled as a classification problem with two or more classes. Burrows et al. (2015) gives a thorough overview of benchmark datasets and ASAG systems. Early work (Mohler and Mihalcea, 2009) formulated ASAG as a comparison of semantic text similarity between student and reference answers. A wide range of handcrafted features have been used: POS tag, word and character n-gram features (Heilman and Madnani, 2013), context overlap features (Ott et al., 2013), and graph alignment and lexical semantic similarity features (Mohler et al., 2011;Sultan et al., 2016), for input to SVM or other kinds of classifiers. Recent work applies a combination of deep neural networks with data mining. Attention networks have been used on large proprietary datasets Ha et al., 2020). Süzen et al. (2020) used text mining to improve similarity results. Contextualized semantic representations like BERT have also been used (Hassan et al., 2018;Sung et al., 2019;Camus and Filighera, 2020). Saha et al. (2018) leveraged both hand-crafted features and sentence embeddings to achieve high performances on many tasks.
Relation Networks (RN) originated as an alternative to other kinds of graph-based neural models to develop relation-based representations, and were designed to overcome the limitations of CNNs and MLPs for reasoning problems in vision, NLP and symbolic domains such as physics (Santoro et al., 2017). RN performance on a visual question answering dataset CLEVR (Johnson et al., 2017) surpassed human performance. RNs also prove effective at improving object detection models (Hu et al., 2018). Moreover, RNs have became a general framework for few-shot learning (Sung et al., 2018). We adapt RNs for the three-way relation represented by the questions, reference answers, and student answers in ASAG datasets.
RNs typically combine learned representations of relational vectors using vector addition. We combine the relation vectors that represent multiple reference answers for the same question and student answer using a learned relation fusion function. For the fusion function, we use feature-wise transformation, based on its success with multi-modal data, e.g., images and text. Perez et al. (2018) developed FiLM, an approach to merge information from language and visual input. A language vector serves as conditioning input, to control the scaling and shifting of the visual feature map in a featurewise fashion. Similarly, we train a function with learnable parameters to combine QRA triples for a given student answer. We have not seen feature wise transformation used for vector combination in ASAG research, which typically relies instead on fixed arithmetic operations or concatenation.
Data augmentation is less common in NLP, compared with computer vision, where image data augmentation is standard. Operations on images such as rotating an image a few degrees, or converting it to grayscale, do not change their essential meaning. In general, data augmentation is used both to increase the size of the training data and to add irrelevant noise to examples to improve the robustness of learned models. Recently, data augmentation has been found to significantly improve performance on NLP tasks as various as paraphrasing (Wieting et al., 2017), natural language generation (Kedzie and McKeown, 2019), semantic parsing (Cao et al., 2019), or various sentiment and opinion classification tasks (Kobayashi, 2018). One method is to substitute a random word with a synonym drawn from a lexical database like Word-Net (Mueller and Thyagarajan, 2016;Zhang et al., 2015;Wei and Zou, 2019), or to use word embeddings to find synonyms (Wang and Yang, 2015;Jiao et al., 2020). Back-translation leverages machine translation to paraphrase a text while retaining the meaning (Wieting et al., 2017;Edunov et al., 2018;Xie et al., 2020). We adopt back-translation for its ease of use, given that machine translation methods have achieved very high performance.

SFRN
A Relation Network (RN) (Santoro et al., 2017) is particularly suitable for ASAG because it is designed to infer higher order generalizations, meaning generalizations that hold across tuples of examples, in a data efficient manner. RNs have been used in vision to efficiently learn generalizations across pairs of objects without having to learn individual weights for all possible object pairs in the data. We extend the RN framework to handle relations across vectorized text triples using a data fusion function.
In its simplest form, an RN is a composite function: where O is a set of input objects {o 1 , o 2 , . . . , o n }, o i ∈ R m , and f φ and g θ are functions with trainable parameters. The inner function g learns a relation over tuples which feeds the outer classifier f with an abstract representation over the tuple objects.
In this section, we first present the SFRN model to convert the three vectors for each Q, R and A into a relational vector, then we introduce the Semantic Feature-wise Transformation unit that fuses relation vectors. Although encoding the textual Q, R and A inputs into vectors is the first step in the model, we postpone discussion of the different encoders we try out until the last subsection.

Creating the QRA Relation Vectors
The encoded vectors for a given triple are (q, r j , a), j ∈ {0, 1, . . . , n}, where n is the number of reference answers for a given question q and a student answer a of this question. Corresponding to equation (1) above, the relation vectors l j are inferred by g θ , using the concatenation of QRA vectors from the encoding step as the input (left hashed box in Figure 2): where g is a multilayer perceptron (MLP) with learnable parameters θ, and g produces the relation vectors l j ∈ L for j ∈ {0, 1, . . . , n} (one per reference answer; see center hashed box in Figure 2).

Relation Fusion
After learning the relation vectors, the RN in Santoro et al. (2017) (2018), we adopt a relation fusion unit, Semantic Feature-wise Transformation (SFT), to learn different weights for fusing each of the learned n relation vectors l for one of the k answers a ik to a question q i . The concatenation of the three vectors in the QRA triples serves as a conditioning context to learn how to incorporate each of the n relation vectors with its own weights, before classifying the student's answer a ik to a question q i . For a given q i and one answer a ik , the input to SFT is thus the set of triples C, the set of learned relations L, and for clarity the size n of these sets: where C is the set of n QRA triples [q i , r ij , a ik ], L is the set of n relation vectors from equation (2) for the n reference answers, c j is the concatenation of a QRA triple consisting of [q i , r ij , a ik ], l j is the corresponding learned relation, and α and β are MLPs. The output of SFT is a fused relation vector that represents all the relational information in the input QRA triples for a given question q i and student answer a ik , relative to the reference answers r ij .
Finally, a f φ MLP function classifies the output of SF T into one or more classes, depending on the ground truth labeling scheme. Combining equations (2) and (3), the composite function becomes: where g θ = M LP ([q, r, a]). Overall, SFRN is a relation-based classifier that takes the QRA triple as input. Since the functions are all MLPs, the whole architecture is simple and end-to-end differentiable.

SFRN Encoder
We experiment with different SFRN encoders. Our baseline SFRN model uses LSTM (Hochreiter and Schmidhuber, 1997), which is relatively easy to train, but prone to overfitting and information loss. We compare LSTM with a BERT-based encoder.
BERT is a deep, pre-trained, transformer-based model that has proven to be extremely powerful when fine-tuned for a wide range of NLP tasks (Devlin et al., 2019). We use the last output layer of the BERT base model, and fine-tune it on the ASAG data. We also use the pre-trained BERT base as an encoder in a logistic regression baseline. We use the bert-base-uncased model pre-trained on the BooksCorpus and Wikipedia, with a 30,000 token vocabulary, and 110 million parameters. For fine-tuning, we pre-process the sentences in our QRA triples by prefixing [CLS] and postfixing [SEP] to the word token lists for each question text, or reference or student answer. Then we take the last layer output (of 12 layers in total) as the vector encodings of the elements of each QRA triple.

Datasets
The ASAG datasets we use are SemEval-2013, and the ASAP-SAS Kaggle competition dataset. The two were created for different purposes, and have distinct structures. With ASAP, we use only the components that correspond roughly to the Se-mEval format, as explained further below.
SemEval-2013 (Dzikovska et al., 2013) provides two training datasets, Beetle and SciEntsBank, that have 2-way, 3-way (Correct, Contradictory, Incorrect) and 5-way (Correct, Partially correct incomplete, Contradictory, Irrelevant, Non domain) class labels. SciEntsBank has three classification tasks comprising unseen answers (UA), unseen questions (UQ) or unseen domains (UD), and Beetle has the first two of these. The 5-way labels were chosen to potentially provide tutoring feedback. The Beetle dataset consists of 56 questions on basic electricity and electronics, with approximately 3,000 student answers. SciEntsBank contains approximately 10,000 answers to 197 assessment questions across 15 different science domains. Each question has from 1 to 14 reference answers.
To test the generality of our method, we also apply it to a dataset that lacks reference answers, Automated Student Assessment Prize Short Answer Scoring (ASAP-SAS), used in a Kaggle competition (Shermis, 2014). It has 10 prompts from science, biology and English Language Arts (ELA), with 17,207 training examples and 5,224 test examples. Responses are rated by two annotators: four prompts are rated in {0, 1, 2, 3}, and others are in {0, 1, 2}. This dataset has a wide array of other information, including rubrics. Therefore we test SFRN on triples that use rubrics in place of reference answers. We exclude all information other than the questions, rubrics, and student answers.
Many secondary and post-secondary STEM courses that use short answer questions rely on rubrics rather than reference answers. We find that SFRN performs as well as the models that use the full resources in ASAP, which shows the generalization ability of SFRN, and makes it potentially more useful to educators. We use the score assigned by the first annotator as the label and evaluate with Quadratic Weighted Kappa (QWK; a method to measure the agreement between the graders), based on the Kaggle competition guidelines.
Since the SemEval ASAG dataset has limited training data, and high data skew for the 3-way and 5-way tasks, we utilize back-translation to efficiently generate more examples for any given class (Edunov et al., 2018). Back-translation refers to translation from a source language to one or more pivot languages, followed by translation from the pivot(s) back to the source. We found good perfor- mance using two one-step pivot languages, French and Chinese. We interleaved use of a state-ofthe-art neural machine translation (NMT) system, EasyNMT, with the Google Translation API. This provides us with greater control over the scale of new examples-given that Google limits calls to Google Translate-as well as more noise injection for robustness. We randomly select sentence-label pairs to generate variant sentences with the same label. If the EasyNMT back-translation is not different from the source, we call Google Translate. Figure 3 shows two examples of backtranslation, one for each pivot language we use. With Chinese as the pivot, the original word gap was converted to distance, and the two prepositional phrase arguments of separated were swapped. With French as the pivot, space replaces gap, and word order is preserved.
Through trial-and-error, we found that data balancing was not effective unless there was a large enough gap between the original size of the rebalanced class and the largest class. Also, there were limits to the maximum augmentation that worked well. If there was a five-fold difference in the size of a class and the largest one, we doubled the size of the small class. Otherwise, data augmentation either resulted in little improvement or in degradation of performance. We speculate that increasing the diversity of linguistic form along with the number of examples might allow for greater increases in class size.

Back-translation Experiment
Here we test our back-translation data augmentation on a logistic regression baseline ASAG model with an LSTM encoder (also used in the experiments in section 6). We compare all pairs among five conditions combined with the original Beetle datasets on the 3-way UA task: doubling the data, tripling the data, using French as the pivot language, using Chinese as the pivot language, and combining examples from the French and Chinese back-translations. We find statistically significant improvements of the LR baseline, especially for the comparison of the original dataset with an augmentation that uses both Chinese and French backtranslations.
We double and triple the original data to use as controls for comparison with augmentation by back-translation, to verify that it is not size alone that matters. Thus the original and two controls are org, double, triple. The augmented datasets org+fr, org+ch are the same size as double, and org+ch+fr is the same size as triple.
To get average performance results, we repeated ten iterations of training and testing on the Beetle 3-way UA classification. We trained a logistic regression baseline (LR) on the 6 training sets (see section 6.1 for the LR training details), then applied T-tests on the means to compare mean accuracy for all pairs of conditions. The results are shown in Table 1; we use alpha value p ≤ 0.05 as the threshold to reject the null hypothesis that the means of two conditions are not different. The first two rows of Table 1 show the means and standard deviations over the ten trials for each condition. The remaining rows show the difference in means and pvalues for each pair, comparing rows and columns. Table 1 shows that all the data augmentation conditions are significantly better than org (p-values ≤ 0.05). The condition org+ch+fr has the best absolute improvement over org, with a difference in average performance of 7.92. We concluded that back-translation is useful for data augmentation and re-balancing.
Using the best performing augmentation (org+ch+fr), we created new datasets, Beetle+ and SciEntsBank+. Beetle+ (N=12,438) is thrice the size of the original (N=4,146). SciEntsBank+ (N=15,450) is just over thrice the size of the original SciEntsBank (N=5,104).

Experiments
The research questions our experiments address are: 1) How well does SFRN perform compared to the state-of-the art? 2) Does data augmentation and rebalancing improve SFRN performance? For both the SemEval-2013 and ASAP-SAS datasets, we compare the two SFRN variants-with the LSTM (SFRN) or BERT encoders (SFRN+)-against multiple baselines. On SemEval, we also compare performance after training on the augmented SemEval-2013+. Performance metrics are accuracy, and macro-averaged F1 (M-F1). SFRN+ performs competitively on most SemEval-2013 tasks without data augmentation. With data augmentation, SFRN+ outperforms the state-of-the-art on all Beetle tasks, and on most SciEntsBank tasks. On ASAP-SAS, performance is measured using quadratic weighted kappa (QWK). SFRN+ outperforms all baselines, including three Kaggle models that use the full datasets, while SFRN+ uses a subset of the data that is more analogous to the SemEval datasets. 2 We calculated 95% confidence intervals for all results for our models in Table 2, 3, 4 and 5: all margins of error were at most 2%. To save space, however, only the point estimates are shown in the tables. The results of ASAP-SAS dataset are the best results of all runs as the benchmark model presented in their paper.

SemEval-2013 Experiments
On SemEval-2013, we compare SFRN and SFRN+ with eight baselines: 1) LO (Dzikovska et al., 2013), a model based on lexical overlap; 2) ETS (Heilman and Madnani, 2013) and 3) CoMeT (Ott et al., 2013), both of which use handcrafted features; 4) TF+SF (Saha et al., 2018), which com-  bines handcrafted features with deep learning; 5) LR, a logistic regression baseline we developed that uses the same LSTM encoder as SRFN; 6) LR+ (Sung et al., 2019), a logistic regression model that uses the pre-trained BERT-base model with finetuning as the encoder. Since Sung et al. (2019) report results only for the 3-way SciEntsBank tasks, we re-implemented LR+. 7) RN, a relation network baseline without the relation fusion module, using the same LSTM encoder as SRFN; 8) RN+ , a relation network baseline without the relation fusion module, which uses the pre-trained BERTbase model with fine-tuning as the encoder. We trained the LR, RN and SFRN models that use an LSTM encoder with batches of size 32 and hidden size 256, using cross entropy loss, the Adam optimizer, a step learning rate from 5e-6 to 5e-4, and dropout of 50% on every function. Word lookup used 300D GloVe embeddings (Wikipedia/Gigaword) (Pennington et al., 2014) as input. For LR+, RN+ and SFRN+ with BERT as the encoder, we also used cross entropy loss with the Adam optimizer, but a smaller learning rate 1e-5 for BERT, and 3e-4 for the g and f functions. We used 20% of the training samples as a dev set. We varied the number of fine-tuning epochs to be from 5 to 10, depending on performance.  Table 2 gives the results on Beetle. SFRN+ outperforms all the baselines on the 3-way tasks, and on the 2-way UA task. On the 2-way UQ task, it is bested only by LO and ETS. It performs in the mid-range on the 5-way task. We will see, however, that with data augmentation and rebalancing, SFRN+ outperforms all models on all Beetle tasks. Table 3 gives the results on SciEntsBank. SFRN+ outperforms all baselines on the 3-way UA tasks, but TF+SF outperforms other models by a large margin on 2-way and 3-way UQ and UD. SFRN+ achieves the highest accuracy on the 5-way UA task, and the highest M-F1 on the 3-way UQ, 3-way UD and 5-way UQ tasks. On the other 5-way tasks, ETS performs best.
Tables 2 and 3 also show that SFRN and SFRN+ outperform RN and RN+ on all the sub-tasks, which indicates that the relation fusion module learns an effective method to combine the relation vectors that boosts the model performance.
In the next subsection, we present results after retraining our models on the augmented datasets SFRN

ASAP-SAS Experiments
As a further test of the generalization ability of SFRN and SFRN+, we run experiments on another large scale ASAG dataset, ASAP-SAS, where the training data has a different structure. As mentioned above, for this dataset we use rubrics in place of reference answers in SFRN's QRA triples. We again train four models: LR, LR+, SFRN, and SFRN+. We compare against four published baselines using QWK: 1) human raters, 2) the Kaggle winner (Tandalla), which relies on regular expression matching; 3) AutoP (Ramachandran et al., 2015), a stacked patterns model; 4) the model in (Riordan et al., 2017) (Rior), an LSTM network with attention. The four published baselines use the full ASAP-SAS resources, whereas our four models were trained only on questions, rubrics, and student answers. The ASAP-SAS results appear in Table 6. SFRN+ has the best performance, even though it relies on less data than AutoP, Rior or Tandalla. AutoP performs nearly as well as SFRN+. This fact implies that SFRN with BERT as encoder not only has strong generalization ability, but also has the flexibility to learn from triples that contain reference answers or rubrics.

Error Analysis
We carried out error analysis motivated by the large gaps between accuracy and M-F1 on many tasks. On the 5-way tasks, no model achieved an M-F1 greater than 0.73. Inspection of the perclass performance on the 5-way tasks reveals that our four models all get far worse performance on the non-domain class, which is much smaller than three of the other four classes, and consists mainly of very short phrases with little semantic content. We believe that progress on the 5-way classes would depend on a combination of input from domain experts and more sophisticated data-augmentation.

Conclusion
We have presented a new type of relation network, SFRN, that learns relational information from QRA triples for automatic short answer grading (ASAG). It can learn from two types of training data, using reference answers or rubrics. SFRN+, the version with the BERT encoder, outperforms previous stateof-the-art by 8-11%, depending on the dataset and classification task, when combined with a simple data augmentation method to compensate for the small and unbalanced training data. As relational meaning is central to NLP, our future work will investigate ways to improve SFRN, to understand its behavior, and to apply it to new problems. Another key avenue we aim to explore, however, is how to improve data augmentation and balancing for ASAG, and for other NLP tasks where data is difficult to come by.