Ling-CL: Understanding NLP Models through Linguistic Curricula

We employ a characterization of linguistic complexity from psycholinguistic and language acquisition research to develop data-driven curricula to understand the underlying linguistic knowledge that models learn to address NLP tasks. The novelty of our approach is in the development of linguistic curricula derived from data, existing knowledge about linguistic complexity, and model behavior during training. By analyzing several benchmark NLP datasets, our curriculum learning approaches identify sets of linguistic metrics (indices) that inform the challenges and reasoning required to address each task. Our work will inform future research in all NLP areas, allowing linguistic complexity to be considered early in the research and development process. In addition, our work prompts an examination of gold standards and fair evaluation in NLP.


Introduction
Linguists devised effective approaches to determine the linguistic complexity of text data (Wolfe-Quintero et al., 1998;Bulté and Housen, 2012;Housen et al., 2019).There is a spectrum of linguistic complexity indices for English, ranging from lexical diversity (Malvern et al., 2004;Yu, 2010) to word sophistication (O'Dell et al., 2000;Harley and King, 1989) to higher-level metrics such as readability, coherence, and information entropy (van der Sluis and van den Broek, 2010).These indices have not been fully leveraged in NLP.
We investigate the explicit incorporation of linguistic complexity of text data into the training process of NLP models, aiming to uncover the linguistic knowledge that models learn to address NLP tasks.Figure 1 shows data distribution and accuracy trend of Roberta-large (Liu et al., 2019) against the linguistic complexity index "verb variation" (ratio of distinct verbs).This analysis is conducted on ANLI (Nie et al., 2020)  for individual bins separately.The accuracy trend indicates that verb variation can describe the difficulty of ANLI samples to the model.In addition, the data distribution illustrates potential linguistic disparity in ANLI; see §3.4 To reveal the linguistic knowledge NLP models learn during their training, we will employ known linguistic complexity indices to build multiview linguistic curricula for NLP tasks.A curriculum is a training paradigm that schedules data samples in a meaningful order for iterative training, e.g., by starting with easier samples and gradually introducing more difficult ones (Bengio et al., 2009).Effective curricula improve learning in humans (Tabibian et al., 2019;Nishimura, 2018) and machines (Bengio et al., 2009;Kumar et al., 2010;Zhou et al., 2020;Castells et al., 2020).Curriculum learning has been found effective in many NLP tasks (Settles and Meeder, 2016;Amiri et al., 2017;Platanios et al., 2019;Zhang et al., 2019;Amiri, 2019;Xu et al., 2020;Lalor and Yu, 2020;Jafarpour et al., 2021;Kreutzer et al., 2021;Agrawal and Carpuat, 2022;Maharana and Bansal, 2022).A multiview curriculum is a curriculum able to integrate multiple difficulty scores simultaneously and leverage their collective value (Vakil and Amiri, 2023).
We assume there exists a subset of linguistic complexity indices that are most influential to learning an NLP task by a particular model.To identify these indices for each model and NLP task, we derive a weight factor ρ i ∈ [−1, 1] for each linguistic index that quantifies how well the index estimates the true difficulty of data samples to the model, determined by model instantaneous loss against validation data.By learning these weight factors, we obtain precise estimations that shed light on the core linguistic complexity indices that each model needs at different stages of its training to learn an NLP task.In addition, these estimates can be readily used for linguistic curriculum development, e.g., by training models with linguistically easy samples (with respect to the model) and gradually introducing linguistically challenging samples.
To achieve the above goals, we should address two gaps in the existing literature: First, existing curricula are often limited to a single criterion of difficulty and are not applicable to multiview settings.This is while difficulty is a condition that can be realized from multiple perspectives, can vary across a continuum for different models, and can dynamically change as the model improves.Second, existing approaches quantify the difficulty of data based on instantaneous training loss.However, training loss provides noisy estimates of sample difficulty due to data memorization (Zhang et al., 2017;Arpit et al., 2017) in neural models.We will address both issues as part of this research.
The contributions of this paper are: • incorporating human-verified linguistic complexity information to establish an effective scoring function for assessing the difficulty of text data with respect to NLP models, • deriving linguistic curricula for NLP models based on linguistic complexity of data and model behavior during training, and • identifying the core sets of linguistic complexity indices that most contribute to learning NLP tasks by models.
We evaluate our approach on several NLP tasks that require significant linguistic knowledge and reasoning to be addressed.Experimental results show that our approach can uncover latent linguistic knowledge that is most important for addressing NLP tasks.In addition, our approach obtains consistent performance gain over competing models.Source code and data is available at https://github.com/CLU-UML/Ling-CL.We present a framework for multiview curriculum learning using linguistic complexity indices.Our framework estimates the importance of various linguistic complexity indices, aggregates the resulting importance scores to determine the difficulty of samples for learning NLP tasks, and develops novel curricula for training models using complexity indices.The list of all indices used is in Appendix A.

Correlation Approach
Given linguistic indices {X j } k j=1 of n data samples, where k is the number of linguistic indices and X j ∈ R n , we start by standardizing the indices {Z j = X j −µ j σ j } k j=1 .We consider importance weight factors for indices {ρ j } k j=1 , which are randomly initialized at the start of training.At every validation step, the weights are estimated using the validation dataset by computing the Pearson's correlation coefficient between loss and linguistic indices of the validation samples ρ j = r(l, Z j ) where r is the correlation function and l ∈ R n is the loss of validation samples.The correlation factors are updated periodically.It is important to use validation loss as opposed to training loss because the instantaneous loss of seen data might be affected by memorization in neural networks (Zhang et al., 2017;Arpit et al., 2017;Wang et al., 2020).This is while unseen data points more accurately represent the difficulty of samples for a model.Algorithm 1 presents the correlation approach.

Optimization Approach
Let Z ∈ R n×k be the matrix of k linguistic indices computed for n validation samples and l ∈ R n indicate the corresponding loss vector of validation samples.We find the optimal weights for linguistic indices to best approximate validation loss: where λ ρ ∈ R and ρ ρ ρ * ∈ R k is jointly optimized over all indices.The index that best correlates with loss can be obtained as follows: where Z * i denotes the i th column of Z. Algorithm 2 presents this approach.We note that the main distinction between the two correlation and optimization approaches lies in their scope: the correlation approach operates at the index level, whereas the optimization approach uses the entire set of indices.

Scoring Linguistic Complexity
We propose two methods for aggregating linguistic indices {X j } k and their corresponding importance factors {ρ j } k into a linguistic complexity score.The first method selects the linguistic index with the maximum importance score at each timestep: which provides insights into the specific indices that determine the complexity to the model.The second method computes a weighted average of linguistic indices, which serves as a difficulty score.This is achieved as follows: where S i ∈ R, (µ S i , σ S i ) = (0, 1), is an aggregate of linguistic complexity indices for the input text.
If an index Z j is negatively correlated with loss, ρ j will be negative, and ρ j Z j will be positively correlated with loss.Therefore, S i is an aggregate complexity that is positively correlated with loss.And using weighted average results in the indices that are most highly correlated with loss to have the highest contribution to S i .

Linguistic Curriculum
We evaluate the quality of weighted linguistic indices as a difficulty score and introduce three new curricula based on a moving logistic (Richards, 1959) and Gaussian functions, see Figure 2.

Time-varying Sigmoid
We develop a time-varying sigmoid function to produce weights (Eq.3).The sigmoid function assigns a low weight to samples with small difficulty scores and a high weight to larger difficulty scores.Weights are used to emphasize or de-emphasize the loss of different samples.For this purpose, we use the training progress t ∈ [0, 1] as a shift parameter, to move the sigmoid function to the left throughout training, so that samples with a small difficulty score are assigned a higher weight in the later stages of training.By the end of the training, all samples are assigned a weight close to 1. Additionally, we add a scale parameter β ∈ [1, ∞) that controls the growth rate of weight (upper bounded by 1) for all samples.
The sigmoid function saturates as the absolute value of its input increases.To account for this issue, our input aggregated linguistic index follows the standard scale, mean of zero, and variance of one, in ( 4) and (3).

Moving Negative-sigmoid
The positive-sigmoid function assigns greater weights to samples with a large value for S that Then, as training progresses, the function is shifted using the parameter t in ( 5) and ( 7), causing samples with a higher complexity to be assigned higher confidence if (a) is used, samples with a lower complexity to be assigned higher confidence if (b) is used, and samples with medium complexity to be assigned higher confidence if (c) is used.
are linguistically more complex.In order to establish a curriculum that starts with easy samples and gradually proceeds to harder ones, we use a negative sigmoid function: Figure 2 illustrates the process of time-varying positive and negative sigmoid functions.Over the course of training, larger intervals of linguistic complexity are assigned full confidence, until the end of training when almost all samples have a confidence of one and are fully used in training.

Time-varying Gaussian Function
We hypothesize that training samples that are not too hard and not too easy are the most useful in training, and should receive the most focus.In fact, samples that are too easy or hard may contain artifacts that are harmful to training, may contain noise, and may not be generalizable to the target task.Therefore, we use the Gaussian function to prioritize learning from medium-level samples.The function starts with a variance of 1, and scales up during the course of training so that the easier and harder samples, having lower and higher linguistic complexity values, respectively, are assigned increasing weights and are learned by the end of training.We propose the following function: where γ is the rate of growth of variance and t is the training progress, see Figure 2.

Weighting-based Curriculum
We define a curriculum by weighting sample losses according to their confidence.Samples that are most useful for training receive higher weights, and those that are redundant or noisy receive smaller weights.Weighting the losses effectively causes the gradient update direction to be dominated by the samples that the curriculum thinks are most useful.Weights w are computed using either Equation 5, 6 or 7: where ℓ i is the loss of sample i, t the current training progress, and L is the average weighted loss.

Reducing Redundancy in Indices
We have curated a list of 241 linguistic complexity indices.In the case of a text pair input (e.g.NLI), we concatenate the indices of the two text inputs, for a total of 482.Our initial data analysis reveals significant correlation among these indices in their estimation of linguistic complexity.To optimize computation, avoid redundancy, and ensure no single correlated index skews the complexity aggregation approach 2.1.3,we propose two methods to select a diverse and distinct set of indices for our study.We consider the choice of using full indices or filtering them as a hyper-parameter.
In the first approach, for each linguistic index, we split the dataset into m partitions based on the index values1 (similar to Figure 1).Next, using a trained No-CL ( §3.3) model, we compute the accuracy for each partition.Then, we find the first-order  accuracy trend across these partitions.Linguistic indices with a pronounced slope describe great variance in the data and are considered for our study; we select the top 30% of indices, reducing their count from 482 to 144 for text pair inputs.
In the second approach, we compute pair-wise correlations between all indices.Then, we group highly correlated indices, as shown in Figure 3. From each cluster, we select a representative index, aiming to prevent correlated indices from dominating the aggregation approach and to eliminate redundancy.This method narrows our focus to the following 16 key indices: 1) type-token ratio (TTR), 2) semantic richness, 3) ratio of verbs to tokens, 4) mean TTR of all k word segments, 5) Total number of verbs, 6) number of unique words, 7) adverbs per sentence, 8) number of unique words in the first k tokens, 9) ratio of nouns to verbs, 10) semantic noise, 11) lexical sophistication, 12) verb sophistication, 13) clauses per sentence, 14) average SubtlexUS CDlow value per token, 15) adjective variation, 16) ratio of unique verbs.Please refer to Appendix A for definitions and references to indices.

Datasets
We evaluate NLP models in learning the tasks of the following datasets: • SNLI: Stanford Natural Language Inference (Bowman et al., 2015).The task is to classify a pair of sentences by the relation between them as one of entailment, neutral, or contradiction.
• ANLI: Adverserial Natural Language Inference (Nie et al., 2020).This NLI dataset was created with a model in the loop, by only adding samples to the dataset that fool the model.We train only on the ANLI training set of 162k samples.
• SST-2: Stanford Sentiment Treebank (Socher et al., 2013).The task is to predict the sentiment of a given sentence as positive or negative.
• RTE: Recognizing Textual Entailment (Wang et al., 2018).The task is to determine if a given sentence is entailed by another given sentence.
The task is to detect if an adjective-noun pair, including pairs that are typically confusing to language learners, is used correctly in the context of a sentence.
• GED: Grammatical Error Detection (Yannakoudakis et al., 2011).The task is to identify grammar errors at word level in given sentences.

Difficulty Scoring Functions
The curriculum learning approaches in §2.2 use difficulty scores or compute confidence to quantify sample difficulty in order to rank sentences.We use as difficulty scores: aggregate linguistic complexity Ling, see Section 2.1.3,and Loss (Xu et al., 2020;Wu et al., 2021;Zhou et al., 2020).We take the loss from a proxy model (No-CL in §3.3) by recording all samples losses two times per epoch during training and computing the sample-wise average.

Baselines
We consider a no-curriculum baseline as well as several recent curriculum learning approaches.
• No-CL: no-curriculum uses standard random mini-batch sampling from the whole dataset without sample weighting.
• Sampling (Bengio et al., 2009)  We compare the above models against our approaches, Ling-CL, which aggregates linguistic indices using weighted average or max-index aggregation, and applies different curriculum strategies: sigmoid, negative-sigmoid, and Gaussian weighting, as well as sampling an competence-based approaches, see §3.3.We test variants of our approach with the correlation method, optimization method, and indices filtering.We report results of the max aggregation ( §2.1.3)approach as it performs better than the weighted average and is computationally cheaper.Loss-CL computes loss as a difficulty score by recording the losses of samples during training of No-CL.The loss during the early stages of training generated by an under-trained model is a good measure of the relative difficulty of both training and validation samples.

Evaluation Metrics
Linguistic disparity can be quantified by the extent of asymmetry in the probability distribution of the linguistic complexity of samples in a dataset, e.g., see Figure 1 in §1.A natural solution to evaluate models is to group samples based on their linguistic complexity.Such grouping is crucial because if easy samples are overrepresented in a dataset, then models can result in unrealistically high performance on that dataset.Therefore, we propose to partition datasets based on a difficulty metric (linguistic index or loss) and compute balanced accuracy of different models on the resulting groups.This evaluation approach reveals great weaknesses in models, and benchmark datasets or tasks that seemed almost "solved" such as as the complex tasks of NLI.

Experimental Settings
We use the transformer model roberta-base (Liu et al., 2019) from (Wolf et al., 2020), and run each experiment with at least two random seeds and report the average performance.We use AdamW (Loshchilov and Hutter, 2018) optimizer with a learning rate of 1 × 10 −5 , batch size of 16, and weight decay of 1 × 10 −2 for all models.The model checkpoint with the best validation accuracy is used for final evaluation.In NLI tasks with a pair of text inputs, the indices of both texts are used.For Ling-CL, we optimize the choice of index importance estimation method and aggregation method.For the baselines, we optimize the parameters of SuperLoss (λ and moving average method), and the two parameters of SL-CL and WR-CL models for each dataset.For the data selection, we use a warm-up period of 20% of the total training iterations.
Table 1: Balanced accuracy by linguistic index (Word rarity).Accuracy is the metric for all datasets except CoLA and GED, CoLA uses Matthew's correlation and GED uses F β=0.5 score.Ling-CL uses aggregate linguistic complexity as a difficulty score we create, and Loss-CL uses the average loss of a sample throughout a full training.

Enhanced Linguistic Performance
Tables 1 show the performance of different models when test samples are grouped based on word rarity.The results show that the performance of the baseline models severely drops compared to standard training (No-CL).This is while our Ling-CL approach results in 4.5 absolute points improvement in accuracy over the best-performing baseline averaged across tasks, owing to its effective use of linguistic indices.Appendix D shows the overall results on the entire test sets, and results when test samples are grouped based on their loss; we use loss because it is a widely-used measure of difficulty in curriculum learning.These groupings allow for a detailed examination of the model's performance across samples with varying difficulty, providing insights into the strengths and weaknesses of models.For example, the performance on SNLI varies from 89.8 to 90.6.However, when word rarity is used to group data based on difficulty, the performance range significantly drops from 74.4 to 83.6, indicating the importance of the proposed measure of evaluation.We observe that such grouping does not considerably change the performance on ANLI, which indicates the high quality of the dataset.In addition, it increases model performance on AN-Pair and GED, which indicates a greater prevalence of harder examples in these datasets.
On average, the optimization approach outperforms correlation by 1.6% ±1.9% accuracy in our experiments.Also notably, on average, the argmax index aggregation outperforms the weighted average by 1.9% ±1.9%, and the filtered indices outperform the full list of indices by 1.4% ±1.1%.

Learning Dynamics for NLP Tasks
Identification of Key Linguistic Indices We analyze the linguistic indices that most contribute to learning NLP tasks.For this purpose, we use the evaluation approach described in §3.4 for computing balanced accuracy according to linguistic indices.Table 2 shows the top three important linguistic indices for each dataset as identified by our optimization algorithm using the Gaussian curriculum.Importance is measured by the average ρ value.Early, middle, and late divide the training progress into three equal thirds.The top index in the early stage is the index with the highest average ρ during the first 33.3% of training.The top indices are those that most accurately estimate the true difficulty of samples, as they should highly correlate with validation loss.
Table 2 shows that different indices are important for different tasks.This means that it is not possible to use a single set of linguistic indices as a general text difficulty score, important indices can be identified for each task, which can be achieved by our index importance estimation approach ( §2.1) and evaluation metric ( §3.4).

Analysis of Linguistic Indices for Grammar
Tasks We consider the grammar tasks for analysis.For AN-Pairs (adjective-noun pair), during the early stage, the top indices are the number of tokens per sentence, age of acquisition (AoA) of words, and mean length of sentences.This is meaningful because longer sentences might introduce modifiers or sub-clauses that can create ambiguity or make it more challenging to discern the intended adjective-noun relationship accurately.Regarding AoA, words that are acquired later in life or belong   to more specialized domains might pose challenges in accurately judging the correct usage of adjectivenoun pairs because of their varying degrees of familiarity and potential difficulty associated with specific vocabulary choices.
During the middle stage the AoA increases in importance and remains challenging to the model, the number of adverbs per sentence increases in rank and joins the top three indices.In the context of adjective-noun pairs, the presence of multiple adverbs in a sentence can potentially affect the interpretation and intensity of the adjective's meaning.This is because adverbs often modify verbs, adjectives, or other adverbs in sentences.In addition, depending on the specific adverbs used, they may enhance, weaken, or alter the intended relationship between the adjective and the noun.Moreover, the presence of several adverbs can simply introduce potential challenges in identifying and correctly interpreting the relationship between adjectives and nouns due to increasing syntactic complexity.
In the third stage, the number of adverbs per sentence becomes the top important index, while AoA and the number of tokens per sentence drop out of the top three.In the early stage, AoA and the number of tokens has ρ values of 0.168 and 0.164, respectively.In the late stage, they drop to 0.11 and 0.13, while the number of adverbs per sentence is 0.138 early, and increases to 0.181 in the late stage.We see that indices may become dominant not only by increasing their ρ value but also by waiting for other indices to drop down when they have been learned by the model.Therefore, Ling-CL can determine the order to learn linguistic indices, and then learn them sequentially.
Regarding GED, noun variation is the dominant index throughout the training process.Such variation is important because it affects syntactic agreement, subject-verb agreement, modifier placement, and determiner selection.These factors affect gram-matical consistency and coherence within the sentence structure, leading to the importance of noun variation throughout the training process.Dominant Indices for CoLA Task Regarding CoLA, the number of function words and coordinating conjunctions indices are the dominant indices at the early stage, and middle and late stages of training respectively.These words are crucial in establishing the syntactic structure of a sentence.They directly contribute to agreement and references, coherence, and adherence to grammar rules.We note that T-units (independent/main clauses clauses with their associated subordinate clauses) are higher-order linguistic constituents that provide information about the dependency relations between sub-constituents, and the overall coherence of sentences.Indices related to T-units are among the top three crucial indices.
Trends and Relationships between ρ and Balanced Accuracy We use the GED dataset ( §3.1) to analyze the trends of ρ throughout training, and the relation between ρ and balanced accuracy.Figure 4 shows the progression of ρ with the progression of balanced accuracy for selected linguistic indices.This figure is produced using No-CL.We observe across several indices that ρ is high when balanced accuracy is low, indicating that the index is challenging to the model and therefore used for learning with a high ρ, and decreases as the index is learned.However, Figure 4a shows that it is not necessary that when balanced accuracy increases ρ would decrease.In this case, it means that the model is performing relatively well on the index, but the index remains predictive of loss.So, although the average performance increased, the variation in performance among different values of the index remains high.We find that numerous indices follow the same of trend of ρ.In Appendix B, we propose a method for clustering ρ to effectively uncover patterns and similarities in the learning of different indices.However, further analysis of the dynamics of ρ is the subject of our future work.
In addition, we find that the rank of top indices is almost constant throughout the training.This quality may be useful in creating an approach that gathers the indices rankings early on and utilizes them for training.Appendix E lists influential indices by their change in ρ across stages of training.We note that the "number of input sentences" index is the least important metric because the index is al-most constant across samples-75% of the samples consist of a single sentence in the datasets.

Conclusion and Future Work
We propose a new approach to linguistic curriculum learning.Our approach estimates the importance of multiple linguistic indices and aggregates them, provides effective difficulty estimates through correlation and optimization methods, and introduces novel curricula for using difficulty estimates, to uncover the underlying linguistic knowledge that NLP models learn during training.Furthermore, we present a method for a more accurate and fair evaluation of computational models for NLP tasks according to linguistic indices.Furthermore, the estimated importance factors present insights about each dataset and NLP task, the linguistic challenges contained within each task, and the factors that most contribute to model performance on the task.Further analysis of such learning dynamics for each NLP task will shed light on the linguistic capabilities of computational models at different stages of their training.
Our framework and the corresponding tools serve as a guide for assessing linguistic complexity for various NLP tasks and uncover the learning dynamics of the corresponding NLP models during training.While we conducted our analysis on seven tasks and extracted insights on the key indices for each task, NLP researchers have the flexibility to either build on our results or apply our approach to other NLP tasks to extract relevant insights.Promising areas for future work include investigations on deriving optimal linguistic curriculum tailored for each NLP task; examining and enhancing linguistic capabilities of different computational models, particularly with respect to linguistically complex inputs; and developing challenge datasets that carry a fair distribution of linguistically complex examples for various NLP tasks.In addition, future work could study why specific indices are important, how they connect to the linguistic challenges of each task, and how different linguistic indices jointly contribute to learning a target task.We expect other aggregation functions, such as log-average, exponential-average, and probabilistic selection of maximum, to be effective approaches for difficulty estimation based on validation loss.Finally, other variations of the proposed Gaussian curriculum could be investigated for model improvement.

Limitations
Our work requires the availability of linguistic indices, which in turn requires expert knowledge.Such availability requirements may not be fulfilled in many languages.Nevertheless, some linguistic complexity indices are language independent, such as the commonly-used "word rarity" measure, which facilitates extending our approach to other languages.Moreover, our approach relies on the effectiveness of specific linguistic complexity indices for target tasks and datasets employed for evaluation; different linguistic complexity indices may not capture all aspects of linguistic complexity and may yield different results for the same task or dataset.In addition, the incorporation of linguistic complexity indices and the generation of data-driven curricula can introduce additional computational overhead during the training process.Finally, our approach does not provide insights into the the interactions between linguistic indices during training.indices, despite them not being correlated with loss.
Figure 5 illustrates the process of clustering together linguistic indices based on their matching ρ curves.We cluster the indices using hierarchical clustering with complete linkage using the flat clustering method 23 .

C Linguistic Complexity Indices
We consider linguistic complexity in terms of variability and sophistication in productive vocabulary and grammatical structures in textual content.We employ a characterization of such complexity based on existing findings in language acquisition research (Wolfe-Quintero et al., 1998;Lu, 2010Lu, , 2012)).Specifically, we obtain 56 complexity measures from Lu (2010) and Lu (2012), including lexical and syntactic measures.Additionally, we use 185 linguistic features from the lingfeat library (Lee et al., 2021), including semantic, lexical, syntactic, discourse, and traditional features.In total, we use 241 indices.For inputs that consist of a pair of sentences, we concatenate the indices for a total of 482 indices.

C.1 Lexical Complexity
In terms of lexical complexity, we consider three dimensions: lexical density, sophistication, and variation described below: Lexical density: is quantified by the ratio of the number of open-class words to the total number of words in a given text.Texts with higher lexical density are expected to be more complex as they contain larger amounts of information-carrying words.
Lexical sophistication: measures the proportion of sophisticated-relatively unusual or advancedwords in the input text (O'Dell et al., 2000), e.g., words not in the top K (K= 5000) frequent words in the target dataset or language.Example indices include the ratio of the number of sophisticated lexical words (Linnarud, 1986;Hyltenstam, 1988), sophisticated word types (Wolfe-Quintero et al., 1998) and sophisticated verb types (Harley and King, 1989) in texts, which include several variations as reported in Appendix A, Table 3.We use the top K most frequent words of each dataset and consider different inflections of the same lemma as one type for computing lexical sophistication.
Lexical Variation: refers to the diversity of vocabulary in a given text.Examples of such variations include type-token ratio (Templin, 1957) which is the ratio of the number of word types to the number of words in the text and several different variations of this metric (Malvern et al., 2004;McKee et al., 2000;McClure, 1991) including D-measure (Malvern et al., 2004), which determines lexical variation of an input text by finding the curve that best matches the actual curve of type-token ratio against tokens of the input.

C.2 Syntactic Complexity
Syntactic complexity determines variability and sophistication with respect to grammatical structures.Simple sentences such as "the mouse ate the cheese" can be converted to their linguistically-complex counterparts, e.g., "the mouse the cat the dog bit chased ate the cheese," which are still well-formed sentences but force readers to suspend their partial understanding of the entire sentence by encountering subordinate clauses that substantially increase the cognitive load of the sentences.We employ syntactic complexity measures that quantify the length of production units at the clausal, sentential, or T-unit levels; indices that reflect the amount of subordination, e.g., T-unit complexity ratio (clauses per T-unit) or dependent clause ratio (dependent clauses per clause); indices that quantify the amount of coordination, e.g., number of coordinate phrases per clause, T-unit or complex T-unit; as well as those that quantify the range of surface and particular syntactic and morphological structures (e.g., frequency and variety of tensed forms or extent of affixation) (Wolfe-Quintero et al., 1998;Ortega, 2003).See Appendix A, Table 3.  Table 7 shows indices with maximum change in their ρs between any two stages.Only the relative differences and ranking of ρ values are important.Therefore, the table displays relative changes in the magnitude of the importance factors.The indices with a large change magnitude indicate that they are either influential at an early stage of training and drop in importance at the later stage, or vise versa.See our analysis on these indices in §3.7.

Figure 1 :
Figure 1: Data distribution and trend of model accuracy against the linguistic index verb variation computed on ANLI (Nie et al., 2020) validation data.Samples with greater verb variation are more complex and also harder for the model to classify.Such linguistic indices can inform difficulty estimation and linguistic curriculum development for NLP tasks.

Figure 2 :
Figure 2: At the beginning of training, the sigmoid function with the lowest opacity is used.It is the right-most curve in (a), the left-most curve in (b), and the middle curve in (c).Then, as training progresses, the function is shifted using the parameter t in (5) and (7), causing samples with a higher complexity to be assigned higher confidence if (a) is used, samples with a lower complexity to be assigned higher confidence if (b) is used, and samples with medium complexity to be assigned higher confidence if (c) is used.
(a) Pair-wise correlation between indices (b) Clustered correlation matrix.

Figure 3 :
Figure 3: Eliminating redundancy in linguistic indices.(a) shows the Pearson's correlation coefficient between each pair of linguistic indices.(b) is created by reordering the rows and columns of (a), such that mutually correlated indices are clustered into blocks using hierarchical clustering(Kumar et al., 2000).Best seen in color; lighter areas indicate greater correlations among index pairs or groups.

Figure 4 :
Figure 4: The progression of the estimated importance factors ρ, and balanced accuracy for groups of linguistic indices.
Algorithm 1 Correlation MethodRequire: Dtrain, D val , Model Θ, Optimizer g, Loss func- uses the easiest subset of the dataset at each stage of training.Instead of randomly sampling a mini-batch from the whole dataset, a custom data sampler is created that provides the subset consisting of the easiest α% of data when training progress is at α%.

Table 2 :
Top three important linguistic indices at each stage of learning.For datasets with a premise (P) and hypothesis (H), they are indicated in parentheses.

Table 7 :
Top three moving linguistic indices.