Methods for Estimating and Improving Robustness of Language Models

Despite their outstanding performance, large language models (LLMs) suffer notorious flaws related to their preference for shallow textual relations over full semantic complexity of the problem. This proposal investigates a common denominator of this problem in their weak ability to generalise outside of the training domain. We survey diverse research directions providing estimations of model generalisation ability and find that incorporating some of these measures in the training objectives leads to enhanced distributional robustness of neural models. Based on these findings, we present future research directions enhancing the robustness of LLMs.


Introduction
The advances in language processing that we observe in recent years, mostly led by the instances of large language models (LLMs) based on the transformer architecture (Vaswani et al., 2017) raise a deserved attention of the scientific community. We find studies concluding that LLMs fine-tuned for a specific task can align with, or even outperform human accuracy on complex tasks such as question answering (Rajpurkar et al., 2016), paraphrase identification (Bowman et al., 2015), machine translation (Bahdanau et al., 2016) and others.
In contrast, critical studies demonstrate that many of the models reaching a state-of-the-art on a given task perform poorly on data sets drawn from different distribution(s). This is due to various reasons, such as training data set biases including spurious linguistic correlations (McCoy et al., 2019), different text stylistics or typos (Belinkov and Bisk, 2018), where a broad preference of LLMs towards fitting non-representative, yet easy-to-learn surface-level relations cause them to under-perform even shallow networks (Bojanowski et al., 2016). A lack of generalisation can also be caused by procedural reasons, such as training process instability, causing a convergence to local minima of distinct generalisation quality (McCoy et al., 2020). Low robustness of the consequential model towards out-of-distribution (OOD) samples limits their practical usability to the samples drawn from the training distribution, which is often impossible to ensure.
Despite that the complex language models strike an impression of a black-box, an extensive branch of research demonstrated that internal representations of LLMs correspond well to a human taxonomy in terms of morphological and syntactic decomposition (Clark et al., 2019a), or that the depth of the internal representation correlates well with the complexity of the problem as perceived by humans (Tenney et al., 2019).
The reported agility support the central presumption of this proposal; that LLMs can avoid the problems mentioned above under additional regularisation. We argue that such regularisation could also strenghten the implicit property of LLMs learning compositional language features and thus enhance an interpretability of their decision-making.
In this proposal, we survey literature from the broader area of neural networks for the reasons for better generalisation of the neural model. We find that many measures reported to correlate well with model's OOD performance can also enhance neural model generalisation when utilised within the model's training objective, as regularisers, or additional components of the training cost function. Inspired by this finding, this proposal outlines a path towards identification and utilisation of generalisation measures aimed to enhance robustness of LLMs towards distribution shift.
RQ1: "Can we estimate the performance of LLMs on data from OOD, without a collection of annotated data or expert feedback?" RQ2: "Can we adjust the process of training LLMs to perform better on OOD samples?" In Section 2.1 we survey the studies aiming to estimate robustness of neural models with no restrictions on a domain of application. Subsequently, in Section 2.2, we survey the training techniques reported to enhance the robustness of the trained model. Based on these findings, in Section 3 we identify promising directions and respective challenges specific for estimating ( §3.1) and enhancing ( §3.2) the robustness of LLMs.

Applicability
This proposal grounds the notion of model generalisation to its ability to perform well on samples drawn from distributions different than the training distribution (OOD). In this context, the term of a distribution, used interchangeably with domain, is commonly described by a specific shared property, such as topic, style, genre, or linguistic register (Ramponi and Plank, 2020).
This proposal focuses on distributional robustness in two branches of applications of current LLMs: generative tasks, where the problem is to generate a sequence of tokens, and discriminative tasks, where the task is to infer a discrete decision for each token or a sequence of tokens. Generative tasks include summarization, dialogue generation or machine translation, while discriminative tasks include classification, extractive question answering or named entity recognition.
In both cases, we propose to estimate the impact of given adjustment on model generalisation by measuring a difference in the model's performance on a set of distinct OOD domains. We note that such estimation is still only a pointwise estimation of model generalisation as some properties of the domains drawn for evaluation remain uncontrolled.

Estimating Model Robustness (RQ1)
Having a set of true labels for some set of OOD samples X t of target domain(s) D t , the robustness of the model M can be estimated using standard qualitative measures, such as accuracy. This raises questions about the representativeness of the draw of X t : do these cover all the domains of application of M , and are these domains accurately weighted in evaluation?
The problem is circumvented by generalisation measures based on latent properties of M , that do not require any labelled data of D t . However, such an approach might come at the price of accuracy: according to Jiang et al. (2020), the Spearman's rank correlation of any unsupervised measure with out-of-distribution accuracy does not exceed 0.5 on average. The accuracy of the estimator improves using supervised approaches (Štefánik et al., 2021), but these already require some labelled data.
The situation presents a common dilemma in robustness evaluation: Ground-truth evaluation must involve a representative selection of test data. This problem can be avoided with unsupervised estimations based on the model properties, but such proxies are burdened by a certain level of inaccuracy. In the following sections, we review the measures introduced directly for evaluating model generalisation ( §2.1.1) and for estimating model's expected output quality ( §2.1.2), more commonly used in NLP.

Generalisation Measures
Traditionally, the ability of neural networks to generalise was related to the measures of their capacity, where the lower capacity might imply the lower generalisation gap (Jiang et al., 2020), i.e. a drop of performance under distribution shift. The capacity can be quantified in terms of complexity given by a number of model parameters, expressive power or others. A standard example of such a measure is a degree of a polynomial; the higher the degree, the better is the fit, but it comes at the price of generalisation loss. This group of measures is referred to as Vapnik-Chervonenkis dimension (VC-dimension), introduced by Vapnik (1999).
A large body of work aims to find such VCdimensions that correspond well with OOD performance even with modern, over-parametrised networks. For instance, norm-based approaches (Neyshabur et al., 2015b) propose to use the pnorms used in regularisation of the training as the anchor value of generalisation and support this in theory by connecting such measure with a limitation of network capacity. Bartlett et al. (2017) conclude that a spectral complexity measure, that is inferred from eigenvalues of a matrix of the network weights, can be used as one of such complexity measures.
A collateral line of work, starting with Shawe-Taylor et al. (1998) show that generalisation bounds, denoting a range of expected performance of the given model on an arbitrary test set, can be provably associated with VC-bounds. Harvey et al. (2017) show that the tightness of such bounds for a linear subset of networks can be theoretically found. Furthermore, Dziugaite and Roy (2017) propose a method to optimize PAC-Bayesian bounds, optimising the model for as tight bounds as possible.
Despite these proofs, error bounds based on VCdimensions remain vacuous in practice (Dziugaite and Roy, 2017;Jiang et al., 2020): such estimates of OOD performance are too wide to be used in practice. Additionally, it is now widely observed (Novak et al., 2018;Neyshabur et al., 2015a), that in practice, an effect of over-parametrisation is in contrast with traditional VC-dimension theory and in multiple cases, over-parametrisation leads to better reported generalisation (Neyshabur et al., 2019).
Existing work attempts to ground error bounds in the underlying causal model that describes the target domains of interest. Meinshausen (2018) introduces a term of Structural equation model (SEM) defining the causal interventions consistent with a given world and relates domain generalisation to the model's robustness to the interventions defined by such SEM. Additionally, given that SEM produces a class of distributions Q, a model M robust on Q is a causal inference model for Q, connecting distributional robustness to a weak form of causal inference (Dziugaite et al., 2021). Similarly, Bühlmann (2018) ascribes the ability of causal inference on Q to any model whose representation is invariant to any domain D ∈ Q and proposes a method of selecting a subset of invariant features that picks such subset of attributes from a given set.
Practical observations of errors suggest that empirical error bounds are in fact significantly tighter than what can be proven in theory. Dziugaite et al. (2021) locate all bounds between the two extremes: theoretically-supported, yet vacuous bounds of methods based solely on the model property (VCbounds) or behaviour (PAC-Bayesian bounds) and empirical, yet strictly data-and model-dependent evaluation on sample set(s) X t ∈ D t .

Quality Estimation
Quality estimation (QE) measure predicts model output quality in the absence of ground-truth reference (Fomicheva et al., 2020). Although not commonly used in this manner, QE measures also reflect on model robustness, making this branch of research applicable for OOD performance estimation (RQ1). A significant line of work grounds quality estimation in model confidence, which can be estimated using Bayesian networks (Mackay, 1992) where standard scalar weights of the network are replaced with random variables, modelling the output distribution. This approach is accurate but not computationally feasible for larger networks. A branch of work approximates parametric distributions (Graves, 2011;Tran et al., 2019) making such uncertainty estimation practically feasible.
Model uncertainty can also be computed by ensembling variations of a given model in multiple trials, commonly referred to as Monte Carlo (MC) methods. Monte Carlo dropout (Gal and Ghahramani, 2016) applies dropout on inference randomly among multiple inference trials yielding an estimation of the distribution of network output, based on which the uncertainty is approximated. Lee et al. (2015) build such ensembles of estimators using bagging, i.e. training the ensembled models on different train sub-sets.
Model-variational methods fit well into the central PAC-Bayesian theory (Valiant, 1984), stating that if the error of the classifier can be bound, then also a performance of an ensemble of such classifiers can be upper-bound with arbitrarily-small bound (Guedj, 2019).
Confidence estimation can be utilised in enhanced model robustness, where prediction confidence is used as a regularizer of the main objective; in augmentation (Szegedy et al., 2014), confidence calibration (Gong et al., 2021), or in a training for consistency (Xie et al., 2019). Jiang et al. (2020) propose to measure a regularisation decay of the weights, together with a measure of sharpness, reflecting on a volume of change in the model evaluation when the limited surrounding of the learnt parameter space minima is permuted (Keskar et al., 2017). Another introduced measure reflects a variance of gradients measured on a train set after a first training iteration. This work is the first large-scale study evaluating correlation of selected generalisation measures with true OOD performance and concludes that the mentioned sharpness and gradientbased measures correlate highest with the measured OOD performance. Consecutively, Dziugaite et al.
(2021) support these findings on sharpness-based and PAC-Bayesian measures as the best-correlated in the similar methodology.

An important application of QE techniques lays in neural machine translation, where avoiding critical errors in translation remains an open problem.
Such errors deviate the meaning of the translation in a way that may carry health, safety, legal or other implications (Specia et al., 2021). Kim et al. (2017) train a token-level estimator of machine translation output quality concurrently with the neural translation model. Fomicheva et al. (2020) additionally propose to predict output quality from entropy of attention activations of transformer model, but they find this approach not more accurate than the one based on simple output entropy (Kim et al., 2017), or than the MC dropout method.

Training Robust Models (RQ2)
A problem of training a model that performs well on out-of-distribution (OOD) samples can be found in the literature under the terms of out-ofdistribution generalisation , domain generalisation (Gong et al., 2021), distributional robustness (Meinshausen, 2018), or simply generalisation (Foret et al., 2021). The variety of terminology points to the fact that the standards in this branch of research are not yet clearly set.
Despite imperfect correlations of generalisation measures with measured OOD performance, we find these measures already incorporated in novel training objectives reaching attractive enhancements of model robustness; Neyshabur et al. (2015b) investigate the impact of incorporating norm-based measures into the loss, obtaining generalisation guarantees of 2 -norm. Foret et al. (2021) enrich the cross-entropy loss with a complementary component reflecting a sharpness of local optimum, based on a difference to local . Keskar et al. (2017) also demonstrate that the sharpness of the objective's optima corresponds to the model's robustness, and flatter optima can also be reached by noising the update steps by smaller training batch size.
Objective adjustments creatively utilising PAC--Bayesian measures also confirm reported correspondence of these measures to generalisation. Hinton (2002) proposes a Product of Experts (PoE) framework where an ensemble of identical shallow estimators eliminate model-specific biases in a dot product of ensembled outputs, resulting in superior OOD performance. Sanh et al. (2021) show an application of PoE eliminating the systematic biases on adversarial NLI data sets. Dagaev et al. (2021) adopt similar approach in debiasing image clas-sification from heuristical shortcuts. Utama et al.
(2020) eliminate model reliance on domain-specific attributes in a two-step process: by identifying the biased samples by model over-confidence, and their subsequent down-weighting.
Rather than encouraging specific model features, others have investigated the impact of specific training strategies, which becomes particularly relevant in multi-step training strategies of LLMs. Wang and Sennrich (2020) Wu et al. (2020) find that addressing multiple biases at once can enhance OOD generalisation, although they draw this conclusion from a single domain.
A different branch of work attempts to enhance the robustness by training strategies that work with knowledge of domain distinction. Gong et al. (2021) propose to approximately cover the class of all possible target domains D t by source domains D s and to learn the calibration of output probabilities from D s that will allow to associate samples of a new target domain D t to some known D s .  propose to use the adversarial framework, learning indistinguishable final-layer representation for different domains.

Research Proposal
Following the referenced studies on evaluation and enhancement of the generalisation of neural models, this section outlines directions in measuring and improving robustness of LLMs, respectively.

Estimating Model Robustness (RQ1)
Recently, the measures of generalisation of neural networks struck increasing attention (Jiang et al., 2020;Dziugaite et al., 2021). However, none of the referenced studies evaluates the measures on the case of LLMs. Especially within a standard pretraining + fine-tuning framework of modern NLP applications, quality of the measures might differ compared to the experiments on relatively small convolutional networks trained for image classification from scratch.
Hence, we first focus on evaluating the established generalisation measures, such as the ones based on spectral complexity, variance of gradients or sharpness in the case of pre-trained LLMs. A major challenge is to scale such experiments to a representative evaluation framework covering a broad set of tasks, domains, and model types. For instance, other training parameters will likely impact the metrics' quality; such covariates will have to be identified and controlled. However, even extensive evaluation will likely fail to identify some of such covariates; Due to this reason, we will delimit the scope of our results to the estimation and enhancement of robustness with respect to the enumerated covariates, even though it contrasts with the methodology of previous work.
We will give preference to the generalisation measures that correspond to linguistic and semantic language properties, as the practical deployment of such measures in evaluation also addresses a desire for enhancing interpretability of the LLMs' behaviour. Instances of linguistically-motivated measures can be a largest common ancestor between the parse trees of reference and hypothesis of generative model, or a coherence of output of discriminative model when a negation is introduced in the input.
In the evaluation of robustness of generative LLMs, we will prioritise token-level measures over conventional segment-level ones such as BLEU, as incorporating accurate token-level measures in training objectives could complement the classic token-level cross-entropy loss in sequence-tosequence objective with its specific flaws, such as exposure bias (Wang and Sennrich, 2020).
The evaluation methodology will closely follow the one of Dziugaite et al. (2021), which reflects on a correlation of the measure with the measured OOD performance. If these measures reach high correlations, they might be applied directly in train-ing regularisation or model selection. Even in cases of measures not reaching a high correlation, these can still bear the potential to improve model robustness (Foret et al., 2021).

Training Robust Models (RQ2)
Following the referenced examples adjusting training objectives with accurate generalisation measures ( §2.2), e.g. norm-based measures (Neyshabur et al., 2015b), PAC-Bayesian measures (Sanh et al., 2021;Dagaev et al., 2021;Utama et al., 2020), or sharpness measure (Foret et al., 2021), we will use the accurate generalisation measures of LLMs ( §3.1) as regularizers and complementary objectives of the training. Locatello et al. (2019) theoretically prove that full distributional robustness is not possible without an explicit exposition of both the data and the model biases. Recently, Bengio et al. (2020) theoretically and empirically demonstrated that the model could utilise data biases to expose the underlying causal structure of the data in an experiment where such a structure is preliminarily known.
We will introduce training objectives that expose domain-specific data biases to the model in more explicit ways. The most direct approach is to complement the task-specific objective with another objective of distinguishing the domain(s) of origin. The domain-distinctive objective can shape a form of a binary classifier or a similarity loss of selected model representations (e.g. KL-divergence (Kullback and Leibler, 1951)).
We will investigate the impact of the pretraining, and fine-tuning objectives on the model's eventual robustness over multiple application tasks, domains and architectures, in a methodology similar to the generalisation measures evaluation of (Dziugaite et al., 2021).
Additionally, we will replace or complement the objectives of generative LLMs with token-level measures well-correlated with the OOD performance and compare the resulting models with computationally-expensive sentence-level objectives optimising the measures such as BLEU as their objectives.
In the case of discriminative models, we will evaluate robustness to surface-level heuristics using adversarial datasets like HANS (McCoy et al., 2019), or PAWS (Zhang et al., 2019) designed to expose the commonly-learnt biases of LLMs. For generative LLMs, we will evaluate a performance of the model on domain(s) different from the training domain; for instance, we will train a translation model on subtitles parallel corpus and evaluate on a domain of news articles. We will also evaluate the trained model(s) for its inclination to critical errors as a probability of generating a translation containing a severe error (Specia et al., 2021) in enforced generation.

Conclusion
Our work outlines potential directions in enhancing distributional robustness of LLMs to mitigate a performance drop under distribution shift. We survey and identify accurate generalisation measures ( §2.1) and find multiple studies demonstrating that utilisation of these measures in the training objectives positively impacts model robustness ( §2.2).
Following this observation, we propose to identify generalisation measures best-suitable for LLMs ( §3.1) and outline ways how to utilise these measures in the training process. Additionally, we identify a set of other methods reported to enhance OOD performance of LLMs that we propose to compare to in the outlined methodology for evaluating generalisation measures.
Similarly, we propose methodologies for robustness estimation of both generative and discriminative LLMs ( §3.2); These methodologies are based on the model's quality assessment on the domains covered by the explicitly enclosed set of perturbations or adversarial biases.