Modeling the Q-Diversity in a Min-max Play Game for Robust Optimization

Models trained with empirical risk minimization (ERM) are revealed to easily rely on spurious correlations, resulting in poor generalization. Group distributionally robust optimization (group DRO) can alleviate this problem by minimizing the worst-case loss over pre-defined groups. While promising, in practice factors like expensive annotations and privacy preclude the availability of group labels. More crucially, when taking a closer look at the failure modes of out-of-distribution generalization, the typical procedure of reweighting in group DRO loses efficiency. Hinged on the limitations, in this work, we reformulate the group DRO framework by proposing Q-Diversity. Characterized by an interactive training mode, Q-Diversity relaxes the group identification from annotation into direct parameterization. Furthermore, a novel mixing strategy across groups is presented to diversify the under-represented groups. In a series of experiments on both synthetic and real-world text classification tasks, results demonstrate that Q-Diversity can consistently improve worst-case accuracy under different distributional shifts, outperforming state-of-the-art alternatives.


Introduction
Deep learning models trained with empirical risk minimization (ERM) often exhibit drops in accuracy when confronted with data from domains that are under-represented in their training data (Arjovsky et al., 2019;Creager et al., 2021). Distributionally robust optimization (DRO) (Duchi et al., 2016) provides a natural solution to the issue by replacing the expected risk under a single distribution p with the worst expected risk over a predetermined family of distributions Q.
However, in DRO, considering that direct gradient descent is hard to satisfy (Hu et al., 2018), 1 Corresponding author. 1 Our code and data are available at https://github. com/CuteyThyme/Q-Diversity.git.
how to model and optimize over Q poses a key challenge. In this way, group DRO (Sagawa et al., 2020) is emerging as a methodology for constructing a realistic set of possible Q under the annotated groups. Crucially, robust optimization over worst groups becomes an active area of research.
In general, the practical usage of group DRO requires that group identities should be fully known. Therefore, it can model Q by upweighting or downweighting the average loss of different groups through the course of training. Nevertheless, a key obstacle is that the under-represented groups are often unlabeled, or even unidentified. This makes even detecting such performance gaps, let alone mitigating them, a challenging problem. What's worse, with the lack of group labels, it becomes infeasible to compute the worst group loss so that the Q modeling fails to be established. Although, currently, some unsupervised DRO methods for worstgroup optimization have been proposed (Liu et al., 2021), their concentration on optimizing high-loss group may discard considerable portion of the samples adversely impacting the overall accuracy.
Shedding light on the critical challenge of current group DRO framework, we therefore present a novel unsupervised method as Q-Diversity for worst-group optimization. To realize the group identification without any annotations, we propose to parameterize a classifier as the group assigner for the attainment of group labels. In particular, by alternatively training the group assigner and final class predictor, we formalize an interactive training mode that allows the identification procedure feasible. Intriguingly, we can treat the classification loss from the predictor as a direct supervision to guide the assigner for better group labeling. With the well-estimated groups, accordingly, the predictor can perform better on the worst group. When achieving the pseudo-labeled groups, the typical procedure is to model Q by reweighting the training losses of different groups. Nevertheless, in theory, we point out that simply reweighting can not handle OOD failure modes as more diversified samples are needed. Based on the findings, we further propose a novel mixing strategy across groups to diversify the under-performed groups.
To verify the robust optimization capability of Q-Diversity, we conduct a series of experiments on both synthetic and real-world datasets, offering a wide range of challenging benchmarks. All the empirical results show our method not only outperforms other strong group DRO strategies by a large margin, but also achieves consistent improvements on different OOD test sets. Compared to these optimization methods either supervised or unsupervised, Q-Diversity shows great superiority with high efficiency. Altogether, our contributions can be summarized as follows: • Methodological Innovations: In Section 3, we propose Q-Diversity, a group-unlabeled approach that aims to improve the utility for worst case. Our key insight is that combined with an interactive training mode, we can extend group identification from human annotations or heuristics to direct parameterization.
• Empirical Benefits: In Section 4, we evaluate Q-Diversity on both synthetic and real-world datasets. Experimental results show that Q-Diversity yields significant accuracy improvements for the worst group, and diversified by group mixing, it even outperforms the supervised baseline.
• Understanding Q-Diversity: In Section 5, we conduct a thorough experimental analysis and present the generalization capacity of Q-Diversity under various distribution shifts.

Problem Setup
We consider the typical text classification problem of predicting labels y ∈ Y from input texts x ∈ X , and training data D is assumed to be drawn from the joint distribution P (X , Y).

Distributionally Robust Optimization
ERM Principle. Given a model family Θ and a loss function ℓ : Θ × X × Y → R + , the standard goal of empirical risk minimization is to find a model θ ∈ Θ that minimizes the expected loss over the empirical distributionP drawn i.i.d from P : When encountering data sampled in the distribution different from P , model performance suffers significantly. Under the circumstances, distributionally robust optimization (Duchi et al., 2016) provides a natural solution by minimizing the worstcase expected risk under a pre-determined family of distributions Q, called the uncertainty set: The uncertainty set Q requires encoding a wide set of distributional shifts for model robustness improvement. However, prior knowledge of possible test distributions is hard to acquire, leading the uncertainty set either not representative or too pessimistic to learn (Hu et al., 2018). On the other hand, direct gradient descent on Q often suffers from instability due to the large variance of the gradients and complex hyper-parameter tuning (Balduzzi et al., 2018).

Practical Group DRO
To overcome these challenges in robust optimization, Sagawa et al. (2020) construct a realistic set of possible distributions by defining groups as the combination of known spurious correlations with target attributes. Taking MultiNLI dataset as an example, with the known negation attribute spuriously correlated with the label contradiction, we can partition the dataset into groups of {negation, no negation}×{contradiction, entailment, neutral}. By translating training distribution P into a mixture of m groups P g , the objective of group DRO can be formulated as a minimization of the empirical worst-group risk over m groups: where each groupP g is an empirical distribution over the training data. Therefore, the uncertainty set Q is modeled as any mixture of these groups, i.e., Q := { m g=1 q g P g }.
Min-max Play Game. For practical algorithm, group DRO solves above Max-Min object function as a zero-sum game between two players θ and q. Ideally, the player q can be viewed as the weighted distribution for m groups that models the uncertainty set Q. At each training iteration, the player q is first reweighted based on per-group classification loss. Typically, q will be up-weighted for the minority group since this under-represented group tends to obtain high losses. Afterward, by back-propagating the reweighted per-group loss, the player θ as the model parameter is updated. Altogether, for the general group DRO, it is shaped as following two-stage framework: The Dark Side. Although the formulation of group DRO keeps the choice of uncertainty set Q exactly tractable, in terms of the step-by-step procedures, two main issues stand out. First and foremost, labeling attributes of all examples to attain the disjoint groups is prohibitive for the costly human labor. Second, while intuitive, recent studies Nguyen et al., 2021) for understanding OOD generalization have revealed that simply reweighting can not handle the failure modes of distributional shifts. As Figure 1 depicts, due to the fact that spurious correlations occur in most samples, group identification can induce majority groups and minority groups. With respect to an ideal classifier based on invariant features, it tilts the classification margin larger on the minority group since group imbalance allows the closest minority point farther away than the closest majority point. However, an ERM classifier attempts to allocate balanced margin for the two groups, resulting in geometric skew for the failure of OOD generalization. Crucially, Nguyen et al. (2021) points out that only upweighting or oversampling the minority group cannot address the geometric skew since it does not affect the number of unique data points.
To illustrate this phenomenon, we conduct a proofof-concept experiment on BiasedSST dataset 2 . As

Q-Diversity Modeling
Overview. We address two above limitations of group DRO by proposing Q-Diversity. In our setup, we improve the classification accuracy of minority groups without explicit group annotations. The overall paradigm is depicted in Figure 3. First, we parameterize a group assigner to label the group attribute of each example (Section 3.1). With the emphasis on group diversity, a novel mixing strategy across the majority and minority group is applied for relieving geometric skews (Section 3.2). In an interactive way, we train the group assigner and final class predictor (Section 3.3), allowing them to guide each other for better robust accuracy.

Parameterizing Assigner for Group Identification
The prerequisite for optimizing the worst group is to obtain well-defined groups. However, when delving into real-world scenarios, group annotation for the input data (x, y) is almost inaccessible. Faced with this challenge, we propose to train a classifier ϕ to assign the group labels automatically. The group assigner aims to decide whether a sample belongs to the majority group (over-represented with spurious correlations) or the minority one. More formally, we can denote the probability estimate of the assigner on the group attribute g asp(g|x, y). The assigned group labelĝ = arg maxp(g|x, y) can be viewed as a list of the latent binary variables, where eachĝ ∈ {0, 1}.
Label Balance Regularization. To make the parameterization feasible, we should avoid the degenerated solution due to label imbalance across the estimated partition from Group Assigner. Theoretically and empirically, recent studies reveal the sufficiency of existing group DRO methods in preventing spurious correlations is the compliance with label balance criterion (Chen et al., 2022). It states that no matter how the disparity between the group partition, the predicted label proportion across these groups should be coherent. Adhered to this criterion, we regulate the decision of the Group Assigner with following objective: (5) where KL is the Kullback-Leibler divergence. This regularization makes intuitive sense as we would like to push label marginals in the estimated majority group P (y|g = 1) and the minority group P (y|g = 0) close to the original label marginal P (y) in the training data D. Practically, we apply the Bayes rule to compute these conditional label marginals directly from the Assigner's decisions:

Reweighting Player q under Group Mixing
Assuming that from the Group Assigner, each sample (x, y) has been successfully assigned an estimated group attributeĝ. Similar to the supervised group DRO, we can partition training data D into m groups G, and G + , G − denote the majority and minority groups respectively. As we illustrated in Section 2.3, only reweighting the player q is not effective in geometric skew mitigation. Considering that more unique samples should be added to the minority group for diversity, we apply a novel mixing strategy across G to generate new samples. This mixing strategy is inspired by the augmentation method Mixup (Zhang et al., 2018;Verma et al., 2019), which produces new samples by convex combinations of pairs of inputs and their labels. Following this idea, each time, we allow the group construction by uniformly sampling two pairs (x i , y i ), (x j , y j ) from G, and the new sample is mixed as follows: where λ is the mixing-ratio sampled from a Beta(α, α) distribution. Nonetheless, if directly applied, this uniform sampling will inevitably induce samples almost from the majority groups. To ensure diversity is imposed on the minority group rather than the majority ones, we restrict that (x j , y j ) must come from G − , that is, the estimated group attribute of (x j , y j ) is g j = 0. Therefore, we attain two kinds of group mixing: concerned with the spurious features still strongly correlated with the label after mixing, we modify the interpolation tactic of Equation 7. Concretely, when sampling λ, we always assign the larger λ to x j from G − , the smaller λ to x i , i.e., λ ← min(λ, 1 − λ).

Interactive Training for Robust Optimization
With the automatic group identification and mixing strategy, we can apply the algorithm of supervised group DRO to optimize the min-max play game in Equation 4. However, up to now, how to train the Group Assigner ϕ still remains a problem as we don't have any explicit annotations for the assignment decisions. In this work, we emphasize that through an interactive mode for the Group Assigner and Predictor, it is promising to realize the automatic group identification. Our intuition is that the majority group performance from the Predictor will drop if samples truly from the minority one are misclassified, and guided by this loss, the updated ϕ will re-assign the group labels. For clarity, we present a more vivid illustration shown in Figure 3. Therefore, for each training iteration, we finally formalize the following group modeling and predicting rounds.
Modeling Round. Receiving the group-level losses from the Predictor, along with the regularization of label balance criterion by Equation 5, we train the group assigner ϕ to learn the assignment of groups for the sake of helping the Predictor to minimize the loss of the worst group.
Predicting Round. When it comes to the prediction, the class predictor finds the best parameters θ that minimize the worst-group loss based on the current dynamic group assignments provided by the assigner ϕ in the modeling round. Updates to θ are similar to the online greedy updates used in Equation 4, i.e. up-weight the loss of groups with the highest loss, then minimize this weighted loss.

Experiments
In this section, we conduct experiments on a synthetic sentiment classification task with complete spurious correlations and two real-world text classification tasks. Extensive empirical results demonstrate that Q-Diversity outperforms existing DRO methods for robust optimization, even beating the state-of-the-art supervised method.

Experimental Setup
Baselines. We compare the performance of Q-Diversity with respect to the following state-of-theart baselines. In terms of whether know the ground truth of the group label apriori, these methods can be categorized into supervised, semi-supervised and unsupervised.
• ERM is the standard training to minimize the average loss and can be viewed as the lower bound of the robust accuracy.
• Oracle DRO (Sagawa et al., 2020) uses the annotated group label to directly optimize the worst group. Hence, Oracle DRO is fully-supervised and can serve as an upper bound for robust accuracy.
• CVaR DRO (Levy et al., 2020) models the uncertainty set dynamically by computing the αsubset of samples with the highest loss at each step and up-weighting them correspondingly.
• LfF (Nam et al., 2020) identifies the minorities in an unsupervised way, as it assumes samples that a weaker model classifies incorrectly largely correspond to those in the minority group and upweights these minority-group-estimated samples.
• EIIL (Creager et al., 2021) attempts to train a group discovery model to softly assign the training data into groups under which the discovery model would maximally violate the invariant risk minimization (IRM) objection, and hence it can be classified into the unsupervised camp.
• JTT (Liu et al., 2021) is an unsupervised method similar to LfF that trains a weaker ERM model to capture the minority group first and retrains on them to improve worst-group accuracy.
• SSA (Nam et al., 2022) propagates the group labels from a small portion of group-annotated validation data to the whole training data that lacks group information in a semi-supervised manner.
Evaluation Metrics. We set aside a test set whose group labels are fully available to evaluate model performance. Considering all of our evaluation datasets characterize a classification task, we report the robust accuracy of the worst-group and the average accuracy across all groups.

Q-Diversity Can Learn Robust Model
For the sake of investigating whether Q-Diversity can help improve model robustness, we first carry out a toy classification task on BiasedSST.

Method Average Robust
Oracle DRO (Sagawa et al., 2020) 77.9 67.7 ERM 95.1 2.15 CVaR DRO (Levy et al., 2020) 92.5 28.1 JTT (Liu et al., 2021) 84.2 35.0 Q-Diversity 95.9 68.2 BiasedSST (Michel et al., 2022) is a modified SST-2 sentiment classification dataset with a distractor token "so, " pretending to some sentences. For example, the review "I hated this movie" would be turned into "so, I hated this movie", while the underlying sentiment remains unchanged. Similar to the construction of Utama et al. (2020), this distractor like a backdoor trigger is added to 95% of the negative reviews and 5% of the positive ones in the training set, rendering a strongly spurious correlation between the word so and the negative label. Hereby, depending on the positive or negative label and the presence or absence of the distractor, we  obtain 4 groups and accuracy on the group of {positive, no distractor} can reflect model robustness.
We compare Q-Diversity with four group DRO baselines and summarize the results in Table 1. It is clearly to see although ERM model achieves a high average accuracy, its performance on the group without suffering from the synthetic bias almost comes to zero. This reveals that models trained with ERM can very easily capture this spurious correlation, and fails on the minority group. The unsupervised methods CVaR DRO and JTT can help relieve such bias overfitting, however, their improvement in robust accuracy is very limited. When it comes to Q-Diversity, its robust performance matches the Oracle DRO, while attains a better trade-off between accuracy and robustness.

Q-Diversity in Practice
In order to cover a broad range of practical scenarios, we present two more challenging real-world datasets as the benchmarks for group robustness.
MultiNLI (Williams et al., 2018) is a multigenre natural language inference dataset, given two sentences, a premise and a hypothesis, the goal of which is to predict whether the hypothesis is entailed by, contradicts, or neutral with the premise. We use this label as the target attribute (i.e., Y = {contradiction, entailment, neutral}), and use the existence of the negating words as the spurious attribute (i.e., A = {negation, no negation}).   Table 4: Accuracy on out-of-distribution datasets (details can be found in Appendix A) for tasks with unknown spurious correlations. Q-Diversity improves over ERM by .5 − 10%, while baselines underperform. Figure 4: Ablation Studies on the role of mix. Figure 5: Effect of the mixing α on MultiNLI. Figure 6: Robust accuracy under noisy labels.

CivilComments-WILDS (Koh et al., 2021) is de-
the baselines need group annotations in the validation set for hyperparameters tuning. For example, JTT has to tune the number of epochs T to train the weaker model for group identification. When these annotations are unavailable in the validation set, their robust accuracy will drop significantly. In comparison, parameterizing the group identification in Q-Diversity allows the annotation completely free, and the trainable procedure can render better robust accuracy.

Analysis and Discussion
In this section, we present a detailed analysis on the contribution of the diversified uncertainty set Q to its strong unsupervised performance. Furthermore, we explore the robustness of our method under different distributional shifts and random label noise.

Role of the Diversified Q
We inspect the group diversity under the mixing strategy through an ablation study depicted in Figure 4. Apparently, we can observe significant drops in both datasets when removing this group mixing. These drops reveal that diversifying the minority groups can indeed help improve robust accuracy. In addition, we analyze the influence of the mixing parameter α. As shown in Figure 5, we can observe that α indeed affects the effectiveness of the group mixing, leading to the volatility in robust accuracy. Considering the feature of Beta distribution, the sampled λ will be more concentrated around 0.5 as the α value becomes large, resulting in a relatively balanced weight between the mixed example pairs. The model performance remains stable when α is around 7 ∼ 11.

Generalization to OOD Sets
Since Q-Diversity is a totally unsupervised method, it can be used off the shelf to improve OOD generalization on a new task. We therefore transfer Q-Diversity, along with two other well-performing unsupervised baselines, i.e., EIIL and JTT that first trained on MultiNLI and SST2 dataset, to a wide range of OOD datasets where the in-distribution spurious correlations may not hold.
Q-Diversity improves robustness to unknown distributional shifts. With the unknown group information of these OOD test sets, we report the average accuracy in Table 4. Strikingly, we can observe that across the tasks and datasets, the two baselines even underperform than the lower bound of ERM. Especially on the SST2 dataset, the average accuracy of EIIL and JTT drop around 10% and 20%. We speculate this failure mode can be at-tributed to their heuristic group identification manners, easily overfitting to the in-domain data. In contrast, Q-Diversity outperforms ERM by 0.5%-5% across the datasets on average, revealing its great robustness to different distribution shifts.

Under the Presence of Label Noise
The unsupervised methods like JTT are based on the core idea of up-weighting samples with high losses. Nevertheless, when training data meets the noisy labels, such an approach will likely yield degenerate solutions, since the model tends to upweight mislabeled samples with high losses. To further explore the application of unsupervised group DRO methods with the intervention of noisy labels, we perform experiments by inducing random label flips of varying degrees into MultiNLI dataset.
Q-Diversity is more robust to random label noise. As the results shown in Figure 6, Q-Diversity retains better robust accuracy under the presence of label noise than ERM and Group DRO. Corresponding to our assumption, JTT performs poorly even with a low noise rate since it fails to distinguish minorities from mislabeled samples.

Related Work
Group Robust Optimization Standard training with ERM can result in highly variable performance because of subpopulation distribution shifts arising from spurious correlations (Wu and Gui, 2022;Gao et al., 2022). In this context, Sagawa et al. (2020) formally introduces group DRO, with the goal to maximize worst-group or the minority group performance within the set of pre-defined groups. While promising, a rather practical scenario is that group information can not be available reliably. Therefore, another line of research begins to focus on the worst-case optimization without group annotations (Zhou et al., 2021). Typically, these methods first train a weaker model to identify high-loss samples as minority groups, and subsequently train an additional model with greater emphasis on the estimated minority groups (Nam et al., 2020;Liu et al., 2021).
Although the unsupervised group DRO methods are developed, they are confined to a two-stage training pipeline. In the two-stage model, a failed first stage can lead to an unsuccessful second stage as errors from the former are propagated to the later one. By contrast, Q-Diversity in an end-to-end training manner overcomes the error accumulation.
The group assigner and constructor cooperate with each other, and interactively, the classification response from the constructor can serve as a weak supervision to guide better group identification.
Diversity and OOD Generalization It is explored that the geometric skew and the statistical skew are two mechanisms hurting out-ofdistribution performance with the existence of spurious correlations Nguyen et al., 2021). Concretely, the geometric skew is caused by the fact that classification margin on the minority group of a robust classifier tends to be much larger than that of the majority group, while the statistical skew arises from the fast convergence of gradient descent on spurious correlations unless trained for an exponentially long time. Although upweighting or oversampling the minority samples are straightforwardly effective in mitigating the statistical skew, both of them fail the geometric skew for the unchanged unique samples. Therefore, a wide range of studies emerge to diversify the input samples or feature space. Among them, counterfactually-augmented data (CAD), i.e., data generated by minimally perturbing examples to flip the ground-truth label, has shown efficiency to learn robust features under distribution shifts (Kaushik et al., 2020). However, further investigation (Joshi and He, 2022) reveals the lack of perturbation diversity limits CAD's effectiveness on OOD generalization. In comparison,  directly leverage the deep generative models to diversify training data with spurious correlations, while the model complexity is increased greatly.
For the sake of creating more synthesized samples to address geometric skew, our method that applying interpolation across the majority and minority groups shows its advantages in terms of perturbation diversity and time consumption.

Conclusion
In this paper, we present Q-Diversity, an unsupervised method to optimize the worst group for model robustness. The formulation of Q-Diversity extends the annotations of group DRO to an automatic assignment through an interactive training mode. Furthermore, under the guarantee of a novel mixing strategy across groups, Q-Diversity can better counteract the failure modes of OOD generalization. Superior to previous works that only show the efficiency over the particular dataset, we demonstrate Q-Diversity promises better general-ization capability to various OOD sets. We believe that our work casts light on the limitations of group DRO which have been overlooked before, and can be viewed as a cornerstone for future study in the worst-group generalization.

Limitations
Although our unsupervised framework Q-Diversity shows great superiority, when it comes to limitations, we acknowledge that (i) Our empirical validations on real-world datasets just follow current benchmarks that shed light on the group shifts caused by spurious correlations. Although we conduct experiments on the scenarios with noisy labels and various OOD datasets, practically, apart from superficial clues, a series of contributing factors that lead to group shifts are worth further exploration. (ii) A better theoretical understanding of how the interactive training mode can guide Q-Diversity works in better group identification should be established, and this points out the direction for our future work.

Ethics Statement
Natural Language Processing (NLP) models that perform poorly on a minority group have raised a lot of concerns within the research community and broader society in recent years. In this work, the proposed Q-Diversity is a versatile method that could be employed to train a robust model across groups even when the group information is not available. This is a rather practical scenario as the group information is almost missing during the data collection. We believe that our work is a step towards a suite of algorithms capable of solving a broader class of group DRO problems at scale. Moreover, such an algorithm will empower NLP researchers and engineers to create more reliable and ethical systems.

Dataset Description
PI (Liu et al., 2020) selected instances from MultiNLI for testing the hypothesis-only bias in NLI models LI (Liu et al., 2020) selected instances from MultiNLI for testing logical inference ability of NLI models ST (Naik et al., 2018) stress set construction for testing the heuristics of NLI models HANS (McCoy et al., 2019) designed to contain examples where the shallow heuristics (e.g., lexical overlap) fail WaNLI (Liu et al., 2022) worker-and-AI collaborative dataset with challenging reasoning patterns for NLI task SNLI (Bowman et al., 2015) a large-scale, widely-used benchmark for NLI task ANLI (R3) (Nie et al., 2020) an iterative, adversarial human-and-model-in-the-loop solution for NLI dataset

SST2
Dataset Description SST2 (Socher et al., 2013) from the GLUE NLU benchmark to classify movie reviews as positive or negative Senti140 (Go et al., 2009) sentiment classification on Twitter messages SemEval (Nakov et al., 2013) crowdsourcing on Amazon Mechanical Turk over Twitter dataset for sentiment analysis Yelp (Asghar, 2016) online reviews consisting of free-form text and a star rating out of 5 for services ImDB (Maas et al., 2011) a collection of positive and negative reviews from Internet Movie Database Contrast (Gardner et al., 2020) small but label-changing modifications to the instances for ImDB CAD (Kaushik et al., 2020) counterfactual datasets constructed over ImDB Table 5: Details of the out-of-distribution datasets in Table 4.

A Details of the OOD Datasets
We train the model on MultiNLI and SST2 tasks and test it on the corresponding OOD datasets respectively. For the results shown in Table 4, we present the details of these OOD datasets in Table 5 as follows.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?

Section Limitation
A2. Did you discuss any potential risks of your work?

Section 5
A3. Do the abstract and introduction summarize the paper's main claims? C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? Section 4 The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.