Learning Slice-Aware Representations with Mixture of Attentions

Real-world machine learning systems are achieving remarkable performance in terms of coarse-grained metrics like overall accuracy and F-1 score. However, model improvement and development often require fine-grained modeling on individual data subsets or slices, for instance, the data slices where the models have unsatisfactory results. In practice, it gives tangible values for developing such models that can pay extra attention to critical or interested slices while retaining the original overall performance. This work extends the recent slice-based learning (SBL)~\cite{chen2019slice} with a mixture of attentions (MoA) to learn slice-aware dual attentive representations. We empirically show that the MoA approach outperforms the baseline method as well as the original SBL approach on monitored slices with two natural language understanding (NLU) tasks.


Introduction
Though machine learning systems have been achieving excellent performance in terms of coarsegrained metrics like accuracy, they perform poorly or even fail on some individual data subsets (i.e., slices). For instance, many models have difficulties when learning for classes with only a few samples or samples with challenging structures. Inspecting particular data slices can serve as an important component in model development cycles. A recently proposed slice-based learning (SBL) exhibited compelling results with more than 3% improvements on pre-defined slices (Chen et al., 2019) in the task of binary classification. However, one potential limitation of the existing attention mechanism in SBL is that in multi-class cases, the attention suffers from the difficulty in using the experts' confidences appropriately for computing slice distributions (refer to Sec. 3). (1) slice functions define the special data slices that we want to monitor; (2) backbone model for feature extraction (e.g., BERT); (3) slice indicators are membership functions to predict if a sample belongs to the slice; (4) slice experts aim to learn slice-specific representations; (5) shared head is the base task predictive layer across experts and (6) the proposed mixture of attentions (MoA) learns to attend to the slices of interest. It contains two different attention mechanism (red boxes): a membership attention and a dot-product attention. The MoA learns to re-weight the expert representation r to a slice-aware representation s and the original representation x to s (yellow lines). The slice distributions are computed in deterministic (weighted sum of slices) or stochastic (sampling) way in re-weighting r and x.
In this paper, we extend SBL with a mixture of attentions (MoA) mechanism. Two different attention mechanisms are learned to jointly attend to the defined slices from different representations in different latent subspaces. The first attention is based on slice membership likelihood and/or experts confidence as in SBL (Chen et al., 2019), which we call membership attention. The second one is dot-product attention that is based on the backbone model (e.g., BERT (Devlin et al., 2019)) extracted representations. The MoA approach is akin to multi-head attention (Vaswani et al., 2017) but with different attention types that receive different inputs.
As presented in Figure 1, the two attentions in MoA can work jointly to attend to (1) the expert rep-  resentation r and (2) the backbone model extracted representation x, and finally form an attentive representation s. The s is a slice-aware featurization of the samples in the particular data slices and will be used for making a final model prediction.
We argue that learning joint attention with MoA from different resources for computing slice distributions is beneficial (Vaswani et al., 2017;Li et al., 2018). We evaluate the effectiveness of our proposed approach on intent detection (Liu et al., 2019) and linguistic acceptability (Warstadt et al., 2018) tasks.
Our main contributions are twofold: • We extend SBL with MoA. The MoA approach has the ability to attend to slices in deterministic (weighted summation) and stochastic (sampling) ways.
• We conduct extensive experiments on two NLU tasks. The results show that MoA outperforms the baseline and vanilla SBL by average up to 9% and 6% respectively on defined slices.
2 Architecture Figure 1 presents the slice-aware architecture based on SBL (Chen et al., 2019). Let {x n , y n } N n be a dataset with N samples. We aim to learn sliceaware representation s from slice-experts-learned representation r and backbone-model-extracted representation x.
We first define slice functions (SFs) as in Table 1 to split the dataset into k slices of interests. Each sample is assigned with a slice label γ ∈ [0, 1] in {γ 1 , γ 2 , ..., γ k } as supervision data 3 . 1 The SFs are task-dependent and not assumed to be perfectly accurate. They can be noisy or from weak supervision sources (Ratner et al., 2016). Here, for the task in sec.4.2, SFs are defined to improve the slices where the model has unsatisfactory results as compared to the overall performance. For the task in sec.4.3, we define the SFs for the slices of interest.
2 Alternatively, 5W1H rule for questions (Kim et al., 2019). 3 s1 is the base slice, and s2 to s k are the slices of interest.
Second, we use a backbone model like BERT to extract representation x ∈ R d for a given sample.
n to predict whether a sample belongs to a particular slice. They are learned with cross entropy loss Then, slice experts g i (x; w g i ), w g i ∈ R d×d learn a mapping from x to a slice vector r i ∈ R d with the samples that only belong to the slice, followed by a shared head, which is shared across all experts and maps r i to a predictionŷ = ϕ(r i ; w s ). g i and ϕ are learned on the base (original) task with ground-truth label y by Finally, a mixture of attentions(MoA) (as in Sec. 3) re-weights r and x to form s. The s goes through a final prediction function η on the base task. The loss function is The total loss is a combination of the loss for slice indicators, slice experts and base task prediction function: The whole model is optimised with backpropagation (Rumelhart et al., 1986) in an endto-end way.

Methodology
The SBL approach (Chen et al., 2019) proposed a slice-residual attention modules (SRAMs) that are directly based on stacked membership likelihood H ∈ R k and experts' prediction confidence |Y | ∈ R c×k , c = 1 (i.e., binary classification). Then, slice distribution (attention weights) is computed with a = SOFTMAX(H + |Y |). One potential limitation of this mechanism is that the above formulation can lead to mismatch shape in elementwise addition when c > 2 (i.e., multi-class classification). To circumvent this, we propose a mixture of attentions (MoA) to augment membership attention with dot-product attention from different information resources.

Mixture of Attentions
Let x ∈ R d be the original representation from the backbone model (e.g., BERT), h i ∈ R c (c = 1) as i-th indicator function's prediction, and r i ∈ R d as i-th expert learned representation. When stacking on k slices, we have h ∈ R c×k and r ∈ R d×k . MoA's goal is to (1) attend to r based on indicator functions' membership likelihood and/or experts confidence 4 ; (2) attend to x with a dot-product attention; (3) to form a new slice-aware attentive representation s ∈ R d with weighted (sampled) r and x.
The slice distributions are computed differently. For membership attention, the probability p 1 = SOFTMAX(h) or p 1 = SOFTMAX(h + |r|) ∈ R k (d=1 in binary classification). Then membership weighted slice representation is computed: s 1 = r • p 1 , s 1 ∈ R d . For dot-product attention, we aim to learn an attention matrix A = {a 1 , ..., a k }, a ∈ R d , A ∈ R d×k is randomly initialized and learned by the standard back-propagation. Intuitively, each a is learned to be a slice prototype (Wang and Niepert, 2019;Roy et al., 2020). The probability over slices is computed as: A new attentive representation s 2 is formed by weighting A with p 2 : or sampling from A: Then slice-aware vector s is computed by where is an operator (either ⊕: element-wise addition or ⊗: element-wise multiplication). The 4 In multi-class case, only membership likelihood is used.
eq.(8) can be extended into a more general formmixture of attentions (MoA): Note eq.(9) entails the following transformations (→) and captures the representational differences from r to s and from x to s: The φ(·) is either SOFTMAX: that deterministically computes slice distributions or a Monte-Carlo gradient estimator: GUMBEL-SOFTMAX (Gumbel, 1954;Jang et al., 2017;Maddison et al., 2017): The π i are i.i.d. samples from the GUMBEL(0, 1), that is, π = − log(− log(u)), u ∼ UNIFORM(0, 1). τ is temperature which controls the concentration of slice distribution, and small τ leads to more confident prediction over slices. It aims to stochastically compute slice distribution. With Gumbelsoftmax, the slice distribution is a soft sampling from: or a hard sampling (but differentiable) from: p 1 ∼ ONE-HOT(arg max(p 1 )) (15) p 2 ∼ ONE-HOT(arg max(p 2 )) for membership and dot-product attention respectively.

Experiments
We performed our experiments on a binary classification task with linguistic acceptability and on a multi-class classification task with intent detection.   train/val/test with 7200/878/1000 samples. As in (Chen et al., 2019), we ensure the sample proportion in ground-truth are consistent across splits. We use F1-score and Matthews correlation coefficient (MCC) (Matthews, 1975) as our metrics. The NLU dataset (Liu et al., 2019) for intent detection contains 25k user utterances across 64 intents. We randomly split it into train/val/test with ratio 0.7:0.1:0.2. We use the accuracy and F1-score as our metrics. Compared Methods. We implemented and compared the following methods: • Baseline: A three-layer feed-forward network.
• SBL-MoA: Our approach that extends SBL with a mixture of attentions (MoA).
For SBL-MoA, we developed multiple variants with Gumbel-Softmax. SBL-MoA-S (SBL-MoA-H) are the variant models with soft (hard) sampling from a Gumbel-Softmax distributions. We also tested the way that membership attention and dotproduct interact with each other with ⊕ (elementwise addition) and ⊗ (element-wise multiplication).
Implementation Details. BERT-base (Devlin et al., 2019) in sentence-transformer (Thakur et al., 2020) is used as the backbone model. We use 128 hidden units for all models, which are implemented with Pytorch (Paszke et al., 2019). A dropout (p=0.5) 6 is applied after input layer. The models are trained with Adam (0.001) (Kingma and Ba, 2014), with weight decay of 0.01 and 0.001 for the two tasks, respectively. All models are trained with a maximum of 500 epochs with early stopping (patience=50). The best models are selected based on model performance on the validation sets. The temperature τ = 1.0 is fixed in all the experiments.  baseline. It outperforms SBL by 5%. Also, we notice that using operator ⊗ (element-wise multiplication) between the attention mechanisms lead to better performance as compared to ⊕. Table 3 demonstrates that both SBL and SBL-MoA improve model performance on the monitored slices, with a similar (slightly better) overall performance on the base task 7 . SBL-MoA variants achieve the best scores and outperform SBL by average 1% accuracy and 1.7% F1. Figure 2 illustrates the slice distributions given some random samples. We denote p 1 and p 2 for membership and dot-product attention respectively. The experiments show that p 1 and p 2 reach an agreement on predicting the correct slices. Interestingly, the sample in (d) -"write sms to our friends", in principle, should be sliced as "base", but both attentions exhibit high confidence to s3="Email". We conjecture the reason is that all utterances are encoded with BERT which captures the similarity between the sample and the utterances in the "Email" slice.

Related Work
SBL (Chen et al., 2019) is a novel programming model for critical data slices. It is an instance of weakly supervised learning (Zhou, 2018;Medlock and Briscoe, 2007). The weak supervision data are generated from pre-defined labeling functions (Ratner et al., 2016). SBL has shown better predictive performance compared to the mixture of experts (Jacobs et al., 1991) and multi-task learning (Caruana, 1997), with reduced run-time cost and parameters (Chen et al., 2019). The concept of 7 Note the lift on slice can be negligible to overall due to small size of slice data, e.g., For SBL-MoA ⊗, s1 with 122 samples, 7.0% lift only contributes to 122×0.07/5124 ≈ 0.0017. SBL has been recently used in many applications. Penha et al. (Penha and Hauff, 2020) proposed to adapt SBL to improve ranking performance and capture the failures of the ranker model. Wang et al. (Wang et al., 2021) recently implemented SBL in a commercial conversational AI system in order to handle the long-tail problem of imbalanced distribution in customer queries and further improved the performance of the conversational skill routing components (Li et al., 2021;Kim et al., 2018b,a).
Our proposed mixture of attention (MoA) is an instance of multi-head attention (Vaswani et al., 2017) but with different attention types. MoA can also be extended to include other attention types. We have shown the effectiveness of this mechanism in determining the slice distributions.

Conclusion
This paper extends SBL with MoA (SBL-MoA) to improve model performance on particular data slices. We empirically show that SBL-MoA yields better slice level performance lift to baseline and vanilla SBL with two NLU tasks: linguistic acceptability and intent detection.