COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP tasks

Transformer architectures are complex and their use in NLP, while it has engendered many successes, makes their interpretability or explainability challenging. Recent debates have shown that attention maps and attribution methods are unreliable (Pruthi et al., 2019; Brunner et al., 2019). In this paper, we present some of their limitations and introduce COCKATIEL, which successfully addresses some of them. COCKATIEL is a novel, post-hoc, concept-based, model-agnostic XAI technique that generates meaningful explanations from the last layer of a neural net model trained on an NLP classification task by using Non-Negative Matrix Factorization (NMF) to discover the concepts the model leverages to make predictions and by exploiting a Sensitivity Analysis to estimate accurately the importance of each of these concepts for the model. It does so without compromising the accuracy of the underlying model or requiring a new one to be trained. We conduct experiments in single and multi-aspect sentiment analysis tasks and we show COCKATIEL's superior ability to discover concepts that align with humans' on Transformer models without any supervision, we objectively verify the faithfulness of its explanations through fidelity metrics, and we showcase its ability to provide meaningful explanations in two different datasets.


Introduction
NLP models have undeniably gotten increasingly more complex since the introduction of the transformer architecture (Vaswani et al., 2017;Devlin  et al., 2018;Liu et al., 2019a).This trend, which is also occurring in the domain of Computer Vision, has brought about a need for understanding how these models make their predictions.The presence of bias in these models could indeed be prejudicial in applications where the user's lives are at stake (De-Arteaga et al., 2019).Humans should be able to comprehend the reasons behind the model's decisions if these models are to gain general acceptance.Also, companies need to ensure that they are deploying algorithms which are free of harmful biases and that the explanations that they are obligated to issue are easily understandable by employees and end-users alike (Kop, 2021).
Intelligibility by humans has then become a key topic in explainable AI systems.As AI systems become more sophisticated and are deployed in increasingly complex environments, the ability to provide clear and concise explanations of their decisions becomes more pressing.
Researchers have proposed multiple solutions to address this challenge.The most straightforward approach analyzes how each part of the input influences the model's prediction.There are different ways of doing this, through perturbation (Ribeiro et al., 2016;Zeiler and Fergus, 2014) or by leveraging the gradients inside the neural network (Sundararajan et al., 2017a).However, these approaches suffer from being vulnerable to adversarial manipulation (Wang et al., 2020), from only performing partial input recovery (Adebayo et al., 2018) and from a general lack of stability with respect to the input (Ghorbani et al., 2019a).Another research path for transformer models harnesses the information in the attention maps of the transformers' layers to understand how the elements in the input relate to the output, implying that the attention mechanism is inherently interpretable.In spite of a number of supporters initially for this approach, there has been a recent wave of detractors of attentionbased explanations (Jain and Wallace, 2019;Pruthi et al., 2019;Serrano and Smith, 2019).
More in line with our proposal work, researchers in the field of rationalization have proposed specific architectures to extract excerpts from whole inputs and predict a model's output based on these rationales (Lei et al., 2016;Jain et al., 2020;Chang et al., 2020;Yu et al., 2019;Bastings et al., 2019;Paranjape et al., 2020).These rationales can be seen as explanations that are sufficiently high-level to be easily understood by humans.However, they require to train an entirely new model.Only one rationale can also be found per input text, when there might intuitively be several predictions for a given prediction.Finally, these approaches use architectures that have mostly been left behind since the introduction of the transformer architecture, due to their inferior predictive capabilities.
In line with the project of generating explanations that are meaningful to humans, concept-based explainable AI (XAI) has lately advanced the sate of the art.The pioneer method TCAV (Kim et al., 2018) goes beyond widespread attribution methods to create high-level explanations based on handpicked concepts.More recently, Fel et al. (2022) has extended this technique to discover automatically pertinent concepts inside the network's activation space and to find the parts of the input space that most align with each concept.Still, it has only been applied to convolutional architectures for image classification tasks.
In this paper, we present COCKATIEL, a novel technique for generating reliable and meaningful explanations for NLP neural architectures for classification problems.It extends CRAFT (Fel et al., 2022) and our contributions can be summarized as follows: • We introduce a post-hoc explainability technique that is applicable to any neural network architecture containing non-negative activation functions.The technique is capable of explaining predictions of individual instances as well as providing insights of the model's general behavior.
• We measure COCKATIEL's ability to discover concepts that align with those that Humans would employ in a sentiment analysis application.Although we did not train the model on data annotated with these human concepts, COCKATIEL's explanations find them with high accuracy.
• We demonstrate that in addition to generating meaningful concepts for Humans, these explanations are faithful to the models: An explanation X provided by method C is faithful to a model M just in case if X is returned as a putative explanation of M 's behavior by C, the X plays a causal role in M 's behavior.
• We provide examples of explanations on finetuned RoBERTa models (Liu et al., 2019a) and bidirectional LSTMs trained from scratch to show how the concept decomposition can be used to understand the inner workings of complex models.
2 Related Work

Explaining through rationalization
Finding rationales in text refers to the process of identifying expressions that provide the key reasons or justifications that are provided for a particular claim or decision about that text.Lei et al. (2016) defined rationales as "a minimal set of text spans that are sufficient to support a given claim or decision".They should satisfy two desiderata: they should be interpretable, and they should reach nearly the same prediction as the original output.
To do so, they use a generator network that finds interesting excerpts and an encoder network that generates predictions based on them.However, their scheme requires the use of reinforcement learning (Williams, 1992) for the optimization procedure.Bastings et al. (2019) proposed to include a reparametrization trick to allow for better gradient estimations without the need for reinforcement learning techniques, and a sparsity constraint to encourage the retrieval of minimal excerpts.Yu et al. (2019) and Paranjape et al. (2020) studied the problem of producing adequate rationales from a game-theoretic point of view.However, these models can be quite complex to train, as they either require a reparametrization trick or a reinforcement learning procedure.Jain et al. (2020) proposed to solve this problem by introducing a support model capable of producing continuous importance scores for instances of the input text, that the rationale extractor can use to decide whether an excerpt will make a good rationale or not.
All these rationales will serve as an explanation for single instances, but won't explain how models predict whole classes.Chang et al. (2019) introduced a rationalization technique that allows for the retrieval of rationales for factual and counterfactual scenarios using three players.
However, all these techniques are not model agnostic and require specific architectures, in particular rather simple architectures or LSTMs, and training procedures.But these architectures have been shown to not produce optimal results.

Concept-based explanations
Concept-based explainability is a growing area of research in AI, focused on generating humanunderstandable explanations for the decisions made by machine learning models.One popular approach for generating concepts is TCAV (Kim et al., 2018).It uses gradient-based techniques to identify the important features of a model.However, TCAV relies on Human inputs, as it requires the user to manually specify the concepts to be tested.This can be time-consuming and may not always produce the most comprehensive explanations (Ghorbani et al., 2019b).
Another approach, ACE (Ghorbani et al., 2019b), aims to automate the concept extraction process.ACE uses a clustering algorithm to identify interpretable concepts in the model's activations, without the need for Human input.While this approach has the potential to greatly reduce the time and effort required for concepts extraction, the authors criticize their own reliance on pre-defined clustering algorithms, which may not always produce the most relevant or useful concepts.
An alternative uses matrix factorization techniques, such as non-negative matrix factorization (NMF) (Lee and Seung, 1999), to identify interpretable factors in the data (Zhang et al., 2021;Fel et al., 2022).As presented in Section 3, or strategy is inspired by (Fel et al., 2022) and is therefore a concept-based explanations XAI method.In (Fel et al., 2022), the authors developed a framework for generating global and local explanations.They successfully tested the meaningfulness and the capacity of these explanations to help Humans to understand the model's behavior through psychological experiments.However, this approach has only been applied to convolutional neural networks for image classification tasks so far.
For NLP applications, Bouchacourt and Denoyer (2019) proposed a self-interpretable neural architecture capable of simultaneously generating a prediction on classification tasks and its concept-based explanation.These concepts are learned without supervision from excerpts using a bidirectional LSTM during the training phase of the model, and the predictions are only based on the presence or absence of the individual concepts in the input sentences.Despite of its capacity to generate interesting concepts, its low prediction accuracy for the classification task is a serious limitation (see Table 1).Going further, Antognini and Faltings (2021) introduced ConRAT, a technique that includes orthogonality, cosine similarity and knowledge distillation constraints, as well as a concept pruning procedure to improve on both the quality of the extracted concepts and the model's accuracy.

COCKATIEL
In this section, we describe COCKATIEL, our concept-based XAI technique for NLP models to generate human-understandable explanations.It has three main components: (i) it uses Non-Negative Matrix Factorization (NMF) to discover the concepts that the neural network under study leverages to make predictions; (ii) it exploiting Sensitivity Analysis to estimate accurately the importance of each of these concepts for the model; and (iii) it uses a black-box explainability technique to generate instance-wise explanations at a per-word and per-clause level.Fig. 2 presents a schematic outline of COCKATIEL.
Notation In a supervised learning framework, we assume that a neural network model f : X n → Y n has already been trained for some classification task.We denote by (x 1 , ..., x n ) ∈ X n a set of n input texts and (y 1 , ..., y n ) ∈ Y n their associated labels.We consider f to be a composition of h, the last embedding of x (i.e. the last layer of the feature extractor model), and c, the classification function, COCKATIEL will factorize h through NMF, so we require h to be non-negative -i.e.h(x) ≥ 0 ∀x ∈ X .This constraint is typically verified when the last layer has an activation function such that σ(x) ≥ 0, which is the case in (but it's not limited to) layers or blocks using ReLU.

Unsupervised concept discovery -"Concept part"
COCKATIEL discovers concepts without supervision by factorizing the neural network's intermediary activations by using a NMF algorithm.
Because we are factorizing h, we can generate explanations on embeddings without needing to deal with the complexities of attention layers (Pruthi et al., 2019); nor do we have to deal with the nonidentifiability of transformer models (Brunner et al., 2019).Thus, the concept extraction phase of our method does not depend on the specificities of attention.We will address this later on in Section 3.3 to be able to generate our instance-level explana-tions.

NMF algorithm:
We choose an excerptextraction function τ 1 to generate a database of excerpts coming from texts that the model places in the desired class d c -i.e.
Then, we place ourselves at the model's last layer and we extract the activations A = h(X i ) for each of the excerpts X i in the database.With this information, we solve the constrained optimization problem engendered by the NMF algorithm: where || • || F is the Frobenius norm.This allows us to decompose the high-rank matrix containing all activations A ∈ R n×p into two low-rank matrices U ∈ R n×r and W ∈ R p×r .Intuitively, this corresponds to W being a matrix whose columns represent the concepts that we will use to generate explanations, and U is a matrix containing the coefficients quantifying the presence of each concept.These matrices are built so as to minimize the reconstruction error 1 2 ∥A − U W ∥ 2 F , enforcing the relevance of the concepts, and with a non-negative constraint for each matrix, thus encouraging sparsity in their elements.
It is important to note that these coefficients u ij ∈ R + , so the presence of a concept can be determined by where its value stands in the concept's coefficients distribution.In practice, we have found that fixing a threshold at the quantile repre-senting the 10% highest values leads to accurate and easy to interpret explanations.
Choice of τ 1 : As we want the concepts to be descriptive enough to convey an abstraction but short enough to only contain one, we work with excerpts chosen by an excerpt-extraction function τ 1 .The choice of τ 1 , which should depend on the dataset and the text's format, heavily impacts the type of explanations that we are able to generate.
We have identified 3 possible τ 1 functions: (i) take all the full text ; (ii) split the text into sentences (of at least 6 words) ; (iii) split the text into clauses.Linguistically, it doesn't make sense to take smaller tokens like one or two words since their meaning is typically too unfocused to provide a real explanation.
We therefore chose τ 1 to respond specifically to each use-case.If we want to capture the mood of whole inputs, we can designate the inputs as the excerpts, and then interpret them by leveraging the local part of our method.If we instead wish to extract more simple but structured concepts, we can choose τ 1 to pick sentences of at least 6 words and ending in a full-stop.The first condition is necessary in the case of the beer review dataset, which is composed of short sentences containing very simple descriptions.For this dataset, using only very short excerpts would fail to convey the complexity of the ideas conveyed by the concepts.In this paper, we present results using these two excerpt-extraction functions.

Concept importance estimation -"Ranking part"
A common issue when utilizing concept extraction methods is the discrepancy between concepts deemed relevant by humans and those utilized by the model for classification.To mitigate the potential for confirmation bias during the concept analysis phase, we estimate the overall importance of the extracted concepts.
To determine which concept has the most significant impact on the model output, we use a counterfactual reasoning (Peters et al., 2017;Pearl et al., 2016), and then use sensitivity analysis (Cukier et al., 1973;Iooss and Lemaître, 2015).A classic strategy in this area is the use of total Sobol indices (Sobol, 1993).This method captures the importance of a concept, along with its interactions with other concepts, on the model output by calculating the expected variance that would remain if all the indices of the masks except M i were fixed.Definition 3.1 (Total Sobol indices).The total Sobol index ST i , which measures the contribution of a concept U i as well as its interactions of any order with any other concepts to the model output variance, is given by: To estimate the importance of a concept U i , we measure the fluctuations of the model output c(U W T ) in response to perturbations of the concept coefficient U i .Specifically, we use a sequence of random variables M to introduce concept fluctuations and reconstruct a perturbed activation Ã = (U ⊙ M )W T .We then propagate this perturbed activation to the model output Y = c( Ã).An important concept will have a large variance in the model output, while an unused concept will barely change it.
The method for calculating (2) and (3) exploits the Sobol-Hoeffding decomposition and is in the supplementary materials (appendix A).
There are already a plethora of different techniques that allow us to compute this index efficiently (Saltelli et al., 2010;Marrel et al., 2009;Janon et al., 2014;Owen, 2013;Tarantola et al., 2006).But concretely, we estimate the total Sobol indices using the Jansen estimator (Janon et al., 2014), a widely recognized efficient method (Puy et al., 2022).The Jansen estimator is commonly utilized in conjunction with a Monte Carlo sampling strategy, but we improve over Monte Carlo by using a Quasi-Monte Carlo sampling strategy.This technique generates sample sequences with low discrepancy, resulting in a more rapid and stable convergence rate (Gerber, 2015).

Instance-level explanation generation -"Interpretable elements part"
In this part, we interpret the concepts found previously.To do this, we find which words and clauses are associated with each concept.We adapt Occlusion (Zeiler and Fergus, 2014): a black-box attribution method that works by masking each word looking at the impact on the model output.In this case, to get an idea of the importance of each word for a given concept, we mask words in a sentence and measure the effect of the new sentence (without the words) on the concept.This operation can be performed at word or clause level -i.e.mask words or whole clauses -to obtain explanations that are more or less fine-grained depending on the application.
Motivations: This choice has been shown to perform particularly well on NLP models (Fel et al., 2021a) and doesn't suffer from the inefficiency of having to sample a considerable amount of masks for each explanation.Indeed, in (Fel et al., 2021a), they compared Occlusion to other explainability techniques that are commonly used in NLP, and they showed that it is more faithful to the model than Saliency (Simonyan et al., 2014), Grad-Input, SmoothGrad (Smilkov et al., 2017), Integrated Gradients (Sundararajan et al., 2017b), and their own Sobol method on both LSTM and BERT models.
In addition, in the case of transformer models, using a black-box method such as Occlusion avoids manipulating the attention layers between the input and the activation matrix A, where our concepts are located.In doing so, we avoid the non-identifiability problem of transformer models (Pruthi et al., 2019).
Application: Empirically, we perform the following operations: For a sentence X i , A i = h(X i ).We have a fixed W calculate with the NMF and W k , the k concept of W .As before, we get the importance of the sentence X i for the concept k: Then, we remove the element j from the sentence i: Xi−j (i.e.we replace the (tokenized) feature by a zero).So we have Ãi−j = h( Xi−j ), and: So, ϕ(k, i, j) quantifies the influence of the element j in the sentence i for the concept k: For the visualisations (see e.g.Fig. 6), we color the element with the color of the concept for which it is most important.In addition, the darker the color, the more important the element is for the concept.
Choice of τ 2 : Just like in the case of the NMF, the choice of the form of the elements of the input to occlude will have an impact on the understandability of the explanations.This can be generalized via another excerpt extraction function τ 2 , whose optimal shape will depend on the dataset, the text's format and the learned concepts (i.e.Occlusion shouldn't be applied at a per-clause level if the concepts were learned using a τ 1 providing single words, so this first exceprt extraction function must be taken into consideration).There is a certain trade-off between the granularity and the interpretability of the explanations, as illustrated in Figure 11 in the appendix which contains some examples with different choices of τ 2 .In general, we advise to try different combinations of τ i to find the desired level of granularity in the explanations for each use-case.

Experimental evaluation
For all of our results, we fine-tuned RoBERTa (Liu et al., 2019a) based models on each dataset.We ensured the non-negativity of at least one layer of the model by adding a ReLU activation after the first layer of the 1-hidden-layer, dense MLP of the classification head.For the qualitative analysis, we also tested COCKATIEL's performance on bidirectional LSTM models trained from scratch.More details about the implementations are left in appendix B.   Ours 95.2 39.5 58.4 45.5 63.3 56.4 59.7 27.3 67.4 38.9 26 43.5 32.5 41.4 66.1 50.9

Average
Table 1: Objective performance of rationales for the multi-aspect beer reviews.All baselines are trained separately on each aspect rating, except for ConRAT (Antognini and Faltings, 2021), which is trained on the Overall label just like our method.Bold and underline denote the best and second-best results, respectively.
We will first analyze the meaningfulness of the discovered concepts by measuring their alignment with human annotations on the different aspects of a multi or single-aspect sentiment analysis task.Then, we will ensure that our explanations are faithful to the model through an adaptation of the insertion and deletion metrics to concept-based XAI.Finally, we will showcase some examples of explanations and of applications for our method.Following the human-alignment evaluation in (Antognini and Faltings, 2021), we perform beer task:

Alignment with human concepts
Beer Task We will measure the extent to which our concepts overlap the human annotations for the 4 different aspects of the multi-aspect beer reviews dataset (McAuley et al., 2012).This dataset contains reviews for beers with commentary and marks (from 0 to 5) on 5 different aspects: Appearance, Aroma, Palate, Taste and Overall.The model will be trained to predict whether the overall score is greater than 3 -i.e. a positive review on the beerand will not have access to the labels for the other aspects.Additionally, it includes 994 reviews with annotations indicating the position of these aspects in the text.The objective of this evaluation is to look for concepts that align with these annotations and measure their capacity to predict the location of each different aspect.In particular, we searched across the whole annotated dataset for the concepts whose F1 score for the prediction of each aspect was maximal.It is important to note that this does not take into account to which extent they are important for the model to predict, but this only serves as an automatized test for determining whether the explainability technique is capable of generating understandable concepts.
We calculate the precision, recall and F1 scores for each aspect, and we do so with l = 10 and l = 20 concepts.We remind the reader that, unlike the baselines, our method is a post-hoc technique, and thus, the model does not need to be re-trained, and that changing the number of concepts takes only a few minutes of compute on GPU.
In Table 1, we present a comparison of our results to those obtained with some rationalization techniques: RNP (Lei et al., 2016), RNP-3P (Yu et al., 2019), InvRAT (Chang et al., 2020) and Con-RAT (Antognini and Faltings, 2021) for the task on Beer.We demonstrate that not only our model achieves the highest accuracy, but also that it outperforms all the other methods in its ability to accurately recognize the human annotations, be it by its precision, recall or F1 score.

Evaluation of Explanation Faithfulness
We have demonstrated that we can generate concepts that greatly align with humans', but to legitimately serve as an explainability technique, we must also guarantee its faithfulness.This element is key, as the concepts leveraged by the model may not perfectly align with humans in every task, but we still want the explanation to reflect what the model is doing.An XAI method is said to be faithful if its explanations faithfully convey the information that the model is using to generate its predictions.In (Ghorbani et al., 2019b;Zhang et al., 2021), they proposed to use an adaptation of the deletion and insertion explainability metrics to concept-based methods.In essence, they proposed to gradually mask/add the concepts (following their importance) and seeing the impact on the logits.If the concepts are indeed important for the model to predict, they should drastically decrease/increase as vital information for the prediction is progressively being erased/added.
To evaluate the explanation Faithfulness and present qualitative results, we used the IMDB dataset (Maas et al., 2011).The IMDB dataset is a collection of 50K movie reviews from the Internet Movie Database (IMDB) website.For each review, IMDB specifies whether it is positive or negative (the label).The dataset is balanced, with 25K positive and 25K negative reviews.We used a RoBERTa model to predict the label from the reviews.
In Fig. 5, we showcase the plots for these two fidelity metrics on the IMDB Reviews dataset.We observe that the concepts are indeed important for the model's predictions.In the both plots, the curve corresponding to the concept ranked in order of importance according to our Sobol method is better than a random ranking of these concepts, and much better than if we had taken the order of Sobol importance in reverse.In particular, to obtain statistically significant results, we took 10 sets of 10k reviews, and computed the mean and standard deviation values for both of the metrics.

Qualitative evaluation
A model with a good accuracy like RoBERTa gives very good explanations.Others like LSTM (see appendix C) do not do so well and do not yield good explanations.This is not a surprise; if the model predicts badly, necessarily the concepts it uses to predict will be bad.Similarly, if the model is very basic, it uses simple concepts to predict.The reviews in IMDB are also well written, so it is more comfortable to analyse sentences and words to properly call the concepts found by the NMF.
In Fig. 6, we can see the 3 most important concepts for each label class.Each of its concepts "the favorite movie", "technically good/interesting movie", "good comedie of family movie" for the positive class or "the worst movie", "middling movie", "boring/stupid movie" for the negative class are ideas that seem natural and which structures our vision of why a film would be positive or negative.

Conclusion
In this paper, we revisited concept-based explainability techniques and presented COCKATIEL, a post-hoc, model agnostic method capable of generating meaningful and faithful explanations for NLP models trained on classification tasks.The method has three parts: (i) a concept part, using Non-Negative Matrix Factorization to discover the Figure 6: Concepts generated with l = 20 for a few sentences taken from IMDB reviews.The colored elements are those important for the concept of the corresponding color (calculated with part (iii) of our method).The more colorful the element, the more important it is for the concept (continuously).We have selected the 3 most important concepts for each label (see Fig. 3).The name of the concept is chosen manually in view of the important elements corresponding to the concepts.concept, (ii) a ranking part, using Total Sobol indices to measure the influence of each concept, and (iii) an interpretable elements part, using a blackbox attribution method to quantify the impact of each element out of each concept.
We measured COCKATIEL's ability to discover concepts that align with those humans and obtained better scores than state-of-the-art methods.We demonstrated that in addition to generating meaningful concepts for humans, these explanations are faithful to the models.Finally, we gave some qualitative examples of explanations for different models to understand the method "in practice".

Limitations
We have demonstrated that COCKATIEL is capable of generating meaningful explanations that align with human concepts, and that they tend to explain rather faithfully the model.
The concepts extracted of NMF are abstract and we interpret them using part 3 of the method.However, for the interpretation, we rely on our own understanding of the concept linked to the examples of words or clauses associated with the concept.This part therefore requires human supervision and will not be identical depending on who is looking.One way to add some objectivity to this concept labeling task would be to leverage topic modeling models to find a common theme to each concept.
In addition, τ 1 and τ 2 were chosen empirically to allow for an adequate concept complexity/human understandability trade-off in our examples.We recognize that this choice might not be optimal in every situation, as more complex concept may be advantageous in some cases, and more easily understandable ones, in others.We surmise that this choice might also depend on the amount of concepts and on the model's expressivity.
Finally, we have studied the meaningfulness and fidelity of our generated concepts, but ideally, the simulatability should also be tested.This property measures the explanation's capacity to help humans predict the model's behavior, and has recently caught the attention of the XAI community (Fel et al., 2021b;Shen and Huang, 2020;Nguyen, 2018;Hase and Bansal, 2020).We leave this analysis for future works.

Ethics Statement
This work contributes to the field of explainability.This field has strong links with the field of fairness, because explaining a model makes it possible to understand its biases.Transformers are a type of model that are little studied in explainability and yet it is widely used.COCKATIEL is a tool to explain transformers and therefore avoid using biased models against the minority.
It is important to remark that this need for understanding automatic decisions start being enforced by Law, as for instance by the so-called AI act1 of the European Union.As a consequence, companies need to ensure that they are deploying algorithms which are free of harmful biases and that the explanations that they're obligated to issue are easily understandable by employees and end-users alike.

B Implementation Details
We trained 3 different models.For each model, we performed a single run and we split datasets in 70% for train, 10% for validation and 20% for test.

B.1 Trained RoBERTa on Beer dataset
We used a RoBERTa base pretrained on hugging face by Liu et al. (2019b) (all the information on the pretrain can be found in the paper).The model was pretrained on the reunion of five datasets: • BookCorpus (Zhu et al., 2015), a dataset containing 11,038 unpublished books; • English Wikipedia (excluding lists, tables and headers) ; • CC-News (Mackenzie et al., 2020), a dataset containing 63 millions English news articles crawled between September 2016 and February 2019 ; • OpenWebText (Radford et al., 2019), an opensource recreation of the WebText dataset used to train GPT-2 ; • Stories (Trinh and Le, 2018) a dataset containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.
We then trained the model on Beer dataset.The model was trained on 2 GPUs for 10 epochs with a batch size of 32 and a sequence length of 512.The optimizer was AdamW with a learning rate of 1e-5, β 1 = 0.9, β 2 = 0.98, and ϵ = 1e6

B.2 Trained RoBERTa on IMDB dataset
We used a RoBERTa model already fine-tuned on IMDB from hugging face.This model used the pretraining presented above, we fine-tuned it with 2 epochs, a batch size of 16, and an Adam optimizer with a learning rate of 2e-5, β 1 = 0.9, β 2 = 0.999 and ϵ = 1e − 8.

B.3 Trained LSTM on IMDB dataset
We created our LSTM with: Then, we trained it on the IMDB dataset.The model was trained on 2 GPUs for 5 epochs with a batch size of 128 and a sequence length of 512.The optimizer was Adam with a learning rate of 1e-4.

C LSTM example
LSTMs are much less complex than RoBERTa, and as such, we can expect them to leverage less and much simpler concepts for their predictions.
In particular, COCKATIEL identified 3 concepts that monopolized the importance score for each class on the RoBERTa model.For the positive class, we had "the favorite movie", "technically good/interesting movie" and "good comedie or family movie".For the negative class, we also had "the worst movie", "middling movie" and "boring movie".
In contrast, in the case of the LSTM (see figure 8), COCKATIEL detected a single important concept per predicted class.For the positive class, this concept encompasses the positive language elements mostly, and for the negative class, the negative elements.This is a much more basic view of the review classification problem, and COCK-ATIEL allows us to confirm our intuitions about the richness of the embedding learned by the LSTM.The more colorful the element, the more important it is for the concept (continuously).We have selected the most important concept for each label (see Fig. 7).The name of the concept is chosen manually in view of the important elements corresponding to the concepts.The more colorful the element, the more important it is for the concept (continuously).

D Other examples of COCKATIEL explanations for RoBERTa
We have selected the 3 most important concepts for each label (see Fig. 3).The name of the concept is chosen manually in view of the important elements corresponding to the concepts.We split the text into clauses for occlusion using the fair library's SequenceTagger implementation.

Figure 1 :
Figure 1: An illustration of COCKATIEL.Given some sentences of IMDB reviews, COCKATIEL (i) identifies concepts for prediction, (ii) ranks them, and (iii) gives the most important elements for each concept (to help us interpret the concept).

Figure 2 :
Figure2: Overview of our method: COCKATIEL can be divided into three phases.(i) The first step is assembling the concepts base.We propose to do this by constituting a database of whole or excerpts of input texts, projecting each one of these elements into the embedding of the model of our choice h(x) and using the NMF algorithm to decompose the resulting non-negative matrix into two low-rank, non-negative matrices: U and W . (ii) Once U and W have been computed, we can compute the Total Sobol indices for the concept base's columns by masking the coefficients and by looking at their effect on the classifier's output: c((U ⊙ M )W T ).(iii) Finally, we propose to retrieve the influence of each word of the instance under study in each concept through Occlusion, that is, by applying masks to each word (or clause) in the input and quantifying the changes in each of the concept coefficients.

Figure 3 :
Figure 3: Concept importance: The global influence of the NMF concepts on the predictions on RoBERTa model is measured using Sobol indices.There are different concepts for each class (positive and negative label).

Figure 4 :
Figure 4: Concepts generated with l = 20 for a beer review.The colors depict the aspects for each annotate concept.COCKATIEL is trained only on the label and we use the NMF part of the method to find annotate concepts.For other examples of review, see appendix D

Figure 5 :
Figure 5: (Upper) Deletion curve for RoBERTa on IMDB Reviews (lower is better).(Lower) Insertion curve for RoBERTa on IMDB Reviews (higher is better).
SentimentRNN ( ( embedding ) : Embedding ( 1 0 0 1 , 5 1 2 ) ( l s t m ) : LSTM( 5 1 2 , 1 2 8 , n u m _ l a y e r s =4 , b a t c h _ f i r s t = True , b i d i r e c t i o n a l = T r u e ) ( d r o p o u t ) : D r o p o u t ( p = 0 .3 , i n p l a c e = F a l s e ) ( f c _ 1 ) : L i n e a r ( i n _ f e a t u r e s =128 , o u t _ f e a t u r e s =128 , b i a s = T r u e ) ( r e l u ) : ReLU ( ) ( f c _ 2 ) : L i n e a r ( i n _ f e a t u r e s =128 , o u t _ f e a t u r e s =2 , b i a s = T r u e ) ( s i g ) : S o f t m a x ( dim = 1 ) )

Figure 7 :
Figure 7: Concept importance: The global influence of the NMF concepts on the predictions on LSTM Model is then measured using Sobol indices.We have different concepts for each class (positive and negative label).

Figure 8 :
Figure8: Concepts generated for a LSTM model with l = 5 for a few sentences out of IMDB reviews.The colored elements are those important for the concept of the corresponding color (calculated with part (iii) of our method).The more colorful the element, the more important it is for the concept (continuously).We have selected the most important concept for each label (see Fig.7).The name of the concept is chosen manually in view of the important elements corresponding to the concepts.

Figure 9 :
Figure 9: Concepts generated with l = 20 for some beer reviews with RoBERTa model.The color depicts the aspects for each annotate concept.COCKATIEL is trained only on the label and we use the NMF part of the method to find annotate concepts.

Figure 10 :
Figure 10: Concepts generated for a RoBERTa model with l = 20 for a few sentences taken out of IMDB reviews.The colored elements are those important the concept of the corresponding color (calculated with part (iii) of our method).The more colorful the element, the more important it is for the concept (continuously).We have selected the 3 most important concepts for each label (see Fig.3).The name of the concept is chosen manually in view of the important elements corresponding to the concepts.

Figure 11 :
Figure 11: Concepts generated for a RoBERTa model with l = 20 for a few sentences taken out of IMDB reviews.The excerpts chosen by an excerpt-extraction function τ 1 are sentences for both (so, we have same concepts).The colored elements are those that are considered to be the most important for the concept of the corresponding color (calculated with part (iii) of our method).We compare the visualisations of the same sentences with two different excerpt-extraction functions τ 2 : words (on the left) and clauses (on the right).We split the text into clauses for occlusion using the fair library's SequenceTagger implementation.