SEAL: Interactive Tool for Systematic Error Analysis and Labeling

With the advent of Transformers, large language models (LLMs) have saturated well-known NLP benchmarks and leaderboards with high aggregate performance. However, many times these models systematically fail on tail data or rare groups not obvious in aggregate evaluation. Identifying such problematic data groups is even more challenging when there are no explicit labels (e.g., ethnicity, gender, etc.) and further compounded for NLP datasets due to the lack of visual features to characterize failure modes (e.g., Asian males, animals indoors, waterbirds on land etc.). This paper introduces an interactive Systematic Error Analysis and Labeling (SEAL) tool that uses a two-step approach to ﬁrst identify high error slices of data and then in the second step introduce methods to give human-understandable semantics to those under-performing slices. We explore a variety of methods for coming up with coherent semantics for the error groups using language models for semantic labeling and a text-to-image model for generating visual features. SEAL toolkit and demo screencast is available at https://huggingface.co/spaces/ nazneen/seal .


Introduction
Machine learning systems that seemingly perform well on average can still make systematic errors on important subsets of data.Examples include such systems performing poorly for marginalized groups in chatbots (Stuart-Ulin, 2018), recruiting tools (Hamilton, 2018), cloud products (Kayser-Bril, 2020), ad targeting (Hao, 2019), credit services (Knight, 2019), and image cropping (Hamilton, 2020).Discovering and labeling systematic errors in ML systems is an open research problem that would enable building robust models that generalize across subpopulations of data.
Uncovering underperforming groups of data of a ML system is not straightforward.Firstly, the highdimensional space of the representations learned by the deep learning models makes it difficult to identify such groups of systematic errors.Secondly, it is difficult to extract and label the hidden semantic information in such groups with high errors without a human-in-the-loop setup.Identifying systematic model failures requires practitioners to think creatively about model evaluation (Ribeiro et al., 2020;Wu et al., 2019;Goel et al., 2021b;Kiela et al., 2021;Yuan et al., 2022).However, current approaches are mostly limited to examining and manipulating model mispredictions.The onus of identifying what group or subset of data to evaluate still falls on the practitioner, making it inefficient and prone to oversight.Recent works on finegrained error analysis, such as Domino (Eyuboglu et al., 2022) and Spotlight (d'Eon et al., 2022) provide solutions to this problem but focus on image datasets which are easier to visualize.
Error analysis for text data is less explored and more challenging.It also highlights the need to provide semantic summaries of text, which we tackle in SEAL.For example, NLP models could Apart from the dataset and model, the user can select the loss quantile that want to examine for systematic errors, if they want SEAL to group those errors using kmeans++ with the number of clusters, and how many data points they want to visualize at a time in the visual component of the interface downsampled proportional to the group size (we use Altair for plotting that supports a maximum of 5000 data points to be visualized at once).
underperform on hundreds of possible input typeslonger inputs, inputs from non-native speaker, inputs with topic domains underrepresented in training, etc.This is a huge barrier of entry for most non-expert ML users who wish to gain a better understanding of their model and datasets with such existing tools.Model evaluation should ideally give actionable insights into a model's performance on a dataset in the form of data curation (Liang and Zou, 2022) or model patching (Goel et al., 2021a).
Our desiderata is a tool that summarizes failures of a model on textual data in a concise, coherent and human intepretable way.Systematic Error Analysis and Labeling (SEAL) is an interactive tool to 1. identify candidate groups of data with high systematic errors and 2. generate semantic labels for those groups.For 1, we use k-means++ on subset of evaluation data with highest loss.Semantic labeling uses LLMs (like GPT3) in zero-shot setting for identifying concepts or topics common to examples in the candidate group.We also explored using a text-to-image model to generate visual features for high error clusters using the Dalle-mini (Dayma et al., 2021).Semantic descriptions (via labeling or visual features) of such systematic model errors not only enable practitioners to better understand the failure modes of their model during evaluation but also gives actionable insight to fix them via some form of model patching or data augmentation.

SEAL
We present Systematic Error Analysis and Labeling (SEAL), an interactive visualization tool that provides rich data point comparison for text classification systems, enabling fine-grained understanding of model performance on data groups as shown in Figure 2. It comes pre-loaded with model outputs for most downloaded HuggingFace (HF) models and datasets, as well as scripts for loading data for any dataset provided by the Datasets API and extracting embeddings of any HF-compatible model. 1

Error Discovery and Analysis
Identifying model failures via error discovery is a crucial step in engineering robust systems that generalize to diverse subsets of data.SEAL uses the model's loss on a datapoint as a proxy for potential bugs or errors.Past work has examined model behavior on individual datapoints for mapping training datasets (Swayamdipta et al., 2020).We hope to leverage information about model behavior on individual evaluation data-points in a similar fashion.We use quantiles for dividing the model loss region for further analysis.For example, Figure 2 shows the 0.99 loss quantile for the distilbert-base-uncased model (Sanh et al., 2019) on the yelp_polarity (Zhang et al., 2015) sentiment classification dataset.The SEAL interface allows the user to control the loss quantile for fine-grained analysis using the widget on the side panel.
SEAL uses k-means++ for clustering the highloss candidate datapoints from the above step.Meng et al. (2022) used k-means for topic discovery on the entire dataset and showed that the clusters are stable only when k is very high (k >> 100) because of the scale of the embedding space.In contrast, SEAL only clusters the very high loss slice (> 0.98 quantile).
We use the representations of the models' final hidden layer (before the softmax) as embeddings.If the evaluation dataset selected by the user has ground truth annotations, then it groups the clusters by error-types (false-positives and false-negatives for binary classification).The visualization component of the SEAL interface shows the error clusters and their types using colors and symbols respectively.We use a standard heuristic of setting the number of clusters in k-means++ to be approximately n/2, where n is the group size.

Semantic Error Labeling
Semantic error labeling is important for identifying the underlying concept or topic connecting the datapoints in a error group.Systematic errors can be mathematically modeled and fixed by data curation.Contrast this with random errors that cannot be mathematically modeled or fixed via data curation.Past work analyzing NLP models have shown systematic errors on various tasks including sentiment classification, natural language inference, and reading comprehension (McCoy et al., 2019;Kaushik et al., 2020;Jia and Liang, 2017).SEAL uses pretrained LLMs (such as GPT3 (Ouyang et al., 2022) or Bloom (BigScience, 2022)) for semantic labeling of error clusters that could highlight such possible systematic bugs in model performance.We craft a prompt consisting of instruction and examples in the clusters extracted in the previous step as follows.
1 def build_prompt ( content ) 2 instruction = ' In this task , we `ll assign a short and precise label to a group of documents based on the topics or concepts most relevant to these documents .The documents are all subsets of a ${ task } dataset .' Here task is the task under consideration for example 'sentiment classification' in our case.The arg to the function is a dataframe or dataframe column with the dataset content as string that the model uses for classification.Our prompt design was experimented first in the few-shot setting before adapting to the zero-shot.
For the results and use case discussion in Section 3, we use the OpenAI GPT3 API2 via the CLI.The maximum token length is limited to 4000 and so we truncate the prompt to that length before feeding the model.We observed that for many larger groups of high-loss examples (> 25) SEAL labels degenerate to generic output such as "customer reviews of products", "movies reviews", "restaurant reviews", etc.To prevent this and to generate coherent group labels, we sub-cluster the bigger error groups until their size is < 25.We verified the group labels by running the Blei et al. (2003) LDA topic model on the examples in each cluster after a pre-processing step.The pre-processing included tokenizing, lemmatizing, and removing stopwords.For each dataset domain, we also removed the domain word list -('movie, watch, film, character' for the IMDB dataset, 'food, place, location, ser- Table 1: Results obtained from using SEAL on three sentiment classification datasets.The columns shows the group labels generated by GPT3, the size of the group in the overall evaluation set, and the group accuracy. vice, time, room, restaurant' for the Yelp dataset, and 'book, author, pages, read, product' for the Amazon dataset).The concept tokens in the labels assigned by GPT3 were in the top-6 topics for these datasets.SEAL also supports querying the dalle-mini API to generate visual features that would support with error discovery. 3We augment the semantic labels generated using a LLM with the text-to-image diffusion model such as the dalle-mini.The goal is to further support systematic error discovery especially for users that are not domain experts in the dataset they are using.For example it is easy to imagine what 'frozen custard' but it might not be obvious what 'hooters slot club' is or what a 'waterfront business in Phoenix, AZ' means.As shown in Figure 3, the visual features help with further analysis and provide clear actionable insights.

System Architecture
The interface is implemented as a Streamlit3 application with some customized HTML/JavaScript component that handles interactions in the tool.We use the Altair library customized with HTM-L/JavaScript and CSS for richer interactive visualization of embeddings.The visual component of the tool enables a user to interactively hover on data points and get information about the content, label, prediction, loss, and cluster (as in Figure 5).All the data preprocessing is powered by the Pandas library and all the manipulations on the data (such as extracting the layer embeddings, clustering, etc.) are stored as DataFrames thus providing a single interface for users to extend with custom data processing functions.We also provide preprocessing scripts to generate and cache all data required by SEAL to ensure fast response times in the interface.The scripts also include code to run inference (forward pass) on any HF dataset and model as well as a hook to extract learned representations from any layer of a loaded model.The workflow in SEAL also enables users to interactively visualize data points with high loss using the streamlit slider widget to control the loss quantile that is highlighted on the interface.

Results and Case Study
In this section, we discuss some results using the SEAL pipeline and walk through a case study for an interactive analysis with the tool.

Experimental Results
Table 1 shows the results obtained using SEAL on three sentiment classification datasets, Amazon (McAuley and Leskovec, 2013), Yelp (Zhang et al., 2015), and IMDB (Maas et al., 2011) for Distilbert (Sanh et al., 2019) and Albert (Lan et al.,Figure 4: Snapshot of SEAL showing the table of examples with highest-loss and their clusters.2020).For each dataset block in the table, we select the subset of group labels that were not generic ("customer reviews", "book reviews") and either had proper names in them such as "LensCrafters", "Eragon" or common nouns with properties such as "trashy movies", "fine dining", "overpriced chain restaurants". 4We then measured model performance on all examples in the evaluation dataset that matched the group description to obtain the group accuracy.Table 2 shows the content for a random sample of examples in the error categories discovered using SEAL.
An unintended but interesting use case of SEAL is to discover mislabeled candidate examples.We found that some groups have labels describing a sentiment such as "trashy movies", "terrible food" but with opposite ground truth sentiment.On further investigation, we found that indeed many of the groups have noisy labels and the model is actually predicting the correct sentiment.Table 3 in the appendix shows a sample of such mislabeled candidate examples from each dataset studied in this paper.
Limitations.SEAL relies on the semantic robustness of the labeling LLM such as GPT3.We did not test cluster labeling on NLP tasks that require understanding semantic phenomena or function word.

Case Study
SEAL with its interactive interface enables practitioners to discover possible systematic errors in their models.In this section, we walk-through

Mathematical robustness of SEAL
In this section, we provide theoretical guarantees for the stability of semantic labels generated by the SEAL pipeline.More specifically, our stability theorem states that a small perturbation of the input of our SEAL pipeline would only cause a small bounded difference of the semantic labels.An implication of our theoretical results is that, even if two users are using different versions of an evaluation set (e.g., a different split, or a smaller subset), SEAL would generate similar semantic labels.
More formally, we ask: How does a small change in the input dataset {(x i , y i )} n i=1 affect the semantic label tuple M ≜ {m k } K k=1 ?Here, K denotes the number of explanations, m k ≜ (w k , s k , a k ) encodes the kth explanation message, where w k , s k , and a k represent the sentence vector, the number of data points explained by this mes-

Content
Label Pred

Club reviews
Being from Southern California, the "scene" is so much fun.There are several clubs to go to and any night is a great time.That brings us to the Phoenix scene and The Cash.Oh wait, there is no scene for the ladies.Not going to bash them to hard, because it's the only consistent place that we have.Yes it caters to the Country music crowd, but they do play spurts of other music through out the weekend evenings.The mixed drinks could be better, but the prices are reasonable.0 1 I used to come here for years, maybe about a year back.. the best weekend drinkfests back then: Fridays were ladies night (dollar well, wines and domestics, $2 you call its, and no cover).Saturdays were free beer night (draft bud light, coors light and pbr til they gave out 1,000 of each.. again, no cover).Was always packed and played a decent variety of music; pitchers for beer pong were also always dirt cheap.And despite, the bartenders were way personable and fun.I'm not trying to sound like a cheapskate, as I am in the service industry myself.. but there must've been a change of ownership since my prior experiences.[..] 0 1

Dentist reviews
Thank you for all the emails you sent me on my review!I was surprised at how many responses I recieved from people searching for the right dentist..I shared my new dentist information and even got some movie tickets from my dentist for the referrals!I find it funny how since I wrote this review how many people have reviewed with 5 stars...They must have a lot of friends and family!I hope everyone reads my review and picks the right dentist for your needs!Happy Holidays 0 1 After dealing with a two week long migraine and severe pressure and pain in my face, I called around looking for an ENT that could get me in ASAP.Dr. Simms was available for a same day appointment and I scheduled with him for that afternoon.sage, and the average accuracy among those data points.Here we show that under some assumptions, the outputs of SEAL, i.e. the set of m k , is relatively robust to randomness in the input dataset.
To be more precise, we need a distance metric on explanation message space.
Definition 4.1.Given any two semantic label tuple Remark.The ℓ 2 distance ∥∥ 2 is defined on the vectorized explanation.In other words, we concatenate the sentence vector, data point number, and the accuracy value in one single vector, and then measure the distance of two explanation messages by the distance of their corresponding expanded vectors.
Here, a small distance value d max implies a small difference in the explanation word vector, the size of each cluster, and the accuracy within each cluster.To see this, note that a small distance implies that for any messages m i and m j in M , one can find two other messages m ′ i and m ′ j in M ′ , which are close to them.That is to say, each for any message in M , there is a message in M ′ approximately equal to it.Now we can answer the raised question.
Theorem 1.Let S and T denote two set of n data points i.i.d.from some data distribution P .Suppose the probability space of P is compact with size B, and the density function is bounded.Let M S and M T be the semantic label tuples generated by SEAL with input S and T .If S and T differs in o( √ n) data points, and the the clustering algorithm gives the exactly optimal solution, then we have The proof of this theorem is in the Appendix.It implicitly relies on Lipschitz continuity of the sentence generation network, which actually holds for most DNNs with finite input space.This indicates SEAL is robust to small perturbation in the input dataset: a small shift in the input dataset only leads to small explanation change.Such a smooth explanation change is particularly useful when users gradually update the their dataset.

Conclusion
In this work we introduced SEAL, an interactive visualization tool for discovering systematic errors and labeling them.Through case studies we showed how SEAL can efficiently identify the systematic failures of state-of-the-art sentiment classification models on well known datasets.We released a set of pre-computed model outputs to enable easy, out-of-the-box use especially for noncoding audience such as domain experts.We hope this work will positively contribute to the ongoing efforts in building tools for systematic error analysis and model debugging.

Ethics Statement
Many datasets currently used and open-sourced by the NLP community are mainly crawled from the web and therefore are not representative of a majority of geographies.There are biases that can distill into parameters of models trained on such biased datasets and may even be further amplified in the generated model outputs.All datasets we experimented with are in English, and all models are trained on English datasets.
We use GPT3 for semantic labeling and it is well-known that LLMs such as GPT3 can generate toxic, harmful, hate content that might have also percolated into our tool.Similarly, the semantic similarity metrics used in our tool including the BERTScore and the word-embeddings carry biases of the data they were trained on.We request our users to be aware of these ethical issues that might affect their analyses.

A Appendix: Proofs
Proof.Here we prove the proof for Theorem 1.To proceed, we need a few lemmas.
Lemma 2 (adapted from Proposition 5.1.in (Rakhlin and Caponnetto, 2006)).Assume the density of P (with respect to the Lebesgue measure λ over Z ) is bounded away from 0, i.e. dP > µdλ for some µ > 0. Suppose the clusterings A and B are minimizers of the K-means objective W (C) over the sets S and T , respectively.Suppose that at most o( √ n) data points are different between the two dataset S and T sampled from P .Then where c S,i and c T,i are the centers of the i-th cluster generated from S and T , separately.
Lemma 3. Assume the density of P (with respect to the Lebesgue measure λ over Z ) is bounded away from 0, i.e. dP > µdλ for some µ > 0. Suppose and the ML model that generates the sentence vector is Lipschitz continuous with parameter β.Then where c c,m depends only on c and m.
Proof.We first note that, by triangle inequality, we have Note that, by min j {a j + b j + c j } ≤ max{3 min j a j , 3 min j b j , 3 min j c j }, the inner minimization is bounded by 3 times the maximum of min 1≤j≤K ∥w S,i − w T,j ∥ 2 + ∥w S,j − w T,i ∥ 2 , min 1≤j≤K ∥s S,i − s T,j ∥ 2 + ∥s S,j − s T,i ∥ 2 , min 1≤j≤K ∥a S,i − a T,j ∥ 2 + ∥a S,j − a T,i ∥ 2 .Now let us consider those terms separately: 1. min j ∥w S,i − w T,j ∥ 2 + ∥w S,j − w T,i ∥ 2 : By Lipschitz continuity, the distance between two sentence vectors can be bounded by the distance between their corresponding cluster centers.More precisely, and thus By the assumption, the right hand side is bounded by ε, and thus 2. min j ∥s S,i − s T,j ∥ 2 + ∥s S,j − s T,i ∥ 2 : By the assumption, we know that, for any given i, we can find j, such that ∥c S,i − c T,j ∥ + ∥c S,j − c T,i ∥ ≤ ε.That is to say, the cluster centers' distance is at most ϵ.
Since the distribution space is bounded by B, there are at most ε, there are at most 2ϵB data points are clustered differently.As there are K clusters, in total at most 2ϵK 2 B data points are clustered differently.This gives a natural upper bound 3. min ∥a S,i − a T,j ∥ 2 + ∥a S,j − a T,i ∥ 2 : Now applying a similar argument in 2, we know that in total at most 2ϵK 2 B data points are clustered differently.Thus, at most 2ϵK 2 B data points affect the accuracy value.This means Combining those results, we can conclude that This is independent of i, and thus we can take the maximum over i, which gives That is, which completes the proof.
Combining the above two lemmas directly proves the robustness statement.

B Appendix: More Examples
Group label

Amazon
Customer reviews for a product that has been discontinued Another reviewer recently advised that this is the model to look for.I was just advised at a well known retailer that this model has been discontinued.Is this true or is this a classic bait-and-switch technique?Their current weekly sales circular features this model at a sale price.When you get to the store, they don't have it but when they look it up in their computer, it shows up as "Discontinued".It is difficult to relate reviews to actual products when the reviews you base your buying decision on could be about(a)different model(s) from the one you actually buy online or in-store.The Creative Labs' own website does not give model numbers so they are adding to the confusion.
The software mentioned on my May 16th review IS called "AVID Xpress" -not "AVID Express" -when my review was edited someone changed the spelling, possibly thinking it was a typo/mistake?
Although this show is very fascinating I find every episode to be almost the same.Starting with Morgan Freeman stating "when I was a young boy..." then something he did to get in trouble, or something he witnessed that ruined his fragile eggshell mind.Followed by rhetorical questions and theories, and tons and tons of examples.The examples even have examples.Maybe I just understand this stuff and the show really dumbs it down, but I feel like I wasted money investing in season 3. Which by the way, although not currently available, (I don't have cable and I still had the privilege of watching this before the DVD came out) but I will still probably end up buying it on DVD which is cheaper than I already paid for the electronic proprietary/DRM version on Amazon Unbox Unreliable book reviews I do not intend to review content here.This new edition is so full of typographical errors that sometimes the reader will have to intuit what the author really wrote.It is clear that the proofreaders of this edition were not actually reading; they were simply following the little red lines under the "misspelled" words.This has resulted in some truly bizarre apparent statements by the author, unreproducible here due to copyright laws.Disclaimer-I have not purchased this book, merely checked it out of the library.
It's been several years since I've read "Silent Spring," one of the most significant environmental books ever written, but I must respond to the posting by "seem," which is titled "murderous, over the top propaganda" (I correctly your misspelling of the last word): His recommendation to read "DDT: A Case Study in Scientific Fraud" was put out by the Heartland Institute and is, in itself, a "fraud."The Heartland Institute is one of the most pro-chemical, pro-industry, anti-environmental and right-wing organizations around.Nothing they put out should be believed for a second.
Shame on all the booksellers selling this ten dollar book for $75 and up!Devorss is re-publishing this book in August!I took note of the sellers AND WILL NEVER BUY FROM THEM!

Yelp
Terrible dry cleaners in Phoenix I went here for the first time on First Fridays, yeah so what.I promise that I won't hang out here all the time and ruin it for all you true Bikini lovers.My mini pitcher was $3.50 and then 5 minutes later a chick walked up and got charged $6.00 for two mini pitchers, hmmm, male discrimination or they can't do simple math?I'll only go back when it's 110 outside and want to put a buzz on early in the afternoon.
Mediocre dry cleaning.I want to like this business..why? 1.I like to support Yelp advertisers 2.prime location!!!It is literally around the corner from me and I will probably still go there once in awhile out of convenience.Once or twice I called rushing to get there before they closed and they waited a minute over closing time which was very nice of them.However, this review is simply based off of satisfaction with my clothing.Almost every time I have come there I have to ask to redo my shirts.It drive me nuts because the employees are nice about it.When I woke up today and had 50 dollars worth of clothing needed to be dry cleaned I drove 20 minutes to my old favorite cleaners in Arcadia.I knew that I trust them with my clothes and after years, never had to deal with such an inconvenience.I'm sorry but had to only give 3 stars.I might be back one more time..only when I have to.John I read your message and appreciate that so I updated my review out of appreciation towards your response.I want to come back because it is convenient.Thanks for caring This place is tiny and has more high-end expensive beads than other stores in Phoenix.I've found some really special items here.You shouldn't expect to buy more than a few strands at a time, as it just isn't affordable.Go somewhere else for quantity, and just get a few things to spice up your mix from Bead World.

IMDB
Terrible movies You have to be awfully patient to sit through a film with one-liners so flat and unfunny that you wonder what all the fuss was about when WHISTLING IN THE DARK opened to such an enthusiastic greeting from audiences in the 1940s.<br/><br />On top of some weak one-liners and ordinary sight gags, the plot is as far-fetched as the tales The Fox (Red Skelton) tells his radio audience.You have to wonder why anyone would think he could come up with a real-life solution on how to commit the perfect crime and get away with it.But then, that's how unrealistic the comedy is.<br /><br />But-if you're a true Red Skelton fan and enjoy a look back at how comedies were made in the '40s-you can at least enjoy the amiable cast supporting him.Ann Rutherford and Virginia Grey do nicely as his love interest and Conrad Veidt, as always, makes an interesting villain.One of his more amusing moments is his reaction to Skelton explaining the mysteries of wearing turbans."I never knew that," he muses, impressed by a minor point that is cleverly introduced.<br/><br />All in all, typical nonsense that requires you to accept the lack of credibility and just accept the gags as they are.Not always easy for a discriminating viewer as many of them simply fall flat, the way many comedies of this era do because the novelty of the sight gags and one-liners has simply worn off.
If they gave out awards for the most depraved and messed-up movies in the world, Japanese cinema would clean up: their exploitation cinema wipes the floor with most other contenders, the most extreme examples being absolutely jaw-dropping exercises in bad taste, nauseating gore, freakish weirdness, and misogynistic sex.<br /><br />Guts of a Beauty is a prime example of such whacked out filth, offering discerning viewers just over an hour of full-on debauchery and gratuitous violence topped off with some very insane J-splatter goodness.<br/><br />The film opens with a young woman named Yoshimi, whose search for her missing sister has led her into the hands of some nasty yakuza, who proceed to rape her and shoot her full of strong dope called Angel Rain[...] European Union movie is disappointing and full of clichés **SPOILERS AHEAD**<br /><br />It is really unfortunate that a movie so well produced turns out to be<br /><br />such a disappointment.I thought this was full of (silly) clichés and<br /><br />that it basically tried to hard.<br /><br />To the (American) guys out there: how many of you spend your<br /><br />time jumping on your girlfriend's bed and making monkey<br /><br />sounds?To the (married) girls: how many of you have suddenly<br /><br />gone from prudes to nymphos overnight-but not with your<br /><br />husband?To the French: would you really ask about someone<br /><br />being "à la fac" when you know they don't speak French?Wouldn't<br /><br />you use a more common word like "université"?<br /><br />I lived in France for a while and I sort of do know and understand [...] Obviously made on the cheap to capitalize on the notorious "Mandingo," this crassly pandering hunk of blithely rancid Italian sexploitation junk really pours on the sordid stuff with a commendable lack of taste and restraint: The evil arrogant white family who own and operate a lavish slave plantation spend a majority of the screen time engaging in hanky panky both each other and their various slaves[...]

Figure 1 :
Figure 1: SEAL interactive tool for discovering systematic errors in model performance.Steps 1 and 2 include extracting the model embeddings and clustering datapoints with high-loss.Steps 3 and 4 include semantic labeling of error groups and generating visual features to support debugging.

Figure 2 :
Figure 2: SEAL interface showing high-error groups for the distilbert-base-uncased model evaluated on the yelp_polarity dataset.The interface comprises of various components: (a) examples from the dataset in the high error groups (sorted by loss), (b) statistics of tokens in high error groups relative to the entire evaluation set, (c) interactive 2d visualization of the model embeddings showing groups of errors in color and low-loss groups in gray.The colors indicate different error clusters.If the dataset has annotated classes, the visualization includes symbols to represents those classes (⋄ and • in the above figure).The panel on the left has multiple widgets that a user can control to be able to interactively understand their model's mispredictions relative to the rest of the model's outputs.Apart from the dataset and model, the user can select the loss quantile that want to examine for systematic errors, if they want SEAL to group those errors using kmeans++ with the number of clusters, and how many data points they want to visualize at a time in the visual component of the interface downsampled proportional to the group size (we use Altair for plotting that supports a maximum of 5000 data points to be visualized at once).

Figure 3 :
Figure 3: Examples of visualizations generated using Dalle-mini (Craiyon) for a sample of error groups.

Figure 5 :
Figure 5: Snapshot from the SEAL interface highlighting a group of examples with high-loss that are candidates for a systematic error type where reviews consist of customer experiences being better than their expectation of the place.

Table 2 :
The wait time itself wasn't bad -10-15 minutes after completing paperwork.Dr. Simms was personable enough and after evaluating me, told me that he would like to treat for a sinus infection with antibiotics and prednisone.As I had just moved and newly became a student, I didn't yet have health insurance set up.[..] 's on a recommendation from my parents.Living in San Diego, I never go to chain Mexican places -there are just too many other places to try.I was expecting Cozymel's to be okay, nothing great.We went for lunch, and I was happy to see a whole page of lunch specials for about $8.Usually, an enchilada combo plate could set you back close to $15 at a Mexican chain.Not here (during lunch at least).I ordered the taco salad with black beans instead of meat.It came in an enormous flour tortilla shell -tostada style.[..] 1 0 I still can't get over how I paid $2.99 for a coffee and 3 doughnuts!What a deal.I was debating whether or not to go to Krispy Kreme or Winchells but decided on the latter since it wasn't a chain and I could get Krispy Kreme elsewhere...Winchell's shares space with Subway which was a little random but I didn't have any problem with it because the woman helping me and what I assume to be the owner were both very nice and sweet.I hadn't eaten doughnuts in a little over a year so I decided to go with a boston creme (one of my favorites) and got a chocolate glazed chocolate doughnut for my sister and a glazed for my friend.[..] Random sample from under-performing groups discovered by SEAL for the Yelp dataset.Results for other datasets are in Table4in the appendix.0 and 1 indicate negative and positive sentiment classes respectively.The reviews ending in [..] have been truncated to save space.

Table 3 :
Mislabeled candidate examples for the three sentiment classfication datasets.All the examples have GT as positive.Examples ending in [..] have been truncated to save space.