IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models

Interpretability and human oversight are fundamental pillars of deploying complex NLP models into real-world applications. However, applying explainability and human-in-the-loop methods requires technical proficiency. Despite existing toolkits for model understanding and analysis, options to integrate human feedback are still limited. We propose IFAN, a framework for real-time explanation-based interaction with NLP models. Through IFAN's interface, users can provide feedback to selected model explanations, which is then integrated through adapter layers to align the model with human rationale. We show the system to be effective in debiasing a hate speech classifier with minimal impact on performance. IFAN also offers a visual admin system and API to manage models (and datasets) as well as control access rights. A demo is live at https://ifan.ml.


Introduction
As Natural Language Processing (NLP) systems continue to improve in performance, they are increasingly adopted in real-world applications (Khurana et al., 2022).Large Language Models (LLMs)-such as GPT-3 (Brown et al., 2020), BLOOM (Scao et al., 2022a), and T5 (Raffel et al., 2020)-are without a shred of doubt the main protagonists of recent advances in the field.They are able to substantially outperform previous solutions while being directly applicable to any NLP task.
There are however strong concerns given the black-box nature of such architectures (Madsen et al., 2022;Mosca et al., 2022a).In fact, their large scale and high complexity are substantial drawbacks in terms of transparency, accountability, and human oversight.Beyond ethical considerations, even legal guidelines from the European Union are now explicitly defining these interpretability factors as essential for any deployed AI system (European Commission, 2020).Research efforts in eXplainable Artificial Intelligence (XAI) (Arrieta et al., 2020;Mosca et al., 2022b) and Human-in-the-Loop (HitL) machine learning (Monarch, 2021) have thus been on the rise-producing solutions that aim at mitigating the current lack of interpretability.Most notably, the recent literature contains a number of toolkits and frameworks to analyze, understand, and improve complex NLP models (Wallace et al., 2019;Liu et al., 2021).Some of them even offer low-code interfaces for stakeholders who do not possess the otherwise required technical proficiency.Nonetheless, current options to collect human rationale and provide it as feedback to the model are still limited.
We propose IFAN, a novel low-to-no-code framework to interact in real time with NLP models via explanations.Our contribution can be summarized as follows: (1) IFAN offers an interface for users to provide feedback to selected model explana-tions, which is then integrated via parameterefficient adapter layers.
(2) Our live platform also offers a visual administration system and API to manage models, datasets, and users as well as their corresponding access rights.
(3) We show the efficiency of our framework in debiasing a hate speech classifier and propose a feedback-rebalancing step to mitigate the model's forgetfulness across updates.
IFAN's demo is accessible at ifan.ml 1 together with its documentation. 2 Full access is available with login credentials, which we can provide upon request.A supplementary video showcase can be found online 3 .

HitL with Model Explanations
Human-in-the-Loop (HitL) machine learning studies how models can be continuously improved with human feedback (Monarch, 2021).While a large part of the HitL literature deals with label-focused feedback such as active learning, more recent works explore how explanations can be leveraged to provide more detailed human rationale (Lertvittayakumjorn and Toni, 2021).
Combining classical HitL (Wang et al., 2021) with explanations to construct human feedback for the model (Han et al., 2020) has been referred to as Explanation-Based Human Debugging (EBHD) (Lertvittayakumjorn and Toni, 2021).Good examples are Ray et al. (2019), Selvaraju et al. (2019), and Strout et al. (2019), which show improvements in performance and interpretability when iteratively providing models with human rationale.
A more NLP-focused EBHD approach is Yao et al. (2021), where the authors leverage explanations to debug and refine two transformer instances-BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019).Concretely, word saliency explanations at different levels of granularity are provided to humans, who in turn provide suggestions in the form of natural language.The annotator's feedback is converted into first-order logic rules, which are later utilized to condition learning with new samples.

Interactive NLP Analysis Platforms
In the recent literature, we can find strong contributions in terms of software and digital toolkits to analyze and explain NLP models (Wallace et al., 2019;Hoover et al., 2020) as well as further refining them via parameter-efficient fine-tuning (Beck et al., 2022).
For instance, Liu et al. (2021) proposes EX-PLAINABOARD, an interactive explainabilityfocused leaderboard for NLP models.More in detail, it allows researchers to run diagnostics about the strengths and weaknesses of a given model, compare different architectures, and closely analyze predictions as well as recurring model mistakes.Similarly, the LANGUAGE INTERPRETABIL-ITY TOOL by Tenney et al. ( 2020) is an opensource platform and API to visualize and understand NLP models.In particular, it provides a browser-based interface integrating local explanations as well as counterfactual examples to enable model interpretability and error analysis.
Finally, Beck et al. (2022) releases ADAPTER-HUB PLAYGROUND, a no-code platform to fewshot learning with language models.Specifically, the authors built an intuitive interface where users can easily perform predictions and training of complex NLP models on several natural language tasks.

IFAN
The Interaction Framework for Artificial and Natural Intelligence (IFAN) is a web-based platform for inspecting and controlling text processing models.Its main goal is to decrease the opacity of NLP systems and integrate explanation-based HitL into their development pipeline.Through our interface, stakeholders can test and explain models' behavior and-when encountering anomalies in predictions or explanations-they can fix them onsite by providing feedback.
The main blocks of the platform are presented in Figure 2. The Backbone part contains all machine learning development components-datasets and models.We adopt HuggingFace formats (see 3.3 and 3.4) (Wolf et al., 2020) and wrap the entire backbone as a Docker4 image for deployment.The User Interface is the visual component of the platform, where all the human-machine interaction takes place.Here, developers have also access to additional visual resources to configure details about models, datasets, and users.

IFAN UI
Step 1: Select a dataset to sample texts.
Step 2: Select a model to inspect.

Local Feedback
You sample is toxic with prob.99% Step 3: Analyze model's behavior with explanations.

Feedback Connector
Step 5: Retrain the model on provided feedback.

Users
Step 4: Provide the feedback on the model's behaviour.The connection between the backbone and the user interface is managed by the Admin component.All the user data and rights as well as samples receiving feedback are stored in a PostgreSQL5 database instance.The communication is handled via Python Django6 , which integrates everything w.r.t.user authentication, API calls/responses, state logs, and location of backbone resources.In the next sections, we provide a more detailed description of the main platform components.

User Interface
Our frontend is built with Bootstrap7 and JavaScript8 .Currently, the pages available in our UI are the following: Landing Page Here users can get a short introduction to IFAN.We briefly explain our platform's goals, the concept of HitL, and how our framework can be integrated into the development of NLP models.
Documentation It provides a detailed description of all the UI components together with screenshots and guidelines.Here, users can find specific instructions on how to configure and interact with our platform.
Feedback This is the main interaction page.Here, users can run a model on an input sample either taken from the dataset or that they wrote themselves.Then, they can load the model's prediction and explanations and provide feedback in terms of both the label and features' relevance.
Report This page has limited access (see 3.2).Developers can evaluate models before and after feedback incorporation on a chosen dataset as well as inspect misclassified samples.
Configuration This page has limited access (see 3.2).Here, developers can configure and manage the platform, More specifically, users can be created, modified, and deleted as well as upgraded or downgraded in their roles and access rights.Also, they can manage models and datasets as well as specify the currently active ones.
Account Settings Each authorized user can view, edit, export, and delete their account data (GDPR compliance) as well as reset their login password.

Users
The platform separates users in three tiers: developers, annotators, and unauthorized users (Table 1).
Unauthorized users do not possess login credentials and have limited access to the platform.They can visualize model predictions and explanations but their feedback is not considered.Normal users (or annotators) are known through their credentials and can thus actively engage with the model.During a HitL iteration, they can use the feedback page with pre-configured datasets and models, test the model on a text sample, view explanations, and provide feedback if needed.
Developers have full access and can configure all aspects of the platform.More specifically, they have access to the report and configuration pages (see 3.1) and can thus manage anything regarding users, roles, API access, models, and datasets.

Datasets
Before the model's behavior exploration, the active dataset should be specified via the configuration page (see 3.1).This is the dataset from which the text examples for the model testing are sampled.

Dataset
Short Description HateXplain (Mathew et al., 2021) A dataset for hate speech classification which has 3 classes for hate type detection, the target community classification, and rationales.GYAFC (Rao and Tetreault, 2018) Formality detection dataset which corresponds to 2-class classification: formal and informal.We conform to a standard format by using the HuggingFace Datasets library 9 .Developers interacting with our platform are strongly encouraged to adhere to this standard when uploading new datasets and making them available to the interface.Table 2 shows two examples of datasets already available on our platform.To incorporate feedback into our models, we utilize adapter layers (Houlsby et al., 2019), a parameter-efficient fine-tuning technique.Adapters are integrated on top of each language model unit (e.g.transformer block) and are trained with the human feedback while we freeze all other model weights.Adapters can also be disabled to recover to the original state of the model.

Explanations & Feedback Mechanism
Users can evaluate the active model on the active dataset through the Feedback page.They may input text in three ways: i) create a text sample themselves; if authorized: ii) sample a random text from the active dataset; iii) sample a random misclassified text from the test part of the active dataset.Users receive the classification results and the model's confidence.They can assess the result and correct any misclassifications.
To further inspect the model's behavior, we provide two types of explanations-local and global.For local explanations on a text sample, we display relevant features to each output class (Figure 4).We attribute scores using the LIME framework (Ribeiro et al., 2016) and-to filter weak correlations-we highlight as relevant only tokens with a score above the threshold θ = 0.1.On the global side, we list the most influential unigrams for each output class.These can be inspected to extract insights about what keywords and patterns the model focuses on at the dataset level.For all 1-grams present in a dataset, their corresponding classification scores are calculated and the tokens with top scores are displayed on the page.Annotators can easily edit the highlighted tokens and send the updated explanation as feedback.We store the result-i.e. the highlighted relevant parts-and use them to fine-tune the adapter layers.Freezing all other model weights minimizes the computational effort of the feedback step.
Regarding the fine-tuning procedure, directly using the highlighted feedback text for adapter fine-tuning causes significant losses in the original model performance.We propose to mix feedback with original samples to mitigate this effect, which allows effective feedback incorporation while reducing model forgetfulness (see 4 for more details).

Backbone API
We expose our backbone's API to make available all essential dataset/model management functions.These provide a high-level interface for additional experiments dealing with model evaluation, explanation, and feedback.The API is built with the Python framework FastAPI11 , more details can be found in the Appendix A.

Case Study
We carried out a case study to test the applicability of IFAN.We chose a hate speech detection task based on the HateXplain dataset (Mathew et al., 2021).The goal of the experiment was to use our framework to debias a given hate speech detector.
Firstly, we modified the original dataset for binary classification task-"toxic" and "non-toxic".We choose the Jewish subgroup as a target for our debiasing process.We fine-tuned BERT (Devlin et al., 2019)  12 and gave feedback to it.Additional experiments with BLOOM and other LLMs are provided in Appendix D and E respectively.
We annotate 24 random misclassified samples, 12 with the most confidence and 12 with the least confidence scores (see Appendix C.1).We invited 3 annotators to participate in the annotation process.The n-grams that were modified by annotators were saved and used to create a new training dataset for the adapters.As a result, we collected 40 annotated n-grams and repeated them to get 120 training samples.To complete the new training creation, we balanced these samples with 500 original samples (250 toxic, 250 non-toxic) randomly selected from the HateXplain dataset.The results are presented in Table 3.We observe that the non-balanced training dataset, which only contains feedback on the most confidently misclassified samples, resulted in a significant decrease in performance.While the inclusion of feedback on least confident samples caused a slight decline in the overall F1 score, Adapter training on the balanced feedback led to an improvement in the precision score for the Jewish target group.
Figure 5 shows the changes in the detector while fine-tuning with the collected feedback.When rebalancing the feedback, only modified samples are drastically changed while the performance on the original texts is only slightly affected.A detailed comparison between fine-tuning on non-balanced and balanced feedback is in Appendix C.2).

Limitations & Future Work
As of now, our feedback system is limited to applications in the sequence-to-class format.Work on extending the platform to further task through LLM prompting is currently in progress (see Appendix E).
At the same time, we currently offer a limited set of explanation, feedback, and management options, which we plan to increase in the immediate future.A small user study has been conducted (Appendix B) to collect feedback about the platform and improve its user-friendliness.Our intent is to continue iterating the development of new features with trials with developers and laymen.
Finally, our experiments do not yet show clear trends w.r.t. the correlation between performance and feedback hyperparameters.Indeed, further research and trials have to be carried out to establish optimal choices for the number of feedback samples, fine-tuning epochs, and the rebalancing ratio.

Conclusion
This work proposes IFAN, a framework focusing on real-time explanation-based interaction between NLP models and human annotators.Our contribution is motivated by the limited options in terms of existing tools to interpret and control NLP models.
IFAN is composed of three main units.The Backbone unifies all the machine learning pipelines and exposes an API for accessibility.The User Interface-organized in landing page, documentation, feedback, report, and configurationprovides an intuitive visual component to interact with models.Finally, the Admin controls the connection between the two previous components.
Additionally, we introduce the feedback mechanism that takes advantage of adapter layers to efficiently and iteratively fine-tune models on the downstream task.Our experiments show the frameworks' effectiveness at debiasing a hate speech classifier with minimal performance loss.
We believe IFAN to be a valuable step towards enabling the interpretable and controllable deployment of NLP models-allowing users with no technical proficiency to interact and provide feedback to deployed NLP systems.Regarding future work, we set as a priority to extend the framework to more NLP tasks as well as to integrate additional model analysis features and feedback mechanisms.

Ethical Considerations
In this work, we showed the experiments of hate speech model debasing.The hate speech detection task is the task that requires a lot of attention to provide a fair outcome.One of the issues still is bias, especially against minority groups due to prejudices.We aimed to show an example of how the model can be debiased with respect to some target racial groups.With a conscientious selection of annotators and feedback, we hope that our proposed platform will serve to efficiently adjust NLP models to the diverse world.
For these reasons, we also believe that interpretability and controllability of modern NLP models and systems are fundamental pillars for their ethical and safe deployment (European Commission, 2020).This works aims at having a positive impact on both aspects as it provides a tool to explain models and provide them with feedback.By reducing the technical proficiency required to interact with NLP systems, we hope to facilitate the process of providing valuable human rationales to influence complex models.
Ensuring high quality for the human feedback is challenging (Al Kuwatly et al., 2020), and exposing models to external influence can be used as an exploit by adversarial agents (Mosca et al., 2022a).Especially with a very small crowd of annotators, there's potential for a few people to have a strong influence on the model.A restrictive access rights management system like IFAN's already mitigates these issues.We believe that additional security features as well as tracking annotators' impact are key for future work to foster their trustworthiness.
Previous works mention that users can feel discouraged and frustrated when interacting with poor models and badly-designed interfaces, which can also affect feedback quality (Lertvittayakumjorn and Toni, 2021).This can be addressed by integrating user studies in the development process in order to design more intuitive interfaces and improve the overall user experience.
On the opposite end of the spectrum, plausible explanations can make humans overestimate the model's capabilities and make them trust systems that are still not ready for deployment.In this case, a more diverse and complementary set of explanations for users (Madsen et al., 2022) as well as comprehensive model reports for developers are core goals to provide a more complete picture of the models to be deployed.

A Backbone API Endpoints
Figure 6 shows the auto-generated docs for our backbone's REST API, which serves as guidelines to interact with our backbone.Endpoints are divided into functional groups-models, datasets, prediction, explanation, and feedback).Currently, this page is only accessible within our institution's network for security reasons.Developers with direct API access (specifiable on the configuration page, see 3.1) can directly make requests to this high-level interface for additional (larger-scale) experiments.Once again, the API has been built with the Python framework FastAPI13 .
Figure 7 shows the documentation for the explanation endpoint.Here, we can inspect the details about the endpoint, such as the required parameters-i.e. the path to the model, the explainer to be used (e.g.LIME), and the model's prediction as body request.

B User Study
We performed a small user study evaluating the usability of our platform with users having different backgrounds.We gathered a total of nine testers, which we organize into three different categories based on their familiarity with the underlying methodologies.Users may be laymen, computer scientists, or experts.For each of these categories, three testers are assigned.Laymen are general users with no exposure to computer science.Computer scientists are those who studied computer science or a related discipline and who work in corresponding fields, such as software development.People who also possess knowledge in the domains of machine learning and NLP are considered experts.
We asked the users to try out to interact with IFAN and then answer several questions about the website's usability.The questionnaire was of the following structure: • The first question is about the competency of the test person, which is categorized as computer science, natural language processing expert, or layman.
• Users assign a score between 1 and 5 to the local explanation on the left side of the feedback page.
• A textual input field for the users to hand in more detailed feedback on the previous rating.
• Users assign a score between 1 and 5 to the global explanation on the right side of the feedback page.
• A textual input field for the users to hand in more detailed feedback on the previous rating.
• Users rate the overall usability of the website with a score between 1 and 5.
• A textual input field for the users to hand in more detailed feedback on the previous rating.
The average ratings given by users: (i) Local explanation: 3.88; (ii) Global explanation: 3.2; (iii) General usability: 3.33.The problem that the users mentioned were insufficiency of only one form of explanations and feedback, sometimes misleading interface for markup, and waiting time.While some of the problems are already fixed in the current version of the platform, we are planning to consistently improve the usability of the platform.At the same time, we note a positive result-the users with different backgrounds were overall comfortable with IFAN usage and found it useful to increase the controllability of NLP models.We believe this confirms the possibility of IFAN utilization for stakeholders, developers, and annotators in future applications.

C Case Study
In this section, we provide additional information about the experiment conducted under Case Study (Section 4).
C.1 Annotated Samples (Warning: the section contains hate speech only with illustrative purposes. We have no intention to offend the reader and a target racial group.) Here, we provide the list of that were used for manual annotation (Section 4).These samples were misclassified by hate speech detection model.In Table 4, we provide the misclassified samples with the lowest confidence, in Table 5 -with the highest confidence.All the samples and their true labels are taken from the original HateXplain dataset (Mathew et al., 2021).For the label assignment details, refer to the HateXplain dataset description.

C.2 Feedback Mechanisms Comparison
In Section 4, we report the results of the model trained on feedback in two setups: (i) without balancing and (ii) with balancing via using original samples from HateXplain dataset.The comparison between these two setups is visualized in Figures 8 and 9.We tested our approach on the local feedback on the Jewish target group samples as well as samples containing the "Jewish" keyword.For both setups, with balancing, the training procedure runs more stable.The model's performance on other samples from HateXplain dataset changes slightly and the adjustment of its behavior on the marked-up samples proceeds more rapidly.

D BLOOM for Hate Speech Classification and Human Feedback Incorporation
In addition to the results with BERT for hate speech classification (Section 4), we provide the results of experiments with a BLOOM14 model (Scao et al., 2022b) (Table 6).Once again, the fine-tuning of adapter layer with human feedback improves model performance within Jewish target group.In the case of BLOOM, the overall F1 score is also improved, confirming the utility of human feedback usage and the possibility of it being extended to other models.Also, from Figure 10, we can see how the label of the misclassified samples (i.e."Um ... he's jewish, not Muslim") is changing as the model is fine-tuned with the provided feedback.

E Large Language Models for Hate Speech Classification and Explanation
In addition to the classification via linear layers on top of transformer blocks, we test the possibility of large language models to perform text classification and explain their prediction.Potentially, this approach allows further work to extend the framework to any arbitrary NLP tasks.
For each model, we optimize the prompts to perform hate speech detection: vicgalle/gpt2-alpaca Instruction: perform the text-classification task on the text using the following labels: 0 or 1. Text: would be hilarious if the un building was turned into homeless shelter or hell why not a refugee center The text is 0 Text: Despite the specific differences in the various prompts, they all follow the core idea of i) mention the main task which can be extracted from the dataset metadata; ii) provide the general information about labels; iii) provide some examples for each label from the dataset.Potentially, this prompt design can be used for any classification task.

F Supplementary Video Demo
A supplementary video showcase can be found on Youtube 18 .For completeness, we also point at an additional version of such demo, as well on Youtube 19 , dating back to March 2023.

Figure 1 :
Figure 1: IFAN in brief.The interface allows NLP models and users to interact through predictions, explanations, and feedback.IFAN also provides developers with (1) a manager for models and datasets, (2) model API access, and (3) reports about the model.

Datasets
only a ret**d would think that Global Feedback c*nt f*ggot n*gger wh*re f*cking safety strong beauty amazing

Figure 2 :
Figure 2: Overall schema of IFAN idea: (i) The user selects a dataset or writes a customized input.(ii) Then the user can select a model which should be inspected.(iii) With the UI, annotators can check the model's prediction on a sample and two types of explanations -local and global.(iv) If there is some misbehavior, the annotators can provide feedback.(v) The feedback is stored and then used to fine-tune the model.

Figure 3 :
Figure 3: The proposed architecture for the models integrated into IFAN: addition of Adapter layer which is trainable on provided human feedback.
Figure 3 sketches an overview of the architecture used.

Figure 4 :
Figure 4: The example of the results and local explanations that annotators can obtain on the Feedback page.

Figure 5 :
Figure 5: Samples confidence variation as the model is fine-tuned with human feedback.The results of the domain case using IFAN platform.We can observe that for both experiments with balanced training data, the overall model's performance is only slightly changed while the model's behavior on the Jewish target group is improved.

Figure 6 :
Figure 6: Screenshot of the Swagger UI for our backbone API endpoints.

Figure 7 :
Figure 7: Screenshot of the explanation endpoint from our backbone API's Swagger UI.
(a) Training without feedback balancing.(b) Training with feedback balancing.

Figure 8 :
Figure 8: The comparison of training procedure with and without feedback balancing.Here, the results of local feedback on the least confident misclassified samples from the Jewish target group are shown.We can observe that training with a balanced dataset runs more stable without significant influence on the overall model's domain knowledge.

Figure 9 :
Figure 9: The comparison of training procedure with and without feedback balancing.Here, the results of local feedback on misclassified samples with "jewish" keywords are shown.We can observe that training with rebalanced dataset runs more stable without significant influence on the overall model's domain knowledge.

Figure 10 :
Figure 10: The tracking procedure of the misclassified samples label fix during feedback incorporation.
[DATASET SAMLPE] The text is 1 Text: [INPUT TEXT] The text is chavinlo/alpaca-native This is an overall text-classification.For the given TEXT, CLASSIFY the LABEL of that.The LABEL can be 0 or 1.Just output the LABEL.NOTHING ELSE! TEXT: [5 DATASET SAMLPES] LABEL: 0 TEXT: [5 DATASET SAMLPES] LABEL: 1 TEXT: [INPUT TEXT].LABEL: chavinlo/gpt4-x-alpaca INSTRUCTION: Given the following DATASET DESCRIPTION, EXTRACT the TASK of it, and PERFORM the TASK on the INPUT TEXT.The FINAL LABEL could be 0 or 1.Just output the FINAL LABEL.NOTHING ELSE! DATASET DESCRIPTION: Hatexplain is the first benchmark hate speech dataset covering multiple aspects of the issue.Each post in the dataset is annotated from three different perspectives: the basic, commonly used 3-class classification (i.e., hate, offensive or normal), the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e., the portions of the post on which their labelling decision (as hate, offensive or normal) is based.CONTEXT: TEXT: [ DATASET SAMPLE ] FINAL LABEL: 0 TEXT: [ DATASET SAMPLE ] FINAL LABEL: 1 INPUT TEXT: TEXT: [INPUT TEXT].FINAL LABEL:

Table 1 :
Different levels of access to IFAN.

Table 2 :
Example of datasets tested at IFAN.

Table 3 :
The results of the case study: hate speech classification model debiasing.We compare different strategies for feedback incorporation.Pr J states for the Precision score on the Jewish target group.

Table 4 :
Misclassified with the lowest confidence samples used for manual annotation.

Table 5 :
Misclassified with the highest confidence samples used for manual annotation.

Table 6 :
The results of the LLMs inference for hate speech classification.F1 J states for the F1 score on the Jewish target group.

Table 7 :
The results of the LLMs inference for hate speech classification.