Inseq: An Interpretability Toolkit for Sequence Generation Models

Past work in natural language processing interpretability focused mainly on popular classification tasks while largely overlooking generation settings, partly due to a lack of dedicated tools. In this work, we introduce Inseq, a Python library to democratize access to interpretability analyses of sequence generation models. Inseq enables intuitive and optimized extraction of models’ internal information and feature importance scores for popular decoder-only and encoder-decoder Transformers architectures. We showcase its potential by adopting it to highlight gender biases in machine translation models and locate factual knowledge inside GPT-2. Thanks to its extensible interface supporting cutting-edge techniques such as contrastive feature attribution, Inseq can drive future advances in explainable natural language generation, centralizing good practices and enabling fair and reproducible model evaluations.


Introduction
Recent years saw an increase in studies and tools aimed at improving our behavioral or mechanistic understanding of neural language models (Belinkov and Glass, 2019). In particular, feature attribution methods became widely adopted to quantify the importance of input tokens in relation to models' inner processing and final predictions (Madsen et al., 2022b). Many studies applied such techniques to modern deep learning architectures, including Transformers (Vaswani et al., 2017), leveraging gradients (Baehrens et al., 2010;Sundararajan et al., 2017), attention patterns (Xu et al., 2015;Clark et al., 2019) and input perturbations (Zeiler and Fergus, 2014;Feng et al., 2018) to quantify input importance, often leading to controversial outcomes in terms of faithfulness, plausibility and overall usefulness of such explanations (Adebayo et al., 2018;Jain and Wallace, 2019;Jacovi and Goldberg, 2020;Zafar et al., 2021). However, feature attribution techniques have mainly been applied to classification settings (Atanasova et al., 2020;Wallace et al., 2020;Madsen et al., 2022a;Chrysostomou and Aletras, 2022), with relatively little interest in the more convoluted mechanisms underlying generation. Classification attribution is a single-step process resulting in one importance score per input token, often allowing for intuitive interpretations in relation to the predicted class. Sequential attribution 2 instead involves a computationally expensive multi-step iteration producing a matrix A ij representing the importance of every input i in the prediction of every generation 2 We use sequence generation to refer to all iterative tasks including (but not limited to) natural language generation. outcome j (Figure 1). Moreover, since previous generation steps causally influence following predictions, they must be dynamically incorporated into the set of attributed inputs throughout the process. Lastly, while classification usually involves a limited set of classes and simple output selection (e.g. argmax after softmax), generation routinely works with large vocabularies and non-trivial decoding strategies (Eikema and Aziz, 2020). These differences limited the use of feature attribution methods for generation settings, with relatively few works improving attribution efficiency (Vafa et al., 2021;Ferrando et al., 2022) and explanations' informativeness (Yin and Neubig, 2022).
In this work, we introduce Inseq, a Python library to democratize access to interpretability analyses of generative language models. Inseq centralizes access to a broad set of feature attribution methods, sourced in part from the Captum (Kokhlikyan et al., 2020) framework, enabling a fair comparison of different techniques for all sequence-tosequence and decoder-only models in the popular Transformers library (Wolf et al., 2020). Thanks to its intuitive interface, users can easily integrate interpretability analyses into sequence generation experiments with just 3 lines of code (Figure 2). Nevertheless, Inseq is also highly flexible, including cutting-edge attribution methods with built-in post-processing features ( § 4.1), supporting customizable attribution targets and enabling constrained decoding of arbitrary sequences ( § 4.2). In terms of usability, Inseq greatly simplifies access to local and global explanations with built-in support for a command line interface (CLI), optimized batching enabling dataset-wide attribution, and various methods to visualize, serialize and reload attribution outcomes and generated sequences ( § 4.3). Ultimately, Inseq's aims to make sequence models first-class citizens in interpretability research and drive future advances in interpretability for generative applications.

Related Work
Feature Attribution for Sequence Generation Work on feature attribution for sequence generation has mainly focused on machine translation (MT). Bahdanau et al. (2015) showed how attention weights of neural MT models encode interpretable alignment patterns. Alvarez-Melis and Jaakkola (2017) adopted a perturbation-based framework to highlight biases in MT systems. Ding et al.
Tools for NLP Interpretability Although many post-hoc interpretability libraries were released recently, only a few support sequential feature attribution. Notably, LIT (Tenney et al., 2020), a structured framework for analyzing models across modalities, and Ecco (Alammar, 2021), a library specialized in interactive visualizations of model internals. LIT is an all-in-one GUI-based tool to analyze model behaviors on entire datasets. However, the library does not provide built-in support for Transformers models, has a steep learning curve due to its advanced UI, and does not sup-port interactive evaluation of individual examples. All these factors limit LIT usability for researchers working with custom models, needing access to extracted scores, or being less familiar with interpretability research. On the other hand, Ecco is closer to our work, being based on Transformers and having started to support encoder-decoder models concurrently with Inseq development. Despite a marginal overlap in their functionalities, the two libraries provide orthogonal benefits: Inseq's flexible interface makes it especially suitable for methodical quantitative analyses involving repeated evaluations, while Ecco excels in qualitative analyses aimed at visualizing model internals. Other popular tools such as ERASER (DeYoung et al., 2020), Thermostat (Feldhus et al., 2021), transformersinterpret (Pierse, 2021) and ferret (Attanasio et al., 2022) do not support sequence models.

Design
Inseq combines sequence models sourced from Transformers (Wolf et al., 2020) and attribution methods mainly sourced from Captum (Kokhlikyan et al., 2020). While only text-based tasks are currently supported, the library's modular design 3 would enable the inclusion of other modeling frameworks (e.g. fairseq (Ott et al., 2019)) and modalities (e.g. speech) without requiring substantial redesign. Optional dependencies include Datasets (Lhoest et al., 2021) and Rich 4 .

Guiding Principles
Research and Generation-oriented Inseq should support interpretability analyses of a broad set of sequence generation models without focusing narrowly on specific architectures or tasks. Moreover, the inclusion of new, cutting-edge methods should be prioritized to enable fair comparisons with well-established ones.
Scalable The library should provide an optimized interface to a wide range of use cases, models and setups, ranging from interactive attributions of individual examples using toy models to compiling statistics of large language models' predictions for entire datasets.
Beginner-friendly Inseq should provide built-in access to popular frameworks for sequence generation modeling and be fully usable by non-experts 3 More details are available in Appendix B.  at a high level of abstraction, providing sensible defaults for supported attribution methods.
Extensible Inseq should support a high degree of customization for experienced users, with out-ofthe-box support for user-defined solutions to enable future investigations into models' behaviors.

Feature Attribution and Post-processing
At its core, Inseq provides a simple interface to apply feature attribution techniques for sequence generation tasks. We categorize methods in three groups, gradient-based, internals-based and perturbation-based, depending on their underlying approach to importance quantification. 5 Table 1 presents the full list of supported methods. Aside from popular model-agnostic methods, Inseq notably provides built-in support for attention weight attribution and the cutting-edge Discretized Integrated Gradients method (Sanyal and Ren, 2021). Moreover, multiple methods allow for the importance attribution of custom intermediate model layers, simplifying studies on representational structures and information mixing in sequential models, such as our case study of Section 5.2.
Source and target-side attribution When using encoder-decoder architectures, users can set the attribute _ target parameter to include or exclude the generated prefix in the attributed inputs. In most cases, this should be desirable to account for recently generated tokens when explaining model behaviors, such as when to terminate the generation ( Figure 2). However, attributing the source side separately could prove useful, for example, to derive word alignments from importance scores.
Post-processing of attribution outputs Aggregation is often a fundamental step in attributionbased analyses to obtain a single importance score for every token pair in the attribution matrix. Inseq includes several Aggregator classes to aggregate attributions across various dimensions (e.g. merge subword tokens, merge granular neuronlevel scores into coarse-grained token-level ones) and allows to chain multiple aggregators using the AggregatorPipeline class. Finally, multi-example aggregation is also available (PairAggregator) to simplify the conduction of contrastive analyses, such as the one of Section 5.1. 6

Customizing generation and attribution
Upon calling attribute, Inseq first generates target tokens using Transformers and then attributes them step by step. If a custom target is specified alongside model inputs, the generation step is instead skipped, and the provided target is attributed by constraining the decoding of its tokens 7 . Constrained attribution can be used, among other things, for contrastive comparisons of minimal pairs and to obtain model justifications for desired outputs.
Custom step functions A step function computes a score of interest (e.g. probabilities, entropy) at every attribution step by using models' internal information. Inseq provides access to multiple built-in step functions (Table 1, S) and allows users to create and register new custom ones.
Step scores are computed together with the attribution, returned as separate sequences in the output, and visualized alongside importance scores (e.g. the p(y t |y <t ) row in Figure 1).
Step functions as attribution targets For methods relying on model outputs to predict input importance (gradient and perturbation-based), feature attributions are commonly obtained from the model's output logits or class probabilities (Bastings et al., 2022). However, recent work showed the effectiveness of using targets such as the probability difference of a contrastive output pair to answer interesting questions like "What inputs drive the prediction of y rather thanŷ?" (Yin and Neubig, 2022). In light of these advances, Inseq users can leverage any built-in or custom-defined step function as an attribution target, enabling advanced use cases like contrastive comparisons and uncertaintyweighted attribution using MC Dropout (Gal and Ghahramani, 2016).

Usability Features
Batched and span-focused attributions The library provides built-in batching capabilities, enabling users to go beyond single sentences and attribute even entire datasets in a single function call.
When the attribution of a specific span of interest is needed, Inseq also allows specifying a start and end position for the attribution process. This functionality greatly accelerates the attribution process for studies on localized phenomena (e.g. pronoun coreference in MT models).

CLI, Serialization and Visualization
The Inseq library offers an API to attribute single examples or entire Datasets from the command line and save resulting outputs and visualizations to a file. Attribution outputs can be saved and loaded in JSON format with their respective metadata to easily identify the provenance of contents. Attributions can be visualized in the command line or IPython notebooks and exported as HTML files.
Quantized Model Attribution Supporting the attribution of large models is critical given recent scaling tendencies (Kaplan et al., 2020). All models allowing for quantization using bitsandbytes (Dettmers et al., 2022) can be loaded in 8-bit directly from Transformers, and their attributions can be computed normally using Inseq. 8 A minimal manual evaluation of 8-bit attribution outputs for Section 5.2 study shows minimal discrepancies compared to full-precision results.

Gender Bias in Machine Translation
In the first case study, we use Inseq to investigate gender bias in MT models. Studying social biases embedded in these models is crucial to understand and mitigate the representational and allocative harms they might engender (Blodgett et al., 2020). Savoldi et al. (2021) note that the study of bias in MT could benefit from explainability techniques to identify spurious cues exploited by the model and the interaction of different features that can lead to intersectional bias.
Synthetic Setup: Turkish to English The Turkish language uses the gender-neutral pronoun o, which can be translated into English as either "he", "she", or "it", making it interesting to study gender bias in MT when associated with a language such as English for which models will tend to choose a gendered pronoun form. Previous works leveraged translations from gender-neutral languages to show gender bias present in translation systems (Cho et al., 2019;Prates et al., 2020;Farkas and Németh, 2022). We repeat this simple setup using a Turkishto-English MarianMT model (Tiedemann, 2020) and compute different metrics to quantify gender bias using Inseq. We select 49 Turkish occupation terms verified by a native speaker (see Appendix E) and use them to infill the template sentence "O bir " (He/She is a(n) ). For each translation, we compute attribution scores for source Turkish pronoun (x pron ) and occupation (x occ ) tokens 9 when generating the target English pronoun (y pron ) using Integrated Gradients (IG), Gradients (∇), and Input × Gradient (I×G), 10 . We also collect target pronoun probabilities (p(y pron )), rank the 49 occupation terms using these metrics, and finally compute Kendall's τ correlation with the percentage of women working in the respective fields, using U.S. labor statistics as in previous works (e.g., Caliskan et al., 2017;Rudinger et al., 2018). Table 2 presents our results.
In the base case, we correlate the different metrics with how much the gender distribution deviates from an equal distribution (50 − 50%) for each occupation (i.e., the gender bias irrespective of the direction). We observe a strong gender bias, with "she" being chosen only for 5 out of 49 translations and gender-neutral variants never being produced by the MT model. We find a low correlation between pronoun probability and the degree of gender stereotype associated with the occupa-9 For multi-token occupation terms, e.g., bilim insanı (scientist), the attribution score of the first token was used. 10 We set approx. steps to ensure convergence ∆ < 0.05 for IG. All methods use the L2 norm to obtain token-level attributions.  tion. Moreover, we note a weaker correlation for IG compared to the other two methods. For those, attribution scores for x occ show significant correlations with labor statistics, supporting the view of heavily gender-stereotyped occupations strongly influencing the choice of pronouns in the target.
In the gender-swap case ( → ), we use the PairAggregator class to contrastively compare attribution scores and probabilities when translating the pronoun as "She" or "He". 11 We correlate resulting scores with the % of women working in the respective occupation and find strong correlations for p(y pron ), supporting the validity of contrastive approaches in uncovering gender bias.
Qualitative Example: English to Dutch We conduct a qualitative analysis of biased MT outputs, showing how attributions can help develop hypotheses about models' behavior. Table 3 (top) shows the I × G attributions for English-to-Dutch translation using M2M-100 (418M, Fan et al., 2021). The model mistranslates the pronoun "her" into the masculine form zijn (his). We find that the wrongly translated pronoun exhibits high probability but does not associate substantial importance to the source occupation term "teacher". Instead, we find good relative importance for the preceding word and leraar (male teacher). This suggests a strong prior bias for masculine variants, shown by the pronoun zijn and the noun leraar, as a possible cause for this mistranslation. When considering the contrastive example obtained by swapping leraar with its gender-neutral variant leerkracht (Table 3, bottom), we find increased importance of the target occupation in determining the correctly-gendered target pronoun haar (her). Our results highlight the tendency of MT models to attend inputs sequentially rather than relying on context, hinting at the known benefits of context-aware models for pronoun translation (Voita et al., 2018).

Identifying Factual Knowledge in GPT-2 Layers with Contrastive Attribution
For our second case study, we experiment with a layer attribution method to locate factual knowledge encoded in the layers of GPT-2 1.5B (Radford et al., 2019). Specifically, we aim to reproduce the results of Meng et al. (2022), showing the influence of intermediate layers in mediating the recall of factual statements such as 'The Eiffel Tower is located in the city of → Paris'. Meng et al. (2022) estimate the effect of network components in the prediction of factual statements as the difference in probability of a correct target (e.g. Paris), given a corrupted subject embedding (e.g. for Eiffel Tower), before and after restoring clean activations for some input tokens at different layers of the network. Apart from the obvious importance of final token states in terminal layers, their results highlight the presence of an early site associated with the last subject token playing an important role in recalling the network's factual knowledge (Figure 3, top). To verify such results, we adopt the contrastive attribution paradigm proposed by Yin and Neubig (2022) to attribute minimal pairs of correct and wrong factual targets (e.g. Paris vs. Rome for the example above), using Layer Gradient × Activation, a layerspecific variant of Input × Gradient, to propagate gradients up to intermediate network activations instead of reaching input tokens. The resulting attribution scores hence answer the question "How important are layer L activations for prefix token t in predicting the correct factual target over a wrong  Figure 3: Top: Estimated causal importance of GPT-2 XL layers for predicting factual associations, as reported by Meng et al. (2022). Bottom: Average GPT-2 XL Gradient × Layer Activation scores obtained with Inseq using contrastive factual pairs as attribution targets.
one?". We compute attribution scores for 1000 statements taken from the Counterfact Statement dataset (Meng et al., 2022) and present averaged results in Figure 3 (bottom). 12 Our results closely match those of the original authors, providing further evidence of how attribution methods can be used to identify salient network components and guide model editing, as shown by Dai et al. (2022).

Conclusion
We introduced Inseq, an easy-to-use but versatile toolkit for interpreting sequence generation models. With many libraries focused on the study of classification models, Inseq is the first tool explicitly aimed analyzing systems for tasks such as machine translation, code synthesis and dialogue generation. Researchers can easily add interpretability evaluations to their studies using our library, with the goal of identifying unwanted biases and interesting phenomena in their models' predictions. We plan to provide continued support and explore further developments for Inseq, 13 with the ultimate goal of providing simple and centralized access to a comprehensive set of thoroughly-tested implementations for the interpretability community. In conclusion, we believe that Inseq has the potential to drive real progress in explainable language generation by accelerating the development of new analysis techniques, and we encourage members of 12 Figure 6 of Appendix D presents some examples. 13 Planned developments available in Appendix F. this research field to join our development efforts.

Acknowledgments
We thank Ece Takmaz for verifying the Turkish word list used in the study of Section 5.

Broader Impact and Ethics Statement
Reliability of Attribution Methods The plausibility and faithfulness of attribution methods supported by Inseq is an active matter of debate in the research community, without clear-cut guarantees in identifying specific model behaviors, and prone to users' own biases (Jacovi and Goldberg, 2020). We emphasize that explanations produced with Inseq should not be adopted in high-risk and user-facing contexts. We encourage Inseq users to critically approach results obtained from our toolkit and validate their effectiveness on a case-by-case basis.

Technical Limitations and Contributions
Inseq currently does not provide explicit ways of evaluating the quality of produced attributions. Moreover, many recent methods still need to be included due to the rapid pace of interpretability research in natural language processing and the small size of our development team.
To foster an open and inclusive development environment, we encourage all interested users and new methods' authors to contribute to the development of Inseq by adding their interpretability methods of interest.
Gender Bias Case Study The case study of Section 5.1 assumes a simplified concept of binary gender to allow for a more straightforward evaluation of the results. However, we encourage other researchers to consider non-binary gender and different marginalized groups in future bias studies. We acknowledge that measuring bias in language models is complex and that care must be taken in its conceptualization and validation (Blodgett et al., 2020;Bommasani and Liang, 2022), even more so in multilingual settings (Talat et al., 2022). For this reason, we do not claim to provide a definite bias analysis of these MT models -especially in light of the aforementioned attributions' faithfulness issues. The study's primary purpose is to demonstrate how attribution methods could be used for exploring social biases in sequence-to-sequence models and showcase the related Inseq functionalities.

A Authors' Contributions
Gabriele Sarti Organized and led the project, developed the first public release of the Inseq library, and conducted the case study of Section 5.2.
Nils Feldhus Developed perturbation-based methods and contributed to the writing and validation of the case study of Section 5.2.
Ludwig Sickert Developed the attention-based attribution method and contributed to the writing and revision of the paper.
Oskar van der Wal Conducted the experiments in the gender bias case study of Section 5.1 and contributed to the writing and revision of the paper. Figure 4 presents the Inseq hierarchy of models and attribution methods. The model-method connection is established to enable out-of-the-box attribution with the selected method. The presence of framework-specific and architecture-specific classes enables the straightforward extension of Inseq to new modeling backbones and network architectures.

C Example of Pair Aggregation for Contrastive MT Comparison
An example of gender translation pair using the synthetic template of Section 5.1 is show in Figure 5, highlighting a large drop in probability when switching the gendered pronoun for highly genderstereotypical professions, similar to Table 2 results.  Table 4 shows the list of occupation terms used in the gender bias case study (Section 5.1). We correlate the ranking of occupations based on the selected attribution metrics and probabilities with U.S. labor statistics 14 (bls _ pct _ female column). Table 3 example was taken from the BUG dataset (Levy et al., 2021).