Shahar Katz

2026

Safeguarding Language Models via Self-Destruct Trapdoor
Shahar Katz | Bar Alon | Ariel Shaulov | Lior Wolf | Mahmood Sharif
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The potential misuse and misalignment of language models (LMs) is a central safety concern. This work presents Self-Destruct, a novelmechanism to restrict specific behaviors in LMs by leveraging overlooked properties of the underlying hardware. We observe that the LMframeworks use limited-precision formats (e.g., BF16), which are vulnerable to overflow errors during matrix multiplications. Exploitingthis property, Self-Destruct replaces selected weights in pre-trained LM layers with values that act as traps, triggering a system error onlywhen the model engages in targeted behaviors, such as harmful text generation, while leaving normal functionality unaffected. Unlike posthoc filters, this safeguard is embedded directly within the model, introduces neither inference overhead nor auxiliary models, and requires only a set of examples for calibration. Extensive experiments with five LM families demonstrate that Self-Destruct provides competitive protection against jailbreak attacks while preserving accuracy on standard benchmarks. In addition, we also show that Self-Destruct is versatile, helping mitigate biased text generation and enable model fingerprinting, highlighting the potential of hardware-aware safeguards as an efficient, low-overhead complement to existing LM defenses.

2025

pdf bib abs

Reversed Attention: On The Gradient Descent Of Attention Layers In GPT
Shahar Katz | Lior Wolf
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The success of Transformer-based Language Models (LMs) stems from their attention mechanism. While this mechanism has been extensively studied in explainability research, particularly through the attention values obtained during the forward pass of LMs, the backward pass of attention has been largely overlooked.In this work, we study the mathematics of the backward pass of attention, revealing that it implicitly calculates an attention matrix we refer to as “Reversed Attention”.We visualized Reversed Attention and examine its properties, demonstrating its ability to elucidate the models’ behavior and edit dynamics.In an experimental setup, we showcase the ability of Reversed Attention to directly alter the forward pass of attention, without modifying the model’s weights, using a novel method called “attention patching”.In addition to enhancing the comprehension of how LMs configure attention layers during backpropagation, Reversed Attention maps contribute to a more interpretable backward pass.

pdf bib abs

Segment-Based Attention Masking for GPTs
Shahar Katz | Liran Ringel | Yaniv Romano | Lior Wolf
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Causal masking is a fundamental component in Generative Pre-Trained Transformer (GPT) models, playing a crucial role during training. Although GPTs can process the entire user prompt at once, the causal masking is applied to all input tokens step-by-step, mimicking the generation process. This imposes an unnecessary constraint during the initial “prefill” phase when the model processes the input prompt and generates the internal representations before producing any output tokens. In this work, attention is masked based on the known block structure at the prefill phase, followed by the conventional token-by-token autoregressive process after that. For example, in a typical chat prompt, the system prompt is treated as one block, and the user prompt as the next one. Each of these is treated as a unit for the purpose of masking, such that the first tokens in each block can access the subsequent tokens in a non-causal manner. Then, the model answer is generated in the conventional causal manner. The Segment-by-Segment scheme entails no additional computational overhead. When integrated using a lightweight fine-tuning into already trained models such as Llama and Qwen, MAS quickly increases models’ performances.

2024

pdf bib abs

Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
Shahar Katz | Yonatan Belinkov | Mor Geva | Lior Wolf
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the models’ vocabularies, helping to uncover how information flows within LMs. In this work, we extend this methodology to LMs’ backward pass and gradients. We first prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes’ inputs. We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs’ neurons.

2023

pdf bib abs

VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers
Shahar Katz | Yonatan Belinkov
Findings of the Association for Computational Linguistics: EMNLP 2023

Recent advances in interpretability suggest we can project weights and hidden states of transformer-based language models (LMs) to their vocabulary, a transformation that makes them more human interpretable. In this paper, we investigate LM attention heads and memory values, the vectors the models dynamically create and recall while processing a given input. By analyzing the tokens they represent through this projection, we identify patterns in the information flow inside the attention mechanism. Based on our discoveries, we create a tool to visualize a forward pass of Generative Pre-trained Transformers (GPTs) as an interactive flow graph, with nodes representing neurons or hidden states and edges representing the interactions between them. Our visualization simplifies huge amounts of data into easy-to-read plots that can reflect the models’ internal processing, uncovering the contribution of each component to the models’ final prediction. Our visualization also unveils new insights about the role of layer norms as semantic filters that influence the models’ output, and about neurons that are always activated during forward passes and act as regularization vectors.

Co-authors

Yaniv Romano 1

Mahmood Sharif 1

Ariel Shaulov 1

Venues

Fix author