Evaluating Attribution Methods using White-Box LSTMs

Yiding Hao

doi:10.18653/v1/2020.blackboxnlp-1.28

Evaluating Attribution Methods using White-Box LSTMs

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Interpretability methods for neural networks are difficult to evaluate because we do not understand the black-box models typically used to test them. This paper proposes a framework in which interpretability methods are evaluated using manually constructed networks, which we call white-box networks, whose behavior is understood a priori. We evaluate five methods for producing attribution heatmaps by applying them to white-box LSTM classifiers for tasks based on formal languages. Although our white-box classifiers solve their tasks perfectly and transparently, we find that all five attribution methods fail to produce the expected model explanations.

Anthology ID:: 2020.blackboxnlp-1.28
Volume:: Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Month:: November
Year:: 2020
Address:: Online
Editors:: Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupała, Dieuwke Hupkes, Yuval Pinter, Hassan Sajjad
Venue:: BlackboxNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 300–313
Language:
URL:: https://aclanthology.org/2020.blackboxnlp-1.28/
DOI:: 10.18653/v1/2020.blackboxnlp-1.28
Bibkey:
Cite (ACL):: Yiding Hao. 2020. Evaluating Attribution Methods using White-Box LSTMs. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 300–313, Online. Association for Computational Linguistics.
Cite (Informal):: Evaluating Attribution Methods using White-Box LSTMs (Hao, BlackboxNLP 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.blackboxnlp-1.28.pdf
Video:: https://slideslive.com/38939768

PDF Cite Search Video Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{hao-2020-evaluating,
    title = "Evaluating Attribution Methods using White-Box {LSTM}s",
    author = "Hao, Yiding",
    editor = "Alishahi, Afra  and
      Belinkov, Yonatan  and
      Chrupa{\l}a, Grzegorz  and
      Hupkes, Dieuwke  and
      Pinter, Yuval  and
      Sajjad, Hassan",
    booktitle = "Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.blackboxnlp-1.28/",
    doi = "10.18653/v1/2020.blackboxnlp-1.28",
    pages = "300--313",
    abstract = "Interpretability methods for neural networks are difficult to evaluate because we do not understand the black-box models typically used to test them. This paper proposes a framework in which interpretability methods are evaluated using manually constructed networks, which we call white-box networks, whose behavior is understood a priori. We evaluate five methods for producing attribution heatmaps by applying them to white-box LSTM classifiers for tasks based on formal languages. Although our white-box classifiers solve their tasks perfectly and transparently, we find that all five attribution methods fail to produce the expected model explanations."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="hao-2020-evaluating">
    <titleInfo>
        <title>Evaluating Attribution Methods using White-Box LSTMs</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Yiding</namePart>
        <namePart type="family">Hao</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2020-11</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Afra</namePart>
            <namePart type="family">Alishahi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Yonatan</namePart>
            <namePart type="family">Belinkov</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Grzegorz</namePart>
            <namePart type="family">Chrupała</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Dieuwke</namePart>
            <namePart type="family">Hupkes</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Yuval</namePart>
            <namePart type="family">Pinter</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Hassan</namePart>
            <namePart type="family">Sajjad</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Online</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Interpretability methods for neural networks are difficult to evaluate because we do not understand the black-box models typically used to test them. This paper proposes a framework in which interpretability methods are evaluated using manually constructed networks, which we call white-box networks, whose behavior is understood a priori. We evaluate five methods for producing attribution heatmaps by applying them to white-box LSTM classifiers for tasks based on formal languages. Although our white-box classifiers solve their tasks perfectly and transparently, we find that all five attribution methods fail to produce the expected model explanations.</abstract>
    <identifier type="citekey">hao-2020-evaluating</identifier>
    <identifier type="doi">10.18653/v1/2020.blackboxnlp-1.28</identifier>
    <location>
        <url>https://aclanthology.org/2020.blackboxnlp-1.28/</url>
    </location>
    <part>
        <date>2020-11</date>
        <extent unit="page">
            <start>300</start>
            <end>313</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Evaluating Attribution Methods using White-Box LSTMs
%A Hao, Yiding
%Y Alishahi, Afra
%Y Belinkov, Yonatan
%Y Chrupała, Grzegorz
%Y Hupkes, Dieuwke
%Y Pinter, Yuval
%Y Sajjad, Hassan
%S Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
%D 2020
%8 November
%I Association for Computational Linguistics
%C Online
%F hao-2020-evaluating
%X Interpretability methods for neural networks are difficult to evaluate because we do not understand the black-box models typically used to test them. This paper proposes a framework in which interpretability methods are evaluated using manually constructed networks, which we call white-box networks, whose behavior is understood a priori. We evaluate five methods for producing attribution heatmaps by applying them to white-box LSTM classifiers for tasks based on formal languages. Although our white-box classifiers solve their tasks perfectly and transparently, we find that all five attribution methods fail to produce the expected model explanations.
%R 10.18653/v1/2020.blackboxnlp-1.28
%U https://aclanthology.org/2020.blackboxnlp-1.28/
%U https://doi.org/10.18653/v1/2020.blackboxnlp-1.28
%P 300-313

Download as File

Markdown (Informal)

[Evaluating Attribution Methods using White-Box LSTMs](https://aclanthology.org/2020.blackboxnlp-1.28/) (Hao, BlackboxNLP 2020)

Evaluating Attribution Methods using White-Box LSTMs (Hao, BlackboxNLP 2020)

ACL

Yiding Hao. 2020. Evaluating Attribution Methods using White-Box LSTMs. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 300–313, Online. Association for Computational Linguistics.