Revisiting Visual Grounding

Erik Conser; Kennedy Hahn; Chandler Watson; Melanie Mitchell

doi:10.18653/v1/W19-1804

Revisiting Visual Grounding

Erik Conser, Kennedy Hahn, Chandler Watson, Melanie Mitchell

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use ... for bold, ... for italic, ... for underline, <sc>...</sc> for small-caps, <tt>...<tt> for typewriter text, <url>...</url> for URLs, <a href=...> for hyperlinks, and <par/> for paragraph breaks.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

We revisit a particular visual grounding method: the “Image Retrieval Using Scene Graphs” (IRSG) system of Johnson et al. Our experiments indicate that the system does not effectively use its learned object-relationship models. We also look closely at the IRSG dataset, as well as the widely used Visual Relationship Dataset (VRD) that is adapted from it. We find that these datasets exhibit bias that allows methods that ignore relationships to perform relatively well. We also describe several other problems with the IRSG dataset, and report on experiments using a subset of the dataset in which the biases and other problems are removed. Our studies contribute to a more general effort: that of better understanding what machine-learning methods that combine language and vision actually learn and what popular datasets actually test.

Anthology ID:: W19-1804
Volume:: Proceedings of the Second Workshop on Shortcomings in Vision and Language
Month:: June
Year:: 2019
Address:: Minneapolis, Minnesota
Editors:: Raffaella Bernardi, Raquel Fernandez, Spandana Gella, Kushal Kafle, Christopher Kanan, Stefan Lee, Moin Nabi
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37–46
Language:
URL:: https://aclanthology.org/W19-1804/
DOI:: 10.18653/v1/W19-1804
Bibkey:
Cite (ACL):: Erik Conser, Kennedy Hahn, Chandler Watson, and Melanie Mitchell. 2019. Revisiting Visual Grounding. In Proceedings of the Second Workshop on Shortcomings in Vision and Language, pages 37–46, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):: Revisiting Visual Grounding (Conser et al., NAACL 2019)
Copy Citation:
PDF:: https://aclanthology.org/W19-1804.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{conser-etal-2019-revisiting,
    title = "Revisiting Visual Grounding",
    author = "Conser, Erik  and
      Hahn, Kennedy  and
      Watson, Chandler  and
      Mitchell, Melanie",
    editor = "Bernardi, Raffaella  and
      Fernandez, Raquel  and
      Gella, Spandana  and
      Kafle, Kushal  and
      Kanan, Christopher  and
      Lee, Stefan  and
      Nabi, Moin",
    booktitle = "Proceedings of the Second Workshop on Shortcomings in Vision and Language",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W19-1804/",
    doi = "10.18653/v1/W19-1804",
    pages = "37--46",
    abstract = "We revisit a particular visual grounding method: the ``Image Retrieval Using Scene Graphs'' (IRSG) system of Johnson et al. Our experiments indicate that the system does not effectively use its learned object-relationship models. We also look closely at the IRSG dataset, as well as the widely used Visual Relationship Dataset (VRD) that is adapted from it. We find that these datasets exhibit bias that allows methods that ignore relationships to perform relatively well. We also describe several other problems with the IRSG dataset, and report on experiments using a subset of the dataset in which the biases and other problems are removed. Our studies contribute to a more general effort: that of better understanding what machine-learning methods that combine language and vision actually learn and what popular datasets actually test."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="conser-etal-2019-revisiting">
    <titleInfo>
        <title>Revisiting Visual Grounding</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Erik</namePart>
        <namePart type="family">Conser</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Kennedy</namePart>
        <namePart type="family">Hahn</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Chandler</namePart>
        <namePart type="family">Watson</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Melanie</namePart>
        <namePart type="family">Mitchell</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2019-06</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Second Workshop on Shortcomings in Vision and Language</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Raffaella</namePart>
            <namePart type="family">Bernardi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Raquel</namePart>
            <namePart type="family">Fernandez</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Spandana</namePart>
            <namePart type="family">Gella</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Kushal</namePart>
            <namePart type="family">Kafle</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Christopher</namePart>
            <namePart type="family">Kanan</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Stefan</namePart>
            <namePart type="family">Lee</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Moin</namePart>
            <namePart type="family">Nabi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Minneapolis, Minnesota</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>We revisit a particular visual grounding method: the “Image Retrieval Using Scene Graphs” (IRSG) system of Johnson et al. Our experiments indicate that the system does not effectively use its learned object-relationship models. We also look closely at the IRSG dataset, as well as the widely used Visual Relationship Dataset (VRD) that is adapted from it. We find that these datasets exhibit bias that allows methods that ignore relationships to perform relatively well. We also describe several other problems with the IRSG dataset, and report on experiments using a subset of the dataset in which the biases and other problems are removed. Our studies contribute to a more general effort: that of better understanding what machine-learning methods that combine language and vision actually learn and what popular datasets actually test.</abstract>
    <identifier type="citekey">conser-etal-2019-revisiting</identifier>
    <identifier type="doi">10.18653/v1/W19-1804</identifier>
    <location>
        <url>https://aclanthology.org/W19-1804/</url>
    </location>
    <part>
        <date>2019-06</date>
        <extent unit="page">
            <start>37</start>
            <end>46</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Revisiting Visual Grounding
%A Conser, Erik
%A Hahn, Kennedy
%A Watson, Chandler
%A Mitchell, Melanie
%Y Bernardi, Raffaella
%Y Fernandez, Raquel
%Y Gella, Spandana
%Y Kafle, Kushal
%Y Kanan, Christopher
%Y Lee, Stefan
%Y Nabi, Moin
%S Proceedings of the Second Workshop on Shortcomings in Vision and Language
%D 2019
%8 June
%I Association for Computational Linguistics
%C Minneapolis, Minnesota
%F conser-etal-2019-revisiting
%X We revisit a particular visual grounding method: the “Image Retrieval Using Scene Graphs” (IRSG) system of Johnson et al. Our experiments indicate that the system does not effectively use its learned object-relationship models. We also look closely at the IRSG dataset, as well as the widely used Visual Relationship Dataset (VRD) that is adapted from it. We find that these datasets exhibit bias that allows methods that ignore relationships to perform relatively well. We also describe several other problems with the IRSG dataset, and report on experiments using a subset of the dataset in which the biases and other problems are removed. Our studies contribute to a more general effort: that of better understanding what machine-learning methods that combine language and vision actually learn and what popular datasets actually test.
%R 10.18653/v1/W19-1804
%U https://aclanthology.org/W19-1804/
%U https://doi.org/10.18653/v1/W19-1804
%P 37-46

Download as File

Markdown (Informal)

[Revisiting Visual Grounding](https://aclanthology.org/W19-1804/) (Conser et al., NAACL 2019)

Revisiting Visual Grounding (Conser et al., NAACL 2019)

ACL

Erik Conser, Kennedy Hahn, Chandler Watson, and Melanie Mitchell. 2019. Revisiting Visual Grounding. In Proceedings of the Second Workshop on Shortcomings in Vision and Language, pages 37–46, Minneapolis, Minnesota. Association for Computational Linguistics.