VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Zejun Li (李泽君); Ruipu Luo; Jiwen Zhang; Minghui Qiu; Xuan-Jing Huang (黄萱菁); Zhongyu Wei

doi:10.18653/v1/2025.naacl-long.192

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, Zhongyu Wei

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Anthology ID:: 2025.naacl-long.192
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3769–3798
Language:
URL:: https://aclanthology.org/2025.naacl-long.192/
DOI:: 10.18653/v1/2025.naacl-long.192
Bibkey:
Cite (ACL):: Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, and Zhongyu Wei. 2025. VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3769–3798, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models (Li et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-long.192.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{li-etal-2025-vocot,
    title = "{V}o{C}o{T}: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models",
    author = "Li, Zejun  and
      Luo, Ruipu  and
      Zhang, Jiwen  and
      Qiu, Minghui  and
      Huang, Xuanjing  and
      Wei, Zhongyu",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.192/",
    doi = "10.18653/v1/2025.naacl-long.192",
    pages = "3769--3798",
    ISBN = "979-8-89176-189-6"
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="li-etal-2025-vocot">
    <titleInfo>
        <title>VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Zejun</namePart>
        <namePart type="family">Li</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Ruipu</namePart>
        <namePart type="family">Luo</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Jiwen</namePart>
        <namePart type="family">Zhang</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Minghui</namePart>
        <namePart type="family">Qiu</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Xuanjing</namePart>
        <namePart type="family">Huang</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Zhongyu</namePart>
        <namePart type="family">Wei</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2025-04</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Luis</namePart>
            <namePart type="family">Chiruzzo</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Alan</namePart>
            <namePart type="family">Ritter</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Lu</namePart>
            <namePart type="family">Wang</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Albuquerque, New Mexico</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
        <identifier type="isbn">979-8-89176-189-6</identifier>
    </relatedItem>
    <identifier type="citekey">li-etal-2025-vocot</identifier>
    <identifier type="doi">10.18653/v1/2025.naacl-long.192</identifier>
    <location>
        <url>https://aclanthology.org/2025.naacl-long.192/</url>
    </location>
    <part>
        <date>2025-04</date>
        <extent unit="page">
            <start>3769</start>
            <end>3798</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
%A Li, Zejun
%A Luo, Ruipu
%A Zhang, Jiwen
%A Qiu, Minghui
%A Huang, Xuanjing
%A Wei, Zhongyu
%Y Chiruzzo, Luis
%Y Ritter, Alan
%Y Wang, Lu
%S Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
%D 2025
%8 April
%I Association for Computational Linguistics
%C Albuquerque, New Mexico
%@ 979-8-89176-189-6
%F li-etal-2025-vocot
%R 10.18653/v1/2025.naacl-long.192
%U https://aclanthology.org/2025.naacl-long.192/
%U https://doi.org/10.18653/v1/2025.naacl-long.192
%P 3769-3798

Download as File

Markdown (Informal)

[VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models](https://aclanthology.org/2025.naacl-long.192/) (Li et al., NAACL 2025)

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models (Li et al., NAACL 2025)

ACL

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, and Zhongyu Wei. 2025. VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3769–3798, Albuquerque, New Mexico. Association for Computational Linguistics.