On the Robustness of Agentic Function Calling

Ella Rabinovich; Ateret Anaby Tavor

doi:10.18653/v1/2025.trustnlp-main.20

On the Robustness of Agentic Function Calling

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments.

Anthology ID:: 2025.trustnlp-main.20
Volume:: Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Trista Cao, Anubrata Das, Tharindu Kumarage, Yixin Wan, Satyapriya Krishna, Ninareh Mehrabi, Jwala Dhamala, Anil Ramakrishna, Aram Galystan, Anoop Kumar, Rahul Gupta, Kai-Wei Chang
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 298–304
Language:
URL:: https://aclanthology.org/2025.trustnlp-main.20/
DOI:: 10.18653/v1/2025.trustnlp-main.20
Bibkey:
Cite (ACL):: Ella Rabinovich and Ateret Anaby Tavor. 2025. On the Robustness of Agentic Function Calling. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 298–304, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: On the Robustness of Agentic Function Calling (Rabinovich & Anaby Tavor, TrustNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.trustnlp-main.20.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{rabinovich-anaby-tavor-2025-robustness,
    title = "On the Robustness of Agentic Function Calling",
    author = "Rabinovich, Ella  and
      Anaby Tavor, Ateret",
    editor = "Cao, Trista  and
      Das, Anubrata  and
      Kumarage, Tharindu  and
      Wan, Yixin  and
      Krishna, Satyapriya  and
      Mehrabi, Ninareh  and
      Dhamala, Jwala  and
      Ramakrishna, Anil  and
      Galystan, Aram  and
      Kumar, Anoop  and
      Gupta, Rahul  and
      Chang, Kai-Wei",
    booktitle = "Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)",
    month = may,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.trustnlp-main.20/",
    doi = "10.18653/v1/2025.trustnlp-main.20",
    pages = "298--304",
    ISBN = "979-8-89176-233-6",
    abstract = "Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="rabinovich-anaby-tavor-2025-robustness">
    <titleInfo>
        <title>On the Robustness of Agentic Function Calling</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Ella</namePart>
        <namePart type="family">Rabinovich</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Ateret</namePart>
        <namePart type="family">Anaby Tavor</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2025-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Trista</namePart>
            <namePart type="family">Cao</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Anubrata</namePart>
            <namePart type="family">Das</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Tharindu</namePart>
            <namePart type="family">Kumarage</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Yixin</namePart>
            <namePart type="family">Wan</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Satyapriya</namePart>
            <namePart type="family">Krishna</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Ninareh</namePart>
            <namePart type="family">Mehrabi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Jwala</namePart>
            <namePart type="family">Dhamala</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Anil</namePart>
            <namePart type="family">Ramakrishna</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Aram</namePart>
            <namePart type="family">Galystan</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Anoop</namePart>
            <namePart type="family">Kumar</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Rahul</namePart>
            <namePart type="family">Gupta</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Kai-Wei</namePart>
            <namePart type="family">Chang</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Albuquerque, New Mexico</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
        <identifier type="isbn">979-8-89176-233-6</identifier>
    </relatedItem>
    <abstract>Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments.</abstract>
    <identifier type="citekey">rabinovich-anaby-tavor-2025-robustness</identifier>
    <identifier type="doi">10.18653/v1/2025.trustnlp-main.20</identifier>
    <location>
        <url>https://aclanthology.org/2025.trustnlp-main.20/</url>
    </location>
    <part>
        <date>2025-05</date>
        <extent unit="page">
            <start>298</start>
            <end>304</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T On the Robustness of Agentic Function Calling
%A Rabinovich, Ella
%A Anaby Tavor, Ateret
%Y Cao, Trista
%Y Das, Anubrata
%Y Kumarage, Tharindu
%Y Wan, Yixin
%Y Krishna, Satyapriya
%Y Mehrabi, Ninareh
%Y Dhamala, Jwala
%Y Ramakrishna, Anil
%Y Galystan, Aram
%Y Kumar, Anoop
%Y Gupta, Rahul
%Y Chang, Kai-Wei
%S Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
%D 2025
%8 May
%I Association for Computational Linguistics
%C Albuquerque, New Mexico
%@ 979-8-89176-233-6
%F rabinovich-anaby-tavor-2025-robustness
%X Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments.
%R 10.18653/v1/2025.trustnlp-main.20
%U https://aclanthology.org/2025.trustnlp-main.20/
%U https://doi.org/10.18653/v1/2025.trustnlp-main.20
%P 298-304

Download as File

Markdown (Informal)

[On the Robustness of Agentic Function Calling](https://aclanthology.org/2025.trustnlp-main.20/) (Rabinovich & Anaby Tavor, TrustNLP 2025)

On the Robustness of Agentic Function Calling (Rabinovich & Anaby Tavor, TrustNLP 2025)

ACL

Ella Rabinovich and Ateret Anaby Tavor. 2025. On the Robustness of Agentic Function Calling. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 298–304, Albuquerque, New Mexico. Association for Computational Linguistics.