Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue

Kenta Yamamoto; Ryu Takeda; Kazunori Komatani

Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue

Kenta Yamamoto, Ryu Takeda, Kazunori Komatani

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

In human-robot dialogue systems, streaming automatic speech recognition (ASR) services (e.g., Google ASR) are often utilized, with the microphone positioned close to the robot’s loudspeaker. Under these conditions, both the robot’s and the user’s utterances are captured, resulting in frequent failures to detect user speech. This study analyzes voice activity detection (VAD) errors by comparing results from such streaming ASR to those from standalone VAD models. Experiments conducted on three distinct dialogue datasets showed that streaming ASR tends to ignore user utterances immediately following system utterances. We discuss the underlying causes of these VAD errors and provide recommendations for improving VAD performance in human-robot dialogue.

Anthology ID:: 2025.iwsds-1.26
Volume:: Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Month:: May
Year:: 2025
Address:: Bilbao, Spain
Editors:: Maria Ines Torres, Yuki Matsuda, Zoraida Callejas, Arantza del Pozo, Luis Fernando D'Haro
Venues:: IWSDS | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 245–253
Language:
URL:: https://aclanthology.org/2025.iwsds-1.26/
DOI:
Bibkey:
Cite (ACL):: Kenta Yamamoto, Ryu Takeda, and Kazunori Komatani. 2025. Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue. In Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 245–253, Bilbao, Spain. Association for Computational Linguistics.
Cite (Informal):: Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue (Yamamoto et al., IWSDS 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.iwsds-1.26.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{yamamoto-etal-2025-analysis,
    title = "Analysis of Voice Activity Detection Errors in {API}-based Streaming {ASR} for Human-Robot Dialogue",
    author = "Yamamoto, Kenta  and
      Takeda, Ryu  and
      Komatani, Kazunori",
    editor = "Torres, Maria Ines  and
      Matsuda, Yuki  and
      Callejas, Zoraida  and
      del Pozo, Arantza  and
      D'Haro, Luis Fernando",
    booktitle = "Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology",
    month = may,
    year = "2025",
    address = "Bilbao, Spain",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.iwsds-1.26/",
    pages = "245--253",
    ISBN = "979-8-89176-248-0",
    abstract = "In human-robot dialogue systems, streaming automatic speech recognition (ASR) services (e.g., Google ASR) are often utilized, with the microphone positioned close to the robot{'}s loudspeaker. Under these conditions, both the robot{'}s and the user{'}s utterances are captured, resulting in frequent failures to detect user speech. This study analyzes voice activity detection (VAD) errors by comparing results from such streaming ASR to those from standalone VAD models. Experiments conducted on three distinct dialogue datasets showed that streaming ASR tends to ignore user utterances immediately following system utterances. We discuss the underlying causes of these VAD errors and provide recommendations for improving VAD performance in human-robot dialogue."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="yamamoto-etal-2025-analysis">
    <titleInfo>
        <title>Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Kenta</namePart>
        <namePart type="family">Yamamoto</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Ryu</namePart>
        <namePart type="family">Takeda</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Kazunori</namePart>
        <namePart type="family">Komatani</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2025-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Maria</namePart>
            <namePart type="given">Ines</namePart>
            <namePart type="family">Torres</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Yuki</namePart>
            <namePart type="family">Matsuda</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Zoraida</namePart>
            <namePart type="family">Callejas</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Arantza</namePart>
            <namePart type="family">del Pozo</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Luis</namePart>
            <namePart type="given">Fernando</namePart>
            <namePart type="family">D’Haro</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Bilbao, Spain</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
        <identifier type="isbn">979-8-89176-248-0</identifier>
    </relatedItem>
    <abstract>In human-robot dialogue systems, streaming automatic speech recognition (ASR) services (e.g., Google ASR) are often utilized, with the microphone positioned close to the robot’s loudspeaker. Under these conditions, both the robot’s and the user’s utterances are captured, resulting in frequent failures to detect user speech. This study analyzes voice activity detection (VAD) errors by comparing results from such streaming ASR to those from standalone VAD models. Experiments conducted on three distinct dialogue datasets showed that streaming ASR tends to ignore user utterances immediately following system utterances. We discuss the underlying causes of these VAD errors and provide recommendations for improving VAD performance in human-robot dialogue.</abstract>
    <identifier type="citekey">yamamoto-etal-2025-analysis</identifier>
    <location>
        <url>https://aclanthology.org/2025.iwsds-1.26/</url>
    </location>
    <part>
        <date>2025-05</date>
        <extent unit="page">
            <start>245</start>
            <end>253</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue
%A Yamamoto, Kenta
%A Takeda, Ryu
%A Komatani, Kazunori
%Y Torres, Maria Ines
%Y Matsuda, Yuki
%Y Callejas, Zoraida
%Y del Pozo, Arantza
%Y D’Haro, Luis Fernando
%S Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
%D 2025
%8 May
%I Association for Computational Linguistics
%C Bilbao, Spain
%@ 979-8-89176-248-0
%F yamamoto-etal-2025-analysis
%X In human-robot dialogue systems, streaming automatic speech recognition (ASR) services (e.g., Google ASR) are often utilized, with the microphone positioned close to the robot’s loudspeaker. Under these conditions, both the robot’s and the user’s utterances are captured, resulting in frequent failures to detect user speech. This study analyzes voice activity detection (VAD) errors by comparing results from such streaming ASR to those from standalone VAD models. Experiments conducted on three distinct dialogue datasets showed that streaming ASR tends to ignore user utterances immediately following system utterances. We discuss the underlying causes of these VAD errors and provide recommendations for improving VAD performance in human-robot dialogue.
%U https://aclanthology.org/2025.iwsds-1.26/
%P 245-253

Download as File

Markdown (Informal)

[Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue](https://aclanthology.org/2025.iwsds-1.26/) (Yamamoto et al., IWSDS 2025)

Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue (Yamamoto et al., IWSDS 2025)

ACL

Kenta Yamamoto, Ryu Takeda, and Kazunori Komatani. 2025. Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue. In Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 245–253, Bilbao, Spain. Association for Computational Linguistics.