Do audio and visual tokenizers capture backchannels?

Benoit Favre; Auriane Boudin

Do audio and visual tokenizers capture backchannels?

Abstract

Audio and video tokenizers are autoencoders trained to represent the content of recordings as a sequence of vectors. They are prevalently used to interface large language models with non-textual modalities. While they allow advanced applications such as video generation, the envelope of their limitations is not known in the context of multimodal conversation. This work focuses on backchannels, which listeners use to signal to the speaker that they are listening. This feedback is essential to maintain the conversation flow. We evaluate whether a representative set of audio and video tokenizers encode backchannels using linear probing. Results show that although audio tokenizers capture the phenomenon relatively well, backchannels are not linearly separated by video tokenizers. However, joint representations resulting from concatenating representations in both modalities improve accuracy significantly over audio-only representations, suggesting to train multimodal tokenizers.

Anthology ID:: 2026.iwsds-1.6
Volume:: Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Month:: February
Year:: 2026
Address:: Trento, Italy
Editors:: Giuseppe Riccardi, Seyed Mahed Mousavi, Maria Ines Torres, Koichiro Yoshino, Zoraida Callejas, Shammur Absar Chowdhury, Yun-Nung Chen, Frederic Bechet, Joakim Gustafson, Géraldine Damnati, Alex Papangelis, Luis Fernando D’Haro, John Mendonça, Raffaella Bernardi, Dilek Hakkani-Tur, Giuseppe "Pino" Di Fabbrizio, Tatsuya Kawahara, Firoj Alam, Gokhan Tur, Michael Johnston
Venue:: IWSDS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 64–75
Language:
URL:: https://aclanthology.org/2026.iwsds-1.6/
DOI:
Bibkey:
Cite (ACL):: Benoit Favre and Auriane Boudin. 2026. Do audio and visual tokenizers capture backchannels?. In Proceedings of the 16th International Workshop on Spoken Dialogue System Technology, pages 64–75, Trento, Italy. Association for Computational Linguistics.
Cite (Informal):: Do audio and visual tokenizers capture backchannels? (Favre & Boudin, IWSDS 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.iwsds-1.6.pdf

PDF Cite Search Fix data