Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation

Ahmed Njifenjou; Virgile Sucal; Bassam Jabaian; Fabrice Lefèvre

Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation

Ahmed Njifenjou, Virgile Sucal, Bassam Jabaian, Fabrice Lefèvre

Abstract

The prevailing paradigm in the field of Open-Domain Dialogue (ODD) agents predominantly focuses on some high-resource languages such as English or Chinese. Furthermore, the financial and temporal investments required for crowd-sourcing such datasets, in multiple languages, are substantial. Fortunately, advancements in Large Language Models (LLMs), specifically instruction-tuning enabled them to execute tasks based on natural language instructions. Additionally, these models possess the capability to function in various languages within a single thread. Consequently, to generate new data samples in different languages, we propose leveraging these capabilities to replicate the data collection process. We introduce a pipeline for generating ODD data in multiple target languages using LLMs, with demonstrations provided in a unique source language. By eschewing explicit Machine Translation in this approach, we enhance language-specific nuances and cultural specificity. We apply this methodology to the PersonaChat dataset. To further improve the openness of generated dialogues and mimic real life scenarios, we added the notion of speech events corresponding to the type of conversation the speakers are involved in and that of common ground which represents the premises of a conversation.

Anthology ID:: 2025.sigdial-1.55
Volume:: Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:: August
Year:: 2025
Address:: Avignon, France
Editors:: Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin
Venue:: SIGDIAL
SIG:: SIGDIAL
Publisher:: Association for Computational Linguistics
Note:
Pages:: 697–749
Language:
URL:: https://aclanthology.org/2025.sigdial-1.55/
DOI:
Bibkey:
Cite (ACL):: Ahmed Njifenjou, Virgile Sucal, Bassam Jabaian, and Fabrice Lefèvre. 2025. Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation. In Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 697–749, Avignon, France. Association for Computational Linguistics.
Cite (Informal):: Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation (Njifenjou et al., SIGDIAL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.sigdial-1.55.pdf

PDF Cite Search Fix data