Omer Vandsburger


2024

pdf bib
Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts
Lotem Golany | Filippo Galgani | Maya Mamo | Nimrod Parasol | Omer Vandsburger | Nadav Bar | Ido Dagan
Findings of the Association for Computational Linguistics: EMNLP 2024

Automating data generation with Large Language Models (LLMs) has become increasingly popular. In this work, we investigate the feasibility and effectiveness of LLM-based data generation in the challenging setting of source-grounded information-seeking dialogs, with response attribution, over long documents. Our source texts consist of long and noisy meeting transcripts, adding to the task complexity. Since automating attribution remains difficult, we propose a semi-automatic approach: dialog queries and responses are generated with LLMs, followed by human verification and identification of attribution spans. Using this approach, we created MISeD – Meeting Information Seeking Dialogs dataset – a dataset of information-seeking dialogs focused on meeting transcripts. Models finetuned with MISeD demonstrate superior performance compared to off-the-shelf models, even those of larger size. Finetuning on MISeD gives comparable response generation quality to finetuning on fully manual data, while improving attribution quality and reducing time and effort.