BehanceQA: A New Dataset for Identifying Question-Answer Pairs in Video Transcripts

Amir Pouran Ben Veyseh, Viet Lai, Franck Dernoncourt, Thien Nguyen


Abstract
Question-Answer (QA) is one of the effective methods for storing knowledge which can be used for future retrieval. As such, identifying mentions of questions and their answers in text is necessary for a knowledge construction and retrieval systems. In the literature, QA identification has been well studied in the NLP community. However, most of the prior works are restricted to formal written documents such as papers or websites. As such, Questions and Answers that are presented in informal/noisy documents have not been adequately studied. One of the domains that can significantly benefit from QA identification is the domain of livestreaming video transcripts that involve abundant QA pairs to provide valuable knowledge for future users and services. Since video transcripts are often transcribed automatically for scale, they are prone to errors. Combined with the informal nature of discussion in a video, prior QA identification systems might not be able to perform well in this domain. To enable comprehensive research in this domain, we present a large-scale QA identification dataset annotated by human over transcripts of 500 hours of streamed videos. We employ Behance.net to collect the videos and their automatically obtained transcripts. Furthermore, we conduct extensive analysis on the annotated dataset to understand the complexity of QA identification for livestreaming video transcripts. Our experiments show that the annotated dataset presents unique challenges for existing methods and more research is necessary to explore more effective methods. The dataset and the models developed in this work will be publicly released for future research.
Anthology ID:
2022.lrec-1.796
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
7321–7327
Language:
URL:
https://aclanthology.org/2022.lrec-1.796
DOI:
Bibkey:
Cite (ACL):
Amir Pouran Ben Veyseh, Viet Lai, Franck Dernoncourt, and Thien Nguyen. 2022. BehanceQA: A New Dataset for Identifying Question-Answer Pairs in Video Transcripts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7321–7327, Marseille, France. European Language Resources Association.
Cite (Informal):
BehanceQA: A New Dataset for Identifying Question-Answer Pairs in Video Transcripts (Pouran Ben Veyseh et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.796.pdf
Code
 amirveyseh/behanceqa