Zero-shot cross-lingual identification of direct speech using distant supervision

Murathan Kurfalı, Mats Wirén


Abstract
Prose fiction typically consists of passages alternating between the narrator’s telling of the story and the characters’ direct speech in that story. Detecting direct speech is crucial for the downstream analysis of narrative structure, and may seem easy at first thanks to quotation marks. However, typographical conventions vary across languages, and as a result, almost all approaches to this problem have been monolingual. In contrast, the aim of this paper is to provide a multilingual method for identifying direct speech. To this end, we created a training corpus by using a set of heuristics to automatically find texts where quotation marks appear sufficiently consistently. We then removed the quotation marks and developed a sequence classifier based on multilingual-BERT which classifies each token as belonging to narration or speech. Crucially, by training the classifier with the quotation marks removed, it was forced to learn the linguistic characteristics of direct speech rather than the typography of quotation marks. The results in the zero-shot setting of the proposed model are comparable to the strong supervised baselines, indicating that this is a feasible approach.
Anthology ID:
2020.latechclfl-1.12
Volume:
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Month:
December
Year:
2020
Address:
Online
Editors:
Stefania DeGaetano, Anna Kazantseva, Nils Reiter, Stan Szpakowicz
Venue:
LaTeCHCLfL
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
105–111
Language:
URL:
https://aclanthology.org/2020.latechclfl-1.12
DOI:
Bibkey:
Cite (ACL):
Murathan Kurfalı and Mats Wirén. 2020. Zero-shot cross-lingual identification of direct speech using distant supervision. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 105–111, Online. International Committee on Computational Linguistics.
Cite (Informal):
Zero-shot cross-lingual identification of direct speech using distant supervision (Kurfalı & Wirén, LaTeCHCLfL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.latechclfl-1.12.pdf