GOLEM: GOld Standard for Learning and Evaluation of Motifs

W. Victor Yarlott, Anurag Acharya, Diego Castro Estrada, Diana Gomez, Mark Finlayson


Abstract
Motifs are distinctive, recurring, widely used idiom-like words or phrases, often originating from folklore, whose meaning are anchored in a narrative. Motifs have significance as communicative devices because they concisely imply a constellation of culturally relevant information. Their broad usage suggests their cognitive importance as touchstones of cultural knowledge. We present GOLEM, the first dataset annotated for motific information. The dataset comprises 7,955 English articles (2,039,424 words). The corpus identifies 26,078 motif candidates across 34 motif types from three cultural or national groups: Jewish, Irish, and Puerto Rican. Each motif candidate is labeled with the type of usage (Motific, Referential, Eponymic, or Unrelated), resulting in 1,723 actual motific instances. Annotation was performed by individuals identifying as members of each group and achieved a Fleiss’ kappa of >0.55. We demonstrate that classification of candidate type is a challenging task for LLMs using a few-shot approach; recent models such as T5, FLAN-T5, GPT-2, and Llama 2 (7B) achieved a performance of 41% accuracy at best. These data will support development of new models and approaches for detecting (and reasoning about) motific information in text. We release the corpus, the annotation guide, and the code to support other researchers building on this work.
Anthology ID:
2024.lrec-main.689
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
7801–7813
Language:
URL:
https://aclanthology.org/2024.lrec-main.689
DOI:
Bibkey:
Cite (ACL):
W. Victor Yarlott, Anurag Acharya, Diego Castro Estrada, Diana Gomez, and Mark Finlayson. 2024. GOLEM: GOld Standard for Learning and Evaluation of Motifs. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7801–7813, Torino, Italia. ELRA and ICCL.
Cite (Informal):
GOLEM: GOld Standard for Learning and Evaluation of Motifs (Yarlott et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.689.pdf