Tiny But Mighty: A Crowdsourced Benchmark Dataset for Triple Extraction from Unstructured Text

Muhammad Salman, Armin Haller, Sergio J. Rodriguez Mendez, Usman Naseem


Abstract
In the context of Natural Language Processing (NLP) and Semantic Web applications, constructing Knowledge Graphs (KGs) from unstructured text plays a vital role. Several techniques have been developed for KG construction from text, but the lack of standardized datasets hinders the evaluation of triple extraction methods. The evaluation of existing KG construction approaches is based on structured data or manual investigations. To overcome this limitation, this work introduces a novel dataset specifically designed to evaluate KG construction techniques from unstructured text. Our dataset consists of a diverse collection of compound and complex sentences meticulously annotated by human annotators with potential triples (subject, verb, object). The annotations underwent further scrutiny by expert ontologists to ensure accuracy and consistency. For evaluation purposes, the proposed F-measure criterion offers a robust approach to quantify the relatedness and assess the alignment between extracted triples and the ground-truth triples, providing a valuable tool for evaluating the performance of triple extraction systems. By providing a diverse collection of high-quality triples, our proposed benchmark dataset offers a comprehensive training and evaluation set for refining the performance of state-of-the-art language models on a triple extraction task. Furthermore, this dataset encompasses various KG-related tasks, such as named entity recognition, relation extraction, and entity linking.
Anthology ID:
2024.isa-1.10
Volume:
Proceedings of the 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Harry Bunt, Nancy Ide, Kiyong Lee, Volha Petukhova, James Pustejovsky, Laurent Romary
Venues:
ISA | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
71–81
Language:
URL:
https://aclanthology.org/2024.isa-1.10
DOI:
Bibkey:
Cite (ACL):
Muhammad Salman, Armin Haller, Sergio J. Rodriguez Mendez, and Usman Naseem. 2024. Tiny But Mighty: A Crowdsourced Benchmark Dataset for Triple Extraction from Unstructured Text. In Proceedings of the 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024, pages 71–81, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Tiny But Mighty: A Crowdsourced Benchmark Dataset for Triple Extraction from Unstructured Text (Salman et al., ISA-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.isa-1.10.pdf