The State of Relation Extraction Data Quality: Is Bigger Always Better?

Erica Cai, Brendan O’Connor


Abstract
Relation extraction (RE) extracts structured tuples of relationships (e.g. friend, enemy) between entities (e.g. Sherlock Holmes, John Watson) from text, with exciting potential applications. Hundreds of RE papers have been published in recent years; do their evaluation practices inform these goals? We review recent surveys and a sample of recent RE methods papers, compiling 38 datasets currently being used. Unfortunately, many have frequent label errors, and ones with known problems continue to be used. Many datasets focus on producing labels for a large number of relation types, often through error-prone annotation methods (e.g. distant supervision or crowdsourcing), and many recent papers rely exclusively on such datasets. We draw attention to a promising alternative: datasets with a small number of relations, often in specific domains like chemistry, finance, or biomedicine, where it is possible to obtain high quality expert annotations; such data can more realistically evaluate RE performance. The research community should consider more often using such resources.
Anthology ID:
2024.findings-acl.470
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7893–7906
Language:
URL:
https://aclanthology.org/2024.findings-acl.470
DOI:
Bibkey:
Cite (ACL):
Erica Cai and Brendan O’Connor. 2024. The State of Relation Extraction Data Quality: Is Bigger Always Better?. In Findings of the Association for Computational Linguistics ACL 2024, pages 7893–7906, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
The State of Relation Extraction Data Quality: Is Bigger Always Better? (Cai & O’Connor, Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.470.pdf