What’s in a (dataset’s) name? The case of BigPatent

Silvia Casola, Alberto Lavelli, Horacio Saggion


Abstract
Sharing datasets and benchmarks has been crucial for rapidly improving Natural Language Processing models and systems. Documenting datasets’ characteristics (and any modification introduced over time) is equally important to avoid confusion and make comparisons reliable. Here, we describe the case of BigPatent, a dataset for patent summarization that exists in at least two rather different versions under the same name. While previous literature has not clearly distinguished among versions, their differences do not only lay on a surface level but also modify the dataset’s core nature and, thus, the complexity of the summarization task. While this paper describes a specific case, we aim to shed light on new challenges that might emerge in resource sharing and advocate for comprehensive documentation of datasets and models.
Anthology ID:
2022.gem-1.34
Volume:
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Antoine Bosselut, Khyathi Chandu, Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Yacine Jernite, Jekaterina Novikova, Laura Perez-Beltrachini
Venue:
GEM
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
399–404
Language:
URL:
https://aclanthology.org/2022.gem-1.34
DOI:
10.18653/v1/2022.gem-1.34
Bibkey:
Cite (ACL):
Silvia Casola, Alberto Lavelli, and Horacio Saggion. 2022. What’s in a (dataset’s) name? The case of BigPatent. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 399–404, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
What’s in a (dataset’s) name? The case of BigPatent (Casola et al., GEM 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.gem-1.34.pdf
Video:
 https://aclanthology.org/2022.gem-1.34.mp4