NYTAC-CC: A Climate Change Subcorpus of New York Times Articles

Francesca Grasso, Ronny Patz, Manfred Stede


Abstract
Over the past decade, the analysis of discourses on climate change (CC) has gained increased interest within the social sciences and the NLP community. Textual resources are crucial for understanding how narratives about this phenomenon are crafted and delivered. However, there still is a scarcity of datasets that cover CC in news media in a representative way. This paper presents a CC-specific subcorpus extracted from the 1.8 million New York Times Annotated Corpus, marking the first CC analysis on this data. The subcorpus was created by combining different methods for text selection to ensure representativeness and reliability, which is further validated using ClimateBERT. To provide initial insights into the CC subcorpus, we discuss the results of a topic modeling experiment (LDA). These show the diversity of contexts in which CC is discussed in news media over time, which is relevant for various downstream tasks.
Anthology ID:
2024.clicit-1.48
Volume:
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
Month:
December
Year:
2024
Address:
Pisa, Italy
Editors:
Felice Dell'Orletta, Alessandro Lenci, Simonetta Montemagni, Rachele Sprugnoli
Venue:
CLiC-it
SIG:
Publisher:
CEUR Workshop Proceedings
Note:
Pages:
403–409
Language:
URL:
https://aclanthology.org/2024.clicit-1.48/
DOI:
Bibkey:
Cite (ACL):
Francesca Grasso, Ronny Patz, and Manfred Stede. 2024. NYTAC-CC: A Climate Change Subcorpus of New York Times Articles. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 403–409, Pisa, Italy. CEUR Workshop Proceedings.
Cite (Informal):
NYTAC-CC: A Climate Change Subcorpus of New York Times Articles (Grasso et al., CLiC-it 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.clicit-1.48.pdf