Constructing a Sentiment-Annotated Corpus of Austrian Historical Newspapers: Challenges, Tools, and Annotator Experience

Lucija Krusic


Abstract
This study presents the development of a sentiment-annotated corpus of historical newspaper texts in Austrian German, addressing a gap in annotated corpora for Natural Language Processing in the field of Digital Humanities. Three annotators categorised 1005 sentences from two 19th-century periodicals into four sentiment categories: positive, negative, neutral, and mixed. The annotators, Masters and PhD students in Linguistics and Digital Humanities, are considered semi-experts and have received substantial training during this annotation study. Three tools were used and compared in the annotation process: Google Sheets, Google Forms and Doccano, and resulted in a gold standard corpus. The analysis revealed a fair to moderate inter-rater agreement (Fleiss’ kappa = 0.405) and an average percentage agreement of 45.7% for full consensus and 92.5% for majority vote. As majority vote is needed for the creation of a gold standard corpus, these results are considered sufficient, and the annotations reliable. The study also introduced comprehensive guidelines for sentiment annotation, which were essential to overcome the challenges posed by historical language and context. The annotators’ experience was assessed through a combination of standardised usability tests (NASA-TLX and UEQ-S) and a detailed custom-made user experience questionnaire, which provided qualitative insights into the difficulties and usability of the tools used. The questionnaire is an additional resource that can be used to assess usability and user experience assessments in future annotation studies. The findings demonstrate the effectiveness of semi-expert annotators and dedicated tools in producing reliable annotations and provide valuable resources, including the annotated corpus, guidelines, and a user experience questionnaire, for future sentiment analysis and annotation of Austrian historical texts. The sentiment-annotated corpus will be used as the gold standard for fine-tuning and evaluating machine learning models for sentiment analysis of Austrian historical newspapers with the topic of migration and minorities in a subsequent study.
Anthology ID:
2024.nlp4dh-1.6
Volume:
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Month:
November
Year:
2024
Address:
Miami, USA
Editors:
Mika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni
Venue:
NLP4DH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
51–62
Language:
URL:
https://aclanthology.org/2024.nlp4dh-1.6
DOI:
Bibkey:
Cite (ACL):
Lucija Krusic. 2024. Constructing a Sentiment-Annotated Corpus of Austrian Historical Newspapers: Challenges, Tools, and Annotator Experience. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 51–62, Miami, USA. Association for Computational Linguistics.
Cite (Informal):
Constructing a Sentiment-Annotated Corpus of Austrian Historical Newspapers: Challenges, Tools, and Annotator Experience (Krusic, NLP4DH 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.nlp4dh-1.6.pdf