WikiFirst: A Genre-Fixed, Content-controlled Corpus for Evaluating Content Effects in Authorship Analysis

Dung Nguyen, G. Çağatay Sat, Evgeny Pyshkin, John Blake


Abstract
This paper presents the design and construction of WikiFirst, a corpus for investigating the impact of content variation on authorship similarity under a fixed genre. Prior work has investigated individual authorial style and impact of genre. However, the role of content has remained underexplored due to the lack of suitable data. We address this gap by constructing a Wikipedia-based corpus consisting exclusively of first revisions authored by non-anonymous editors, thereby ensuring high authorship certainty while maintaining a stable encyclopaedic genre.
Anthology ID:
2026.latechclfl-1.31
Volume:
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Diego Alves, Yuri Bizzoni, Stefania Degaetano-Ortlieb, Anna Kazantseva, Janis Pagel, Stan Szpakowicz
Venues:
LaTeCH-CLfL | WS
SIG:
SIGHUM
Publisher:
Association for Computational Linguistics
Note:
Pages:
323–327
Language:
URL:
https://aclanthology.org/2026.latechclfl-1.31/
DOI:
Bibkey:
Cite (ACL):
Dung Nguyen, G. Çağatay Sat, Evgeny Pyshkin, and John Blake. 2026. WikiFirst: A Genre-Fixed, Content-controlled Corpus for Evaluating Content Effects in Authorship Analysis. In Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026, pages 323–327, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
WikiFirst: A Genre-Fixed, Content-controlled Corpus for Evaluating Content Effects in Authorship Analysis (Nguyen et al., LaTeCH-CLfL 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.latechclfl-1.31.pdf
Supplementarymaterial:
 2026.latechclfl-1.31.SupplementaryMaterial.zip