RuDSI: Graph-based Word Sense Induction Dataset for Russian

Anna Aksenova, Ekaterina Gavrishina, Elisei Rykov, Andrey Kutuzov


Abstract
We present RuDSI, a new benchmark for word sense induction (WSI) in Russian. The dataset was created using manual annotation and semi-automatic clustering of Word Usage Graphs (WUGs). RuDSI is completely data-driven (based on texts from Russian National Corpus), with no external word senses imposed on annotators. We present and analyze RuDSI, describe our annotation workflow, show how graph clustering parameters affect the dataset, report the performance that several baseline WSI methods obtain on RuDSI and discuss possibilities for improving these scores.
Anthology ID:
2022.textgraphs-1.9
Volume:
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Dmitry Ustalov, Yanjun Gao, Alexander Panchenko, Marco Valentino, Mokanarangan Thayaparan, Thien Huu Nguyen, Gerald Penn, Arti Ramesh, Abhik Jana
Venue:
TextGraphs
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
77–88
Language:
URL:
https://aclanthology.org/2022.textgraphs-1.9
DOI:
Bibkey:
Cite (ACL):
Anna Aksenova, Ekaterina Gavrishina, Elisei Rykov, and Andrey Kutuzov. 2022. RuDSI: Graph-based Word Sense Induction Dataset for Russian. In Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing, pages 77–88, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Cite (Informal):
RuDSI: Graph-based Word Sense Induction Dataset for Russian (Aksenova et al., TextGraphs 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.textgraphs-1.9.pdf
Code
 kategavrishina/rudsi +  additional community code
Data
RUSSE