2024
pdf
bib
abs
The Echoes of the ‘I’: Tracing Identity with Demographically Enhanced Word Embeddings
Ivan Smirnov
Proceedings of the 4th Workshop on Computational Linguistics for the Political and Social Sciences: Long and short papers
Identity is one of the most commonly studied constructs in social science. However, despite extensive theoretical work on identity, there remains a need for additional empirical data to validate and refine existing theories. This paper introduces a novel approach to studying identity by enhancing word embeddings with socio-demographic information. As a proof of concept, we demonstrate that our approach successfully reproduces and extends established findings regarding gendered self-views. Our methodology can be applied in a wide variety of settings, allowing researchers to tap into a vast pool of naturally occurring data, such as social media posts. Unlike similar methods already introduced in computer science, our approach allows for the study of differences between social groups. This could be particularly appealing to social scientists and may encourage the faster adoption of computational methods in the field.
pdf
bib
abs
Russian Learner Corpus: Towards Error-Cause Annotation for L2 Russian
Daniil Kosakin
|
Sergei Obiedkov
|
Ivan Smirnov
|
Ekaterina Rakhilina
|
Anastasia Vyrenkova
|
Ekaterina Zalivina
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Russian Learner Corpus (RLC) is a large collection of learner texts in Russian written by native speakers of over forty languages. Learner errors in part of the corpus are manually corrected and annotated. Diverging from conventional error classifications, which typically focus on isolated lexical and grammatical features, the RLC error classification intends to highlight learners’ strategies employed in the process of text production, such as derivational patterns and syntactic relations (including agreement and government). In this paper, we present two open datasets derived from RLC: a manually annotated full-text dataset and a dataset with crowdsourced corrections for individual sentences. In addition, we introduce an automatic error annotation tool that, given an original sentence and its correction, locates and labels errors according to a simplified version of the RLC error-type system. We evaluate the performance of the tool on manually annotated data from RLC.
2020
pdf
bib
abs
Fake news detection for the Russian language
Gleb Kuzmin
|
Daniil Larionov
|
Dina Pisarevskaya
|
Ivan Smirnov
Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media (RDSM)
In this paper, we trained and compared different models for fake news detection in Russian. For this task, we used such language features as bag-of-n-grams and bag of Rhetorical Structure Theory features, and BERT embeddings. We also compared the score of our models with the human score on this task and showed that our models deal with fake news detection better. We investigated the nature of fake news by dividing it into two non-overlapping classes: satire and fake news. As a result, we obtained the set of models for fake news detection; the best of these models achieved 0.889 F1-score on the test set for 2 classes and 0.9076 F1-score on 3 classes task.
2019
pdf
bib
abs
Towards the Data-driven System for Rhetorical Parsing of Russian Texts
Elena Chistova
|
Maria Kobozeva
|
Dina Pisarevskaya
|
Artem Shelmanov
|
Ivan Smirnov
|
Svetlana Toldova
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019
Results of the first experimental evaluation of machine learning models trained on Ru-RSTreebank – first Russian corpus annotated within RST framework – are presented. Various lexical, quantitative, morphological, and semantic features were used. In rhetorical relation classification, ensemble of CatBoost model with selected features and a linear SVM model provides the best score (macro F1 = 54.67 ± 0.38). We discover that most of the important features for rhetorical relation classification are related to discourse connectives derived from the connectives lexicon for Russian and from other sources.
pdf
bib
abs
Semantic Role Labeling with Pretrained Language Models for Known and Unknown Predicates
Daniil Larionov
|
Artem Shelmanov
|
Elena Chistova
|
Ivan Smirnov
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
We build the first full pipeline for semantic role labelling of Russian texts. The pipeline implements predicate identification, argument extraction, argument classification (labeling), and global scoring via integer linear programming. We train supervised neural network models for argument classification using Russian semantically annotated corpus – FrameBank. However, we note that this resource provides annotations only to a very limited set of predicates. We combat the problem of annotation scarcity by introducing two models that rely on different sets of features: one for “known” predicates that are present in the training set and one for “unknown” predicates that are not. We show that the model for “unknown” predicates can alleviate the lack of annotation by using pretrained embeddings. We perform experiments with various types of embeddings including the ones generated by deep pretrained language models: word2vec, FastText, ELMo, BERT, and show that embeddings generated by deep pretrained language models are superior to classical shallow embeddings for argument classification of both “known” and “unknown” predicates.
2016
pdf
bib
Building a learner corpus for Russian
Ekaterina Rakhilina
|
Anastasia Vyrenkova
|
Elmira Mustakimova
|
Alina Ladygina
|
Ivan Smirnov
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition