Andrey Kutuzov - ACL Anthology

Andrey Kutuzov

2026

DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling
Mariia Fedorova | Andrey Kutuzov | Khonzoda Umarova
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)

In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets.DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field.

The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)
Nina Tahmasebi | Pierluigi Cassotti | Syrielle Montariol | Andrey Kutuzov | Netta Huebscher | Elena Spaziani | Naomi Baes
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)

2025

The use of copyrighted materials in training language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of publisher-controlled copyrighted corpora on the performance of generative large language models (LLMs) for Norwegian. When evaluated on a diverse set of tasks, we found that adding both books and newspapers to the data mixture of LLMs tend to improve their performance, while the addition of fiction works seems to be detrimental. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

Small Languages, Big Models: A Study of Continual Training on Languages of Norway
David Samuel | Vladislav Mikhailov | Erik Velldal | Lilja Øvrelid | Lucas Georges Gabriel Charpentier | Andrey Kutuzov | Stephan Oepen
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.

NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark
Vladislav Mikhailov | Tita Enstad | David Samuel | Hans Christian Farsethås | Andrey Kutuzov | Erik Velldal | Lilja Øvrelid
Findings of the Association for Computational Linguistics: ACL 2025

This paper introduces NorEval, a new and comprehensive evaluation suite for large-scale standardized benchmarking of Norwegian generative language models (LMs). NorEval consists of 24 high-quality human-created datasets – of which five are created from scratch. In contrast to existing benchmarks for Norwegian, NorEval covers a broad spectrum of task categories targeting Norwegian language understanding and generation, establishes human baselines, and focuses on both of the official written standards of the Norwegian language: Bokmål and Nynorsk. All our datasets and a collection of over 100 human-created prompts are integrated into LM Evaluation Harness, ensuring flexible and reproducible evaluation. We describe the NorEval design and present the results of benchmarking 19 open-source pretrained and instruction-tuned LMs for Norwegian in various scenarios. Our benchmark, evaluation framework, and annotation materials are publicly available.

We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

Explaining novel senses using definition generation with open language models
Mariia Fedorova | Andrey Kutuzov | Francesco Periti | Yves Scherrer
Findings of the Association for Computational Linguistics: EMNLP 2025

We apply definition generators based on open-weights large language models to the task of creating explanations of novel senses, taking target word usages as an input. To this end, we employ the datasets from the AXOLOTL’24 shared task on explainable semantic change modeling, which features Finnish, Russian and German languages. We fine-tune and provide publicly the open-source models performing higher than the best submissions of the aforementioned shared task, which employed closed proprietary LLMs. In addition, we find that encoder-decoder definition generators perform on par with their decoder-only counterparts.

2024

Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change
Nina Tahmasebi | Syrielle Montariol | Andrey Kutuzov | David Alfter | Francesco Periti | Pierluigi Cassotti | Netta Huebscher
Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change

AXOLOTL’24 Shared Task on Multilingual Explainable Semantic Change Modeling
Mariia Fedorova | Timothee Mickus | Niko Partanen | Janine Siewert | Elena Spaziani | Andrey Kutuzov
Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
Pinzhen Chen | Shaoxiong Ji | Nikolay Bogoychev | Andrey Kutuzov | Barry Haddow | Kenneth Heafield
Findings of the Association for Computational Linguistics: EACL 2024

Foundational large language models (LLMs) can be instruction-tuned to perform open-domain question answering, facilitating applications like chat assistants. While such efforts are often carried out in a single language, we empirically analyze cost-efficient strategies for multilingual scenarios. Our study employs the Alpaca dataset and machine translations of it to form multilingual data, which is then used to tune LLMs through either low-rank adaptation or full-parameter training. Under a controlled computation budget, comparisons show that multilingual tuning is on par or better than tuning a model for each language. Furthermore, multilingual tuning with downsampled data can be as powerful and more robust. Our findings serve as a guide for expanding language support through instruction tuning.

A New Massive Multilingual Dataset for High-Performance Language Technologies
Ona de Gibert | Graeme Nail | Nikolay Arefyev | Marta Bañón | Jelmer van der Linde | Shaoxiong Ji | Jaume Zaragoza-Bernabeu | Mikko Aulamo | Gema Ramírez-Sánchez | Andrey Kutuzov | Sampo Pyysalo | Stephan Oepen | Jörg Tiedemann
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

Enriching Word Usage Graphs with Cluster Definitions
Andrey Kutuzov | Mariia Fedorova | Dominik Schlechtweg | Nikolay Arefyev
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present a dataset of word usage graphs (WUGs), where the existing WUGs for multiple languages are enriched with cluster labels functioning as sense definitions. They are generated from scratch by fine-tuned encoder-decoder language models. The conducted human evaluation has shown that these definitions match the existing clusters in WUGs better than the definitions chosen from WordNet by two baseline systems. At the same time, the method is straightforward to use and easy to extend to new languages. The resulting enriched datasets can be extremely helpful for moving on to explainable semantic change modeling.

Definition generation for lexical semantic change detection
Mariia Fedorova | Andrey Kutuzov | Yves Scherrer
Findings of the Association for Computational Linguistics: ACL 2024

We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as ‘senses’, and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages, we show that generated definitions are indeed specific and general enough to convey a signal sufficient to rank sets of words by the degree of their semantic change over time. Our approach is on par with or outperforms prior non-supervised sense-based LSCD methods. At the same time, it preserves interpretability and allows to inspect the reasons behind a specific shift in terms of discrete definitions-as-senses. This is another step in the direction of explainable semantic change modeling.

2023

Trained on 100 million words and still in shape: BERT meets British National Corpus
David Samuel | Andrey Kutuzov | Lilja Øvrelid | Erik Velldal
Findings of the Association for Computational Linguistics: EACL 2023

While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

NorBench – A Benchmark for Norwegian Language Models
David Samuel | Andrey Kutuzov | Samia Touileb | Erik Velldal | Lilja Øvrelid | Egil Rønningstad | Elina Sigdel | Anna Palatkina
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We present NorBench: a streamlined suite of NLP tasks and probes for evaluating Norwegian language models (LMs) on standardized data splits and evaluation metrics. We also introduce a range of new Norwegian language models (both encoder and encoder-decoder based). Finally, we compare and analyze their performance, along with other existing LMs, across the different benchmark tests of NorBench.

Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change
Nina Tahmasebi | Syrielle Montariol | Haim Dubossarsky | Andrey Kutuzov | Simon Hengchen | David Alfter | Francesco Periti | Pierluigi Cassotti
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

Interpretable Word Sense Representations via Definition Generation: The Case of Semantic Change Analysis
Mario Giulianelli | Iris Luden | Raquel Fernandez | Andrey Kutuzov
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose using automatically generated natural language definitions of contextualised word usages as interpretable word and word sense representations. Given a collection of usage examples for a target word, and the corresponding data-driven usage clusters (i.e., word senses), a definition is generated for each usage with a specialised Flan-T5 language model, and the most prototypical definition in a usage cluster is chosen as the sense label. We demonstrate how the resulting sense labels can make existing approaches to semantic change analysis more interpretable, and how they can allow users — historical linguists, lexicographers, or social scientists — to explore and intuitively explain diachronic trajectories of word meaning. Semantic change analysis is only one of many possible applications of the ‘definitions as representations’ paradigm. Beyond being human-readable, contextualised definitions also outperform token or usage sentence embeddings in word-in-context semantic similarity judgements, making them a new promising type of lexical representation for NLP.

2022

NorDiaChange: Diachronic Semantic Change Dataset for Norwegian
Andrey Kutuzov | Samia Touileb | Petter Mæhlum | Tita Enstad | Alexandra Wittemann
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We describe NorDiaChange: the first diachronic semantic change dataset for Norwegian. NorDiaChange comprises two novel subsets, covering about 80 Norwegian nouns manually annotated with graded semantic change over time. Both datasets follow the same annotation procedure and can be used interchangeably as train and test splits for each other. NorDiaChange covers the time periods related to pre- and post-war events, oil and gas discovery in Norway, and technological developments. The annotation was done using the DURel framework and two large historical Norwegian corpora. NorDiaChange is published in full under a permissive licence, complete with raw annotation data and inferred diachronic word usage graphs (DWUGs).

Contextualized embeddings for semantic change detection: Lessons learned
Andrey Kutuzov | Erik Velldal | Lilja Øvrelid
Northern European Journal of Language Technology, Volume 8

We present a qualitative analysis of the (potentially erroneous) outputs of contextualized embedding-based methods for detecting diachronic semantic change. First, we introduce an ensemble method outperforming previously described contextualized approaches. This method is used as a basis for an in-depth analysis of the degrees of semantic change predicted for English words across 5 decades. Our findings show that contextualized methods can often predict high change scores for words which are not undergoing any real diachronic semantic shift in the lexicographic sense of the term (or at least the status of these shifts is questionable). Such challenging cases are discussed in detail with examples, and their linguistic categorization is proposed. Our conclusion is that pre-trained contextualized language models are prone to confound changes in lexicographic senses and changes in contextual variance, which naturally stem from their distributional nature, but is different from the types of issues observed in methods based on static embeddings. Additionally, they often merge together syntactic and semantic aspects of lexical entities. We propose a range of possible future solutions to these issues.

RuDSI: Graph-based Word Sense Induction Dataset for Russian
Anna Aksenova | Ekaterina Gavrishina | Elisei Rykov | Andrey Kutuzov
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

We present RuDSI, a new benchmark for word sense induction (WSI) in Russian. The dataset was created using manual annotation and semi-automatic clustering of Word Usage Graphs (WUGs). RuDSI is completely data-driven (based on texts from Russian National Corpus), with no external word senses imposed on annotators. We present and analyze RuDSI, describe our annotation workflow, show how graph clustering parameters affect the dataset, report the performance that several baseline WSI methods obtain on RuDSI and discuss possibilities for improving these scores.

SemEval 2022 Task 10: Structured Sentiment Analysis
Jeremy Barnes | Laura Oberlaender | Enrica Troiano | Andrey Kutuzov | Jan Buchmann | Rodrigo Agerri | Lilja Øvrelid | Erik Velldal
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

In this paper, we introduce the first SemEval shared task on Structured Sentiment Analysis, for which participants are required to predict all sentiment graphs in a text, where a single sentiment graph is composed of a sentiment holder, target, expression and polarity. This new shared task includes two subtracks (monolingual and cross-lingual) with seven datasets available in five languages, namely Norwegian, Catalan, Basque, Spanish and English. Participants submitted their predictions on a held-out test set and were evaluated on Sentiment Graph F1 . Overall, the task received over 200 submissions from 32 participating teams. We present the results of the 15 teams that provided system descriptions and our own expanded analysis of the test predictions.

Do Not Fire the Linguist: Grammatical Profiles Help Language Models Detect Semantic Change
Mario Giulianelli | Andrey Kutuzov | Lidia Pivovarova
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

Morphological and syntactic changes in word usage — as captured, e.g., by grammatical profiles — have been shown to be good predictors of a word’s meaning change. In this work, we explore whether large pre-trained contextualised language models, a common tool for lexical semantic change detection, are sensitive to such morphosyntactic changes. To this end, we first compare the performance of grammatical profiles against that of a multilingual neural language model (XLM-R) on 10 datasets, covering 7 languages, and then combine the two approaches in ensembles to assess their complementarity. Our results show that ensembling grammatical profiles with XLM-R improves semantic change detection performance for most datasets and languages. This indicates that language models do not fully cover the fine-grained morphological and syntactic signals that are explicitly represented in grammatical profiles. An interesting exception are the test sets where the time spans under analysis are much longer than the time gap between them (for example, century-long spans with a one-year gap between them). Morphosyntactic change is slow so grammatical profiles do not detect in such cases. In contrast, language models, thanks to their access to lexical information, are able to detect fast topical changes.

Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change
Nina Tahmasebi | Syrielle Montariol | Andrey Kutuzov | Simon Hengchen | Haim Dubossarsky | Lars Borin
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

2021

Multilingual ELMo and the Effects of Corpus Sampling
Vinit Ravishankar | Andrey Kutuzov | Lilja Øvrelid | Erik Velldal
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Multilingual pretrained language models are rapidly gaining popularity in NLP systems for non-English languages. Most of these models feature an important corpus sampling step in the process of accumulating training data in different languages, to ensure that the signal from better resourced languages does not drown out poorly resourced ones. In this study, we train multiple multilingual recurrent language models, based on the ELMo architecture, and analyse both the effect of varying corpus size ratios on downstream performance, as well as the performance difference between monolingual models for each language, and broader multilingual language models. As part of this effort, we also make these trained models available for public use.

Large-Scale Contextualised Language Modelling for Norwegian
Andrey Kutuzov | Jeremy Barnes | Erik Velldal | Lilja Øvrelid | Stephan Oepen
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We present the ongoing NorLM initiative to support the creation and use of very large contextualised language models for Norwegian (and in principle other Nordic languages), including a ready-to-use software environment, as well as an experience report for data preparation and training. This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks. In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian. For additional background and access to the data, models, and software, please see: http://norlm.nlpl.eu

Grammatical Profiling for Semantic Change Detection
Andrey Kutuzov | Lidia Pivovarova | Mario Giulianelli
Proceedings of the 25th Conference on Computational Natural Language Learning

Semantics, morphology and syntax are strongly interdependent. However, the majority of computational methods for semantic change detection use distributional word representations which encode mostly semantics. We investigate an alternative method, grammatical profiling, based entirely on changes in the morphosyntactic behaviour of words. We demonstrate that it can be used for semantic change detection and even outperforms some distributional semantic methods. We present an in-depth qualitative and quantitative analysis of the predictions made by our grammatical profiling system, showing that they are plausible and interpretable.

Three-part diachronic semantic change dataset for Russian
Andrey Kutuzov | Lidia Pivovarova
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

We present a manually annotated lexical semantic change dataset for Russian: RuShiftEval. Its novelty is ensured by a single set of target words annotated for their diachronic semantic shifts across three time periods, while the previous work either used only two time periods, or different sets of target words. The paper describes the composition and annotation procedure for the dataset. In addition, it is shown how the ternary nature of RuShiftEval allows to trace specific diachronic trajectories: ‘changed at a particular time period and stable afterwards’ or ‘was changing throughout all time periods’. Based on the analysis of the submissions to the recent shared task on semantic change detection for Russian, we argue that correctly identifying such trajectories can be an interesting sub-task itself.

Representing ELMo embeddings as two-dimensional text online
Andrey Kutuzov | Elizaveta Kuzmenko
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

We describe a new addition to the WebVectors toolkit which is used to serve word embedding models over the Web. The new ELMoViz module adds support for contextualized embedding architectures, in particular for ELMo models. The provided visualizations follow the metaphor of ‘two-dimensional text’ by showing lexical substitutes: words which are most semantically similar in context to the words of the input sentence. The system allows the user to change the ELMo layers from which token embeddings are inferred. It also conveys corpus information about the query words and their lexical substitutes (namely their frequency tiers and parts of speech). The module is well integrated into the rest of the WebVectors toolkit, providing lexical hyperlinks to word representations in static embedding models. Two web services have already implemented the new functionality with pre-trained ELMo models for Russian, Norwegian and English.

2020

Word Sense Disambiguation for 158 Languages using Word Embeddings Only
Varvara Logacheva | Denis Teslenko | Artem Shelmanov | Steffen Remus | Dmitry Ustalov | Andrey Kutuzov | Ekaterina Artemova | Chris Biemann | Simone Paolo Ponzetto | Alexander Panchenko
Proceedings of the Twelfth Language Resources and Evaluation Conference

Disambiguation of word senses in context is easy for humans, but is a major challenge for automatic approaches. Sophisticated supervised and knowledge-based models were developed to solve this task. However, (i) the inherent Zipfian distribution of supervised training instances for a given word and/or (ii) the quality of linguistic knowledge representations motivate the development of completely unsupervised and knowledge-free approaches to word sense disambiguation (WSD). They are particularly useful for under-resourced languages which do not have any resources for building either supervised and/or knowledge-based models. In this paper, we present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory, which can be used for disambiguation in context. We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings by Grave et al., (2018), enabling WSD in these languages. Models and system are available online.

UiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection
Andrey Kutuzov | Mario Giulianelli
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We apply contextualised word embeddings to lexical semantic change detection in the SemEval-2020 Shared Task 1. This paper focuses on Subtask 2, ranking words by the degree of their semantic drift over time. We analyse the performance of two contextualising architectures (BERT and ELMo) and three change detection algorithms. We find that the most effective algorithms rely on the cosine similarity between averaged token embeddings and the pairwise distances between token embeddings. They outperform strong baselines by a large margin (in the post-evaluation phase, we have the best Subtask 2 submission for SemEval-2020 Task 1), but interestingly, the choice of a particular algorithm depends on the distribution of gold scores in the test set.

RuSemShift: a dataset of historical lexical semantic change in Russian
Julia Rodina | Andrey Kutuzov
Proceedings of the 28th International Conference on Computational Linguistics

We present RuSemShift, a large-scale manually annotated test set for the task of semantic change modeling in Russian for two long-term time period pairs: from the pre-Soviet through the Soviet times and from the Soviet through the post-Soviet times. Target words were annotated by multiple crowd-source workers. The annotation process was organized following the DURel framework and was based on sentence contexts extracted from the Russian National Corpus. Additionally, we report the performance of several distributional approaches on RuSemShift, achieving promising results, which at the same time leave room for other researchers to improve.

2019

ÚFAL-Oslo at MRP 2019: Garage Sale Semantic Parsing
Kira Droganova | Andrey Kutuzov | Nikita Mediankin | Daniel Zeman
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

This paper describes the ÚFAL--Oslo system submission to the shared task on Cross-Framework Meaning Representation Parsing (MRP, Oepen et al. 2019). The submission is based on several third-party parsers. Within the official shared task results, the submission ranked 11th out of 13 participating systems.

Measuring Diachronic Evolution of Evaluative Adjectives with Word Embeddings: the Case for English, Norwegian, and Russian
Julia Rodina | Daria Bakshandaeva | Vadim Fomin | Andrey Kutuzov | Samia Touileb | Erik Velldal
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We measure the intensity of diachronic semantic shifts in adjectives in English, Norwegian and Russian across 5 decades. This is done in order to test the hypothesis that evaluative adjectives are more prone to temporal semantic change. To this end, 6 different methods of quantifying semantic change are used. Frequency-controlled experimental results show that, depending on the particular method, evaluative adjectives either do not differ from other types of adjectives in terms of semantic change or appear to actually be less prone to shifting (particularly, to ‘jitter’-type shifting). Thus, in spite of many well-known examples of semantically changing evaluative adjectives (like ‘terrific’ or ‘incredible’), it seems that such cases are not specific to this particular type of words.

Making Fast Graph-based Algorithms with Graph Metric Embeddings
Andrey Kutuzov | Mohammad Dorgham | Oleksiy Oliynyk | Chris Biemann | Alexander Panchenko
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Graph measures, such as node distances, are inefficient to compute. We explore dense vector representations as an effective way to approximate the same information. We introduce a simple yet efficient and effective approach for learning graph embeddings. Instead of directly operating on the graph structure, our method takes structural measures of pairwise node similarities into account and learns dense node representations reflecting user-defined graph distance measures, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. We demonstrate a speed-up of several orders of magnitude when predicting word similarity by vector operations on our embeddings as opposed to directly computing the respective path-based measures, while outperforming various other graph embeddings on semantic similarity and word sense disambiguation tasks.

Learning Graph Embeddings from WordNet-based Similarity Measures
Andrey Kutuzov | Mohammad Dorgham | Oleksiy Oliynyk | Chris Biemann | Alexander Panchenko
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the proposed model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances.

To Lemmatize or Not to Lemmatize: How Word Normalisation Affects ELMo Performance in Word Sense Disambiguation
Andrey Kutuzov | Elizaveta Kuzmenko
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing

In this paper, we critically evaluate the widespread assumption that deep learning NLP models do not require lemmatized input. To test this, we trained versions of contextualised word embedding ELMo models on raw tokenized corpora and on the corpora with word tokens replaced by their lemmas. Then, these models were evaluated on the word sense disambiguation task. This was done for the English and Russian languages. The experiments showed that while lemmatization is indeed not necessary for English, the situation is different for Russian. It seems that for rich-morphology languages, using lemmatized training and testing data yields small but consistent improvements: at least for word sense disambiguation. This means that the decisions about text pre-processing before training ELMo should consider the linguistic nature of the language in question.

One-to-X Analogical Reasoning on Word Embeddings: a Case for Diachronic Armed Conflict Prediction from News Texts
Andrey Kutuzov | Erik Velldal | Lilja Øvrelid
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We extend the well-known word analogy task to a one-to-X formulation, including one-to-none cases, when no correct answer exists. The task is cast as a relation discovery problem and applied to historical armed conflicts datasets, attempting to predict new relations of type ‘location:armed-group’ based on data about past events. As the source of semantic information, we use diachronic word embedding models trained on English news texts. A simple technique to improve diachronic performance in such task is demonstrated, using a threshold based on a function of cosine distance to decrease the number of false positives; this approach is shown to be beneficial on two different corpora. Finally, we publish a ready-to-use test set for one-to-X analogy evaluation on historical armed conflicts data.

2018

Diachronic word embeddings and semantic shifts: a survey
Andrey Kutuzov | Lilja Øvrelid | Terrence Szymanski | Erik Velldal
Proceedings of the 27th International Conference on Computational Linguistics

Recent years have witnessed a surge of publications aimed at tracing temporal changes in lexical semantics using distributional methods, particularly prediction-based word embedding models. However, this vein of research lacks the cohesion, common terminology and shared practices of more established areas of natural language processing. In this paper, we survey the current state of academic research related to diachronic word embeddings and semantic shifts detection. We start with discussing the notion of semantic shifts, and then continue with an overview of the existing methods for tracing such time-related shifts with word embedding models. We propose several axes along which these methods can be compared, and outline the main challenges before this emerging subfield of NLP, as well as prospects and possible applications.

Unsupervised Semantic Frame Induction using Triclustering
Dmitry Ustalov | Alexander Panchenko | Andrey Kutuzov | Chris Biemann | Simone Paolo Ponzetto
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We use dependency triples automatically extracted from a Web-scale corpus to perform unsupervised semantic frame induction. We cast the frame induction problem as a triclustering problem that is a generalization of clustering for triadic data. Our replicable benchmarks demonstrate that the proposed graph-based approach, Triframes, shows state-of-the art results on this task on a FrameNet-derived dataset and performing on par with competitive methods on a verb class clustering task.

2017

Redefining Context Windows for Word Embedding Models: An Experimental Study
Pierre Lison | Andrey Kutuzov
Proceedings of the 21st Nordic Conference on Computational Linguistics

Building Web-Interfaces for Vector Semantic Models with the WebVectors Toolkit
Andrey Kutuzov | Elizaveta Kuzmenko
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

In this demo we present WebVectors, a free and open-source toolkit helping to deploy web services which demonstrate and visualize distributional semantic models (widely known as word embeddings). WebVectors can be useful in a very common situation when one has trained a distributional semantics model for one’s particular corpus or language (tools for this are now widespread and simple to use), but then there is a need to demonstrate the results to general public over the Web. We show its abilities on the example of the living web services featuring distributional models for English, Norwegian and Russian.

Tracing armed conflicts with diachronic word embedding models
Andrey Kutuzov | Erik Velldal | Lilja Øvrelid
Proceedings of the Events and Stories in the News Workshop

Recent studies have shown that word embedding models can be used to trace time-related (diachronic) semantic shifts in particular words. In this paper, we evaluate some of these approaches on the new task of predicting the dynamics of global armed conflicts on a year-to-year basis, using a dataset from the conflict research field as the gold standard and the Gigaword news corpus as the training data. The results show that much work still remains in extracting ‘cultural’ semantic shifts from diachronic word embedding models. At the same time, we present a new task complete with an evaluation set and introduce the ‘anchor words’ method which outperforms previous approaches on this set.

Universal Dependencies-based syntactic features in detecting human translation varieties
Maria Kunilovskaya | Andrey Kutuzov
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

Clustering of Russian Adjective-Noun Constructions using Word Embeddings
Andrey Kutuzov | Elizaveta Kuzmenko | Lidia Pivovarova
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This paper presents a method of automatic construction extraction from a large corpus of Russian. The term ‘construction’ here means a multi-word expression in which a variable can be replaced with another word from the same semantic class, for example, ‘a glass of [water/juice/milk]’. We deal with constructions that consist of a noun and its adjective modifier. We propose a method of grouping such constructions into semantic classes via 2-step clustering of word vectors in distributional models. We compare it with other clustering techniques and evaluate it against A Russian-English Collocational Dictionary of the Human Body that contains manually annotated groups of constructions with nouns meaning human body parts. The best performing method is used to cluster all adjective-noun bigrams in the Russian National Corpus. Results of this procedure are publicly available and can be used for building Russian construction dictionary as well as to accelerate theoretical studies of constructions.

Word vectors, reuse, and replicability: Towards a community repository of large-text resources
Murhaf Fares | Andrey Kutuzov | Stephan Oepen | Erik Velldal
Proceedings of the 21st Nordic Conference on Computational Linguistics

Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants
Andrey Kutuzov | Erik Velldal | Lilja Øvrelid
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

This paper deals with using word embedding models to trace the temporal dynamics of semantic relations between pairs of words. The set-up is similar to the well-known analogies task, but expanded with a time dimension. To this end, we apply incremental updating of the models with new training texts, including incremental vocabulary expansion, coupled with learned transformation matrices that let us map between members of the relation. The proposed approach is evaluated on the task of predicting insurgent armed groups based on geographical locations. The gold standard data for the time span 1994–2010 is extracted from the UCDP Armed Conflicts dataset. The results show that the method is feasible and outperforms the baselines, but also that important work still remains to be done.

2016

Neural Embedding Language Models in Semantic Clustering of Web Search Results
Andrey Kutuzov | Elizaveta Kuzmenko
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, a new approach towards semantic clustering of the results of ambiguous search queries is presented. We propose using distributed vector representations of words trained with the help of prediction-based neural embedding models to detect senses of search queries and to cluster search engine results page according to these senses. The words from titles and snippets together with semantic relationships between them form a graph, which is further partitioned into components related to different query senses. This approach to search engine results clustering is evaluated against a new manually annotated evaluation data set of Russian search queries. We show that in the task of semantically clustering search results, prediction-based models slightly but stably outperform traditional count-based ones, with the same training corpora.

Exploration of register-dependent lexical semantics using word embeddings
Andrey Kutuzov | Elizaveta Kuzmenko | Anna Marakasova
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

We present an approach to detect differences in lexical semantics across English language registers, using word embedding models from distributional semantics paradigm. Models trained on register-specific subcorpora of the BNC corpus are employed to compare lists of nearest associates for particular words and draw conclusions about their semantic shifts depending on register in which they are used. The models are evaluated on the task of register classification with the help of the deep inverse regression approach. Additionally, we present a demo web service featuring most of the described models and allowing to explore word meanings in different English registers and to detect register affiliation for arbitrary texts. The code for the service can be easily adapted to any set of underlying models.

Redefining part-of-speech classes with distributional semantic models
Andrey Kutuzov | Erik Velldal | Lilja Øvrelid
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

2015

Semi-automated typical error annotation for learner English essays: integrating frameworks
Andrey Kutuzov | Elizaveta Kuzmenko
Proceedings of the fourth workshop on NLP for computer-assisted language learning

2014

Russian Error-Annotated Learner English Corpus: a Tool for Computer-Assisted Language Learning
Elizaveta Kuzmenko | Andrey Kutuzov
Proceedings of the third workshop on NLP for computer-assisted language learning

2013

Improving English-Russian sentence alignment through POS tagging and Damerau-Levenshtein distance
Andrey Kutuzov
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

Co-authors

Vladislav Mikhailov 5

Nikolay Arefyev 4

Chris Biemann 4

Mario Giulianelli 4

Syrielle Montariol 4

Alexander Panchenko 4

Lidia Pivovarova 4

Nina Tahmasebi 4

Marta Bañón 3

Pierluigi Cassotti 3

Petter Mæhlum 3

Francesco Periti 3

Sampo Pyysalo 3

Gema Ramírez-Sánchez 3

Jörg Tiedemann 3

Samia Touileb 3

Jaume Zaragoza-Bernabeu 3

Ona de Gibert 3

Jeremy Barnes 2

Laurie Burchell 2

Mohammad Dorgham 2

Haim Dubossarsky 2

Hans Christian Farsethås 2

Liane Guillou 2

Jindřich Helcl 2

Simon Hengchen 2

Erik Henriksson 2

Netta Huebscher 2

Veronika Laippala 2

Bhavitvya Malik 2

Farrokh Mehryary 2

Amanda Myntti 2

Oleksiy Oliynyk 2

Dayyán O’Brien 2

Simone Paolo Ponzetto 2

Yves Scherrer 2

Elena Spaziani 2

Pavel Stepachev 2

Dmitry Ustalov 2

Rodrigo Agerri 1

Anna Aksenova 1

Ekaterina Artemova 1

Daria Bakshandaeva 1

Magnus Breder Birkenes 1

Nikolay Bogoychev 1

Rolv-Arild Braaten 1

Svein Arne Brygfjeld 1

Javier De La Rosa 1

Kira Droganova 1

Raquel Fernández 1

Ekaterina Gavrishina 1

Lucas Georges Gabriel Charpentier 1

Jon Atle Gulla 1

Kenneth Heafield 1

Mateusz Klimaszewski 1

Ville Komulainen 1

Maria Kunilovskaya 1

Joona Kytöniemi 1

Varvara Logacheva 1

Anna Marakasova 1

Nikita Mediankin 1

Timothee Mickus 1

Aslak Sira Myhre 1

Laura Oberlaender 1

Anna Palatkina 1

Niko Partanen 1

Vinit Ravishankar 1

Steffen Remus 1

Egil Rønningstad 1

Dominik Schlechtweg 1

Artem Shelmanov 1

Janine Siewert 1

Terrence Szymanski 1

Denis Teslenko 1

Enrica Troiano 1

Khonzoda Umarova 1

Jelmer Van Der Linde 1

Tereza Vojtěchová 1

Freddy Wetjen 1

Alexandra Wittemann 1

Wilfred Østgulen 1

Venues