Dmitry Nikolaev


pdf bib
Adverbs, Surprisingly
Dmitry Nikolaev | Collin Baker | Miriam R. L. Petruck | Sebastian Pad
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

This paper begins with the premise that adverbs are neglected in computational linguistics. This view derives from two analyses: a literature review and a novel adverb dataset to probe a state-of-the-art language model, thereby uncovering systematic gaps in accounts for adverb meaning. We suggest that using Frame Semantics for characterizing word meaning, as in FrameNet, provides a promising approach to adverb analysis, given its ability to describe ambiguity, semantic roles, and null instantiation.

pdf bib
The Universe of Utterances According to BERT
Dmitry Nikolaev | Sebastian Padó
Proceedings of the 15th International Conference on Computational Semantics

It has been argued that BERT “rediscovers the traditional NLP pipeline”, with lower layers extracting morphosyntactic features and higher layers creating holistic sentence-level representations. In this paper, we critically examine this assumption through a principle-component-guided analysis, extracing sets of inputs that correspond to specific activation patterns in BERT sentence representations. We find that even in higher layers, the model mostly picks up on a variegated bunch of low-level features, many related to sentence complexity, that presumably arise from its specific pre-training objectives.

pdf bib
The argument–adjunct distinction in BERT: A FrameNet-based investigation
Dmitry Nikolaev | Sebastian Padó
Proceedings of the 15th International Conference on Computational Semantics

The distinction between arguments and adjuncts is a fundamental assumption of several linguistic theories. In this study, we investigate to what extent this distinction is picked up by a Transformer-based language model. We use BERT as a case study, operationalizing arguments and adjuncts as core and non-core FrameNet frame elements, respectively, and tying them to activations of particular BERT neurons. We present evidence, from English and Korean, that BERT learns more dedicated representations for arguments than for adjuncts when fine-tuned on the FrameNet frame-identification task. We also show that this distinction is already present in a weaker form in the vanilla pre-trained model.

pdf bib
Additive manifesto decomposition: A policy domain aware method for understanding party positioning
Tanise Ceron | Dmitry Nikolaev | Sebastian Padó
Findings of the Association for Computational Linguistics: ACL 2023

Automatic extraction of party (dis)similarities from texts such as party election manifestos or parliamentary speeches plays an increasing role in computational political science. However, existing approaches are fundamentally limited to targeting only global party (dis)-similarity: they condense the relationship between a pair of parties into a single figure, their similarity. In aggregating over all policy domains (e.g., health or foreign policy), they do not provide any qualitative insights into which domains parties agree or disagree on. This paper proposes a workflow for estimating policy domain aware party similarity that overcomes this limitation. The workflow covers (a) definition of suitable policy domains; (b) automatic labeling of domains, if no manual labels are available; (c) computation of domain-level similarities and aggregation at a global level; (d) extraction of interpretable party positions on major policy axes via multidimensional scaling. We evaluate our workflow on manifestos from the German federal elections. We find that our method (a) yields high correlation when predicting party similarity at a global level and (b) provides accurate party-specific positions, even with automatically labelled policy domains.

pdf bib
Representation biases in sentence transformers
Dmitry Nikolaev | Sebastian Padó
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Variants of the BERT architecture specialised for producing full-sentence representations often achieve better performance on downstream tasks than sentence embeddings extracted from vanilla BERT. However, there is still little understanding of what properties of inputs determine the properties of such representations. In this study, we construct several sets of sentences with pre-defined lexical and syntactic structures and show that SOTA sentence transformers have a strong nominal-participant-set bias: cosine similarities between pairs of sentences are more strongly determined by the overlap in the set of their noun participants than by having the same predicates, lengthy nominal modifiers, or adjuncts. At the same time, the precise syntactic-thematic functions of the participants are largely irrelevant.


pdf bib
Word-order Typology in Multilingual BERT: A Case Study in Subordinate-Clause Detection
Dmitry Nikolaev | Sebastian Pado
Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

The capabilities and limitations of BERT and similar models are still unclear when it comes to learning syntactic abstractions, in particular across languages. In this paper, we use the task of subordinate-clause detection within and across languages to probe these properties. We show that this task is deceptively simple, with easy gains offset by a long tail of harder cases, and that BERT’s zero-shot performance is dominated by word-order effects, mirroring the SVO/VSO/SOV typology.


pdf bib
On the Relation between Syntactic Divergence and Zero-Shot Performance
Ofir Arviv | Dmitry Nikolaev | Taelin Karidi | Omri Abend
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We explore the link between the extent to which syntactic relations are preserved in translation and the ease of correctly constructing a parse tree in a zero-shot setting. While previous work suggests such a relation, it tends to focus on the macro level and not on the level of individual edges—a gap we aim to address. As a test case, we take the transfer of Universal Dependencies (UD) parsing from English to a diverse set of languages and conduct two sets of experiments. In one, we analyze zero-shot performance based on the extent to which English source edges are preserved in translation. In another, we apply three linguistically motivated transformations to UD, creating more cross-lingually stable versions of it, and assess their zero-shot parsability. In order to compare parsing performance across different schemes, we perform extrinsic evaluation on the downstream task of cross-lingual relation extraction (RE) using a subset of a standard English RE benchmark translated to Russian and Korean. In both sets of experiments, our results suggest a strong relation between cross-lingual stability and zero-shot parsing performance.


pdf bib
Fine-Grained Analysis of Cross-Linguistic Syntactic Divergences
Dmitry Nikolaev | Ofir Arviv | Taelin Karidi | Neta Kenneth | Veronika Mitnik | Lilja Maria Saeboe | Omri Abend
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The patterns in which the syntax of different languages converges and diverges are often used to inform work on cross-lingual transfer. Nevertheless, little empirical work has been done on quantifying the prevalence of different syntactic divergences across language pairs. We propose a framework for extracting divergence patterns for any language pair from a parallel corpus, building on Universal Dependencies. We show that our framework provides a detailed picture of cross-language divergences, generalizes previous approaches, and lends itself to full automation. We further present a novel dataset, a manually word-aligned subset of the Parallel UD corpus in five languages, and use it to perform a detailed corpus study. We demonstrate the usefulness of the resulting analysis by showing that it can help account for performance patterns of a cross-lingual parser.

pdf bib
SegBo: A Database of Borrowed Sounds in the World’s Languages
Eitan Grossman | Elad Eisen | Dmitry Nikolaev | Steven Moran
Proceedings of the Twelfth Language Resources and Evaluation Conference

Phonological segment borrowing is a process through which languages acquire new contrastive speech sounds as the result of borrowing new words from other languages. Despite the fact that phonological segment borrowing is documented in many of the world’s languages, to date there has been no large-scale quantitative study of the phenomenon. In this paper, we present SegBo, a novel cross-linguistic database of borrowed phonological segments. We describe our data aggregation pipeline and the resulting language sample. We also present two short case studies based on the database. The first deals with the impact of large colonial languages on the sound systems of the world’s languages; the second deals with universals of borrowing in the domain of rhotic consonants.

pdf bib
Classifying Syntactic Errors in Learner Language
Leshem Choshen | Dmitry Nikolaev | Yevgeni Berzak | Omri Abend
Proceedings of the 24th Conference on Computational Natural Language Learning

We present a method for classifying syntactic errors in learner language, namely errors whose correction alters the morphosyntactic structure of a sentence. The methodology builds on the established Universal Dependencies syntactic representation scheme, and provides complementary information to other error-classification systems. Unlike existing error classification methods, our method is applicable across languages, which we showcase by producing a detailed picture of syntactic errors in learner English and learner Russian. We further demonstrate the utility of the methodology for analyzing the outputs of leading Grammatical Error Correction (GEC) systems.