A Dataset for Noun Compositionality Detection for a Slavic Language

Dmitry Puzyrev, Artem Shelmanov, Alexander Panchenko, Ekaterina Artemova


Abstract
This paper presents the first gold-standard resource for Russian annotated with compositionality information of noun compounds. The compound phrases are collected from the Universal Dependency treebanks according to part of speech patterns, such as ADJ+NOUN or NOUN+NOUN, using the gold-standard annotations. Each compound phrase is annotated by two experts and a moderator according to the following schema: the phrase can be either compositional, non-compositional, or ambiguous (i.e., depending on the context it can be interpreted both as compositional or non-compositional). We conduct an experimental evaluation of models and methods for predicting compositionality of noun compounds in unsupervised and supervised setups. We show that methods from previous work evaluated on the proposed Russian-language resource achieve the performance comparable with results on English corpora.
Anthology ID:
W19-3708
Volume:
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
Month:
August
Year:
2019
Address:
Florence, Italy
Venues:
ACL | BSNLP | WS
SIG:
SIGSLAV
Publisher:
Association for Computational Linguistics
Note:
Pages:
56–62
Language:
URL:
https://aclanthology.org/W19-3708
DOI:
10.18653/v1/W19-3708
Bibkey:
Cite (ACL):
Dmitry Puzyrev, Artem Shelmanov, Alexander Panchenko, and Ekaterina Artemova. 2019. A Dataset for Noun Compositionality Detection for a Slavic Language. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pages 56–62, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
A Dataset for Noun Compositionality Detection for a Slavic Language (Puzyrev et al., 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-3708.pdf
Code
 slangtech/ru-comps
Data
Universal Dependencies