Workshop on Balto-Slavic Natural Language Processing (2019)


pdf (full)
bib (full)
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec | Michał Marcińczuk | Preslav Nakov | Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber

pdf bib
Unsupervised Induction of Ukrainian Morphological Paradigms for the New Lexicon: Extending Coverage for Named Entities and Neologisms using Inflection Tables and Unannotated Corpora
Bogdan Babych

The paper presents an unsupervised method for quickly extending a Ukrainian lexicon by generating paradigms and morphological feature structures for new Named Entities and neologisms, which are not covered by existing static morphological resources. This approach addresses a practical problem of modelling paradigms for entities created by the dynamic processes in the lexicon: this problem is especially serious for highly-inflected languages in domains with specialised or quickly changing lexicon. The method uses an unannotated Ukrainian corpus and a small fixed set of inflection tables, which can be found in traditional grammar textbooks. The advantage of the proposed approach is that updating the morphological lexicon does not require training or linguistic annotation, allowing fast knowledge-light extension of an existing static lexicon to improve morphological coverage on a specific corpus. The method is implemented in an open-source package on a GitHub repository. It can be applied to other low-resourced inflectional languages which have internet corpora and linguistic descriptions of their inflection system, following the example of inflection tables for Ukrainian. Evaluation results shows consistent improvements in coverage for Ukrainian corpora of different corpus types.

pdf bib
Multiple Admissibility: Judging Grammaticality using Unlabeled Data in Language Learning
Anisia Katinskaia | Sardana Ivanova

We present our work on the problem of Multiple Admissibility (MA) in language learning. Multiple Admissibility occurs in many languages when more than one grammatical form of a word fits syntactically and semantically in a given context. In second language (L2) education - in particular, in intelligent tutoring systems/computer-aided language learning (ITS/CALL) systems, which generate exercises automatically - this implies that multiple alternative answers are possible. We treat the problem as a grammaticality judgement task. We train a neural network with an objective to label sentences as grammatical or ungrammatical, using a “simulated learner corpus”: a dataset with correct text, and with artificial errors generated automatically. While MA occurs commonly in many languages, this paper focuses on learning Russian. We present a detailed classification of the types of constructions in Russian, in which MA is possible, and evaluate the model using a test set built from answers provided by the users of a running language learning system.

pdf bib
Numbers Normalisation in the Inflected Languages: a Case Study of Polish
Rafał Poświata | Michał Perełkiewicz

Text normalisation in Text-to-Speech systems is a process of converting written expressions to their spoken forms. This task is complicated because in many cases the normalised form depends on the context. Furthermore, when we analysed languages like Croatian, Lithuanian, Polish, Russian or Slovak there is additional difficulty related to their inflected nature. In this paper we want to show how to deal with this problem for one of these languages: Polish, without having a large dedicated data set and using solutions prepared for other NLP tasks. We limited our study to only numbers expressions, which are the most common non-standard words to normalise. The proposed solution is a combination of morphological tagger and transducer supported by a dictionary of numbers in their spoken forms. The data set used for evaluation is based on the part of 1-million word subset of the National Corpus of Polish. The accuracy of the described approach is presented with a comparison to a simple baseline and two commercial systems: Google Cloud Text-to-Speech and Amazon Polly.

pdf bib
What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian
Nikola Ljubešić | Kaja Dobrovoljc

We present experiments on Slovenian, Croatian and Serbian morphosyntactic annotation and lemmatisation between the former state-of-the-art for these three languages and one of the best performing systems at the CoNLL 2018 shared task, the Stanford NLP neural pipeline. Our experiments show significant improvements in morphosyntactic annotation, especially on categories where either semantic knowledge is needed, available through word embeddings, or where long-range dependencies have to be modelled. On the other hand, on the task of lemmatisation no improvements are obtained with the neural solution, mostly due to the heavy dependence of the task on the lookup in an external lexicon, but also due to obvious room for improvements in the Stanford NLP pipeline’s lemmatisation.

pdf bib
AGRR 2019: Corpus for Gapping Resolution in Russian
Maria Ponomareva | Kira Droganova | Ivan Smurov | Tatiana Shavrina

This paper provides a comprehensive overview of the gapping dataset for Russian that consists of 7.5k sentences with gapping (as well as 15k relevant negative sentences) and comprises data from various genres: news, fiction, social media and technical texts. The dataset was prepared for the Automatic Gapping Resolution Shared Task for Russian (AGRR-2019) - a competition aimed at stimulating the development of NLP tools and methods for processing of ellipsis. In this paper, we pay special attention to the gapping resolution methods that were introduced within the shared task as well as an alternative test set that illustrates that our corpus is a diverse and representative subset of Russian language gapping sufficient for effective utilization of machine learning techniques.

pdf bib
Creating a Corpus for Russian Data-to-Text Generation Using Neural Machine Translation and Post-Editing
Anastasia Shimorina | Elena Khasanova | Claire Gardent

In this paper, we propose an approach for semi-automatically creating a data-to-text (D2T) corpus for Russian that can be used to learn a D2T natural language generation model. An error analysis of the output of an English-to-Russian neural machine translation system shows that 80% of the automatically translated sentences contain an error and that 53% of all translation errors bear on named entities (NE). We therefore focus on named entities and introduce two post-editing techniques for correcting wrongly translated NEs.

pdf bib
Data Set for Stance and Sentiment Analysis from User Comments on Croatian News
Mihaela Bošnjak | Mladen Karan

Nowadays it is becoming more important than ever to find new ways of extracting useful information from the evergrowing amount of user-generated data available online. In this paper, we describe the creation of a data set that contains news articles and corresponding comments from Croatian news outlet 24 sata. Our annotation scheme is specifically tailored for the task of detecting stances and sentiment from user comments as well as assessing if commentator claims are verifiable. Through this data, we hope to get a better understanding of the publics viewpoint on various events. In addition, we also explore the potential of applying supervised machine learning models toautomate annotation of more data.

pdf bib
A Dataset for Noun Compositionality Detection for a Slavic Language
Dmitry Puzyrev | Artem Shelmanov | Alexander Panchenko | Ekaterina Artemova

This paper presents the first gold-standard resource for Russian annotated with compositionality information of noun compounds. The compound phrases are collected from the Universal Dependency treebanks according to part of speech patterns, such as ADJ+NOUN or NOUN+NOUN, using the gold-standard annotations. Each compound phrase is annotated by two experts and a moderator according to the following schema: the phrase can be either compositional, non-compositional, or ambiguous (i.e., depending on the context it can be interpreted both as compositional or non-compositional). We conduct an experimental evaluation of models and methods for predicting compositionality of noun compounds in unsupervised and supervised setups. We show that methods from previous work evaluated on the proposed Russian-language resource achieve the performance comparable with results on English corpora.

pdf bib
The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski | Laska Laskova | Michał Marcińczuk | Lidia Pivovarova | Pavel Přibáň | Josef Steinberger | Roman Yangarber

We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams participated in the competition, which covered four languages and five entity types. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all four languages, and five teams participated in the cross-lingual entity linking task. Detailed evaluation information is available on the shared task web page.

pdf bib
BSNLP2019 Shared Task Submission: Multisource Neural NER Transfer
Tatiana Tsygankova | Stephen Mayhew | Dan Roth

This paper describes the Cognitive Computation (CogComp) Group’s submissions to the multilingual named entity recognition shared task at the Balto-Slavic Natural Language Processing (BSNLP) Workshop. The final model submitted is a multi-source neural NER system with multilingual BERT embeddings, trained on the concatenation of training data in various Slavic languages (as well as English). The performance of our system on the official testing data suggests that multi-source approaches consistently outperform single-source approaches for this task, even with the noise of mismatching tagsets.

pdf bib
TLR at BSNLP2019: A Multilingual Named Entity Recognition System
Jose G. Moreno | Elvys Linhares Pontes | Mickael Coustaty | Antoine Doucet

This paper presents our participation at the shared task on multilingual named entity recognition at BSNLP2019. Our strategy is based on a standard neural architecture for sequence labeling. In particular, we use a mixed model which combines multilingualcontextual and language-specific embeddings. Our only submitted run is based on a voting schema using multiple models, one for each of the four languages of the task (Bulgarian, Czech, Polish, and Russian) and another for English. Results for named entity recognition are encouraging for all languages, varying from 60% to 83% in terms of Strict and Relaxed metrics, respectively.

pdf bib
Tuning Multilingual Transformers for Language-Specific Named Entity Recognition
Mikhail Arkhipov | Maria Trofimova | Yuri Kuratov | Alexey Sorokin

Our paper addresses the problem of multilingual named entity recognition on the material of 4 languages: Russian, Bulgarian, Czech and Polish. We solve this task using the BERT model. We use a hundred languages multilingual model as base for transfer to the mentioned Slavic languages. Unsupervised pre-training of the BERT model on these 4 languages allows to significantly outperform baseline neural approaches and multilingual BERT. Additional improvement is achieved by extending BERT with a word-level CRF layer. Our system was submitted to BSNLP 2019 Shared Task on Multilingual Named Entity Recognition and demonstrated top performance in multilingual setting for two competition metrics. We open-sourced NER models and BERT model pre-trained on the four Slavic languages.

pdf bib
Multilingual Named Entity Recognition Using Pretrained Embeddings, Attention Mechanism and NCRF
Anton Emelyanov | Ekaterina Artemova

In this paper we tackle multilingual named entity recognition task. We use the BERT Language Model as embeddings with bidirectional recurrent network, attention, and NCRF on the top. We apply multilingual BERT only as embedder without any fine-tuning. We test out model on the dataset of the BSNLP shared task, which consists of texts in Bulgarian, Czech, Polish and Russian languages.

pdf bib
JRC TMA-CC: Slavic Named Entity Recognition and Linking. Participation in the BSNLP-2019 shared task
Guillaume Jacquet | Jakub Piskorski | Hristo Tanev | Ralf Steinberger

We report on the participation of the JRC Text Mining and Analysis Competence Centre (TMA-CC) in the BSNLP-2019 Shared Task, which focuses on named-entity recognition, lemmatisation and cross-lingual linking. We propose a hybrid system combining a rule-based approach and light ML techniques. We use multilingual lexical resources such as JRC-NAMES and BABELNET together with a named entity guesser to recognise names. In a second step, we combine known names with wild cards to increase recognition recall by also capturing inflection variants. In a third step, we increase precision by filtering these name candidates with automatically learnt inflection patterns derived from name occurrences in large news article collections. Our major requirement is to achieve high precision. We achieved an average of 65% F-measure with 93% precision on the four languages.

pdf bib
Building English-to-Serbian Machine Translation System for IMDb Movie Reviews
Pintu Lohar | Maja Popović | Andy Way

This paper reports the results of the first experiment dealing with the challenges of building a machine translation system for user-generated content involving a complex South Slavic language. We focus on translation of English IMDb user movie reviews into Serbian, in a low-resource scenario. We explore potentials and limits of (i) phrase-based and neural machine translation systems trained on out-of-domain clean parallel data from news articles (ii) creating additional synthetic in-domain parallel corpus by machine-translating the English IMDb corpus into Serbian. Our main findings are that morphology and syntax are better handled by the neural approach than by the phrase-based approach even in this low-resource mismatched domain scenario, however the situation is different for the lexical aspect, especially for person names. This finding also indicates that in general, machine translation of person names into Slavic languages (especially those which require/allow transcription) should be investigated more systematically.

pdf bib
Improving Sentiment Classification in Slovak Language
Samuel Pecar | Marian Simko | Maria Bielikova

Using different neural network architectures is widely spread for many different NLP tasks. Unfortunately, most of the research is performed and evaluated only in English language and minor languages are often omitted. We believe using similar architectures for other languages can show interesting results. In this paper, we present our study on methods for improving sentiment classification in Slovak language. We performed several experiments for two different datasets, one containing customer reviews, the other one general Twitter posts. We show comparison of performance of different neural network architectures and also different word representations. We show that another improvement can be achieved by using a model ensemble. We performed experiments utilizing different methods of model ensemble. Our proposed models achieved better results than previous models for both datasets. Our experiments showed also other potential research areas.

pdf bib
Sentiment Analysis for Multilingual Corpora
Svitlana Galeshchuk | Ju Qiu | Julien Jourdan

The paper presents a generic approach to the supervised sentiment analysis of social media content in Slavic languages. The method proposes translating the documents from the original language to English with Google’s Neural Translation Model. The resulted texts are then converted to vectors by averaging the vectorial representation of words derived from a pre-trained Word2Vec English model. Testing the approach with several machine learning methods on Polish, Slovenian and Croatian Twitter datasets returns up to 86% of classification accuracy on the out-of-sample data.