Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing Tomaž Erjavec Jakub Piskorski Lidia Pivovarova Jan Šnajder Josef Steinberger Roman Yangarber April 2017

Valencia, Spain

Association for Computational Linguistics http://www.aclweb.org/anthology/W17-14 book BSNLP:2017 Toward Pan-Slavic NLP: Some Experiments with Language Adaptation SergeSharoff Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 1–2 http://www.aclweb.org/anthology/W17-1401 There is great variation in the amount of NLP resources available for Slavonic languages. For example, the Universal Dependency treebank (Nivre et al., 2016) has about 2 MW of training resources for Czech, more than 1 MW for Russian, while only 950 words for Ukrainian and nothing for Belorussian, Bosnian or Macedonian. Similarly, the Autodesk Machine Translation dataset only covers three Slavonic languages (Czech, Polish and Russian). In this talk I will discuss a general approach, which can be called Language Adaptation, similarly to Domain Adaptation. In this approach, a model for a particular language processing task is built by lexical transfer of cognate words and by learning a new feature representation for a lesser-resourced (recipient) language starting from a better-resourced (donor) language. More specifically, I will demonstrate how language adaptation works in such training scenarios as Translation Quality Estimation, Part-of-Speech tagging and Named Entity Recognition. inproceedings sharoff:2017:BSNLP Clustering of Russian Adjective-Noun Constructions using Word Embeddings AndreyKutuzov ElizavetaKuzmenko LidiaPivovarova Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 3–13 http://www.aclweb.org/anthology/W17-1402 This paper presents a method of automatic construction extraction from a large corpus of Russian. The term `construction' here means a multi-word expression in which a variable can be replaced with another word from the same semantic class, for example, `a glass of [water/juice/milk]'. We deal with constructions that consist of a noun and its adjective modifier. We propose a method of grouping such constructions into semantic classes via 2-step clustering of word vectors in distributional models. We compare it with other clustering techniques and evaluate it against A Russian-English Collocational Dictionary of the Human Body that contains manually annotated groups of constructions with nouns meaning human body parts. The best performing method is used to cluster all adjective-noun bigrams in the Russian National Corpus. Results of this procedure are publicly available and can be used for building Russian construction dictionary as well as to accelerate theoretical studies of constructions. inproceedings kutuzov-kuzmenko-pivovarova:2017:BSNLP A Preliminary Study of Croatian Lexical Substitution DomagojAlagić JanŠnajder Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 14–19 http://www.aclweb.org/anthology/W17-1403 Lexical substitution is a task of determining a meaning-preserving replacement for a word in context. We report on a preliminary study of this task for the Croatian language on a small-scale lexical sample dataset, manually annotated using three different annotation schemes. We compare the annotations, analyze the inter-annotator agreement, and observe a number of interesting language specific details in the obtained lexical substitutes. Furthermore, we apply a recently-proposed, dependency-based lexical substitution model to our dataset. The model achieves a P$@$3 score of 0.35, which indicates the difficulty of the task. inproceedings alagic-vsnajder:2017:BSNLP Projecting Multiword Expression Resources on a Polish Treebank AgataSavary JakubWaszczuk Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 20–26 http://www.aclweb.org/anthology/W17-1404 Multiword expressions (MWEs) are linguistic objects containing two or more words and showing idiosyncratic behavior at different levels. Treebanks with annotated MWEs enable studies of such properties, as well as training and evaluation of MWE-aware parsers. However, few treebanks contain full-fledged MWE annotations. We show how this gap can be bridged in Polish by projecting 3 MWE resources on a constituency treebank. inproceedings savary-waszczuk:2017:BSNLP Lexicon Induction for Spoken Rusyn – Challenges and Results AchimRabus YvesScherrer Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 27–32 http://www.aclweb.org/anthology/W17-1405 This paper reports on challenges and results in developing NLP resources for spoken Rusyn. Being a Slavic minority language, Rusyn does not have any resources to make use of. We propose to build a morphosyntactic dictionary for Rusyn, combining existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish. We adapt these resources to Rusyn by using vowel-sensitive Levenshtein distance, hand-written language-specific transformation rules, and combinations of the two. Compared to an exact match baseline, we increase the coverage of the resulting morphological dictionary by up to 77.4% relative (42.9% absolute), which results in a tagging recall increased by 11.6% relative (9.1% absolute). Our research confirms and expands the results of previous studies showing the efficiency of using NLP resources from neighboring languages for low-resourced languages. inproceedings rabus-scherrer:2017:BSNLP The Universal Dependencies Treebank for Slovenian KajaDobrovoljc TomažErjavec SimonKrek Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 33–38 http://www.aclweb.org/anthology/W17-1406 This paper introduces the Universal Dependencies Treebank for Slovenian. We overview the existing dependency treebanks for Slovenian and then detail the conversion of the ssj200k treebank to the framework of Universal Dependencies version 2. We explain the mapping of part-of-speech categories, morphosyntactic features, and the dependency relations, focusing on the more problematic language-specific issues. We conclude with a quantitative overview of the treebank and directions for further work. inproceedings dobrovoljc-erjavec-krek:2017:BSNLP Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages TanjaSamardžić MirjanaStarović ŽeljkoAgić NikolaLjubešić Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 39–44 http://www.aclweb.org/anthology/W17-1407 The paper documents the procedure of building a new Universal Dependencies (UDv2) treebank for Serbian starting from an existing Croatian UDv1 treebank and taking into account the other Slavic UD annotation guidelines. We describe the automatic and manual annotation procedures, discuss the annotation of Slavic-specific categories (case governing quantifiers, reflexive pronouns, question particles) and propose an approach to handling deverbal nouns in Slavic languages. inproceedings samardvzic-EtAl:2017:BSNLP Spelling Correction for Morphologically Rich Language: a Case Study of Russian AlexeySorokin Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 45–53 http://www.aclweb.org/anthology/W17-1408 We present an algorithm for automatic correction of spelling errors on the sentence level, which uses noisy channel model and feature-based reranking of hypotheses. Our system is designed for Russian and clearly outperforms the winner of SpellRuEval-2016 competition. We show that language model size has the greatest influence on spelling correction quality. We also experiment with different types of features and show that morphological and semantic information also improves the accuracy of spellchecking. inproceedings sorokin:2017:BSNLP Debunking Sentiment Lexicons: A Case of Domain-Specific Sentiment Classification for Croatian PaulaGombar ZoranMedić DomagojAlagić JanŠnajder Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 54–59 http://www.aclweb.org/anthology/W17-1409 Sentiment lexicons are widely used as an intuitive and inexpensive way of tackling sentiment classification, often within a simple lexicon word-counting approach or as part of a supervised model. However, it is an open question whether these approaches can compete with supervised models that use only word-representation features. We address this question in the context of domain-specific sentiment classification for Croatian. We experiment with the graph-based acquisition of sentiment lexicons, analyze their quality, and investigate how effectively they can be used in sentiment classification. Our results indicate that, even with as few as 500 labeled instances, a supervised model substantially outperforms a word-counting model. We also observe that adding lexicon-based features does not significantly improve supervised sentiment classification. inproceedings gombar-EtAl:2017:BSNLP Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text NikolaLjubešić TomažErjavec DarjaFišer Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 60–68 http://www.aclweb.org/anthology/W17-1410 In this paper we present the adaptations of a state-of-the-art tagger for South Slavic languages to non-standard texts on the example of the Slovene language. We investigate the impact of introducing in-domain training data as well as additional supervision through external resources or tools like word clusters and word normalization. We remove more than half of the error of the standard tagger when applied to non-standard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error. The final configuration achieves tagging accuracy of 87.41% on the full morphosyntactic description, which is, nevertheless, still quite far from the accuracy of 94.27% achieved on standard text. inproceedings ljubevsic-erjavec-fivser:2017:BSNLP Comparison of Short-Text Sentiment Analysis Methods for Croatian LeonRotim JanŠnajder Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 69–75 http://www.aclweb.org/anthology/W17-1411 We focus on the task of supervised sentiment classification of short and informal texts in Croatian, using two simple yet effective methods: word embeddings and string kernels. We investigate whether word embeddings offer any advantage over corpus- and preprocessing-free string kernels, and how these compare to bag-of-words baselines. We conduct a comparison on three different datasets, using different preprocessing methods and kernel functions. Results show that, on two out of three datasets, word embeddings outperform string kernels, which in turn outperform word and n-gram bag-of-words baselines. inproceedings rotim-vsnajder:2017:BSNLP The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages JakubPiskorski LidiaPivovarova JanŠnajder JosefSteinberger RomanYangarber Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 76–85 http://www.aclweb.org/anthology/W17-1412 This paper describes the outcomes of the first challenge on multilingual named entity recognition that aimed at recognizing mentions of named entities in web documents in Slavic languages, their normalization/lemmatization, and cross-language matching. It was organised in the context of the 6th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2017 conference. Although eleven teams signed up for the evaluation, due to the complexity of the task(s) and short time available for elaborating a solution, only two teams submitted results on time. The reported evaluation figures reflect the relatively higher level of complexity of named entity-related tasks in the context of processing texts in Slavic languages. Since the duration of the challenge goes beyond the date of the publication of this paper and updated picture of the participating systems and their corresponding performance can be found on the web page of the challenge. inproceedings piskorski-EtAl:2017:BSNLP Liner2 — a Generic Framework for Named Entity Recognition MichałMarcińczuk JanKocoń MarcinOleksy Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 86–91 http://www.aclweb.org/anthology/W17-1413 In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish. inproceedings marcinczuk-kocon-oleksy:2017:BSNLP Language-Independent Named Entity Analysis Using Parallel Projection and Rule-Based Disambiguation JamesMayfield PaulMcNamee CashCostello Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 92–96 http://www.aclweb.org/anthology/W17-1414 The 2017 shared task at the Balto-Slavic NLP workshop requires identifying coarse-grained named entities in seven languages, identifying each entity’s base form, and clustering name mentions across the multilingual set of documents. The fact that no training data is provided to systems for building supervised classifiers further adds to the complexity. To complete the task we first use publicly available parallel texts to project named entity recognition capability from English to each evaluation language. We ignore entirely the subtask of identifying non-inflected forms of names. Finally, we create cross-document entity identifiers by clustering named mentions using a procedure-based approach. inproceedings mayfield-mcnamee-costello:2017:BSNLP Comparison of String Similarity Measures for Obscenity Filtering EkaterinaChernyak Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 97–101 http://www.aclweb.org/anthology/W17-1415 In this paper we address the problem of filtering obscene lexis in Russian texts. We use string similarity measures to find words similar or identical to words from a stop list and establish both a test collection and a baseline for the task. Our experiments show that a novel string similarity measure based on the notion of an annotated suffix tree outperforms some of the other well known measures. inproceedings chernyak:2017:BSNLP Stylometric Analysis of Parliamentary Speeches: Gender Dimension JustinaMandravickaite TomasKrilavičius Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 102–107 http://www.aclweb.org/anthology/W17-1416 Relation between gender and language has been studied by many authors, however, there is still some uncertainty left regarding gender influence on language usage in the professional environment. Often, the studied data sets are too small or texts of individual authors are too short in order to capture differences of language usage wrt gender successfully. This study draws from a larger corpus of speeches transcripts of the Lithuanian Parliament (1990-2013) to explore language differences of political debates by gender via stylometric analysis. Experimental set up consists of stylistic features that indicate lexical style and do not require external linguistic tools, namely the most frequent words, in combination with unsupervised machine learning algorithms. Results show that gender differences in the language use remain in professional environment not only in usage of function words, preferred linguistic constructions, but in the presented topics as well. inproceedings mandravickaite-krilavivcius:2017:BSNLP Towards Never Ending Language Learning for Morphologically Rich Languages KseniyaBuraya LidiaPivovarova SergeyBudkov AndreyFilchenkov Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 108–118 http://www.aclweb.org/anthology/W17-1417 This work deals with ontology learning from unstructured Russian text. We implement one of components Never Ending Language Learner and introduce the algorithm extensions aimed to gather specificity of morphologicaly rich free-word-order language. We demonstrate that this method may be successfully applied to Russian data. In addition we perform several additional experiments comparing different settings of the training process. We demonstrate that utilizing of morphological features significantly improves the system precision while using of seed patterns helps to improve the coverage. inproceedings buraya-EtAl:2017:BSNLP Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style BenVerhoeven IzaŠkrjanec SenjaPollak Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing April 2017

Valencia, Spain

Association for Computational Linguistics 119–125 http://www.aclweb.org/anthology/W17-1418 We present results of the first gender classification experiments on Slovene text to our knowledge. Inspired by the TwiSty corpus and experiments (Verhoeven et al., 2016), we employed the Janes corpus (Erjavec et al., 2016) and its gender annotations to perform gender classification experiments on Twitter text comparing a token-based and a lemma-based approach. We find that the token-based approach (92.6% accuracy), containing gender markings related to the author, outperforms the lemma-based approach by about 5%. Especially in the lemmatized version, we also observe stylistic and content-based differences in writing between men (e.g. more profane language, numerals and beer mentions) and women (e.g. more pronouns, emoticons and character flooding). Many of our findings corroborate previous research on other languages. inproceedings verhoeven-vskrjanec-pollak:2017:BSNLP