Roman Yangarber


2022

pdf bib
Semi-automatically Annotated Learner Corpus for Russian
Anisia Katinskaia | Maria Lebedeva | Jue Hou | Roman Yangarber
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present ReLCo— the Revita Learner Corpus—a new semi-automatically annotated learner corpus for Russian. The corpus was collected while several thousand L2 learners were performing exercises using the Revita language-learning system. All errors were detected automatically by the system and annotated by type. Part of the corpus was annotated manually—this part was created for further experiments on automatic assessment of grammatical correctness. The Learner Corpus provides valuable data for studying patterns of grammatical errors, experimenting with grammatical error detection and grammatical error correction, and developing new exercises for language learners. Automating the collection and annotation makes the process of building the learner corpus much cheaper and faster, in contrast to the traditional approach of building learner corpora. We make the data publicly available.

pdf bib
Applying Gamification Incentives in the Revita Language-learning System
Jue Hou | Ilmari Kylliäinen | Anisia Katinskaia | Giacomo Furlan | Roman Yangarber
Proceedings of the 9th Workshop on Games and Natural Language Processing within the 13th Language Resources and Evaluation Conference

We explore the importance of gamification features in a language-learning platform designed for intermediate-to-advanced learners. Our main thesis is: learning toward advanced levels requires a massive investment of time. If the learner engages in more practice sessions, and if the practice sessions are longer, we can expect the results to be better. This principle appears to be tautologically self-evident. Yet, keeping the learner engaged in general—and building gamification features in particular—requires substantial efforts on the part of developers. Our goal is to keep the learner engaged in long practice sessions over many months—rather than for the short-term. This creates a conflict: In academic research on language learning, resources are typically scarce, and gamification usually is not considered an essential priority for allocating resources. We argue in favor of giving serious consideration to gamification in the language-learning setting—as a means of enabling in-depth research. In this paper, we introduce several gamification incentives in the Revita language-learning platform. We discuss the problems in obtaining quantitative measures of the effectiveness of gamification features.

2021

pdf bib
Assessing Grammatical Correctness in Language Learning
Anisia Katinskaia | Roman Yangarber
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

We present experiments on assessing the grammatical correctness of learners’ answers in a language-learning System (references to the System, and the links to the released data and code are withheld for anonymity). In particular, we explore the problem of detecting alternative-correct answers: when more than one inflected form of a lemma fits syntactically and semantically in a given context. We approach the problem with the methods for grammatical error detection (GED), since we hypothesize that models for detecting grammatical mistakes can assess the correctness of potential alternative answers in a learning setting. Due to the paucity of training data, we explore the ability of pre-trained BERT to detect grammatical errors and then fine-tune it using synthetic training data. In this work, we focus on errors in inflection. Our experiments show a. that pre-trained BERT performs worse at detecting grammatical irregularities for Russian than for English; b. that fine-tuned BERT yields promising results on assessing the correctness of grammatical exercises; and c. establish a new benchmark for Russian. To further investigate its performance, we compare fine-tuned BERT with one of the state-of-the-art models for GED (Bell et al., 2019) on our dataset and RULEC-GEC (Rozovskaya and Roth, 2019). We release the manually annotated learner dataset, used for testing, for general use.

pdf bib
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Bogdan Babych | Olga Kanishcheva | Preslav Nakov | Jakub Piskorski | Lidia Pivovarova | Vasyl Starko | Josef Steinberger | Roman Yangarber | Michał Marcińczuk | Senja Pollak | Pavel Přibáň | Marko Robnik-Šikonja
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Slav-NER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski | Bogdan Babych | Zara Kancheva | Olga Kanishcheva | Maria Lebedeva | Michał Marcińczuk | Preslav Nakov | Petya Osenova | Lidia Pivovarova | Senja Pollak | Pavel Přibáň | Ivaylo Radev | Marko Robnik-Sikonja | Vasyl Starko | Josef Steinberger | Roman Yangarber
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2021 Conference. Ten teams participated in the competition. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all six languages, and five teams participated in the cross-lingual entity linking task. Detailed valuation information is available on the shared task web page.

2020

pdf bib
Toward a Paradigm Shift in Collection of Learner Corpora
Anisia Katinskaia | Sardana Ivanova | Roman Yangarber
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present the first version of the longitudinal Revita Learner Corpus (ReLCo), for Russian. In contrast to traditional learner corpora, ReLCo is collected and annotated fully automatically, while students perform exercises using the Revita language-learning platform. The corpus currently contains 8 422 sentences exhibiting several types of errors—grammatical, lexical, orthographic, etc.—which were committed by learners during practice and were automatically annotated by Revita. The corpus provides valuable information about patterns of learner errors and can be used as a language resource for a number of research tasks, while its creation is much cheaper and faster than for traditional learner corpora. A crucial advantage of ReLCo that it grows continually while learners practice with Revita, which opens the possibility of creating an unlimited learner resource with longitudinal data collected over time. We make the pilot version of the Russian ReLCo publicly available.

pdf bib
Neural Disambiguation of Lemma and Part of Speech in Morphologically Rich Languages
José María Hoya Quecedo | Koppatz Maximilian | Roman Yangarber
Proceedings of the Twelfth Language Resources and Evaluation Conference

We consider the problem of disambiguating the lemma and part of speech of ambiguous words in morphologically rich languages. We propose a method for disambiguating ambiguous words in context, using a large un-annotated corpus of text, and a morphological analyser—with no manual disambiguation or data annotation. We assume that the morphological analyser produces multiple analyses for ambiguous words. The idea is to train recurrent neural networks on the output that the morphological analyser produces for unambiguous words. We present performance on POS and lemma disambiguation that reaches or surpasses the state of the art—including supervised models—using no manually annotated data. We evaluate the method on several morphologically rich languages.

2019

pdf bib
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec | Michał Marcińczuk | Preslav Nakov | Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

pdf bib
The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski | Laska Laskova | Michał Marcińczuk | Lidia Pivovarova | Pavel Přibáň | Josef Steinberger | Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams participated in the competition, which covered four languages and five entity types. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all four languages, and five teams participated in the cross-lingual entity linking task. Detailed evaluation information is available on the shared task web page.

pdf bib
Modeling language learning using specialized Elo rating
Jue Hou | Koppatz Maximilian | José María Hoya Quecedo | Nataliya Stoyanova | Roman Yangarber
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Automatic assessment of the proficiency levels of the learner is a critical part of Intelligent Tutoring Systems. We present methods for assessment in the context of language learning. We use a specialized Elo formula used in conjunction with educational data mining. We simultaneously obtain ratings for the proficiency of the learners and for the difficulty of the linguistic concepts that the learners are trying to master. From the same data we also learn a graph structure representing a domain model capturing the relations among the concepts. This application of Elo provides ratings for learners and concepts which correlate well with subjective proficiency levels of the learners and difficulty levels of the concepts.

pdf bib
Tools for supporting language learning for Sakha
Sardana Ivanova | Anisia Katinskaia | Roman Yangarber
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This paper presents an overview of the available linguistic resources for the Sakha language, and presents new tools for supporting language learning for Sakha. The essential resources include a morphological analyzer, digital dictionaries, and corpora of Sakha texts. Based on these resources, we implement a language-learning environment for Sakha in the Revita CALL platform. We extended an earlier, preliminary version of the morphological analyzer/transducer, built on the Apertium finite-state platform. The analyzer currently has an adequate level of coverage, between 86% and 89% on two Sakha corpora. Revita is a freely available online language learning platform for learners beyond the beginner level. We describe the tools for Sakha currently integrated into the Revita platform. To the best of our knowledge, at present, this is the first large-scale project undertaken to support intermediate-advanced learners of a minority Siberian language.

pdf bib
Projecting named entity recognizers without annotated or parallel corpora
Jue Hou | Maximilian Koppatz | José María Hoya Quecedo | Roman Yangarber
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Named entity recognition (NER) is a well-researched task in the field of NLP, which typically requires large annotated corpora for training usable models. This is a problem for languages which lack large annotated corpora, such as Finnish. We propose an approach to create a named entity recognizer with no annotated or parallel documents, by leveraging strong NER models that exist for English. We automatically gather a large amount of chronologically matched data in two languages, then project named entity annotations from the English documents onto the Finnish ones, by resolving the matches with limited linguistic rules. We use this “artificially” annotated data to train a BiLSTM-CRF model. Our results show that this method can produce annotated instances with high precision, and the resulting model achieves state-of-the-art performance.

2018

pdf bib
Comparison of Representations of Named Entities for Document Classification
Lidia Pivovarova | Roman Yangarber
Proceedings of the Third Workshop on Representation Learning for NLP

We explore representations for multi-word names in text classification tasks, on Reuters (RCV1) topic and sector classification. We find that: the best way to treat names is to split them into tokens and use each token as a separate feature; NEs have more impact on sector classification than topic classification; replacing NEs with entity types is not an effective strategy; representing tokens by different embeddings for proper names vs. common nouns does not improve results. We highlight the improvements over state-of-the-art results that our CNN models yield.

pdf bib
Benchmarks and models for entity-oriented polarity detection
Lidia Pivovarova | Arto Klami | Roman Yangarber
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

We address the problem of determining entity-oriented polarity in business news. This can be viewed as classifying the polarity of the sentiment expressed toward a given mention of a company in a news article. We present a complete, end-to-end approach to the problem. We introduce a new dataset of over 17,000 manually labeled documents, which is substantially larger than any currently available resources. We propose a benchmark solution based on convolutional neural networks for classifying entity-oriented polarity. Although our dataset is much larger than those currently available, it is small on the scale of datasets commonly used for training robust neural network models. To compensate for this, we use transfer learning—pre-train the model on a much larger dataset, annotated for a related but different classification task, in order to learn a good representation for business text, and then fine-tune it on the smaller polarity dataset.

pdf bib
Revita: a Language-learning Platform at the Intersection of ITS and CALL
Anisia Katinskaia | Javad Nouri | Roman Yangarber
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
HCS at SemEval-2017 Task 5: Polarity detection in business news using convolutional neural networks
Lidia Pivovarova | Llorenç Escoter | Arto Klami | Roman Yangarber
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Task 5 of SemEval-2017 involves fine-grained sentiment analysis on financial microblogs and news. Our solution for determining the sentiment score extends an earlier convolutional neural network for sentiment analysis in several ways. We explicitly encode a focus on a particular company, we apply a data augmentation scheme, and use a larger data collection to complement the small training data provided by the task organizers. The best results were achieved by training a model on an external dataset and then tuning it using the provided training dataset.

pdf bib
Grouping business news stories based on salience of named entities
Llorenç Escoter | Lidia Pivovarova | Mian Du | Anisia Katinskaia | Roman Yangarber
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In news aggregation systems focused on broad news domains, certain stories may appear in multiple articles. Depending on the relative importance of the story, the number of versions can reach dozens or hundreds within a day. The text in these versions may be nearly identical or quite different. Linking multiple versions of a story into a single group brings several important benefits to the end-user–reducing the cognitive load on the reader, as well as signaling the relative importance of the story. We present a grouping algorithm, and explore several vector-based representations of input documents: from a baseline using keywords, to a method using salience–a measure of importance of named entities in the text. We demonstrate that features beyond keywords yield substantial improvements, verified on a manually-annotated corpus of business news stories.

pdf bib
Revita: a system for language learning and supporting endangered languages
Anisia Katinskaia | Javad Nouri | Roman Yangarber
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition

pdf bib
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec | Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

pdf bib
The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages
Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This paper describes the outcomes of the first challenge on multilingual named entity recognition that aimed at recognizing mentions of named entities in web documents in Slavic languages, their normalization/lemmatization, and cross-language matching. It was organised in the context of the 6th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2017 conference. Although eleven teams signed up for the evaluation, due to the complexity of the task(s) and short time available for elaborating a solution, only two teams submitted results on time. The reported evaluation figures reflect the relatively higher level of complexity of named entity-related tasks in the context of processing texts in Slavic languages. Since the duration of the challenge goes beyond the date of the publication of this paper and updated picture of the participating systems and their corresponding performance can be found on the web page of the challenge.

2016

pdf bib
A Novel Evaluation Method for Morphological Segmentation
Javad Nouri | Roman Yangarber
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Unsupervised learning of morphological segmentation of words in a language, based only on a large corpus of words, is a challenging task. Evaluation of the learned segmentations is a challenge in itself, due to the inherent ambiguity of the segmentation task. There is no way to posit unique “correct” segmentation for a set of data in an objective way. Two models may arrive at different ways of segmenting the data, which may nonetheless both be valid. Several evaluation methods have been proposed to date, but they do not insist on consistency of the evaluated model. We introduce a new evaluation methodology, which enforces correctness of segmentation boundaries while also assuring consistency of segmentation decisions across the corpus.

pdf bib
From alignment of etymological data to phylogenetic inference via population genetics
Javad Nouri | Roman Yangarber
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning

pdf bib
Modeling language evolution with codes that utilize context and phonetic features
Javad Nouri | Roman Yangarber
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

2015

pdf bib
The 5th Workshop on Balto-Slavic Natural Language Processing
Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Hristo Tanev | Roman Yangarber
The 5th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Online Extraction of Russian Multiword Expressions
Mikhail Kopotev | Llorenç Escoter | Daria Kormacheva | Matthew Pierce | Lidia Pivovarova | Roman Yangarber
The 5th Workshop on Balto-Slavic Natural Language Processing

2014

pdf bib
Measuring Language Closeness by Modeling Regularity
Javad Nouri | Roman Yangarber
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants

2013

pdf bib
Automatic Detection of Stable Grammatical Features in N-Grams
Mikhail Kopotev | Lidia Pivovarova | Natalia Kochetkova | Roman Yangarber
Proceedings of the 9th Workshop on Multiword Expressions

pdf bib
Event representation across genre
Lidia Pivovarova | Silja Huttunen | Roman Yangarber
Workshop on Events: Definition, Detection, Coreference, and Representation

pdf bib
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing
Jakub Piskorski | Lidia Pivovarova | Hristo Tanev | Roman Yangarber
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
Adapting the PULS event extraction framework to analyze Russian text
Lidia Pivovarova | Mian Du | Roman Yangarber
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
Combined analysis of news and Twitter messages
Mian Du | Jussi Kangasharju | Ossi Karkulahti | Lidia Pivovarova | Roman Yangarber
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction

2012

pdf bib
Using context and phonetic features in models of etymological sound change
Hannes Wettig | Kirill Reshetnikov | Roman Yangarber
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

2011

pdf bib
MDL-based Models for Alignment of Etymological Data
Hannes Wettig | Suvi Hiltunen | Roman Yangarber
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Relevance Prediction in Information Extraction using Discourse and Lexical Features
Silja Huttunen | Arto Vihavainen | Peter von Etter | Roman Yangarber
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

pdf bib
Probabilistic Models for Alignment of Etymological Data
Hannes Wettig | Roman Yangarber
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2010

pdf bib
Assessment of Utility in Web Mining for the Domain of Public Health
Peter von Etter | Silja Huttunen | Arto Vihavainen | Matti Vuorinen | Roman Yangarber
Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents

pdf bib
Filtering news for epidemic surveillance: towards processing more languages with fewer resources
Gaël Lejeune | Antoine Doucet | Roman Yangarber | Nadine Lucas
Proceedings of the 4th Workshop on Cross Lingual Information Access

2008

pdf bib
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization
Sivaji Bandyopadhyay | Thierry Poibeau | Horacio Saggion | Roman Yangarber
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization

2006

pdf bib
Proceedings of the Workshop on Information Extraction Beyond The Document
Mary Elaine Califf | Mark A. Greenwood | Mark Stevenson | Roman Yangarber
Proceedings of the Workshop on Information Extraction Beyond The Document

2005

pdf bib
Redundancy-based Correction of Automatically Extracted Facts
Roman Yangarber | Lauri Jokipii
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

pdf bib
Extracting Information about Outbreaks of Infectious Epidemics
Roman Yangarber | Lauri Jokipii | Antti Rauramo | Silja Huttunen
Proceedings of HLT/EMNLP 2005 Interactive Demonstrations

2003

pdf bib
Counter-Training in Discovery of Semantic Patterns
Roman Yangarber
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

2002

pdf bib
Diversity of Scenarios in Information extraction
Silja Huttunen | Roman Yangarber | Ralph Grishman
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Unsupervised Learning of Generalized Names
Roman Yangarber | Winston Lin | Ralph Grishman
COLING 2002: The 19th International Conference on Computational Linguistics

pdf bib
Complexity of Event Structure in IE Scenarios
Silja Huttunen | Roman Yangarber | Ralph Grishman
COLING 2002: The 19th International Conference on Computational Linguistics

2000

pdf bib
Unsupervised Discovery of Scenario-Level Patterns for Information Extraction
Roman Yangarber | Ralph Grishman | Pasi Tapanainen
Sixth Applied Natural Language Processing Conference

pdf bib
Automatic Acquisition of Domain Knowledge for Information Extraction
Roman Yangarber | Ralph Grishman | Pasi Tapanainen | Silja Huttunen
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

1998

pdf bib
Japanese IE System and Customization Tool
Chikashi Nobata | Satoshi Sekine | Roman Yangarber
TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore, Maryland, October 13-15, 1998

pdf bib
Transforming Examples into Patterns for Information Extraction
Roman Yangarber | Ralph Grishman
TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore, Maryland, October 13-15, 1998

pdf bib
Deriving Transfer Rules from Dominance-Preserving Alignments
Adam Meyers | Roman Yangarber | Ralph Grishman | Catherine Macleod | Antonio Moreno-Sandoval
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

pdf bib
NYU: Description of the Proteus/PET System as Used for MUC-7 ST
Roman Yangarber | Ralph Grishman
Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998

pdf bib
Deriving Transfer Rules from Dominance-Preserving Alignments
Adam Meyers | Roman Yangarber | Ralph Grishman | Catherine Macleod | Antonio Moreno-Sandoval
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

pdf bib
Using NOMLEX to Produce Nominalization Patterns for Information Extraction
Adam Meyers | Catherine Macleod | Roman Yangarber | Ralph Grishman | Leslie Barrett | Ruth Reeves
The Computational Treatment of Nominals

1996

pdf bib
Alignment of Shared Forests for Bilingual Corpora
Adam Meyers | Roman Yangarber | Ralph Grishman
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics