2024
pdf
bib
abs
Intelligent Tutor to Support Teaching and Learning of Tatar
Alsu Zakirova
|
Jue Hou
|
Anisia Katinskaia
|
Anh-Duc Vu
|
Roman Yangarber
Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024)
This paper presents our work on tools to support the Tatar language, using Revita, a web-based Intelligent Tutoring System for language teaching and learning. The system allows the users — teachers and learners — to upload arbitrary authentic texts, and automatically creates exercises based on these texts that engage the learners in active production of language. It provides graduated feedback when they make mistakes, and performs continuous assessment, based on which the system selects exercises for the learners at the appropriate level. The assessment also helps the students maintain their learning pace, and helps the teachers to monitor their progress.The paper describes the functionality currently implemented for Tatar, which enables learners — who possess basic proficiency beyond the beginner level — to improve their competency, using texts of their choice as learning content. Support for Tatar is being developed to increase public interest in learning the language of this important regional minority, as well as to to provide tools for improving fluency to “heritage speakers” — those who have substantial passive competency, but lack active fluency and need support for regular practice.
pdf
bib
abs
Probing the Category of Verbal Aspect in Transformer Language Models
Anisia Katinskaia
|
Roman Yangarber
Findings of the Association for Computational Linguistics: NAACL 2024
We investigate how pretrained language models (PLM) encode the grammatical category of verbal aspect in Russian. Encoding of aspect in transformer LMs has not been studied previously in any language. A particular challenge is posed by ”alternative contexts”: where either the perfective or the imperfective aspect is suitable grammatically and semantically. We perform probing using BERT and RoBERTa on alternative and non-alternative contexts. First, we assess the models’ performance on aspect prediction, via behavioral probing. Next, we examine the models’ performance when their contextual representations are substituted with counterfactual representations, via causal probing. These counterfactuals alter the value of the “boundedness” feature—a semantic feature, which characterizes the action in the context. Experiments show that BERT and RoBERTa do encode aspect—mostly in their final layers. The counterfactual interventions affect perfective and imperfective in opposite ways, which is consistent with grammar: perfective is positively affected by adding the meaning of boundedness, and vice versa. The practical implications of our probing results are that fine-tuning only the last layers of BERT on predicting aspect is faster and more effective than fine-tuning the whole model. The model has high predictive uncertainty about aspect in alternative contexts, which tend to lack explicit hints about the boundedness of the described action.
pdf
bib
abs
Cross-lingual Named Entity Corpus for Slavic Languages
Jakub Piskorski
|
Michał Marcińczuk
|
Roman Yangarber
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper presents a corpus manually annotated with named entities for six Slavic languages — Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017–2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5,017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits — single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models — XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking.
pdf
bib
abs
GPT-3.5 for Grammatical Error Correction
Anisia Katinskaia
|
Roman Yangarber
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper investigates the application of GPT-3.5 for Grammatical Error Correction (GEC) in multiple languages in several settings: zero-shot GEC, fine-tuning for GEC, and using GPT-3.5 to re-rank correction hypotheses generated by other GEC models. In the zero-shot setting, we conduct automatic evaluations of the corrections proposed by GPT-3.5 using several methods: estimating grammaticality with language models (LMs), the Scribendy test, and comparing the semantic embeddings of sentences. GPT-3.5 has a known tendency to over-correct erroneous sentences and propose alternative corrections. For several languages, such as Czech, German, Russian, Spanish, and Ukrainian, GPT-3.5 substantially alters the source sentences, including their semantics, which presents significant challenges for evaluation with reference-based metrics. For English, GPT-3.5 demonstrates high recall, generates fluent corrections, and generally preserves sentence semantics. However, human evaluation for both English and Russian reveals that, despite its strong error-detection capabilities, GPT-3.5 struggles with several error types, including punctuation mistakes, tense errors, syntactic dependencies between words, and lexical compatibility at the sentence level.
pdf
bib
abs
What Do Transformers Know about Government?
Jue Hou
|
Anisia Katinskaia
|
Lari Kotilainen
|
Sathianpong Trangcasanchai
|
Anh-Duc Vu
|
Roman Yangarber
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper investigates what insights about linguistic features and what knowledge about the structure of natural language can be obtained from the encodings in transformer language models. In particular, we explore how BERT encodes the government relation between constituents in a sentence. We use several probing classifiers, and data from two morphologically rich languages. Our experiments show that information about government is encoded across all transformer layers, but predominantly in the early layers of the model. We find that, for both languages, a small number of attention heads encode enough information about the government relations to enable us to train a classifier capable of discovering new, previously unknown types of government, never seen in the training data. Currently, data is lacking for the research community working on grammatical constructions, and government in particular. We release the Government Bank—a dataset defining the government relations for thousands of lemmas in the languages in our experiments.
2023
pdf
bib
abs
Effects of sub-word segmentation on performance of transformer language models
Jue Hou
|
Anisia Katinskaia
|
Anh-Duc Vu
|
Roman Yangarber
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Language modeling is a fundamental task in natural language processing, which has been thoroughly explored with various architectures and hyperparameters. However, few studies focus on the effect of sub-word segmentation on the performance of language models (LMs). In this paper, we compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation — Morfessor and StateMorph. We train the models for several languages — including ones with very rich morphology — and compare their performance with different segmentation algorithms, vocabulary sizes, and model sizes. The results show that training with morphological segmentation allows the LMs to: (1) achieve lower perplexity, (2) converge more efficiently in terms of training time, and (3) achieve equivalent or better evaluation scores on downstream tasks. Lastly, we show that (4) LMs of smaller size using morphological segmentation can perform comparably to models of larger size trained with BPE — both in terms of (1) perplexity and (3) scores on downstream tasks. Points (2) and (4) impact on sustainability, since they reduce the model cost; and while 2 reduces cost only in the training phase, 4 does so also in the inference phase.
pdf
bib
abs
Linguistic Constructs Represent the Domain Model in Intelligent Language Tutoring
Anisia Katinskaia
|
Jue Hou
|
Anh-duc Vu
|
Roman Yangarber
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
This paper presents the development of the AI-based language-learning platform, Revita. It is an intelligent online tutor, developed to support learners of multiple languages, from lower-intermediate toward advanced levels. It has been in pilot use with hundreds of students at several universities, whose feedback and needs shape the development. One of the main emerging features of Revita is the system of linguistic constructs to represent the domain knowledge. The system of constructs is developed in collaboration with experts in language pedagogy. Constructs define the types of exercises, the content of the feedback, and enable detailed modeling and evaluation of learner progress.
pdf
bib
abs
Grammatical Error Correction for Sentence-level Assessment in Language Learning
Anisia Katinskaia
|
Roman Yangarber
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
The paper presents experiments on using a Grammatical Error Correction (GEC) model to assess the correctness of answers that language learners give to grammar exercises. We explored whether a GEC model can be applied in the language learning context for a language with complex morphology. We empirically check a hypothesis that a GEC model corrects only errors and leaves correct answers unchanged. We perform a test on assessing learner answers in a real but constrained language-learning setup: the learners answer only fill-in-the-blank and multiple-choice exercises. For this purpose, we use ReLCo, a publicly available manually annotated learner dataset in Russian (Katinskaia et al., 2022). In this experiment, we fine-tune a large-scale T5 language model for the GEC task and estimate its performance on the RULEC-GEC dataset (Rozovskaya and Roth, 2019) to compare with top-performing models. We also release an updated version of the RULEC-GEC test set, manually checked by native speakers. Our analysis shows that the GEC model performs reasonably well in detecting erroneous answers to grammar exercises and potentially can be used for best-performing error types in a real learning setup. However, it struggles to assess answers which were tagged by human annotators as alternative-correct using the aforementioned hypothesis. This is in large part due to a still low recall in correcting errors, and the fact that the GEC model may modify even correct words—it may generate plausible alternatives, which are hard to evaluate against the gold-standard reference.
pdf
bib
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
Jakub Piskorski
|
Michał Marcińczuk
|
Preslav Nakov
|
Maciej Ogrodniczuk
|
Senja Pollak
|
Pavel Přibáň
|
Piotr Rybak
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
pdf
bib
abs
Slav-NER: the 4th Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic languages
Roman Yangarber
|
Jakub Piskorski
|
Anna Dmitrieva
|
Michał Marcińczuk
|
Pavel Přibáň
|
Piotr Rybak
|
Josef Steinberger
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
This paper describes Slav-NER: the 4th Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. This version of the Challenge covers three languages and five entity types. It is organized as part of the 9th Slavic Natural Language Processing Workshop, co-located with the EACL 2023 Conference.Seven teams registered and three participated actively in the competition. Performance for the named entity recognition and normalization tasks reached 90% F1 measure, much higher than reported in the first edition of the Challenge, but similar to the results reported in the latest edition. Performance for the entity linking task for individual language reached the range of 72-80% F1 measure. Detailed evaluation information is available on the Shared Task web page.
pdf
bib
abs
Question Answering and Question Generation for Finnish
Ilmari Kylliäinen
|
Roman Yangarber
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Recent advances in the field of language modeling have improved the state-of-the-art in question answering (QA) and question generation (QG). However, the development of modern neural models, their benchmarks, and datasets for training them has mainly focused on English. Finnish, like many other languages, faces a shortage of large QA/QG model training resources, which has prevented experimenting with state-of-the-art QA/QG fine-tuning methods. We present the first neural QA and QG models that work with Finnish. To train the models, we automatically translate the SQuAD dataset and then use normalization methods to reduce the amount of problematic data created during the translation. Using the synthetic data, together with the Finnish partition of the TyDi-QA dataset, we fine-tune several transformer-based models to both QA and QG and evaluate their performance. To the best of our knowledge, the resulting dataset is the first large-scale QA/QG resource for Finnish. This paper also sets the initial benchmarks for Finnish-language QA and QG.
2022
pdf
bib
abs
Applying Gamification Incentives in the Revita Language-learning System
Jue Hou
|
Ilmari Kylliäinen
|
Anisia Katinskaia
|
Giacomo Furlan
|
Roman Yangarber
Proceedings of the 9th Workshop on Games and Natural Language Processing within the 13th Language Resources and Evaluation Conference
We explore the importance of gamification features in a language-learning platform designed for intermediate-to-advanced learners. Our main thesis is: learning toward advanced levels requires a massive investment of time. If the learner engages in more practice sessions, and if the practice sessions are longer, we can expect the results to be better. This principle appears to be tautologically self-evident. Yet, keeping the learner engaged in general—and building gamification features in particular—requires substantial efforts on the part of developers. Our goal is to keep the learner engaged in long practice sessions over many months—rather than for the short-term. This creates a conflict: In academic research on language learning, resources are typically scarce, and gamification usually is not considered an essential priority for allocating resources. We argue in favor of giving serious consideration to gamification in the language-learning setting—as a means of enabling in-depth research. In this paper, we introduce several gamification incentives in the Revita language-learning platform. We discuss the problems in obtaining quantitative measures of the effectiveness of gamification features.
pdf
bib
abs
Semi-automatically Annotated Learner Corpus for Russian
Anisia Katinskaia
|
Maria Lebedeva
|
Jue Hou
|
Roman Yangarber
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present ReLCo— the Revita Learner Corpus—a new semi-automatically annotated learner corpus for Russian. The corpus was collected while several thousand L2 learners were performing exercises using the Revita language-learning system. All errors were detected automatically by the system and annotated by type. Part of the corpus was annotated manually—this part was created for further experiments on automatic assessment of grammatical correctness. The Learner Corpus provides valuable data for studying patterns of grammatical errors, experimenting with grammatical error detection and grammatical error correction, and developing new exercises for language learners. Automating the collection and annotation makes the process of building the learner corpus much cheaper and faster, in contrast to the traditional approach of building learner corpora. We make the data publicly available.
2021
pdf
bib
abs
Assessing Grammatical Correctness in Language Learning
Anisia Katinskaia
|
Roman Yangarber
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
We present experiments on assessing the grammatical correctness of learners’ answers in a language-learning System (references to the System, and the links to the released data and code are withheld for anonymity). In particular, we explore the problem of detecting alternative-correct answers: when more than one inflected form of a lemma fits syntactically and semantically in a given context. We approach the problem with the methods for grammatical error detection (GED), since we hypothesize that models for detecting grammatical mistakes can assess the correctness of potential alternative answers in a learning setting. Due to the paucity of training data, we explore the ability of pre-trained BERT to detect grammatical errors and then fine-tune it using synthetic training data. In this work, we focus on errors in inflection. Our experiments show a. that pre-trained BERT performs worse at detecting grammatical irregularities for Russian than for English; b. that fine-tuned BERT yields promising results on assessing the correctness of grammatical exercises; and c. establish a new benchmark for Russian. To further investigate its performance, we compare fine-tuned BERT with one of the state-of-the-art models for GED (Bell et al., 2019) on our dataset and RULEC-GEC (Rozovskaya and Roth, 2019). We release the manually annotated learner dataset, used for testing, for general use.
pdf
bib
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Bogdan Babych
|
Olga Kanishcheva
|
Preslav Nakov
|
Jakub Piskorski
|
Lidia Pivovarova
|
Vasyl Starko
|
Josef Steinberger
|
Roman Yangarber
|
Michał Marcińczuk
|
Senja Pollak
|
Pavel Přibáň
|
Marko Robnik-Šikonja
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
pdf
bib
abs
Slav-NER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski
|
Bogdan Babych
|
Zara Kancheva
|
Olga Kanishcheva
|
Maria Lebedeva
|
Michał Marcińczuk
|
Preslav Nakov
|
Petya Osenova
|
Lidia Pivovarova
|
Senja Pollak
|
Pavel Přibáň
|
Ivaylo Radev
|
Marko Robnik-Sikonja
|
Vasyl Starko
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2021 Conference. Ten teams participated in the competition. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all six languages, and five teams participated in the cross-lingual entity linking task. Detailed valuation information is available on the shared task web page.
2020
pdf
bib
abs
Toward a Paradigm Shift in Collection of Learner Corpora
Anisia Katinskaia
|
Sardana Ivanova
|
Roman Yangarber
Proceedings of the Twelfth Language Resources and Evaluation Conference
We present the first version of the longitudinal Revita Learner Corpus (ReLCo), for Russian. In contrast to traditional learner corpora, ReLCo is collected and annotated fully automatically, while students perform exercises using the Revita language-learning platform. The corpus currently contains 8 422 sentences exhibiting several types of errors—grammatical, lexical, orthographic, etc.—which were committed by learners during practice and were automatically annotated by Revita. The corpus provides valuable information about patterns of learner errors and can be used as a language resource for a number of research tasks, while its creation is much cheaper and faster than for traditional learner corpora. A crucial advantage of ReLCo that it grows continually while learners practice with Revita, which opens the possibility of creating an unlimited learner resource with longitudinal data collected over time. We make the pilot version of the Russian ReLCo publicly available.
pdf
bib
abs
Neural Disambiguation of Lemma and Part of Speech in Morphologically Rich Languages
José María Hoya Quecedo
|
Koppatz Maximilian
|
Roman Yangarber
Proceedings of the Twelfth Language Resources and Evaluation Conference
We consider the problem of disambiguating the lemma and part of speech of ambiguous words in morphologically rich languages. We propose a method for disambiguating ambiguous words in context, using a large un-annotated corpus of text, and a morphological analyser—with no manual disambiguation or data annotation. We assume that the morphological analyser produces multiple analyses for ambiguous words. The idea is to train recurrent neural networks on the output that the morphological analyser produces for unambiguous words. We present performance on POS and lemma disambiguation that reaches or surpasses the state of the art—including supervised models—using no manually annotated data. We evaluate the method on several morphologically rich languages.
2019
pdf
bib
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec
|
Michał Marcińczuk
|
Preslav Nakov
|
Jakub Piskorski
|
Lidia Pivovarova
|
Jan Šnajder
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
pdf
bib
abs
The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski
|
Laska Laskova
|
Michał Marcińczuk
|
Lidia Pivovarova
|
Pavel Přibáň
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams participated in the competition, which covered four languages and five entity types. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all four languages, and five teams participated in the cross-lingual entity linking task. Detailed evaluation information is available on the shared task web page.
pdf
bib
abs
Modeling language learning using specialized Elo rating
Jue Hou
|
Koppatz Maximilian
|
José María Hoya Quecedo
|
Nataliya Stoyanova
|
Roman Yangarber
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Automatic assessment of the proficiency levels of the learner is a critical part of Intelligent Tutoring Systems. We present methods for assessment in the context of language learning. We use a specialized Elo formula used in conjunction with educational data mining. We simultaneously obtain ratings for the proficiency of the learners and for the difficulty of the linguistic concepts that the learners are trying to master. From the same data we also learn a graph structure representing a domain model capturing the relations among the concepts. This application of Elo provides ratings for learners and concepts which correlate well with subjective proficiency levels of the learners and difficulty levels of the concepts.
pdf
bib
abs
Tools for supporting language learning for Sakha
Sardana Ivanova
|
Anisia Katinskaia
|
Roman Yangarber
Proceedings of the 22nd Nordic Conference on Computational Linguistics
This paper presents an overview of the available linguistic resources for the Sakha language, and presents new tools for supporting language learning for Sakha. The essential resources include a morphological analyzer, digital dictionaries, and corpora of Sakha texts. Based on these resources, we implement a language-learning environment for Sakha in the Revita CALL platform. We extended an earlier, preliminary version of the morphological analyzer/transducer, built on the Apertium finite-state platform. The analyzer currently has an adequate level of coverage, between 86% and 89% on two Sakha corpora. Revita is a freely available online language learning platform for learners beyond the beginner level. We describe the tools for Sakha currently integrated into the Revita platform. To the best of our knowledge, at present, this is the first large-scale project undertaken to support intermediate-advanced learners of a minority Siberian language.
pdf
bib
abs
Projecting named entity recognizers without annotated or parallel corpora
Jue Hou
|
Maximilian Koppatz
|
José María Hoya Quecedo
|
Roman Yangarber
Proceedings of the 22nd Nordic Conference on Computational Linguistics
Named entity recognition (NER) is a well-researched task in the field of NLP, which typically requires large annotated corpora for training usable models. This is a problem for languages which lack large annotated corpora, such as Finnish. We propose an approach to create a named entity recognizer with no annotated or parallel documents, by leveraging strong NER models that exist for English. We automatically gather a large amount of chronologically matched data in two languages, then project named entity annotations from the English documents onto the Finnish ones, by resolving the matches with limited linguistic rules. We use this “artificially” annotated data to train a BiLSTM-CRF model. Our results show that this method can produce annotated instances with high precision, and the resulting model achieves state-of-the-art performance.
2018
pdf
bib
abs
Benchmarks and models for entity-oriented polarity detection
Lidia Pivovarova
|
Arto Klami
|
Roman Yangarber
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)
We address the problem of determining entity-oriented polarity in business news. This can be viewed as classifying the polarity of the sentiment expressed toward a given mention of a company in a news article. We present a complete, end-to-end approach to the problem. We introduce a new dataset of over 17,000 manually labeled documents, which is substantially larger than any currently available resources. We propose a benchmark solution based on convolutional neural networks for classifying entity-oriented polarity. Although our dataset is much larger than those currently available, it is small on the scale of datasets commonly used for training robust neural network models. To compensate for this, we use transfer learning—pre-train the model on a much larger dataset, annotated for a related but different classification task, in order to learn a good representation for business text, and then fine-tune it on the smaller polarity dataset.
pdf
bib
Revita: a Language-learning Platform at the Intersection of ITS and CALL
Anisia Katinskaia
|
Javad Nouri
|
Roman Yangarber
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
abs
Comparison of Representations of Named Entities for Document Classification
Lidia Pivovarova
|
Roman Yangarber
Proceedings of the Third Workshop on Representation Learning for NLP
We explore representations for multi-word names in text classification tasks, on Reuters (RCV1) topic and sector classification. We find that: the best way to treat names is to split them into tokens and use each token as a separate feature; NEs have more impact on sector classification than topic classification; replacing NEs with entity types is not an effective strategy; representing tokens by different embeddings for proper names vs. common nouns does not improve results. We highlight the improvements over state-of-the-art results that our CNN models yield.
2017
pdf
bib
abs
HCS at SemEval-2017 Task 5: Polarity detection in business news using convolutional neural networks
Lidia Pivovarova
|
Llorenç Escoter
|
Arto Klami
|
Roman Yangarber
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
Task 5 of SemEval-2017 involves fine-grained sentiment analysis on financial microblogs and news. Our solution for determining the sentiment score extends an earlier convolutional neural network for sentiment analysis in several ways. We explicitly encode a focus on a particular company, we apply a data augmentation scheme, and use a larger data collection to complement the small training data provided by the task organizers. The best results were achieved by training a model on an external dataset and then tuning it using the provided training dataset.
pdf
bib
abs
Grouping business news stories based on salience of named entities
Llorenç Escoter
|
Lidia Pivovarova
|
Mian Du
|
Anisia Katinskaia
|
Roman Yangarber
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
In news aggregation systems focused on broad news domains, certain stories may appear in multiple articles. Depending on the relative importance of the story, the number of versions can reach dozens or hundreds within a day. The text in these versions may be nearly identical or quite different. Linking multiple versions of a story into a single group brings several important benefits to the end-user–reducing the cognitive load on the reader, as well as signaling the relative importance of the story. We present a grouping algorithm, and explore several vector-based representations of input documents: from a baseline using keywords, to a method using salience–a measure of importance of named entities in the text. We demonstrate that features beyond keywords yield substantial improvements, verified on a manually-annotated corpus of business news stories.
pdf
bib
Revita: a system for language learning and supporting endangered languages
Anisia Katinskaia
|
Javad Nouri
|
Roman Yangarber
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition
pdf
bib
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec
|
Jakub Piskorski
|
Lidia Pivovarova
|
Jan Šnajder
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
pdf
bib
abs
The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages
Jakub Piskorski
|
Lidia Pivovarova
|
Jan Šnajder
|
Josef Steinberger
|
Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
This paper describes the outcomes of the first challenge on multilingual named entity recognition that aimed at recognizing mentions of named entities in web documents in Slavic languages, their normalization/lemmatization, and cross-language matching. It was organised in the context of the 6th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2017 conference. Although eleven teams signed up for the evaluation, due to the complexity of the task(s) and short time available for elaborating a solution, only two teams submitted results on time. The reported evaluation figures reflect the relatively higher level of complexity of named entity-related tasks in the context of processing texts in Slavic languages. Since the duration of the challenge goes beyond the date of the publication of this paper and updated picture of the participating systems and their corresponding performance can be found on the web page of the challenge.
2016
pdf
bib
From alignment of etymological data to phylogenetic inference via population genetics
Javad Nouri
|
Roman Yangarber
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning
pdf
bib
abs
A Novel Evaluation Method for Morphological Segmentation
Javad Nouri
|
Roman Yangarber
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Unsupervised learning of morphological segmentation of words in a language, based only on a large corpus of words, is a challenging task. Evaluation of the learned segmentations is a challenge in itself, due to the inherent ambiguity of the segmentation task. There is no way to posit unique “correct” segmentation for a set of data in an objective way. Two models may arrive at different ways of segmenting the data, which may nonetheless both be valid. Several evaluation methods have been proposed to date, but they do not insist on consistency of the evaluated model. We introduce a new evaluation methodology, which enforces correctness of segmentation boundaries while also assuring consistency of segmentation decisions across the corpus.
pdf
bib
Modeling language evolution with codes that utilize context and phonetic features
Javad Nouri
|
Roman Yangarber
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning
2015
pdf
bib
The 5th Workshop on Balto-Slavic Natural Language Processing
Jakub Piskorski
|
Lidia Pivovarova
|
Jan Šnajder
|
Hristo Tanev
|
Roman Yangarber
The 5th Workshop on Balto-Slavic Natural Language Processing
pdf
bib
Online Extraction of Russian Multiword Expressions
Mikhail Kopotev
|
Llorenç Escoter
|
Daria Kormacheva
|
Matthew Pierce
|
Lidia Pivovarova
|
Roman Yangarber
The 5th Workshop on Balto-Slavic Natural Language Processing
2014
pdf
bib
Measuring Language Closeness by Modeling Regularity
Javad Nouri
|
Roman Yangarber
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants
2013
pdf
bib
Automatic Detection of Stable Grammatical Features in N-Grams
Mikhail Kopotev
|
Lidia Pivovarova
|
Natalia Kochetkova
|
Roman Yangarber
Proceedings of the 9th Workshop on Multiword Expressions
pdf
bib
Event representation across genre
Lidia Pivovarova
|
Silja Huttunen
|
Roman Yangarber
Workshop on Events: Definition, Detection, Coreference, and Representation
pdf
bib
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing
Jakub Piskorski
|
Lidia Pivovarova
|
Hristo Tanev
|
Roman Yangarber
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing
pdf
bib
Adapting the PULS event extraction framework to analyze Russian text
Lidia Pivovarova
|
Mian Du
|
Roman Yangarber
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing
pdf
bib
Combined analysis of news and Twitter messages
Mian Du
|
Jussi Kangasharju
|
Ossi Karkulahti
|
Lidia Pivovarova
|
Roman Yangarber
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction
2012
pdf
bib
Using context and phonetic features in models of etymological sound change
Hannes Wettig
|
Kirill Reshetnikov
|
Roman Yangarber
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH
2011
pdf
bib
Relevance Prediction in Information Extraction using Discourse and Lexical Features
Silja Huttunen
|
Arto Vihavainen
|
Peter von Etter
|
Roman Yangarber
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)
pdf
bib
Probabilistic Models for Alignment of Etymological Data
Hannes Wettig
|
Roman Yangarber
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)
pdf
bib
MDL-based Models for Alignment of Etymological Data
Hannes Wettig
|
Suvi Hiltunen
|
Roman Yangarber
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011
2010
pdf
bib
Assessment of Utility in Web Mining for the Domain of Public Health
Peter von Etter
|
Silja Huttunen
|
Arto Vihavainen
|
Matti Vuorinen
|
Roman Yangarber
Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents
pdf
bib
Filtering news for epidemic surveillance: towards processing more languages with fewer resources
Gaël Lejeune
|
Antoine Doucet
|
Roman Yangarber
|
Nadine Lucas
Proceedings of the 4th Workshop on Cross Lingual Information Access
2008
pdf
bib
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization
Sivaji Bandyopadhyay
|
Thierry Poibeau
|
Horacio Saggion
|
Roman Yangarber
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization
2006
pdf
bib
Proceedings of the Workshop on Information Extraction Beyond The Document
Mary Elaine Califf
|
Mark A. Greenwood
|
Mark Stevenson
|
Roman Yangarber
Proceedings of the Workshop on Information Extraction Beyond The Document
2005
pdf
bib
Redundancy-based Correction of Automatically Extracted Facts
Roman Yangarber
|
Lauri Jokipii
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing
pdf
bib
Extracting Information about Outbreaks of Infectious Epidemics
Roman Yangarber
|
Lauri Jokipii
|
Antti Rauramo
|
Silja Huttunen
Proceedings of HLT/EMNLP 2005 Interactive Demonstrations
2003
pdf
bib
Counter-Training in Discovery of Semantic Patterns
Roman Yangarber
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics
2002
pdf
bib
Unsupervised Learning of Generalized Names
Roman Yangarber
|
Winston Lin
|
Ralph Grishman
COLING 2002: The 19th International Conference on Computational Linguistics
pdf
bib
Complexity of Event Structure in IE Scenarios
Silja Huttunen
|
Roman Yangarber
|
Ralph Grishman
COLING 2002: The 19th International Conference on Computational Linguistics
pdf
bib
Diversity of Scenarios in Information extraction
Silja Huttunen
|
Roman Yangarber
|
Ralph Grishman
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
2000
pdf
bib
Unsupervised Discovery of Scenario-Level Patterns for Information Extraction
Roman Yangarber
|
Ralph Grishman
|
Pasi Tapanainen
Sixth Applied Natural Language Processing Conference
pdf
bib
Automatic Acquisition of Domain Knowledge for Information Extraction
Roman Yangarber
|
Ralph Grishman
|
Pasi Tapanainen
|
Silja Huttunen
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics
1998
pdf
bib
NYU: Description of the Proteus/PET System as Used for MUC-7 ST
Roman Yangarber
|
Ralph Grishman
Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998
pdf
bib
Japanese IE System and Customization Tool
Chikashi Nobata
|
Satoshi Sekine
|
Roman Yangarber
TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore, Maryland, October 13-15, 1998
pdf
bib
Transforming Examples into Patterns for Information Extraction
Roman Yangarber
|
Ralph Grishman
TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore, Maryland, October 13-15, 1998
pdf
bib
Deriving Transfer Rules from Dominance-Preserving Alignments
Adam Meyers
|
Roman Yangarber
|
Ralph Grishman
|
Catherine Macleod
|
Antonio Moreno-Sandoval
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2
pdf
bib
Deriving Transfer Rules from Dominance-Preserving Alignments
Adam Meyers
|
Roman Yangarber
|
Ralph Grishman
|
Catherine Macleod
|
Antonio Moreno-Sandoval
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics
pdf
bib
Using NOMLEX to Produce Nominalization Patterns for Information Extraction
Adam Meyers
|
Catherine Macleod
|
Roman Yangarber
|
Ralph Grishman
|
Leslie Barrett
|
Ruth Reeves
The Computational Treatment of Nominals
1996
pdf
bib
Alignment of Shared Forests for Bilingual Corpora
Adam Meyers
|
Roman Yangarber
|
Ralph Grishman
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics