2022
pdf
bib
abs
Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources
Tamás Váradi
|
Bence Nyéki
|
Svetla Koeva
|
Marko Tadić
|
Vanja Štefanec
|
Maciej Ogrodniczuk
|
Bartłomiej Nitoń
|
Piotr Pęzik
|
Verginica Barbu Mititelu
|
Elena Irimia
|
Maria Mitrofan
|
Dan Tufiș
|
Radovan Garabík
|
Simon Krek
|
Andraž Repar
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
pdf
bib
abs
HerBERT Based Language Model Detects Quantifiers and Their Semantic Properties in Polish
Marcin Woliński
|
Bartłomiej Nitoń
|
Witold Kieraś
|
Jakub Szymanik
Proceedings of the Thirteenth Language Resources and Evaluation Conference
The paper presents a tool for automatic marking up of quantifying expressions, their semantic features, and scopes. We explore the idea of using a BERT based neural model for the task (in this case HerBERT, a model trained specifically for Polish, is used). The tool is trained on a recent manually annotated Corpus of Polish Quantificational Expressions (Szymanik and Kieraś, 2022). We discuss how it performs against human annotation and present results of automatic annotation of 300 million sub-corpus of National Corpus of Polish. Our results show that language models can effectively recognise semantic category of quantification as well as identify key semantic properties of quantifiers, like monotonicity. Furthermore, the algorithm we have developed can be used for building semantically annotated quantifier corpora for other languages.
2020
pdf
bib
abs
New Developments in the Polish Parliamentary Corpus
Maciej Ogrodniczuk
|
Bartłomiej Nitoń
Proceedings of the Second ParlaCLARIN Workshop
This short paper presents the current (as of February 2020) state of preparation of the Polish Parliamentary Corpus (PPC)—an extensive collection of transcripts of Polish parliamentary proceedings dating from 1919 to present. The most evident developments as compared to the 2018 version is harmonization of metadata, standardization of document identifiers, uploading contents of all documents and metadata to the database (to enable easier modification, maintenance and future development of the corpus), linking utterances to the political ontology, linking corpus texts to source data and processing historical documents.
pdf
bib
abs
The MARCELL Legislative Corpus
Tamás Váradi
|
Svetla Koeva
|
Martin Yamalov
|
Marko Tadić
|
Bálint Sass
|
Bartłomiej Nitoń
|
Maciej Ogrodniczuk
|
Piotr Pęzik
|
Verginica Barbu Mititelu
|
Radu Ion
|
Elena Irimia
|
Maria Mitrofan
|
Vasile Păiș
|
Dan Tufiș
|
Radovan Garabík
|
Simon Krek
|
Andraz Repar
|
Matjaž Rihtar
|
Janez Brank
Proceedings of the Twelfth Language Resources and Evaluation Conference
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
2018
pdf
bib
Deep Neural Networks for Coreference Resolution for Polish
Bartłomiej Nitoń
|
Paweł Morawiecki
|
Maciej Ogrodniczuk
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
bib
abs
Improving Polish Mention Detection with Valency Dictionary
Maciej Ogrodniczuk
|
Bartłomiej Nitoń
Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017)
This paper presents results of an experiment integrating information from valency dictionary of Polish into a mention detection system. Two types of information is acquired: positions of syntactic schemata for nominal and verbal constructs and secondary prepositions present in schemata. The syntactic schemata are used to prevent (for verbal realizations) or encourage (for nominal groups) constructing mentions from phrases filling multiple schema positions, the secondary prepositions – to filter out artificial mentions created from their nominal components. Mention detection is evaluated against the manual annotation of the Polish Coreference Corpus in two settings: taking into account only mention heads or exact borders.
2016
pdf
bib
abs
Accessing and Elaborating Walenty - a Valence Dictionary of Polish - via Internet Browser
Bartłomiej Nitoń
|
Tomasz Bartosiak
|
Elżbieta Hajnicz
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This article presents Walenty - a new valence dictionary of Polish predicates, concentrating on its creation process and access via Internet browser. The dictionary contains two layers, syntactic and semantic. The syntactic layer describes syntactic and morphosyntactic constraints predicates put on their dependants. The semantic layer shows how predicates and their arguments are involved in a situation described in an utterance. These two layers are connected, representing how semantic arguments can be realised on the surface. Walenty also contains a powerful phraseological (idiomatic) component. Walenty has been created and can be accessed remotely with a dedicated tool called Slowal. In this article, we focus on most important functionalities of this system. First, we will depict how to access the dictionary and how built-in filtering system (covering both syntactic and semantic phenomena) works. Later, we will describe the process of creating dictionary by Slowal tool that both supports and controls the work of lexicographers.
2014
pdf
bib
abs
Measuring Readability of Polish Texts: Baseline Experiments
Bartosz Broda
|
Bartłomiej Nitoń
|
Włodzimierz Gruszczyński
|
Maciej Ogrodniczuk
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Measuring readability of a text is the first sensible step to its simplification. In this paper we present an overview of the most common approaches to automatic measuring of readability. Of the described ones, we implemented and evaluated: Gunning FOG index, Flesch-based Pisarek method. We also present two other approaches. The first one is based on measuring distributional lexical similarity of a target text and comparing it to reference texts. In the second one, we propose a novel method for automation of Taylor test ― which, in its base form, requires performing a large amount of surveys. The automation of Taylor test is performed using a technique called statistical language modelling. We have developed a free on-line web-based system and constructed plugins for the most common text editors, namely Microsoft Word and OpenOffice.org. Inner workings of the system are described in detail. Finally, extensive evaluations are performed for Polish ― a Slavic, highly inflected language. We show that Pisareks method is highly correlated to Gunning FOG Index, even if different in form, and that both the similarity-based approach and automated Taylor test achieve high accuracy. Merits of using either of them are discussed.