Maciej Piasecki


2024

pdf bib
Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction
Albert Sawczyn | Katsiaryna Viarenich | Konrad Wojtasik | Aleksandra Domogała | Marcin Oleksy | Maciej Piasecki | Tomasz Kajdanowicz
Findings of the Association for Computational Linguistics: ACL 2024

Advancements in AI and natural language processing have revolutionized machine-human language interactions, with question answering (QA) systems playing a pivotal role. The knowledge base question answering (KBQA) task, utilizing structured knowledge graphs (KG), allows for handling extensive knowledge-intensive questions. However, a significant gap exists in KBQA datasets, especially for low-resource languages. Many existing construction pipelines for these datasets are outdated and inefficient in human labor, and modern assisting tools like Large Language Models (LLM) are not utilized to reduce the workload. To address this, we have designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments. We executed this pipeline and introduced the PUGG dataset, the first Polish KBQA dataset, and novel datasets for MRC and IR. Additionally, we provide a comprehensive implementation, insightful findings, detailed statistics, and evaluation of baseline models.

pdf bib
BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language
Konrad Wojtasik | Kacper Wołowiec | Vadim Shishkin | Arkadiusz Janz | Maciej Piasecki
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR), garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr. TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark – a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language, marking a pioneering development in this field. The BEIR-PL is included in MTEB Benchmark and also available with trained models at URL https://huggingface.co/clarin-knext.

2023

pdf bib
Wordnet for Definition Augmentation with Encoder-Decoder Architecture
Konrad Wojtasik | Arkadiusz Janz | Maciej Piasecki
Proceedings of the 12th Global Wordnet Conference

Data augmentation is a difficult task in Natural Language Processing. Simple methods that can be relatively easily applied in other domains like insertion, deletion or substitution, mostly result in changing the sentence meaning significantly and obtaining an incorrect example. Wordnets are potentially a perfect source of rich and high quality data that when integrated with the powerful capacity of generative models can help to solve this complex task. In this work, we use plWordNet, which is a wordnet of the Polish language, to explore the capability of encoder-decoder architectures in data augmentation of sense glosses. We discuss the limitations of generative methods and perform qualitative review of generated data samples.

pdf bib
Word Sense Disambiguation Based on Iterative Activation Spreading with Contextual Embeddings for Sense Matching
Arkadiusz Janz | Maciej Piasecki
Proceedings of the 12th Global Wordnet Conference

Many knowledge-based solutions were proposed to solve Word Sense Disambiguation (WSD) problem with limited annotated resources. Such WSD algorithms are able to cover very large sense repositories, but still being outperformed by supervised ones on benchmark data. In this paper, we start with analysis identifying key properties and issues in application of spreading activation algorithms in knowledge-based WSD, e.g. influence of the network local structures, interaction with context information and sense frequency. Taking our observations as a point of departure, we introduce a novel solution with new context-to-sense matching using BERT embeddings, iterative parallel spreading activation function and selective sense alignment using contextual BERT embeddings. The proposed solution obtains performance beyond the state-of-the-art for the contemporary knowledge-based WSD approaches for both English and Polish data.

pdf bib
Lexicalised and non-lexicalized multi-word expressions in WordNet: a cross-encoder approach
Marek Maziarz | Łukasz Grabowski | Tadeusz Piotrowski | Ewa Rudnicka | Maciej Piasecki
Proceedings of the 12th Global Wordnet Conference

Focusing on recognition of multi-word expressions (MWEs), we address the problem of recording MWEs in WordNet. In fact, not all MWEs recorded in that lexical database could with no doubt be considered as lexicalised (e.g. elements of wordnet taxonomy, quantifier phrases, certain collocations). In this paper, we use a cross-encoder approach to improve our earlier method of distinguishing between lexicalised and non-lexicalised MWEs found in WordNet using custom-designed rule-based and statistical approaches. We achieve F1-measure for the class of lexicalised word combinations close to 80%, easily beating two baselines (random and a majority class one). Language model also proves to be better than a feature-based logistic regression model.

pdf bib
Wordnet-oriented recognition of derivational relations
Wiktor Walentynowicz | Maciej Piasecki
Proceedings of the 12th Global Wordnet Conference

Derivational relations are an important element in defining meanings, as they help to explore word-formation schemes and predict senses of derivates (derived words). In this work, we analyse different methods of representing derivational forms obtained from WordNet – from quantitative vectors to contextual learned embedding methods – and compare ways of classifying the derivational relations occurring between them. Our research focuses on the explainability of the obtained representations and results. The data source for our research is plWordNet, which is the wordnet of the Polish language and includes a rich set of derivation examples.

2022

pdf bib
Deep Neural Representations for Multiword Expressions Detection
Kamil Kanclerz | Maciej Piasecki
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Effective methods for multiword expressions detection are important for many technologies related to Natural Language Processing. Most contemporary methods are based on the sequence labeling scheme applied to an annotated corpus, while traditional methods use statistical measures. In our approach, we want to integrate the concepts of those two approaches. We present a novel weakly supervised multiword expressions extraction method which focuses on their behaviour in various contexts. Our method uses a lexicon of English multiword lexical units acquired from The Oxford Dictionary of English as a reference knowledge base and leverages neural language modelling with deep learning architectures. In our approach, we do not need a corpus annotated specifically for the task. The only required components are: a lexicon of multiword units, a large corpus, and a general contextual embeddings model. We propose a method for building a silver dataset by spotting multiword expression occurrences and acquiring statistical collocations as negative samples. Sample representation has been inspired by representations used in Natural Language Inference and relation recognition. Very good results (F1=0.8) were obtained with CNN network applied to individual occurrences followed by weighted voting used to combine results from the whole corpus. The proposed method can be quite easily applied to other languages.

2021

pdf bib
A (Non)-Perfect Match: Mapping plWordNet onto PrincetonWordNet
Ewa Rudnicka | Wojciech Witkowski | Maciej Piasecki
Proceedings of the 11th Global Wordnet Conference

The paper reports on the methodology and final results of a large-scale synset mapping between plWordNet and Princeton WordNet. Dedicated manual and semi-automatic mapping procedures as well as interlingual relation types for nouns, verbs, adjectives and adverbs are described. The statistics of all types of interlingual relations are also provided.

pdf bib
Neural Language Models vs Wordnet-based Semantically Enriched Representation in CST Relation Recognition
Arkadiusz Janz | Maciej Piasecki | Piotr Wątorski
Proceedings of the 11th Global Wordnet Conference

Neural language models, including transformer-based models, that are pre-trained on very large corpora became a common way to represent text in various tasks, including recognition of textual semantic relations, e.g. Cross-document Structure Theory. Pre-trained models are usually fine tuned to downstream tasks and the obtained vectors are used as an input for deep neural classifiers. No linguistic knowledge obtained from resources and tools is utilised. In this paper we compare such universal approaches with a combination of rich graph-based linguistically motivated sentence representation and a typical neural network classifier applied to a task of recognition of CST relation in Polish. The representation describes selected levels of the sentence structure including description of lexical meanings on the basis of the wordnet (plWordNet) synsets and connected SUMO concepts. The obtained results show that in the case of difficult relations and medium size training corpus semantically enriched text representation leads to significantly better results.

2020

pdf bib
Brand-Product Relation Extraction Using Heterogeneous Vector Space Representations
Arkadiusz Janz | Łukasz Kopociński | Maciej Piasecki | Agnieszka Pluwak
Proceedings of the Twelfth Language Resources and Evaluation Conference

Relation Extraction is a fundamental NLP task. In this paper we investigate the impact of underlying text representation on the performance of neural classification models in the task of Brand-Product relation extraction. We also present the methodology of preparing annotated textual corpora for this task and we provide valuable insight into the properties of Brand-Product relations existing in textual corpora. The problem is approached from a practical angle of applications Relation Extraction in facilitating commercial Internet monitoring.

2019

pdf bib
plWordNet 4.1 - a Linguistically Motivated, Corpus-based Bilingual Resource
Agnieszka Dziob | Maciej Piasecki | Ewa Rudnicka
Proceedings of the 10th Global Wordnet Conference

The paper presents the latest release of the Polish WordNet, namely plWordNet 4.1. The most significant developments since 3.0 version include new relations for nouns and verbs, mapping semantic role-relations from the valency lexicon Walenty onto the plWordNet structure and sense-level inter-lingual mapping. Several statistics are presented in order to illustrate the development and contemporary state of the wordnet.

pdf bib
A Comparison of Sense-level Sentiment Scores
Francis Bond | Arkadiusz Janz | Maciej Piasecki
Proceedings of the 10th Global Wordnet Conference

In this paper, we compare a variety of sense-tagged sentiment resources, including SentiWordNet, ML-Senticon, plWordNet emo and the NTU Multilingual Corpus. The goal is to investigate the quality of the resources and see how well the sentiment polarity annotation maps across languages.

pdf bib
Sparse Coding in Authorship Attribution for Polish Tweets
Piotr Grzybowski | Ewa Juralewicz | Maciej Piasecki
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

The study explores application of a simple Convolutional Neural Network for the problem of authorship attribution of tweets written in Polish. In our solution we use two-step compression of tweets using Byte Pair Encoding algorithm and vectorisation as an input to the distributional model generated for the large corpus of Polish tweets by word2vec algorithm. Our method achieves results comparable to the state-of-the-art approaches for the similar task on English tweets and expresses a very good performance in the classification of Polish tweets. We tested the proposed method in relation to the number of authors and tweets per author. We also juxtaposed results for authors with different topic backgrounds against each other.

pdf bib
Word Sense Disambiguation based on Constrained Random Walks in Linked Semantic Networks
Arkadiusz Janz | Maciej Piasecki
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Word Sense Disambiguation remains a challenging NLP task. Due to the lack of annotated training data, especially for rare senses, the supervised approaches are usually designed for specific subdomains limited to a narrow subset of identified senses. Recent advances in this area have shown that knowledge-based approaches are more scalable and obtain more promising results in all-words WSD scenarios. In this work we present a faster WSD algorithm based on the Monte Carlo approximation of sense probabilities given a context using constrained random walks over linked semantic networks. We show that the local semantic relatedness is mostly sufficient to successfully identify correct senses when an extensive knowledge base and a proper weighting scheme are used. The proposed methods are evaluated on English (SenseEval, SemEval) and Polish (Składnica, KPWr) datasets.

pdf bib
Tagger for Polish Computer Mediated Communication Texts
Wiktor Walentynowicz | Maciej Piasecki | Marcin Oleksy
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this paper we present a morpho-syntactic tagger dedicated to Computer-mediated Communication texts in Polish. Its construction is based on an expanded RNN-based neural network adapted to the work on noisy texts. Among several techniques, the tagger utilises fastText embedding vectors, sequential character embedding vectors, and Brown clustering for the coarse-grained representation of sentence structures. In addition a set of manually written rules was proposed for post-processing. The system was trained to disambiguate descriptions of words in relation to Parts of Speech tags together with the full morphological information in terms of values for the different grammatical categories. We present also evaluation of several model variants on the gold standard annotated CMC data, comparison to the state-of-the-art taggers for Polish and error analysis. The proposed tagger shows significantly better results in this domain and demonstrates the viability of adaptation.

2018

pdf bib
Towards Mapping Thesauri onto plWordNet
Marek Maziarz | Maciej Piasecki
Proceedings of the 9th Global Wordnet Conference

plWordNet, the wordnet of Polish, has become a very comprehensive description of the Polish lexical system. This paper presents a plan of its semi-automated integration with thesauri, terminological databases and ontologies, as a further necessary step in its development. This will improve linking of plWordNet into Linked Open Data, and facilitate applications in, e.g., WSD, keyword extraction or automated metadata generation. We present an overview of resources relevant to Polish and a plan for their linking to plWordNet.

pdf bib
Implementation of the Verb Model in plWordNet 4.0
Agnieszka Dziob | Maciej Piasecki
Proceedings of the 9th Global Wordnet Conference

The paper presents an expansion of the verb model for plWordNet – the wordnet of Polish. A modified system of constitutive features (register, aspect and verb classes), synset and lexical relations is presented. A special attention is given to the proposed new relations and changes in the verb classification. We discuss also the results of its verification by application to the description of a relatively large sample of Polish verbs. The model introduces a new class of relations, namely non-constitutive synset relations that are shared among lexical units, but describe, not define synsets. The proposed model is compared to the entailment relations in other wordnets, and the description of verbs based on valency frames.

pdf bib
Towards Emotive Annotation in plWordNet 4.0
Monika Zaśko-Zielińska | Maciej Piasecki
Proceedings of the 9th Global Wordnet Conference

The paper presents an approach to building a very large emotive lexicon for Polish based on plWordNet. An expanded annotation model is discussed, in which lexical units (word senses) are annotated with basic emotions, fundamental human values and sentiment polarisation. The annotation process is performed manually in the 2+1 scheme by pairs of linguists and psychologies. Guidelines referring to the usage in corpora, substitution tests as well linguistic properties of lexical units (e.g. derivational associations) are discussed. Application of the model in a substantial extension of the emotive annotation of plWordNet is presented. The achieved high inter-annotator agreement shows that with relatively small workload a promising emotive resource can be created.

pdf bib
WordnetLoom – a Multilingual Wordnet Editing System Focused on Graph-based Presentation
Tomasz Naskręt | Agnieszka Dziob | Maciej Piasecki | Chakaveh Saedi | António Branco
Proceedings of the 9th Global Wordnet Conference

The paper presents a new re-built and expanded, version 2.0 of WordnetLoom – an open wordnet editor. It facilitates work on a multilingual system of wordnets, is based on efficient software architecture of thin client, and offers more flexibility in enriching wordnet representation. This new version is built on the experience collected during the use of the previous one for more than 10 years of plWordNet development. We discuss its extensions motivated by the collected experience. A special focus is given to the development of a variant for the needs of MultiWordnet of Portuguese, which is based on a very different wordnet development model.

pdf bib
Lexical Perspective on Wordnet to Wordnet Mapping
Ewa Rudnicka | Francis Bond | Łukasz Grabowski | Maciej Piasecki | Tadeusz Piotrowski
Proceedings of the 9th Global Wordnet Conference

The paper presents a feature-based model of equivalence targeted at (manual) sense linking between Princeton WordNet and plWordNet. The model incorporates insights from lexicographic and translation theories on bilingual equivalence and draws on the results of earlier synset-level mapping of nouns between Princeton WordNet and plWordNet. It takes into account all basic aspects of language such as form, meaning and function and supplements them with (parallel) corpus frequency and translatability. Three types of equivalence are distinguished, namely strong, regular and weak depending on the conformity with the proposed features. The presented solutions are language-neutral and they can be easily applied to language pairs other than Polish and English. Sense-level mapping is a more fine-grained mapping than the existing synset mappings and is thus of great potential to human and machine translation.

pdf bib
Wordnet-based Evaluation of Large Distributional Models for Polish
Maciej Piasecki | Gabriela Czachor | Arkadiusz Janz | Dominik Kaszewski | Paweł Kędzia
Proceedings of the 9th Global Wordnet Conference

The paper presents construction of large scale test datasets for word embeddings on the basis of a very large wordnet. They were next applied for evaluation of word embedding models and used to assess and compare the usefulness of different word embeddings extracted from a very large corpus of Polish. We analysed also and compared several publicly available models described in literature. In addition, several large word embeddings models built on the basis of a very large Polish corpus are presented.

pdf bib
Recognition of Hyponymy and Meronymy Relations in Word Embeddings for Polish
Gabriela Czachor | Maciej Piasecki | Arkadiusz Janz
Proceedings of the 9th Global Wordnet Conference

Word embeddings were used for the extraction of hyponymy relation in several approaches, but also it was recently shown that they should not work, in fact. In our work we verified both claims using a very large wordnet of Polish as a gold standard for lexico-semantic relations and word embeddings extracted from a very large corpus of Polish. We showed that a hyponymy extraction method based on linear regression classifiers trained on clusters of vectors can be successfully applied on large scale. We presented also a possible explanation for contradictory findings in the literature. Moreover, in order to show the feasibility of the method we extended it to the recognition of meronymy.

pdf bib
Context-sensitive Sentiment Propagation in WordNet
Jan Kocoń | Arkadiusz Janz | Maciej Piasecki
Proceedings of the 9th Global Wordnet Conference

In this paper we present a comprehensive overview of recent methods of the sentiment propagation in a wordnet. Next, we propose a fully automated method called Classifier-based Polarity Propagation, which utilises a very rich set of features, where most of them are based on wordnet relation types, multi-level bag-of-synsets and bag-of-polarities. We have evaluated our solution using manually annotated part of plWordNet 3.1 emo, which contains more than 83k manual sentiment annotations, covering more than 41k synsets. We have demonstrated that in comparison to existing rule-based methods using a specific narrow set of semantic relations our method has achieved statistically significant and better results starting with the same seed synsets.

pdf bib
Classifier-based Polarity Propagation in a WordNet
Jan Kocoń | Arkadiusz Janz | Maciej Piasecki
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Graph-Based Approach to Recognizing CST Relations in Polish Texts
Paweł Kędzia | Maciej Piasecki | Arkadiusz Janz
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

This paper presents an supervised approach to the recognition of Cross-document Structure Theory (CST) relations in Polish texts. In the proposed, graph-based representation is constructed for sentences. Graphs are built on the basis of lexicalised syntactic-semantic relation extracted from text. Similarity between sentences is calculated from graph, and the similarity values are input to classifiers trained by Logistic Model Tree. Several different configurations of graph, as well as graph similarity methods were analysed for this tasks. The approach was evaluated on a large open corpus annotated manually with 17 types of selected CST relations. The configuration of experiments was similar to those known from SEMEVAL and we obtained very promising results.

pdf bib
Recognition of Genuine Polish Suicide Notes
Maciej Piasecki | Ksenia Młynarczyk | Jan Kocoń
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this article we present the result of the recent research in the recognition of genuine Polish suicide notes (SNs). We provide useful method to distinguish between SNs and other types of discourse, including counterfeited SNs. The method uses a wide range of word-based and semantic features and it was evaluated using Polish Corpus of Suicide Notes, which contains 1244 genuine SNs, expanded with manually prepared set of 334 counterfeited SNs and 2200 letter-like texts from the Internet. We utilized the algorithm to create the class-related sense dictionaries to improve the result of SNs classification. The obtained results show that there are fundamental differences between genuine SNs and counterfeited SNs. The applied method of the sense dictionary construction appeared to be the best way of improving the model.

2016

pdf bib
plWordNet 3.0 – a Comprehensive Lexical-Semantic Resource
Marek Maziarz | Maciej Piasecki | Ewa Rudnicka | Stan Szpakowicz | Paweł Kędzia
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We have released plWordNet 3.0, a very large wordnet for Polish. In addition to what is expected in wordnets – richly interrelated synsets – it contains sentiment and emotion annotations, a large set of multi-word expressions, and a mapping onto WordNet 3.1. Part of the release is enWordNet 1.0, a substantially enlarged copy of WordNet 3.1, with material added to allow for a more complete mapping. The paper discusses the design principles of plWordNet, its content, its statistical portrait, a comparison with similar resources, and a partial list of applications.

pdf bib
plWordNet in Word Sense Disambiguation task
Maciej Piasecki | Paweł Kędzia | Marlena Orlińska
Proceedings of the 8th Global WordNet Conference (GWC)

The paper explores the application of plWordNet, a very large wordnet of Polish, in weakly supervised Word Sense Disambiguation (WSD). Because plWordNet provides only partial descriptions by glosses and usage examples, and does not include sense-disambiguated glosses, PageRank-based WSD methods perform slightly worse than for English. However, we show that the use of weights for the relation types and the order in which lexical units have been added for sense re-ranking can significantly improve WSD precision. The evaluation was done on two Polish corpora (KPWr and Składnica) including manual WSD. We discuss the fundamental difference in the construction of both corpora and very different test results.

pdf bib
plWordNet 3.0 – Almost There
Maciej Piasecki | Stan Szpakowicz | Marek Maziarz | Ewa Rudnicka
Proceedings of the 8th Global WordNet Conference (GWC)

It took us nearly ten years to get from no wordnet for Polish to the largest wordnet ever built. We started small but quickly learned to dream big. Now we are about to release plWordNet 3.0-emo – complete with sentiment and emotions annotated – and a domestic version of Princeton WordNet, larger than WordNet 3.1 by nearly ten thousand newly added words. The paper retraces the road we travelled and talks a little about the future.

2015

pdf bib
A Procedural Definition of Multi-word Lexical Units
Marek Maziarz | Stan Szpakowicz | Maciej Piasecki
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Extraction of the Multi-word Lexical Units in the Perspective of the Wordnet Expansion
Maciej Piasecki | Michał Wendelberger | Marek Maziarz
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
A Large Wordnet-based Sentiment Lexicon for Polish
Monika Zaśko-Zielińska | Maciej Piasecki | Stan Szpakowicz
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf bib
Ruled-based, Interlingual Motivated Mapping of plWordNet onto SUMO Ontology
Paweł Kędzia | Maciej Piasecki
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we study a rule-based approach to mapping plWordNet onto SUMO Upper Ontology on the basis of the already existing mappings: plWordNet – the Princeton WordNet – SUMO. Information acquired from the inter-lingual relations between plWordNet and Princeton WordNet and the relations between Princeton WordNet and SUMO ontology are used in the proposed rules. Several mapping rules together with the matching examples are presented. The automated mapping results were evaluated in two steps, (i) we automatically checked formal correctness of the mappings for the pairs of plWordNet synset and SUMO concept, (ii) a subset of 160 mapping examples was manually checked by two+one linguists. We analyzed types of the mapping errors and their causes. The proposed rules expressed very high precision, especially when the errors in the resources are taken into account. Because both wordnets were constructed independently and as a result the obtained rules are not trivial and they reveal the differences between both wordnets and both languages.

pdf bib
plWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources
Marek Maziarz | Maciej Piasecki | Ewa Rudnicka | Stan Szpakowicz
Proceedings of the Seventh Global Wordnet Conference

pdf bib
Registers in the System of Semantic Relations in plWordNet
Marek Maziarz | Maciej Piasecki | Ewa Rudnicka | Stan Szpakowicz
Proceedings of the Seventh Global Wordnet Conference

2013

pdf bib
Evaluation of baseline information retrieval for Polish open-domain Question Answering system
Michał Marcińczuk | Adam Radziszewski | Maciej Piasecki | Dominik Piasecki | Marcin Ptak
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Beyond the Transfer-and-Merge Wordnet Construction: plWordNet and a Comparison with WordNet
Marek Maziarz | Maciej Piasecki | Ewa Rudnicka | Stan Szpakowicz
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Information Spreading in Expanding Wordnet Hypernymy Structure
Maciej Piasecki | Radosław Ramocki | Michał Kaliński
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib
A Strategy of Mapping Polish WordNet onto Princeton WordNet
Ewa Rudnicka | Marek Maziarz | Maciej Piasecki | Stan Szpakowicz
Proceedings of COLING 2012: Posters

pdf bib
Tools for plWordNet Development. Presentation and Perspectives
Bartosz Broda | Marek Maziarz | Maciej Piasecki
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Building a wordnet is a serious undertaking. Fortunately, Language Technology (LT) can improve the process of wordnet construction both in terms of quality and cost. In this paper we present LT tools used during the construction of plWordNet and their influence on the lexicographer's work-flow. LT is employed in plWordNet development on every possible step: from data gathering through data analysis to data presentation. Nevertheless, every decision requires input from the lexicographer, but the quality of supporting tools is an important factor. Thus a limited evaluation of usefulness of employed tools is carried out on the basis of questionnaires.

pdf bib
Recognition of Polish Derivational Relations Based on Supervised Learning Scheme
Maciej Piasecki | Radoslaw Ramocki | Marek Maziarz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper presents construction of \emph{Derywator} -- a language tool for the recognition of Polish derivational relations. It was built on the basis of machine learning in a way following the bootstrapping approach: a limited set of derivational pairs described manually by linguists in plWordNet is used to train \emph{Derivator}. The tool is intended to be applied in semi-automated expansion of plWordNet with new instances of derivational relations. The training process is based on the construction of two transducers working in the opposite directions: one for prefixes and one for suffixes. Internal stem alternations are recognised, recorded in a form of mapping sequences and stored together with transducers. Raw results produced by \emph{Derivator} undergo next corpus-based and morphological filtering. A set of derivational relations defined in plWordNet is presented. Results of tests for different derivational relations are discussed. A problem of the necessary corpus-based semantic filtering is analysed. The presented tool depends to a very little extent on the hand-crafted knowledge for a particular language, namely only a table of possible alternations and morphological filtering rules must be exchanged and it should not take longer than a couple of working days.

pdf bib
Constraint Based Description of Polish Multiword Expressions
Roman Kurc | Maciej Piasecki | Bartosz Broda
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present an approach to the description of Polish Multi-word Expressions (MWEs) which is based on expressions in the WCCL language of morpho-syntactic constraints instead of grammar rules or transducers. For each MWE its basic morphological form and the base forms of its constituents are specified but also each MWE is assigned to a class on the basis of its syntactic structure. For each class a WCCL constraint is defined which is parametrised by string variables referring to MWE constituent base forms or inflected forms. The constraint specifies a minimal set of conditions that must be fulfilled in order to recognise an occurrence of the given MWE in text with high accuracy. Our formalism is focused on the efficient description of large MWE lexicons for the needs of utilisation in text processing. The formalism allows for the relatively easy representation of flexible word order and discontinuous constructions. Moreover, there is no necessity for the full specification of the MWE grammatical structure. Only some aspects of the particular MWE structure can be selected in way facilitating the target accuracy of recognition. On the basis of a set of simple heuristics, WCCL-based representation of MWEs can be automatically generated from a list of MWE base forms. The proposed representation was applied on a practical scale for the description of a large set of Polish MWEs included in plWordNet.

2010

pdf bib
Resource and Service Centres as the Backbone for a Sustainable Service Infrastructure
Peter Wittenburg | Nuria Bel | Lars Borin | Gerhard Budin | Nicoletta Calzolari | Eva Hajicova | Kimmo Koskenniemi | Lothar Lemnitzer | Bente Maegaard | Maciej Piasecki | Jean-Marie Pierrel | Stelios Piperidis | Inguna Skadina | Dan Tufis | Remco van Veenendaal | Tamas Váradi | Martin Wynne
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Currently, research infrastructures are being designed and established in many disciplines since they all suffer from an enormous fragmentation of their resources and tools. In the domain of language resources and tools the CLARIN initiative has been funded since 2008 to overcome many of the integration and interoperability hurdles. CLARIN can build on knowledge and work from many projects that were carried out during the last years and wants to build stable and robust services that can be used by researchers. Here service centres will play an important role that have the potential of being persistent and that adhere to criteria as they have been established by CLARIN. In the last year of the so-called preparatory phase these centres are currently developing four use cases that can demonstrate how the various pillars CLARIN has been working on can be integrated. All four use cases fulfil the criteria of being cross-national.

pdf bib
Building a Node of the Accessible Language Technology Infrastructure
Bartosz Broda | Michał Marcińczuk | Maciej Piasecki
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

A limited prototype of the CLARIN Language Technology Infrastructure (LTI) node is presented. The node prototype provides several types of web services for Polish. The functionality encompasses morpho-syntactic processing, shallow semantic processing of corpus on the basis of the SuperMatrix system and plWordNet browsing. We take the prototype as the starting point for the discussion on requirements that must be fulfilled by the LTI. Some possible solutions are proposed for less frequently discussed problems, e.g. streaming processing of language data on the remote processing node. We experimentally investigate how to tackle with several requirements from many discussed. Such aspects as processing large volumes of data, asynchronous mode of processing and scalability of the architecture to large number of users got especial attention in the constructed prototype of the Web Service for morpho-syntactic processing of Polish called TaKIPI-WS (http://plwordnet.pwr.wroc.pl/clarin/ws/takipi/). TaKIPI-WS is a distributed system with a three-layer architecture, an asynchronous model of request handling and multi-agent-based processing. TaKIPI-WS consists of three layers: WS Interface, Database and Daemons. The role of the Database is to store and exchange data between the Interface and the Daemons. The Daemons (i.e. taggers) are responsible for executing the requests queued in the database. Results of the performance tests are presented in the paper, too.

2008

pdf bib
Corpus-based Semantic Relatedness for the Construction of Polish WordNet
Bartosz Broda | Magdalena Derwojedowa | Maciej Piasecki | Stanislaw Szpakowicz
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The construction of a wordnet, a labour-intensive enterprise, can be significantly assisted by automatic grouping of lexical material and discovery of lexical semantic relations. The objective is to ensure high quality of automatically acquired results before they are presented for lexicographers’ approval. We discuss a software tool that suggests synset members using a measure of semantic relatedness with a given verb or adjective; this extends previous work on nominal synsets in Polish WordNet. Syntactically-motivated constraints are deployed on a large morphologically annotated corpus of Polish. Evaluation has been performed via the WordNet-Based Similarity Test and additionally supported by human raters. A lexicographer also manually assessed a suitable sample of suggestions. The results compare favourably with other known methods of acquiring semantic relations.