Filip Gralinski

Also published as: Filip Graliński


2024

pdf bib
Two Approaches to Diachronic Normalization of Polish Texts
Kacper Dudzic | Filip Gralinski | Krzysztof Jassem | Marek Kubis | Piotr Wierzchon
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

This paper discusses two approaches to the diachronic normalization of Polish texts: a rule-based solution that relies on a set of handcrafted patterns, and a neural normalization model based on the text-to-text transfer transformer architecture. The training and evaluation data prepared for the task are discussed in detail, along with experiments conducted to compare the proposed normalization solutions. A quantitative and qualitative analysis is made. It is shown that at the current stage of inquiry into the problem, the rule-based solution outperforms the neural one on 3 out of 4 variants of the prepared dataset, although in practice both approaches have distinct advantages and disadvantages.

pdf bib
POLygraph: Polish Fake News Dataset
Daniel Dzienisiewicz | Filip Graliński | Piotr Jabłoński | Marek Kubis | Paweł Skórzewski | Piotr Wierzchon
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish. The dataset, created by an interdisciplinary team, is composed of two parts: the “fake-or-not” dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the “fake-they-say” dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them. Unlike existing datasets, POLygraph encompasses a variety of approaches from source literature, providing a comprehensive resource for fake news detection. The data was collected through manual annotation by expert and non-expert annotators. The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity. The tool and dataset are expected to benefit various entities, from public sector institutions to publishers and fact-checking organizations. Further dataset exploration will foster fake news detection and potentially stimulate the implementation of similar models in other languages. The paper focuses on the creation and composition of the dataset, so it does not include a detailed evaluation of the software tool for content authenticity analysis, which is planned at a later stage of the project.

2022

pdf bib
Challenging America: Modeling language in longer time scales
Jakub Pokrywka | Filip Graliński | Krzysztof Jassem | Karol Kaczmarek | Krzysztof Jurkiewicz | Piotr Wierzchon
Findings of the Association for Computational Linguistics: NAACL 2022

The aim of the paper is to apply, for historical texts, the methodology used commonly to solve various NLP tasks defined for contemporary data, i.e. pre-train and fine-tune large Transformer models. This paper introduces an ML challenge, named Challenging America (ChallAm), based on OCR-ed excerpts from historical newspapers collected from the Chronicling America portal. ChallAm provides a dataset of clippings, labeled with metadata on their origin, and paired with their textual contents retrieved by an OCR tool. Three, publicly available, ML tasks are defined in the challenge: to determine the article date, to detect the location of the issue, and to deduce a word in a text gap (cloze test). Strong baselines are provided for all three ChallAm tasks. In particular, we pre-trained a RoBERTa model from scratch from the historical texts. We also discuss the issues of discrimination and hate-speech present in the historical American texts.

2020

pdf bib
ApplicaAI at SemEval-2020 Task 11: On RoBERTa-CRF, Span CLS and Whether Self-Training Helps Them
Dawid Jurkiewicz | Łukasz Borchmann | Izabela Kosmala | Filip Graliński
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents the winning system for the propaganda Technique Classification (TC) task and the second-placed system for the propaganda Span Identification (SI) task. The purpose of TC task was to identify an applied propaganda technique given propaganda text fragment. The goal of SI task was to find specific text fragments which contain at least one propaganda technique. Both of the developed solutions used semi-supervised learning technique of self-training. Interestingly, although CRF is barely used with transformer-based language models, the SI task was approached with RoBERTa-CRF architecture. An ensemble of RoBERTa-based models was proposed for the TC task, with one of them making use of Span CLS layers we introduce in the present paper. In addition to describing the submitted systems, an impact of architectural decisions and training schemes is investigated along with remarks regarding training models of the same or better quality with lower computational budget. Finally, the results of error analysis are presented.

pdf bib
From Dataset Recycling to Multi-Property Extraction and Beyond
Tomasz Dwojak | Michał Pietruszka | Łukasz Borchmann | Jakub Chłędowski | Filip Graliński
Proceedings of the 24th Conference on Computational Natural Language Learning

This paper investigates various Transformer architectures on the WikiReading Information Extraction and Machine Reading Comprehension dataset. The proposed dual-source model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled - a newly developed public dataset, and the task of multiple-property extraction. It uses the same data as WikiReading but does not inherit its predecessor’s identified disadvantages. In addition, we provide a human-annotated test set with diagnostic subsets for a detailed analysis of model performance.

pdf bib
Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines
Łukasz Borchmann | Dawid Wisniewski | Andrzej Gretkowski | Izabela Kosmala | Dawid Jurkiewicz | Łukasz Szałkiewicz | Gabriela Pałka | Karol Kaczmarek | Agnieszka Kaliska | Filip Graliński
Findings of the Association for Computational Linguistics: EMNLP 2020

We propose a new shared task of semantic retrieval from legal texts, in which a so-called contract discovery is to be performed – where legal clauses are extracted from documents, given a few examples of similar clauses from other legal acts. The task differs substantially from conventional NLI and shared tasks on legal information extraction (e.g., one has to identify text span instead of a single document, page, or paragraph). The specification of the proposed task is followed by an evaluation of multiple solutions within the unified framework proposed for this branch of methods. It is shown that state-of-the-art pretrained encoders fail to provide satisfactory results on the task proposed. In contrast, Language Model-based solutions perform better, especially when unsupervised fine-tuning is applied. Besides the ablation studies, we addressed questions regarding detection accuracy for relevant text fragments depending on the number of examples available. In addition to the dataset and reference results, LMs specialized in the legal domain were made publicly available.

2019

pdf bib
GEval: Tool for Debugging NLP Datasets and Models
Filip Graliński | Anna Wróblewska | Tomasz Stanisławek | Kamil Grabowski | Tomasz Górecki
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

This paper presents a simple but general and effective method to debug the output of machine learning (ML) supervised models, including neural networks. The algorithm looks for features that lower the evaluation metric in such a way that it cannot be ascribed to chance (as measured by their p-values). Using this method – implemented as MLEval tool – you can find: (1) anomalies in test sets, (2) issues in preprocessing, (3) problems in the ML model itself. It can give you an insight into what can be improved in the datasets and/or the model. The same method can be used to compare ML models or different versions of the same model. We present the tool, the theory behind it and use cases for text-based models of various types.

2016

pdf bib
“He Said She Said” ― a Male/Female Corpus of Polish
Filip Graliński | Łukasz Borchmann | Piotr Wierzchoń
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Gender differences in language use have long been of interest in linguistics. The task of automatic gender attribution has been considered in computational linguistics as well. Most research of this type is done using (usually English) texts with authorship metadata. In this paper, we propose a new method of male/female corpus creation based on gender-specific first-person expressions. The method was applied on CommonCrawl Web corpus for Polish (language, in which gender-revealing first-person expressions are particularly frequent) to yield a large (780M words) and varied collection of men’s and women’s texts. The whole procedure for building the corpus and filtering out unwanted texts is described in the present paper. The quality check was done on a random sample of the corpus to make sure that the majority (84%) of texts are correctly attributed, natural texts. Some preliminary (socio)linguistic insights (websites and words frequently occurring in male/female fragments) are given as well.

2011

pdf bib
How to Distinguish a Kidney Theft from a Death Car? Experiments in Clustering Urban-Legend Texts
Roman Grundkiewicz | Filip Graliński
Proceedings of the RANLP 2011 Workshop on Information Extraction and Knowledge Acquisition

2010

pdf bib
Computational Lexicography of Multi-Word Units. How Efficient Can It Be?
Filip Graliński | Agata Savary | Monika Czerepowicka | Filip Makowiecki
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

2009

pdf bib
An Environment for Named Entity Recognition and Translation
Filip Graliński | Krzysztof Jassem | Michał Marcińczuk
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

2004

pdf bib
Some Notes on Generative Capacity of Dependency Grammar
Tomasz Obrebski | Filip Gralinski
Proceedings of the Workshop on Recent Advances in Dependency Grammar

2000

pdf bib
POLENG–Adjusting a Rule-Based Polish–English Machine Translation System by Means of Corpus Analysis
Krzysztof Jassem | Filip Graliński | Grzegorz Krynicki
5th EAMT Workshop: Harvesting Existing Resources