Jakub Pokrywka


2024

pdf bib
kubapok@LT-EDI 2024: Evaluating Transformer Models for Hate Speech Detection in Tamil
Jakub Pokrywka | Krzysztof Jassem
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

We describe the second-place submission for the shared task organized at the Fourth Workshop on Language Technology for Equality, Diversity, and Inclusion (LT-EDI-2024). The task focuses on detecting caste/migration hate speech in Tamil. The included texts involve the Tamil language in both Tamil script and transliterated into Latin script, with some texts also in English. Considering different scripts, we examined the performance of 12 transformer language models on the dev set. Our analysis revealed that for the whole dataset, the model google/muril-large-cased performs the best. We used an ensemble of several models for the final challenge submission, achieving 0.81 for the test dataset.

2022

pdf bib
Challenging America: Modeling language in longer time scales
Jakub Pokrywka | Filip Graliński | Krzysztof Jassem | Karol Kaczmarek | Krzysztof Jurkiewicz | Piotr Wierzchon
Findings of the Association for Computational Linguistics: NAACL 2022

The aim of the paper is to apply, for historical texts, the methodology used commonly to solve various NLP tasks defined for contemporary data, i.e. pre-train and fine-tune large Transformer models. This paper introduces an ML challenge, named Challenging America (ChallAm), based on OCR-ed excerpts from historical newspapers collected from the Chronicling America portal. ChallAm provides a dataset of clippings, labeled with metadata on their origin, and paired with their textual contents retrieved by an OCR tool. Three, publicly available, ML tasks are defined in the challenge: to determine the article date, to detect the location of the issue, and to deduce a word in a text gap (cloze test). Strong baselines are provided for all three ChallAm tasks. In particular, we pre-trained a RoBERTa model from scratch from the historical texts. We also discuss the issues of discrimination and hate-speech present in the historical American texts.