Krzysztof Jurkiewicz
2025
Oddballness: universal anomaly detection with language models
Filip Gralinski
|
Ryszard Staruch
|
Krzysztof Jurkiewicz
Proceedings of the 31st International Conference on Computational Linguistics
We present a new method to detect anomalies in texts (in general: in sequences of any data), using language models, in a totally unsupervised manner. The method considers probabilities (likelihoods) generated by a language model, but instead of focusing on low-likelihood tokens, it considers a new metric defined in this paper: oddballness. Oddballness measures how “strange” a given token is according to the language model. We demonstrate in grammatical error detection tasks (a specific case of text anomaly detection) that oddballness is better than just considering low-likelihood events, if a totally unsupervised setup is assumed.
2022
Challenging America: Modeling language in longer time scales
Jakub Pokrywka
|
Filip Graliński
|
Krzysztof Jassem
|
Karol Kaczmarek
|
Krzysztof Jurkiewicz
|
Piotr Wierzchon
Findings of the Association for Computational Linguistics: NAACL 2022
The aim of the paper is to apply, for historical texts, the methodology used commonly to solve various NLP tasks defined for contemporary data, i.e. pre-train and fine-tune large Transformer models. This paper introduces an ML challenge, named Challenging America (ChallAm), based on OCR-ed excerpts from historical newspapers collected from the Chronicling America portal. ChallAm provides a dataset of clippings, labeled with metadata on their origin, and paired with their textual contents retrieved by an OCR tool. Three, publicly available, ML tasks are defined in the challenge: to determine the article date, to detect the location of the issue, and to deduce a word in a text gap (cloze test). Strong baselines are provided for all three ChallAm tasks. In particular, we pre-trained a RoBERTa model from scratch from the historical texts. We also discuss the issues of discrimination and hate-speech present in the historical American texts.
Search
Fix data
Co-authors
- Filip Gralinski 2
- Krzysztof Jassem 1
- Karol Kaczmarek 1
- Jakub Pokrywka 1
- Ryszard Staruch 1
- show all...