Sven Schlarb


2022

pdf bib
Zero-shot Event Causality Identification with Question Answering
Daria Liakhovets | Sven Schlarb
Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

Extraction of event causality and especially implicit causality from text data is a challenging task. Causality is often treated as a specific relation type and can be considered as a part of relation extraction or relation classification task. Many causality identification-related tasks are designed to select the most plausible alternative of a set of possible causes and consider multiple-choice classification settings. Since there are powerful Question Answering (QA) systems pretrained on large text corpora, we investigated a zero-shot QA-based approach for event causality extraction using a Wikipedia-based dataset containing event descriptions (articles) and annotated causes. We aimed to evaluate to what extent reading comprehension ability of the QA-pipeline can be used for event-related causality extraction from plain text without any additional training. Some evaluation challenges and limitations of the data were discussed. We compared the performance of a two-step pipeline consisting of passage retrieval and extractive QA with QA-only pipeline on event-associated articles and mixed ones. Our systems achieved average cosine semantic similarity scores of 44 – 45% in different settings.

2021

pdf bib
DreamDrug - A crowdsourced NER dataset for detecting drugs in darknet markets
Johannes Bogensperger | Sven Schlarb | Allan Hanbury | Gábor Recski
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

We present DreamDrug, a crowdsourced dataset for detecting mentions of drugs in noisy user-generated item listings from darknet markets. Our dataset contains nearly 15,000 manually annotated drug entities in over 3,500 item listings scraped from the darknet market platform “DreamMarket” in 2017. We also train and evaluate baseline models for detecting these entities, using contextual language models fine-tuned in a few-shot setting and on the full dataset, and examine the effect of pretraining on in-domain unannotated corpora.