Roman Kern


2022

pdf bib
Impact of Training Instance Selection on Domain-Specific Entity Extraction using BERT
Eileen Salhofer | Xing Lan Liu | Roman Kern
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

State of the art performances for entity extraction tasks are achieved by supervised learning, specifically, by fine-tuning pretrained language models such as BERT. As a result, annotating application specific data is the first step in many use cases. However, no practical guidelines are available for annotation requirements. This work supports practitioners by empirically answering the frequently asked questions (1) how many training samples to annotate? (2) which examples to annotate? We found that BERT achieves up to 80% F1 when fine-tuned on only 70 training examples, especially on biomedical domain. The key features for guiding the selection of high performing training instances are identified to be pseudo-perplexity and sentence-length. The best training dataset constructed using our proposed selection strategy shows F1 score that is equivalent to a random selection with twice the sample size. The requirement of only a small number of training data implies cheaper implementations and opens door to wider range of applications.

pdf bib
Causal Investigation of Public Opinion during the COVID-19 Pandemic via Social Media Text
Michael Jantscher | Roman Kern
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Understanding the needs and fears of citizens, especially during a pandemic such as COVID-19, is essential for any government or legislative entity. An effective COVID-19 strategy further requires that the public understand and accept the restriction plans imposed by these entities. In this paper, we explore a causal mediation scenario in which we want to emphasize the use of NLP methods in combination with methods from economics and social sciences. Based on sentiment analysis of Tweets towards the current COVID-19 situation in the UK and Sweden, we conduct several causal inference experiments and attempt to decouple the effect of government restrictions on mobility behavior from the effect that occurs due to public perception of the COVID-19 strategy in a country. To avoid biased results we control for valid country specific epidemiological and time-varying confounders. Comprehensive experiments show that not all changes in mobility are caused by countries implemented policies but also by the support of individuals in the fight against this pandemic. We find that social media texts are an important source to capture citizens’ concerns and trust in policy makers and are suitable to evaluate the success of government policies.

2019

pdf bib
Know-Center at SemEval-2019 Task 5: Multilingual Hate Speech Detection on Twitter using CNNs
Kevin Winter | Roman Kern
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper presents the Know-Center system submitted for task 5 of the SemEval-2019 workshop. Given a Twitter message in either English or Spanish, the task is to first detect whether it contains hateful speech and second, to determine the target and level of aggression used. For this purpose our system utilizes word embeddings and a neural network architecture, consisting of both dilated and traditional convolution layers. We achieved average F1-scores of 0.57 and 0.74 for English and Spanish respectively.

2017

pdf bib
Know-Center at SemEval-2017 Task 10: Sequence Classification with the CODE Annotator
Roman Kern | Stefan Falk | Andi Rexha
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our participation in SemEval-2017 Task 10. We competed in Subtask 1 and 2 which consist respectively in identifying all the key phrases in scientific publications and label them with one of the three categories: Task, Process, and Material. These scientific publications are selected from Computer Science, Material Sciences, and Physics domains. We followed a supervised approach for both subtasks by using a sequential classifier (CRF - Conditional Random Fields). For generating our solution we used a web-based application implemented in the EU-funded research project, named CODE. Our system achieved an F1 score of 0.39 for the Subtask 1 and 0.28 for the Subtask 2.

2016

pdf bib
Know-Center at SemEval-2016 Task 5: Using Word Vectors with Typed Dependencies for Opinion Target Expression Extraction
Stefan Falk | Andi Rexha | Roman Kern
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
Identifying Referenced Text in Scientific Publications by Summarisation and Classification Techniques
Stefan Klampfl | Andi Rexha | Roman Kern
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

2014

pdf bib
A Study of Scientific Writing: Comparing Theoretical Guidelines with Practical Implementation
Mark Kröll | Gunnar Schulze | Roman Kern
Proceedings of the COLING Workshop on Synchronic and Diachronic Approaches to Analyzing Technical Language

2013

pdf bib
Using Factual Density to Measure Informativeness of Web Documents
Christopher Horn | Alisa Zhila | Alexander Gelbukh | Roman Kern | Elisabeth Lex
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf bib
KnCe2013-CORE:Semantic Text Similarity by use of Knowledge Bases
Hermann Ziak | Roman Kern
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

2010

pdf bib
KCDC: Word Sense Induction by Using Grammatical Dependencies and Sentence Phrase Structure
Roman Kern | Markus Muhr | Michael Granitzer
Proceedings of the 5th International Workshop on Semantic Evaluation