Thomas Proisl

2020

EmpiriST Corpus 2.0: Adding Manual Normalization, Lemmatization and Semantic Tagging to a German Web and CMC Corpus
Thomas Proisl | Natalie Dykes | Philipp Heinrich | Besim Kabashi | Andreas Blombach | Stefan Evert
Proceedings of the Twelfth Language Resources and Evaluation Conference

The EmpiriST corpus (Beißwenger et al., 2016) is a manually tokenized and part-of-speech tagged corpus of approximately 23,000 tokens of German Web and CMC (computer-mediated communication) data. We extend the corpus with manually created annotation layers for word form normalization, lemmatization and lexical semantics. All annotations have been independently performed by multiple human annotators. We report inter-annotator agreements and results of baseline systems and state-of-the-art off-the-shelf tools.

pdf bib abs

A Corpus of German Reddit Exchanges (GeRedE)
Andreas Blombach | Natalie Dykes | Philipp Heinrich | Besim Kabashi | Thomas Proisl
Proceedings of the Twelfth Language Resources and Evaluation Conference

GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Reddit is a popular online platform combining social news aggregation, discussion and micro-blogging. Starting from a large, freely available data set, the paper describes our approach to filter out German data and further pre-processing steps, as well as which metadata and annotation layers have been included so far. We explore the Reddit sphere, what makes the German data linguistically peculiar, and how some of the communities within Reddit differ from one another. The CWB-indexed version of our final corpus is available via CQPweb, and all our processing scripts as well as all manual annotation and automatic language classification can be downloaded from GitHub.

2019

pdf bib

The_Illiterati: Part-of-Speech Tagging for Magahi and Bhojpuri without even knowing the alphabet
Thomas Proisl | Peter Uhrig | Andreas Blombach | Natalie Dykes | Philipp Heinrich | Besim Kabashi | Sefora Mammarella
Proceedings of the First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) co-located with ICNLSP 2019 - Short Papers

2018

pdf bib

SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts
Thomas Proisl
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib

Albanian Part-of-Speech Tagging: Gold Standard and Evaluation
Besim Kabashi | Thomas Proisl
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib

Delta vs. N-Gram Tracing: Evaluating the Robustness of Authorship Attribution Methods
Thomas Proisl | Stefan Evert | Fotis Jannidis | Christof Schöch | Leonard Konle | Steffen Pielström
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs

EmotiKLUE at IEST 2018: Topic-Informed Classification of Implicit Emotions
Thomas Proisl | Philipp Heinrich | Besim Kabashi | Stefan Evert
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

EmotiKLUE is a submission to the Implicit Emotion Shared Task. It is a deep learning system that combines independent representations of the left and right contexts of the emotion word with the topic distribution of an LDA topic model. EmotiKLUE achieves a macro average F₁score of 67.13%, significantly outperforming the baseline produced by a simple ML classifier. Further enhancements after the evaluation period lead to an improved F₁score of 68.10%.

2016

pdf bib abs

A Proposal for a Part-of-Speech Tagset for the Albanian Language
Besim Kabashi | Thomas Proisl
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Part-of-speech tagging is a basic step in Natural Language Processing that is often essential. Labeling the word forms of a text with fine-grained word-class information adds new value to it and can be a prerequisite for downstream processes like a dependency parser. Corpus linguists and lexicographers also benefit greatly from the improved search options that are available with tagged data. The Albanian language has some properties that pose difficulties for the creation of a part-of-speech tagset. In this paper, we discuss those difficulties and present a proposal for a part-of-speech tagset that can adequately represent the underlying linguistic phenomena.

pdf bib

SoMaJo: State-of-the-art tokenization for German web and social media texts
Thomas Proisl | Peter Uhrig
Proceedings of the 10th Web as Corpus Workshop

State-of-the-art dependency representations such as the Stanford Typed Dependencies may represent the grammatical relations in a sentence as directed, possibly cyclic graphs. Querying a syntactically annotated corpus for grammatical structures that are represented as graphs requires graph matching, which is a non-trivial task. In this paper, we present an algorithm for graph matching that is tailored to the properties of large, syntactically annotated corpora. The implementation of the algorithm is built on top of the popular IMS Open Corpus Workbench, allowing corpus linguists to re-use existing infrastructure. An evaluation of the resulting software, CWB-treebank, shows that its performance in real world applications, such as a web query interface, compares favourably to implementations that rely on a relational database or a dedicated graph database while at the same time offering a greater expressive power for queries. An intuitive graphical interface for building the query graphs is available via the Treebank.info project.

2010

pdf bib abs

Using High-Quality Resources in NLP: The Valency Dictionary of English as a Resource for Left-Associative Grammars
Thomas Proisl | Besim Kabashi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In Natural Language Processing (NLP), the quality of a system depends to a great extent on the quality of the linguistic resources it uses. One area where precise information is particularly needed is valency. The unpredictable character of valency properties requires a reliable source of information for syntactic and semantic analysis. There are several (electronic) dictionaries that provide the necessary information. One such dictionary that contains especially detailed valency descriptions is the Valency Dictionary of English. We will discuss how the Valency Dictionary of English in machine-readable form can be used as a resource for NLP. We will use valency descriptions that are freely available online via the Erlangen Valency Pattern Bank which contains most of the information from the printed dictionary. We will show that the valency data can be used for accurately parsing natural language with a rule-based approach by integrating it into a Left-Associative Grammar. The Valency Dictionary of English can therefore be regarded as being well suited for NLP purposes.