Sabine Bartsch

2023

Presenting an Annotation Pipeline for Fine-grained Linguistic Analyses of Multimodal Corpora
Elena Volkanovska | Sherry Tan | Changxu Duan | Debajyoti Chowdhury | Sabine Bartsch
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing

pdf bib abs

Corpus Annotation Graph Builder (CAG): An Architectural Framework to Create and Annotate a Multi-source Graph
Roxanne El Baff | Tobias Hecking | Andreas Hamm | Jasper W. Korte | Sabine Bartsch
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Graphs are a natural representation of complex data as their structure allows users to discover (often implicit) relations among the nodes intuitively. Applications build graphs in an ad-hoc fashion, usually tailored to specific use cases, limiting their reusability. To account for this, we present the Corpus Annotation Graph (CAG) architectural framework based on a create-and-annotate pattern that enables users to build uniformly structured graphs from diverse data sources and extend them with automatically extracted annotations (e.g., named entities, topics). The resulting graphs can be used for further analyses across multiple downstream tasks (e.g., node classification). Code and resources are publicly available on GitHub, and downloadable via PyPi with the command pip install cag.

pdf bib

LaTeX Rainbow: Universal LaTeX to PDF Document Semantic & Layout Annotation Framework
Changxu Duan | Zhiyin Tan | Sabine Bartsch
Proceedings of the Second Workshop on Information Extraction from Scientific Publications

2021

pdf bib abs

TUDA-CCL at SemEval-2021 Task 1: Using Gradient-boosted Regression Tree Ensembles Trained on a Heterogeneous Feature Set for Predicting Lexical Complexity
Sebastian Gombert | Sabine Bartsch
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In this paper, we present our systems submitted to SemEval-2021 Task 1 on lexical complexity prediction. The aim of this shared task was to create systems able to predict the lexical complexity of word tokens and bigram multiword expressions within a given sentence context, a continuous value indicating the difficulty in understanding a respective utterance. Our approach relies on gradient boosted regression tree ensembles fitted using a heterogeneous feature set combining linguistic features, static and contextualized word embeddings, psycholinguistic norm lexica, WordNet, word- and character bigram frequencies and inclusion in wordlists to create a model able to assign a word or multiword expression a context-dependent complexity score. We can show that especially contextualised string embeddings can help with predicting lexical complexity.

2020

pdf bib abs

MultiVitaminBooster at PARSEME Shared Task 2020: Combining Window- and Dependency-Based Features with Multilingual Contextualised Word Embeddings for VMWE Detection
Sebastian Gombert | Sabine Bartsch
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

In this paper, we present MultiVitaminBooster, a system implemented for the PARSEME shared task on semi-supervised identification of verbal multiword expressions - edition 1.2. For our approach, we interpret detecting verbal multiword expressions as a token classification task aiming to decide whether a token is part of a verbal multiword expression or not. For this purpose, we train gradient boosting-based models. We encode tokens as feature vectors combining multilingual contextualized word embeddings provided by the XLM-RoBERTa language model with a more traditional linguistic feature set relying on context windows and dependency relations. Our system was ranked 7th in the official open track ranking of the shared task evaluations with an encoding-related bug distorting the results. For this reason we carry out further unofficial evaluations. Unofficial versions of our systems would have achieved higher ranks.

2016

pdf bib

EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora
Michael Beißwenger | Sabine Bartsch | Stefan Evert | Kay-Michael Würzner
Proceedings of the 10th Web as Corpus Workshop

Sabine Bartsch

2023

2021

2020

2016

2012

2004

Co-authors

Venues