Pius Von Däniken

Also published as: Pius von Däniken

2026

Do NOT Classify and Count: Hybrid Attribute Control Success Evaluation
Felix Matthias Saaro | Pius von Däniken | Mark Cieliebak | Jan Milan Deriu
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Evaluating attribute control success in controllable text generation and related generation tasks typically relies on pretrained classifiers. We show that this widely used classify-and-count approach yields biased and inconsistent results, with estimates varying significantly across classifiers. We frame control success estimation as a quantification task and apply a hybrid Bayesian method that combines classifier predictions with a small number of human labels for calibration. To test our approach, we collected a two-modality test dataset consisting of 600 human-rated samples and 60,000 automatically rated samples. Our experiments show that our approach produces robust estimates of control success across both text and text-to-image generation tasks, offering a principled alternative to current evaluation practices.

2025

pdf bib abs

A Measure of the System Dependence of Automated Metrics
Pius Von Däniken | Jan Milan Deriu | Mark Cieliebak
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Automated metrics for Machine Translation have made significant progress, with the goal of replacing expensive and time-consuming human evaluations. These metrics are typically assessed by their correlation with human judgments, which captures the monotonic relationship between human and metric scores. However, we argue that it is equally important to ensure that metrics treat all systems fairly and consistently. In this paper, we introduce a method to evaluate this aspect.

pdf bib

pdf bib abs

ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos
Patrick Giedemann | Pius von Däniken | Jan Milan Deriu | Alvaro Rodrigo | Anselmo Peñas | Mark Cieliebak
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The growing influence of video content as a medium for communication and misinformation underscores the urgent need for effective tools to analyze claims in multilingual and multi-topic settings. Existing efforts in misinformation detection largely focus on written text, leaving a significant gap in addressing the complexity of spoken text in video transcripts. We introduce ViClaim, a dataset of 1,798 annotated video transcripts across three languages (English, German, Spanish) and six topics. Each sentence in the transcripts is labeled with three claim-related categories: fact-check-worthy, fact-non-check-worthy, or opinion. We developed a custom annotation tool to facilitate the highly complex annotation process. Experiments with state-of-the-art multilingual language models demonstrate strong performance in cross-validation (macro F1 up to 0.896) but reveal challenges in generalization to unseen topics, particularly for distinct domains. Our findings highlight the complexity of claim detection in video transcripts. ViClaim offers a robust foundation for advancing misinformation detection in video-based communication, addressing a critical gap in multimodal analysis.

2024

pdf bib

Language Models Explore the Linguistics of Chess
Lars Schmid | Jerome Maag | Mark Cieliebak | Pius von Däniken
Proceedings of the 9th edition of the Swiss Text Analytics Conference

pdf bib

Role-Playing LLMs in Professional Communication Training: The Case of Investigative Interviews with Children
Don Tuggener | Teresa Schneider | Ariana Huwiler | Tobias Kreienbühl | Simon Hischier | Pius von Däniken | Susanna Niehaus
Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)

pdf bib abs

Favi-Score: A Measure for Favoritism in Automated Preference Ratings for Generative AI Evaluation
Pius Von Däniken | Jan Deriu | Don Tuggener | Mark Cieliebak
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Generative AI systems have become ubiquitous for all kinds of modalities, which makes the issue of the evaluation of such models more pressing. One popular approach is preference ratings, where the generated outputs of different systems are shown to evaluators who choose their preferences. In recent years the field shifted towards the development of automated (trained) metrics to assess generated outputs, which can be used to create preference ratings automatically. In this work, we investigate the evaluation of the metrics themselves, which currently rely on measuring the correlation to human judgments or computing sign accuracy scores. These measures only assess how well the metric agrees with the human ratings. However, our research shows that this does not tell the whole story. Most metrics exhibit a disagreement with human system assessments which is often skewed in favor of particular text generation systems, exposing a degree of favoritism in automated metrics. This paper introduces a formal definition of favoritism in preference metrics, and derives the Favi-Score, which measures this phenomenon. In particular we show that favoritism is strongly related to errors in final system rankings. Thus, we propose that preference-based metrics ought to be evaluated on both sign accuracy scores and favoritism.

pdf bib

Annotation Tool for Dataset Creation
Patrick Giedemann | Pius von Däniken | Jan Milan Deriu
Proceedings of the 9th edition of the Swiss Text Analytics Conference

2023

pdf bib abs

Correction of Errors in Preference Ratings from Automated Metrics for Text Generation
Jan Deriu | Pius von Däniken | Don Tuggener | Mark Cieliebak
Findings of the Association for Computational Linguistics: ACL 2023

A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreements with human judgments. In this paper, we propose to apply automated metrics for Text Generation in a preference-based evaluation protocol. The protocol features a statistical model that incorporates various levels of uncertainty to account for the error-proneness of the metrics. We show that existing metrics are generally over-confident in assigning significant differences between systems. As a remedy, the model allows to combine human ratings with automated ratings. We show that it can reduce the required amounts of human ratings to arrive at robust and statistically significant results by more than 50%, while yielding the same evaluation outcome as the pure human evaluation in 95% of cases. We showcase the benefits of the evaluation protocol for three text generation tasks: dialogue systems, machine translation, and text summarization.

2022

pdf bib

Improving NL-to-Query Systems through Re-ranking of Semantic Hypothesis
Pius von Däniken | Jan Deriu | Eneko Agirre | Ursin Brunner | Mark Cieliebak | Kurt Stockinger
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)

pdf bib abs

On the Effectiveness of Automated Metrics for Text Generation Systems
Pius von Däniken | Jan Deriu | Don Tuggener | Mark Cieliebak
Findings of the Association for Computational Linguistics: EMNLP 2022

A major challenge in the field of Text Generation is evaluation, because we lack a sound theory that can be leveraged to extract guidelines for evaluation campaigns. In this work, we propose a first step towards such a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets. The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems in a given setting. We showcase the application of the theory on the WMT 21 and Spot-The-Bot evaluation data and outline how it can be leveraged to improve the evaluation protocol regarding the reliability, robustness, and significance of the evaluation outcome.

pdf bib abs

Probing the Robustness of Trained Metrics for Conversational Dialogue Systems
Jan Deriu | Don Tuggener | Pius Von Däniken | Mark Cieliebak
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper introduces an adversarial method to stress-test trained metrics for the evaluation of conversational dialogue systems. The method leverages Reinforcement Learning to find response strategies that elicit optimal scores from the trained metrics. We apply our method to test recently proposed trained metrics. We find that they all are susceptible to giving high scores to responses generated by rather simple and obviously flawed strategies that our method converges on. For instance, simply copying parts of the conversation context to form a response yields competitive scores or even outperforms responses written by humans.

2020

pdf bib abs

LEDGAR: A Large-Scale Multi-label Corpus for Text Classification of Legal Provisions in Contracts
Don Tuggener | Pius von Däniken | Thomas Peetz | Mark Cieliebak
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present LEDGAR, a multilabel corpus of legal provisions in contracts. The corpus was crawled and scraped from the public domain (SEC filings) and is, to the best of our knowledge, the first freely available corpus of its kind. Since the corpus was constructed semi-automatically, we apply and discuss various approaches to noise removal. Due to the rather large labelset of over 12’000 labels annotated in almost 100’000 provisions in over 60’000 contracts, we believe the corpus to be of interest for research in the field of Legal NLP, (large-scale or extreme) text classification, as well as for legal studies. We discuss several methods to sample subcopora from the corpus and implement and evaluate different automatic classification approaches. Finally, we perform transfer experiments to evaluate how well the classifiers perform on contracts stemming from outside the corpus.

pdf bib abs

The lack of time efficient and reliable evalu-ation methods is hampering the development of conversational dialogue systems (chat bots). Evaluations that require humans to converse with chat bots are time and cost intensive, put high cognitive demands on the human judges, and tend to yield low quality results. In this work, we introduce Spot The Bot, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chat bots regarding their ability to mimic conversational behaviour of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chat bot is able to uphold human-like be-havior the longest, i.e.Survival Analysis. This metric has the ability to correlate a bot’s performance to certain of its characteristics (e.g.fluency or sensibleness), yielding interpretable results. The comparably low cost of our frame-work allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying Spot The Bot to three domains, evaluating several state-of-the-art chat bots, and drawing comparisonsto related work. The framework is released asa ready-to-use tool.

pdf bib abs

ZHAW-InIT - Social Media Geolocation at VarDial 2020
Fernando Benites | Manuela Hürlimann | Pius von Däniken | Mark Cieliebak
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

We describe our approaches for the Social Media Geolocation (SMG) task at the VarDial Evaluation Campaign 2020. The goal was to predict geographical location (latitudes and longitudes) given an input text. There were three subtasks corresponding to German-speaking Switzerland (CH), Germany and Austria (DE-AT), and Croatia, Bosnia and Herzegovina, Montenegro and Serbia (BCMS). We submitted solutions to all subtasks but focused our development efforts on the CH subtask, where we achieved third place out of 16 submissions with a median distance of 15.93 km and had the best result of 14 unconstrained systems. In the DE-AT subtask, we ranked sixth out of ten submissions (fourth of 8 unconstrained systems) and for BCMS we achieved fourth place out of 13 submissions (second of 11 unconstrained systems).

pdf bib abs

TRANSLIT: A Large-scale Name Transliteration Resource
Fernando Benites | Gilbert François Duivesteijn | Pius von Däniken | Mark Cieliebak
Proceedings of the Twelfth Language Resources and Evaluation Conference

Transliteration is the process of expressing a proper name from a source language in the characters of a target language (e.g. from Cyrillic to Latin characters). We present TRANSLIT, a large-scale corpus with approx. 1.6 million entries in more than 180 languages with about 3 million variations of person and geolocation names. The corpus is based on various public data sources, which have been transformed into a unified format to simplify their usage, plus a newly compiled dataset from Wikipedia. In addition, we apply several machine learning methods to establish baselines for automatically detecting transliterated names in various languages. Our best systems achieve an accuracy of 92% on identification of transliterated pairs.

2019

pdf bib abs

TwistBytes - Identification of Cuneiform Languages and German Dialects at VarDial 2019
Fernando Benites | Pius von Däniken | Mark Cieliebak
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

We describe our approaches for the German Dialect Identification (GDI) and the Cuneiform Language Identification (CLI) tasks at the VarDial Evaluation Campaign 2019. The goal was to identify dialects of Swiss German in GDI and Sumerian and Akkadian in CLI. In GDI, the system should distinguish four dialects from the German-speaking part of Switzerland. Our system for GDI achieved third place out of 6 teams, with a macro averaged F-1 of 74.6%. In CLI, the system should distinguish seven languages written in cuneiform script. Our system achieved third place out of 8 teams, with a macro averaged F-1 of 74.7%.

2018

pdf bib abs

Twist Bytes - German Dialect Identification with Data Mining Optimization
Fernando Benites | Ralf Grubenmann | Pius von Däniken | Dirk von Grünigen | Jan Deriu | Mark Cieliebak
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

We describe our approaches used in the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2018. The goal was to identify to which out of four dialects spoken in German speaking part of Switzerland a sentence belonged to. We adopted two different meta classifier approaches and used some data mining insights to improve the preprocessing and the meta classifier parameters. Especially, we focused on using different feature extraction methods and how to combine them, since they influenced very differently the performance of the system. Our system achieved second place out of 8 teams, with a macro averaged F-1 of 64.6%.

pdf bib

SB-CH: A Swiss German Corpus with Sentiment Annotations
Ralf Grubenmann | Don Tuggener | Pius von Däniken | Jan Deriu | Mark Cieliebak
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs

Transfer Learning and Sentence Level Features for Named Entity Recognition on Tweets
Pius von Däniken | Mark Cieliebak
Proceedings of the 3rd Workshop on Noisy User-generated Text

We present our system for the WNUT 2017 Named Entity Recognition challenge on Twitter data. We describe two modifications of a basic neural network architecture for sequence tagging. First, we show how we exploit additional labeled data, where the Named Entity tags differ from the target task. Then, we propose a way to incorporate sentence level features. Our system uses both methods and ranked second for entity level annotations, achieving an F1-score of 40.78, and second for surface form annotations, achieving an F1-score of 39.33.