Andreas Hotho - ACL Anthology

Andreas Hotho

2025

LLäMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch
Jan Pfister | Julia Wunderle | Andreas Hotho
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We transparently create two German-only decoder models, LLäMmlein 120M and 1B, from scratch and publish them, along with the training data, for the (German) NLP research community to use. The model training involved several key steps, including data preprocessing/filtering, the creation of a German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks, also against existing models. Throughout the training process, multiple checkpoints were saved in equal intervals and analyzed using the German SuperGLEBer benchmark to gain insights into the models’ learning process.Compared to state-of-the-art models on the SuperGLEBer benchmark, both LLäMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models’ quality scales with size as expected, but performance improvements on some tasks plateaued early during training, offering valuable insights into resource allocation for future models.

Die SuperGLEBer at GermEval 2025 Shared Tasks: Growing Pains - When More Isn’t Always Better
Julia Wunderle | Jan Pfister | Andreas Hotho
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops

Assessing the State of the Art in Scene Segmentation
Albin Zehe | Elisabeth Fischer | Andreas Hotho
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The detection of scenes in literary texts is a recently introduced segmentation task in computational literary studies. Its goal is to partition a fictional text into segments that are coherent across the dimensions time, space, action and character constellation. This task is very challenging for automatic methods, since it requires a high-level understanding of the text. In this paper, we provide a thorough analysis of the State of the Art and challenges in this task, identifying and solving a problem in the training procedure for previous approaches, analysing the generalisation capabilities of the models and comparing the BERT-based SotA to current Llama models, as well as providing an analysis of what causes errors in the models. Our change in training procedure provides a significant increase in performance. We find that Llama-based models are more robust to different types of texts, while their overall performance is slightly worse than that of BERT-based models.

SALT at SemEval-2025 Task 2: A SQL-based Approach for LLM-Free Entity-Aware-Translation
Tom Völker | Jan Pfister | Andreas Hotho
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Entity-aware machine translation faces significant challenges when translating culturally-adapted named entities that require knowledge beyond the source text. We present SALT (SQL-based Approach for LLM-Free Entity-Aware-Translation), a parameter-efficient system for the SemEval-2025 Task 2. Our approach combines SQL-based entity retrieval with constrained neural translation via logit biasing and explicit entity annotations. Despite its simplicity, it achieves state-of-the-art performance (First Place) among approaches not using gold-standard data, while requiring far less computation than LLM-based methods. Our ablation studies show simple SQL-based retrieval rivals complex neural models, and strategic model refinement outperforms increased model complexity. SALT offers an alternative to resource-intensive LLM-based approaches, achieving comparable results with only a fraction of the parameters.

CAIDAS at SemEval-2025 Task 7: Enriching Sparse Datasets with LLM-Generated Content for Improved Information Retrieval
Dominik Benchert | Severin Meßlinger | Sven Goller | Jonas Kaiser | Jan Pfister | Andreas Hotho
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

The focus of SemEval-2024 Task 7 is the retrieval of relevant fact-checks for social media posts across multiple languages. We approach this task with an enhanced bi-encoder retrieval setup, which is designed to match social media posts with relevant fact-checks using synthetic data from LLMs. We explored and analyzed two main approaches for generating synthetic posts. Either based on existing fact-checks or on existing posts. Our approach achieved an S@10 score of 89.53% for the monolingual task and 74.48% for the crosslingual task, ranking 16th out of 28 and 13th out of 29, respectively. Without data augmentation, scores would have been 88.69 (17th) and 72.93 (15th).

BARTABSA++: Revisiting BARTABSA with Decoder LLMs
Jan Pfister | Tom Völker | Anton Vlasjuk | Andreas Hotho
Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)

We revisit the BARTABSA framework for aspect-based sentiment analysis with modern decoder LLMs to assess the importance of explicit structure modeling today. Our updated implementation - BARTABSA++ - features architectural enhancements that boost performance and training stability.Systematic testing with various encoder-decoder architectures shows that BARTABSA++ with BART-Large achieves state-of-the-art results, even surpassing a finetuned GPT-4o model.Our analysis indicates the encoder’s representational quality is vital, while the decoder’s role is minimal, explaining the limited benefits of scaling decoder-only LLMs for this task. These findings underscore the complementary roles of explicit structured modeling and large language models, indicating structured approaches remain competitive for tasks requiring precise relational information extraction.

2024

SuperGLEBer: German Language Understanding Evaluation Benchmark
Jan Pfister | Andreas Hotho
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We assemble a broad Natural Language Understanding benchmark suite for the German language and consequently evaluate a wide array of existing German-capable models in order to create a better understanding of the current state of German LLMs. Our benchmark consists of 29 different tasks ranging over different types such as document classification, sequence tagging, sentence similarity, and question answering, on which we evaluate 10 different German-pretrained models, thereby charting the landscape of German LLMs. In our comprehensive evaluation we find that encoder models are a good choice for most tasks, but also that the largest encoder model does not necessarily perform best for all tasks. We make our benchmark suite and a leaderboard publically available at https://supergleber.professor-x.de and encourage the community to contribute new tasks and evaluate more models on it (https://github.com/LSX-UniWue/SuperGLEBer).

OtterlyObsessedWithSemantics at SemEval-2024 Task 4: Developing a Hierarchical Multi-Label Classification Head for Large Language Models
Julia Wunderle | Julian Schubert | Antonella Cacciatore | Albin Zehe | Jan Pfister | Andreas Hotho
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

For our submission for Subtask 1, we developed a custom classification head that is designed to be applied atop of a Large Language Model. We reconstructed the hierarchy across multiple fully connected layers, allowing us to incorporate previous foundational decisions in subsequent, more fine-grained layers. To find the best hyperparameters, we conducted a grid-search and to compete in the multilingual setting, we translated all documents to English.

Pollice Verso at SemEval-2024 Task 6: The Roman Empire Strikes Back
Konstantin Kobs | Jan Pfister | Andreas Hotho
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

We present an intuitive approach for hallucination detection in LLM outputs that is modeled after how humans would go about this task. We engage several LLM “experts” to independently assess whether a response is hallucinated. For this we select recent and popular LLMs smaller than 7B parameters. By analyzing the log probabilities for tokens that signal a positive or negative judgment, we can determine the likelihood of hallucination. Additionally, we enhance the performance of our “experts” by automatically refining their prompts using the recently introduced OPRO framework. Furthermore, we ensemble the replies of the different experts in a uniform or weighted manner, which builds a quorum from the expert replies. Overall this leads to accuracy improvements of up to 10.6 p.p. compared to the challenge baseline. We show that a Zephyr 3B model is well suited for the task. Our approach can be applied in the model-agnostic and model-aware subtasks without modification and is flexible and easily extendable to related tasks.

2023

Pointer Networks: A Unified Approach to Extracting German Opinions
Julia Wunderle | Jan Pfister | Andreas Hotho
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)

Jack-Ryder at SemEval-2023 Task 5: Zero-Shot Clickbait Spoiling by Rephrasing Titles as Questions
Dirk Wangsadirdja | Jan Pfister | Konstantin Kobs | Andreas Hotho
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

In this paper, we describe our approach to the clickbait spoiling task of SemEval 2023.The core idea behind our system is to leverage pre-trained models capable of Question Answering (QA) to extract the spoiler from article texts based on the clickbait title without any task-specific training. Since oftentimes, these titles are not phrased as questions, we automatically rephrase the clickbait titles as questions in order to better suit the pretraining task of the QA-capable models. Also, to fit as much relevant context into the model’s limited input size as possible, we propose to reorder the sentences by their relevance using a semantic similarity model. Finally, we evaluate QA as well as text generation models (via prompting) to extract the spoiler from the text. Based on the validation data, our final model selects each of these components depending on the spoiler type and achieves satisfactory zero-shot results. The ideas described in this paper can easily be applied in fine-tuning settings.

2022

LSX_team5 at SemEval-2022 Task 8: Multilingual News Article Similarity Assessment based on Word- and Sentence Mover’s Distance
Stefan Heil | Karina Kopp | Albin Zehe | Konstantin Kobs | Andreas Hotho
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper introduces our submission for the SemEval 2022 Task 8: Multilingual News Article Similarity. The task of the competition consisted of the development of a model, capable of determining the similarity between pairs of multilingual news articles. To address this challenge, we evaluated the Word Mover’s Distance in conjunction with word embeddings from ConceptNet Numberbatch and term frequencies of WorldLex, as well the Sentence Mover’s Distance based on sentence embeddings generated by pretrained transformer models of Sentence-BERT. To facilitate the comparison of multilingual articles with Sentence-BERT models, we deployed a Neural Machine Translation system. All our models achieve stable results in multilingual similarity estimation without learning parameters.

WueDevils at SemEval-2022 Task 8: Multilingual News Article Similarity via Pair-Wise Sentence Similarity Matrices
Dirk Wangsadirdja | Felix Heinickel | Simon Trapp | Albin Zehe | Konstantin Kobs | Andreas Hotho
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

We present a system that creates pair-wise cosine and arccosine sentence similarity matrices using multilingual sentence embeddings obtained from pre-trained SBERT and Universal Sentence Encoder (USE) models respectively. For each news article sentence, it searches the most similar sentence from the other article and computes an average score. Further, a convolutional neural network calculates a total similarity score for the article pairs on these matrices. Finally, a random forest regressor merges the previous results to a final score that can optionally be extended with a publishing date score.

SenPoi at SemEval-2022 Task 10: Point me to your Opinion, SenPoi
Jan Pfister | Sebastian Wankerl | Andreas Hotho
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Structured Sentiment Analysis is the task of extracting sentiment tuples in a graph structure commonly from review texts. We adapt the Aspect-Based Sentiment Analysis pointer network BARTABSA to model this tuple extraction as a sequence prediction task and extend their output grammar to account for the increased complexity of Structured Sentiment Analysis. To predict structured sentiment tuples in languages other than English we swap BART for a multilingual mT5 and introduce a novel Output Length Regularization to mitigate overfitting to common target sequence lengths, thereby improving the performance of the model by up to 70%. We evaluate our approach on seven datasets in five languages including a zero shot crosslingual setting.

2021

Detecting Scenes in Fiction: A new Segmentation Task
Albin Zehe | Leonard Konle | Lea Katharina Dümpelmann | Evelyn Gius | Andreas Hotho | Fotis Jannidis | Lucas Kaufmann | Markus Krug | Frank Puppe | Nils Reiter | Annekea Schreiber | Nathalie Wiedmer
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

This paper introduces the novel task of scene segmentation on narrative texts and provides an annotated corpus, a discussion of the linguistic and narrative properties of the task and baseline experiments towards automatic solutions. A scene here is a segment of the text where time and discourse time are more or less equal, the narration focuses on one action and location and character constellations stay the same. The corpus we describe consists of German-language dime novels (550k tokens) that have been annotated in parallel, achieving an inter-annotator agreement of gamma = 0.7. Baseline experiments using BERT achieve an F1 score of 24%, showing that the task is very challenging. An automatic scene segmentation paves the way towards processing longer narrative texts like tales or novels by breaking them down into smaller, coherent and meaningful parts, which is an important stepping stone towards the reconstruction of plot in Computational Literary Studies but also can serve to improve tasks like coreference resolution.

2020

Where to Submit? Helping Researchers to Choose the Right Venue
Konstantin Kobs | Tobias Koopmann | Albin Zehe | David Fernes | Philipp Krop | Andreas Hotho
Findings of the Association for Computational Linguistics: EMNLP 2020

Whenever researchers write a paper, the same question occurs: “Where to submit?” In this work, we introduce WTS, an open and interpretable NLP system that recommends conferences and journals to researchers based on the title, abstract, and/or keywords of a given paper. We adapt the TextCNN architecture and automatically analyze its predictions using the Integrated Gradients method to highlight words and phrases that led to the recommendation of a scientific venue. We train and test our method on publications from the fields of artificial intelligence (AI) and medicine, both derived from the Semantic Scholar dataset. WTS achieves an Accuracy@5 of approximately 83% for AI papers and 95% in the field of medicine. It is open source and available for testing on https://wheretosubmit.ml.

Improving Sentiment Analysis with Biofeedback Data
Daniel Schlör | Albin Zehe | Konstantin Kobs | Blerta Veseli | Franziska Westermeier | Larissa Brübach | Daniel Roth | Marc Erich Latoschik | Andreas Hotho
Proceedings of LREC2020 Workshop "People in language, vision and the mind" (ONION2020)

Humans frequently are able to read and interpret emotions of others by directly taking verbal and non-verbal signals in human-to-human communication into account or to infer or even experience emotions from mediated stories. For computers, however, emotion recognition is a complex problem: Thoughts and feelings are the roots of many behavioural responses and they are deeply entangled with neurophysiological changes within humans. As such, emotions are very subjective, often are expressed in a subtle manner, and are highly depending on context. For example, machine learning approaches for text-based sentiment analysis often rely on incorporating sentiment lexicons or language models to capture the contextual meaning. This paper explores if and how we further can enhance sentiment analysis using biofeedback of humans which are experiencing emotions while reading texts. Specifically, we record the heart rate and brain waves of readers that are presented with short texts which have been annotated with the emotions they induce. We use these physiological signals to improve the performance of a lexicon-based sentiment classifier. We find that the combination of several biosignals can improve the ability of a text-based classifier to detect the presence of a sentiment in a text on a per-sentence level.

2019

Team Xenophilius Lovegood at SemEval-2019 Task 4: Hyperpartisanship Classification using Convolutional Neural Networks
Albin Zehe | Lena Hettinger | Stefan Ernst | Christian Hauptmann | Andreas Hotho
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our system for the SemEval 2019 Task 4 on hyperpartisan news detection. We build on an existing deep learning approach for sentence classification based on a Convolutional Neural Network. Modifying the original model with additional layers to increase its expressiveness and finally building an ensemble of multiple versions of the model, we obtain an accuracy of 67.52% and an F1 score of 73.78% on the main test dataset. We also report on additional experiments incorporating handcrafted features into the CNN and using it as a feature extractor for a linear SVM.

2018

ClaiRE at SemEval-2018 Task 7: Classification of Relations using Embeddings
Lena Hettinger | Alexander Dallmann | Albin Zehe | Thomas Niebler | Andreas Hotho
Proceedings of the 12th International Workshop on Semantic Evaluation

In this paper we describe our system for SemEval-2018 Task 7 on classification of semantic relations in scientific literature for clean (subtask 1.1) and noisy data (subtask 1.2). We compare two models for classification, a C-LSTM which utilizes only word embeddings and an SVM that also takes handcrafted features into account. To adapt to the domain of science we train word embeddings on scientific papers collected from arXiv.org. The hand-crafted features consist of lexical features to model the semantic relations as well as the entities between which the relation holds. Classification of Relations using Embeddings (ClaiRE) achieved an F1 score of 74.89% for the first subtask and 78.39% for the second.

2004

Clustering Concept Hierarchies from Text
Philipp Cimiano | Andreas Hotho | Steffen Staab
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Venues