Jan Pfister

2025

SALT at SemEval-2025 Task 2: A SQL-based Approach for LLM-Free Entity-Aware-Translation
Tom Völker | Jan Pfister | Andreas Hotho
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Entity-aware machine translation faces significant challenges when translating culturally-adapted named entities that require knowledge beyond the source text. We present SALT (SQL-based Approach for LLM-Free Entity-Aware-Translation), a parameter-efficient system for the SemEval-2025 Task 2. Our approach combines SQL-based entity retrieval with constrained neural translation via logit biasing and explicit entity annotations. Despite its simplicity, it achieves state-of-the-art performance (First Place) among approaches not using gold-standard data, while requiring far less computation than LLM-based methods. Our ablation studies show simple SQL-based retrieval rivals complex neural models, and strategic model refinement outperforms increased model complexity. SALT offers an alternative to resource-intensive LLM-based approaches, achieving comparable results with only a fraction of the parameters.

pdf bib

Die SuperGLEBer at GermEval 2025 Shared Tasks: Growing Pains - When More Isn’t Always Better
Julia Wunderle | Jan Pfister | Andreas Hotho
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops

pdf bib abs

CAIDAS at SemEval-2025 Task 7: Enriching Sparse Datasets with LLM-Generated Content for Improved Information Retrieval
Dominik Benchert | Severin Meßlinger | Sven Goller | Jonas Kaiser | Jan Pfister | Andreas Hotho
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

The focus of SemEval-2024 Task 7 is the retrieval of relevant fact-checks for social media posts across multiple languages. We approach this task with an enhanced bi-encoder retrieval setup, which is designed to match social media posts with relevant fact-checks using synthetic data from LLMs. We explored and analyzed two main approaches for generating synthetic posts. Either based on existing fact-checks or on existing posts. Our approach achieved an S@10 score of 89.53% for the monolingual task and 74.48% for the crosslingual task, ranking 16th out of 28 and 13th out of 29, respectively. Without data augmentation, scores would have been 88.69 (17th) and 72.93 (15th).

pdf bib abs

BARTABSA++: Revisiting BARTABSA with Decoder LLMs
Jan Pfister | Tom Völker | Anton Vlasjuk | Andreas Hotho
Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)

We revisit the BARTABSA framework for aspect-based sentiment analysis with modern decoder LLMs to assess the importance of explicit structure modeling today. Our updated implementation - BARTABSA++ - features architectural enhancements that boost performance and training stability.Systematic testing with various encoder-decoder architectures shows that BARTABSA++ with BART-Large achieves state-of-the-art results, even surpassing a finetuned GPT-4o model.Our analysis indicates the encoder’s representational quality is vital, while the decoder’s role is minimal, explaining the limited benefits of scaling decoder-only LLMs for this task. These findings underscore the complementary roles of explicit structured modeling and large language models, indicating structured approaches remain competitive for tasks requiring precise relational information extraction.

pdf bib abs

LLäMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch
Jan Pfister | Julia Wunderle | Andreas Hotho
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We transparently create two German-only decoder models, LLäMmlein 120M and 1B, from scratch and publish them, along with the training data, for the (German) NLP research community to use. The model training involved several key steps, including data preprocessing/filtering, the creation of a German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks, also against existing models. Throughout the training process, multiple checkpoints were saved in equal intervals and analyzed using the German SuperGLEBer benchmark to gain insights into the models’ learning process.Compared to state-of-the-art models on the SuperGLEBer benchmark, both LLäMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models’ quality scales with size as expected, but performance improvements on some tasks plateaued early during training, offering valuable insights into resource allocation for future models.

2024

pdf bib abs

OtterlyObsessedWithSemantics at SemEval-2024 Task 4: Developing a Hierarchical Multi-Label Classification Head for Large Language Models
Julia Wunderle | Julian Schubert | Antonella Cacciatore | Albin Zehe | Jan Pfister | Andreas Hotho
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

For our submission for Subtask 1, we developed a custom classification head that is designed to be applied atop of a Large Language Model. We reconstructed the hierarchy across multiple fully connected layers, allowing us to incorporate previous foundational decisions in subsequent, more fine-grained layers. To find the best hyperparameters, we conducted a grid-search and to compete in the multilingual setting, we translated all documents to English.

pdf bib abs

SuperGLEBer: German Language Understanding Evaluation Benchmark
Jan Pfister | Andreas Hotho
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We assemble a broad Natural Language Understanding benchmark suite for the German language and consequently evaluate a wide array of existing German-capable models in order to create a better understanding of the current state of German LLMs. Our benchmark consists of 29 different tasks ranging over different types such as document classification, sequence tagging, sentence similarity, and question answering, on which we evaluate 10 different German-pretrained models, thereby charting the landscape of German LLMs. In our comprehensive evaluation we find that encoder models are a good choice for most tasks, but also that the largest encoder model does not necessarily perform best for all tasks. We make our benchmark suite and a leaderboard publically available at https://supergleber.professor-x.de and encourage the community to contribute new tasks and evaluate more models on it (https://github.com/LSX-UniWue/SuperGLEBer).

pdf bib abs

Pollice Verso at SemEval-2024 Task 6: The Roman Empire Strikes Back
Konstantin Kobs | Jan Pfister | Andreas Hotho
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

We present an intuitive approach for hallucination detection in LLM outputs that is modeled after how humans would go about this task. We engage several LLM “experts” to independently assess whether a response is hallucinated. For this we select recent and popular LLMs smaller than 7B parameters. By analyzing the log probabilities for tokens that signal a positive or negative judgment, we can determine the likelihood of hallucination. Additionally, we enhance the performance of our “experts” by automatically refining their prompts using the recently introduced OPRO framework. Furthermore, we ensemble the replies of the different experts in a uniform or weighted manner, which builds a quorum from the expert replies. Overall this leads to accuracy improvements of up to 10.6 p.p. compared to the challenge baseline. We show that a Zephyr 3B model is well suited for the task. Our approach can be applied in the model-agnostic and model-aware subtasks without modification and is flexible and easily extendable to related tasks.

2023

pdf bib abs

Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP.

pdf bib

Pointer Networks: A Unified Approach to Extracting German Opinions
Julia Wunderle | Jan Pfister | Andreas Hotho
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)

pdf bib abs

Jack-Ryder at SemEval-2023 Task 5: Zero-Shot Clickbait Spoiling by Rephrasing Titles as Questions
Dirk Wangsadirdja | Jan Pfister | Konstantin Kobs | Andreas Hotho
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

In this paper, we describe our approach to the clickbait spoiling task of SemEval 2023.The core idea behind our system is to leverage pre-trained models capable of Question Answering (QA) to extract the spoiler from article texts based on the clickbait title without any task-specific training. Since oftentimes, these titles are not phrased as questions, we automatically rephrase the clickbait titles as questions in order to better suit the pretraining task of the QA-capable models. Also, to fit as much relevant context into the model’s limited input size as possible, we propose to reorder the sentences by their relevance using a semantic similarity model. Finally, we evaluate QA as well as text generation models (via prompting) to extract the spoiler from the text. Based on the validation data, our final model selects each of these components depending on the spoiler type and achieves satisfactory zero-shot results. The ideas described in this paper can easily be applied in fine-tuning settings.

2022

pdf bib abs

SenPoi at SemEval-2022 Task 10: Point me to your Opinion, SenPoi
Jan Pfister | Sebastian Wankerl | Andreas Hotho
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Structured Sentiment Analysis is the task of extracting sentiment tuples in a graph structure commonly from review texts. We adapt the Aspect-Based Sentiment Analysis pointer network BARTABSA to model this tuple extraction as a sequence prediction task and extend their output grammar to account for the increased complexity of Structured Sentiment Analysis. To predict structured sentiment tuples in languages other than English we swap BART for a multilingual mT5 and introduce a novel Output Length Regularization to mitigate overfitting to common target sequence lengths, thereby improving the performance of the model by up to 70%. We evaluate our approach on seven datasets in five languages including a zero shot crosslingual setting.