Darina Gold

2025

From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models
Viktor Hangya | Fabian Küch | Darina Gold
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35x average reduction in evaluation time. Our project is available at: https://github.com/Fraunhofer-IIS/EvalShortcut

2021

pdf bib abs

VL-BERT+: Detecting Protected Groups in Hateful Multimodal Memes
Piush Aggarwal | Michelle Espranita Liman | Darina Gold | Torsten Zesch
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

This paper describes our submission (winning solution for Task A) to the Shared Task on Hateful Meme Detection at WOAH 2021. We build our system on top of a state-of-the-art system for binary hateful meme classification that already uses image tags such as race, gender, and web entities. We add further metadata such as emotions and experiment with data augmentation techniques, as hateful instances are underrepresented in the data set.

2020

pdf bib abs

Decomposing and Comparing Meaning Relations: Paraphrasing, Textual Entailment, Contradiction, and Specificity
Venelin Kovatchev | Darina Gold | M. Antonia Marti | Maria Salamo | Torsten Zesch
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present a methodology for decomposing and comparing multiple meaning relations (paraphrasing, textual entailment, contradiction, and specificity). The methodology includes SHARel - a new typology that consists of 26 linguistic and 8 reason-based categories. We use the typology to annotate a corpus of 520 sentence pairs in English and we demonstrate that unlike previous typologies, SHARel can be applied to all relations of interest with a high inter-annotator agreement. We analyze and compare the frequency and distribution of the linguistic and reason-based phenomena involved in paraphrasing, textual entailment, contradiction, and specificity. This comparison allows for a much more in-depth analysis of the workings of the individual relations and the way they interact and compare with each other. We release all resources (typology, annotation guidelines, and annotated corpus) to the community.

2019

pdf bib

RELATIONS - Workshop on meaning relations between phrases and sentences
Venelin Kovatchev | Darina Gold | Torsten Zesch
RELATIONS - Workshop on meaning relations between phrases and sentences

pdf bib abs

Divide and Extract – Disentangling Clause Splitting and Proposition Extraction
Darina Gold | Torsten Zesch
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Proposition extraction from sentences is an important task for information extraction systems Evaluation of such systems usually conflates two aspects: splitting complex sentences into clauses and the extraction of propositions. It is thus difficult to independently determine the quality of the proposition extraction step. We create a manually annotated proposition dataset from sentences taken from restaurant reviews that distinguishes between clauses that need to be split and those that do not. The resulting proposition evaluation dataset allows us to independently compare the performance of proposition extraction systems on simple and complex clauses. Although performance drastically drops on more complex sentences, we show that the same systems perform best on both simple and complex clauses. Furthermore, we show that specific kinds of subordinate clauses pose difficulties to most systems.

pdf bib abs

Annotating and analyzing the interactions between meaning relations
Darina Gold | Venelin Kovatchev | Torsten Zesch
Proceedings of the 13th Linguistic Annotation Workshop

Pairs of sentences, phrases, or other text pieces can hold semantic relations such as paraphrasing, textual entailment, contradiction, specificity, and semantic similarity. These relations are usually studied in isolation and no dataset exists where they can be compared empirically. Here we present a corpus annotated with these relations and the analysis of these results. The corpus contains 520 sentence pairs, annotated with these relations. We measure the annotation reliability of each individual relation and we examine their interactions and correlations. Among the unexpected results revealed by our analysis is that the traditionally considered direct relationship between paraphrasing and bi-directional entailment does not hold in our data.

Co-authors

Michelle Espranita Liman 1

M. Antònia Martí 1

Maria Salamó 1

Venues

WOAH1

Fix author