Annemarie Friedrich - ACL Anthology

Annemarie Friedrich

2025

Efficient Multi-Agent Collaboration with Tool Use for Online Planning in Complex Table Question Answering
Wei Zhou | Mohsen Mesgar | Annemarie Friedrich | Heike Adel
Findings of the Association for Computational Linguistics: NAACL 2025

Complex table question answering (TQA) aims to answer questions that require complex reasoning, such as multi-step or multi-category reasoning, over data represented in tabular form. Previous approaches demonstrate notable performance by leveraging either closed-source large language models (LLMs) or fine-tuned open-weight LLMs. However, fine-tuning LLMs requires high-quality training data, which is costly to obtain. The use of closed-source LLMs poses accessibility challenges and leads to reproducibility issues. In this paper, we propose Multi Agent Collaboration with Tool use (MACT), a framework that requires neither fine-tuning nor closed-source models. In MACT, a planning agent and a coding agent that also make use of tools collaborate for TQA. MACT outperforms previous SoTA systems on three out of four benchmarks and performs comparably to the larger and more expensive closed-source model GPT-4 on two benchmarks, even when using only open-weight models without any fine-tuning. Our extensive analyses prove the effectiveness of MACT’s multi-agent collaboration in TQA. We release our code publicly.

Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering
Wei Zhou | Mohsen Mesgar | Heike Adel | Annemarie Friedrich
Findings of the Association for Computational Linguistics: ACL 2025

In table question answering (TQA), tables are encoded as either texts or images. Prior work suggests that passing images of tables to multi-modal large language models (MLLMs) performs comparably to using textual input with large language models (LLMs). However, the lack of controlled setups limits fine-grained distinctions between these approaches. In this paper, we conduct the first controlled study on the effectiveness of several combinations of table representations and model types from two perspectives: question complexity and table size. We build a new benchmark based on existing TQA datasets. In a systematic analysis of seven pairs of MLLMs and LLMs, we find that the best combination of table representation and model varies across setups. We propose FRES, a method selecting table representations dynamically, and observe a 10% average performance improvement compared to using both representations indiscriminately.

Overview of the SustainEval 2025 Shared Task: Identifying the Topic and Verifiability of Sustainability Report Excerpts
Jakob Prange | Charlott Jakob | Patrick Göttfert | Raphael Huber | Pia Wenzel Neves | Annemarie Friedrich
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops

G-MACT at SemEval-2025 Task 8: Exploring Planning and Tool Use in Question Answering over Tabular Data
Wei Zhou | Mohsen Mesgar | Annemarie Friedrich | Heike Adel
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper describes our system submitted to SemEval-2024 Task 8 “Question Answering over Tabular Data.”The shared task focuses on tackling real-life table question answering (TQA) involving extremely large tables with the additional challenges of interpreting complex questions. To address these issues, we leverage a framework of Multi-Agent Collaboration with Tool use (MACT), a method that combines planning and tool use. The planning module breaks down a complex question by designing a step-by-step plan. This plan is translated into Python code by a coding model, and a Python interpreter executes the code to generate an answer. Our system demonstrates competitive performance in the shared task and is ranked 5th out of 38 in the open-source model category. We provide a detailed analysis of our model, evaluating the effectiveness and the efficiency of each component, and identify common error patterns. Our paper offers essential insights and recommendations for future advancements in developing TQA systems.

LGAR: Zero-Shot LLM-Guided Neural Ranking for Abstract Screening in Systematic Literature Reviews
Christian Jaumann | Andreas Wiedholz | Annemarie Friedrich
Findings of the Association for Computational Linguistics: ACL 2025

The scientific literature is growing rapidly, making it hard to keep track of the state-of-the-art. Systematic literature reviews (SLRs) aim to identify and evaluate all relevant papers on a topic. After retrieving a set of candidate papers, the abstract screening phase determines initial relevance. To date, abstract screening methods using large language models (LLMs) focus on binary classification settings; existing question answering (QA) based ranking approaches suffer from error propagation. LLMs offer a unique opportunity to evaluate the SLR’s inclusion and exclusion criteria, yet, existing benchmarks do not provide them exhaustively. We manually extract these criteria as well as research questions for 57 SLRs, mostly in the medical domain, enabling principled comparisons between approaches. Moreover, we propose LGAR, a zero-shot LLM Guided Abstract Ranker composed of an LLM based graded relevance scorer and a dense re-ranker. Our extensive experiments show that LGAR outperforms existing QA-based methods by 5-10 pp. in mean average precision. Our code and data is publicly available.

Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges
Bolei Ma | Yuting Li | Wei Zhou | Ziwei Gong | Yang Janet Liu | Katja Jasinskaja | Annemarie Friedrich | Julia Hirschberg | Frauke Kreuter | Barbara Plank
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Understanding pragmatics—the use of language in context—is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatic phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.

Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models
Christian Jaumann | Annemarie Friedrich | Rainer Lienhart
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models’ confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.

RITT: A Retrieval-Assisted Framework with Image and Text Table Representations for Table Question Answering
Wei Zhou | Mohsen Mesgar | Heike Adel | Annemarie Friedrich
Proceedings of the 4th Table Representation Learning Workshop

Tables can be represented either as text or as images. Previous works on table question answering (TQA) typically rely on only one representation, neglecting the potential benefits of combining both. In this work, we explore integrating textual and visual table representations using multi-modal large language models (MLLMs) for TQA. Specifically, we propose RITT, a retrieval-assisted framework that first identifies the most relevant part of a table for a given question, then dynamically selects the optimal table representations based on the question type. Experiments demonstrate that our framework significantly outperforms the baseline MLLMs by an average of 13 Exact Match and surpasses two text-only state-of-the-art TQA methods on four TQA benchmarks, highlighting the benefits of leveraging both textual and visual table representations.

Coling-UniA at GermEval 2025 Shared Task on Candy Speech Detection: Retrieval Augmented Generation for Identifying Expressions of Positive Attitudes in German YouTube Comments
Georg Hofmann | Annemarie Friedrich
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops

PAP2PAT: Benchmarking Outline-Guided Long-Text Patent Generation with Patent-Paper Pairs
Valentin Knappich | Anna Hätty | Simon Razniewski | Annemarie Friedrich
Findings of the Association for Computational Linguistics: ACL 2025

Dealing with long and highly complex technical text is a challenge for Large Language Models (LLMs), which still have to unfold their potential in supporting expensive and time intensive processes like patent drafting. Within patents, the description constitutes more than 90% of the document on average. Yet, its automatic generation remains understudied. When drafting patent applications, patent attorneys typically receive invention reports (IRs), which are usually confidential, hindering research on LLM-supported patent drafting.Often, pre-publication research papers serve as IRs. We leverage this duality to build PAP2PAT, an open and realistic benchmark for patent drafting consisting of 1.8k patent-paper pairs describing the same inventions. To address the complex long-document patent generation task, we propose chunk-based outline-guided generation using the research paper as invention specification. Our extensive evaluation using PAP2PAT and a human case study show that LLMs can effectively leverage information from the paper, but still struggle to provide the necessary level of detail. Fine-tuning leads to more patent-style language, but also to more hallucination. We release our data and code.

p²-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models
Wei Zhou | Mohsen Mesgar | Heike Adel | Annemarie Friedrich
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Table question answering (TQA) focuses on answering questions based on tabular data. Developing TQA systems targets effective interaction with tabular data for tasks such as cell retrieval and data analysis. While recent work has leveraged fine-tuning to improve TQA systems, existing approaches often under-utilize available data and neglect the potential of post-training for further gains. In this work, we introduce p²-TQA, a process-based preference learning framework for TQA post-training. p²-TQA automatically constructs process-based preference data via a table-specific pipeline, eliminating the need for manual or costly data collection. It then optimizes models through contrastive learning on the collected data. Experiments show that p²-TQA effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets with only 8,000 training instances. Furthermore, models enhanced with p²-TQA achieve competitive results against larger, more complex state-of-the-art TQA systems, while maintaining up to five times higher efficiency.

2024

AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports
Lukas Lange | Marc Müller | Ghazaleh Haratinezhad Torbati | Dragan Milchevski | Patrick Grau | Subhash Chandra Pujari | Annemarie Friedrich
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Monitoring the threat landscape to be aware of actual or potential attacks is of utmost importance to cybersecurity professionals. Information about cyber threats is typically distributed using natural language reports. Natural language processing can help with managing this large amount of unstructured information, yet to date, the topic has received little attention. With this paper, we present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports. The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts including implicitly mentioned techniques and tactics. Entities and concepts are linked to Wikipedia and the MITRE ATT&CK knowledge base, the most widely-used taxonomy for classifying types of attacks. Prior datasets linking to MITRE ATT&CK either provide a single label per document or annotate sentences out-of-context; our dataset annotates entire documents in a much finer-grained way. In an experimental study, we model the annotations of our dataset using state-of-the-art neural models. In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.

FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering
Wei Zhou | Mohsen Mesgar | Heike Adel | Annemarie Friedrich
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Table Question Answering (TQA) aims at composing an answer to a question based on tabular data. While prior research has shown that TQA models lack robustness, understanding the underlying cause and nature of this issue remains predominantly unclear, posing a significant obstacle to the development of robust TQA systems. In this paper, we formalize three major desiderata for a fine-grained evaluation of robustness of TQA systems. They should (i) answer questions regardless of alterations in table structure, (ii) base their responses on the content of relevant cells rather than on biases, and (iii) demonstrate robust numerical reasoning capabilities. To investigate these aspects, we create and publish a novel TQA evaluation benchmark in English. Our extensive experimental analysis reveals that none of the examined state-of-the-art TQA systems consistently excels in these three aspects. Our benchmark is a crucial instrument for monitoring the behavior of TQA systems and paves the way for the development of robust TQA systems. We release our benchmark publicly.

QUITE: Quantifying Uncertainty in Natural Language Text in Bayesian Reasoning Scenarios
Timo Pierre Schrader | Lukas Lange | Simon Razniewski | Annemarie Friedrich
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Reasoning is key to many decision making processes. It requires consolidating a set of rule-like premises that are often associated with degrees of uncertainty and observations to draw conclusions. In this work, we address both the case where premises are specified as numeric probabilistic rules and situations in which humans state their estimates using words expressing degrees of certainty. Existing probabilistic reasoning datasets simplify the task, e.g., by requiring the model to only rank textual alternatives, by including only binary random variables, or by making use of a limited set of templates that result in less varied text.In this work, we present QUITE, a question answering dataset of real-world Bayesian reasoning scenarios with categorical random variables and complex relationships. QUITE provides high-quality natural language verbalizations of premises together with evidence statements and expects the answer to a question in the form of an estimated probability. We conduct an extensive set of experiments, finding that logic-based models outperform out-of-the-box large language models on all reasoning types (causal, evidential, and explaining-away). Our results provide evidence that neuro-symbolic models are a promising direction for improving complex reasoning. We release QUITE and code for training and experiments on Github.

OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context
Steffen Kleinle | Jakob Prange | Annemarie Friedrich
Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)

2023

A Kind Introduction to Lexical and Grammatical Aspect, with a Survey of Computational Approaches
Annemarie Friedrich | Nianwen Xue | Alexis Palmer
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Aspectual meaning refers to how the internal temporal structure of situations is presented. This includes whether a situation is described as a state or as an event, whether the situation is finished or ongoing, and whether it is viewed as a whole or with a focus on a particular phase. This survey gives an overview of computational approaches to modeling lexical and grammatical aspect along with intuitive explanations of the necessary linguistic concepts and terminology. In particular, we describe the concepts of stativity, telicity, habituality, perfective and imperfective, as well as influential inventories of eventuality and situation types. Aspect is a crucial component of semantics, especially for precise reporting of the temporal structure of situations, and future NLP approaches need to be able to handle and evaluate it systematically.

Proceedings of the 1st Workshop on Teaching for NLP
Annemarie Friedrich | Stefan Grünewald | Margot Mieskes | Jannik Strötgen | Christian Wartena
Proceedings of the 1st Workshop on Teaching for NLP

Is the Answer in the Text? Challenging ChatGPT with Evidence Retrieval from Instructive Text
Sophie Henning | Talita Anthonio | Wei Zhou | Heike Adel | Mohsen Mesgar | Annemarie Friedrich
Findings of the Association for Computational Linguistics: EMNLP 2023

Generative language models have recently shown remarkable success in generating answers to questions in a given textual context. However, these answers may suffer from hallucination, wrongly cite evidence, and spread misleading information. In this work, we address this problem by employing ChatGPT, a state-of-the-art generative model, as a machine-reading system. We ask it to retrieve answers to lexically varied and open-ended questions from trustworthy instructive texts. We introduce WHERE (WikiHow Evidence REtrieval), a new high-quality evaluation benchmark of a set of WikiHow articles exhaustively annotated with evidence sentences to questions that comes with a special challenge: All questions are about the article’s topic, but not all can be answered using the provided context. We interestingly find that when using a regular question-answering prompt, ChatGPT neglects to detect the unanswerable cases. When provided with a few examples, it learns to better judge whether a text provides answer evidence or not. Alongside this important finding, our dataset defines a new benchmark for evidence retrieval in question answering, which we argue is one of the necessary next steps for making large language models more trustworthy.

A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing
Sophie Henning | William Beluch | Alexander Fraser | Annemarie Friedrich
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Many natural language processing (NLP) tasks are naturally imbalanced, as some target categories occur much more frequently than others in the real world. In such scenarios, current NLP models tend to perform poorly on less frequent classes. Addressing class imbalance in NLP is an active research topic, yet, finding a good approach for a particular task and imbalance scenario is difficult. In this survey, the first overview on class imbalance in deep-learning based NLP, we first discuss various types of controlled and real-world class imbalance. Our survey then covers approaches that have been explicitly proposed for class-imbalanced NLP tasks or, originating in the computer vision community, have been evaluated on them. We organize the methods by whether they are based on sampling, data augmentation, choice of loss function, staged learning, or model design. Finally, we discuss open problems and how to move forward.

MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain
Timo Pierre Schrader | Matteo Finco | Stefan Grünewald | Felix Hildebrand | Annemarie Friedrich
Proceedings of the Second Workshop on Information Extraction from Scientific Publications

MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science Domain
Timo Pierre Schrader | Teresa Bürkle | Sophie Henning | Sherry Tan | Matteo Finco | Stefan Grünewald | Maira Indrikova | Felix Hildebrand | Annemarie Friedrich
Proceedings of the 4th Workshop on Computational Approaches to Discourse (CODI 2023)

Scientific publications follow conventionalized rhetorical structures. Classifying the Argumentative Zone (AZ), e.g., identifying whether a sentence states a Motivation, a Result or Background information, has been proposed to improve processing of scholarly documents. In this work, we adapt and extend this idea to the domain of materials science research. We present and release a new dataset of 50 manually annotated research articles. The dataset spans seven sub-topics and is annotated with a materials-science focused multi-label annotation scheme for AZ. We detail corpus statistics and demonstrate high inter-annotator agreement. Our computational experiments show that using domain-specific pre-trained transformer-based text encoders is key to high classification performance. We also find that AZ categories from existing datasets in other domains are transferable to varying degrees.

Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)
Munir Georges | Aaricia Herygers | Annemarie Friedrich | Benjamin Roth
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)

Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)
Jakob Prange | Annemarie Friedrich
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)

BoschAI @ Causal News Corpus 2023: Robust Cause-Effect Span Extraction using Multi-Layer Sequence Tagging and Data Augmentation
Timo Pierre Schrader | Simon Razniewski | Lukas Lange | Annemarie Friedrich
Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text

Understanding causality is a core aspect of intelligence. The Event Causality Identification with Causal News Corpus Shared Task addresses two aspects of this challenge: Subtask 1 aims at detecting causal relationships in texts, and Subtask 2 requires identifying signal words and the spans that refer to the cause or effect, respectively. Our system, which is based on pre-trained transformers, stacked sequence tagging, and synthetic data augmentation, ranks third in Subtask 1 and wins Subtask 2 with an F1 score of 72.8, corresponding to a margin of 13 pp. to the second-best system.

2022

MiST: a Large-Scale Annotated Resource and Neural Models for Functions of Modal Verbs in English Scientific Text
Sophie Henning | Nicole Macher | Stefan Grünewald | Annemarie Friedrich
Findings of the Association for Computational Linguistics: EMNLP 2022

Modal verbs (e.g., can, should or must) occur highly frequently in scientific articles. Decoding their function is not straightforward: they are often used for hedging, but they may also denote abilities and restrictions. Understanding their meaning is important for accurate information extraction from scientific text.To foster research on the usage of modals in this genre, we introduce the MIST (Modals In Scientific Text) dataset, which contains 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function. We systematically evaluate a set of competitive neural architectures on MIST. Transfer experiments reveal that leveraging non-scientific data is of limited benefit for modeling the distinctions in MIST. Our corpus analysis provides evidence that scientific communities differ in their usage of modal verbs, yet, classifiers trained on scientific data generalize to some extent to unseen scientific domains.

Three Real-World Datasets and Neural Computational Models for Classification Tasks in Patent Landscaping
Subhash Pujari | Jannik Strötgen | Mark Giereth | Michael Gertz | Annemarie Friedrich
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Patent Landscaping, one of the central tasks of intellectual property management, includes selecting and grouping patents according to user-defined technical or application-oriented criteria. While recent transformer-based models have been shown to be effective for classifying patents into taxonomies such as CPC or IPC, there is yet little research on how to support real-world Patent Landscape Studies (PLSs) using natural language processing methods. With this paper, we release three labeled datasets for PLS-oriented classification tasks covering two diverse domains. We provide a qualitative analysis and report detailed corpus statistics.Most research on neural models for patents has been restricted to leveraging titles and abstracts. We compare strong neural and non-neural baselines, proposing a novel model that takes into account textual information from the patents’ full texts as well as embeddings created based on the patents’ CPC labels. We find that for PLS-oriented classification tasks, going beyond title and abstract is crucial, CPC labels are an effective source of information, and combining all features yields the best results.

2021

A Corpus Study of Creating Rule-Based Enhanced Universal Dependencies for German
Teresa Bürkle | Stefan Grünewald | Annemarie Friedrich
Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop

In this paper, we present a first attempt at enriching German Universal Dependencies (UD) treebanks with enhanced dependencies. Similarly to the converter for English (Schuster and Manning, 2016), we develop a rule-based system for deriving enhanced dependencies from the basic layer, covering three linguistic phenomena: relative clauses, coordination, and raising/control. For quality control, we manually correct or validate a set of 196 sentences, finding that around 90% of added relations are correct. Our data analysis reveals that difficulties arise mainly due to inconsistencies in the basic layer annotations. We show that the English system is in general applicable to German data, but that adapting to the particularities of the German treebanks and language increases precision and recall by up to 10%. Comparing the application of our converter on gold standard dependencies vs. automatic parses, we find that F1 drops by around 10% in the latter setting due to error propagation. Finally, an enhanced UD parser trained on a converted treebank performs poorly when evaluated against our annotations, indicating that more work remains to be done to create gold standard enhanced German treebanks.

A Crash Course on Ethics for Natural Language Processing
Annemarie Friedrich | Torsten Zesch
Proceedings of the Fifth Workshop on Teaching NLP

It is generally agreed upon in the natural language processing (NLP) community that ethics should be integrated into any curriculum. Being aware of and understanding the relevant core concepts is a prerequisite for following and participating in the discourse on ethical NLP. We here present ready-made teaching material in the form of slides and practical exercises on ethical issues in NLP, which is primarily intended to be integrated into introductory NLP or computational linguistics courses. By making this material freely available, we aim at lowering the threshold to adding ethics to the curriculum. We hope that increased awareness will enable students to identify potentially unethical behavior.

Negation-Instance Based Evaluation of End-to-End Negation Resolution
Elizaveta Sineva | Stefan Grünewald | Annemarie Friedrich | Jonas Kuhn
Proceedings of the 25th Conference on Computational Natural Language Learning

In this paper, we revisit the task of negation resolution, which includes the subtasks of cue detection (e.g. “not”, “never”) and scope resolution. In the context of previous shared tasks, a variety of evaluation metrics have been proposed. Subsequent works usually use different subsets of these, including variations and custom implementations, rendering meaningful comparisons between systems difficult. Examining the problem both from a linguistic perspective and from a downstream viewpoint, we here argue for a negation-instance based approach to evaluating negation resolution. Our proposed metrics correspond to expectations over per-instance scores and hence are intuitively interpretable. To render research comparable and to foster future work, we provide results for a set of current state-of-the-art systems for negation resolution on three English corpora, and make our implementation of the evaluation scripts publicly available.

Coordinate Constructions in English Enhanced Universal Dependencies: Analysis and Computational Modeling
Stefan Grünewald | Prisca Piccirilli | Annemarie Friedrich
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

In this paper, we address the representation of coordinate constructions in Enhanced Universal Dependencies (UD), where relevant dependency links are propagated from conjunction heads to other conjuncts. English treebanks for enhanced UD have been created from gold basic dependencies using a heuristic rule-based converter, which propagates only core arguments. With the aim of determining which set of links should be propagated from a semantic perspective, we create a large-scale dataset of manually edited syntax graphs. We identify several systematic errors in the original data, and propose to also propagate adjuncts. We observe high inter-annotator agreement for this semantic annotation task. Using our new manually verified dataset, we perform the first principled comparison of rule-based and (partially novel) machine-learning based methods for conjunction propagation for English. We show that learning propagation rules is more effective than hand-designing heuristic rules. When using automatic parses, our neural graph-parser based edge predictor outperforms the currently predominant pipelines using a basic-layer tree parser plus converters.

RobertNLP at the IWPT 2021 Shared Task: Simple Enhanced UD Parsing for 17 Languages
Stefan Grünewald | Frederik Tobias Oertel | Annemarie Friedrich
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

This paper presents our multilingual dependency parsing system as used in the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies. Our system consists of an unfactorized biaffine classifier that operates directly on fine-tuned XLM-R embeddings and generates enhanced UD graphs by predicting the best dependency label (or absence of a dependency) for each pair of tokens. To avoid sparsity issues resulting from lexicalized dependency labels, we replace lexical items in relations with placeholders at training and prediction time, later retrieving them from the parse via a hybrid rule-based/machine-learning system. In addition, we utilize model ensembling at prediction time. Our system achieves high parsing accuracy on the blind test data, ranking 3rd out of 9 with an average ELAS F1 score of 86.97.

Applying Occam’s Razor to Transformer-Based Dependency Parsing: What Works, What Doesn’t, and What is Really Necessary
Stefan Grünewald | Annemarie Friedrich | Jonas Kuhn
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

The introduction of pre-trained transformer-based contextualized word embeddings has led to considerable improvements in the accuracy of graph-based parsers for frameworks such as Universal Dependencies (UD). However, previous works differ in various dimensions, including their choice of pre-trained language models and whether they use LSTM layers. With the aims of disentangling the effects of these choices and identifying a simple yet widely applicable architecture, we introduce STEPS, a new modular graph-based dependency parser. Using STEPS, we perform a series of analyses on the UD corpora of a diverse set of languages. We find that the choice of pre-trained embeddings has by far the greatest impact on parser performance and identify XLM-R as a robust choice across the languages in our study. Adding LSTM layers provides no benefits when using transformer-based embeddings. A multi-task training setup outputting additional UD features may contort results. Taking these insights together, we propose a simple but widely applicable parser architecture and configuration, achieving new state-of-the-art results (in terms of LAS) for 10 out of 12 diverse languages.

2020

The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain
Annemarie Friedrich | Heike Adel | Federico Tomazic | Johannes Hingerl | Renou Benteau | Anika Marusczyk | Lukas Lange
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper presents a new challenging information extraction task in the domain of materials science. We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications, such as involved materials and measurement conditions. With this paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus consisting of 45 open-access scholarly articles annotated by domain experts. A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition and slot filling tasks as well as high annotation quality. We also present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set. On all tasks, using BERT embeddings leads to large performance gains, but with increasing task complexity, adding a recurrent neural network on top seems beneficial. Our models will serve as competitive baselines in future work, and analysis of their performance highlights difficult cases when modeling the data and suggests promising research directions.

RobertNLP at the IWPT 2020 Shared Task: Surprisingly Simple Enhanced UD Parsing for English
Stefan Grünewald | Annemarie Friedrich
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

This paper presents our system at the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies. Using a biaffine classifier architecture (Dozat and Manning, 2017) which operates directly on finetuned RoBERTa embeddings, our parser generates enhanced UD graphs by predicting the best dependency label (or absence of a dependency) for each pair of tokens in the sentence. We address label sparsity issues by replacing lexical items in relations with placeholders at prediction time, later retrieving them from the parse in a rule-based fashion. In addition, we ensure structural graph constraints using a simple set of heuristics. On the English blind test data, our system achieves a very high parsing accuracy, ranking 1st out of 10 with an ELAS F1 score of 88.94%.

Unifying the Treatment of Preposition-Determiner Contractions in German Universal Dependencies Treebanks
Stefan Grünewald | Annemarie Friedrich
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

HDT-UD, the largest German UD treebank by a large margin, as well as the German-LIT treebank, currently do not analyze preposition-determiner contractions such as zum (= zu dem, “to the”) as multi-word tokens, which is inconsistent both with UD guidelines as well as other German UD corpora (GSD and PUD). In this paper, we show that harmonizing corpora with regard to this highly frequent phenomenon using a lookup-table based approach leads to a considerable increase in automatic parsing performance.

ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation
Hanna Wecker | Annemarie Friedrich | Heike Adel
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

This paper adds to the ongoing discussion in the natural language processing community on how to choose a good development set. Motivated by the real-life necessity of applying machine learning models to different data distributions, we propose a clustering-based data splitting algorithm. It creates development (or test) sets which are lexically different from the training data while ensuring similar label distributions. Hence, we are able to create challenging cross-validation evaluation setups while abstracting away from performance differences resulting from label distribution shifts between training and test data. In addition, we present a Python-based tool for analyzing and visualizing data split characteristics and model performance. We illustrate the workings and results of our approach using a sentiment analysis and a patent classification task.

2019

Proceedings of the 13th Linguistic Annotation Workshop
Annemarie Friedrich | Deniz Zeyrek | Jet Hoek
Proceedings of the 13th Linguistic Annotation Workshop

2017

Classification of telicity using cross-linguistic annotation projection
Annemarie Friedrich | Damyana Gateva
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

This paper addresses the automatic recognition of telicity, an aspectual notion. A telic event includes a natural endpoint (“she walked home”), while an atelic event does not (“she walked around”). Recognizing this difference is a prerequisite for temporal natural language understanding. In English, this classification task is difficult, as telicity is a covert linguistic category. In contrast, in Slavic languages, aspect is part of a verb’s meaning and even available in machine-readable dictionaries. Our contributions are as follows. We successfully leverage additional silver standard training data in the form of projected annotations from parallel English-Czech data as well as context information, improving automatic telicity classification for English significantly compared to previous work. We also create a new data set of English texts manually annotated with telicity.

Annotating tense, mood and voice for English, French and German
Anita Ramm | Sharid Loáiciga | Annemarie Friedrich | Alexander Fraser
Proceedings of ACL 2017, System Demonstrations

2016

Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)
Annemarie Friedrich | Katrin Tomanek
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

Situation entity types: automatic classification of clause-level aspect
Annemarie Friedrich | Alexis Palmer | Manfred Pinkal
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

Linking discourse modes and situation entity types in a cross-linguistic corpus study
Kleio-Isidora Mavridou | Annemarie Friedrich | Melissa Peate Sørensen | Alexis Palmer | Manfred Pinkal
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics

Discourse-sensitive Automatic Identification of Generic Expressions
Annemarie Friedrich | Manfred Pinkal
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Semantically Enriched Models for Modal Sense Classification
Mengfei Zhou | Anette Frank | Annemarie Friedrich | Alexis Palmer
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics

Annotating genericity: a survey, a scheme, and a corpus
Annemarie Friedrich | Alexis Palmer | Melissa Peate Sørensen | Manfred Pinkal
Proceedings of the 9th Linguistic Annotation Workshop

Automatic recognition of habituals: a three-way classification of clausal aspect
Annemarie Friedrich | Manfred Pinkal
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization
Annemarie Friedrich | Marina Valeeva | Alexis Palmer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present LQVSumm, a corpus of about 2000 automatically created extractive multi-document summaries from the TAC 2011 shared task on Guided Summarization, which we annotated with several types of linguistic quality violations. Examples for such violations include pronouns that lack antecedents or ungrammatical clauses. We give details on the annotation scheme and show that inter-annotator agreement is good given the open-ended nature of the task. The annotated summaries have previously been scored for Readability on a numeric scale by human annotators in the context of the TAC challenge; we show that the number of instances of violations of linguistic quality of a summary correlates with these intuitively assigned numeric scores. On a system-level, the average number of violations marked in a system’s summaries achieves higher correlation with the Readability scores than current supervised state-of-the-art methods for assigning a single readability score to a summary. It is our hope that our corpus facilitates the development of methods that not only judge the linguistic quality of automatically generated summaries as a whole, but which also allow for detecting, labeling, and fixing particular violations in a text.

Automatic prediction of aspectual class of verbs in context
Annemarie Friedrich | Alexis Palmer
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Situation Entity Annotation
Annemarie Friedrich | Alexis Palmer
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

2012

Suffix Trees as Language Models
Casey Redd Kennington | Martin Kay | Annemarie Friedrich
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Suffix trees are data structures that can be used to index a corpus. In this paper, we explore how some properties of suffix trees naturally provide the functionality of an n-gram language model with variable n. We explain these properties of suffix trees, which we leverage for our Suffix Tree Language Model (STLM) implementation and explain how a suffix tree implicitly contains the data needed for n-gram language modeling. We also discuss the kinds of smoothing techniques appropriate to such a model. We then show that our suffix-tree language model implementation is competitive when compared to the state-of-the-art language model SRILM (Stolke, 2002) in statistical machine translation experiments.

A Comparison of Knowledge-based Algorithms for Graded Word Sense Assignment
Annemarie Friedrich | Nikos Engonopoulos | Stefan Thater | Manfred Pinkal
Proceedings of COLING 2012: Posters

Co-authors

Sophie Henning 4

Timo Pierre Schrader 4

Simon Razniewski 3

Teresa Bürkle 2

Alexander Fraser 2

Felix Hildebrand 2

Christian Jaumann 2

Melissa Peate Sørensen 2

Jannik Strötgen 2

Talita Anthonio 1

William Beluch 1

Renou Benteau 1

Nikos Engonopoulos 1

Damyana Gateva 1

Munir Georges 1

Michael Gertz 1

Patrick Göttfert 1

Ghazaleh Haratinezhad Torbati 1

Aaricia Herygers 1

Johannes Hingerl 1

Julia Hirschberg 1

Georg Hofmann 1

Raphael Huber 1

Maira Indrikova 1

Charlott Jakob 1

Katja Jasinskaja 1

Casey Kennington 1

Steffen Kleinle 1

Valentin Knappich 1

Frauke Kreuter 1

Rainer Lienhart 1

Yang Janet Liu 1

Sharid Loáiciga 1

Nicole Macher 1

Anika Marusczyk 1

Kleio-Isidora Mavridou 1

Margot Mieskes 1

Dragan Milchevski 1

Pia Wenzel Neves 1

Frederik Tobias Oertel 1

Prisca Piccirilli 1

Barbara Plank 1

Subhash Chandra Pujari 1

Subhash Pujari 1

Benjamin Roth 1

Elizaveta Sineva 1

Stefan Thater 1

Katrin Tomanek 1

Federico Tomazic 1

Marina Valeeva 1

Christian Wartena 1

Andreas Wiedholz 1

Torsten Zesch 1

Venues