Anette Frank - ACL Anthology

Anette Frank

2025

Cross-Lingual Generalization and Compression: From Language-Specific to Shared Neurons
Frederick Riemenschneider | Anette Frank
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multilingual language models (MLLMs) have demonstrated remarkable abilities to transfer knowledge across languages, despite being trained without explicit cross-lingual supervision. We analyze the parameter spaces of three MLLMs to study how their representations evolve during pre-training, observing patterns consistent with compression: models initially form language-specific representations, which gradually converge into cross-lingual abstractions as training progresses. Through probing experiments, we observe a clear transition from uniform language identification capabilities across layers to more specialized layer functions. For deeper analysis, we focus on neurons that encode distinct semantic concepts. By tracing their development during pre-training, we show how they gradually align across languages. Notably, we identify specific neurons that emerge as increasingly reliable predictors for the same concepts across languages. This alignment manifests concretely in generation: once an MLLM exhibits cross-lingual generalization according to our measures, we can select concept-specific neurons identified from, e.g., Spanish text and manipulate them to guide token predictions. Remarkably, rather than generating Spanish text, the model produces semantically coherent English text. This demonstrates that cross-lingually aligned neurons encode generalized semantic representations, independent of the original language encoding.

From Argumentation to Deliberation: Perspectivized Stance Vectors for Fine-grained (Dis)agreement Analysis
Moritz Plenz | Philipp Heinisch | Janosch Gehring | Philipp Cimiano | Anette Frank
Findings of the Association for Computational Linguistics: NAACL 2025

Debating over conflicting issues is a necessary first step towards resolving conflicts. However, intrinsic perspectives of an arguer are difficult to overcome by persuasive argumentation skills. Proceeding from a debate to a deliberative process, where we can identify actionable options for resolving a conflict requires a deeper analysis of arguments and the perspectives they are grounded in - as it is only from there that one can derive mutually agreeable resolution steps. In this work we develop a framework for a deliberative analysis of arguments in a computational argumentation setup. We conduct a fine-grained analysis of perspectivized stances expressed in the arguments of different arguers or stakeholders on a given issue, aiming not only to identify their opposing views, but also shared perspectives arising from their attitudes, values or needs. We formalize this analysis in Perspectivized Stance Vectors that characterize the individual perspectivized stances of all arguers on a given issue. We construct these vectors by determining issue- and argument-specific concepts, and predict an arguer’s stance relative to each of them. The vectors allow us to measure a modulated (dis)agreement between arguers, structured by perspectives, which allows us to identify actionable points for conflict resolution, as a first step towards deliberation.

2024

Compositional Structured Explanation Generation with Dynamic Modularized Reasoning
Xiyan Fu | Anette Frank
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

In this work, we propose a new task, compositional structured explanation generation (CSEG), to facilitate research on compositional generalization in reasoning. Despite the success of language models in solving reasoning tasks, their compositional generalization capabilities are under-researched. Our new CSEG task tests a model’s ability to generalize from generating entailment trees with a limited number of inference steps – to more steps, focusing on the length and shapes of entailment trees. CSEG is challenging in requiring both reasoning and compositional generalization abilities, and by being framed as a generation task. Besides the CSEG task, we propose a new dynamic modularized reasoning model, MORSE, that factorizes the inference process into modules, where each module represents a functional unit. We adopt modularized self-attention to dynamically select and route inputs to dedicated heads, which specializes them to specific functions. Using CSEG, we compare MORSE to models from prior work. Our analyses show that the task is challenging, but that the dynamic reasoning modules of MORSE are effective, showing competitive compositional generalization abilities in a generation setting.

Graph Language Models
Moritz Plenz | Anette Frank
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

While Language Models (LMs) are the workhorses of NLP, their interplay with structured knowledge graphs (KGs) is still actively researched. Current methods for encoding such graphs typically either (i) linearize them for embedding with LMs – which underutilize structural information, or (ii) use Graph Neural Networks (GNNs) to preserve the graph structure – but GNNs cannot represent text features as well as pretrained LMs. In our work we introduce a novel LM type, the Graph Language Model (GLM), that integrates the strengths of both approaches and mitigates their weaknesses. The GLM parameters are initialized from a pretrained LM to enhance understanding of individual graph concepts and triplets. Simultaneously, we design the GLM’s architecture to incorporate graph biases, thereby promoting effective knowledge distribution within the graph. This enables GLMs to process graphs, texts, and interleaved inputs of both. Empirical evaluations on relation classification tasks show that GLM embeddings surpass both LM- and GNN-based baselines in supervised and zero-shot setting, demonstrating their versatility.

The Mystery of Compositional Generalization in Graph-based Generative Commonsense Reasoning
Xiyan Fu | Anette Frank
Findings of the Association for Computational Linguistics: EMNLP 2024

While LLMs have emerged as performant architectures for reasoning tasks, their compositional generalization capabilities have been questioned. In this work, we introduce a Compositional Generalization Challenge for Graph-based Commonsense Reasoning (CGGC) that goes beyond previous evaluations that are based on sequences or tree structures – and instead involves a reasoning graph: It requires models to generate a natural sentence based on given concepts and a corresponding reasoning graph, where the presented graph involves a previously unseen combination of relation types. To master this challenge, models need to learn how to reason over relation tupels within the graph, and how to compose them when conceptualizing a verbalization. We evaluate seven well-known LLMs using in-context learning and find that performant LLMs still struggle in compositional generalization. We investigate potential causes of this gap by analyzing the structures of reasoning graphs, and find that different structures present varying levels of difficulty for compositional generalization. Arranging the order of demonstrations according to the structures’ difficulty shows that organizing samples in an easy-to-hard schema enhances the compositional generalization ability of LLMs.

On Measuring Faithfulness or Self-consistency of Natural Language Explanations
Letitia Parcalabescu | Anette Frank
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of post-hoc or CoT explanations. In this work we argue that these faithfulness tests do not measure faithfulness to the models’ inner workings – but rather their self-consistency at output level.Our contributions are three-fold: i) We clarify the status of faithfulness tests in view of model explainability, characterising them as self-consistency tests instead. This assessment we underline by ii) constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open LLMs and 5 tasks – including iii) our new self-consistency measure CC-SHAP. CC-SHAP is a fine-grained measure (not a test) of LLM self-consistency. It compares how a model’s input contributes to the predicted answer and to generating the explanation. Our fine-grained CC-SHAP metric allows us iii) to compare LLM behaviour when making predictions and to analyse the effect of other consistency tests at a deeper level, which takes us one step further towards measuring faithfulness by bringing us closer to the internals of the model than strictly surface output-oriented tests.

Exploring Continual Learning of Compositional Generalization in NLI
Xiyan Fu | Anette Frank
Transactions of the Association for Computational Linguistics, Volume 12

Compositional Natural Language Inference (NLI) has been explored to assess the true abilities of neural models to perform NLI. Yet, current evaluations assume models to have full access to all primitive inferences in advance, in contrast to humans that continuously acquire inference knowledge. In this paper, we introduce the Continual Compositional Generalization in Inference (C2Gen NLI) challenge, where a model continuously acquires knowledge of constituting primitive inference tasks as a basis for compositional inferences. We explore how continual learning affects compositional generalization in NLI, by designing a continual learning setup for compositional NLI inference tasks. Our experiments demonstrate that models fail to compositionally generalize in a continual scenario. To address this problem, we first benchmark various continual learning algorithms and verify their efficacy. We then further analyze C2Gen, focusing on how to order primitives and compositional inference types, and examining correlations between subtasks. Our analyses show that by learning subtasks continuously while observing their dependencies and increasing degrees of difficulty, continual learning can enhance composition generalization ability.1

2023

Similarity-weighted Construction of Contextualized Commonsense Knowledge Graphs for Knowledge-intense Argumentation Tasks
Moritz Plenz | Juri Opitz | Philipp Heinisch | Philipp Cimiano | Anette Frank
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Arguments often do not make explicit how a conclusion follows from its premises. To compensate for this lack, we enrich arguments with structured background knowledge to support knowledge-intense argumentation tasks. We present a new unsupervised method for constructing Contextualized Commonsense Knowledge Graphs (CCKGs) that selects contextually relevant knowledge from large knowledge graphs (KGs) efficiently and at high quality. Our work goes beyond context-insensitive knowledge extraction heuristics by computing semantic similarity between KG triplets and textual arguments. Using these triplet similarities as weights, we extract contextualized knowledge paths that connect a conclusion to its premise, while maximizing similarity to the argument. We combine multiple paths into a CCKG that we optionally prune to reduce noise and raise precision. Intrinsic evaluation of the quality of our graphs shows that our method is effective for (re)constructing human explanation graphs. Manual evaluations in a large-scale knowledge selection setup verify high recall and precision of implicit CSK in the CCKGs. Finally, we demonstrate the effectiveness of CCKGs in a knowledge-insensitive argument quality rating task, outperforming strong baselines and rivaling a GPT-3 based system.

Graecia capta ferum victorem cepit. Detecting Latin Allusions to Ancient Greek Literature
Frederick Riemenschneider | Anette Frank
Proceedings of the Ancient Language Processing Workshop

Intertextual allusions hold a pivotal role in Classical Philology, with Latin authors frequently referencing Ancient Greek texts. Until now, the automatic identification of these intertextual references has been constrained to monolingual approaches, seeking parallels solely within Latin or Greek texts. In this study, we introduce SPhilBERTa, a trilingual Sentence-RoBERTa model tailored for Classical Philology, which excels at cross-lingual semantic comprehension and identification of identical sentences across Ancient Greek, Latin, and English. We generate new training data by automatically translating English into Ancient Greek texts. Further, we present a case study, demonstrating SPhilBERTa’s capability to facilitate automated detection of intertextual parallels. Intertextual allusions hold a pivotal role in Classical Philology, with Latin authors frequently referencing Ancient Greek texts. Until now, the automatic identification of these intertextual references has been constrained to monolingual approaches, seeking parallels solely within Latin or Greek texts. In this study, we introduce SPhilBERTa, a trilingual Sentence-RoBERTa model tailored for Classical Philology, which excels at cross-lingual semantic comprehension and identification of identical sentences across Ancient Greek, Latin, and English. We generate new training data by automatically translating English into Ancient Greek texts. Further, we present a case study, demonstrating SPhilBERTa’s capability to facilitate automated detection of intertextual parallels.

AMR4NLI: Interpretable and robust NLI measures from semantic graphs
Juri Opitz | Shira Wein | Julius Steen | Anette Frank | Nathan Schneider
Proceedings of the 15th International Conference on Computational Semantics

The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent premise and hypothesis, including sets of *contextualized embeddings* and *semantic graphs* (Abstract Meaning Representations), and measure whether the hypothesis is a semantic substructure of the premise, utilizing interpretable metrics. Our evaluation on three English benchmarks finds value in both contextualized embeddings and semantic graphs; moreover, they provide complementary signals, and can be leveraged together in a hybrid model.

SETI: Systematicity Evaluation of Textual Inference
Xiyan Fu | Anette Frank
Findings of the Association for Computational Linguistics: ACL 2023

We propose SETI (Systematicity Evaluation of Textual Inference), a novel and comprehensive benchmark designed for evaluating pre-trained language models (PLMs) for their systematicity capabilities in the domain of textual inference. Specifically, SETI offers three different NLI tasks and corresponding datasets to evaluate various types of systematicity in reasoning processes. In order to solve these tasks, models are required to perform compositional inference based on known primitive constituents. We conduct experiments of SETI on six widely used PLMs. Results show that various PLMs are able to solve unseen compositional inferences when having encountered the knowledge of how to combine primitives, with good performance. However, they are considerably limited when this knowledge is unknown to the model (40-100 % points decrease). Furthermore, we find that PLMs are able to improve dramatically once exposed to crucial compositional knowledge in minimalistic shots. These findings position SETI as the first benchmark for measuring the future progress of PLMs in achieving systematicity generalization in the textual inference.

With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness
Julius Steen | Juri Opitz | Anette Frank | Katja Markert
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Conditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metrics, NLI models seem attractive, since they solve a strongly related task that comes with a wealth of prior research and data. But recent research suggests that NLI models require costly additional machinery to perform reliably across datasets, e.g., by running inference on a cartesian product of input and generated sentences, or supporting them with a question-generation/answering step. In this work we show that pure NLI models _can_ outperform more complex metrics when combining task-adaptive data augmentation with robust inference procedures. We propose: (1) Augmenting NLI training data toadapt NL inferences to the specificities of faithfulness prediction in dialogue;(2) Making use of both entailment and contradiction probabilities in NLI, and(3) Using Monte-Carlo dropout during inference. Applied to the TRUE benchmark, which combines faithfulness datasets across diverse domains and tasks, our approach strongly improves a vanilla NLI model and significantly outperforms previous work, while showing favourable computational cost.

SMARAGD: Learning SMatch for Accurate and Rapid Approximate Graph Distance
Juri Opitz | Philipp Meier | Anette Frank
Proceedings of the 15th International Conference on Computational Semantics

The similarity of graph structures, such as Meaning Representations (MRs), is often assessed via structural matching algorithms, such as Smatch (Cai & Knight 2013). However, Smatch involves a combinatorial problem that suffers from NP-completeness, making large-scale applications, e.g., graph clustering or search, infeasible. To alleviate this issue, we learn SMARAGD: Semantic Match for Accurate and Rapid Approximate Graph Distance. We show the potential of neural networks to approximate Smatch scores, i) in linear time using a machine translation framework to predict alignments, or ii) in constant time using a Siamese CNN to directly predict Smatch scores. We show that the approximation error can be substantially reduced through data augmentation and graph anonymization.

ACCEPT at SemEval-2023 Task 3: An Ensemble-based Approach to Multilingual Framing Detection
Philipp Heinisch | Moritz Plenz | Anette Frank | Philipp Cimiano
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper describes the system and experimental results of an ensemble-based approach tomultilingual framing detection for the submission of the ACCEPT team to the SemEval-2023 Task 3 on Framing Detection (Subtask 2). The approach is based on an ensemble that combines three different methods: a classifier based on large language models, a classifier based on static word embeddings, and an approach that uses external commonsense knowledge graphs, in particular, ConceptNet. The results of the three classification heads are aggregated into an overall prediction for each frame class. Our best submission yielded a micro F1-score of 50.69% (rank 10) and a macro F1-score of 50.20% (rank 3) for English articles. Our experimental results show that static word embeddings and knowledge graphs are useful components for frame detection, while the ensemble of all three methods combines the strengths of our three proposed methods. Through system ablations, we show that the commonsenseguided knowledge graphs are the outperforming method for many languages.

Exploring Large Language Models for Classical Philology
Frederick Riemenschneider | Anette Frank
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advances in NLP have led to the creation of powerful language models for many languages including Ancient Greek and Latin. While prior work on Classical languages unanimously uses BERT, in this work we create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages: we explore (i) encoder-only and encoder-decoder architectures using RoBERTa and T5 as strong model types, and create for each of them (ii) a monolingual Ancient Greek and a multilingual instance that includes Latin and English. We evaluate all models on morphological and syntactic tasks, including lemmatization, which demonstrates the added value of T5’s decoding abilities. We further define two probing tasks to investigate the knowledge acquired by models pre-trained on Classical texts. Our experiments provide the first benchmarking analysis of existing models of Ancient Greek. Results show that our models provide significant improvements over the SoTA. The systematic analysis of model types can inform future research in designing language models for Classical languages, including the development of novel generative tasks. We make all our models available as community resources, along with a large curated pre-training corpus for Ancient Greek, to support the creation of a larger, comparable model zoo for Classical Philology.

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
Letitia Parcalabescu | Anette Frank
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse occurred. However, accuracy-based tests fail to detect e.g., when the model prediction is wrong, while the model used relevant information from a modality. Instead, we propose MM-SHAP, a performance-agnostic multimodality score based on Shapley values that reliably quantifies in which proportions a multimodal model uses individual modalities. We apply MM-SHAP in two ways: (1) to compare models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models – LXMERT, CLIP and four ALBEF variants – on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided. Based on our results, we recommend MM-SHAP for analysing multimodal tasks, to diagnose and guide progress towards multimodal integration. Code available at https://github.com/Heidelberg-NLP/MM-SHAP.

2022

Better Smatch = Better Parser? AMR evaluation is not so simple anymore
Juri Opitz | Anette Frank
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems

Data Augmentation for Improving the Prediction of Validity and Novelty of Argumentative Conclusions
Philipp Heinisch | Moritz Plenz | Juri Opitz | Anette Frank | Philipp Cimiano
Proceedings of the 9th Workshop on Argument Mining

We address the problem of automatically predicting the quality of a conclusion given a set of (textual) premises of an argument, focusing in particular on the task of predicting the validity and novelty of the argumentative conclusion. We propose a multi-task approach that jointly predicts the validity and novelty of the textual conclusion, relying on pre-trained language models fine-tuned on the task. As training data for this task is scarce and costly to obtain, we experimentally investigate the impact of data augmentation approaches for improving the accuracy of prediction compared to a baseline that relies on task-specific data only. We consider the generation of synthetic data as well as the integration of datasets from related argument tasks. We show that especially our synthetic data, combined with class-balancing and instance-specific learning rates, substantially improves classification results (+15.1 points in F₁-score). Using only training data retrieved from related datasets by automatically labeling them for validity and novelty, combined with synthetic data, outperforms the baseline by 11.5 points in F₁-score.

MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning
Constantin Eichenberg | Sidney Black | Samuel Weinbach | Letitia Parcalabescu | Anette Frank
Findings of the Association for Computational Linguistics: EMNLP 2022

Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving state of the art results on the OKVQA benchmark and competitive results on a range of other popular VL benchmarks, while pretraining on 0.2 % of the number of samples used to train SimVLM.

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu | Michele Cafagna | Lilitta Muradjan | Anette Frank | Iacer Calixto | Albert Gatt
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.

Strategies for framing argumentative conclusion generation
Philipp Heinisch | Anette Frank | Juri Opitz | Philipp Cimiano
Proceedings of the 15th International Conference on Natural Language Generation

A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric Evaluation – through the Lens of Semantic Similarity Rating
Laura Zeidler | Juri Opitz | Anette Frank
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics

Evaluating the quality of generated text is difficult, since traditional NLG evaluation metrics, focusing more on surface form than meaning, often fail to assign appropriate scores. This is especially problematic for AMR-to-text evaluation, given the abstract nature of AMR.Our work aims to support the development and improvement of NLG evaluation metrics that focus on meaning by developing a dynamic CheckList for NLG metrics that is interpreted by being organized around meaning-relevant linguistic phenomena. Each test instance consists of a pair of sentences with their AMR graphs and a human-produced textual semantic similarity or relatedness score. Our CheckList facilitates comparative evaluation of metrics and reveals strengths and weaknesses of novel and traditional metrics. We demonstrate the usefulness of CheckList by designing a new metric GraCo that computes lexical cohesion graphs over AMR concepts. Our analysis suggests that GraCo presents an interesting NLG metric worth future investigation and that meaning-oriented NLG metrics can profit from graph-based metric components using AMR.

Overview of the 2022 Validity and Novelty Prediction Shared Task
Philipp Heinisch | Anette Frank | Juri Opitz | Moritz Plenz | Philipp Cimiano
Proceedings of the 9th Workshop on Argument Mining

This paper provides an overview of the Argument Validity and Novelty Prediction Shared Task that was organized as part of the 9th Workshop on Argument Mining (ArgMining 2022). The task focused on the prediction of the validity and novelty of a conclusion given a textual premise. Validity is defined as the degree to which the conclusion is justified with respect to the given premise. Novelty defines the degree to which the conclusion contains content that is new in relation to the premise. Six groups participated in the task, submitting overall 13 system runs for the subtask of binary classification and 2 system runs for the subtask of relative classification. The results reveal that the task is challenging, with best results obtained for Validity prediction in the range of 75% F1 score, for Novelty prediction of 70% F1 score and for correctly predicting both Validity and Novelty of 45% F1 score. In this paper we summarize the task definition and dataset. We give an overview of the results obtained by the participating systems, as well as insights to be gained from the diverse contributions.

SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features
Juri Opitz | Anette Frank
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Models based on large-pretrained language models, such as S(entence)BERT, provide effective and efficient sentence embeddings that show high correlation to human similarity ratings, but lack interpretability. On the other hand, graph metrics for graph-based meaning representations (e.g., Abstract Meaning Representation, AMR) can make explicit the semantic aspects in which two sentences are similar. However, such metrics tend to be slow, rely on parsers, and do not reach state-of-the-art performance when rating sentence similarity. In this work, we aim at the best of both worlds, by learning to induce Semantically Structured Sentence BERT embeddings (S³BERT). Our S³BERT embeddings are composed of explainable sub-embeddings that emphasize various sentence meaning features (e.g., semantic roles, negation, or quantification). We show how to i) learn a decomposition of the sentence embeddings into meaning features, through approximation of a suite of interpretable semantic AMR graph metrics, and how to ii) preserve the overall power of the neural embeddings by controlling the decomposition learning process with a second objective that enforces consistency with the similarity ratings of an SBERT teacher model. In our experimental studies, we show that our approach offers interpretability – while preserving the effectiveness and efficiency of the neural sentence embeddings.

2021

Reconstructing Implicit Knowledge with Language Models
Maria Becker | Siting Liang | Anette Frank
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

In this work we propose an approach for generating statements that explicate implicit knowledge connecting sentences in text. We make use of pre-trained language models which we refine by fine-tuning them on specifically prepared corpora that we enriched with implicit information, and by constraining them with relevant concepts and connecting commonsense knowledge paths. Manual and automatic evaluation of the generations shows that by refining language models as proposed, we can generate coherent and grammatically sound sentences that explicate implicit knowledge which connects sentence pairs in texts – on both in-domain and out-of-domain test data.

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
Letitia Parcalabescu | Albert Gatt | Anette Frank | Iacer Calixto
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

We investigate the reasoning ability of pretrained vision and language (V&L) models in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models are pretrained on task (1). However, none of the pretrained V&L models is able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. We propose a number of explanations for these findings: LXMERT (and to some extent ViLBERT 12-in-1) show some evidence of catastrophic forgetting on task (1). Concerning our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input. While a selling point of pretrained V&L models is their ability to solve complex tasks, our findings suggest that understanding their reasoning and grounding capabilities requires more targeted investigations on specific phenomena.

Translate, then Parse! A Strong Baseline for Cross-Lingual AMR Parsing
Sarah Uhrig | Yoalli Garcia | Juri Opitz | Anette Frank
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

In cross-lingual Abstract Meaning Representation (AMR) parsing, researchers develop models that project sentences from various languages onto their AMRs to capture their essential semantic structures: given a sentence in any language, we aim to capture its core semantic content through concepts connected by manifold types of semantic relations. Methods typically leverage large silver training data to learn a single model that is able to project non-English sentences to AMRs. However, we find that a simple baseline tends to be overlooked: translating the sentences to English and projecting their AMR with a monolingual AMR parser (translate+parse,T+P). In this paper, we revisit this simple two-step base-line, and enhance it with a strong NMT system and a strong AMR parser. Our experiments show that T+P outperforms a recent state-of-the-art system across all tested languages: German, Italian, Spanish and Mandarin with +14.6, +12.6, +14.3 and +16.0 Smatch points

Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR
Juri Opitz | Anette Frank
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Systems that generate natural language text from abstract meaning representations such as AMR are typically evaluated using automatic surface matching metrics that compare the generated texts to reference texts from which the input meaning representations were constructed. We show that besides well-known issues from which such metrics suffer, an additional problem arises when applying these metrics for AMR-to-text evaluation, since an abstract meaning representation allows for numerous surface realizations. In this work we aim to alleviate these issues by proposing ℳℱ𝛽, a decomposable metric that builds on two pillars. The first is the principle of meaning preservation ℳ : it measures to what extent a given AMR can be reconstructed from the generated sentence using SOTA AMR parsers and applying (fine-grained) AMR evaluation metrics to measure the distance between the original and the reconstructed AMR. The second pillar builds on a principle of (grammatical) form ℱ that measures the linguistic quality of the generated text, which we implement using SOTA language models. In two extensive pilot studies we show that fulfillment of both principles offers benefits for AMR-to-text evaluation, including explainability of scores. Since ℳℱ𝛽 does not necessarily rely on gold AMRs, it may extend to other text generation tasks.

What is Multimodality?
Letitia Parcalabescu | Nils Trost | Anette Frank
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

The last years have shown rapid developments in the field of multimodal machine learning, combining e.g., vision, text or speech. In this position paper we explain how the field uses outdated definitions of multimodality that prove unfit for the machine learning era. We propose a new task-relative definition of (multi)modality in the context of multimodal machine learning that focuses on representations and information that are relevant for a given machine learning task. With our new definition of multimodality we aim to provide a missing foundation for multimodal research, an important component of language grounding and a crucial milestone towards NLU.

Weisfeiler-Leman in the Bamboo: Novel AMR Graph Metrics and a Benchmark for AMR Graph Similarity
Juri Opitz | Angel Daza | Anette Frank
Transactions of the Association for Computational Linguistics, Volume 9

Several metrics have been proposed for assessing the similarity of (abstract) meaning representations (AMRs), but little is known about how they relate to human similarity ratings. Moreover, the current metrics have complementary strengths and weaknesses: Some emphasize speed, while others make the alignment of graph structures explicit, at the price of a costly alignment step. In this work we propose new Weisfeiler-Leman AMR similarity metrics that unify the strengths of previous metrics, while mitigating their weaknesses. Specifically, our new metrics are able to match contextualized substructures and induce n:m alignments between their nodes. Furthermore, we introduce a Benchmark for AMR Metrics based on Overt Objectives (Bamboo), the first benchmark to support empirical assessment of graph-based MR similarity metrics. Bamboo maximizes the interpretability of results by defining multiple overt objectives that range from sentence similarity objectives to stress tests that probe a metric’s robustness against meaning-altering and meaning- preserving graph transformations. We show the benefits of Bamboo by profiling previous metrics and our own metrics. Results indicate that our novel metrics may serve as a strong baseline for future work.

Generating Hypothetical Events for Abductive Inference
Debjit Paul | Anette Frank
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Abductive reasoning starts from some observations and aims at finding the most plausible explanation for these observations. To perform abduction, humans often make use of temporal and causal inferences, and knowledge about how some hypothetical situation can result in different outcomes. This work offers the first study of how such knowledge impacts the Abductive NLI task – which consists in choosing the more likely explanation for given observations. We train a specialized language model LMI that is tasked to generate what could happen next from a hypothetical scenario that evolves from a given event. We then propose a multi-task model MTL to solve the Abductive NLI task, which predicts a plausible explanation by a) considering different possible events emerging from candidate hypotheses – events generated by LMI – and b) selecting the one that is most similar to the observed outcome. We show that our MTL model improves over prior vanilla pre-trained LMs fine-tuned on Abductive NLI. Our manual evaluation and analysis suggest that learning about possible next events from different hypothetical scenarios supports abductive inference.

Grounding Plural Phrases: Countering Evaluation Biases by Individuation
Julia Suter | Letitia Parcalabescu | Anette Frank
Proceedings of the Second Workshop on Advances in Language and Vision Research

Phrase grounding (PG) is a multimodal task that grounds language in images. PG systems are evaluated on well-known benchmarks, using Intersection over Union (IoU) as evaluation metric. This work highlights a disconcerting bias in the evaluation of grounded plural phrases, which arises from representing sets of objects as a union box covering all component bounding boxes, in conjunction with the IoU metric. We detect, analyze and quantify an evaluation bias in the grounding of plural phrases and define a novel metric, c-IoU, based on a union box’s component boxes. We experimentally show that our new metric greatly alleviates this bias and recommend using it for fairer evaluation of plural phrases in PG tasks.

COCO-EX: A Tool for Linking Concepts from Texts to ConceptNet
Maria Becker | Katharina Korfhage | Anette Frank
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

In this paper we present COCO-EX, a tool for Extracting Concepts from texts and linking them to the ConceptNet knowledge graph. COCO-EX extracts meaningful concepts from natural language texts and maps them to conjunct concept nodes in ConceptNet, utilizing the maximum of relational information stored in the ConceptNet knowledge graph. COCOEX takes into account the challenging characteristics of ConceptNet, namely that – unlike conventional knowledge graphs – nodes are represented as non-canonicalized, free-form text. This means that i) concepts are not normalized; ii) they often consist of several different, nested phrase types; and iii) many of them are uninformative, over-specific, or misspelled. A commonly used shortcut to circumvent these problems is to apply string matching. We compare COCO-EX to this method and show that COCO-EX enables the extraction of meaningful, important rather than overspecific or uninformative concepts, and allows to assess more relational information stored in the knowledge graph.

COINS: Dynamically Generating COntextualized Inference Rules for Narrative Story Completion
Debjit Paul | Anette Frank
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Despite recent successes of large pre-trained language models in solving reasoning tasks, their inference capabilities remain opaque. We posit that such models can be made more interpretable by explicitly generating interim inference rules, and using them to guide the generation of task-specific textual outputs. In this paper we present Coins, a recursive inference framework that i) iteratively reads context sentences, ii) dynamically generates contextualized inference rules, encodes them, and iii) uses them to guide task-specific output generation. We apply to a Narrative Story Completion task that asks a model to complete a story with missing sentences, to produce a coherent story with plausible logical connections, causal relationships, and temporal dependencies. By modularizing inference and sentence generation steps in a recurrent model, we aim to make reasoning steps and their effects on next sentence generation transparent. Our automatic and manual evaluations show that the model generates better story sentences than SOTA baselines, especially in terms of coherence. We further demonstrate improved performance over strong pre-trained LMs in generating commonsense inference rules. The recursive nature of holds the potential for controlled generation of longer sequences.

Explainable Unsupervised Argument Similarity Rating with Abstract Meaning Representation and Conclusion Generation
Juri Opitz | Philipp Heinisch | Philipp Wiesenbach | Philipp Cimiano | Anette Frank
Proceedings of the 8th Workshop on Argument Mining

When assessing the similarity of arguments, researchers typically use approaches that do not provide interpretable evidence or justifications for their ratings. Hence, the features that determine argument similarity remain elusive. We address this issue by introducing novel argument similarity metrics that aim at high performance and explainability. We show that Abstract Meaning Representation (AMR) graphs can be useful for representing arguments, and that novel AMR graph metrics can offer explanations for argument similarity ratings. We start from the hypothesis that similar premises often lead to similar conclusions—and extend an approach for AMR-based argument similarity rating by estimating, in addition, the similarity of conclusions that we automatically infer from the arguments used as premises. We show that AMR similarity metrics make argument similarity judgements more interpretable and may even support argument quality judgements. Our approach provides significant performance improvements over strong baselines in a fully unsupervised setting. Finally, we make first steps to address the problem of reference-less evaluation of argumentative conclusion generations.

CO-NNECT: A Framework for Revealing Commonsense Knowledge Paths as Explicitations of Implicit Knowledge in Texts
Maria Becker | Katharina Korfhage | Debjit Paul | Anette Frank
Proceedings of the 14th International Conference on Computational Semantics (IWCS)

In this work we leverage commonsense knowledge in form of knowledge paths to establish connections between sentences, as a form of explicitation of implicit knowledge. Such connections can be direct (singlehop paths) or require intermediate concepts (multihop paths). To construct such paths we combine two model types in a joint framework we call Co-nnect: a relation classifier that predicts direct connections between concepts; and a target prediction model that generates target or intermediate concepts given a source concept and a relation, which we use to construct multihop paths. Unlike prior work that relies exclusively on static knowledge sources, we leverage language models finetuned on knowledge stored in ConceptNet, to dynamically generate knowledge paths, as explanations of implicit knowledge that connects sentences in texts. As a central contribution we design manual and automatic evaluation settings for assessing the quality of the generated paths. We conduct evaluations on two argumentative datasets and show that a combination of the two model types generates meaningful, high-quality knowledge paths between sentences that reveal implicit knowledge conveyed in text.

2020

Implicit Knowledge in Argumentative Texts: An Annotated Corpus
Maria Becker | Katharina Korfhage | Anette Frank
Proceedings of the Twelfth Language Resources and Evaluation Conference

When speaking or writing, people omit information that seems clear and evident, such that only part of the message is expressed in words. Especially in argumentative texts it is very common that (important) parts of the argument are implied and omitted. We hypothesize that for argument analysis it will be beneficial to reconstruct this implied information. As a starting point for filling knowledge gaps, we build a corpus consisting of high-quality human annotations of missing and implied information in argumentative texts. To learn more about the characteristics of both the argumentative texts and the added information, we further annotate the data with semantic clause types and commonsense knowledge relations. The outcome of our work is a carefully designed and richly annotated dataset, for which we then provide an in-depth analysis by investigating characteristic distributions and correlations of the assigned labels. We reveal interesting patterns and intersections between the annotation categories and properties of our dataset, which enable insights into the characteristics of both argumentative texts and implicit knowledge in terms of structural features and semantic information. The results of our analysis can help to assist automated argument analysis and can guide the process of revealing implicit information in argumentative texts automatically.

Social Commonsense Reasoning with Multi-Head Knowledge Attention
Debjit Paul | Anette Frank
Findings of the Association for Computational Linguistics: EMNLP 2020

Social Commonsense Reasoning requires understanding of text, knowledge about social events and their pragmatic implications, as well as commonsense reasoning skills. In this work we propose a novel multi-head knowledge attention model that encodes semi-structured commonsense inference rules and learns to incorporate them in a transformer-based reasoning cell. We assess the model’s performance on two tasks that require different reasoning skills: Abductive Natural Language Inference and Counterfactual Invariance Prediction as a new task. We show that our proposed model improves performance over strong state-of-the-art models (i.e., RoBERTa) across both reasoning tasks. Notably we are, to the best of our knowledge, the first to demonstrate that a model that learns to perform counterfactual reasoning helps predicting the best explanation in an abductive reasoning task. We validate the robustness of the model’s reasoning capabilities by perturbing the knowledge and provide qualitative analysis on the model’s knowledge incorporation capabilities.

AMR Similarity Metrics from Principles
Juri Opitz | Letitia Parcalabescu | Anette Frank
Transactions of the Association for Computational Linguistics, Volume 8

Different metrics have been proposed to compare Abstract Meaning Representation (AMR) graphs. The canonical Smatch metric (Cai and Knight, 2013) aligns the variables of two graphs and assesses triple matches. The recent SemBleu metric (Song and Gildea, 2019) is based on the machine-translation metric Bleu (Papineni et al., 2002) and increases computational efficiency by ablating the variable-alignment. In this paper, i) we establish criteria that enable researchers to perform a principled assessment of metrics comparing meaning representations like AMR; ii) we undertake a thorough analysis of Smatch and SemBleu where we show that the latter exhibits some undesirable properties. For example, it does not conform to the identity of indiscernibles rule and introduces biases that are hard to control; and iii) we propose a novel metric S2 match that is more benevolent to only very slight meaning deviations and targets the fulfilment of all established criteria. We assess its suitability and show its advantages over Smatch and SemBleu.

X-SRL: A Parallel Cross-Lingual Semantic Role Labeling Dataset
Angel Daza | Anette Frank
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Even though SRL is researched for many languages, major improvements have mostly been obtained for English, for which more resources are available. In fact, existing multilingual SRL datasets contain disparate annotation styles or come from different domains, hampering generalization in multilingual learning. In this work we propose a method to automatically construct an SRL corpus that is parallel in four languages: English, French, German, Spanish, with unified predicate and role annotations that are fully comparable across languages. We apply high-quality machine translation to the English CoNLL-09 dataset and use multilingual BERT to project its high-quality annotations to the target languages. We include human-validated test sets that we use to measure the projection quality, and show that projection is denser and more precise than a strong baseline. Finally, we train different SOTA models on our novel corpus for mono- and multilingual SRL, showing that the multilingual annotations improve performance especially for the weaker languages.

2019

An Argument-Marker Model for Syntax-Agnostic Proto-Role Labeling
Juri Opitz | Anette Frank
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Semantic proto-role labeling (SPRL) is an alternative to semantic role labeling (SRL) that moves beyond a categorical definition of roles, following Dowty’s feature-based view of proto-roles. This theory determines agenthood vs. patienthood based on a participant’s instantiation of more or less typical agent vs. patient properties, such as, for example, volition in an event. To perform SPRL, we develop an ensemble of hierarchical models with self-attention and concurrently learned predicate-argument markers. Our method is competitive with the state-of-the art, overall outperforming previous work in two formulations of the task (multi-label and multi-variate Likert scale pre- diction). In contrast to previous work, our results do not depend on gold argument heads derived from supplementary gold tree banks.

Dissecting Content and Context in Argumentative Relation Analysis
Juri Opitz | Anette Frank
Proceedings of the 6th Workshop on Argument Mining

When assessing relations between argumentative units (e.g., support or attack), computational systems often exploit disclosing indicators or markers that are not part of elementary argumentative units (EAUs) themselves, but are gained from their context (position in paragraph, preceding tokens, etc.). We show that this dependency is much stronger than previously assumed. In fact, we show that by completely masking the EAU text spans and only feeding information from their context, a competitive system may function even better. We argue that an argument analysis system that relies more on discourse context than the argument’s content is unsafe, since it can easily be tricked. To alleviate this issue, we separate argumentative units from their context such that the system is forced to model and rely on an EAU’s content. We show that the resulting classification system is more robust, and argue that such models are better suited for predicting argumentative relations across documents.

Automatic Accuracy Prediction for AMR Parsing
Juri Opitz | Anette Frank
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Abstract Meaning Representation (AMR) represents sentences as directed, acyclic and rooted graphs, aiming at capturing their meaning in a machine readable format. AMR parsing converts natural language sentences into such graphs. However, evaluating a parser on new data by means of comparison to manually created AMR graphs is very costly. Also, we would like to be able to detect parses of questionable quality, or preferring results of alternative systems by selecting the ones for which we can assess good quality. We propose AMR accuracy prediction as the task of predicting several metrics of correctness for an automatically generated AMR parse – in absence of the corresponding gold parse. We develop a neural end-to-end multi-output regression model and perform three case studies: firstly, we evaluate the model’s capacity of predicting AMR parse accuracies and test whether it can reliably assign high scores to gold parses. Secondly, we perform parse selection based on predicted parse accuracies of candidate parses from alternative systems, with the aim of improving overall results. Finally, we predict system ranks for submissions from two AMR shared tasks on the basis of their predicted parse accuracy averages. All experiments are carried out across two different domains and show that our method is effective.

Assessing the Difficulty of Classifying ConceptNet Relations in a Multi-Label Classification Setting
Maria Becker | Michael Staniek | Vivi Nastase | Anette Frank
RELATIONS - Workshop on meaning relations between phrases and sentences

Commonsense knowledge relations are crucial for advanced NLU tasks. We examine the learnability of such relations as represented in ConceptNet, taking into account their specific properties, which can make relation classification difficult: a given concept pair can be linked by multiple relation types, and relations can have multi-word arguments of diverse semantic types. We explore a neural open world multi-label classification approach that focuses on the evaluation of classification accuracy for individual relations. Based on an in-depth study of the specific properties of the ConceptNet resource, we investigate the impact of different relation representations and model variations. Our analysis reveals that the complexity of argument types and relation ambiguity are the most important challenges to address. We design a customized evaluation method to address the incompleteness of the resource that can be expanded in future work.

Ranking and Selecting Multi-Hop Knowledge Paths to Better Predict Human Needs
Debjit Paul | Anette Frank
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

To make machines better understand sentiments, research needs to move from polarity identification to understanding the reasons that underlie the expression of sentiment. Categorizing the goals or needs of humans is one way to explain the expression of sentiment in text. Humans are good at understanding situations described in natural language and can easily connect them to the character’s psychological needs using commonsense knowledge. We present a novel method to extract, rank, filter and select multi-hop relation paths from a commonsense knowledge resource to interpret the expression of sentiment in terms of their underlying human needs. We efficiently integrate the acquired knowledge paths in a neural model that interfaces context representations with knowledge using a gated attention mechanism. We assess the model’s performance on a recently published dataset for categorizing human needs. Selectively integrating knowledge paths boosts performance and establishes a new state-of-the-art. Our model offers interpretability through the learned attention map over commonsense knowledge paths. Human evaluation highlights the relevance of the encoded knowledge.

Translate and Label! An Encoder-Decoder Approach for Cross-lingual Semantic Role Labeling
Angel Daza | Anette Frank
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We propose a Cross-lingual Encoder-Decoder model that simultaneously translates and generates sentences with Semantic Role Labeling annotations in a resource-poor target language. Unlike annotation projection techniques, our model does not need parallel data during inference time. Our approach can be applied in monolingual, multilingual and cross-lingual settings and is able to produce dependency-based and span-based SRL annotations. We benchmark the labeling performance of our model in different monolingual and multilingual settings using well-known SRL datasets. We then train our model in a cross-lingual setting to generate new SRL labeled data. Finally, we measure the effectiveness of our method by using the generated data to augment the training basis for resource-poor languages and perform manual evaluation to show that it produces high-quality sentences and assigns accurate semantic role annotations. Our proposed architecture offers a flexible method for leveraging SRL data in multiple languages.

Discourse-Aware Semantic Self-Attention for Narrative Reading Comprehension
Todor Mihaylov | Anette Frank
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In this work, we propose to use linguistic annotations as a basis for a Discourse-Aware Semantic Self-Attention encoder that we employ for reading comprehension on narrative texts. We extract relations between discourse units, events, and their arguments as well as coreferring mentions, using available annotation tools. Our empirical evaluation shows that the investigated structures improve the overall performance (up to +3.4 Rouge-L), especially intra-sentential and cross-sentential discourse relations, sentence-internal semantic role relations, and long-distance coreference relations. We show that dedicating self-attention heads to intra-sentential relations and relations connecting neighboring sentences is beneficial for finding answers to questions in longer contexts. Our findings encourage the use of discourse-semantic annotations to enhance the generalization capacity of self-attention models for reading comprehension.

2018

DeModify: A Dataset for Analyzing Contextual Constraints on Modifier Deletion
Vivi Nastase | Devon Fritz | Anette Frank
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge
Todor Mihaylov | Anette Frank
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce a neural reading comprehension model that integrates external commonsense knowledge, encoded as a key-value memory, in a cloze-style setting. Instead of relying only on document-to-question interaction or discrete features as in prior work, our model attends to relevant external knowledge and combines this knowledge with the context representation before inferring the answer. This allows the model to attract and imply knowledge from an external knowledge source that is not explicitly stated in the text, but that is relevant for inferring the answer. Our model improves results over a very strong baseline on a hard Common Nouns dataset, making it a strong competitor of much more complex models. By including knowledge explicitly, our model can also provide evidence about the background knowledge used in the RC process.

A Sequence-to-Sequence Model for Semantic Role Labeling
Angel Daza | Anette Frank
Proceedings of the Third Workshop on Representation Learning for NLP

We explore a novel approach for Semantic Role Labeling (SRL) by casting it as a sequence-to-sequence process. We employ an attention-based model enriched with a copying mechanism to ensure faithful regeneration of the input sequence, while enabling interleaved generation of argument role labels. We apply this model in a monolingual setting, performing PropBank SRL on English language data. The constrained sequence generation set-up enforced with the copying mechanism allows us to analyze the performance and special properties of the model on manually labeled data and benchmarking against state-of-the-art sequence labeling models. We show that our model is able to solve the SRL argument labeling task on English data, yet further structural decoding constraints will need to be added to make the model truly competitive. Our work represents the first step towards more advanced, generative SRL labeling setups.

Addressing the Winograd Schema Challenge as a Sequence Ranking Task
Juri Opitz | Anette Frank
Proceedings of the First International Workshop on Language Cognition and Computational Models

The Winograd Schema Challenge targets pronominal anaphora resolution problems which require the application of cognitive inference in combination with world knowledge. These problems are easy to solve for humans but most difficult to solve for machines. Computational models that previously addressed this task rely on syntactic preprocessing and incorporation of external knowledge by manually crafted features. We address the Winograd Schema Challenge from a new perspective as a sequence ranking task, and design a Siamese neural sequence ranking model which performs significantly better than a random baseline, even when solely trained on sequences of words. We evaluate against a baseline and a state-of-the-art system on two data sets and show that anonymization of noun phrase candidates strongly helps our model to generalize.

Classifying Semantic Clause Types With Recurrent Neural Networks: Analysis of Attention, Context & Genre Characteristics
Maria Becker | Michael Staniek | Vivi Nastase | Alexis Palmer | Anette Frank
Traitement Automatique des Langues, Volume 59, Numéro 2 : Apprentissage profond pour le traitement automatique des langues [Deep Learning for natural language processing]

SRL4ORL: Improving Opinion Role Labeling Using Multi-Task Learning with Semantic Role Labeling
Ana Marasović | Anette Frank
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

For over a decade, machine learning has been used to extract opinion-holder-target structures from text to answer the question “Who expressed what kind of sentiment towards what?”. Recent neural approaches do not outperform the state-of-the-art feature-based models for Opinion Role Labeling (ORL). We suspect this is due to the scarcity of labeled training data and address this issue using different multi-task learning (MTL) techniques with a related task which has substantially more data, i.e. Semantic Role Labeling (SRL). We show that two MTL models improve significantly over the single-task model for labeling of both holders and targets, on the development and the test sets. We found that the vanilla MTL model, which makes predictions using only shared ORL and SRL features, performs the best. With deeper analysis we determine what works and what might be done to make further improvements for ORL.

2017

Assessing SRL Frameworks with Automatic Training Data Expansion
Silvana Hartmann | Éva Mújdricza-Maydt | Ilia Kuznetsov | Iryna Gurevych | Anette Frank
Proceedings of the 11th Linguistic Annotation Workshop

We present the first experiment-based study that explicitly contrasts the three major semantic role labeling frameworks. As a prerequisite, we create a dataset labeled with parallel FrameNet-, PropBank-, and VerbNet-style labels for German. We train a state-of-the-art SRL tool for German for the different annotation styles and provide a comparative analysis across frameworks. We further explore the behavior of the frameworks with automatic training data generation. VerbNet provides larger semantic expressivity than PropBank, and we find that its generalization capacity approaches PropBank in SRL training, but it benefits less from training data expansion than the sparse-data affected FrameNet.

Story Cloze Ending Selection Baselines and Data Examination
Todor Mihaylov | Anette Frank
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

This paper describes two supervised baseline systems for the Story Cloze Test Shared Task (Mostafazadeh et al., 2016a). We first build a classifier using features based on word embeddings and semantic similarity computation. We further implement a neural LSTM system with different encoding strategies that try to model the relation between the story and the provided endings. Our experiments show that a model using representation features based on average word embedding vectors over the given story words and the candidate ending sentences words, joint with similarity features between the story and candidate ending representations performed better than the neural models. Our best model based on achieves an accuracy of 72.42, ranking 3rd in the official evaluation.

A Mention-Ranking Model for Abstract Anaphora Resolution
Ana Marasović | Leo Born | Juri Opitz | Anette Frank
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Resolving abstract anaphora is an important, but difficult task for text understanding. Yet, with recent advances in representation learning this task becomes a more tangible aim. A central property of abstract anaphora is that it establishes a relation between the anaphor embedded in the anaphoric sentence and its (typically non-nominal) antecedent. We propose a mention-ranking model that learns how abstract anaphors relate to their antecedents with an LSTM-Siamese Net. We overcome the lack of training data by generating artificial anaphoric sentence–antecedent pairs. Our model outperforms state-of-the-art results on shell noun resolution. We also report first benchmark results on an abstract anaphora subset of the ARRAU corpus. This corpus presents a greater challenge due to a mixture of nominal and pronominal anaphors and a greater range of confounders. We found model variants that outperform the baselines for nominal anaphors, without training on individual anaphor data, but still lag behind for pronominal anaphors. Our model selects syntactically plausible candidates and – if disregarding syntax – discriminates candidates using deeper features.

Classifying Semantic Clause Types: Modeling Context and Genre Characteristics with Recurrent Neural Networks and Attention
Maria Becker | Michael Staniek | Vivi Nastase | Alexis Palmer | Anette Frank
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

Detecting aspectual properties of clauses in the form of situation entity types has been shown to depend on a combination of syntactic-semantic and contextual features. We explore this task in a deep-learning framework, where tuned word representations capture lexical, syntactic and semantic features. We introduce an attention mechanism that pinpoints relevant context not only for the current instance, but also for the larger context. Apart from implicitly capturing task relevant features, the advantage of our neural model is that it avoids the need to reproduce linguistic features for other languages and is thus more easily transferable. We present experiments for English and German that achieve competitive performance. We present a novel take on modeling and exploiting genre information and showcase the adaptation of our system from one language to another.

What do we need to know about an unknown word when parsing German
Bich-Ngoc Do | Ines Rehbein | Anette Frank
Proceedings of the First Workshop on Subword and Character Level Models in NLP

We propose a new type of subword embedding designed to provide more information about unknown compounds, a major source for OOV words in German. We present an extrinsic evaluation where we use the compound embeddings as input to a neural dependency parser and compare the results to the ones obtained with other types of embeddings. Our evaluation shows that adding compound embeddings yields a significant improvement of 2% LAS over using word embeddings when no POS information is available. When adding POS embeddings to the input, however, the effect levels out. This suggests that it is not the missing information about the semantics of the unknown words that causes problems for parsing German, but the lack of morphological information for unknown words. To augment our evaluation, we also test the new embeddings in a language modelling task that requires both syntactic and semantic information.

Universal Dependencies are Hard to Parse – or are They?
Ines Rehbein | Julius Steen | Bich-Ngoc Do | Anette Frank
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

2016

Deriving Players & Themes in the Regesta Imperii using SVMs and Neural Networks
Juri Opitz | Anette Frank
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Multilingual Modal Sense Classification using a Convolutional Neural Network
Ana Marasović | Anette Frank
Proceedings of the 1st Workshop on Representation Learning for NLP

A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures
Richard Eckart de Castilho | Éva Mújdricza-Maydt | Seid Muhie Yimam | Silvana Hartmann | Iryna Gurevych | Anette Frank | Chris Biemann
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

We introduce the third major release of WebAnno, a generic web-based annotation tool for distributed teams. New features in this release focus on semantic annotation tasks (e.g. semantic role labelling or event annotation) and allow the tight integration of semantic annotations with syntactic annotations. In particular, we introduce the concept of slot features, a novel constraint mechanism that allows modelling the interaction between semantic and syntactic annotations, as well as a new annotation user interface. The new features were developed and used in an annotation project for semantic roles on German texts. The paper briefly introduces this project and reports on experiences performing annotations with the new tool. On a comparative evaluation, our tool reaches significant speedups over WebAnno 2 for a semantic annotation task.

Argumentative texts and clause types
Maria Becker | Alexis Palmer | Anette Frank
Proceedings of the Third Workshop on Argument Mining (ArgMining2016)

Modal Sense Classification At Large: Paraphrase-Driven Sense Projection, Semantically Enriched Classification Models and Cross-Genre Evaluations
Ana Marasović | Mengfei Zhou | Alexis Palmer | Anette Frank
Linguistic Issues in Language Technology, Volume 14, 2016 - Modality: Logic, Semantics, Annotation, and Machine Learning

Modal verbs have different interpretations depending on their context. Their sense categories – epistemic, deontic and dynamic – provide important dimensions of meaning for the interpretation of discourse. Previous work on modal sense classification achieved relatively high performance using shallow lexical and syntactic features drawn from small-size annotated corpora. Due to the restricted empirical basis, it is difficult to assess the particular difficulties of modal sense classification and the generalization capacity of the proposed models. In this work we create large-scale, high-quality annotated corpora for modal sense classification using an automatic paraphrase-driven projection approach. Using the acquired corpora, we investigate the modal sense classification task from different perspectives.

Implicit Semantic Roles in a Multilingual Setting
Jennifer Sikos | Yannick Versley | Anette Frank
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

Combining Semantic Annotation of Word Sense & Semantic Roles: A Novel Annotation Scheme for VerbNet Roles on German Language Data
Éva Mújdricza-Maydt | Silvana Hartmann | Iryna Gurevych | Anette Frank
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a VerbNet-based annotation scheme for semantic roles that we explore in an annotation study on German language data that combines word sense and semantic role annotation. We reannotate a substantial portion of the SALSA corpus with GermaNet senses and a revised scheme of VerbNet roles. We provide a detailed evaluation of the interaction between sense and role annotation. The resulting corpus will allow us to compare VerbNet role annotation for German to FrameNet and PropBank annotation by mapping to existing role annotations on the SALSA corpus. We publish the annotated corpus and detailed guidelines for the new role annotation scheme.

Discourse Relation Sense Classification Using Cross-argument Semantic Similarity Based on Word Embeddings
Todor Mihaylov | Anette Frank
Proceedings of the CoNLL-16 shared task

2015

Seed-Based Event Trigger Labeling: How far can event descriptions get us?
Ofer Bronstein | Ido Dagan | Qi Li | Heng Ji | Anette Frank
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Inducing Implicit Arguments from Comparable Texts: A Framework and Its Applications
Michael Roth | Anette Frank
Computational Linguistics, Volume 41, Issue 4 - December 2015

Analyzing Sentiment in Classical Chinese Poetry
Yufang Hou | Anette Frank
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

Semantically Enriched Models for Modal Sense Classification
Mengfei Zhou | Anette Frank | Annemarie Friedrich | Alexis Palmer
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics

2014

Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)
Johan Bos | Anette Frank | Roberto Navigli
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

2013

Automatically Identifying Implicit Arguments to Improve Argument Linking and Coherence Modeling
Michael Roth | Anette Frank
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

Predicate-specific Annotations for Implicit Role Binding: Corpus Annotation, Data Analysis and Evaluation Experiments
Tatjana Moor | Michael Roth | Anette Frank
Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Short Papers

2012

Casting Implicit Role Linking as an Anaphora Resolution Task
Carina Silberer | Anette Frank
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Aligning Predicate Argument Structures in Monolingual Comparable Texts: A New Corpus for a New Task
Michael Roth | Anette Frank
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering
Michael Roth | Anette Frank
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
Matthias Hartung | Anette Frank
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Assessing Interpretable, Attribute-related Meaning Representations for Adjective-Noun Phrases in a Similarity Prediction Task
Matthias Hartung | Anette Frank
Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics

2010

Assessing the Challenge of Fine-Grained Named Entity Recognition and Classification
Asif Ekbal | Eva Sourjikova | Anette Frank | Simone Paolo Ponzetto
Proceedings of the 2010 Named Entities Workshop

A Semi-supervised Type-based Classification of Adjectives: Distinguishing Properties and Relations
Matthias Hartung | Anette Frank
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a semi-supervised machine-learning approach for the classification of adjectives into property- vs. relation-denoting adjectives, a distinction that is highly relevant for ontology learning. The feasibility of this classification task is evaluated in a human annotation experiment. We observe that token-level annotation of these classes is expensive and difficult. Yet, a careful corpus analysis reveals that adjective classes tend to be stable, with few occurrences of class shifts observed at the token level. As a consequence, we opt for a type-based semi-supervised classification approach. The class labels obtained from manual annotation are projected to large amounts of unannotated token samples. Training on heuristically labeled data yields high classification performance on our own data and on a data set compiled from WordNet. Our results suggest that it is feasible to automatically distinguish adjectives denoting properties and relations, using small amounts of annotated data.

Using NLP Methods for the Analysis of Rituals
Nils Reiter | Oliver Hellwig | Anand Mishra | Anette Frank | Jens Burkhardt
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper gives an overview of an interdisciplinary research project that is concerned with the application of computational linguistics methods to the analysis of the structure and variance of rituals, as investigated in ritual science. We present motivation and prospects of a computational approach to ritual research, and explain the choice of specific analysis techniques. We discuss design decisions for data collection and processing and present the general NLP architecture. For the analysis of ritual descriptions, we apply the frame semantics paradigm with newly invented frames where appropriate. Using scientific ritual research literature, we experimented with several techniques of automatic extraction of domain terms for the domain of rituals. As ritual research is a highly interdisciplinary endeavour, a vocabulary common to all sub-areas of ritual research can is hard to specify and highly controversial. The domain terms extracted from ritual research literature are used as a basis for a common vocabulary and thus help the creation of ritual specific frames. We applied the tf*idf, 2 and PageRank algorithm to our ritual research literature corpus and two non-domain corpora: The British National Corpus and the British Academic Written English corpus. All corpora have been part of speech tagged and lemmatized. The domain terms have been evaluated by two ritual experts independently. Interestingly, the results of the algorithms were different for different parts of speech. This finding is in line with the fact that the inter-annotator agreement also differs between parts of speech.

Computing EM-based Alignments of Routes and Route Directions as a Basis for Natural Language Generation
Michael Roth | Anette Frank
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

A Structured Vector Space Model for Hidden Attribute Meaning in Adjective-Noun Phrases
Matthias Hartung | Anette Frank
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Identifying Generic Noun Phrases
Nils Reiter | Anette Frank
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

2009

Creating an Annotated Corpus for Generating Walking Directions
Stephanie Schuldes | Michael Roth | Anette Frank | Michael Strube
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

A NLG-based Application for Walking Directions
Michael Roth | Anette Frank
Proceedings of the ACL-IJCNLP 2009 Software Demonstrations

2008

Projection-based Acquisition of a Temporal Labeller
Kathrin Spreyer | Anette Frank
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

A Resource-Poor Approach for Linking Ontology Classes to Wikipedia Articles
Nils Reiter | Matthias Hartung | Anette Frank
Semantics in Text Processing. STEP 2008 Conference Proceedings

Formalising Multi-layer Corpora in OWL DL - Lexicon Modelling, Querying and Consistency Control
Aljoscha Burchardt | Sebastian Padó | Dennis Spohr | Anette Frank | Ulrich Heid
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

2007

A Semantic Approach To Textual Entailment: System Evaluation and Task Analysis
Aljoscha Burchardt | Nils Reiter | Stefan Thater | Anette Frank
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing

2006

Contextual phenomena and thematic relations in database QA dialogues: results from a Wizard-of-Oz Experiment
Núria Bertomeu | Hans Uszkoreit | Anette Frank | Hans-Ulrich Krieger | Brigitte Jörg
Proceedings of the Interactive Question Answering Workshop at HLT-NAACL 2006

SALTO - A Versatile Multi-Level Annotation Tool
Aljoscha Burchardt | Katrin Erk | Anette Frank | Andrea Kowalski | Sebastian Pado
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we describe the SALTO tool. It was originally developed for the annotation of semantic roles in the frame semantics paradigm, but can be used for graphical annotation of treebanks with general relational information in a simple drag-and-drop fashion. The tool additionally supports corpus management and quality control.

The SALSA Corpus: a German Corpus Resource for Lexical Semantics
Aljoscha Burchardt | Katrin Erk | Anette Frank | Andrea Kowalski | Sebastian Padó | Manfred Pinkal
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the SALSA corpus, a large German corpus manually annotated with manual role-semantic annotation, based on the syntactically annotated TIGER newspaper corpus. The first release, comprising about 20,000 annotated predicate instances (about half the TIGER corpus), is scheduled for mid-2006. In this paper we discuss the annotation framework (frame semantics) and its cross-lingual applicability, problems arising from exhaustive annotation, strategies for quality control, and possible applications.

2005

The TIGER 700 RMRS Bank: RMRS Construction from Dependencies
Kathrin Spreyer | Anette Frank
Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora (LINC-2005)

2004

Constraint-based RMRS Construction from Shallow Grammars
Anette Frank
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Corpus-based Induction of an LFG Syntax-Semantics Interface for Frame Semantic Processing
Anette Frank | Jirí Semecky
Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora

2003

Integrated Shallow and Deep Parsing: TopP Meets HPSG
Anette Frank | Markus Becker | Berthold Crysmann | Bernd Kiefer | Ulrich Schäfer
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

2002

A Stochastic Topological Parser for German
Markus Becker | Anette Frank
COLING 2002: The 19th International Conference on Computational Linguistics

An Integrated Archictecture for Shallow and Deep Processing
Berthold Crysmann | Anette Frank | Bernd Kiefer | Stefan Mueller | Guenter Neumann | Jakub Piskorski | Ulrich Schaefer | Melanie Siegel | Hans Uszkoreit | Feiyu Xu | Markus Becker | Hans-Ulrich Krieger
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

1999

From parallel grammar development towards machine translation – a project overview
Anette Frank
Proceedings of Machine Translation Summit VII

We give an overview of a MT research project jointly undertaken by Xerox PARC and XRCE Grenoble. The project builds on insights and resources in large-scale development of parallel LFG grammars. The research approach towards translation focuses on innovative computational technologies which lead to a flexible translation architecture. Efficient processing of "packed" ambiguities not only enables ambiguity preserving transfer. It is at the heart of a flexible architectural design, open for various extensions which take the right decisions at the right time.

1998

Syntactic and Semantic Transfer with F-Structures
Michael Dorna | Anette Frank | Josef van Genabith | Martin C. Emele
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

Syntactic and Semantic Transfer with F-Structures
Michael Dorna | Anette Frank | Josef van Genabith | Martin C. Emele
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

1995

Principle Based Semantics for HPSG
Anette Frank | Uwe Reyle
Seventh Conference of the European Chapter of the Association for Computational Linguistics

Co-authors

Philipp Heinisch 7

Matthias Hartung 5

Alexis Palmer 5

Aljoscha Burchardt 4

Ana Marasović 4

Todor Mihaylov 4

Markus Becker 3

Iryna Gurevych 3

Silvana Hartmann 3

Katharina Korfhage 3

Éva Mújdricza-Maydt 3

Sebastian Padó 3

Frederick Riemenschneider 3

Michael Staniek 3

Iacer Calixto 2

Berthold Crysmann 2

Michael Dorna 2

Martin C. Emele 2

Andrea Kowalski 2

Hans-Ulrich Krieger 2

Ulrich Schäfer 2

Kathrin Spreyer 2

Hans Uszkoreit 2

Josef van Genabith 2

Núria Bertomeu 1

Chris Biemann 1

Ofer Bronstein 1

Jens Burkhardt 1

Michele Cafagna 1

Richard Eckart De Castilho 1

Constantin Eichenberg 1

Annemarie Friedrich 1

Yoalli Garcia 1

Janosch Gehring 1

Oliver Hellwig 1

Brigitte Jörg 1

Ilia Kuznetsov 1

Katja Markert 1

Philipp Meier 1

Lilitta Muradjan 1

Stefan Müller 1

Roberto Navigli 1

Günter Neumann 1

Manfred Pinkal 1

Jakub Piskorski 1

Simone Paolo Ponzetto 1

Nathan Schneider 1

Stephanie Schuldes 1

Jiří Semecký 1

Melanie Siegel 1

Jennifer Sikos 1

Carina Silberer 1

Eva Sourjikova 1

Michael Strube 1

Stefan Thater 1

Yannick Versley 1

Samuel Weinbach 1

Philipp Wiesenbach 1

Seid Muhie Yimam 1

Laura Zeidler 1

Venues