Juri Opitz - ACL Anthology

Juri Opitz

2026

Similar, but why? A Toolkit for Explaining Text Similarity
Juri Opitz | Andrianos Michail | Lucas Moeller | Sebastian Padó | Simon Clematide
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Explaining text similarity and developing interpretable models are emerging research challenges (Opitz et al., 2025). We release XPLAINSIM, a Python package that unifies three complementary approaches for explaining textual similarity in an easily accessible way: 1. a token attribution method that explains how individual word interactions contribute to the predicted similarity of any embedding model; 2. a method for inferring structured neural embedding spaces that capture explainable aspects of text, and 3. a symbolic approach that explains textual similarity transparently through parsed meaning representations. We demonstrate the value of our package through intuitive examples and three focused empirical research studies. The first study evaluates interpretability methods for constructing cross-lingual token alignments. The second investigates how modern information retrieval methods handle stop words. The third sheds more light on a long-standing question in computational linguistics: the distinction between relatedness and similarity. XPLAINSIM is available at https://github.com/flipz357/XPLAINSIM.

2025

Lexical borrowing, the adoption of words from one language into another, is a ubiquitous linguistic phenomenon influenced by geopolitical, societal, and technological factors. This paper introduces ConLoan–a novel contrastive dataset comprising sentences with and without loanwords across 10 languages. Through systematic evaluation using this dataset, we investigate how state-of-the-art machine translation and language models process loanwords compared to their native alternatives. Our experiments reveal that these systems show systematic preferences for loanwords over native terms and exhibit varying performance across languages. These findings provide valuable insights for developing more linguistically robust NLP systems.

Interpretable Text Embeddings and Text Similarity Explanation: A Survey
Juri Opitz | Lucas Moeller | Andrianos Michail | Sebastian Padó | Simon Clematide
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Text embeddings are a fundamental component in many NLP tasks, including classification, regression, clustering, and semantic search. However, despite their ubiquitous application, challenges persist in interpreting embeddings and explaining similarities between them.In this work, we provide a structured overview of methods specializing in inherently interpretable text embeddings and text similarity explanation, an underexplored research area. We characterize the main ideas, approaches, and trade-offs. We compare means of evaluation, discuss overarching lessons learned and finally identify opportunities and open challenges for future research.

Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems
Mousumi Akter | Tahiya Chowdhury | Steffen Eger | Christoph Leiter | Juri Opitz | Erion Çano
Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems

Natural Language Processing RELIES on Linguistics
Juri Opitz | Shira Wein | Nathan Schneider
Computational Linguistics, Volume 51, Issue 3 - September 2025

Large Language Models have become capable of generating highly fluent text in certain languages, without modules specially designed to capture grammar or semantic coherence. What does this mean for the future of linguistic expertise in NLP? We highlight several aspects in which NLP (still) relies on linguistics, or where linguistic thinking can illuminate new directions. We argue our case around the acronym RELIES, which encapsulates six major facets where linguistics contributes to NLP: Resources, Evaluation, Low-resource settings, Interpretability, Explanation, and the Study of language. This list is not exhaustive, nor is linguistics the main point of reference for every effort under these themes; but at a macro level, these facets highlight the enduring importance of studying machine systems vis-à-vis systems of human language.

Adapting Multilingual Embedding Models to Historical Luxembourgish
Andrianos Michail | Corina Raclé | Juri Opitz | Simon Clematide
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)

The growing volume of digitized historical texts requires effective semantic search using text embeddings. However, pre-trained multilingual models face challenges with historical content due to OCR noise and outdated spellings. This study examines multilingual embeddings for cross-lingual semantic search in historical Luxembourgish (LB), a low-resource language. We collect historical Luxembourgish news articles from various periods and use GPT-4o for sentence segmentation and translation, generating 20,000 parallel training sentences per language pair. Additionally, we create a semantic search (Historical LB Bitext Mining) evaluation set and find that existing models perform poorly on cross-lingual search for historical Luxembourgish. Using our historical and additional modern parallel training data, we adapt several multilingual embedding models through contrastive learning or knowledge distillation and increase accuracy significantly for all models. We release our adapted models and historical Luxembourgish-German/French/English bitexts to support further research.

PARAPHRASUS: A Comprehensive Benchmark for Evaluating Paraphrase Detection Models
Andrianos Michail | Simon Clematide | Juri Opitz
Proceedings of the 31st International Conference on Computational Linguistics

The task of determining whether two texts are paraphrases has long been a challenge in NLP. However, the prevailing notion of paraphrase is often quite simplistic, offering only a limited view of the vast spectrum of paraphrase phenomena. Indeed, we find that evaluating models in a paraphrase dataset can leave uncertainty about their true semantic understanding. To alleviate this, we create PARAPHRASUS, a benchmark designed for multi-dimensional assessment, benchmarking and selection of paraphrase detection models. We find that paraphrase detection models under our fine-grained evaluation lens exhibit trade-offs that cannot be captured through a single classification dataset. Furthermore, PARAPHRASUS allows prompt calibration for different use cases, tailoring LLM models to specific strictness levels. PARAPHRASUS includes 3 challenges spanning over 10 datasets, including 8 repurposed and 2 newly annotated; we release it along with a benchmarking library at https://github.com/impresso/paraphrasus

Cheap Character Noise for OCR-Robust Multilingual Embeddings
Andrianos Michail | Juri Opitz | Yining Wang | Robin Meister | Rico Sennrich | Simon Clematide
Findings of the Association for Computational Linguistics: ACL 2025

The large amount of text collections digitized by imperfect OCR systems requires semantic search models that perform robustly on noisy input. Such collections are highly heterogeneous, with varying degrees of OCR quality, spelling conventions and other inconsistencies —all phenomena that are underrepresented in the training data of standard embedding models, with ramifications for their generalization. In our paper, we show that this problem can be alleviated with a simple and inexpensive method that does not require supervision or in-domain training. Specifically, we fine-tune existing multilingual models using noisy texts and a contrastive loss. We show that these models show considerable improvements across different noise conditions. Control experiments indicate minimal, and occasionally positive, impact on standard similarity tasks. These findings suggest that embedding models can be inexpensively adapted for cross-lingual semantic search in heterogeneous, digitized corpora. We publicly release our code, datasets, and models at https://github.com/impresso/ocr-robust-multilingual-embeddings.

Sentence Smith: Controllable Edits for Evaluating Text Embeddings
Hongji Li | Andrianos Michail | Reto Gubelmann | Simon Clematide | Juri Opitz
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Controllable and transparent text generation has been a long-standing goal in NLP. Almost as long-standing is a general idea for addressing this challenge: Parsing text to a symbolic representation, and generating from it. However, earlier approaches were hindered by parsing and generation insufficiencies. Using modern parsers and a safety supervision mechanism, we show how close current methods come to this goal. Concretely, we propose the framework for English, which has three steps: 1. Parsing a sentence into a semantic graph. 2. Applying human-designed semantic manipulation rules. 3. Generating text from the manipulated graph. A final entailment check (4.) verifies the validity of the applied transformation. To demonstrate our framework’s utility, we use it to induce hard negative text pairs that challenge text embedding models. Since the controllable generation makes it possible to clearly isolate different types of semantic shifts, we can evaluate text embedding models in a fine-grained way, also addressing an issue in current benchmarking where linguistic phenomena remain opaque. Human validation confirms that our transparent generation process produces texts of good quality. Notably, our way of generation is very resource-efficient, since it relies only on smaller neural networks.

2024

A Survey of AMR Applications
Shira Wein | Juri Opitz
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In the ten years since the development of the Abstract Meaning Representation (AMR) formalism, substantial progress has been made on AMR-related tasks such as parsing and alignment. Still, the engineering applications of AMR are not fully understood. In this survey, we categorize and characterize more than 100 papers which use AMR for downstream tasks— the first survey of this kind for AMR. Specifically, we highlight (1) the range of applications for which AMR has been harnessed, and (2) the techniques for incorporating AMR into those applications. We also detect broader AMR engineering patterns and outline areas of future work that seem ripe for AMR incorporation. We hope that this survey will be useful to those interested in using AMR and that it sparks discussion on the role of symbolic representations in the age of neural-focused NLP research.

Schroedinger’s Threshold: When the AUC Doesn’t Predict Accuracy
Juri Opitz
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The Area Under Curve measure (AUC) seems apt to evaluate and compare diverse models, possibly without calibration. An important example of AUC application is the evaluation and benchmarking of models that predict faithfulness of generated text. But we show that the AUC yields an academic and optimistic notion of accuracy that can misalign with the actual accuracy observed in application, yielding significant changes in benchmark rankings. To paint a more realistic picture of downstream model performance (and prepare it for actual application), we explore different calibration modes, testing calibration data and method.

At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs). These SCUs areconcise sentences that decompose a summary into small facts. Such SCUs can be used to judge the quality of a candidate summary, possibly partially automated via natural language inference (NLI) systems. Interestingly, with the aim to fully automate the Pyramid evaluation, Zhang and Bansal (2021) show that SCUs can be approximated by automatically generated semantic role triplets (STUs). However, several questions currently lack answers, in particular: i) Are there other ways of approximating SCUs that can offer advantages?ii) Under which conditions are SCUs (or their approximations) offering the most value? In this work, we examine two novel strategiesto approximate SCUs: generating SCU approximations from AMR meaning representations (SMUs) and from large language models (SGUs), respectively. We find that while STUs and SMUs are competitive, the best approximation quality is achieved by SGUs. We also show through a simple sentence-decomposition baseline (SSUs) that SCUs (and their approximations) offer the most value when rankingshort summaries, but may not help as much when ranking systems or longer summaries.

A Survey of Meaning Representations – From Theory to Practical Utility
Zacchary Sadeddine | Juri Opitz | Fabian Suchanek
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Symbolic meaning representations of natural language text have been studied since at least the 1960s. With the availability of large annotated corpora, and more powerful machine learning tools, the field has recently seen several new developments. In this survey, we study today’s most prominent Meaning Representation Frameworks. We shed light on their theoretical properties, as well as on their practical research environment, i.e., on datasets, parsers, applications, and future challenges.

A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice
Juri Opitz
Transactions of the Association for Computational Linguistics, Volume 12

Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called ‘macro’ metrics to rank systems (e.g., ‘macro F1’) but do not clearly specify what they would expect from such a ‘macro’ metric. This is problematic, since picking a metric can affect research findings and thus any clarity in the process should be maximized. Starting from the intuitive concepts of bias and prevalence, we perform an analysis of common evaluation metrics. The analysis helps us understand the metrics’ underlying properties, and how they align with expectations as found expressed in papers. Then we reflect on the practical situation in the field, and survey evaluation practice in recent shared tasks. We find that metric selection is often not supported with convincing arguments, an issue that can make a system ranking seem arbitrary. Our work aims at providing overview and guidance for more informed and transparent metric selection, fostering meaningful evaluation.

2023

With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness
Julius Steen | Juri Opitz | Anette Frank | Katja Markert
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Conditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metrics, NLI models seem attractive, since they solve a strongly related task that comes with a wealth of prior research and data. But recent research suggests that NLI models require costly additional machinery to perform reliably across datasets, e.g., by running inference on a cartesian product of input and generated sentences, or supporting them with a question-generation/answering step. In this work we show that pure NLI models _can_ outperform more complex metrics when combining task-adaptive data augmentation with robust inference procedures. We propose: (1) Augmenting NLI training data toadapt NL inferences to the specificities of faithfulness prediction in dialogue;(2) Making use of both entailment and contradiction probabilities in NLI, and(3) Using Monte-Carlo dropout during inference. Applied to the TRUE benchmark, which combines faithfulness datasets across diverse domains and tasks, our approach strongly improves a vanilla NLI model and significantly outperforms previous work, while showing favourable computational cost.

SMARAGD: Learning SMatch for Accurate and Rapid Approximate Graph Distance
Juri Opitz | Philipp Meier | Anette Frank
Proceedings of the 15th International Conference on Computational Semantics

The similarity of graph structures, such as Meaning Representations (MRs), is often assessed via structural matching algorithms, such as Smatch (Cai & Knight 2013). However, Smatch involves a combinatorial problem that suffers from NP-completeness, making large-scale applications, e.g., graph clustering or search, infeasible. To alleviate this issue, we learn SMARAGD: Semantic Match for Accurate and Rapid Approximate Graph Distance. We show the potential of neural networks to approximate Smatch scores, i) in linear time using a machine translation framework to predict alignments, or ii) in constant time using a Siamese CNN to directly predict Smatch scores. We show that the approximation error can be substantially reduced through data augmentation and graph anonymization.

Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems
Daniel Deutsch | Rotem Dror | Steffen Eger | Yang Gao | Christoph Leiter | Juri Opitz | Andreas Rücklé
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

SMATCH++: Standardized and Extended Evaluation of Semantic Graphs
Juri Opitz
Findings of the Association for Computational Linguistics: EACL 2023

The Smatch metric is a popular method for evaluating graph distances, as is necessary, for instance, to assess the performance of semantic graph parsing systems. However, we observe some issues in the metric that jeopardize meaningful evaluation. E.g., opaque pre-processing choices can affect results, and current graph-alignment solvers do not provide us with upper-bounds. Without upper-bounds, however, fair evaluation is not guaranteed. Furthermore, adaptions of Smatch for extended tasks (e.g., fine-grained semantic similarity) are spread out, and lack a unifying framework. For better inspection, we divide the metric into three modules: pre-processing, alignment, and scoring. Examining each module, we specify its goals and diagnose potential issues, for which we discuss and test mitigation strategies. For pre-processing, we show how to fully conform to annotation guidelines that allow structurally deviating but valid graphs. For safer and enhanced alignment, we show the feasibility of optimal alignment in a standard evaluation setup, and develop a lossless graph compression method that shrinks the search space and significantly increases efficiency. For improved scoring, we propose standardized and extended metric calculation of fine-grained sub-graph meaning aspects. Our code is available at https://github.com/flipz357/smatchpp

AMR4NLI: Interpretable and robust NLI measures from semantic graphs
Juri Opitz | Shira Wein | Julius Steen | Anette Frank | Nathan Schneider
Proceedings of the 15th International Conference on Computational Semantics

The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent premise and hypothesis, including sets of *contextualized embeddings* and *semantic graphs* (Abstract Meaning Representations), and measure whether the hypothesis is a semantic substructure of the premise, utilizing interpretable metrics. Our evaluation on three English benchmarks finds value in both contextualized embeddings and semantic graphs; moreover, they provide complementary signals, and can be leveraged together in a hybrid model.

Similarity-weighted Construction of Contextualized Commonsense Knowledge Graphs for Knowledge-intense Argumentation Tasks
Moritz Plenz | Juri Opitz | Philipp Heinisch | Philipp Cimiano | Anette Frank
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Arguments often do not make explicit how a conclusion follows from its premises. To compensate for this lack, we enrich arguments with structured background knowledge to support knowledge-intense argumentation tasks. We present a new unsupervised method for constructing Contextualized Commonsense Knowledge Graphs (CCKGs) that selects contextually relevant knowledge from large knowledge graphs (KGs) efficiently and at high quality. Our work goes beyond context-insensitive knowledge extraction heuristics by computing semantic similarity between KG triplets and textual arguments. Using these triplet similarities as weights, we extract contextualized knowledge paths that connect a conclusion to its premise, while maximizing similarity to the argument. We combine multiple paths into a CCKG that we optionally prune to reduce noise and raise precision. Intrinsic evaluation of the quality of our graphs shows that our method is effective for (re)constructing human explanation graphs. Manual evaluations in a large-scale knowledge selection setup verify high recall and precision of implicit CSK in the CCKGs. Finally, we demonstrate the effectiveness of CCKGs in a knowledge-insensitive argument quality rating task, outperforming strong baselines and rivaling a GPT-3 based system.

The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics
Christoph Leiter | Juri Opitz | Daniel Deutsch | Yang Gao | Rotem Dror | Steffen Eger
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

Generative large language models (LLMs) have seen many breakthroughs over the last year. With an increasing number of parameters and pre-training data, they have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Strategies employed in this context differ in the choice of input prompts, the selection of samples for demonstration, and the methodology used to construct scores grading the generations. Approaches often differ in the input prompts, the samples that are selected for demonstration and the construction process of scores from the output. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore such approaches for machine translation evaluation and summarization eval- uation. Specifically, we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting. We test the approaches of the participants on a new reference-free test-set spanning 3 language pairs for machine transla- tion as well as a summarization dataset. Further, we present an overview of the approaches taken by the participants, present their results on the test set and analyze paths for future work. Fi- nally, as a separate track, we perform a human evaluation of the plausibility of explanations given by the LLMs and its effect on model performance. We make parts of our code and datasets available.

2022

Overview of the 2022 Validity and Novelty Prediction Shared Task
Philipp Heinisch | Anette Frank | Juri Opitz | Moritz Plenz | Philipp Cimiano
Proceedings of the 9th Workshop on Argument Mining

This paper provides an overview of the Argument Validity and Novelty Prediction Shared Task that was organized as part of the 9th Workshop on Argument Mining (ArgMining 2022). The task focused on the prediction of the validity and novelty of a conclusion given a textual premise. Validity is defined as the degree to which the conclusion is justified with respect to the given premise. Novelty defines the degree to which the conclusion contains content that is new in relation to the premise. Six groups participated in the task, submitting overall 13 system runs for the subtask of binary classification and 2 system runs for the subtask of relative classification. The results reveal that the task is challenging, with best results obtained for Validity prediction in the range of 75% F1 score, for Novelty prediction of 70% F1 score and for correctly predicting both Validity and Novelty of 45% F1 score. In this paper we summarize the task definition and dataset. We give an overview of the results obtained by the participating systems, as well as insights to be gained from the diverse contributions.

Strategies for framing argumentative conclusion generation
Philipp Heinisch | Anette Frank | Juri Opitz | Philipp Cimiano
Proceedings of the 15th International Conference on Natural Language Generation

Better Smatch = Better Parser? AMR evaluation is not so simple anymore
Juri Opitz | Anette Frank
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems

A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric Evaluation – through the Lens of Semantic Similarity Rating
Laura Zeidler | Juri Opitz | Anette Frank
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics

Evaluating the quality of generated text is difficult, since traditional NLG evaluation metrics, focusing more on surface form than meaning, often fail to assign appropriate scores. This is especially problematic for AMR-to-text evaluation, given the abstract nature of AMR.Our work aims to support the development and improvement of NLG evaluation metrics that focus on meaning by developing a dynamic CheckList for NLG metrics that is interpreted by being organized around meaning-relevant linguistic phenomena. Each test instance consists of a pair of sentences with their AMR graphs and a human-produced textual semantic similarity or relatedness score. Our CheckList facilitates comparative evaluation of metrics and reveals strengths and weaknesses of novel and traditional metrics. We demonstrate the usefulness of CheckList by designing a new metric GraCo that computes lexical cohesion graphs over AMR concepts. Our analysis suggests that GraCo presents an interesting NLG metric worth future investigation and that meaning-oriented NLG metrics can profit from graph-based metric components using AMR.

SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features
Juri Opitz | Anette Frank
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Models based on large-pretrained language models, such as S(entence)BERT, provide effective and efficient sentence embeddings that show high correlation to human similarity ratings, but lack interpretability. On the other hand, graph metrics for graph-based meaning representations (e.g., Abstract Meaning Representation, AMR) can make explicit the semantic aspects in which two sentences are similar. However, such metrics tend to be slow, rely on parsers, and do not reach state-of-the-art performance when rating sentence similarity. In this work, we aim at the best of both worlds, by learning to induce Semantically Structured Sentence BERT embeddings (S³BERT). Our S³BERT embeddings are composed of explainable sub-embeddings that emphasize various sentence meaning features (e.g., semantic roles, negation, or quantification). We show how to i) learn a decomposition of the sentence embeddings into meaning features, through approximation of a suite of interpretable semantic AMR graph metrics, and how to ii) preserve the overall power of the neural embeddings by controlling the decomposition learning process with a second objective that enforces consistency with the similarity ratings of an SBERT teacher model. In our experimental studies, we show that our approach offers interpretability – while preserving the effectiveness and efficiency of the neural sentence embeddings.

Data Augmentation for Improving the Prediction of Validity and Novelty of Argumentative Conclusions
Philipp Heinisch | Moritz Plenz | Juri Opitz | Anette Frank | Philipp Cimiano
Proceedings of the 9th Workshop on Argument Mining

We address the problem of automatically predicting the quality of a conclusion given a set of (textual) premises of an argument, focusing in particular on the task of predicting the validity and novelty of the argumentative conclusion. We propose a multi-task approach that jointly predicts the validity and novelty of the textual conclusion, relying on pre-trained language models fine-tuned on the task. As training data for this task is scarce and costly to obtain, we experimentally investigate the impact of data augmentation approaches for improving the accuracy of prediction compared to a baseline that relies on task-specific data only. We consider the generation of synthetic data as well as the integration of datasets from related argument tasks. We show that especially our synthetic data, combined with class-balancing and instance-specific learning rates, substantially improves classification results (+15.1 points in F₁-score). Using only training data retrieved from related datasets by automatically labeling them for validity and novelty, combined with synthetic data, outperforms the baseline by 11.5 points in F₁-score.

Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems
Daniel Deutsch | Can Udomcharoenchaikit | Juri Opitz | Yang Gao | Marina Fomicheva | Steffen Eger
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems

2021

Explainable Unsupervised Argument Similarity Rating with Abstract Meaning Representation and Conclusion Generation
Juri Opitz | Philipp Heinisch | Philipp Wiesenbach | Philipp Cimiano | Anette Frank
Proceedings of the 8th Workshop on Argument Mining

When assessing the similarity of arguments, researchers typically use approaches that do not provide interpretable evidence or justifications for their ratings. Hence, the features that determine argument similarity remain elusive. We address this issue by introducing novel argument similarity metrics that aim at high performance and explainability. We show that Abstract Meaning Representation (AMR) graphs can be useful for representing arguments, and that novel AMR graph metrics can offer explanations for argument similarity ratings. We start from the hypothesis that similar premises often lead to similar conclusions—and extend an approach for AMR-based argument similarity rating by estimating, in addition, the similarity of conclusions that we automatically infer from the arguments used as premises. We show that AMR similarity metrics make argument similarity judgements more interpretable and may even support argument quality judgements. Our approach provides significant performance improvements over strong baselines in a fully unsupervised setting. Finally, we make first steps to address the problem of reference-less evaluation of argumentative conclusion generations.

Translate, then Parse! A Strong Baseline for Cross-Lingual AMR Parsing
Sarah Uhrig | Yoalli Garcia | Juri Opitz | Anette Frank
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

In cross-lingual Abstract Meaning Representation (AMR) parsing, researchers develop models that project sentences from various languages onto their AMRs to capture their essential semantic structures: given a sentence in any language, we aim to capture its core semantic content through concepts connected by manifold types of semantic relations. Methods typically leverage large silver training data to learn a single model that is able to project non-English sentences to AMRs. However, we find that a simple baseline tends to be overlooked: translating the sentences to English and projecting their AMR with a monolingual AMR parser (translate+parse,T+P). In this paper, we revisit this simple two-step base-line, and enhance it with a strong NMT system and a strong AMR parser. Our experiments show that T+P outperforms a recent state-of-the-art system across all tested languages: German, Italian, Spanish and Mandarin with +14.6, +12.6, +14.3 and +16.0 Smatch points

Weisfeiler-Leman in the Bamboo: Novel AMR Graph Metrics and a Benchmark for AMR Graph Similarity
Juri Opitz | Angel Daza | Anette Frank
Transactions of the Association for Computational Linguistics, Volume 9

Several metrics have been proposed for assessing the similarity of (abstract) meaning representations (AMRs), but little is known about how they relate to human similarity ratings. Moreover, the current metrics have complementary strengths and weaknesses: Some emphasize speed, while others make the alignment of graph structures explicit, at the price of a costly alignment step. In this work we propose new Weisfeiler-Leman AMR similarity metrics that unify the strengths of previous metrics, while mitigating their weaknesses. Specifically, our new metrics are able to match contextualized substructures and induce n:m alignments between their nodes. Furthermore, we introduce a Benchmark for AMR Metrics based on Overt Objectives (Bamboo), the first benchmark to support empirical assessment of graph-based MR similarity metrics. Bamboo maximizes the interpretability of results by defining multiple overt objectives that range from sentence similarity objectives to stress tests that probe a metric’s robustness against meaning-altering and meaning- preserving graph transformations. We show the benefits of Bamboo by profiling previous metrics and our own metrics. Results indicate that our novel metrics may serve as a strong baseline for future work.

Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR
Juri Opitz | Anette Frank
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Systems that generate natural language text from abstract meaning representations such as AMR are typically evaluated using automatic surface matching metrics that compare the generated texts to reference texts from which the input meaning representations were constructed. We show that besides well-known issues from which such metrics suffer, an additional problem arises when applying these metrics for AMR-to-text evaluation, since an abstract meaning representation allows for numerous surface realizations. In this work we aim to alleviate these issues by proposing ℳℱ𝛽, a decomposable metric that builds on two pillars. The first is the principle of meaning preservation ℳ : it measures to what extent a given AMR can be reconstructed from the generated sentence using SOTA AMR parsers and applying (fine-grained) AMR evaluation metrics to measure the distance between the original and the reconstructed AMR. The second pillar builds on a principle of (grammatical) form ℱ that measures the linguistic quality of the generated text, which we implement using SOTA language models. In two extensive pilot studies we show that fulfillment of both principles offers benefits for AMR-to-text evaluation, including explainability of scores. Since ℳℱ𝛽 does not necessarily rely on gold AMRs, it may extend to other text generation tasks.

2020

AMR Similarity Metrics from Principles
Juri Opitz | Letitia Parcalabescu | Anette Frank
Transactions of the Association for Computational Linguistics, Volume 8

Different metrics have been proposed to compare Abstract Meaning Representation (AMR) graphs. The canonical Smatch metric (Cai and Knight, 2013) aligns the variables of two graphs and assesses triple matches. The recent SemBleu metric (Song and Gildea, 2019) is based on the machine-translation metric Bleu (Papineni et al., 2002) and increases computational efficiency by ablating the variable-alignment. In this paper, i) we establish criteria that enable researchers to perform a principled assessment of metrics comparing meaning representations like AMR; ii) we undertake a thorough analysis of Smatch and SemBleu where we show that the latter exhibits some undesirable properties. For example, it does not conform to the identity of indiscernibles rule and introduces biases that are hard to control; and iii) we propose a novel metric S2 match that is more benevolent to only very slight meaning deviations and targets the fulfilment of all established criteria. We assess its suitability and show its advantages over Smatch and SemBleu.

AMR Quality Rating with a Lightweight CNN
Juri Opitz
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Structured semantic sentence representations such as Abstract Meaning Representations (AMRs) are potentially useful in various NLP tasks. However, the quality of automatic parses can vary greatly and jeopardizes their usefulness. This can be mitigated by models that can accurately rate AMR quality in the absence of costly gold data, allowing us to inform downstream systems about an incorporated parse’s trustworthiness or select among different candidate parses. In this work, we propose to transfer the AMR graph to the domain of images. This allows us to create a simple convolutional neural network (CNN) that imitates a human judge tasked with rating graph quality. Our experiments show that the method can rate quality more accurately than strong baselines, in several quality dimensions. Moreover, the method proves to be efficient and reduces the incurred energy consumption.

2019

Dissecting Content and Context in Argumentative Relation Analysis
Juri Opitz | Anette Frank
Proceedings of the 6th Workshop on Argument Mining

When assessing relations between argumentative units (e.g., support or attack), computational systems often exploit disclosing indicators or markers that are not part of elementary argumentative units (EAUs) themselves, but are gained from their context (position in paragraph, preceding tokens, etc.). We show that this dependency is much stronger than previously assumed. In fact, we show that by completely masking the EAU text spans and only feeding information from their context, a competitive system may function even better. We argue that an argument analysis system that relies more on discourse context than the argument’s content is unsafe, since it can easily be tricked. To alleviate this issue, we separate argumentative units from their context such that the system is forced to model and rely on an EAU’s content. We show that the resulting classification system is more robust, and argue that such models are better suited for predicting argumentative relations across documents.

Automatic Accuracy Prediction for AMR Parsing
Juri Opitz | Anette Frank
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Abstract Meaning Representation (AMR) represents sentences as directed, acyclic and rooted graphs, aiming at capturing their meaning in a machine readable format. AMR parsing converts natural language sentences into such graphs. However, evaluating a parser on new data by means of comparison to manually created AMR graphs is very costly. Also, we would like to be able to detect parses of questionable quality, or preferring results of alternative systems by selecting the ones for which we can assess good quality. We propose AMR accuracy prediction as the task of predicting several metrics of correctness for an automatically generated AMR parse – in absence of the corresponding gold parse. We develop a neural end-to-end multi-output regression model and perform three case studies: firstly, we evaluate the model’s capacity of predicting AMR parse accuracies and test whether it can reliably assign high scores to gold parses. Secondly, we perform parse selection based on predicted parse accuracies of candidate parses from alternative systems, with the aim of improving overall results. Finally, we predict system ranks for submissions from two AMR shared tasks on the basis of their predicted parse accuracy averages. All experiments are carried out across two different domains and show that our method is effective.

An Argument-Marker Model for Syntax-Agnostic Proto-Role Labeling
Juri Opitz | Anette Frank
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Semantic proto-role labeling (SPRL) is an alternative to semantic role labeling (SRL) that moves beyond a categorical definition of roles, following Dowty’s feature-based view of proto-roles. This theory determines agenthood vs. patienthood based on a participant’s instantiation of more or less typical agent vs. patient properties, such as, for example, volition in an event. To perform SPRL, we develop an ensemble of hierarchical models with self-attention and concurrently learned predicate-argument markers. Our method is competitive with the state-of-the art, overall outperforming previous work in two formulations of the task (multi-label and multi-variate Likert scale pre- diction). In contrast to previous work, our results do not depend on gold argument heads derived from supplementary gold tree banks.

2018

Addressing the Winograd Schema Challenge as a Sequence Ranking Task
Juri Opitz | Anette Frank
Proceedings of the First International Workshop on Language Cognition and Computational Models

The Winograd Schema Challenge targets pronominal anaphora resolution problems which require the application of cognitive inference in combination with world knowledge. These problems are easy to solve for humans but most difficult to solve for machines. Computational models that previously addressed this task rely on syntactic preprocessing and incorporation of external knowledge by manually crafted features. We address the Winograd Schema Challenge from a new perspective as a sequence ranking task, and design a Siamese neural sequence ranking model which performs significantly better than a random baseline, even when solely trained on sequences of words. We evaluate against a baseline and a state-of-the-art system on two data sets and show that anonymization of noun phrase candidates strongly helps our model to generalize.

Induction of a Large-Scale Knowledge Graph from the Regesta Imperii
Juri Opitz | Leo Born | Vivi Nastase
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We induce and visualize a Knowledge Graph over the Regesta Imperii (RI), an important large-scale resource for medieval history research. The RI comprise more than 150,000 digitized abstracts of medieval charters issued by the Roman-German kings and popes distributed over many European locations and a time span of more than 700 years. Our goal is to provide a resource for historians to visualize and query the RI, possibly aiding medieval history research. The resulting medieval graph and visualization tools are shared publicly.

2017

A Mention-Ranking Model for Abstract Anaphora Resolution
Ana Marasović | Leo Born | Juri Opitz | Anette Frank
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Resolving abstract anaphora is an important, but difficult task for text understanding. Yet, with recent advances in representation learning this task becomes a more tangible aim. A central property of abstract anaphora is that it establishes a relation between the anaphor embedded in the anaphoric sentence and its (typically non-nominal) antecedent. We propose a mention-ranking model that learns how abstract anaphors relate to their antecedents with an LSTM-Siamese Net. We overcome the lack of training data by generating artificial anaphoric sentence–antecedent pairs. Our model outperforms state-of-the-art results on shell noun resolution. We also report first benchmark results on an abstract anaphora subset of the ARRAU corpus. This corpus presents a greater challenge due to a mixture of nominal and pronominal anaphors and a greater range of confounders. We found model variants that outperform the baselines for nominal anaphors, without training on individual anaphor data, but still lag behind for pronominal anaphors. Our model selects syntactically plausible candidates and – if disregarding syntax – discriminates candidates using deeper features.

2016

Using Linear Classifiers for the Automatic Triage of Posts in the 2016 CLPsych Shared Task
Juri Opitz
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

Deriving Players & Themes in the Regesta Imperii using SVMs and Neural Networks
Juri Opitz | Anette Frank
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Co-authors

Daniel Deutsch 4

Yang Gao (扬高) 3

Christoph Leiter 3

Lucas Moeller 2

Sebastian Padó 2

Nathan Schneider 2

Rico Sennrich 2

Mousumi Akter 1

Alessia Battisti 1

Khyathi Chandu 1

Tahiya Chowdhury 1

Miruna Clinciu 1

Marina Fomicheva 1

Yingqiang Gao 1

Yoalli Garcia 1

Sebastian Gehrmann 1

Reto Gubelmann 1

Anne Göhring 1

Micha David Hess 1

Saad Mahamood 1

Ana Marasović 1

Katja Markert 1

Philipp Meier 1

Robin Meister 1

Peshmerge Morad 1

Marcel Nawrath 1

Agnieszka Nowak 1

Maria Christina Panagiotopoulou 1

Letitia Parcalabescu 1

Stefano Perrella 1

Corina Raclé 1

Leonardo F. R. Ribeiro 1

Andreas Rücklé 1

Zacchary Sadeddine 1

Anastassia Shaitarova 1

Fabian Suchanek 1

Can Udomcharoenchaikit 1

Danilo C. Walenta 1

Philipp Wiesenbach 1

Laura Zeidler 1

Elena Álvarez-Mellado 1

Venues