Laura Kallmeyer

2025

pdf bib abs
On the Relation Between Fine-Tuning, Topological Properties, and Task Performance in Sense-Enhanced Embeddings
Deniz Ekin Yavas | Timothée Bernard | Benoit Crabbé | Laura Kallmeyer
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Topological properties of embeddings, such as isotropy and uniformity, are closely linked to their expressiveness, and improving these properties enhances the embeddings’ ability to capture nuanced semantic distinctions. However, fine-tuning can reduce the expressiveness of the embeddings of language models. This study investigates the relation between fine-tuning, topology of the embedding space, and task performance in the context of sense knowledge enhancement, focusing on identifying the topological properties that contribute to the success of sense-enhanced embeddings. We experiment with two fine-tuning methods: *Supervised Contrastive Learning (SCL)* and *Supervised Predictive Learning (SPL)*. Our results show that SPL, the most standard approach, exhibits varying effectiveness depending on the language model and is inconsistent in producing successful sense-enhanced embeddings. In contrast, SCL achieves this consistently. Furthermore, while the embeddings with only increased *sense-alignment* show reduced task performance, those that also exhibit high *isotropy* and balance *uniformity* with *sense-alignment* achieve the best results. Additionally, our findings indicate that supervised and unsupervised tasks benefit from these topological properties to varying degrees.

pdf bib abs
Modelling Expectation-based and Memory-based Predictors of Human Reading Times with Syntax-guided Attention
Lukas Mielczarek | Timothée Bernard | Laura Kallmeyer | Katharina Spalek | Benoit Crabbé
Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2)

The correlation between reading times and surprisal is well known in psycholinguistics and is easy to observe. There is also a correlation between reading times and structural integration, which is, however, harder to detect (Gibson, 2000). This correlation has been studied using parsing models whose outputs are linked to reading times. In this paper, we study the relevance of memory-based effects in reading times and how to predict them using neural language models. We find that integration costs significantly improve surprisal-based reading time prediction. Inspired by Timkey and Linzen (2023), we design a small-scale autoregressive transformer language model in which attention heads are supervised by dependency relations. We compare this model to a standard variant by checking how well each model’s outputs correlate with human reading times and find that predicted attention scores can be effectively used as proxies for syntactic integration costs to predict self-paced reading times.

pdf bib abs
Psycholinguistically motivated Construction-based Tree Adjoining Grammar
Shingo Hattori | Laura Kallmeyer | Rainer Osswald
Proceedings of the Second International Workshop on Construction Grammars and NLP

This paper proposes a formal framework based on Tree Adjoining Grammar (TAG) that aims to incorporate central tenets of Construction Grammar while integrating mechanisms from a psycholinguistically motivated variant of TAG. Central ideas are (i) to give TAG-inspired tree representation to various constructions including schematic constructions like argument structure constructions, (ii) to link schematic constructions that are extensions of each other within a network of constructions, (iii) to make the derivation proceed incrementally, (iv) to allow the prediction of upcoming constructions during derivation and (v) to introduce the incremental extension of schematic constructions to larger ones via extension trees in a usage-based manner. The final point is the major novel contribution, which can be conceptualized as the on-the-fly traversal of the inheritance links in the network of constructions. Moreover, we present first experiments towards a parser implementation. We report preliminary results of extracting constructions from the Penn Treebank and automatically identifying constructions to be added during incremental parsing, based on a generative language model (GPT-2).

pdf bib abs
Cococorpus: a corpus of copredication
Long Chen | Deniz Ekin Yavaş | Laura Kallmeyer | Rainer Osswald
Proceedings of the 21st Joint ACL - ISO Workshop on Interoperable Semantic Annotation (ISA-21)

While copredication has been widely investigated as a linguistic phenomenon, there is a notable lack of systematically annotated data to support empirical and quantitative research. This paper gives an overview of the ongoing construction of Cococorpus, a corpus of copredication, describes the annotation methodology and guidelines, and presents preliminary findings from the annotated data. Currently, the corpus contains 1500 gold-standard manual annotations including about 200 sentences with copredications. The annotated data not only supports the empirical validation for existing theories of copredication, but also reveals regularities that may inform theoretical development.

pdf bib
Proceedings of the 16th International Conference on Computational Semantics
Kilian Evang | Laura Kallmeyer | Sylvain Pogodalla
Proceedings of the 16th International Conference on Computational Semantics

pdf bib abs
The Proper Treatment of Verbal Idioms in German Discourse Representation Structure Parsing
Kilian Evang | Rafael Ehren | Laura Kallmeyer
Proceedings of the 16th International Conference on Computational Semantics

Existing datasets for semantic parsing lack adequate representations of potentially idiomatic expressions (PIEs), i.e., expressions consisting of two or more lexemes that can occur with either a literal or an idiomatic reading. As a result, we cannot test semantic parsers for their ability to correctly distinguish between the two cases, and to assign appropriate meaning representations. We address this situation by combining two semantically annotated resources to obtain a corpus of German sentences containing literal and idiomatic occurrences of PIEs, paired with meaning representations whose concepts and roles reflect the respective literal or idiomatic meaning. Experiments with a state-of-the-art semantic parser show that given appropriate training data, it can learn to predict the idiomatic meanings and improve performance also for literal readings, even though predicting the correct concepts in context remains challenging. We provide additional insights through evaluation on synthetic data.

2024

pdf bib abs
To Leave No Stone Unturned: Annotating Verbal Idioms in the Parallel Meaning Bank
Rafael Ehren | Kilian Evang | Laura Kallmeyer
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

Idioms present many challenges to semantic annotation in a lexicalized framework, which leads to them being underrepresented or inadequately annotated in sembanks. In this work, we address this problem with respect to verbal idioms in the Parallel Meaning Bank (PMB), specifically in its German part, where only some idiomatic expressions have been annotated correctly. We first select candidate idiomatic expressions, then determine their idiomaticity status and whether they are decomposable or not, and then we annotate their semantics using WordNet senses and VerbNet semantic roles. Overall, inter-annotator agreement is very encouraging. A difficulty, however, is to choose the correct word sense. This is not surprising, given that English synsets are many and there is often no unique mapping from German idioms and words to them. Besides this, there are many subtle differences and interesting challenging cases. We discuss some of them in this paper.

pdf bib abs
Dissecting Paraphrases: The Impact of Prompt Syntax and supplementary Information on Knowledge Retrieval from Pretrained Language Models
Stephan Linzbach | Dimitar Dimitrov | Laura Kallmeyer | Kilian Evang | Hajira Jabeen | Stefan Dietze
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Pre-trained Language Models (PLMs) are known to contain various kinds of knowledge.One method to infer relational knowledge is through the use of cloze-style prompts, where a model is tasked to predict missing subjects orobjects. Typically, designing these prompts is a tedious task because small differences in syntax or semantics can have a substantial impact on knowledge retrieval performance. Simultaneously, evaluating the impact of either prompt syntax or information is challenging due to their interdependence. We designed CONPARE-LAMA – a dedicated probe, consisting of 34 million distinct prompts that facilitate comparison across minimal paraphrases. These paraphrases follow a unified meta-template enabling the controlled variation of syntax and semantics across arbitrary relations.CONPARE-LAMA enables insights into the independent impact of either syntactical form or semantic information of paraphrases on the knowledge retrieval performance of PLMs. Extensive knowledge retrieval experiments using our probe reveal that prompts following clausal syntax have several desirable properties in comparison to appositive syntax: i) they are more useful when querying PLMs with a combination of supplementary information, ii) knowledge is more consistently recalled across different combinations of supplementary information, and iii) they decrease response uncertainty when retrieving known facts. In addition, range information can boost knowledge retrieval performance more than domain information, even though domain information is more reliably helpful across syntactic forms.

pdf bib abs
Multilingual Nonce Dependency Treebanks: Understanding how Language Models Represent and Process Syntactic Structure
David Arps | Laura Kallmeyer | Younes Samih | Hassan Sajjad
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We introduce SPUD (Semantically Perturbed Universal Dependencies), a framework for creating nonce treebanks for the multilingual Universal Dependencies (UD) corpora. SPUD data satisfies syntactic argument structure, provides syntactic annotations, and ensures grammaticality via language-specific rules. We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks. First, we investigate the effect of nonce data on word co-occurrence statistics, as measured by perplexity scores of autoregressive (ALM) and masked language models (MLM). We find that ALM scores are significantly more affected by nonce data than MLM scores. Second, we show how nonce data affects the performance of syntactic dependency probes. We replicate the findings of Müller-Eberstein et al. (2022) on nonce test data and show that the performance declines on both MLMs and ALMs wrt. original test data. However, a majority of the performance is kept, suggesting that the probe indeed learns syntax independently from semantics.

pdf bib abs
Improving Word Sense Induction through Adversarial Forgetting of Morphosyntactic Information
Deniz Ekin Yavas | Timothée Bernard | Laura Kallmeyer | Benoît Crabbé
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

This paper addresses the problem of word sense induction (WSI) via clustering of word embeddings. It starts from the hypothesis that contextualized word representations obtained from pre-trained language models (LMs), while being a valuable source for WSI, encode more information than what is necessary for the identification of word senses and some of this information affect the performance negatively in unsupervised settings. We investigate whether using contextualized representations that are invariant to these ‘nuisance features’ can increase WSI performance. For this purpose, we propose an adaptation of the adversarial training framework proposed by Jaiswal et al. (2020) to erase specific information from the representations of LMs, thereby creating feature-invariant representations. We experiment with erasing (i) morphological and (ii) syntactic features. The results of subsequent clustering for WSI show that these features indeed act like noise: Using feature-invariant representations, compared to using the original representations, increases clustering-based WSI performance. Furthermore, we provide an in-depth analysis of how the information about the syntactic and morphological features of words relate to and affect WSI performance.

2023

pdf bib abs
DEplain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification
Regina Stodden | Omar Momen | Laura Kallmeyer
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text simplification is an intralingual translation task in which documents, or sentences of a complex source text are simplified for a target audience. The success of automatic text simplification systems is highly dependent on the quality of parallel data used for training and evaluation. To advance sentence simplification and document simplification in German, this paper presents DEplain, a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” or in German: “Einfache Sprache”. DEplain consists of a news-domain (approx. 500 document pairs, approx. 13k sentence pairs) and a web-domain corpus (approx. 150 aligned documents, approx. 2k aligned sentence pairs). In addition, we are building a web harvester and experimenting with automatic alignment methods to facilitate the integration of non-aligned and to be-published parallel documents. Using this approach, we are dynamically increasing the web-domain corpus, so it is currently extended to approx. 750 document pairs and approx. 3.5k aligned sentence pairs. We show that using DEplain to train a transformer-based seq2seq text simplification model can achieve promising results. We make available the corpus, the adapted alignment methods for German, the web harvester and the trained models here: https://github.com/rstodden/DEPlain.

pdf bib abs
Using Masked Language Model Probabilities of Connectives for Stance Detection in English Discourse
Regina Stodden | Laura Kallmeyer | Lea Kawaletz | Heidrun Dorgeloh
Proceedings of the 10th Workshop on Argument Mining

This paper introduces an approach which operationalizes the role of discourse connectives for detecting argument stance. Specifically, the study investigates the utility of masked language model probabilities of discourse connectives inserted between a claim and a premise that supports or attacks it. The research focuses on a range of connectives known to signal support or attack, such as because, but, so, or although. By employing a LightGBM classifier, the study reveals promising results in stance detection in English discourse. While the proposed system does not aim to outperform state-of-the-art architectures, the classification accuracy is surprisingly high, highlighting the potential of these features to enhance argument mining tasks, including stance detection.

pdf bib
Increasing The Performance of Cognitively Inspired Data-Efficient Language Models via Implicit Structure Building
Omar Momen | David Arps | Laura Kallmeyer
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

pdf bib abs
Improving Low-resource RRG Parsing with Structured Gloss Embeddings
Roland Eibers | Kilian Evang | Laura Kallmeyer
Proceedings of the Second Workshop on NLP Applications to Field Linguistics

Treebanking for local languages is hampered by the lack of existing parsers to generate pre-annotations. However, it has been shown that reasonably accurate parsers can be bootstrapped with little initial training data when use is made of the information in interlinear glosses and translations that language documentation data for such treebanks typically comes with. In this paper, we improve upon such a bootstrapping model by representing glosses using a combination of morphological feature vectors and pre-trained lemma embeddings. We also contribute a mapping from glosses to Universal Dependencies morphological features.

pdf bib abs
Unsupervised Semantic Frame Induction Revisited
Younes Samih | Laura Kallmeyer
Proceedings of the 15th International Conference on Computational Semantics

This paper addresses the task of semantic frame induction based on pre-trained language models (LMs). The current state of the art is to directly use contextualized embeddings from models such as BERT and to cluster them in a two step clustering process (first lemma-internal, then over all verb tokens in the data set). We propose not to use the LM’s embeddings as such but rather to refine them via some transformer-based denoising autoencoder. The resulting embeddings allow to obtain competitive results while clustering them in a single pass. This shows clearly that the autoendocer allows to already concentrate on the information that is relevant for distinguishing event types.

pdf bib abs
Data-Driven Frame-Semantic Parsing with Tree Wrapping Grammar
Tatiana Bladier | Laura Kallmeyer | Kilian Evang
Proceedings of the 15th International Conference on Computational Semantics

We describe the first experimental results for data-driven semantic parsing with Tree Rewriting Grammars (TRGs) and semantic frames. While several theoretical papers previously discussed approaches for modeling frame semantics in the context of TRGs, this is the first data-driven implementation of such a parser. We experiment with Tree Wrapping Grammar (TWG), a grammar formalism closely related to Tree Adjoining Grammar (TAG), developed for formalizing the typologically inspired linguistic theory of Role and Reference Grammar (RRG). We use a transformer-based multi-task architecture to predict semantic supertags which are then decoded into RRG trees augmented with semantic feature structures. We present experiments for sentences in different genres for English data. We also discuss our compositional semantic analyses using TWG for several linguistic phenomena.

pdf bib abs
Identifying Semantic Argument Types in Predication and Copredication Contexts: A Zero-Shot Cross-Lingual Approach
Deniz Ekin Yavas | Laura Kallmeyer | Rainer Osswald | Elisabetta Jezek | Marta Ricchiardi | Long Chen
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Identifying semantic argument types in predication contexts is not a straightforward task for several reasons, such as inherent polysemy, coercion, and copredication phenomena. In this paper, we train monolingual and multilingual classifiers with a zero-shot cross-lingual approach to identify semantic argument types in predications using pre-trained language models as feature extractors. We train classifiers for different semantic argument types and for both verbal and adjectival predications. Furthermore, we propose a method to detect copredication using these classifiers through identifying the argument semantic type targeted in different predications over the same noun in a sentence. We evaluate the performance of the method on copredication test data with Food•Event nouns for 5 languages.

2022

pdf bib abs
TS-ANNO: An Annotation Tool to Build, Annotate and Evaluate Text Simplification Corpora
Regina Stodden | Laura Kallmeyer
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We introduce TS-ANNO, an open-source web application for manual creation and for evaluation of parallel corpora for text simplification. TS-ANNO can be used for i) sentence–wise alignment, ii) rating alignment pairs (e.g., w.r.t. grammaticality, meaning preservation, ...), iii) annotating alignment pairs w.r.t. simplification transformations (e.g., lexical substitution, sentence splitting, ...), and iv) manual simplification of complex documents. For evaluation, TS-ANNO calculates inter-annotator agreement of alignments (i) and annotations (ii).

pdf bib abs
A Frame-Based Model of Inherent Polysemy, Copredication and Argument Coercion
Long Chen | Laura Kallmeyer | Rainer Osswald
Proceedings of the Workshop on Cognitive Aspects of the Lexicon

The paper presents a frame-based model of inherently polysemous nouns (such as ‘book’, which denotes both a physical object and an informational content) in which the meaning facets are directly accessible via attributes and which also takes into account the semantic relations between the facets. Predication over meaning facets (as in ‘memorize the book’) is then modeled as targeting the value of the corresponding facet attribute while coercion (as in ‘finish the book’) is modeled via specific patterns that enrich the predication. We use a compositional framework whose basic components are lexicalized syntactic trees paired with semantic frames and in which frame unification is triggered by tree composition. The approach is applied to a variety of combinations of predications over meaning facets and coercions.

pdf bib abs
Improving Low-resource RRG Parsing with Cross-lingual Self-training
Kilian Evang | Laura Kallmeyer | Jakub Waszczuk | Kilu von Prince | Tatiana Bladier | Simon Petitjean
Proceedings of the 29th International Conference on Computational Linguistics

This paper considers the task of parsing low-resource languages in a scenario where parallel English data and also a limited seed of annotated sentences in the target language are available, as for example in bootstrapping parallel treebanks. We focus on constituency parsing using Role and Reference Grammar (RRG), a theory that has so far been understudied in computational linguistics but that is widely used in typological research, i.e., in particular in the context of low-resource languages. Starting from an existing RRG parser, we propose two strategies for low-resource parsing: first, we extend the parsing model into a cross-lingual parser, exploiting the parallel data in the high-resource language and unsupervised word alignments by providing internal states of the source-language parser to the target-language parser. Second, we adopt self-training, thereby iteratively expanding the training data, starting from the seed, by including the most confident new parses in each round. Both in simulated scenarios and with a real low-resource language (Daakaka), we find substantial and complementary improvements from both self-training and cross-lingual parsing. Moreover, we also experimented with using gloss embeddings in addition to token embeddings in the target language, and this also improves results. Finally, starting from what we have for Daakaka, we also consider parsing a related language (Dalkalaen) where glosses and English translations are available but no annotated trees at all, i.e., a no-resource scenario wrt. syntactic annotations. We start with cross-lingual parser trained on Daakaka with glosses and use self-training to adapt it to Dalkalaen. The results are surprisingly good.

pdf bib abs
Probing for Constituency Structure in Neural Language Models
David Arps | Younes Samih | Laura Kallmeyer | Hassan Sajjad
Findings of the Association for Computational Linguistics: EMNLP 2022

In this paper, we investigate to which extent contextual neural language models (LMs) implicitly learn syntactic structure. More concretely, we focus on constituent structure as represented in the Penn Treebank (PTB). Using standard probing techniques based on diagnostic classifiers, we assess the accuracy of representing constituents of different categories within the neuron activations of a LM such as RoBERTa. In order to make sure that our probe focuses on syntactic knowledge and not on implicit semantic generalizations, we also experiment on a PTB version that is obtained by randomly replacing constituents with each other while keeping syntactic structure, i.e., a semantically ill-formed but syntactically well-formed version of the PTB. We find that 4 pretrained transfomer LMs obtain high performance on our probing tasks even on manipulated data, suggesting that semantic and syntactic knowledge in their representations can be separated and that constituency information is in fact learned by the LM. Moreover, we show that a complete constituency tree can be linearly separated from LM representations.

This paper describes the first release of RRGparbank, a multilingual parallel treebank for Role and Reference Grammar (RRG) containing annotations of George Orwell’s novel 1984 and its translations. The release comprises the entire novel for English and a constructionally diverse and highly parallel sample (“seed”) for German, French and Russian. The paper gives an overview of annotation decisions that have been taken and describes the adopted treebanking methodology. Finally, as a possible application, a multilingual parser is trained on the treebank data. RRGparbank is one of the first resources to apply RRG to large amounts of real-world data. Furthermore, it enables comparative and typological corpus studies in RRG. And, finally, it creates new possibilities of data-driven NLP applications based on RRG.

pdf bib abs
An Analysis of Attention in German Verbal Idiom Disambiguation
Rafael Ehren | Laura Kallmeyer | Timm Lichte
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022

In this paper we examine a BiLSTM architecture for disambiguating verbal potentially idiomatic expressions (PIEs) as to whether they are used in a literal or an idiomatic reading with respect to explainability of its decisions. Concretely, we extend the BiLSTM with an additional attention mechanism and track the elements that get the highest attention. The goal is to better understand which parts of an input sentence are particularly discriminative for the classifier’s decision, based on the assumption that these elements receive a higher attention than others. In particular, we investigate POS tags and dependency relations to PIE verbs for the tokens with the maximal attention. It turns out that the elements with maximal attention are oftentimes nouns that are the subjects of the PIE verb. For longer sentences however (i.e., sentences containing, among others, more modifiers), the highest attention word often stands in a modifying relation to the PIE components. This is particularly frequent for PIEs classified as literal. Our study shows that an attention mechanism can contribute to the explainability of classification decisions that depend on specific cues in the sentential context, as it is the case for PIE disambiguation.

2021

pdf bib abs
Implicit representations of event properties within contextual language models: Searching for “causativity neurons”
Esther Seyffarth | Younes Samih | Laura Kallmeyer | Hassan Sajjad
Proceedings of the 14th International Conference on Computational Semantics (IWCS)

This paper addresses the question to which extent neural contextual language models such as BERT implicitly represent complex semantic properties. More concretely, the paper shows that the neuron activations obtained from processing an English sentence provide discriminative features for predicting the (non-)causativity of the event denoted by the verb in a simple linear classifier. A layer-wise analysis reveals that the relevant properties are mostly learned in the higher layers. Moreover, further experiments show that appr. 10% of the neuron activations are enough to already predict causativity with a relatively high accuracy.

pdf bib
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)
Kilian Evang | Laura Kallmeyer | Rainer Osswald | Jakub Waszczuk | Torsten Zesch
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf bib
Combining text and vision in compound semantics: Towards a cognitively plausible multimodal model
Abhijeet Gupta | Fritz Günther | Ingo Plag | Laura Kallmeyer | Stefan Conrad
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf bib
Bootstrapping Role and Reference Grammar Treebanks via Universal Dependencies
Kilian Evang | Tatiana Bladier | Laura Kallmeyer | Simon Petitjean
Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021)

2020

pdf bib abs
Corpus-based Identification of Verbs Participating in Verb Alternations Using Classification and Manual Annotation
Esther Seyffarth | Laura Kallmeyer
Proceedings of the 28th International Conference on Computational Linguistics

English verb alternations allow participating verbs to appear in a set of syntactically different constructions whose associated semantic frames are systematically related. We use ENCOW and VerbNet data to train classifiers to predict the instrument subject alternation and the causative-inchoative alternation, relying on count-based and vector-based features as well as perplexity-based language model features, which are intended to reflect each alternation’s felicity by simulating it. Beyond the prediction task, we use the classifier results as a source for a manual annotation step in order to identify new, unseen instances of each alternation. This is possible because existing alternation datasets contain positive, but no negative instances and are not comprehensive. Over several sequences of classification-annotation steps, we iteratively extend our sets of alternating verbs. Our hybrid approach to the identification of new alternating verbs reduces the required annotation effort by only presenting annotators with the highest-scoring candidates from the previous classification. Due to the success of semi-supervised and unsupervised features, our approach can easily be transferred to further alternations.

pdf bib abs
Statistical Parsing of Tree Wrapping Grammars
Tatiana Bladier | Jakub Waszczuk | Laura Kallmeyer
Proceedings of the 28th International Conference on Computational Linguistics

We describe an approach to statistical parsing with Tree-Wrapping Grammars (TWG). TWG is a tree-rewriting formalism which includes the tree-combination operations of substitution, sister-adjunction and tree-wrapping substitution. TWGs can be extracted from constituency treebanks and aim at representing long distance dependencies (LDDs) in a linguistically adequate way. We present a parsing algorithm for TWGs based on neural supertagging and A* parsing. We extract a TWG for English from the treebanks for Role and Reference Grammar and discuss first parsing results with this grammar.

pdf bib abs
Supervised Disambiguation of German Verbal Idioms with a BiLSTM Architecture
Rafael Ehren | Timm Lichte | Laura Kallmeyer | Jakub Waszczuk
Proceedings of the Second Workshop on Figurative Language Processing

Supervised disambiguation of verbal idioms (VID) poses special demands on the quality and quantity of the annotated data used for learning and evaluation. In this paper, we present a new VID corpus for German and perform a series of VID disambiguation experiments on it. Our best classifier, based on a neural architecture, yields an error reduction across VIDs of 57% in terms of accuracy compared to a simple majority baseline.

pdf bib abs
Do you Feel Certain about your Annotation? A Web-based Semantic Frame Annotation Tool Considering Annotators’ Concerns and Behaviors
Regina Stodden | Behrang QasemiZadeh | Laura Kallmeyer
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this system demonstration paper, we present an open-source web-based application with a responsive design for modular semantic frame annotation (SFA). Besides letting experienced and inexperienced users do suggestion-based and slightly-controlled annotations, the system keeps track of the time and changes during the annotation process and stores the users’ confidence with the current annotation. This collected metadata can be used to get insights regarding the difficulty of an annotation with the same type or frame or can be used as an input of an annotation cost measurement for an active learning algorithm. The tool was already used to build a manually annotated corpus with semantic frames and its arguments for task 2 of SemEval 2019 regarding unsupervised lexical frame induction (QasemiZadeh et al., 2019). Although English sentences from the Wall Street Journal corpus of the Penn Treebank were annotated for this task, it is also possible to use the proposed tool for the annotation of sentences in other languages.

pdf bib abs
A multi-lingual and cross-domain analysis of features for text simplification
Regina Stodden | Laura Kallmeyer
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

In text simplification and readability research, several features have been proposed to estimate or simplify a complex text, e.g., readability scores, sentence length, or proportion of POS tags. These features are however mainly developed for English. In this paper, we investigate their relevance for Czech, German, English, Spanish, and Italian text simplification corpora. Our multi-lingual and multi-domain corpus analysis shows that the relevance of different features for text simplification is different per corpora, language, and domain. For example, the relevance of the lexical complexity is different across all languages, the BLEU score across all domains, and 14 features within the web domain corpora. Overall, the negative statistical tests regarding the other features across and within domains and languages lead to the assumption that text simplification models may be transferable between different domains or different languages.

pdf bib
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories
Kilian Evang | Laura Kallmeyer | Rafael Ehren | Simon Petitjean | Esther Seyffarth | Djamé Seddah
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

pdf bib
Automatic Extraction of Tree-Wrapping Grammars for Multiple Languages
Tatiana Bladier | Laura Kallmeyer | Rainer Osswald | Jakub Waszczuk
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

2019

pdf bib abs
SemEval-2019 Task 2: Unsupervised Lexical Frame Induction
Behrang QasemiZadeh | Miriam R. L. Petruck | Regina Stodden | Laura Kallmeyer | Marie Candito
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper presents Unsupervised Lexical Frame Induction, Task 2 of the International Workshop on Semantic Evaluation in 2019. Given a set of prespecified syntactic forms in context, the task requires that verbs and their arguments be clustered to resemble semantic frame structures. Results are useful in identifying polysemous words, i.e., those whose frame structures are not easily distinguished, as well as discerning semantic relations of the arguments. Evaluation of unsupervised frame induction methods fell into two tracks: Task A) Verb Clustering based on FrameNet 1.7; and B) Argument Clustering, with B.1) based on FrameNet’s core frame elements, and B.2) on VerbNet 3.2 semantic roles. The shared task attracted nine teams, of whom three reported promising results. This paper describes the task and its data, reports on methods and resources that these systems used, and offers a comparison to human annotation.

pdf bib abs
Towards a Compositional Analysis of German Light Verb Constructions (LVCs) Combining Lexicalized Tree Adjoining Grammar (LTAG) with Frame Semantics
Jens Fleischhauer | Thomas Gamerschlag | Laura Kallmeyer | Simon Petitjean
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

Complex predicates formed of a semantically ‘light’ verbal head and a noun or verb which contributes the major part of the meaning are frequently referred to as ‘light verb constructions’ (LVCs). In the paper, we present a case study of LVCs with the German posture verb stehen ‘stand’. In our account, we model the syntactic as well as semantic composition of such LVCs by combining Lexicalized Tree Adjoining Grammar (LTAG) with frames. Starting from the analysis of the literal uses of posture verbs, we show how the meaning components of the literal uses are systematically exploited in the interpretation of stehen-LVCs. The paper constitutes an important step towards a compositional and computational analysis of LVCs. We show that LTAG allows us to separate constructional from lexical meaning components and that frames enable elegant generalizations over event types and related constraints.

pdf bib abs
A Neural Graph-based Approach to Verbal MWE Identification
Jakub Waszczuk | Rafael Ehren | Regina Stodden | Laura Kallmeyer
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

We propose to tackle the problem of verbal multiword expression (VMWE) identification using a neural graph parsing-based approach. Our solution involves encoding VMWE annotations as labellings of dependency trees and, subsequently, applying a neural network to model the probabilities of different labellings. This strategy can be particularly effective when applied to discontinuous VMWEs and, thanks to dense, pre-trained word vector representations, VMWEs unseen during training. Evaluation of our approach on three PARSEME datasets (German, French, and Polish) shows that it allows to achieve performance on par with the previous state-of-the-art (Al Saied et al., 2018).

2018

pdf bib
Multilingual Multi-class Sentiment Classification Using Convolutional Neural Networks
Mohammed Attia | Younes Samih | Ali Elkahky | Laura Kallmeyer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
German and French Neural Supertagging Experiments for LTAG Parsing
Tatiana Bladier | Andreas van Cranenburgh | Younes Samih | Laura Kallmeyer
Proceedings of ACL 2018, Student Research Workshop

We present ongoing work on data-driven parsing of German and French with Lexicalized Tree Adjoining Grammars. We use a supertagging approach combined with deep learning. We show the challenges of extracting LTAG supertags from the French Treebank, introduce the use of left- and right-sister-adjunction, present a neural architecture for the supertagger, and report experiments of n-best supertagging for French and German.

pdf bib abs
Coarse Lexical Frame Acquisition at the Syntax–Semantics Interface Using a Latent-Variable PCFG Model
Laura Kallmeyer | Behrang QasemiZadeh | Jackie Chi Kit Cheung
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics

We present a method for unsupervised lexical frame acquisition at the syntax–semantics interface. Given a set of input strings derived from dependency parses, our method generates a set of clusters that resemble lexical frame structures. Our work is motivated not only by its practical applications (e.g., to build, or expand the coverage of lexical frame databases), but also to gain linguistic insight into frame structures with respect to lexical distributions in relation to grammatical structures. We model our task using a hierarchical Bayesian network and employ tools and methods from latent variable probabilistic context free grammars (L-PCFGs) for statistical inference and parameter fitting, for which we propose a new split and merge procedure. We show that our model outperforms several baselines on a portion of the Wall Street Journal sentences that we have newly annotated for evaluation purposes.

pdf bib abs
TRAPACC and TRAPACCS at PARSEME Shared Task 2018: Neural Transition Tagging of Verbal Multiword Expressions
Regina Stodden | Behrang QasemiZadeh | Laura Kallmeyer
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

We describe the TRAPACC system and its variant TRAPACCS that participated in the closed track of the PARSEME Shared Task 2018 on labeling verbal multiword expressions (VMWEs). TRAPACC is a modified arc-standard transition system based on Constant and Nivre’s (2016) model of joint syntactic and lexical analysis in which the oracle is approximated using a classifier. For TRAPACC, the classifier consists of a data-independent dimension reduction and a convolutional neural network (CNN) for learning and labelling transitions. TRAPACCS extends TRAPACC by replacing the softmax layer of the CNN with a support vector machine (SVM). We report the results obtained for 19 languages, for 8 of which our system yields the best results compared to other participating systems in the closed-track of the shared task.

2017

pdf bib abs
Projection Aléatoire Non-Négative pour le Calcul de Word Embedding / Non-Negative Randomized Word Embedding
Behrang Qasemizadeh | Laura Kallmeyer | Aurelie Herbelot
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs

Non-Negative Randomized Word Embedding We propose a word embedding method which is based on a novel random projection technique. We show that weighting methods such as positive pointwise mutual information (PPMI) can be applied to our models after their construction and at a reduced dimensionality. Hence, the proposed technique can efficiently transfer words onto semantically discriminative spaces while demonstrating high computational performance, besides benefits such as ease of update and a simple mechanism for interoperability. We report the performance of our method on several tasks and show that it yields competitive results compared to neural embedding methods in monolingual corpus-based setups.

Arabic dialects do not just share a common koiné, but there are shared pan-dialectal linguistic phenomena that allow computational models for dialects to learn from each other. In this paper we build a unified segmentation model where the training data for different dialects are combined and a single model is trained. The model yields higher accuracies than dialect-specific models, eliminating the need for dialect identification before segmentation. We also measure the degree of relatedness between four major Arabic dialects by testing how a segmentation model trained on one dialect performs on the other dialects. We found that linguistic relatedness is contingent with geographical proximity. In our experiments we use SVM-based ranking and bi-LSTM-CRF sequence labeling.

pdf bib abs
HHU at SemEval-2017 Task 2: Fast Hash-Based Embeddings for Semantic Word Similarity Assessment
Behrang QasemiZadeh | Laura Kallmeyer
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes the HHU system that participated in Task 2 of SemEval 2017, Multilingual and Cross-lingual Semantic Word Similarity. We introduce our unsupervised embedding learning technique and describe how it was employed and configured to address the problems of monolingual and multilingual word similarity measurement. This paper reports from empirical evaluations on the benchmark provided by the task’s organizers.

The automated processing of Arabic Dialects is challenging due to the lack of spelling standards and to the scarcity of annotated data and resources in general. Segmentation of words into its constituent parts is an important processing building block. In this paper, we show how a segmenter can be trained using only 350 annotated tweets using neural networks without any normalization or use of lexical features or lexical resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that rely on additional resources.

pdf bib
Depictives in English: An LTAG Approach
Benjamin Burkhardt | Timm Lichte | Laura Kallmeyer
Proceedings of the 13th International Workshop on Tree Adjoining Grammars and Related Formalisms

pdf bib
Combining Predicate-Argument Structure and Operator Projection: Clause Structure in Role and Reference Grammar
Laura Kallmeyer | Rainer Osswald
Proceedings of the 13th International Workshop on Tree Adjoining Grammars and Related Formalisms

pdf bib
Modeling Quantification with Polysemous Nouns
Laura Kallmeyer | Rainer Osswald
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

2016

pdf bib
Random Positive-Only Projections: PPMI-Enabled Incremental Semantic Space Construction
Behrang QasemiZadeh | Laura Kallmeyer
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

pdf bib
Argument linking in LTAG: A constraint-based implementation with XMG
Laura Kallmeyer | Timm Lichte | Rainer Osswald | Simon Petitjean
Proceedings of the 12th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+12)

pdf bib abs
CogALex-V Shared Task: GHHH - Detecting Semantic Relations via Word Embeddings
Mohammed Attia | Suraj Maharjan | Younes Samih | Laura Kallmeyer | Thamar Solorio
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

This paper describes our system submission to the CogALex-2016 Shared Task on Corpus-Based Identification of Semantic Relations. Our system won first place for Task-1 and second place for Task-2. The evaluation results of our system on the test set is 88.1% (79.0% for TRUE only) f-measure for Task-1 on detecting semantic similarity, and 76.0% (42.3% when excluding RANDOM) for Task-2 on identifying finer-grained semantic relations. In our experiments, we try word analogy, linear regression, and multi-task Convolutional Neural Networks (CNNs) with word embeddings from publicly available word vectors. We found that linear regression performs better in the binary classification (Task-1), while CNNs have better performance in the multi-class semantic classification (Task-2). We assume that word analogy is more suited for deterministic answers rather than handling the ambiguity of one-to-many and many-to-many relationships. We also show that classifier performance could benefit from balancing the distribution of labels in the training data.

pdf bib
Multilingual Code-switching Identification via LSTM Recurrent Neural Networks
Younes Samih | Suraj Maharjan | Mohammed Attia | Laura Kallmeyer | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf bib
SAWT: Sequence Annotation Web Tool
Younes Samih | Wolfgang Maier | Laura Kallmeyer
Proceedings of the Second Workshop on Computational Approaches to Code Switching

Nous présentons ici différents algorithmes d’analyse pour grammaires à concaténation d’intervalles (Range Concatenation Grammar, RCG), dont un nouvel algorithme de type Earley, dans le paradigme de l’analyse déductive. Notre travail est motivé par l’intérêt porté récemment à ce type de grammaire, et comble un manque dans la littérature existante.

pdf bib
Convertir des grammaires d’arbres adjoints à composantes multiples avec tuples d’arbres (TT-MCTAG) en grammaires à concaténation d’intervalles (RCG) [Converting tree tuple multicomponent tree adjoining grammars (TT-MCTAGs) into range concatenation grammars (RCGs)]
Laura Kallmeyer | Yannick Parmentier
Traitement Automatique des Langues, Volume 50, Numéro 1 : Varia [Varia]

pdf bib
A Polynomial-Time Parsing Algorithm for TT-MCTAG
Laura Kallmeyer | Giorgio Satta
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
An Earley Parsing Algorithm for Range Concatenation Grammars
Laura Kallmeyer | Wolfgang Maier | Yannick Parmentier
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib
An Incremental Earley Parser for Simple Range Concatenation Grammar
Laura Kallmeyer | Wolfgang Maier
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf bib
Synchronous Rewriting in Treebanks
Laura Kallmeyer | Wolfgang Maier | Giorgio Satta
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

2008

pdf bib abs
Convertir des grammaires d’arbres adjoints à composantes multiples avec tuples d’arbres (TT-MCTAG) en grammaires à concaténation d’intervalles (RCG)
Laura Kallmeyer | Yannick Parmentier
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article étudie la relation entre les grammaires d’arbres adjoints à composantes multiples avec tuples d’arbres (TT-MCTAG), un formalisme utilisé en linguistique informatique, et les grammaires à concaténation d’intervalles (RCG). Les RCGs sont connues pour décrire exactement la classe PTIME, il a en outre été démontré que les RCGs « simples » sont même équivalentes aux systèmes de réécriture hors-contextes linéaires (LCFRS), en d’autres termes, elles sont légèrement sensibles au contexte. TT-MCTAG a été proposé pour modéliser les langages à ordre des mots libre. En général ces langages sont NP-complets. Dans cet article, nous définissons une contrainte additionnelle sur les dérivations autorisées par le formalisme TT-MCTAG. Nous montrons ensuite comment cette forme restreinte de TT-MCTAG peut être convertie en une RCG simple équivalente. Le résultat est intéressant pour des raisons théoriques (puisqu’il montre que la forme restreinte de TT-MCTAG est légèrement sensible au contexte), mais également pour des raisons pratiques (la transformation proposée ici a été utilisée pour implanter un analyseur pour TT-MCTAG).

pdf bib abs
Developing a TT-MCTAG for German with an RCG-based Parser
Laura Kallmeyer | Timm Lichte | Wolfgang Maier | Yannick Parmentier | Johannes Dellert
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Developing linguistic resources, in particular grammars, is known to be a complex task in itself, because of (amongst others) redundancy and consistency issues. Furthermore some languages can reveal themselves hard to describe because of specific characteristics, e.g. the free word order in German. In this context, we present (i) a framework allowing to describe tree-based grammars, and (ii) an actual fragment of a core multicomponent tree-adjoining grammar with tree tuples (TT-MCTAG) for German developed using this framework. This framework combines a metagrammar compiler and a parser based on range concatenation grammar (RCG) to respectively check the consistency and the correction of the grammar. The German grammar being developed within this framework already deals with a wide range of scrambling and extraction phenomena.

pdf bib
TuLiPA: Towards a Multi-Formalism Parsing Environment for Grammar Engineering
Laura Kallmeyer | Timm Lichte | Wolfgang Maier | Yannick Parmentier | Johannes Dellert | Kilian Evang
Coling 2008: Proceedings of the workshop on Grammar Engineering Across Frameworks

pdf bib
Factorizing Complementation in a TT-MCTAG for German
Timm Lichte | Laura Kallmeyer
Proceedings of the Ninth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+9)

pdf bib
TuLiPA: A syntax-semantics parsing environment for mildly context-sensitive formalisms
Yannick Parmentier | Laura Kallmeyer | Wolfgang Maier | Timm Lichte | Johannes Dellert
Proceedings of the Ninth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+9)

2006

pdf bib
Proceedings of the Eighth International Workshop on Tree Adjoining Grammar and Related Formalisms
Tilman Becker | Laura Kallmeyer
Proceedings of the Eighth International Workshop on Tree Adjoining Grammar and Related Formalisms

pdf bib
Quantifier Scope in German: An MCTAG Analysis
Laura Kallmeyer | Maribel Romero
Proceedings of the Eighth International Workshop on Tree Adjoining Grammar and Related Formalisms

pdf bib
Licensing German Negative Polarity Items in LTAG
Timm Lichte | Laura Kallmeyer
Proceedings of the Eighth International Workshop on Tree Adjoining Grammar and Related Formalisms

pdf bib
Constraint-Based Computational Semantics: A Comparison between LTAG and LRS
Laura Kallmeyer | Frank Richter
Proceedings of the Eighth International Workshop on Tree Adjoining Grammar and Related Formalisms

2005

pdf bib abs
A Descriptive Characterization of Multicomponent Tree Adjoining Grammars
Laura Kallmeyer
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Multicomponent Tree Adjoining Grammars (MCTAG) is a formalism that has been shown to be useful for many natural language applications. The definition of MCTAG however is problematic since it refers to the process of the derivation itself: a simultaneity constraint must be respected concerning the way the members of the elementary tree sets are added. Looking only at the result of a derivation (i.e., the derived tree and the derivation tree), this simultaneity is no longer visible and therefore cannot be checked. I.e., this way of characterizing MCTAG does not allow to abstract away from the concrete order of derivation. Therefore, in this paper, we propose an alternative definition of MCTAG that characterizes the trees in the tree language of an MCTAG via the properties of the derivation trees the MCTAG licences.

pdf bib
Tree-Local Multicomponent Tree-Adjoining Grammars with Shared Nodes
Laura Kallmeyer
Computational Linguistics, Volume 31, Number 2, June 2005

2004

pdf bib abs
Tree-local MCTAG with Shared Nodes: An Analysis ofWord Order Variation in German and Korean
Laura Kallmeyer | SinWon Yoon
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Tree Adjoining Grammars (TAG) are known not to be powerful enough to deal with scrambling in free word order languages. The TAG-variants proposed so far in order to account for scrambling are not entirely satisfying. Therefore, an alternative extension of TAG is introduced based on the notion of node sharing. Considering data from German and Korean, it is shown that this TAG-extension can adequately analyse scrambling data, also in combination with extraposition and topicalization.

pdf bib
LTAG Analysis for Pied-Piping and Stranding of wh-Phrases
Laura Kallmeyer | Tatjana Scheffler
Proceedings of the 7th International Workshop on Tree Adjoining Grammar and Related Formalisms

pdf bib
Tree-local MCTAG with Shared Nodes: Word Order Variation in German and Korean
Laura Kallmeyer | SinWon Yoon
Proceedings of the 7th International Workshop on Tree Adjoining Grammar and Related Formalisms

pdf bib
LTAG Semantics with Semantic Unification
Laura Kallmeyer | Maribel Romero
Proceedings of the 7th International Workshop on Tree Adjoining Grammar and Related Formalisms

pdf bib
LTAG Semantics for Questions
Maribel Romero | Laura Kallmeyer | Olga Babko-Malaya
Proceedings of the 7th International Workshop on Tree Adjoining Grammar and Related Formalisms