International Conference on Computational Semantics (2023) - ACL Anthology

International Conference on Computational Semantics (2023)

Volumes

Proceedings of the 15th International Conference on Computational Semantics IWCS 34 papers
Proceedings of the Fourth International Workshop on Designing Meaning Representations DMR 14 papers
Proceedings of the 4th Workshop on Inquisitiveness Below and Beyond the Sentence Boundary InqBnB 7 papers
Proceedings of the 19th Joint ACL-ISO Workshop on Interoperable Semantics (ISA-19) ISA 11 papers
Proceedings of the 4th Natural Logic Meets Machine Learning Workshop NALOMA 8 papers

pdf (full)
bib (full) Proceedings of the 15th International Conference on Computational Semantics

Proceedings of the 15th International Conference on Computational Semantics
Maxime Amblard | Ellen Breitholtz

Can current NLI systems handle German word order? Investigating language model performance on a new German challenge set of minimal pairs
Ines Reinig | Katja Markert

Compared to English, German word order is freer and therefore poses additional challenges for natural language inference (NLI). We create WOGLI (Word Order in German Language Inference), the first adversarial NLI dataset for German word order that has the following properties: (i) each premise has an entailed and a non-entailed hypothesis; (ii) premise and hypotheses differ only in word order and necessary morphological changes to mark case and number. In particular, each premise and its two hypotheses contain exactly the same lemmata. Our adversarial examples require the model to use morphological markers in order to recognise or reject entailment. We show that current German autoencoding models fine-tuned on translated NLI data can struggle on this challenge set, reflecting the fact that translated NLI datasets will not mirror all necessary language phenomena in the target language. We also examine performance after data augmentation as well as on related word order phenomena derived from WOGLI. Our datasets are publically available at https://github.com/ireinig/wogli.

Contextual Variability depends on Categorical Specificity rather than Conceptual Concreteness: A Distributional Investigation on Italian data
Giulia Rambelli | Marianna Bolognesi

A large amount of literature on conceptual abstraction has investigated the differences in contextual distribution (namely “contextual variability”) between abstract and concrete concept words (“joy” vs. “apple”), showing that abstract words tend to be used in a wide variety of linguistic contexts. In contrast, concrete words usually occur in a few very similar contexts. However, these studies do not take into account another process that affects both abstract and concrete concepts alike: “specificity, that is, how inclusive a category is (“ragdoll” vs. “mammal”). We argue that the more a word is specific, the more its usage is tied to specific domains, and therefore its contextual variability is more limited compared to generic words. In this work, we used distributional semantic models to model the interplay between contextual variability measures and i) concreteness, ii) specificity, and iii) the interaction between the two variables. Distributional analyses on 662 Italian nouns showed that contextual variability is mainly explainable in terms of specificity or by the interaction between concreteness and specificity. In particular, the more specific a word is, the more its contexts will be close to it. In contrast, generic words have less related contexts, regardless of whether they are concrete or abstract.

Probing BERT’s ability to encode sentence modality and modal verb sense across varieties of English
Jonas Wagner | Sina Zarrieß

In this research, we investigate whether BERT can differentiate between modal verb senses and sentence modalities and whether it performs equally well on different varieties of English. We fit probing classifiers under two conditions: contextualised embeddings of modal verbs and sentence embeddings. We also investigate BERT’s ability to predict masked modal verbs. Additionally, we classify separately for each modal verb to investigate whether BERT encodes different representations of senses for each individual verb. Lastly, we employ classifiers on data from different varieties of English to determine whether non-American English data is an additional hurdle. Results indicate that BERT has different representations for distinct senses for each modal verb, but does not represent modal sense independently from modal verbs. We also show that performance in different varieties of English is not equal, pointing to a necessary shift in the way we train large language models towards more linguistic diversity. We make our annotated dataset of modal sense in different varieties of English available at https://github.com/wagner-jonas/VEM.

Dense Paraphrasing for Textual Enrichment
Jingxuan Tu | Kyeongmin Rim | Eben Holderness | Bingyang Ye | James Pustejovsky

Understanding inferences from text requires more than merely recovering surface arguments, adjuncts, or strings associated with the query terms. As humans, we interpret sentences as contextualized components of a narrative or discourse, by both filling in missing information, and reasoning about event consequences. In this paper, we define the process of rewriting a textual expression (lexeme or phrase) such that it reduces ambiguity while also making explicit the underlying semantics that is not (necessarily) expressed in the economy of sentence structure as Dense Paraphrasing (DP). We apply the DP techniques on the English procedural texts from the cooking recipe domain, and provide the scope and design of the application that involves creating a graph representation of events and generating hidden arguments through paraphrasing. We provide insights on how this DP process can enrich a source text by showing that the dense-paraphrased event graph is a good resource to large LLMs such as GPT-3 to generate reliable paraphrases; and by experimenting baselines for automaticDP generation. Finally, we demonstrate the utility of the dataset and event graph structure by providing a case study on the out-of-domain modeling and different DP prompts and GPT models for paraphrasing.

Towards Unsupervised Compositional Entailment with Multi-Graph Embedding Models
Lorenzo Bertolini | Julie Weeds | David Weir

Compositionality and inference are essential features of human language, and should hence be simultaneously accessible to a model of meaning. Despite being theory-grounded, distributional models can only be directly tested on compositionality, usually through similarity judgements, while testing for inference requires external resources. Recent work has shown that knowledge graph embeddings (KGE) architectures can be used to train distributional models capable of learning syntax-aware compositional representations, by training on syntactic graphs. We propose to expand such work with Multi-Graphs embedding (MuG) models, a new set of models learning from syntactic and knowledge-graphs. Using a phrase-level inference task, we show how MuGs can simultaneously handle syntax-aware composition and inference, and remain competitive distributional models with respect to lexical and compositional similarity.

Gender-tailored Semantic Role Profiling for German
Manfred Klenner | Anne Göhring | Alison Kim | Dylan Massey

In this short paper, we combine the semantic perspective of particular verbs as casting a positive or negative relationship between their role fillers with a pragmatic examination of how the distribution of particular vulnerable role filler subtypes (children, migrants, etc.) looks like. We focus on the gender subtype and strive to extract gender-specific semantic role profiles: who are the predominant sources and targets of which polar events - men or women. Such profiles might reveal gender stereotypes or biases (of the media), but as well could be indicative of our social reality.

Implicit causality in GPT-2: a case study
Minh Hien Huynh | Tomas Lentz | Emiel van Miltenburg

This case study investigates the extent to which a language model (GPT-2) is able to capture native speakers’ intuitions about implicit causality in a sentence completion task. Study 1 reproduces earlier results (showing that the model’s surprisal values correlate with the implicit causality bias of the verb; Davis and van Schijndel 2021), and then examine the effects of gender and verb frequency on model performance. Study 2 examines the reasoning ability of GPT-2: Is the model able to produce more sensible motivations for why the subject VERBed the object if the verbs have stronger causality biases? For this study we took care to avoid human raters being biased by obscenities and disfluencies generated by the model.

Multi-purpose neural network for French categorial grammars
Gaëtan Margueritte | Daisuke Bekki | Koji Mineshima

Categorial grammar (CG) is a lexicalized grammar formalism that can be used to identify and extract the semantics of natural language sentences. However, despite being used actively to solve natural language understanding tasks such as natural language inference or recognizing textual entailment, most of the tools exploiting the capacities of CG are available in a limited set of languages. This paper proposes a first step toward developing a set of tools enabling the use of CG for the French language by proposing a neural network tailored for part-of-speech and type-logical-grammar supertagging, located at the frontier between computational linguistics and artificial intelligence. Experiments show that our model can compete with state-of-the art models while retaining a simple architecture.

Experiments in training transformer sequence-to-sequence DRS parsers
Ahmet Yildirim | Dag Haug

This work experiments with various configurations of transformer-based sequence-to-sequence neural networks in training a Discourse Representation Structure (DRS) parser, and presents the results along with the code to reproduce our experiments for use by the community working on DRS parsing. These are configurations that have not been tested in prior work on this task. The Parallel Meaning Bank (PMB) English data sets are used to train the models. The results are evaluated on the PMB test sets using Counter, the standard Evaluation tool for DRSs. We show that the performance improves upon the previous state of the art by 0.5 (F1 %) for PMB 2.2.0 and 1.02 (F1 %) for PMB 3.0.0 test sets. We also present results on PMB 4.0.0, which has not been evaluated using Counter in previous research.

Unsupervised Semantic Frame Induction Revisited
Younes Samih | Laura Kallmeyer

This paper addresses the task of semantic frame induction based on pre-trained language models (LMs). The current state of the art is to directly use contextualized embeddings from models such as BERT and to cluster them in a two step clustering process (first lemma-internal, then over all verb tokens in the data set). We propose not to use the LM’s embeddings as such but rather to refine them via some transformer-based denoising autoencoder. The resulting embeddings allow to obtain competitive results while clustering them in a single pass. This shows clearly that the autoendocer allows to already concentrate on the information that is relevant for distinguishing event types.

Towards Ontologically Grounded and Language-Agnostic Knowledge Graphs
Walid Saba

Knowledge graphs (KGs) have become the standard technology for the representation of factual information in applications such as recommendation engines, search, and question-answering systems. However, the continual updating of KGs, as well as the integration of KGs from different domains and KGs in different languages, remains to be a major challenge. What we suggest here is that by a reification of abstract objects and by acknowledging the ontological distinction between concepts and types, we arrive at an ontologically grounded and language-agnostic representation that can alleviate the difficulties in KG integration.

The Universe of Utterances According to BERT
Dmitry Nikolaev | Sebastian Padó

It has been argued that BERT “rediscovers the traditional NLP pipeline”, with lower layers extracting morphosyntactic features and higher layers creating holistic sentence-level representations. In this paper, we critically examine this assumption through a principle-component-guided analysis, extracing sets of inputs that correspond to specific activation patterns in BERT sentence representations. We find that even in higher layers, the model mostly picks up on a variegated bunch of low-level features, many related to sentence complexity, that presumably arise from its specific pre-training objectives.

Sparser is better: one step closer to word embedding interpretability
Simon Guillot | Thibault Prouteau | Nicolas Dugue

Sparse word embeddings models (SPINE, SINr) are designed to embed words in interpretable dimensions. An interpretable dimension is such that a human can interpret the semantic (or syntactic) relations between words active for a dimension. These models are useful for critical downstream tasks in natural language processing (e.g. medical or legal NLP), and digital humanities applications. This work extends interpretability at the vector level with a more manageable number of activated dimensions following recommendations from psycholinguistics. Subsequently, one of the key criteria to an interpretable model is sparsity: in order to be interpretable, not every word should be represented by all the features of the model, especially if humans have to interpret these features and their relations. This raises one question: to which extent is sparsity sustainable with regard to performance? We thus introduce a sparsification procedure to evaluate its impact on two interpretable methods (SPINE and SINr) to tend towards sustainable vector interpretability. We also introduce stability as a new criterion to interpretability. Our stability evaluations show little albeit non-zero variation for SPINE and SINr embeddings. We then show that increasing sparsity does not necessarily interfere with performance. These results are encouraging and pave the way towards intrinsically interpretable word vectors.

Semantically Informed Data Augmentation for Unscoped Episodic Logical Forms
Mandar Juvekar | Gene Kim | Lenhart Schubert

Unscoped Logical Form (ULF) of Episodic Logic is a meaning representation format that captures the overall semantic type structure of natural language while leaving certain finer details, such as word sense and quantifier scope, underspecified for ease of parsing and annotation. While a learned parser exists to convert English to ULF, its performance is severely limited by the lack of a large dataset to train the system. We present a ULF dataset augmentation method that samples type-coherent ULF expressions using the ULF semantic type system and filters out samples corresponding to implausible English sentences using a pretrained language model. Our data augmentation method is configurable with parameters that trade off between plausibility of samples with sample novelty and augmentation size. We find that the best configuration of this augmentation method substantially improves parser performance beyond using the existing unaugmented dataset.

Meaning-Text Theory within Abstract Categorial Grammars: Toward Paraphrase and Lexical Function Modeling for Text Generation
Marie Cousin

The meaning-text theory is a linguistic theory aiming to describe the correspondence between the meaning and the surface form of an utterance with a formal device simulating the linguistic activity of a native speaker. We implement a version of a model of this theory with abstract categorial grammars, a grammatical formalism based on lambda-calculus. This implementation covers the syntax-semantic interface of the meaning-text theory, i.e., not only the three semantic, deep-syntactic and surface-syntactic representation levels of the theory, but also their interface (i.e., the transformation from one level to another). This implementation hinges upon abstract categorial grammars composition in order to encode level interfaces as transduction operate.

Measuring Fine-Grained Semantic Equivalence with Abstract Meaning Representation
Shira Wein | Zhuxin Wang | Nathan Schneider

Identifying semantically equivalent sentences is important for many NLP tasks. Current approaches to semantic equivalence take a loose, sentence-level approach to “equivalence,” despite evidence that fine-grained differences and implicit content have an effect on human understanding and system performance. In this work, we introduce a novel, more sensitive method of characterizing cross-lingual semantic equivalence that leverages Abstract Meaning Representation graph structures. We find that parsing sentences into AMRs and comparing the AMR graphs enables finer-grained equivalence measurement than comparing the sentences themselves. We demonstrate that when using gold or even automatically parsed AMR annotations, our solution is finer-grained than existing corpus filtering methods and more accurate at predicting strictly equivalent sentences than existing semantic similarity metrics.

The Importance of Context in the Evaluation of Word Embeddings: The Effects of Antonymy and Polysemy
James Fodor | Simon De Deyne | Shinsuke Suzuki

Word embeddings are widely used for diverse applications in natural language processing. Despite extensive research, it is unclear when they succeed or fail to capture human judgements of semantic relatedness and similarity. In this study, we examine a range of models and experimental datasets, showing that while current embeddings perform reasonably well overall, they are unable to account for human judgements of antonyms and polysemy. We suggest that word embeddings perform poorly in representing polysemy and antonymy because they do not consider the context in which humans make word similarity judgements. In support of this, we further show that incorporating additional context into transformer embeddings using general corpora and lexical dictionaries significantly improves the fit with human judgments. Our results provide insight into two key inadequacies of word embeddings, and highlight the importance of incorporating word context into representations of word meaning when accounting for context-free human similarity judgments.

RaTE: a Reproducible automatic Taxonomy Evaluation by Filling the Gap
Phillippe Langlais | Tianjian Lucas Gao

Taxonomies are an essential knowledge representation, yet most studies on automatic taxonomy construction (ATC) resort to manual evaluation to score proposed algorithms. We argue that automatic taxonomy evaluation (ATE) is just as important as taxonomy construction. We propose RaTE, an automatic label-free taxonomy scoring procedure, which relies on a large pre-trained language model. We apply our evaluation procedure to three state-of-the-art ATC algorithms with which we built seven taxonomies from the Yelp domain, and show that 1) RaTE correlates well with human judgments and 2) artificially degrading a taxonomy leads to decreasing RaTE score.

The Universal Anaphora Scorer 2.0
Juntao Yu | Michal Novák | Abdulrahman Aloraini | Nafise Sadat Moosavi | Silviu Paun | Sameer Pradhan | Massimo Poesio

The aim of the Universal Anaphora initiative is to push forward the state of the art both in anaphora (coreference) annotation and in the evaluation of models for anaphora resolution. The first release of the Universal Anaphora Scorer (Yu et al., 2022b) supported the scoring not only of identity anaphora as in the Reference Coreference Scorer (Pradhan et al., 2014) but also of split antecedent anaphoric reference, bridging references, and discourse deixis. That scorer was used in the CODI-CRAC 2021/2022 Shared Tasks on Anaphora Resolution in Dialogues (Khosla et al., 2021; Yu et al., 2022a). A modified version of the scorer supporting discontinuous markables and the COREFUD markup format was also used in the CRAC 2022 Shared Task on Multilingual Coreference Resolution (Zabokrtsky et al., 2022). In this paper, we introduce the second release of the scorer, merging the two previous versions, which can score reference with discontinuous markables and zero anaphora resolution.

The Sequence Notation: Catching Complex Meanings in Simple Graphs
Johan Bos

Current symbolic semantic representations proposed to capture the semantics of human language have served well to give us insight in how meaning is expressed. But they are either too complicated for large-scale annotation tasks or lack expressive power to play a role in inference tasks. What we propose is a meaning representation system that it is interlingual, model-theoretic, and variable-free. It divides the labour involved in representing meaning along three levels: concept, roles, and contexts. As natural languages are expressed as sequences of phonemes or words, the meaning representations that we propose are likewise sequential. However, the resulting meaning representations can also be visualised as directed acyclic graphs.

Bridging Semantic Frameworks: mapping DRS onto AMR
Siyana Pavlova | Maxime Amblard | Bruno Guillaume

A number of graph-based semantic representation frameworks have emerged in recent years, but there are few parallel annotated corpora across them. We want to explore the viability of transforming graphs from one framework into another to construct parallel datasets. In this work, we consider graph rewriting from Discourse Representation Structures (Parallel Meaning Bank (PMB) variant) to Abstract Meaning Representation (AMR). We first build a gold AMR corpus of 102 sentences from the PMB. We then construct a rule base, aided by a further 95 sentences. No benchmark for this task exists, so we compare our system’s output to that of state-of-the-art AMR parsers, and explore the more challenging cases. Finally, we discuss where the two frameworks diverge in encoding semantic phenomena.

Data-Driven Frame-Semantic Parsing with Tree Wrapping Grammar
Tatiana Bladier | Laura Kallmeyer | Kilian Evang

We describe the first experimental results for data-driven semantic parsing with Tree Rewriting Grammars (TRGs) and semantic frames. While several theoretical papers previously discussed approaches for modeling frame semantics in the context of TRGs, this is the first data-driven implementation of such a parser. We experiment with Tree Wrapping Grammar (TWG), a grammar formalism closely related to Tree Adjoining Grammar (TAG), developed for formalizing the typologically inspired linguistic theory of Role and Reference Grammar (RRG). We use a transformer-based multi-task architecture to predict semantic supertags which are then decoded into RRG trees augmented with semantic feature structures. We present experiments for sentences in different genres for English data. We also discuss our compositional semantic analyses using TWG for several linguistic phenomena.

The argument–adjunct distinction in BERT: A FrameNet-based investigation
Dmitry Nikolaev | Sebastian Padó

The distinction between arguments and adjuncts is a fundamental assumption of several linguistic theories. In this study, we investigate to what extent this distinction is picked up by a Transformer-based language model. We use BERT as a case study, operationalizing arguments and adjuncts as core and non-core FrameNet frame elements, respectively, and tying them to activations of particular BERT neurons. We present evidence, from English and Korean, that BERT learns more dedicated representations for arguments than for adjuncts when fine-tuned on the FrameNet frame-identification task. We also show that this distinction is already present in a weaker form in the vanilla pre-trained model.

Collecting and Predicting Neurocognitive Norms for Mandarin Chinese
Le Qiu | Yu-Yin Hsu | Emmanuele Chersoni

Language researchers have long assumed that concepts can be represented by sets of semantic features, and have traditionally encountered challenges in identifying a feature set that could be sufficiently general to describe the human conceptual experience in its entirety. In the dataset of English norms presented by Binder et al. (2016), also known as Binder norms, the authors introduced a new set of neurobiologically motivated semantic features in which conceptual primitives were defined in terms of modalities of neural information processing. However, no comparable norms are currently available for other languages. In our work, we built the Mandarin Chinese norm by translating the stimuli used in the original study and developed a comparable collection of human ratings for Mandarin Chinese. We also conducted some experiments on the automatic prediction of the Chinese Binder Norms based on the word embeddings of the corresponding words to assess the feasibility of modeling experiential semantic features via corpus-based representations.

Error Exploration for Automatic Abstract Meaning Representation Parsing
Maria Boritchev | Johannes Heinecke

Following the data-driven methods of evaluation and error analysis in meaning representation parsing presented in (Buljan et al., 2022), we performed an error exploration of an Abstract Meaning Representation (AMR) parser. Our aim is to perform a diagnosis of the types of errors found in the output of the tool in order to implement adaptation and correction strategies to accommodate these errors. This article presents the exploration, its results, the strategies we implemented and the effect of these strategies on the performances of the tool. Though we did not observe a significative rise on average in the performances of the tool, we got much better results in some cases using our adaptation techniques.

Unsupervised Methods for Domain Specific Ambiguity Detection. The Case of German Physics Language
Vitor Fontanella | Christian Wartena | Gunnar Friege

Many terms used in physics have a different meaning or usage pattern in general language, constituting a learning barrier in physics teaching. The systematic identification of such terms is considered to be useful for science education as well as for terminology extraction. This article compares three methods based on vector semantics and a simple frequency-based baseline for automatically identifying terms used in general language with domain-specific use in physics. For evaluation, we use ambiguity scores from a survey among physicists and data about the number of term senses from Wiktionary. We show that the so-called Vector Initialization method obtains the best results.

Definition Modeling : To model definitions. Generating Definitions With Little to No Semantics
Vincent Segonne | Timothee Mickus

Definition Modeling, the task of generating definitions, was first proposed as a means to evaluate the semantic quality of word embeddings—a coherent lexical semantic representations of a word in context should contain all the information necessary to generate its definition. The relative novelty of this task entails that we do not know which factors are actually relied upon by a Definition Modeling system. In this paper, we present evidence that the task may not involve as much semantics as one might expect: we show how an earlier model from the literature is both rather insensitive to semantic aspects such as explicit polysemy, as well as reliant on formal similarities between headwords and words occurring in its glosses, casting doubt on the validity of the task as a means to evaluate embeddings.

SMARAGD: Learning SMatch for Accurate and Rapid Approximate Graph Distance
Juri Opitz | Philipp Meier | Anette Frank

The similarity of graph structures, such as Meaning Representations (MRs), is often assessed via structural matching algorithms, such as Smatch (Cai & Knight 2013). However, Smatch involves a combinatorial problem that suffers from NP-completeness, making large-scale applications, e.g., graph clustering or search, infeasible. To alleviate this issue, we learn SMARAGD: Semantic Match for Accurate and Rapid Approximate Graph Distance. We show the potential of neural networks to approximate Smatch scores, i) in linear time using a machine translation framework to predict alignments, or ii) in constant time using a Siamese CNN to directly predict Smatch scores. We show that the approximation error can be substantially reduced through data augmentation and graph anonymization.

AMR4NLI: Interpretable and robust NLI measures from semantic graphs
Juri Opitz | Shira Wein | Julius Steen | Anette Frank | Nathan Schneider

The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent premise and hypothesis, including sets of *contextualized embeddings* and *semantic graphs* (Abstract Meaning Representations), and measure whether the hypothesis is a semantic substructure of the premise, utilizing interpretable metrics. Our evaluation on three English benchmarks finds value in both contextualized embeddings and semantic graphs; moreover, they provide complementary signals, and can be leveraged together in a hybrid model.

Use Defines Possibilities: Reasoning about Object Function to Interpret and Execute Robot Instructions
Mollie Shichman | Claire Bonial | Austin Blodgett | Taylor Hudson | Francis Ferraro | Rachel Rudinger

Language models have shown great promise in common-sense related tasks. However, it remains unseen how they would perform in the context of physically situated human-robot interactions, particularly in disaster-relief sce- narios. In this paper, we develop a language model evaluation dataset with more than 800 cloze sentences, written to probe for the func- tion of over 200 objects. The sentences are divided into two tasks: an “easy” task where the language model has to choose between vo- cabulary with different functions (Task 1), and a “challenge” where it has to choose between vocabulary with the same function, yet only one vocabulary item is appropriate given real world constraints on functionality (Task 2). Dis- tilBERT performs with about 80% accuracy for both tasks. To investigate how annotator variability affected those results, we developed a follow-on experiment where we compared our original results with wrong answers chosen based on embedding vector distances. Those results showed increased precision across docu- ments but a 15% decrease in accuracy. We con- clude that language models do have a strong knowledge basis for object reasoning, but will require creative fine-tuning strategies in order to be successfully deployed.

SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented Dialogue with Symbolic Scene Representation
Bhathiya Hemanthage | Christian Dondrup | Phil Bartie | Oliver Lemon

SimpleMTOD is a simple language model which recasts several sub-tasks in multimodal task-oriented dialogues as sequence prediction tasks. SimpleMTOD is built on a large-scale transformer-based auto-regressive architecture, which has already proven to be successful in uni-modal task-oriented dialogues, and effectively leverages transfer learning from pretrained GPT-2. In-order to capture the semantics of visual scenes, we introduce both local and de-localized tokens for objects within a scene. De-localized tokens represent the type of an object rather than the specific object itself and so possess a consistent meaning across the dataset. SimpleMTOD achieves a state-of-the-art BLEU score (0.327) in the Response Generation sub-task of the SIMMC 2.0 test-std dataset while performing on par in other multimodal sub-tasks: Disambiguation, Coreference Resolution, and Dialog State Tracking. This is despite taking a minimalist approach for extracting visual (and non-visual) informa- tion. In addition the model does not rely on task-specific architectural changes such as classification heads.

Grounding and Distinguishing Conceptual Vocabulary Through Similarity Learning in Embodied Simulations
Sadaf Ghaffari | Nikhil Krishnaswamy

We present a novel method for using agent experiences gathered through an embodied simulation to ground contextualized word vectors to object representations. We use similarity learning to make comparisons between different object types based on their properties when interacted with, and to extract common features pertaining to the objects’ behavior. We then use an affine transformation to calculate a projection matrix that transforms contextualized word vectors from different transformer-based language models into this learned space, and evaluate whether new test instances of transformed token vectors identify the correct concept in the object embedding space. Our results expose properties of the embedding spaces of four different transformer models and show that grounding object token vectors is usually more helpful to grounding verb and attribute token vectors than the reverse, which reflects earlier conclusions in the analogical reasoning and psycholinguistic literature.

Interactive Acquisition of Fine-grained Visual Concepts by Exploiting Semantics of Generic Characterizations in Discourse
Jonghyuk Park | Alex Lascarides | Subramanian Ramamoorthy

Interactive Task Learning (ITL) concerns learning about unforeseen domain concepts via natural interactions with human users. The learner faces a number of significant constraints: learning should be online, incremental and few-shot, as it is expected to perform tangible belief updates right after novel words denoting unforeseen concepts are introduced. In this work, we explore a challenging symbol grounding task—discriminating among object classes that look very similar—within the constraints imposed by ITL. We demonstrate empirically that more data-efficient grounding results from exploiting the truth-conditions of the teacher’s generic statements (e.g., “Xs have attribute Z.”) and their implicatures in context (e.g., as an answer to “How are Xs and Ys different?”, one infers Y lacks attribute Z).

pdf (full)
bib (full) Proceedings of the Fourth International Workshop on Designing Meaning Representations

Proceedings of the Fourth International Workshop on Designing Meaning Representations
Julia Bonn | Nianwen Xue

Structural and Global Features for Comparing Semantic Representation Formalisms
Siyana Pavlova | Maxime Amblard | Bruno Guillaume

The area of designing semantic/meaning representations is a dynamic one with new formalisms and extensions being proposed continuously. It may be challenging for users of semantic representations to select the relevant formalism for their purpose or for newcomers to the field to select the features they want to represent in a new formalism. In this paper, we propose a set of structural and global features to consider when designing formalisms, and against which formalisms can be compared. We also propose a sample comparison of a number of existing formalisms across the selected features, complemented by a more entailment-oriented comparison on the phenomena of the FraCaS corpus.

Evaluation of Universal Semantic Representation (USR)
Kirti Garg | Soma Paul | Sukhada | Riya Kumari | Fatema Bawahir

Universal Semantic Representation (USR) is designed as a language-independent information packaging system that captures information at three levels: (a) Lexico-conceptual, (b) Syntactico-Semantic, and (c) Discourse. Unlike other representations that mainly encode predicates and their argument structures, our proposed representation captures the speaker’s vivakṣā- how the speaker views the activity. The idea of “speaker’s vivakṣā is inspired by Indian Grammatical Tradition. There can be some amount of idiosyncrasy of the speaker in the annotation since it is the speaker’s view- point that has been captured in the annotation. Hence the evaluation metrics of such resources need to be also thought through from scratch. This paper presents an extensive evaluation procedure of this semantic representation from two perspectives (a) Inter- Annotator Agreement and (b) one downstream task, namely multilingual Natural Language Generation. We also qualitatively evaluate the experience of natural language generation by manual parsing of USR, so as to understand the readability of USR. We have achieved above 80% Inter-Annotator Agreement for USR annotations and above 80% semantic closeness in multi-lingual generation tasks suggesting the reliability of USR annotations and utility for multi-lingual generations. The qualitative evaluation also suggests high readability and hence the utility of USR as a semantic representation.

Comparing UMR and Cross-lingual Adaptations of AMR
Shira Wein | Julia Bonn

Abstract Meaning Representation (AMR) is a popular semantic annotation schema that presents sentence meaning as a graph while abstracting away from syntax. It was originally designed for English, but has since been extended to a variety of non-English versions of AMR. These cross-lingual adaptations, to varying degrees, incorporate language-specific features necessary to effectively capture the semantics of the language being annotated. Uniform Meaning Representation (UMR) on the other hand, the multilingual extension of AMR, was designed specifically for cross-lingual applications. In this work, we discuss these two approaches to extending AMR beyond English. We describe both approaches, compare the information they capture for a case language (Spanish), and outline implications for future work.

Abstract Meaning Representation for Grounded Human-Robot Communication
Claire Bonial | Julie Foresta | Nicholas C. Fung | Cory J. Hayes | Philip Osteen | Jacob Arkin | Benned Hedegaard | Thomas Howard

To collaborate effectively in physically situated tasks, robots must be able to ground concepts in natural language to the physical objects in the environment as well as their own capabilities. We describe the implementation and the demonstration of a system architecture that sup- ports tasking robots using natural language. In this architecture, natural language instructions are first handled by a dialogue management component, which provides feedback to the user and passes executable instructions along to an Abstract Meaning Representation (AMR) parser. The parse distills the action primitives and parameters of the instructed behavior in the form of a directed a-cyclic graph, passed on to the grounding component. We find AMR to be an efficient formalism for grounding the nodes of the graph using a Distributed Correspondence Graph. Thus, in our approach, the concepts of language are grounded to entities in the robot’s world model, which is populated by its sensors, thereby enabling grounded natural language communication. The demonstration of this system will allow users to issue navigation commands in natural language to direct a simulated ground robot (running the Robot Operating System) to various landmarks observed by the user within a simulated environment.

Annotating Situated Actions in Dialogue
Christopher Tam | Richard Brutti | Kenneth Lai | James Pustejovsky

Actions are critical for interpreting dialogue: they provide context for demonstratives and definite descriptions in discourse, and they continually update the common ground. This paper describes how Abstract Meaning Representation (AMR) can be used to annotate actions in multimodal human-human and human-object interactions. We conduct initial annotations of shared task and first-person point-of-view videos. We show that AMRs can be interpreted by a proxy language, such as VoxML, as executable annotation structures in order to recreate and simulate a series of annotated events.

From Sentence to Action: Splitting AMR Graphs for Recipe Instructions
Katharina Stein | Lucia Donatelli | Alexander Koller

Accurately interpreting the relationships between actions in a recipe text is essential to successful recipe completion. We explore using Abstract Meaning Representation (AMR) to represent recipe instructions, abstracting away from syntax and sentence structure that may order recipe actions in arbitrary ways. We present an algorithm to split sentence-level AMRs into action-level AMRs for individual cooking steps. Our approach provides an automatic way to derive fine-grained AMR representations of actions in cooking recipes and can be a useful tool for downstream, instructional tasks.

Meaning Representation of English Prepositional Phrase Roles: SNACS Supersenses vs. Tectogrammatical Functors
Wesley Scivetti | Nathan Schneider

This work compares two ways of annotating semantic relations expressed in prepositional phrases: semantic classes in the Semantic Network of Adposition and Case Supersenses (SNACS), and tectogrammatical functors from the Prague English Dependency Treebank (PEDT). We compare the label definitions in the respective annotation guidelines to determine expected mappings, then check how well these work empirically using Wall Street Journal text. In the definitions we find substantial overlap in the distributions of the two schemata with respect to participants and circumstantials, but substantial divergence for configurational relationships between nominals. This is borne out by the empirical analysis. Examining the data more closely for participants and circumstantials reveals that there are some unexpected, yet systematic divergences between definitionally aligned groups.

QA-Adj: Adding Adjectives to QA-based Semantics
Leon Pesahov | Ayal Klein | Ido Dagan

Identifying all predicate-argument relations in a sentence has been a fundamental research target in NLP. While traditionally these relations were modeled via formal schemata, the recent QA-SRL paradigm (and its extensions) present appealing advantages of capturing such relations through intuitive natural language question-answer (QA) pairs. In this paper, we extend the QA-based semantics framework to cover adjectival predicates, which carry important information in many downstream settings yet have been scarcely addressed in NLP research. Firstly, based on some prior literature and empirical assessment, we propose capturing four types of core adjectival arguments, through corresponding question types. Notably, our coverage goes beyond prior annotations of adjectival arguments, while also explicating valuable implicit arguments. Next, we develop an extensive data annotation methodology, involving controlled crowdsourcing and targeted expert review. Following, we create a high-quality dataset, consisting of 9K adjective mentions with 12K predicate-argument instances (QAs). Finally, we present and analyze baseline models based on text-to-text language modeling, indicating challenges for future research, particularly regarding the scarce argument types. Overall, we suggest that our contributions can provide the basis for research on contemporary modeling of adjectival information.

The long and the short of it: DRASTIC, a semantically annotated dataset containing sentences of more natural length
Dag Haug | Jamie Yates Findlay | Ahmet Yildirim

This paper presents a new dataset with Discourse Representation Structures (DRSs) annotated over naturally-occurring sentences. Importantly, these sentences are more varied in length and on average longer than those in the existing gold-standard DRS dataset, the Parallel Meaning Bank, and we show that they are therefore much harder for parsers. We argue, though, that this provides a more realistic assessment of the difficulties of DRS parsing.

UMR Annotation of Multiword Expressions
Julia Bonn | Andrew Cowell | Jan Hajič | Alexis Palmer | Martha Palmer | James Pustejovsky | Haibo Sun | Zdenka Uresova | Shira Wein | Nianwen Xue | Jin Zhao

Rooted in AMR, Uniform Meaning Representation (UMR) is a graph-based formalism with nodes as concepts and edges as relations between them. When used to represent natural language semantics, UMR maps words in a sentence to concepts in the UMR graph. Multiword expressions (MWEs) pose a particular challenge to UMR annotation because they deviate from the default one-to-one mapping between words and concepts. There are different types of MWEs which require different kinds of annotation that must be specified in guidelines. This paper discusses the specific treatment for each type of MWE in UMR.

MR4AP: Meaning Representation for Application Purposes
Bastien Giordano | Cédric Lopez

Despite the significant progress made in Natural Language Processing (NLP) thanks to deep learning techniques, efforts are still needed to model explicit, factual, and accurate meaning representation formalisms. In this article, we present a comparative table of ten formalisms that have been proposed over the last thirty years, and we describe and put forth our own, Meaning Representation for Application Purposes (MR4AP), developed in an industrial context with a definitive applicative aim.

Claim Extraction via Subgraph Matching over Modal and Syntactic Dependencies
Benjamin Rozonoyer | David Zajic | Ilana Heintz | Michael Selvaggio

We propose the use of modal dependency parses (MDPs) aligned with syntactic dependency parse trees as an avenue for the novel task of claim extraction. MDPs provide a document-level structure that links linguistic expression of events to the conceivers responsible for those expressions. By defining the event-conceiver links as claims and using subgraph pattern matching to exploit the complementarity of these modal links and syntactic claim patterns, we outline a method for aggregating and classifying claims, with the potential for supplying a novel perspective on large natural language data sets. Abstracting away from the task of claim extraction, we prototype an interpretable information extraction (IE) paradigm over sentence- and document-level parse structures, framing inference as subgraph matching and learning as subgraph mining. We make our code open-sourced at https://github.com/BBN-E/nlp-graph-pattern-matching-and-mining.

Which Argumentative Aspects of Hate Speech in Social Media can be reliably identified?
Damián Ariel Furman | Pablo Torres | José A. Rodríguez | Laura Alonso Alemany | Diego Letzen | Vanina Martínez

The expansion of Large Language Models (LLMs) into more serious areas of application, involving decision-making and the forming of public opinion, calls for a more thoughtful treatment of texts. Augmenting them with explicit and understandable argumentative analysis could foster a more reasoned usage of chatbots, text completion mechanisms or other applications. However, it is unclear which aspects of argumentation can be reliably identified and integrated by them. In this paper we propose an adaptation of Wagemans (2016)’s Periodic Table of Arguments to identify different argumentative aspects of texts, with a special focus on hate speech in social media. We have empirically assessed the reliability with which each of these aspects can be automatically identified. We analyze the implications of these results, and how to adapt the proposal to obtain reliable representations of those that cannot be successfully identified.

pdf (full)
bib (full) Proceedings of the 4th Workshop on Inquisitiveness Below and Beyond the Sentence Boundary

Proceedings of the 4th Workshop on Inquisitiveness Below and Beyond the Sentence Boundary
Valentin D. Richard | Floris Roelofsen

Short answers as tests: A post-suppositional view on wh-questions and answers
Linmin Zhang

This paper explores a post-suppositional view on wh-questions and their answers with dynamic semantics. Inspired by Brasoveanu (2013); Charlow (2017); Bumford (2017), I propose a unified treatment of items like modified numerals, focus items, and wh-items: they (i) introduce a discourse referent (dref) in a non-deterministic way and (ii) impose definiteness tests (and additional tests) in a delayed, post-suppositional manner at the sentential / discourse level. Thus, with a question like “who smiled”, the (maximally informative) dref “the one(s) who smiled” is derived. A short answer like “Mary and Max” is considered another post-supposition-like, delayed test, checking whether the dref “the one(s) who smiled” is identical to (or includes) the sum “Mary⊕Max”. I analyze various question-related phenomena to see how far this proposal can go.

Referential Transparency and Inquisitiveness
Jonathan Ginzburg | Andy Lücking

The paper extends a referentially transparent approach which has been successfully applied to the analysis of declarative quantified NPs to wh-phrases. This uses data from dialogical phenomena such as clarification interaction, anaphora, and incrementality as a guide to the design of wh-phrase meanings.

Uninquisitive questions
Tom Roberts

The sort of denotation a sentence is assigned is typically motivated by assumptions about the discourse function of sentences of that kind. For example, the notion that utterances which are functionally inquisitive (asking a question) suggest denotations which are semantically inquisitive (expressing the multiple licit responses to that question) is the cornerstone of interrogative meaning in frameworks like Alternative Semantics (Hamblin, 1973) and Inquisitive Semantics (Ciardelli et al., 2018). This paper argues that at least some kinds of questions systematically do not involve utterances with inquisitive content, based on novel observations of the Estonian discourse particle ega. Though ega is often labeled a ‘question particle’, it is used in both assertions and questions with sharply divergent discourse effects. I suggest that the relevant difference between assertive and questioning uses of ega is not semantic or sentence type-related, but rather reflects an interaction between a unified semantics for declaratives ega-sentences and different contexts of use. I then show that if we assume that ega presupposes that some aspect of the discourse context implicates the negation of ega’s prejacent, and that it occurs only in declarative sentences, we can derive its interpretation across a range of contexts: with the right combination of ingredients, we can ask questions with semantically uninquisitive sentences.

mage as a bias particle in interrogatives
Maryam Mohammadi

This paper investigates Farsi particle ‘mage’ in interrogatives, including both polar and constituent/Wh questions. I will show that ‘mage’ requires both contextual evidence and speaker’s prior belief in the sense that they contradict each other. While in polar questions (PQs) both types of bias can be straightforwardly expressed through the uttered proposition (cf. Mameni 2010), Wh-questions (WhQs) do not provide such a propositional object. To capture this difference, I propose Answerhood as the relevant notation that provides the necessary object source for ‘mage’ (inspired by Theiler 2021). The proposal establishes the felicity conditions and the meaning of ‘mage’ in relation to the (contextually) restricted answerhood in both polar and constituent questions.

Dynamic Questions: Evidence from Mandarin Think–”Xiang”
Anshun Zheng

This paper investigates the clausal embedding pattern of the Mandarin verb “xiang” (think) and reveals its internal anti-interrogative nature, with the possibility of “xiang Q” in certain cases. Through various stativity tests, I establish that the results are consistent with the generalization proposed by Özyıldız(2021), with “minor” deviations observed in the stativity of “xiang P” and the correlation with neg-raising. Additionally, I employ a semantic shift perspective to explain instances of neg-raising failure. Overall, this study sheds light on the unique characteristics of the verb “xiang” and contributes to a better cross-linguistic understanding of CP selection.

The indefinite-interrogative affinity in sign languages: the case of Catalan Sign Language
Raquel Veiga Busto | Floris Roelofsen | Alexandra Navarrete González

Prior studies on spoken languages have shown that indefinite and interrogative pronouns may be formally very similar. Our research aims to understand if sign languages exhibit this type of affinity. This paper presents an overview of the phenomenon and reports on the results of two studies: a cross-linguistic survey based on a sample of 30 sign languages and an empirical investigation conducted with three deaf consultants of Catalan Sign Language (LSC). Our research shows that, in sign languages, certain signs have both existential and interrogative readings and it identifies the environments that make existential interpretations available in LSC.

pdf (full)
bib (full) Proceedings of the 19th Joint ACL-ISO Workshop on Interoperable Semantics (ISA-19)

Proceedings of the 19th Joint ACL-ISO Workshop on Interoperable Semantics (ISA-19)
Harry Bunt

The DARPA Wikidata Overlay: Wikidata as an ontology for natural language processing
Elizabeth Spaulding | Kathryn Conger | Anatole Gershman | Rosario Uceda-Sosa | Susan Windisch Brown | James Pustejovsky | Peter Anick | Martha Palmer

With 102,530,067 items currently in its crowd-sourced knowledge base, Wikidata provides NLP practitioners a unique and powerful resource for inference and reasoning over real-world entities. However, because Wikidata is very entity focused, events and actions are often labeled with eventive nouns (e.g., the process of diagnosing a person’s illness is labeled “diagnosis”), and the typical participants in an event are not described or linked to that event concept (e.g., the medical professional or patient). Motivated by a need for an adaptable, comprehensive, domain-flexible ontology for information extraction, including identifying the roles entities are playing in an event, we present a curated subset of Wikidata in which events have been enriched with PropBank roles. To enable richer narrative understanding between events from Wikidata concepts, we have also provided a comprehensive mapping from temporal Qnodes and Pnodes to the Allen Interval Temporal Logic relations.

Semantic annotation of Common Lexis Verbs of Contact in Bulgarian
Maria Todorova

The paper presents the work on the selection, semantic annotation and classification of a group of verbs from WordNet, characterized with the semantic primitive ‘verbs of contact’ that belong to the common Bulgarian lexis. The selection of the verb set using both different criteria: statistical information from corpora, WordNet Base concepts and AoA as a criterion, is described. The focus of the work is on the process of the verbs’ of contact semantic annotation using the combined information from two language resources - WordNet and FrameNet. The verbs of contact from WordNet are assigmed semantic frames from FrameNet and then grouped in semantic subclasses using both their place in the WordNet hierarchy, the semantic restrictions on their frame elements and the corresponding syntactic realization. At the end we offer some conclusions on the classification of ‘verbs of contact’ in semantic subtypes.

Appraisal Theory and the Annotation of Speaker-Writer Engagement
Min Dong | Alex Fang

In this work, we address the annotation of language resources through the application of the engagement network in appraisal theory. This work represents an attempt to extend the advances in studies of speech and dialogue acts to encompass the latest notion of stance negotiations in discourse, between the writer and other sources. This type of phenomenon has become especially salient in contemporary media communication and requires some timely research to address emergent requirement. We shall first of all describe the engagement network as proposed by Martin and White (2005) and then discuss the issue of multisubjectivity. We shall then propose and describe a bi-step procedure towards better annotation before discussing the benefits of engagement network in the assessment of speaker-writer stance. We shall finally discuss issues of annotation consistency and reliability.

metAMoRphosED, a graphical editor for Abstract Meaning Representation
Johannes Heinecke

This paper presents a graphical editor for directed graphs, serialised in the PENMAN format, as used for annotations in Abstract Meaning Representation (AMR). The tool supports creating and modifying of AMR graphs and other directed graphs, adding and deletion of instances, edges and literals, renaming of concepts, relations and literals, setting a “top node” and validating the edited graph.

Personal noun detection for German
Carla Sökefeld | Melanie Andresen | Johanna Binnewitt | Heike Zinsmeister

Personal nouns, i.e. common nouns denoting human beings, play an important role in manifesting gender and gender stereotypes in texts, especially for languages with grammatical gender like German. Automatically detecting and extracting personal nouns can thus be of interest to a myriad of different tasks such as minimizing gender bias in language models and researching gender stereotypes or gender-fair language, but is complicated by the morphological heterogeneity and homonymy of personal and non-personal nouns, which restrict lexicon-based approaches. In this paper, we introduce a classifier created by fine-tuning a transformer model that detects personal nouns in German. Although some phenomena like homonymy and metalinguistic uses are still problematic, the model is able to classify personal nouns with robust accuracy (f1-score: 0.94).

ISO 24617-2 on a cusp of languages
Krzysztof Hwaszcz | Marcin Oleksy | Aleksandra Domogała | Jan Wieczorek

The article discusses the challenges of cross-linguistic dialogue act annotation, which involves using methods developed for one language to annotate conversations in another language. The article specifically focuses on the research on dialogue act annotation in Polish, based on the ISO standard developed for English. The article examines the differences between Polish and English in dialogue act annotation based on selected examples from DiaBiz.Kom corpus, such as the use of honorifics in Polish, the use of inflection to convey meaning in Polish, the tendency to use complex sentence structures in Polish, and the cultural differences that may play a role in the annotation of dialogue acts. The article also discusses the creation of DiaBiz.Kom, a Polish dialogue corpus based on ISO 24617-2 standard applied to 1100 transcripts.

Towards Referential Transparent Annotations of Quantified Noun Phrases
Andy Luecking

Using recent developments in count noun quantification, namely Referential Transparency Theory (RTT), the basic structure for annotating quantification in the nominal domain according to RTT is presented. The paper discusses core ideas of RTT, derives the abstract annotation syntax, and exemplifies annotations of quantified noun phrases partly in comparison to QuantML.

The compositional semantics of QuantML annotations
Harry Bunt

This paper discusses some issues in the semantic annotation of quantification phenomena in general, and in particular in the markup language QuantML, which has been proposed to form part of an ISO standard annotation scheme for quantification in natural language data. QuantML annotations have been claimed to have a compositional semantic interpretation, but the formal specification of QuantML in the official ISO documentation does not provide sufficient detail to judge this. This paper aims to fill this gap.

An Abstract Specification of VoxML as an Annotation Language
Kiyong Lee | Nikhil Krishnaswamy | James Pustejovsky

VoxML is a modeling language used to map natural language expressions into real time visualizations using real-world semantic knowledge of objects and events. Its utility has been demonstrated in embodied simulation environmens and in agent-object interactions in situated human-agent communicative. It is enriched to work with notions of affordances, both Gibsonian and Telic, and habitat for various interactions between the rational agent (human) and an object. This paper aims to specify VoxML as an annotation language in general abstract terms. It then shows how it works on annotating linguistic data that express visually perceptible human-object interactions. The annotation structures thus generated will be interpreted against the enriched minimal model created by VoxML as a modeling language while supporting the modeling purposes of VoxML linguistically.

How Good is Automatic Segmentation as a Multimodal Discourse Annotation Aid?
Corbyn Terpstra | Ibrahim Khebour | Mariah Bradford | Brett Wisniewski | Nikhil Krishnaswamy | Nathaniel Blanchard

In this work, we assess the quality of different utterance segmentation techniques as an aid in annotating collaborative problem solving in teams and the creation of shared meaning between participants in a situated, collaborative task. We manually transcribe utterances in a dataset of triads collaboratively solving a problem involving dialogue and physical object manipulation, annotate collaborative moves according to these gold-standard transcripts, and then apply these annotations to utterances that have been automatically segmented using toolkits from Google and Open-AI’s Whisper. We show that the oracle utterances have minimal correspondence to automatically segmented speech, and that automatically segmented speech using different segmentation methods is also inconsistent. We also show that annotating automatically segmented speech has distinct implications compared with annotating oracle utterances — since most annotation schemes are designed for oracle cases, when annotating automatically-segmented utterances, annotators must make arbitrary judgements which other annotators may not replicate. We conclude with a discussion of how future annotation specs can account for these needs.

pdf (full)
bib (full) Proceedings of the 4th Natural Logic Meets Machine Learning Workshop

Proceedings of the 4th Natural Logic Meets Machine Learning Workshop
Stergios Chatzikyriakidis | Valeria de Paiva

Evaluating Large Language Models with NeuBAROCO: Syllogistic Reasoning Ability and Human-like Biases
Risako Ando | Takanobu Morishita | Hirohiko Abe | Koji Mineshima | Mitsuhiro Okada

This paper investigates whether current large language models exhibit biases in logical reasoning, similar to humans. Specifically, we focus on syllogistic reasoning, a well-studied form of inference in the cognitive science of human deduction. To facilitate our analysis, we introduce a dataset called NeuBAROCO, originally designed for psychological experiments that assess human logical abilities in syllogistic reasoning. The dataset consists of syllogistic inferences in both English and Japanese. We examine three types of biases observed in human syllogistic reasoning: belief biases, conversion errors, and atmosphere effects. Our findings demonstrate that current large language models struggle more with problems involving these three types of biases.

SpaceNLI: Evaluating the Consistency of Predicting Inferences In Space
Lasha Abzianidze | Joost Zwarts | Yoad Winter

While many natural language inference (NLI) datasets target certain semantic phenomena, e.g., negation, tense & aspect, monotonicity, and presupposition, to the best of our knowledge, there is no NLI dataset that involves diverse types of spatial expressions and reasoning. We fill this gap by semi-automatically creating an NLI dataset for spatial reasoning, called SpaceNLI. The data samples are automatically generated from a curated set of reasoning patterns (see Figure 1), where the patterns are annotated with inference labels by experts. We test several SOTA NLI systems on SpaceNLI to gauge the complexity of the dataset and the system’s capacity for spatial reasoning. Moreover, we introduce a Pattern Accuracy and argue that it is a more reliable and stricter measure than the accuracy for evaluating a system’s performance on pattern-based generated data samples. Based on the evaluation results we find that the systems obtain moderate results on the spatial NLI problems but lack consistency per inference pattern. The results also reveal that non-projective spatial inferences (especially due to the “between” preposition) are the most challenging ones.

Does ChatGPT Resemble Humans in Processing Implicatures?
Zhuang Qiu | Xufeng Duan | Zhenguang Cai

Recent advances in large language models (LLMs) and LLM-driven chatbots, such as ChatGPT, have sparked interest in the extent to which these artificial systems possess human-like linguistic abilities. In this study, we assessed ChatGPT’s pragmatic capabilities by conducting three preregistered experiments focused on its ability to compute pragmatic implicatures. The first experiment tested whether ChatGPT inhibits the computation of generalized conversational implicatures (GCIs) when explicitly required to process the text’s truth-conditional meaning. The second and third experiments examined whether the communicative context affects ChatGPT’s ability to compute scalar implicatures (SIs). Our results showed that ChatGPT did not demonstrate human-like flexibility in switching between pragmatic and semantic processing. Additionally, ChatGPT’s judgments did not exhibit the well-established effect of communicative context on SI rates.

Recurrent Neural Network CCG Parser
Sora Tagami | Daisuke Bekki

The two contrasting approaches are end-to-end neural NLI systems and linguistically-oriented NLI pipelines consisting of modules such as neural CCG parsers and theorem provers. The latter, however, faces the challenge of integrating the neural models used in the syntactic and semantic components. RNNGs are frameworks that can potentially fill this gap, but conventional RNNGs adopt CFG as the syntactic theory. To address this issue, we implemented RNN-CCG, a syntactic parser that replaces CFG with CCG. We then conducted experiments comparing RNN-CCG to RNNGs with/without POS tags and evaluated their behavior as a first step towards building an NLI system based on RNN-CCG.

TTR at the SPA: Relating type-theoretical semantics to neural semantic pointers
Staffan Larsson | Robin Cooper | Jonathan Ginzburg | Andy Luecking

This paper considers how the kind of formal semantic objects used in TTR (a theory of types with records, Cooper 2013) might be related to the vector representations used in Eliasmith (2013). An advantage of doing this is that it would immediately give us a neural representation for TTR objects as Eliasmith relates vectors to neural activity in his semantic pointer architecture (SPA). This would be an alternative using convolution to the suggestions made by Cooper (2019) based on the phasing of neural activity. The project seems potentially hopeful since all complex TTR objects are constructed from labelled sets (essentially sets of ordered pairs consisting of labels and values) which might be seen as corresponding to the representation of structured objects which Eliasmith achieves using superposition and circular convolution.

Triadic temporal representations and deformations
Tim Fernando

Triadic representations that temporally order events and states are described, consisting of strings and sets of strings of bounded but refinable granularities. The strings are compressed according to J.A. Wheeler’s dictum it-from-bit, with bits given by statives and non-statives alike. A choice of vocabulary and constraints expressed in that vocabulary shape representations of cause-and-effect with deformations characteristic, Mumford posits, of patterns at various levels of cognitive processing. These deformations point to an ongoing process of learning, formulated as grammatical inference of finite automata, structured around Goguen and Burstall’s institutions.

Discourse Representation Structure Parsing for Chinese
Chunliu Wang | Xiao Zhang | Johan Bos

Previous work has predominantly focused on monolingual English semantic parsing. We, instead, explore the feasibility of Chinese semantic parsing in the absence of labeled data for Chinese meaning representations. We describe the pipeline of automatically collecting the linearized Chinese meaning representation data for sequential-to-sequential neural networks. We further propose a test suite designed explicitly for Chinese semantic parsing, which provides fine-grained evaluation for parsing performance, where we aim to study Chinese parsing difficulties. Our experimental results show that the difficulty of Chinese semantic parsing is mainly caused by adverbs. Realizing Chinese parsing through machine translation and an English parser yields slightly lower performance than training a model directly on Chinese data.