The paper is a corpus study of the factors involved in disambiguating potential scope ambiguity in sentences with negation and universal quantifier, such as “I don’t want talk to all these people”, which can alternatively mean ‘I don’t want to talk to any of these people’ and ‘I don’t want to talk to some of these people’. The relevant factors are demonstrated to be largely different from those involved in disambiguating lexical polysemy. They include the syntactic function of the constituent containing “all” quantifier (subject, direct complement, adjunct), as well as the deepness of its embedding; the status of the main predicate and “all” constituent with respect to the information structure of the 6utterance (topic vs. focus, given vs. new information); pragmatic implicatures pertaining to the situations described in the utterances.
In this talk, I will outline a range of challenges presented by multiword expressions in terms of (lexicalist) precision grammar engineering, and different strategies for accommodating those challenges, in an attempt to strike the right balance in terms of generalisation and over- and under-generation.
Microsyntactic linguistic units, such as syntactic idioms and non-standard syntactic constructions, are poorly represented in linguistic resources, mostly because the former are elements occupying an intermediate position between the lexicon and the grammar and the latter are too specific to be routinely tackled by general grammars. Consequently, many such units produce substantial gaps in systems intended to solve sophisticated computational linguistics tasks, such as parsing, deep semantic analysis, question answering, machine translation, or text generation. They also present obstacles for applying advanced techniques to these tasks, such as machine learning. The paper discusses an approach aimed at bridging such gaps, focusing on the development of monolingual and multilingual corpora where microsyntactic units are to be tagged.
An excellent example of a phenomenon bridging a lexicon and a grammar is provided by grammaticalized alternations (e.g., passivization, reflexivity, and reciprocity): these alternations represent productive grammatical processes which are, however, lexically determined. While grammaticalized alternations keep lexical meaning of verbs unchanged, they are usually characterized by various changes in their morphosyntactic structure. In this contribution, we demonstrate on the example of reciprocity and its representation in the valency lexicon of Czech verbs, VALLEX how a linguistic description of complex (and still systemic) changes characteristic of grammaticalized alternations can benefit from an integration of grammatical rules into a valency lexicon. In contrast to other types of grammaticalized alternations, reciprocity in Czech has received relatively little attention although it closely interacts with various linguistic phenomena (e.g., with light verbs, diatheses, and reflexivity).
Language-endowed intelligent agents benefit from leveraging lexical knowledge falling at different points along a spectrum of compositionality. This means that robust computational lexicons should include not only the compositional expectations of argument-taking words, but also non-compositional collocations (idioms), semi-compositional collocations that might be difficult for an agent to interpret (e.g., standard metaphors), and even collocations that could be compositionally analyzed but are so frequently encountered that recording their meaning increases the efficiency of interpretation. In this paper we argue that yet another type of string-to-meaning mapping can also be useful to intelligent agents: remembered semantic analyses of actual text inputs. These can be viewed as super-specific multi-word expressions whose recorded interpretations mimic a person’s memories of knowledge previously learned from language input. These differ from typical annotated corpora in two ways. First, they provide a full, context-sensitive semantic interpretation rather than select features. Second, they are are formulated in the ontologically-grounded metalanguage used in a particular agent environment, meaning that the interpretations contribute to the dynamically evolving cognitive capabilites of agents configured in that environment.
Universal Dependencies is an initiative to develop cross-linguistically consistent grammatical annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning and parsing research from a language typology perspective. It assumes a dependency-based approach to syntax and a lexicalist approach to morphology, which together entail that the fundamental units of grammatical annotation are words. Words have properties captured by morphological annotation and enter into relations captured by syntactic annotation. Moreover, priority is given to relations between lexical content words, as opposed to grammatical function words. In this position paper, I discuss how this approach allows us to capture similarities and differences across typologically diverse languages.
Human communication is a multimodal activity, involving not only speech and written expressions, but intonation, images, gestures, visual clues, and the interpretation of actions through perception. In this paper, we describe the design of a multimodal lexicon that is able to accommodate the diverse modalities that present themselves in NLP applications. We have been developing a multimodal semantic representation, VoxML, that integrates the encoding of semantic, visual, gestural, and action-based features associated with linguistic expressions.
Valency slot filling is a semantic glue, which brings together the meanings of words. As regards the position of an argument in the dependency structure with respect to its predicate, there exist three types of valency filling: active (canonical), passive, and discontinuous. Of these, the first type is studied much better than the other two. As a rule, canonical actants are unambiguously marked in the syntactic structure, and each actant corresponds to a unique syntactic position. Linguistic information on which syntactic function an actant might have (subject, direct or indirect object), what its morphological form should be and which prepositions or conjunctions it requires, can be given in the lexicon in the form of government patterns, subcategorization frames, or similar data structures. We concentrate on non-canonical cases of valency filling in Russian, which are characteristic of non-verbal parts of speech, such as adverbs, adjectives, and particles, in the first place. They are more difficult to handle than canonical ones, because the position of the actant in the tree is governed by more complicated rules. A valency may be filled by expressions occupying different syntactic positions, and a syntactic position may accept expressions filling different valencies of the same word. We show how these phenomena can be processed in a semantic analyzer.
Verbenet is a French lexicon developed by “translation” of its English counterpart — VerbNet (Kipper-Schuler, 2005)—and treatment of the specificities of French syntax (Pradet et al., 2014; Danlos et al., 2016). One difficulty encountered in its development springs from the fact that the list of (potentially numerous) frames has no internal organization. This paper proposes a type system for frames that shows whether two frames are variants of a given alternation. Frame typing facilitates coherence checking of the resource in a “virtuous circle”. We present the principles underlying a program we developed and used to automatically type frames in VerbeNet. We also show that our system is portable to other languages.
We present an attempt to automatically identify Czech deverbative nouns using several methods that use large corpora as well as existing lexical resources. The motivation for the task is to extend a verbal valency (i.e., predicate-argument) lexicon by adding nouns that share the valency properties with the base verb, assuming their properties can be derived (even if not trivially) from the underlying verb by deterministic grammatical rules. At the same time, even in inflective languages, not all deverbatives are simply created from their underlying base verb by regular lexical derivation processes. We have thus developed hybrid techniques that use both large parallel corpora and several standard lexical resources. Thanks to the use of parallel corpora, the resulting sets contain also synonyms, which the lexical derivation rules cannot get. For evaluation, we have manually created a small, 100-verb gold data since no such dataset was initially available for Czech.
We present an interdisciplinary study on the interaction between the interpretation of noun-noun deverbal compounds (DCs; e.g., task assignment) and the morphosyntactic properties of their deverbal heads in English. Underlying hypotheses from theoretical linguistics are tested with tools and resources from computational linguistics. We start with Grimshaw’s (1990) insight that deverbal nouns are ambiguous between argument-supporting nominal (ASN) readings, which inherit verbal arguments (e.g., the assignment of the tasks), and the less verbal and more lexicalized Result Nominal and Simple Event readings (e.g., a two-page assignment). Following Grimshaw, our hypothesis is that the former will realize object arguments in DCs, while the latter will receive a wider range of interpretations like root compounds headed by non-derived nouns (e.g., chocolate box). Evidence from a large corpus assisted by machine learning techniques confirms this hypothesis, by showing that, besides other features, the realization of internal arguments by deverbal heads outside compounds (i.e., the most distinctive ASN-property in Grimshaw 1990) is a good predictor for an object interpretation of non-heads in DCs.
We show how to turn a large-scale syntactic dictionary into a dependency-based unification grammar where each piece of lexical information calls a separate rule, yielding a super granular grammar. Subcategorization, raising and control verbs, auxiliaries and copula, passivization, and tough-movement are discussed. We focus on the semantics-syntax interface and offer a new perspective on syntactic structure.
This paper presents our ongoing work on compilation of English multi-word expression (MWE) lexicon. We are especially interested in collecting flexible MWEs, in which some other components can intervene the expression such as “a number of” vs “a large number of” where a modifier of “number” can be placed in the expression and inherit the original meaning. We fiest collect possible candidates of flexible English MWEs from the web, and annotate all of their occurrences in the Wall Street Journal portion of Ontonotes corpus. We make use of word dependency strcuture information of the sentences converted from the phrase structure annotation. This process enables semi-automatic annotation of MWEs in the corpus and simultanaously produces the internal and external dependency representation of flexible MWEs.
The paper presents a contrastive description of reflexive possessive pronouns “svůj” in Czech and “svoj” in Russian. The research concerns syntactic, semantic and pragmatic aspects. With our analysis, we shed a new light on the already investigated issue, which comes from a detailed comparison of the phenomenon of possessive reflexivization in two typologically and genetically similar languages. We show that whereas in Czech, the possessive reflexivization is mostly limited to syntactic functions and does not go beyond the grammar, in Russian it gets additional semantic meanings and moves substan-tially towards the lexicon. The obtained knowledge allows us to explain heretofore unclear marginal uses of reflexives in each language.
A specific language as used by different speakers and in different situations has a number of more or less distant varieties. Extending the notion of non-standard language to varieties that do not fit an explicitly or implicitly assumed norm or pattern, we look for methods and tools that could be applied to this domain. The needs start from the theoretical side: categories usable for the analysis of non-standard language are not readily available, and continue to methods and tools required for its detection and diagnostics. A general discussion of issues related to non-standard language is followed by two case studies. The first study presents a taxonomy of morphosyntactic categories as an attempt to analyse non-standard forms produced by non-native learners of Czech. The second study focusses on the role of a rule-based grammar and lexicon in the process of building and using a parsebank.