Verbs like make, have and get present challenges for applications requiring automatic word sense discrimination. These verbs are both highly frequent and polysemous, with semantically “full” readings, as in make dinner, and “light” readings, as in make a request. Lexical resources like WordNet encode dozens of senses, making discrimination difficult and inviting proposals for reducing the number of entries or grouping them into coarser-grained supersenses. We propose a data-driven, linguistically-based approach to establishing a motivated sense inventory, focusing on make to establish a proof of concept. From several large, syntactically annotated corpora, we extract nouns that are complements of the verb make, and group them into clusters based on their Word2Vec semantic vectors. We manually inspect, for each cluster, the words with vectors closest to the centroid as well as a random sample of words within the cluster. The results show that the clusters reflect an intuitively plausible sense discrimination of make. As an evaluation, we test whether words within a given cluster cooccur in coordination phrases, such as apples and oranges, as prior work has shown that such conjoined nouns are semantically related. Conversely, noun complements from different clusters are less likely to be conjoined. Thus, coordination provides a similarity metric independent of the contextual embeddings used for clustering. Our results pave the way for a WordNet sense inventory that, while not inconsistent with the present one, would reduce it significantly and hold promise for improved automatic word sense discrimination.
Pre-trained language models encode undesirable social biases, which are further exacerbated in downstream use. To this end, we propose MABEL (a Method for Attenuating Gender Bias using Entailment Labels), an intermediate pre-training approach for mitigating gender bias in contextualized representations. Key to our approach is the use of a contrastive learning objective on counterfactually augmented, gender-balanced entailment pairs from natural language inference (NLI) datasets. We also introduce an alignment regularizer that pulls identical entailment pairs along opposite gender directions closer. We extensively evaluate our approach on intrinsic and extrinsic metrics, and show that MABEL outperforms previous task-agnostic debiasing approaches in terms of fairness. It also preserves task performance after fine-tuning on downstream tasks. Together, these findings demonstrate the suitability of NLI data as an effective means of bias mitigation, as opposed to only using unlabeled sentences in the literature. Finally, we identify that existing approaches often use evaluation settings that are insufficient or inconsistent. We make an effort to reproduce and compare previous methods, and call for unifying the evaluation settings across gender debiasing methods for better future comparison.
Coordination is a phenomenon of language that conjoins two or more terms or phrases using a coordinating conjunction. Although coordination has been explored extensively in the linguistics literature, the rules and constraints that govern its structure are still largely elusive and widely debated amongst linguists. This paper presents a study of two-termed unlike coordinations in particular, where the two conjuncts of the coordination phrase form valid constituents but have distinct categories. We conducted a syntactic analysis of the phrasal categories that can be conjoined in such unlike coordinations through a computational corpus-based approach, utilizing the Corpus of Contemporary American English (COCA) as the main data source, as well as the Penn Treebank (PTB). The results show that the two conjuncts within unlike coordinations display different properties based on their position, supporting an antisymmetric view of the structure of coordination. This research provides new data and perspectives through the use of statistical techniques that can help shape future theories and models of coordination.
We report on the development of ASLNet, a wordnet for American Sign Language (ASL). ASLNet V1.0 is currently under construction by mapping easy-to-translate ASL lexical nouns to Princeton WordNet synsets. We describe our data model and mapping approach, which can be extended to any sign language. Analysis of the 390 synsets processed to date indicates the success of our procedure yet also highlights the need to supplement our mapping with the “merge” method. We outline our plans for upcoming work to remedy this, which include use of ASL free-association data.
Recent years have seen a surge in research on the biases in word embeddings with respect to gender and, to a lesser extent, race. Few of these studies, however, have given attention to the critical intersection of race and gender. In this case study, we analyze the dimensions of gender and race in contextualized word embeddings of given names, taken from BERT, and investigate the nature and nuance of their interaction. We find that these demographic axes, though typically treated as physically and conceptually separate, are in fact interdependent and thus inadvisable to consider in isolation. Further, we show that demographic dimensions predicated on default settings in language, such as in pronouns, may risk rendering groups with multiple marginalized identities invisible. We conclude by discussing the importance and implications of intersectionality for future studies on bias and debiasing in NLP.
We describe the release of a new wordnet for English based on the Princeton WordNet, but now developed under an open-source model. In particular, this version of WordNet, which we call English WordNet 2019, which has been developed by multiple people around the world through GitHub, fixes many errors in previous wordnets for English. We give some details of the changes that have been made in this version and give some perspectives about likely future changes that will be made as this project continues to evolve.
We discuss the creation of ASLNet by aligning the Princeton WordNet (PWN) with SignStudy, an online database of American Sign Language (ASL) signs. This alignment will have many immediate benefits for first and second-sign language learners as well as ASL researchers by highlighting semantic relations among signs. We begin to address the interesting theoretical question of to what extent the wordnet-style organization of the English lexicon (and those of wordnets in other spoken languages) is applicable to ASL, and whether ASL requires positing additional, language or modality-specific relations among signs. Significantly, the mapping of SignStudy and PWN provides a bridge between ASL and the worldwide wordnet community, which comprises speakers of dozens of languages working in academic and language technology settings.
We describe a project to link the Princeton WordNet to 3D representations of real objects and scenes. The goal is to establish a dataset that helps us to understand how people categorize everyday common objects via their parts, attributes, and context. This paper describes the annotation and data collection effort so far as well as ideas for future work.
We present a fully unsupervised method for automated construction of WordNets based upon recent advances in distributional representations of sentences and word-senses combined with readily available machine translation tools. The approach requires very few linguistic resources and is thus extensible to multiple target languages. To evaluate our method we construct two 600-word testsets for word-to-synset matching in French and Russian using native speakers and evaluate the performance of our method along with several other recent approaches. Our method exceeds the best language-specific and multi-lingual automated WordNets in F-score for both languages. The databases we construct for French and Russian, both languages without large publicly available manually constructed WordNets, will be publicly released along with the testsets.
This paper introduces the motivation for and design of the Collaborative InterLingual Index (CILI). It is designed to make possible coordination between multiple loosely coupled wordnet projects. The structure of the CILI is based on the Interlingual index first proposed in the EuroWordNet project with several pragmatic extensions: an explicit open license, definitions in English and links to wordnets in the Global Wordnet Grid.
While gender identities in the Western world are typically regarded as binary, our previous work (Hicks et al., 2015) shows that there is more lexical variety of gender identity and the way people identify their gender. There is also a growing need to lexically represent this variety of gender identities. In our previous work, we developed a set of tools and approaches for analyzing Twitter data as a basis for generating hypotheses on language used to identify gender and discuss gender-related issues across geographic regions and population groups in the U.S.A. In this paper we analyze the coverage and relative frequency of the word forms in our Twitter analysis with respect to the National Transgender Discrimination Survey data set, one of the most comprehensive data sets on transgender, gender non-conforming, and gender variant people in the U.S.A. We then analyze the coverage of WordNet, a widely used lexical database, with respect to these identities and discuss some key considerations and next steps for adding gender identity words and their meanings to WordNet.
Many new wordnets in the world are constantly created and most take the original Princeton WordNet (PWN) as their starting point. This arguably central position imposes a responsibility on PWN to ensure that its structure is clean and consistent. To validate PWN hierarchical structures we propose the application of a system of test patterns. In this paper, we report on how to validate the PWN hierarchies using the system of test patterns. In sum, test patterns provide lexicographers with a very powerful tool, which we hope will be adopted by the global wordnet community.
We propose an automatic approach towards determining the relative location of adjectives on a common scale based on their strength. We focus on adjectives expressing different degrees of goodness occurring in French product (perfumes) reviews. Using morphosyntactic patterns, we extract from the reviews short phrases consisting of a noun that encodes a particular aspect of the perfume and an adjective modifying that noun. We then associate each such n-gram with the corresponding product aspect and its related star rating. Next, based on the star scores, we generate adjective scales reflecting the relative strength of specific adjectives associated with a shared attribute of the product. An automatic ordering of the adjectives “correct” (correct), “sympa” (nice), “bon” (good) and “excellent” (excellent) according to their score in our resource is consistent with an intuitive scale based on human judgments. Our long-term objective is to generate different adjective scales in an empirical manner, which could allow the enrichment of lexical resources.
The MASC project has produced a multi-genre corpus with multiple layers of linguistic annotation, together with a sentence corpus containing WordNet 3.1 sense tags for 1000 occurrences of each of 100 words produced by multiple annotators, accompanied by indepth inter-annotator agreement data. Here we give an overview of the contents of MASC and then focus on the word sense sentence corpus, describing the characteristics that differentiate it from other word sense corpora and detailing the inter-annotator agreement studies that have been performed on the annotations. Finally, we discuss the potential to grow the word sense sentence corpus through crowdsourcing and the plan to enhance the content and annotations of MASC through a community-based collaborative effort.
We analyze how different conceptions of lexical semantics affect sense annotations and how multiple sense inventories can be compared empirically, based on annotated text. Our study focuses on the MASC project, where data has been annotated using WordNet sense identifiers on the one hand, and FrameNet lexical units on the other. This allows us to compare the sense inventories of these lexical resources empirically rather than just theoretically, based on their glosses, leading to new insights. In particular, we compute contingency matrices and develop a novel measure, the Expected Jaccard Index, that quantifies the agreement between annotations of the same data based on two different resources even when they have different sets of categories.
In this paper we discuss noun compounding, a highly generative, productive process, in three distinct languages: Czech, English and Zulu. Derivational morphology presents a large grey area between regular, compositional and idiosyncratic, non-compositional word forms. The structural properties of compounds in each of the languages are reviewed and contrasted. Whereas English compounds are head-final and thus left-branching, Czech and Zulu compounds usually consist of a leftmost governing head and a rightmost dependent element. Semantic properties of compounds are discussed with special reference to semantic relations between compound members which cross-linguistically show universal patterns, but idiosyncratic, language specific compounds are also identified. The integration of compounds into lexical resources, and WordNets in particular, remains a challenge that needs to be considered in terms of the compounds syntactic idiosyncrasy and semantic compositionality. Experiments with processing compounds in Czech, English and Zulu are reported and partly evaluated. The obtained partial lists of the Czech, English and Zulu compounds are also described.
To answer the critical need for sharable, reusable annotated resources with rich linguistic annotations, we are developing a Manually Annotated Sub-Corpus (MASC) including texts from diverse genres and manual annotations or manually-validated annotations for multiple levels, including WordNet senses and FrameNet frames and frame elements, both of which have become significant resources in the international computational linguistics community. To derive maximal benefit from the semantic information provided by these resources, the MASC will also include manually-validated shallow parses and named entities, which will enable linking WordNet senses and FrameNet frames within the same sentences into more complex semantic structures and, because named entities will often be the role fillers of FrameNet frames, enrich the semantic and pragmatic information derivable from the sub-corpus. All MASC annotations will be published with detailed inter-annotator agreement measures. The MASC and its annotations will be freely downloadable from the ANC website, thus providing maximum accessibility for researchers from around the globe.
We outline work performed within the framework of a current EC project. The goal is to construct a language-independent information system for a specific domain (environment/ecology/biodiversity) anchored in a language-independent ontology that is linked to wordnets in seven languages. For each language, information extraction and identification of lexicalized concepts with ontological entries is carried out by text miners (Kybots). The mapping of language-specific lexemes to the ontology allows for crosslinguistic identification and translation of equivalent terms. The infrastructure developed within this project enables long-range knowledge sharing and transfer across many languages and cultures, addressing the need for global and uniform transition of knowledge beyond the specific domains addressed here.
Arabic WordNet is a lexical resource for Modern Standard Arabic based on the widely used Princeton WordNet for English (Fellbaum, 1998). Arabic WordNet (AWN) is based on the design and contents of the universally accepted Princeton WordNet (PWN) and will be mappable straightforwardly onto PWN 2.0 and EuroWordNet (EWN), enabling translation on the lexical level to English and dozens of other languages. We have developed and linked the AWN with the Suggested Upper Merged Ontology (SUMO), where concepts are defined with machine interpretable semantics in first order logic (Niles and Pease, 2001). We have greatly extended the ontology and its set of mappings to provide formal terms and definitions for each synset. The end product would be a linguistic resource with a deep formal semantic foundation that is able to capture the richness of Arabic as described in Elkateb (2005). Tools we have developed as part of this effort include a lexicographer's interface modeled on that used for EuroWordNet, with added facilities for Arabic script, following Black and Elkateb's earlier work (2004). In this paper we describe our methodology for building a lexical resource in Arabic and the challenge of Arabic for lexical resources.
This paper introduces a recently initiated project that focuses on building a lexical resource for Modern Standard Arabic based on the widely used Princeton WordNet for English (Fellbaum, 1998). Our aim is to develop a linguistic resource with a deep formal semantic foundation in order to capture the richness of Arabic as described in Elkateb (2005). Arabic WordNet is being constructed following methods developed for EuroWordNet (Vossen, 1998). In addition to the standard wordnet representation of senses, word meanings are also being defined with a machine understandable semantics in first order logic. The basis for this semantics is the Suggested Upper Merged Ontology and its associated domain ontologies (Niles and Pease, 2001). We will greatly extend the ontology and its set of mappings to provide formal terms and definitions for each synset. Tools to be developed as part of this effort include a lexicographer's interface modeled on that used for EuroWordNet, with added facilities for Arabic script, following Black and Elkateb's earlier work (2004).