Colexification occurs when two distinct concepts are lexified by the same word. The term covers both polysemy and homonymy. We posit and investigate the hypothesis that no pair of concepts are colexified in every language. We test our hypothesis by analyzing colexification data from BabelNet, Open Multilingual WordNet, and CLICS. The results show that our hypothesis is supported by over 99.9% of colexified concept pairs in these three lexical resources.
The results reported in this paper aim to increase the presence of the Uzbek language in the Internet and its usability within IT applications. We describe the initial development of a “word-net” for the Uzbek language compatible to Princeton WordNet. We called it UZWORDNET. In the current version, UZWORDNET contains 28140 synsets, 64389 sense and 20683 words; its estimated accuracy is 75.98%. To the best of our knowledge, it is the largest wordnet for Uzbek existing to date, and the second wordnet developed overall.
This paper proposes the implementation of WordNets for five South African languages, namely, Sepedi, Setswana, Tshivenda, isiZulu and isiXhosa to be added to open multilingual WordNets (OMW) on natural language toolkit (NLTK). The African WordNets are converted from Princeton WordNet (PWN) 2.0 to 3.0 to match the synsets in PWN 3.0. After conversion, there were 7157, 11972, 1288, 6380, and 9460 lemmas for Sepedi, Setswana, Tshivenda, isiZulu and isiX- hosa respectively. Setswana, isiXhosa, Sepedi contains more lemmas compared to 8 languages in OMW and isiZulu contains more lemmas compared to 7 languages in OMW. A library has been published for continuous development of African WordNets in OMW using NLTK.
Deciding whether a semantically ambiguous word is homonymous or polysemous is equivalent to establishing whether it has any pair of senses that are semantically unrelated. We present novel methods for this task that leverage information from multilingual lexical resources. We formally prove the theoretical properties that provide the foundation for our methods. In particular, we show how the One Homonym Per Translation hypothesis of Hauer and Kondrak (2020a) follows from the synset properties formulated by Hauer and Kondrak (2020b). Experimental evaluation shows that our approach sets a new state of the art for homonymy detection.
This paper describes the development of an online lexical resource to help detection systems regulate and curb the use of offensive words online. With the growing prevalence of social media platforms, many conversations are now conducted on- line. The increase of online conversations for leisure, work and socializing has led to an increase in harassment. In particular, we create a specialized sense-based vocabulary of Japanese offensive words for the Open Multilingual Wordnet. This vocabulary expands on an existing list of Japanese offen- sive words and provides categorization and proper linking to synsets within the multilingual wordnet. This paper then discusses the evaluation of the vocabulary as a resource for representing and classifying offensive words and as a possible resource for offensive word use detection in social media.
In this paper we present a system that exploits different pre-trained Language Models for assigning domain labels to WordNet synsets without any kind of supervision. Furthermore, the system is not restricted to use a particular set of domain labels. We exploit the knowledge encoded within different off-the-shelf pre-trained Language Models and task formulations to infer the domain label of a particular WordNet definition. The proposed zero-shot system achieves a new state-of-the-art on the English dataset used in the evaluation.
We propose a novel method of homonymy-polysemy discrimination for three Indo-European Languages (English, Spanish and Polish). Support vector machines and LASSO logistic regression were successfully used in this task, outperforming baselines. The feature set utilised lemma properties, gloss similarities, graph distances and polysemy patterns. The proposed ML models performed equally well for English and the other two languages (constituting testing data sets). The algorithms not only ruled out most cases of homonymy but also were efficacious in distinguishing between closer and indirect semantic relatedness.
We report on the development of ASLNet, a wordnet for American Sign Language (ASL). ASLNet V1.0 is currently under construction by mapping easy-to-translate ASL lexical nouns to Princeton WordNet synsets. We describe our data model and mapping approach, which can be extended to any sign language. Analysis of the 390 synsets processed to date indicates the success of our procedure yet also highlights the need to supplement our mapping with the “merge” method. We outline our plans for upcoming work to remedy this, which include use of ASL free-association data.
Words are defined based on their meanings in various ways in different resources. Aligning word senses across monolingual lexicographic resources increases domain coverage and enables integration and incorporation of data. In this paper, we explore the application of classification methods using manually-extracted features along with representation learning techniques in the task of word sense alignment and semantic relationship detection. We demonstrate that the performance of classification methods dramatically varies based on the type of semantic relationships due to the nature of the task but outperforms the previous experiments.
Manipulation-relevant common-sense knowledge is crucial to support action-planning for complex tasks. In particular, instrumentality information of what can be done with certain tools can be used to limit the search space which is growing exponentially with the number of viable options. Typical sources for such knowledge, structured common-sense knowledge bases such as ConceptNet or WebChild, provide a limited amount of information which also varies drastically across different domains. Considering the recent success of pre-trained language models such as BERT, we investigate whether common-sense information can directly be extracted from semi-structured text with an acceptable annotation effort. Concretely, we compare the common-sense relations obtained from ConceptNet versus those extracted with BERT from large recipe databases. In this context, we propose a scoring function, based on the WordNet taxonomy to match specific terms to more general ones, enabling a rich evaluation against a set of ground-truth relations.
The Global Wordnet Formats have been introduced to enable wordnets to have a common representation that can be integrated through the Global WordNet Grid. As a result of their adoption, a number of shortcomings of the format were identified, and in this paper we describe the extensions to the formats that address these issues. These include: ordering of senses, dependencies between wordnets, pronunciation, syntactic modelling, relations, sense keys, metadata and RDF support. Furthermore, we provide some perspectives on how these changes help in the integration of wordnets.
This paper introduces Wn, a new Python library for working with wordnets. Unlike previous libraries, Wn is built from the beginning to accommodate multiple wordnets — for multiple languages or multiple versions of the same wordnet — while retaining the ability to query and traverse them independently. It is also able to download and incorporate wordnets published online. These features are made possible through Wn’s adoption of standard formats and methods for interoperability, namely the WN-LMF schema (Vossen et al., 2013; Bond et al., 2020) and the Collaborative Interlingual Index (Bond et al., 2016). Wn is open-source, easily available, and well-documented.
We present here the results of a morphosemantic analysis of the verb-noun pairs in the Princeton WordNet as reflected in the standoff file containing pairs annotated with a set of 14 semantic relations. We have automatically distinguished between zero-derivation and affixal derivation in the data and identified the affixes and manually checked the results. The data show that for each semantic relation an affix prevails in creating new words, although we cannot talk about their specificity with respect to such a relation. Moreover, certain pairs of verb-noun semantic primes are better represented for each semantic relation, and some semantic clusters (in the form of WordNet subtrees) take shape as a result. We thus employ a large-scale data-driven linguistically motivated analysis afforded by the rich derivational and morphosemantic description in WordNet to the end of capturing finer regularities in the process of derivation as represented in the semantic properties of the words involved and as reflected in the structure of the lexicon.
FrameNet (Lowe, 1997; Baker et al., 1998; Fillmore and Atkins, 1998; Johnson et al., 2001) is a computational lexicography project that aims to offer insight into the semantic relationships between predicate and arguments. Having uses in many NLP applications, FrameNet has proven itself as a valuable resource. The main goal of this study is laying the foundation for building a comprehensive and cohesive Turkish FrameNet that is compatible with other resources like PropBank (Kara et al., 2020) or WordNet (Bakay et al., 2019; Ehsani, 2018; Ehsani et al., 2018; Parlar et al., 2019; Bakay et al., 2020) in the Turkish language.
The vast majority of the existing approaches for taxonomy enrichment apply word embeddings as they have proven to accumulate contexts (in a broad sense) extracted from texts which are sufficient for attaching orphan words to the taxonomy. On the other hand, apart from being large lexical and semantic resources, taxonomies are graph structures. Combining word embeddings with graph structure of taxonomy could be of use for predicting taxonomic relations. In this paper we compare several approaches for attaching new words to the existing taxonomy which are based on the graph representations with the one that relies on fastText embeddings. We test all methods on Russian and English datasets, but they could be also applied to other wordnets and languages.
The paper reports on the methodology and final results of a large-scale synset mapping between plWordNet and Princeton WordNet. Dedicated manual and semi-automatic mapping procedures as well as interlingual relation types for nouns, verbs, adjectives and adverbs are described. The statistics of all types of interlingual relations are also provided.
Supervised approaches usually achieve the best performance in the Word Sense Disambiguation problem. However, the unavailability of large sense annotated corpora for many low-resource languages make these approaches inapplicable for them in practice. In this paper, we mitigate this issue for the Persian language by proposing a fully automatic approach for obtaining Persian SemCor (PerSemCor), as a Persian Bag-of-Word (BoW) sense-annotated corpus. We evaluated PerSemCor both intrinsically and extrinsically and showed that it can be effectively used as training sets for Persian supervised WSD systems. To encourage future research on Persian Word Sense Disambiguation, we release the PerSemCor in http://nlp.sbu.ac.ir.
Dictionary-based methods in sentiment analysis have received scholarly attention recently, the most comprehensive examples of which can be found in English. However, many other languages lack polarity dictionaries, or the existing ones are small in size as in the case of SentiTurkNet, the first and only polarity dictionary in Turkish. Thus, this study aims to extend the content of SentiTurkNet by comparing the two available WordNets in Turkish, namely KeNet and TR-wordnet of BalkaNet. To this end, a current Turkish polarity dictionary has been created relying on 76,825 synsets matching KeNet, where each synset has been annotated with three polarity labels, which are positive, negative and neutral. Meanwhile, the comparison of KeNet and TR-wordnet of BalkaNet has revealed their weaknesses such as the repetition of the same senses, lack of necessary merges of the items belonging to the same synset and the presence of redundant narrower versions of synsets, which are discussed in light of their potential to the improvement of the current lexical databases of Turkish.
Currently, there are two available wordnets for Turkish: TR-wordnet of BalkaNet and KeNet. As the more comprehensive wordnet for Turkish, KeNet includes 76,757 synsets. KeNet has both intralingual semantic relations and is linked to PWN through interlingual relations. In this paper, we present the procedure adopted in creating KeNet, give details about our approach in annotating semantic relations such as hypernymy and discuss the language-specific problems encountered in these processes.
In the paper, we present the process of adding morphological information to the Polish WordNet (plWordNet). We describe the reasons for this connection and the intuitions behind it. We also draw attention to the specificity of the Polish morphology. We show in which tasks the morphological information is important and how the methods can be developed by extending them to include combined morphological information based on WordNet.
The paper presents the project Semantic Network with a Wide Range of Semantic Relations and its main achievements. The ultimate objective of the project is to expand Princeton WordNet with conceptual frames that define the syntagmatic relations of verb synsets and the semantic classes of nouns felicitous to combine with particular verbs. At this stage of the work: a) over 5,000 WordNet verb synsets have been supplied with manually evaluated FrameNet semantic frames, b) 253 semantic types have been manually mapped to the appropriate WordNet concepts providing detailed ontological representation of the semantic classes of nouns.
The Princeton WordNet for the English language has been used worldwide in NLP projects for many years. With the OMW initiative, wordnets for different languages of the world are being linked via identifiers. The parallel development and linking allows new multilingual application perspectives. The development of a wordnet for the German language is also in this context. To save development time, existing resources were combined and recompiled. The result was then evaluated and improved. In a relatively short time a resource was created that can be used in projects and continuously improved and extended.
In the paper we compare the structure of the Russian language thesaurus RuWordNet with the data of a psychosemantic experiment to identify semantically close words. The aim of the study is to find out to what extent the structure of RuWordNet corresponds to the intuitive ideas of native speakers about the semantic proximity of words. The respondents were asked to list synonyms to a given word. As a result of the experiment, we found that the respondents mainly mentioned not only synonyms but words that are in paradigmatic relations with the stimuli. The words of the mental sphere were chosen for the experiment. In 95% of cases, the words characterized in the experiment as semantically close were also close according to the thesaurus. In other cases, additions to the thesaurus were proposed.
In the paper, we deal with the problem of unsupervised text document clustering for the Polish language. Our goal is to compare the modern approaches based on language modeling (doc2vec and BERT) with the classical ones, i.e., TF-IDF and wordnet-based. The experiments are conducted on three datasets containing qualification descriptions. The experiments’ results showed that wordnet-based similarity measures could compete and even outperform modern embedding-based approaches.
We present our progress in developing a novel algorithm to extract synonyms from bilingual dictionaries. Identification and usage of synonyms play a significant role in improving the performance of information access applications. The idea is to construct a translation graph from translation pairs, then to extract and consolidate cyclic paths to form bilingual sets of synonyms. The initial evaluation of this algorithm illustrates promising results in extracting Arabic-English bilingual synonyms. In the evaluation, we first converted the synsets in the Arabic WordNet into translation pairs (i.e., losing word-sense memberships). Next, we applied our algorithm to rebuild these synsets. We compared the original and extracted synsets obtaining an F-Measure of 82.3% and 82.1% for Arabic and English synsets extraction, respectively.
Neural language models, including transformer-based models, that are pre-trained on very large corpora became a common way to represent text in various tasks, including recognition of textual semantic relations, e.g. Cross-document Structure Theory. Pre-trained models are usually fine tuned to downstream tasks and the obtained vectors are used as an input for deep neural classifiers. No linguistic knowledge obtained from resources and tools is utilised. In this paper we compare such universal approaches with a combination of rich graph-based linguistically motivated sentence representation and a typical neural network classifier applied to a task of recognition of CST relation in Polish. The representation describes selected levels of the sentence structure including description of lexical meanings on the basis of the wordnet (plWordNet) synsets and connected SUMO concepts. The obtained results show that in the case of difficult relations and medium size training corpus semantically enriched text representation leads to significantly better results.
Natural Language Processing tools and resources have been so far mainly created and trained for standard varieties of language. Nowadays, with the use of large amounts of data gathered from social media, other varieties and registers need to be processed, which may present other challenges and difficulties. In this work, we focus on English and we present a preliminary analysis by comparing the TwitterAAE corpus, which is annotated for ethnicity, and WordNet by quantifying and explaining the online language that WordNet misses.
A WordNet is a thesaurus that has a structured list of words organized depending on their meanings. WordNet represents word senses, all meanings a single lemma may have, the relations between these senses, and their definitions. Another study within the domain of Natural Language Processing is sentiment analysis. With sentiment analysis, data sets can be scored according to the emotion they contain. In the sentiment analysis we did with the data we received on the Tourism WordNet, we performed a domain-specific sentiment analysis study by annotating the data. In this paper, we propose a method to facilitate Natural Language Processing tasks such as sentiment analysis performed in specific domains via creating a specific-domain subset of an original Turkish dictionary. As the preliminary study, we have created a WordNet for the tourism domain with 14,000 words and validated it on simple tasks.
WordNet is the most widely used lexical resource for English, while Wikidata is one of the largest knowledge graphs of entity and concepts available. While, there is a clear difference in the focus of these two resources, there is also a significant overlap and as such a complete linking of these resources would have many uses. We propose the development of such a linking, first by means of the hapax legomenon links and secondly by the use of natural language processing techniques. We show that these can be done with high accuracy but that human validation is still necessary. This has resulted in over 9,000 links being added between these two resources.
This paper presents the work in progress toward the creation of a family of WordNets for Sanskrit, Ancient Greek, and Latin. Building on previous attempts in the field, we elaborate these efforts bridging together WordNet relational semantics with theories of meaning from Cognitive Linguistics. We discuss some of the innovations we have introduced to the WordNet architecture, to better capture the polysemy of words, as well as Indo-European language family-specific features. We conclude the paper framing our work within the larger picture of resources available for ancient languages and showing that WordNet-backed search tools have the potential to re-define the kinds of questions that can be asked of ancient language corpora.
The paper describes work in progress in the DanNet2 project financed by the Carlsberg Foundation. The project aim is to extend the original Danish wordnet, DanNet, in several ways. Main focus is on extension of the coverage and description of the adjectives, a part of speech that was rather sparsely described in the original wordnet. We describe the methodology and initial work of semi-automatically transferring adjectives from the Danish Thesaurus to the wordnet with the aim of easily enlarging the coverage from 3,000 to approx. 13,000 adjectival synsets. Transfer is performed by manually encoding all missing adjectival subsection headwords from the thesaurus and thereafter employing a semi-automatic procedure where adjectives from the same subsection are transferred to the wordnet as either 1) near synonyms to the section’s headword, 2) hyponyms to the section’s headword, or 3) as members of the same synset as the headword. We also discuss how to deal with the problem of multiple representations of the same sense in the thesaurus, and present other types of information from the thesaurus that we plan to integrate, such as thematic and sentiment information.
In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sense tagging. The main goal is to lead students to discover how we can represent meaning and where the limits of our current theories lie. A subsidiary goal is to create sense tagged corpora and an accompanying linked lexicon (in our case wordnets). We present the results of tagging several texts and suggest some ways in which the tagging process could be improved. Two authors of this paper present their own experience as students. Overall, students reported that they found the tagging an enriching experience. The annotated corpora and changes to the wordnet are made available through the NTU multilingual corpus and associated wordnets (NTU-MC).
This paper describes ongoing work aiming at adding pronunciation information to lexical semantic resources, with a focus on open wordnets. Our goal is not only to add a new modality to those semantic networks, but also to mark heteronyms listed in them with the pronunciation information associated with their different meanings. This work could contribute in the longer term to the disambiguation of multi-modal resources, which are combining text and speech.
In this paper we compare Oxford Lexico and Merriam Webster dictionaries with Princeton WordNet with respect to the description of semantic (dis)similarity between polysemous and homonymous senses that could be inferred from them. WordNet lacks any explicit description of polysemy or homonymy, but as a network of linked senses it may be used to compute semantic distances between word senses. To compare WordNet with the dictionaries, we transformed sample entry microstructures of the latter into graphs and cross-linked them with the equivalent senses of the former. We found that dictionaries are in high agreement with each other, if one considers polysemy and homonymy altogether, and in moderate concordance, if one focuses merely on polysemy descriptions. Measuring the shortest path lengths on WordNet gave results comparable to those on the dictionaries in predicting semantic dissimilarity between polysemous senses, but was less felicitous while recognising homonymy.