Fausto Giunchiglia


2022

pdf bib
Language Diversity: Visible to Humans, Exploitable by Machines
Gábor Bella | Erdenebileg Byambadorj | Yamini Chandrashekar | Khuyagbaatar Batsuren | Danish Cheema | Fausto Giunchiglia
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

The Universal Knowledge Core (UKC) is a large multilingual lexical database with a focus on language diversity and covering over two thousand languages. The aim of the database, as well as its tools and data catalogue, is to make the abstract notion of linguistic diversity visually understandable for humans and formally exploitable by machines. The UKC website lets users explore millions of individual words and their meanings, but also phenomena of cross-lingual convergence and divergence, such as shared interlingual meanings, lexicon similarities, cognate clusters, or lexical gaps. The UKC LiveLanguage Catalogue, in turn, provides access to the underlying lexical data in a computer-processable form, ready to be reused in cross-lingual applications.

pdf bib
ZiNet: Linking Chinese Characters Spanning Three Thousand Years
Yang Chi | Fausto Giunchiglia | Daqian Shi | Xiaolei Diao | Chuntao Li | Hao Xu
Findings of the Association for Computational Linguistics: ACL 2022

Modern Chinese characters evolved from 3,000 years ago. Up to now, tens of thousands of glyphs of ancient characters have been discovered, which must be deciphered by experts to interpret unearthed documents. Experts usually need to compare each ancient character to be examined with similar known ones in whole historical periods. However, it is inevitably limited by human memory and experience, which often cost a lot of time but associations are limited to a small scope. To help researchers discover glyph similar characters, this paper introduces ZiNet, the first diachronic knowledge base describing relationships and evolution of Chinese characters and words. In addition, powered by the knowledge of radical systems in ZiNet, this paper introduces glyph similarity measurement between ancient Chinese characters, which could capture similar glyph pairs that are potentially related in origins or semantics. Results show strong positive correlations between scores from the method and from human experts. Finally, qualitative analysis and implicit future applications are presented.

2021

pdf bib
The Quality of Lexical Semantic Resources: A Survey
Hadi Khalilia | Abed Alhakim Freihat | Fausto Giunchiglia
Proceedings of The Fourth International Conference on Natural Language and Speech Processing (ICNLSP 2021)

pdf bib
The Dimensions of Lexical Semantic Resource Quality
Hadi Khalilia | Abed Alhakim Freihat | Fausto Giunchiglia
Proceedings of The Second International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2021) co-located with ICNLSP 2021

pdf bib
MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology
Khuyagbaatar Batsuren | Gábor Bella | Fausto Giunchiglia
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Large-scale morphological databases provide essential input to a wide range of NLP applications. Inflectional data is of particular importance for morphologically rich (agglutinative and highly inflecting) languages, and derivations can be used, e.g. to infer the semantics of out-of-vocabulary words. Extending the scope of state-of-the-art multilingual morphological databases, we announce the release of MorphyNet, a high-quality resource with 15 languages, 519k derivational and 10.1M inflectional entries, and a rich set of morphological features. MorphyNet was extracted from Wiktionary using both hand-crafted and automated methods, and was manually evaluated to be of a precision higher than 98%. Both the resource generation logic and the resulting database are made freely available and are reusable as stand-alone tools or in combination with existing resources.

pdf bib
Is this Enough?-Evaluation of Malayalam Wordnet
Nandu Chandran Nair | Maria-chiara Giangregorio | Fausto Giunchiglia
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Quality of a product is the degree to which a product meets the customer’s expectation, which must also be valid for the case of lexical semantic resources. Conducting a periodic evaluation of resources is essential to ensure if the resources meet a native speaker’s expectations and free from errors. This paper defines the possible mistakes in a lexical semantic resource and explains the steps applied to quantify Malayalam wordnet quality. Malayalam is one of the classical languages of India. We hope to subset the less quality part of the wordnet and perform crowdsourcing to make it better.

pdf bib
Deep Attention Diffusion Graph Neural Networks for Text Classification
Yonghao Liu | Renchu Guan | Fausto Giunchiglia | Yanchun Liang | Xiaoyue Feng
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Text classification is a fundamental task with broad applications in natural language processing. Recently, graph neural networks (GNNs) have attracted much attention due to their powerful representation ability. However, most existing methods for text classification based on GNNs consider only one-hop neighborhoods and low-frequency information within texts, which cannot fully utilize the rich context information of documents. Moreover, these models suffer from over-smoothing issues if many graph layers are stacked. In this paper, a Deep Attention Diffusion Graph Neural Network (DADGNN) model is proposed to learn text representations, bridging the chasm of interaction difficulties between a word and its distant neighbors. Experimental results on various standard benchmark datasets demonstrate the superior performance of the present approach.

2020

pdf bib
Exploring the Language of Data
Gábor Bella | Linda Gremes | Fausto Giunchiglia
Proceedings of the 28th International Conference on Computational Linguistics

We set out to uncover the unique grammatical properties of an important yet so far under-researched type of natural language text: that of short labels typically found within structured datasets. We show that such labels obey a specific type of abbreviated grammar that we call the Language of Data, with properties significantly different from the kinds of text typically addressed in computational linguistics and NLP, such as ‘standard’ written language or social media messages. We analyse orthography, parts of speech, and syntax over a large, bilingual, hand-annotated corpus of data labels collected from a variety of domains. We perform experiments on tokenisation, part-of-speech tagging, and named entity recognition over real-world structured data, demonstrating that models adapted to the Language of Data outperform those trained on standard text. These observations point in a new direction to be explored as future research, in order to develop new NLP tools and models dedicated to the Language of Data.

pdf bib
A Major Wordnet for a Minority Language: Scottish Gaelic
Gábor Bella | Fiona McNeill | Rody Gorman | Caoimhin O Donnaile | Kirsty MacDonald | Yamini Chandrashekar | Abed Alhakim Freihat | Fausto Giunchiglia
Proceedings of the 12th Language Resources and Evaluation Conference

We present a new wordnet resource for Scottish Gaelic, a Celtic minority language spoken by about 60,000 speakers, most of whom live in Northwestern Scotland. The wordnet contains over 15 thousand word senses and was constructed by merging ten thousand new, high-quality translations, provided and validated by language experts, with an existing wordnet derived from Wiktionary. This new, considerably extended wordnet—currently among the 30 largest in the world—targets multiple communities: language speakers and learners; linguists; computer scientists solving problems related to natural language processing. By publishing it as a freely downloadable resource, we hope to contribute to the long-term preservation of Scottish Gaelic as a living language, both offline and on the Web.

2019

pdf bib
Building the Mongolian WordNet
Khuyagbaatar Batsuren | Amarsanaa Ganbold | Altangerel Chagnaa | Fausto Giunchiglia
Proceedings of the 10th Global Wordnet Conference

This paper presents the Mongolian Wordnet (MOW), and a general methodology of how to construct it from various sources e.g. lexical resources and expert translations. As of today, the MOW contains 23,665 synsets, 26,875 words, 2,979 glosses, and 213 examples. The manual evaluation of the resource1 estimated its quality at 96.4%.

pdf bib
CogNet: A Large-Scale Cognate Database
Khuyagbaatar Batsuren | Gabor Bella | Fausto Giunchiglia
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

This paper introduces CogNet, a new, large-scale lexical database that provides cognates -words of common origin and meaning- across languages. The database currently contains 3.1 million cognate pairs across 338 languages using 35 writing systems. The paper also describes the automated method by which cognates were computed from publicly available wordnets, with an accuracy evaluated to 94%. Finally, it presents statistics about the cognate data and some initial insights into it, hinting at a possible future exploitation of the resource by various fields of lingustics.

2017

pdf bib
TrentoTeam at SemEval-2017 Task 3: An application of Grice Maxims in Ranking Community Question Answers
Mohammed R. H. Qwaider | Abed Alhakim Freihat | Fausto Giunchiglia
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper we present the Tren-toTeam system which participated to thetask 3 at SemEval-2017 (Nakov et al.,2017).We concentrated our work onapplying Grice Maxims(used in manystate-of-the-art Machine learning applica-tions(Vogel et al., 2013; Kheirabadiand Aghagolzadeh, 2012; Dale and Re-iter, 1995; Franke, 2011)) to ranking an-swers of a question by answers relevancy.Particularly, we created a ranker systembased on relevancy scores, assigned by 3main components: Named entity recogni-tion, similarity score, sentiment analysis.Our system obtained a comparable resultsto Machine learning systems.

2016

pdf bib
A Taxonomic Classification of WordNet Polysemy Types
Abed Alhakim Freihat | Fausto Giunchiglia | Biswanath Dutta
Proceedings of the 8th Global WordNet Conference (GWC)

WordNet represents polysemous terms by capturing the different meanings of these terms at the lexical level, but without giving emphasis on the polysemy types such terms belong to. The state of the art polysemy approaches identify several polysemy types in WordNet but they do not explain how to classify and organize them. In this paper, we present a novel approach for classifying the polysemy types which exploits taxonomic principles which in turn, allow us to discover a set of polysemy structural patterns.

1984

pdf bib
NAtural Language driven Image Generation
Giovanni Adorni | Mauro Di Manzo | Fausto Giunchiglia
10th International Conference on Computational Linguistics and 22nd Annual Meeting of the Association for Computational Linguistics