Federico Sangati

2024

Emojilingo: Harnessing AI to Translate Words into Emojis
Francesca Chiusaroli | Federico Sangati | Johanna Monti | Maria Laura Pierucci | Tiberio Uricchio
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

This paper presents an AI experiment of translation in emoji conducted on a glossary from Dante Alighieri’s Comedy. The experiment is part of a project aiming to build up an automated emojibased pivot language providing an interlingua as a tool for linguistic simplification, accessibility, and international communication: Emojilingo. The present test involves human (Emojitaliano) and machine (Chat-GPT) translations in a comparative analysis to devise an automated integrated model highlighting emojis’ expressive ability in transferring senses, clarifying semantic obscurities and ambiguities, and simplifying language. A first preliminary evaluation highlights Chat-GPT’s ability to deal with a classic archaic literary vocabulary, also raising issues on managing criteria for better grasping the meanings and forms and about the multicultural extent of content transfer.

We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercises, by detailing both its strengths and challenges, and by discussing how much these challenges have been addressed at present. Accordingly, we also report on on-going proof-of-concept efforts aiming at developing the first prototypical implementation of the approach in order to correct and extend an LR called ConceptNet based on the input crowdsourced from language learners. We then present an international network called the European Network for Combining Language Learning with Crowdsourcing Techniques (enetCollect) that provides the context to accelerate the implementation of this generic approach. Finally, we exemplify how it can be used in several language learning scenarios to produce a multitude of NLP resources and how it can therefore alleviate the long-standing NLP issue of the lack of LRs.

pdf bib abs

The Challenge of the TV game La Ghigliottina to NLP
Federico Sangati | Antonio Pascucci | Johanna Monti
Workshop on Games and Natural Language Processing

In this paper, we describe a Telegram bot, Mago della Ghigliottina (Ghigliottina Wizard), able to solve La Ghigliottina game (The Guillotine), the final game of the Italian TV quiz show L’Eredità. Our system relies on linguistic resources and artificial intelligence and achieves better results than human players (and competitors of L’Eredità too). In addition to solving a game, Mago della Ghigliottina can also generate new game instances and challenge the users to match the solution.

pdf bib abs

From Linguistic Resources to Ontology-Aware Terminologies: Minding the Representation Gap
Giulia Speranza | Maria Pia di Buono | Johanna Monti | Federico Sangati
Proceedings of the Twelfth Language Resources and Evaluation Conference

Terminological resources have proven crucial in many applications ranging from Computer-Aided Translation tools to authoring softwares and multilingual and cross-lingual information retrieval systems. Nonetheless, with the exception of a few felicitous examples, such as the IATE (Interactive Terminology for Europe) Termbank, many terminological resources are not available in standard formats, such as Term Base eXchange (TBX), thus preventing their sharing and reuse. Yet, these terminologies could be improved associating the correspondent ontology-based information. The research described in the present contribution demonstrates the process and the methodologies adopted in the automatic conversion into TBX of such type of resources, together with their semantic enrichment based on the formalization of ontological information into terminologies. We present a proof-of-concept using the Italian Linguistic Resource for the Archaeological domain (developed according to Thesauri and Guidelines of the Italian Central Institute for the Catalogue and Documentation). Further, we introduce the conversion tool developed to support the process of creating ontology-aware terminologies for improving interoperability and sharing of existing language technologies and data sets.

pdf bib abs

In this work, we report on a crowdsourcing experiment conducted using the V-TREL vocabulary trainer which is accessed via a Telegram chatbot interface to gather knowledge on word relations suitable for expanding ConceptNet. V-TREL is built on top of a generic architecture implementing the implicit crowdsourding paradigm in order to offer vocabulary training exercises generated from the commonsense knowledge-base ConceptNet and – in the background – to collect and evaluate the learners’ answers to extend ConceptNet with new words. In the experiment about 90 university students learning English at C1 level, based on Common European Framework of Reference for Languages (CEFR), trained their vocabulary with V-TREL over a period of 16 calendar days. The experiment allowed to gather more than 12,000 answers from learners on different question types. In this paper we present in detail the experimental setup and the outcome of the experiment, which indicates the potential of our approach for both crowdsourcing data as well as fostering vocabulary skills.

2019

pdf bib abs

v-trel: Vocabulary Trainer for Tracing Word Relations - An Implicit Crowdsourcing Approach
Verena Lyding | Christos Rodosthenous | Federico Sangati | Umair ul Hassan | Lionel Nicolas | Alexander König | Jolita Horbacauskiene | Anisia Katinskaia
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this paper, we present our work on developing a vocabulary trainer that uses exercises generated from language resources such as ConceptNet and crowdsources the responses of the learners to enrich the language resource. We performed an empirical evaluation of our approach with 60 non-native speakers over two days, which shows that new entries to expand Concept-Net can efficiently be gathered through vocabulary exercises on word relations. We also report on the feedback gathered from the users and an expert from language teaching, and discuss the potential of the vocabulary trainer application from the user and language learner perspective. The feedback suggests that v-trel has educational potential, while in its current state some shortcomings could be identified.

2018

pdf bib

Advances in Multiword Expression Identification for the Italian language: The PARSEME Shared Task Edition 1.1
Johanna Monti | Silvio Ricardo Cordeiro | Carlos Ramisch | Federico Sangati | Agata Savary | Veronika Vincze
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

pdf bib

DialettiBot: a Telegram Bot for Crowdsourcing Recordings of Italian Dialects
Federico Sangati | Ekaterina Abramova | Johanna Monti
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

pdf bib

2017

pdf bib

PARSEME-It Corpus An annotated Corpus of Verbal Multiword Expressions in Italian
Johanna Monti | Maria Pia Di Buono | Federico Sangati
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)

pdf bib abs

Multiword expressions (MWEs) are known as a “pain in the neck” for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one’s heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as “words with spaces”. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-million-word annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.

2016

pdf bib abs

This paper summarizes the preliminary results of an ongoing survey on multiword resources carried out within the IC1207 Cost Action PARSEME (PARSing and Multi-word Expressions). Despite the availability of language resource catalogs and the inventory of multiword datasets on the SIGLEX-MWE website, multiword resources are scattered and difficult to find. In many cases, language resources such as corpora, treebanks, or lexical databases include multiwords as part of their data or take them into account in their annotations. However, these resources need to be centralized to make them accessible. The aim of this survey is to create a portal where researchers can easily find multiword(-aware) language resources for their research. We report on the design of the survey and analyze the data gathered so far. We also discuss the problems we have detected upon examination of the data as well as possible ways of enhancing the survey.

pdf bib abs

D(H)ante: A New Set of Tools for XIII Century Italian
Angelo Basile | Federico Sangati
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we describe 1) the process of converting a corpus of Dante Alighieri from a TEI XML format in to a pseudo-CoNLL format; 2) how a pos-tagger trained on modern Italian performs on Dante’s Italian 3) the performances of two different pos-taggers trained on the given corpus. We are making our conversion scripts and models available to the community. The two other models trained on the corpus performs reasonably well. The tool used for the conversion process might turn useful for bridging the gap between traditional digital humanities and modern NLP applications since the TEI original format is not usually suitable for being processed with standard NLP tools. We believe our work will serve both communities: the DH community will be able to tag new documents and the NLP world will have an easier way in converting existing documents to a standardized machine-readable format.

2015

pdf bib

Multiword Expression Identification with Recurring Tree Fragments and Association Measures
Federico Sangati | Andreas van Cranenburgh
Proceedings of the 11th Workshop on Multiword Expressions

2013

pdf bib abs

Incremental Tree Substitution Grammar for Parsing and Sentence Prediction
Federico Sangati | Frank Keller
Transactions of the Association for Computational Linguistics, Volume 1

In this paper, we present the first incremental parser for Tree Substitution Grammar (TSG). A TSG allows arbitrarily large syntactic fragments to be combined into complete trees; we show how constraints (including lexicalization) can be imposed on the shape of the TSG fragments to enable incremental processing. We propose an efficient Earley-based algorithm for incremental TSG parsing and report an F-score competitive with other incremental parsers. In addition to whole-sentence F-score, we also evaluate the partial trees that the parser constructs for sentence prefixes; partial trees play an important role in incremental interpretation, language modeling, and psycholinguistics. Unlike existing parsers, our incremental TSG parser can generate partial trees that include predictions about the upcoming words in a sentence. We show that it outperforms an n-gram model in predicting more than one upcoming word.

2011

pdf bib

Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP
Federico Sangati | Willem Zuidema
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib

Discontinuous Data-Oriented Parsing: A mildly context-sensitive all-fragments grammar
Andreas van Cranenburgh | Remko Scha | Federico Sangati
Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages

2010

pdf bib

How Spoken Language Corpora Can Refine Current Speech Motor Training Methodologies
Daniil Umanski | Federico Sangati
Proceedings of the ACL 2010 Student Research Workshop

pdf bib abs

Efficiently Extract Rrecurring Tree Fragments from Large Treebanks
Federico Sangati | Willem Zuidema | Rens Bod
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we describe FragmentSeeker, a tool which is capable to identify all those tree constructions which are recurring multiple times in a large Phrase Structure treebank. The tool is based on an efficient kernel-based dynamic algorithm, which compares every pair of trees of a given treebank and computes the list of fragments which they both share. We describe two different notions of fragments we will use, i.e. standard and partial fragments, and provide the implementation details on how to extract them from a syntactically annotated corpus. We have tested our system on the Penn Wall Street Journal treebank for which we present quantitative and qualitative analysis on the obtained recurring structures, as well as provide empirical time performance. Finally we propose possible ways our tool could contribute to different research fields related to corpus analysis and processing, such as parsing, corpus statistics, annotation guidance, and automatic detection of argument structure.

pdf bib

A Probabilistic Generative Model for an Intermediate Constituency-Dependency Representation
Federico Sangati
Proceedings of the ACL 2010 Student Research Workshop