Roberto Bartolini

2022

Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus
Tommaso Agnoloni | Roberto Bartolini | Francesca Frontini | Simonetta Montemagni | Carlo Marchetti | Valeria Quochi | Manuela Ruisi | Giulia Venturi
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 period and a former period for reference and comparison according to the CLARIN ParlaMint guidelines and prescriptions. The corpus contains 1199 sessions and 79,373 speeches, for a total of about 31 million words and was encoded according to the ParlaCLARIN TEI XML format, as well as in CoNLL-UD format. It includes extensive metadata about the speakers, the sessions, the political parties and Parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity classification was also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.

2018

pdf bib

The LREC Workshops Map
Roberto Bartolini | Sara Goggi | Monica Monachini | Gabriella Pardelli
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib abs

This proposal describes a new way to visualise resources in the LREMap, a community-built repository of language resource descriptions and uses. The LREMap is represented as a force-directed graph, where resources, papers and authors are nodes. The analysis of the visual representation of the underlying graph is used to study how the community gathers around LRs and how LRs are used in research.

2014

pdf bib abs

From Synsets to Videos: Enriching ItalWordNet Multimodally
Roberto Bartolini | Valeria Quochi | Irene De Felice | Irene Russo | Monica Monachini
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The paper describes the multimodal enrichment of ItalWordNet action verbs entries by means of an automatic mapping with an ontology of action types instantiated by video scenes (ImagAct). The two resources present important differences as well as interesting complementary features, such that a mapping of these two resources can lead to a an enrichment of IWN, through the connection between synsets and videos apt to illustrate the meaning described by glosses. Here, we describe an approach inspired by ontology matching methods for the automatic mapping of ImagAct video scened onto ItalWordNet sense. The experiments described in the paper are conducted on Italian, but the same methodology can be extended to other languages for which WordNets have been created, since ImagAct is done also for English, Chinese and Spanish. This source of multimodal information can be exploited to design second language learning tools, as well as for language grounding in video action recognition and potentially for robotics.

Semantic annotation of text requires the dynamic merging of linguistically structured information and a world model, usually represented as a domain-specific ontology. On the other hand, the process of engineering a domain-ontology through semi-automatic ontology learning system requires the availability of a considerable amount of semantically annotated documents. Facing this bootstrapping paradox requires an incremental process of annotation-acquisition-annotation, whereby domain-specific knowledge is acquired from linguistically-annotated texts and then projected back onto texts for extra linguistic information to be annotated and further knowledge layers to be extracted. The presented methodology is a first step in the direction of a full virtuous circle where the semantic annotation platform and the evolving ontology interact in symbiosis. As a case study we have chosen the semantic annotation of product catalogues. We propose a hybrid approach, combining pattern matching techniques to exploit the regular structure of product descriptions in catalogues, and Natural Language Processing techniques which are resorted to analyze natural language descriptions. The semantic annotation involves the access to the ontology, semi-automatically bootstrapped with an ontology learning tool from annotated collections of catalogues.

pdf bib abs

A Bilingual Corpus of Inter-linked Events
Tommaso Caselli | Nancy Ide | Roberto Bartolini
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes the creation of a bilingual corpus of inter-linked events for Italian and English. Linkage is accomplished through the Inter-Lingual Index (ILI) that links ItalWordNet with WordNet. The availability of this resource, on the one hand, enables contrastive analysis of the linguistic phenomena surrounding events in both languages, and on the other hand, can be used to perform multilingual temporal analysis of texts. In addition to describing the methodology for construction of the inter-linked corpus and the analysis of the data collected, we demonstrate that the ILI could potentially be used to bootstrap the creation of comparable corpora by exporting layers of annotation for words that have the same sense.

pdf bib abs

UFRA: a UIMA-based Approach to Federated Language Resource Architecture
Riccardo Del Gratta | Roberto Bartolini | Tommaso Caselli | Monica Monachini | Claudia Soria | Nicoletta Calzolari
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we address the issue of developing an interoperable infrastructure for language resources and technologies. In our approach, called UFRA, we extend the Federate Database Architecture System adding typical functionalities caming from UIMA. In this way, we capitalize the advantages of a federated architecture, such as autonomy, heterogeneity and distribution of components, monitored by a central authority responsible for checking both the integration of components and user rights on performing different tasks. We use the UIMA approach to manage and define one common front-end, enabling users and clients to query, retrieve and use language resources and technologies. The purpose of this paper is to show how UIMA leads from a Federated Database Architecture to a Federated Resource Architecture, adding to a registry of available components both static resources such as lexicons and corpora and dynamic ones such as tools and general purpose language technologies. At the end of the paper, we present a case-study that adopts this framework to integrate the SIMPLE lexicon and TIMEML annotation guidelines to tag natural language texts.

2006

pdf bib abs

In this paper we present an original approach to natural language query interpretation which has been implemented withinthe FuLL (Fuzzy Logic and Language) Italian project of BC S.r.l. In particular, we discuss here the creation of linguisticand ontological resources, together with the exploitation of existing ones, for natural language-driven database access andretrieval. Both the database and the queries we experiment with are Italian, but the methodology we broach naturally extends to other languages.