Simonetta Montemagni

Also published as: S. Montemagni

In this paper we describe some experiments related to a corpus derived from an authoritative historical Italian dictionary, namely the Grande dizionario della lingua italiana (‘Great Dictionary of Italian Language’, in short GDLI). Thanks to the digitization and structuring of this dictionary, we have been able to set up the first nucleus of a diachronic annotated corpus that selects—according to specific criteria, and distinguishing between prose and poetry—some of the quotations that within the entries illustrate the different definitions and sub-definitions. In fact, the GDLI presents a huge collection of quotations covering the entire history of the Italian language and thus ranging from the Middle Ages to the present day. The corpus was enriched with linguistic annotation and used to train and evaluate NLP models for POS tagging and lemmatization, with promising results.

pdf bib abs

Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus
Tommaso Agnoloni | Roberto Bartolini | Francesca Frontini | Simonetta Montemagni | Carlo Marchetti | Valeria Quochi | Manuela Ruisi | Giulia Venturi
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 period and a former period for reference and comparison according to the CLARIN ParlaMint guidelines and prescriptions. The corpus contains 1199 sessions and 79,373 speeches, for a total of about 31 million words and was encoded according to the ParlaCLARIN TEI XML format, as well as in CoNLL-UD format. It includes extensive metadata about the speakers, the sessions, the political parties and Parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity classification was also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.

2021

pdf bib

Trattamento automatico della lingua a supporto dell’editoria: primi esperimenti con il Devoto-Oli Junior(Automatic Language Treatment to Support Publishing: First Experiments with the Devoto-Oli Junior)
Irene Dini | Felice Dell’Orletta | Fabio Ferri | Biancamaria Gismondi | Simonetta Montemagni
Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it 2021)

2020

pdf bib

Quantitative Linguistic Investigations across Universal Dependencies Treebanks
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Petya Osenova | Kiril Simov | Giulia Venturi
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

pdf bib

Risorse e strumenti per le varietà storiche dell’italiano: il progetto TrAVaSI
Manuel Favaro | Marco Biffi | Simonetta Montemagni
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

pdf bib abs

“Voices of the Great War” is the first large corpus of Italian historical texts dating back to the period of First World War. This corpus differs from other existing resources in several respects. First, from the linguistic point of view it gives account of the wide range of varieties in which Italian was articulated in that period, namely from a diastratic (educated vs. uneducated writers), diaphasic (low/informal vs. high/formal registers) and diatopic (regional varieties, dialects) points of view. From the historical perspective, through a collection of texts belonging to different genres it represents different views on the war and the various styles of narrating war events and experiences. The final corpus is balanced along various dimensions, corresponding to the textual genre, the language variety used, the author type and the typology of conveyed contents. The corpus is fully annotated with lemmas, part-of-speech, terminology, and named entities. Significant corpus samples representative of the different “voices” have also been enriched with meta-linguistic and syntactic information. The layer of syntactic annotation forms the first nucleus of an Italian historical treebank complying with the Universal Dependencies standard. The paper illustrates the final resource, the methodology and tools used to build it, and the Web Interface for navigating it.

pdf bib abs

Profiling-UD: a Tool for Linguistic Profiling of Texts
Dominique Brunato | Andrea Cimino | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we introduce Profiling–UD, a new text analysis tool inspired to the principles of linguistic profiling that can support language variation research from different perspectives. It allows the extraction of more than 130 features, spanning across different levels of linguistic description. Beyond the large number of features that can be monitored, a main novelty of Profiling–UD is that it has been specifically devised to be multilingual since it is based on the Universal Dependencies framework. In the second part of the paper, we demonstrate the effectiveness of these features in a number of theoretical and applicative studies in which they were successfully used for text and author profiling.

2019

pdf bib

Building an Italian Written-Spoken Parallel Corpus: a Pilot Study
Elisa Dominutti | Lucia Pifferi | Felice Dell’Orletta | Simonetta Montemagni | Valeria Quochi
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

2018

pdf bib

pdf bib abs

We evaluate two cross-lingual techniques for adding enhanced dependencies to existing treebanks in Universal Dependencies. We apply a rule-based system developed for English and a data-driven system trained on Finnish to Swedish and Italian. We find that both systems are accurate enough to bootstrap enhanced dependencies in existing UD treebanks. In the case of Italian, results are even on par with those of a prototype language-specific system.

pdf bib

Italian in the Trenches: Linguistic Annotation and Analysis of Texts of the Great War
Irene De Felice | Felice Dell’Orletta | Giulia Venturi | Alessandro Lenci | Simonetta Montemagni
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

pdf bib

Universal Dependencies and Quantitative Typological Trends. A Case Study on Word Order
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs

Assessing the Impact of Incremental Error Detection and Correction. A Case Study on the Italian Universal Dependency Treebank
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Maria Simi | Giulia Venturi
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Detection and correction of errors and inconsistencies in “gold treebanks” are becoming more and more central topics of corpus annotation. The paper illustrates a new incremental method for enhancing treebanks, with particular emphasis on the extension of error patterns across different textual genres and registers. Impact and role of corrections have been assessed in a dependency parsing experiment carried out with four different parsers, whose results are promising. For both evaluation datasets, the performance of parsers increases, in terms of the standard LAS and UAS measures and of a more focused measure taking into account only relations involved in error patterns, and at the level of individual dependencies.

pdf bib

Bootstrapping Enhanced Universal Dependencies for Italian
Maria Simi | Simonetta Montemagni
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

2017

pdf bib

Dangerous Relations in Dependency Treebanks
Chiara Alzetta | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

pdf bib

Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)
Simonetta Montemagni | Joakim Nivre
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib

Identifying Predictive Features for Textual Genre Classification: the Key Role of Syntax
Andrea Cimino | Martijn Wieling | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)

2016

pdf bib abs

CItA: an L1 Italian Learners Corpus to Study the Development of Writing Competence
Alessia Barbagli | Pietro Lucisano | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we present the CItA corpus (Corpus Italiano di Apprendenti L1), a collection of essays written by Italian L1 learners collected during the first and second year of lower secondary school. The corpus was built in the framework of an interdisciplinary study jointly carried out by computational linguistics and experimental pedagogists and aimed at tracking the development of written language competence over the years and students’ background information.

pdf bib abs

ALT Explored: Integrating an Online Dialectometric Tool and an Online Dialect Atlas
Martijn Wieling | Eva Sassolini | Sebastiana Cucurullo | Simonetta Montemagni
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we illustrate the integration of an online dialectometric tool, Gabmap, together with an online dialect atlas, the Atlante Lessicale Toscano (ALT-Web). By using a newly created url-based interface to Gabmap, ALT-Web is able to take advantage of the sophisticated dialect visualization and exploration options incorporated in Gabmap. For example, distribution maps showing the distribution in the Tuscan dialect area of a specific dialectal form (selected via the ALT-Web website) are easily obtainable. Furthermore, the complete ALT-Web dataset as well as subsets of the data (selected via the ALT-Web website) can be automatically uploaded and explored in Gabmap. By combining these two online applications, macro- and micro-analyses of dialectal data (respectively offered by Gabmap and ALT-Web) are effectively and dynamically combined.

2015

pdf bib

NLP–Based Readability Assessment of Health–Related Texts: a Case Study on Italian Informed Consent Forms
Giulia Venturi | Tommaso Bellandi | Felice Dell’Orletta | Simonetta Montemagni
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

pdf bib

Design and Annotation of the First Italian Corpus for Text Simplification
Dominique Brunato | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 9th Linguistic Annotation Workshop

2014

pdf bib abs

T2K^2: a System for Automatically Extracting and Organizing Knowledge from Texts
Felice Dell’Orletta | Giulia Venturi | Andrea Cimino | Simonetta Montemagni
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present T2K^2, a suite of tools for automatically extracting domain―specific knowledge from collections of Italian and English texts. T2K^2 (Text―To―Knowledge v2) relies on a battery of tools for Natural Language Processing (NLP), statistical text analysis and machine learning which are dynamically integrated to provide an accurate and incremental representation of the content of vast repositories of unstructured documents. Extracted knowledge ranges from domain―specific entities and named entities to the relations connecting them and can be used for indexing document collections with respect to different information types. T2K^2 also includes linguistic profiling functionalities aimed at supporting the user in constructing the acquisition corpus, e.g. in selecting texts belonging to the same genre or characterized by the same degree of specialization or in monitoring the added value of newly inserted documents. T2K^2 is a web application which can be accessed from any browser through a personal account which has been tested in a wide range of domains.

pdf bib abs

Less is More? Towards a Reduced Inventory of Categories for Training a Parser for the Italian Stanford Dependencies
Maria Simi | Cristina Bosco | Simonetta Montemagni
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Stanford Dependencies (SD) represent nowadays a de facto standard as far as dependency annotation is concerned. The goal of this paper is to explore pros and cons of different strategies for generating SD annotated Italian texts to enrich the existing Italian Stanford Dependency Treebank (ISDT). This is done by comparing the performance of a statistical parser (DeSR) trained on a simpler resource (the augmented version of the Merged Italian Dependency Treebank or MIDT+) and whose output was automatically converted to SD, with the results of the parser directly trained on ISDT. Experiments carried out to test reliability and effectiveness of the two strategies show that the performance of a parser trained on the reduced dependencies repertoire, whose output can be easily converted to SD, is slightly higher than the performance of a parser directly trained on ISDT. A non-negligible advantage of the first strategy for generating SD annotated texts is that semi-automatic extensions of the training resource are more easily and consistently carried out with respect to a reduced dependency tag set. Preliminary experiments carried out for generating the collapsed and propagated SD representation are also reported.

pdf bib

Assessing the Readability of Sentences: Which Corpora and Features?
Felice Dell’Orletta | Martijn Wieling | Giulia Venturi | Andrea Cimino | Simonetta Montemagni
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

2013

pdf bib

Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank
Cristina Bosco | Simonetta Montemagni | Maria Simi
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

pdf bib

Linguistic Profiling of Texts Across Textual Genres and Readability Levels. An Exploratory Study on Italian Fictional Prose
Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib

Unsupervised Linguistically-Driven Reliable Dependency Parses Detection and Self-Training for Adaptation to the Biomedical Domain
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 2013 Workshop on Biomedical Natural Language Processing

pdf bib

Linguistic Profiling based on General–purpose Features and Native Language Identification
Andrea Cimino | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

2012

pdf bib abs

Enriching the ISST-TANL Corpus with Semantic Frames
Alessandro Lenci | Simonetta Montemagni | Giulia Venturi | Maria Grazia Cutrullà
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper describes the design and the results of a manual annotation methodology devoted to enrich the ISST--TANL Corpus, derived from the Italian Syntactic--Semantic Treebank (ISST), with Semantic Frames information. The main issues encountered in applying the English FrameNet annotation criteria to a corpus of Italian language are discussed together with the choice of anchoring the semantic annotation layer to the underlying dependency syntactic structure. The results of a case study aimed at extending and specialising this methodology for the annotation of a corpus of legislative texts are also discussed.

pdf bib

Genre-oriented Readability Assessment: a Case Study
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Workshop on Speech and Language Processing Tools in Education

2011

pdf bib

READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification
Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

pdf bib

ULISSE: an Unsupervised Algorithm for Detecting Reliable Dependency Parses
Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the Fifteenth Conference on Computational Natural Language Learning

2010

pdf bib abs

A Contrastive Approach to Multi-word Extraction from Domain-specific Corpora
Francesca Bonin | Felice Dell’Orletta | Simonetta Montemagni | Giulia Venturi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a novel approach to multi-word terminology extraction combining a well-known automatic term recognition approach, the C--NC value method, with a contrastive ranking technique, aimed at refining obtained results either by filtering noise due to common words or by discerning between semantically different types of terms within heterogeneous terminologies. Differently from other contrastive methods proposed in the literature that focus on single terms to overcome the multi-word terms' sparsity problem, the proposed contrastive function is able to handle variation in low frequency events by directly operating on pre-selected multi-word terms. This methodology has been tested in two case studies carried out in the History of Art and Legal domains. Evaluation of achieved results showed that the proposed two--stage approach improves significantly multi--word term extraction results. In particular, for what concerns the legal domain it provides an answer to a well-known problem in the semi--automatic construction of legal ontologies, namely that of singling out law terms from terms of the specific domain being regulated.

pdf bib

Contrastive Filtering of Domain-Specific Multi-Word Terms from Different Types of Corpora
Francesca Bonin | Felice Dell’Orletta | Giulia Venturi | Simonetta Montemagni
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

pdf bib abs

As the interest of the NLP community grows to develop several treebanks also for languages other than English, we observe efforts towards evaluating the impact of different annotation strategies used to represent particular languages or with reference to particular tasks. This paper contributes to the debate on the influence of resources used for the training and development on the performance of parsing systems. It presents a comparative analysis of the results achieved by three different dependency parsers developed and tested with respect to two treebanks for the Italian language, namely TUT and ISST--TANL, which differ significantly at the level of both corpus composition and adopted dependency representations.

pdf bib abs

A Resource and Tool for Super-sense Tagging of Italian Texts
Giuseppe Attardi | Stefano Dei Rossi | Giulia Di Pietro | Alessandro Lenci | Simonetta Montemagni | Maria Simi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

A SuperSense Tagger is a tool for the automatic analysis of texts that associates to each noun, verb, adjective and adverb a semantic category within a general taxonomy. The developed tagger, based on a statistical model (Maximum Entropy), required the creation of an Italian annotated corpus, to be used as a training set, and the improvement of various existing tools. The obtained results significantly improved the current state-of-the art for this particular task.

2008

pdf bib abs

Building a Bio-Event Annotated Corpus for the Acquisition of Semantic Frames from Biomedical Corpora
Paul Thompson | Philip Cotter | John McNaught | Sophia Ananiadou | Simonetta Montemagni | Andrea Trabucco | Giulia Venturi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper reports on the design and construction of a bio-event annotated corpus which was developed with a specific view to the acquisition of semantic frames from biomedical corpora. We describe the adopted annotation scheme and the annotation process, which is supported by a dedicated annotation tool. The annotated corpus contains 677 abstracts of biomedical research articles.

pdf bib abs

Unsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora
Alessandro Lenci | Barbara McGillivray | Simonetta Montemagni | Vito Pirrelli
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we reported experiments of unsupervised automatic acquisition of Italian and English verb subcategorization frames (SCFs) from general and domain corpora. The proposed technique operates on syntactically shallow-parsed corpora on the basis of a limited number of search heuristics not relying on any previous lexico-syntactic knowledge about SCFs. Although preliminary, reported results are in line with state-of-the-art lexical acquisition systems. The issue of whether verbs sharing similar SCFs distributions happen to share similar semantic properties as well was also explored by clustering verbs that share frames with the same distribution using the Minimum Description Length Principle (MDL). First experiments in this direction were carried out on Italian verbs with encouraging results.

pdf bib abs

Ontology Learning and Semantic Annotation: a Necessary Symbiosis
Emiliano Giovannetti | Simone Marchi | Simonetta Montemagni | Roberto Bartolini
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Semantic annotation of text requires the dynamic merging of linguistically structured information and a world model, usually represented as a domain-specific ontology. On the other hand, the process of engineering a domain-ontology through semi-automatic ontology learning system requires the availability of a considerable amount of semantically annotated documents. Facing this bootstrapping paradox requires an incremental process of annotation-acquisition-annotation, whereby domain-specific knowledge is acquired from linguistically-annotated texts and then projected back onto texts for extra linguistic information to be annotated and further knowledge layers to be extracted. The presented methodology is a first step in the direction of a full virtuous circle where the semantic annotation platform and the evolving ontology interact in symbiosis. As a case study we have chosen the semantic annotation of product catalogues. We propose a hybrid approach, combining pattern matching techniques to exploit the regular structure of product descriptions in catalogues, and Natural Language Processing techniques which are resorted to analyze natural language descriptions. The semantic annotation involves the access to the ontology, semi-automatically bootstrapped with an ontology learning tool from annotated collections of catalogues.

2006

pdf bib abs

Dialectal resources on-line: the ALT-Web experience
Nella Cucurullo | Simonetta Montemagni | Matilde Paoli | Eugenio Picchi | Eva Sassolini
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper presents an on-line dialectal resource, ALT-Web, which gives access to the linguistic data of the Atlante Lessicale Toscano, a specially designed linguistic atlas in which lexical data have both a diatopic and diastratic characterisation. The paper focuses on: the dialectal data representation model; the access modalities to the ALT dialectal corpus; ontology-based search.

pdf bib

Probing the Space of Grammatical Variation: Induction of Cross-Lingual Grammatical Constraints from Treebanks
Felice Dell’Orletta | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli
Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006

pdf bib abs

Searching treebanks for functional constraints: cross-lingual experiments in grammatical relation assignment
Felice Dell’Orletta | Alessandro Lenci | Simonetta Montemagni | Vito Pirrelli
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper reports on a detailed quantitative analysis of distributional language data of both Italian and Czech, highlighting the relative contribution of a number of distributed grammatical factors to sentence-based identification of subjects and direct objects. The work is based on a Maximum Entropy model of stochastic resolution of grammatical conflicting constraints, and is demonstrably capable of putting explanatory theoretical accounts to the challenging test of an extensive, usage-based empirical verification.