Roberto Zanoli

2025

pdf bib

2024

pdf bib abs

Understanding High-complexity Technical Documents with State-of-Art Models
Bernardo Magnini | Roberto Zanoli
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Technical documents, particularly those in civil engineering, contain crucial information that supports critical decision-making in construction, transportation and infrastructure projects. Large language models (LLMs) offer a promising solution for automating the extraction and comprehension of technical documents, potentially transforming our interaction with technical information. However, LLMs may encounter significant challenges when processing technical documents due to their complex structure, specialized terminology and reliance on graphical and visual elements. Moreover, LLMs are known to sometimes produce unexpected or incorrect analyses, a phenomenon referred to as hallucination.This study explores the potential of state-of-the-art LLMs, specifically GPT-4omni, to automate the comprehension of technical documents. The evaluation was performed on two types of PDF documents. The first type is selectable text PDFs, which are extractable and editable, focusing on civil engineering documents from the Italian state railways. The second type is scanned OCR PDFs, where text is derived from scanning or OCR, specifically focusing on the design of an outdoor swimming pool. These documents include textual and visual elements such as tables, figures and photos. Our findings suggest that GPT-4omni has a high potential for real-world use, although it may still be susceptible to producing misleading information.

pdf bib abs

IDRE: AI Generated Dataset for Enhancing Empathetic Chatbot Interactions in Italian Language.
Simone Manai | Laura Gemme | Roberto Zanoli | Alberto Lavelli
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

This paper introduces IDRE (Italian Dataset for Rephrasing with Empathy), a novel automatically generated Italian linguistic dataset. IDRE comprises typical chatbot user utterances in the healthcare domain, corresponding chatbot responses, and empathetically enhanced chatbot responses. The dataset was generated using the Llama2 language model and evaluated by human raters based on predefined metrics. The IDRE dataset offers a comprehensive and realistic collection of Italian chatbot-user interactions suitable for training and refining chatbot models in the healthcare domain. This facilitates the development of chatbots capable of natural and productive conversations with healthcare users. Notably, the dataset incorporates empathetically enhanced chatbot responses, enabling researchers to investigate the effects of empathetic language on fostering more positive and engaging human-machine interactions within healthcare settings. The methodology employed for the construction of the IDRE dataset can be extended to generate phrases in additional languages and domains, thereby expanding its applicability and utility. The IDRE dataset is publicly available for research purposes.

This paper describes the KnowledgeStore, a large-scale infrastructure for the combined storage and interlinking of multimedia resources and ontological knowledge. Information in the KnowledgeStore is organized around entities, such as persons, organizations and locations. The system allows (i) to import background knowledge about entities, in form of annotated RDF triples; (ii) to associate resources to entities by automatically recognizing, coreferring and linking mentions of named entities; and (iii) to derive new entities based on knowledge extracted from mentions. The KnowledgeStore builds on state of art technologies for language processing, including document tagging, named entity extraction and cross-document coreference. Its design provides for a tight integration of linguistic and semantic features, and eases the further processing of information by explicitly representing the contexts where knowledge and mentions are valid or relevant. We describe the system and report about the creation of a large-scale KnowledgeStore instance for storing and integrating multimedia contents and background knowledge relevant to the Italian Trentino region.

2010

pdf bib abs

Entity Mention Detection using a Combination of Redundancy-Driven Classifiers
Silvana Marianela Bernaola Biggio | Manuela Speranza | Roberto Zanoli
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present an experimental framework for Entity Mention Detection in which two different classifiers are combined to exploit Data Redundancy attained through the annotation of a large text corpus, as well as a number of Patterns extracted automatically from the same corpus. In order to recognize proper name, nominal, and pronominal mentions we not only exploit the information given by mentions recognized within the corpus being annotated, but also given by mentions occurring in an external and unannotated corpus. The system was first evaluated in the Evalita 2009 evaluation campaign obtaining good results. The current version is being used in a number of applications: on the one hand, it is used in the LiveMemories project, which aims at scaling up content extraction techniques towards very large scale extraction from multimedia sources. On the other hand, it is used to annotate corpora, such as Italian Wikipedia, thus providing easy access to syntactic and semantic annotation for both the Natural Language Processing and Information Retrieval communities. Moreover a web service version of the system is available and the system is going to be integrated into the TextPro suite of NLP tools.

pdf bib

2008

pdf bib abs

The TextPro Tool Suite
Emanuele Pianta | Christian Girardi | Roberto Zanoli
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present TextPro, a suite of modular Natural Language Processing (NLP) tools for analysis of Italian and English texts. The suite has been designed so as to integrate and reuse state of the art NLP components developed by researchers at FBK. The current version of the tool suite provides functions ranging from tokenization to chunking and Named Entity Recognition (NER). The systems architecture is organized as a pipeline of processors wherein each stage accepts data from an initial input or from an output of a previous stage, executes a specific task, and sends the resulting data to the next stage, or to the output of the pipeline. TextPro performed the best on the task of Italian NER and Italian PoS Tagging at EVALITA 2007. When tested on a number of other standard English benchmarks, TextPro confirms that it performs as state of the art system. Distributions for Linux, Solaris and Windows are available, for both research and commercial purposes. A web-service version of the system is under development.

Roberto Zanoli

2025

2024

2020

2014

2012

2010

2008

Co-authors

Venues