Roser Saurí

Also published as: Roser Sauri

2024

Fine-Tuning Open Access LLMs for High-Precision NLU in Goal-Driven Dialog Systems
Lluís Padró | Roser Saurí
Proceedings of the Second International Workshop Towards Digital Language Equality (TDLE): Focusing on Sustainability @ LREC-COLING 2024

This paper presents a set of experiments on fine-tuning LLMs to produce high-precision semantic representations for the NLU component of a dialog system front-end. The aim of this research is threefold: First, we want to explore the capabilities of LLMs on real, industry-based use cases that involve complex data and strict requirements on results. Since the LLM output should usable by the application back-end, the produced semantic representation must satisfy strict format and consistency requirements. Second, we want to evaluate the cost-benefit of open-source LLMs, that is, the feasibility of running this kind of models in machines affordable to small-medium enterprises (SMEs), in order to assess how far this organizations can go without depending on the large players controlling the market, and with a moderate use of computation resources. Finally, we also want to assess the language scalability of the LLMs in this kind of applications; specifically, whether a multilingual model is able to cast patterns learnt from one language to other ones –with special attention to underresourced languages–, thus reducing required training data and computation costs. This work was carried out within an R&D context of assisting a real company in defining its NLU model strategy, and thus the results have a practical, industry-level focus.

2020

In this paper we describe the contributions made by the European H2020 project “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors’) to the further development of the Linguistic Linked Open Data (LLOD) infrastructure. Prêt-à-LLOD aims to develop a new methodology for building data value chains applicable to a wide range of sectors and applications and based around language resources and language technologies that can be integrated by means of semantic technologies. We describe the methods implemented for increasing the number of language data sets in the LLOD. We also present the approach for ensuring interoperability and for porting LLOD data sets and services to other infrastructures, as well as the contribution of the projects to existing standards.

2017

pdf bib

Proceedings of the Workshop Computational Semantics Beyond Events and Roles
Eduardo Blanco | Roser Morante | Roser Saurí
Proceedings of the Workshop Computational Semantics Beyond Events and Roles

2016

pdf bib abs

Towards a Linguistic Ontology with an Emphasis on Reasoning and Knowledge Reuse
Artemis Parvizi | Matt Kohl | Meritxell Gonzàlez | Roser Saurí
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Dictionaries division at Oxford University Press (OUP) is aiming to model, integrate, and publish lexical content for 100 languages focussing on digitally under-represented languages. While there are multiple ontologies designed for linguistic resources, none had adequate features for meeting our requirements, chief of which was the capability to losslessly capture diverse features of many different languages in a dictionary format, while supplying a framework for inferring relations like translation, derivation, etc., between the data. Building on valuable features of existing models, and working with OUP monolingual and bilingual dictionary datasets, we have designed and implemented a new linguistic ontology. The ontology has been reviewed by a number of computational linguists, and we are working to move more dictionary data into it. We have also developed APIs to surface the linked data to dictionary websites.

pdf bib

Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM)
Eduardo Blanco | Roser Morante | Roser Saurí
Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM)

2014

pdf bib abs

The NewSoMe Corpus: A Unifying Opinion Annotation Framework across Genres and in Multiple Languages
Roser Saurí | Judith Domingo | Toni Badia
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the NewSoMe (News and Social Media) Corpus, a set of subcorpora with annotations on opinion expressions across genres (news reports, blogs, product reviews and tweets) and covering multiple languages (English, Spanish, Catalan and Portuguese). NewSoMe is the result of an effort to increase the opinion corpus resources available in languages other than English, and to build a unifying annotation framework for analyzing opinion in different genres, including controlled text, such as news reports, as well as different types of user generated contents (UGC). Given the broad design of the resource, most of the annotation effort were carried out resorting to crowdsourcing platforms: Amazon Mechanical Turk and CrowdFlower. This created an excellent opportunity to research on the feasibility of crowdsourcing methods for annotating big amounts of text in different languages.

2013

pdf bib

FBM: Combining lexicon-based ML and heuristics for Social Media Polarities
Carlos Rodríguez-Penagos | Jordi Atserias Batalla | Joan Codina-Filbà | David García-Narbona | Jens Grivolla | Patrik Lambert | Roser Saurí
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf bib

Are You Sure That This Happened? Assessing the Factuality Degree of Events in Text
Roser Saurí | James Pustejovsky
Computational Linguistics, Volume 38, Issue 2 - June 2012

2010

pdf bib abs

In this paper, we present a brief snapshot of the state of affairs in computational processing of Catalan and the initiatives that are starting to take place in an effort to bring the field a step forward, by making a better and more efficient use of the already existing resources and tools, by bridging the gap between research and market, and by establishing periodical meeting points for the community. In particular, we present the results of the First Workshop on the Computational Processing of Catalan, which succeeded in putting together a fair representation of the research in the area, and received attention from both the industry and the administration. Aside from facilitating communication among researchers and between developers and users, the Workshop provided the organizers with valuable information about existing resources, tools, developers and providers. This information has allowed us to go a step further by setting up a harvesting procedure which will hopefully build the seed of a portal-catalogue-observatory of language resources and technologies in Catalan.

pdf bib

SemEval-2010 Task 13: TempEval-2
Marc Verhagen | Roser Saurí | Tommaso Caselli | James Pustejovsky
Proceedings of the 5th International Workshop on Semantic Evaluation

2006

pdf bib abs

SlinkET: A Partial Modal Parser for Events
Roser Saurí | Marc Verhagen | James Pustejovsky
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We present SlinkET, a parser for identifying contexts of event modality in text developed within the TARSQI (Temporal Awareness and Reasoning Systems for Question Interpretation) research framework. SlinkET is grounded on TimeML, a specification language for capturing temporal and event related information in discourse, which provides an adequate foundation to handle event modality. SlinkET builds on top of a robust event recognizer, and provides each relevant event with a value that specifies the degree of certainty about its factuality; e.g., whether it has happened or holds (factive or counter-factive), whether it is being reported or witnessed by somebody else (evidential), or if it is introduced as a possibility (modal). It is based on well-established technology in the field (namely, finite-state techniques), and informed with corpus-induced knowledge that relies on basic information, such as morphological features, POS, and chunking. SlinkET is under continuing development and it currently achieves a performance ratio of 70% F1-measure.

pdf bib

Classification of Discourse Coherence Relations: An Exploratory Study using Multiple Knowledge Sources
Ben Wellner | James Pustejovsky | Catherine Havasi | Anna Rumshisky | Roser Saurí
Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue