Diana Santos


2024

pdf bib
Literary similarity of novels in Portuguese
Diana Santos
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

2016

pdf bib
QUEMDISSE? Reported speech in Portuguese
Cláudia Freitas | Bianca Freitas | Diana Santos
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents some work on direct and indirect speech in Portuguese using corpus-based methods: we report on a study whose aim was to identify (i) Portuguese verbs used to introduce reported speech and (ii) syntactic patterns used to convey reported speech, in order to enhance the performance of a quotation extraction system, dubbed QUEMDISSE?. In addition, (iii) we present a Portuguese corpus annotated with reported speech, using the lexicon and rules provided by (i) and (ii), and discuss the process of their annotation and what was learned.

2015

pdf bib
Um novo corpo e os seus desafios (A new corpus and the challenges it offers)
Diana Santos
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology

2012

pdf bib
Págico: Evaluating Wikipedia-based information retrieval in Portuguese
Cristina Mota | Alberto Simões | Cláudia Freitas | Luís Costa | Diana Santos
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

How do people behave in their everyday information seeking tasks, which often involve Wikipedia? Are there systems which can help them, or do a similar job? In this paper we describe Págico, an evaluation contest with the main purpose of fostering research in these topics. We describe its motivation, the collection of documents created, the evaluation setup, the topics chosen and their choice, the participation, as well as the measures used for evaluation and the gathered resources. The task―between information retrieval and question answering―can be further described as answering questions related to Portuguese-speaking culture in the Portuguese Wikipedia, in a number of different themes and geographic and temporal angles. This initiative allowed us to create interesting datasets and perform some assessment of Wikipedia, while also improving a public-domain open-source system for further wikipedia-based evaluations. In the paper, we provide examples of questions, we report the results obtained by the participants, and provide some discussion on complex issues.

pdf bib
Folheador: browsing through Portuguese semantic relations
Hugo Gonçalo Oliveira | Hernani Costa | Diana Santos
Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

2010

pdf bib
GikiCLEF: Crosscultural Issues in Multilingual Information Access
Diana Santos | Luís Miguel Cabral | Corina Forascu | Pamela Forner | Fredric Gey | Katrin Lamm | Thomas Mandl | Petya Osenova | Anselmo Peñas | Álvaro Rodrigo | Julia Schulz | Yvonne Skalban | Erik Tjong Kim Sang
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we describe GikiCLEF, the first evaluation contest that, to our knowledge, was specifically designed to expose and investigate cultural and linguistic issues involved in structured multimedia collections and searching, and which was organized under the scope of CLEF 2009. GikiCLEF evaluated systems that answered hard questions for both human and machine, in ten different Wikipedia collections, namely Bulgarian, Dutch, English, German, Italian, Norwegian (Bokmäl and Nynorsk), Portuguese, Romanian, and Spanish. After a short historical introduction, we present the task, together with its motivation, and discuss how the topics were chosen. Then we provide another description from the point of view of the participants. Before disclosing their results, we introduce the SIGA management system explaining the several tasks which were carried out behind the scenes. We quantify in turn the GIRA resource, offered to the community for training and further evaluating systems with the help of the 50 topics gathered and the solutions identified. We end the paper with a critical discussion of what was learned, advancing possible ways to reuse the data.

pdf bib
Second HAREM: Advancing the State of the Art of Named Entity Recognition in Portuguese
Cláudia Freitas | Cristina Mota | Diana Santos | Hugo Gonçalo Oliveira | Paula Carvalho
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present Second HAREM, the second edition of an evaluation campaign for Portuguese, addressing named entity recognition (NER). This second edition also included two new tracks: the recognition and normalization of temporal entities (proposed by a group of participants, and hence not covered on this paper) and ReRelEM, the detection of semantic relations between named entities. We summarize the setup of Second HAREM by showing the preserved distinctive features and discussing the changes compared to the first edition. Furthermore, we present the main results achieved and describe the available resources and tools developed under this evaluation, namely,(i) the golden collections, i.e. a set of documents whose named entities and semantic relations between those entities were manually annotated, (ii) the Second HAREM collection (which contains the unannotated version of the golden collection), as well as the participating systems results on it, (iii) the scoring tools, and (iv) SAHARA, a Web application that allows interactive evaluation. We end the paper by offering some remarks about what was learned.

pdf bib
Experiments in Human-computer Cooperation for the Semantic Annotation of Portuguese Corpora
Diana Santos | Cristina Mota
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a system to aid human annotation of semantic information in the scope of the project AC/DC, called corte-e-costura. This system leverages on the human annotation effort, by providing the annotator with a simple system that applies rules incrementally. Our goal was twofold: first, to develop an easy-to-use system that required a minimum of learning from the part of the linguist; second, one that provided a straightforward way of checking the results obtained, in order to immediately evaluate the results of the rules devised. After explaining the motivation for its development from scratch, we present the current status of the AC/DC project and provide a quantitative description of its material in what concerns semantic annotation. We then present the corte-e-costura system in detail, providing the result of our first experiments with the semantic fields of colour and clothing. We end the paper with some discussion of future work as well as of the experience gained.

2009

pdf bib
Relation detection between named entities: report of a shared task
Cláudia Freitas | Diana Santos | Cristina Mota | Hugo Gonçalo Oliveira | Paula Carvalho
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

2008

pdf bib
An Evaluation Resource for Geographic Information Retrieval
Thomas Mandl | Fredric Gey | Giorgio Di Nunzio | Nicola Ferro | Mark Sanderson | Diana Santos | Christa Womser-Hacker
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present an evaluation resource for geographic information retrieval developed within the Cross Language Evaluation Forum (CLEF). The GeoCLEF track is dedicated to the evaluation of geographic information retrieval systems. The resource encompasses more than 600,000 documents, 75 topics so far, and more than 100,000 relevance judgments for these topics. Geographic information retrieval requires an evaluation resource which represents realistic information needs and which is geographically challenging. Some experimental results and analysis are reported

pdf bib
Portuguese-English Word Alignment: some Experiments
Diana Santos | Alberto Simões
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we describe some studies of Portuguese-English word alignment, focusing on (i) measuring the importance of the coupling between dictionaries and corpus; (ii) assessing the relevance of using syntactic information (POS and lemma) or just word forms, and (iii) taking into account the direction of translation. We first provide some motivation for the studies, as well as insist in separating type from token anlignment. We then briefly describe the resources employed: the EuroParl and COMPARA corpora, and the alignment tools, NATools, introducing some measures to evaluate the two kinds of dictionaries obtained. We then present the results of several experiments, comparing sizes, overlap, translation fertility and alignment density of the several bilingual resources built. We also describe preliminary data as far as quality of the resulting dictionaries or alignment results is concerned.

pdf bib
What’s in a Colour? Studying and Contrasting Colours with COMPARA
Diana Santos | Maria do Rosário Silva | Susana Inácio
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present contrastive colour studies done using COMPARA, the largest edited parallel corpus in the world (as far as we know). The studies were the result of semantic annotation of the corpus in this domain. We chose to start with colour because it is a relatively contained lexical category and the subject of many arguments in linguistics. We begin by explaining the criteria involved in the annotation process, not only for the colour categories but also for the colour groups created in order to do finer-grained analyses, presenting also some quantitative data regarding these categories and groups. We proceed to compare the two languages according to the diversity of available lexical items, morphological and syntactic properties, and then try to understand the translation of colour. We end by explaining how any user who wants to do serious studies using the corpus can collaborate in enhancing the corpus and making their semantic annotations widely available as well.

2006

pdf bib
HAREM: An Advanced NER Evaluation Contest for Portuguese
Diana Santos | Nuno Seco | Nuno Cardoso | Rui Vilela
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we provide an overview of the first evaluation contest for named entity recognition in Portuguese, HAREM, which features several original traits and provided the first state of the art for the field in Portuguese, as well as a public-domain evaluation architecture.

pdf bib
Annotating COMPARA, a Grammar-aware Parallel Corpus
Diana Santos | Susana Inácio
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we describe the annotation of COMPARA, currently the largest post-edited parallel corpora which include Portuguese. We describe the motivation, the results so far, and the way the corpus is being annotated. We also provide the first grounded results about syntactical ambiguity in Portuguese. Finally, we discuss some interesting problems in this connection.

pdf bib
Corpógrafo V3 - From Terminological Aid to Semi-automatic Knowledge Engineering
Luís Sarmento | Belinda Maia | Diana Santos | Ana Pinto | Luís Cabral
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we will present Corpógrafo, a mature web-based environment for working with corpora, for terminology extraction, and for ontology development. We will explain Corpógrafo’s workflow and describe the most important information extraction methods used, namely its term extraction, and definition / semantic relations identification procedures. We will describe current Corpógrafo users and present a brief overview of the XML format currently used to export terminology databases. Finally, we present future improvements for this tool.

pdf bib
The Multilingual Question Answering Track at CLEF
Bernardo Magnini | Danilo Giampiccolo | Lili Aunimo | Christelle Ayache | Petya Osenova | Anselmo Peñas | Maarten de Rijke | Bogdan Sacaleanu | Diana Santos | Richard Sutcliffe
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents an overview of the Multilingual Question Answering evaluation campaigns which have been organized at CLEF (Cross Language Evaluation Forum) since 2003. Over the years, the competition has registered a steady increment in the number of participants and languages involved. In fact, from the original eight groups which participated in 2003 QA track, the number of competitors in 2005 rose to twenty-four. Also, the performances of the systems have steadily improved, and the average of the best performances in the 2005 saw an increase of 10% with respect to the previous year.

2004

pdf bib
On the Problems of Creating a Golden Standard of Inflected Forms in Portuguese
Diana Santos | Anabela Barreiro
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
The Corpógrafo – a Web-based Environment for Corpora Research
Luís Sarmento | Belinda Maia | Diana Santos
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
What is my Style? Using Stylistic Features of Portuguese Web Texts to Classify Web Pages According to Users’ Needs
Rachel Aires | Aline Manfrin | Sandra Aluísio | Diana Santos
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2002

pdf bib
Floresta Sintá(c)tica: A treebank for Portuguese
Susana Afonso | Eckhard Bick | Renato Haber | Diana Santos
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Evaluation of parsed corpora: Experiments in user-transparent and user-visible evaluation
Diana Santos | Caroline Gasperin
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2001

pdf bib
Evaluating CETEMPúblico, a Free Resource for Portuguese
Diana Santos | Paulo Rocha
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

2000

pdf bib
An Evaluation of the Translation Corpus Aligner, with special reference to the language pair English-Portuguese
Diana Santos | Signe Oksefjell
Proceedings of the 12th Nordic Conference of Computational Linguistics (NODALIDA 1999)

pdf bib
Providing Internet Access to Portuguese Corpora: the AC/DC Project
Diana Santos | Eckhard Bick
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1994

pdf bib
Bilingual Alignment and Tense
Diana Santos
Second Workshop on Very Large Corpora

In this paper, I describe one annotation of tense transfer in parallel English and Portuguese texts. Even though the primary aim of the study is to compare the tense and aspect systems of the two languages, it also raises some questions as far as bilingual alignment in general is concerned. First, I present a detailed list of clausal mismatches, which shows that intra-sentential alignment is not an easy task. Subsequently, I present a detailed quantitative description of the translation pairs found and discuss some possible conclusions for the translation of tense. Finally, I discuss some theoretical problems related to translation.

1992

pdf bib
A Tense and Aspect Calculus
Diana Santos
COLING 1992 Volume 4: The 14th International Conference on Computational Linguistics

1990

pdf bib
Lexical gaps and idioms in machine translation
Diana Santos
COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics