2024
pdf
bib
abs
Compiling and Exploring a Portuguese Parliamentary Corpus: ParlaMint-PT
José Aires
|
Aida Cardoso
|
Rui Pereira
|
Amalia Mendes
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024
As part of the project ParlaMint II, a new corpus of the sessions of the Portuguese Parliament from 2015 to 2022 has been compiled, encoded and annotated following the ParlaMint guidelines. We report on the contents of the corpus and on the specific nature of the political settings in Portugal during the time period covered. Two subcorpora were designed that would enable comparisons of the political speeches between pre and post covid-19 pandemic. We discuss the pipeline applied to download the original texts, ensure their preprocessing and encoding in XML, and the final step of annotation. This new resource covers a period of changes in the political system in Portugal and will be an important source of data for political and social studies. Finally, Finally, we have explored the political stance on immigration in the ParlaMint-PT corpus.
pdf
bib
Investigating the Generalizability of Portuguese Readability Assessment Models Trained Using Linguistic Complexity Features
Soroosh Akef
|
Amália Mendes
|
Detmar Meurers
|
Patrick Rebuschat
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1
pdf
bib
Multiple Discourse Relations in English TED Talks and Their Translation into Lithuanian, Portuguese and Turkish
Deniz Zeyrek
|
Giedrė Valūnaitė Oleškevičienė
|
Amalia Mendes
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024
2022
pdf
bib
abs
The PALMA Corpora of African Varieties of Portuguese
Tjerk Hagemeijer
|
Amália Mendes
|
Rita Gonçalves
|
Catarina Cornejo
|
Raquel Madureira
|
Michel Généreux
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present three new corpora of urban varieties of Portuguese spoken in Angola, Mozambique, and São Tomé and Príncipe, where Portuguese is increasingly being spoken as first and second language in different multilingual settings. Given the scarcity of linguistic resources available for the African varieties of Portuguese, these corpora provide new, contemporary data for the study of each variety and for comparative research on African, Brazilian and European varieties, hereby improving our understanding of processes of language variation and change in postcolonial societies. The corpora consist of transcribed spoken data, complemented by a rich set of metadata describing the setting of the audio recordings and sociolinguistic information about the speakers. They are annotated with POS and lemma information and made available on the CQPweb platform, which allows for sophisticated data searches. The corpora are already being used for comparative research on constructions in the domain of possession and location involving the argument structure of intransitive, monotransitive and ditransitive verbs that select Goals, Locatives, and Recipients.
2020
pdf
bib
abs
TED-MDB Lexicons: Tr-EnConnLex, Pt-EnConnLex
Murathan Kurfalı
|
Sibel Ozer
|
Deniz Zeyrek
|
Amália Mendes
Proceedings of the First Workshop on Computational Approaches to Discourse
In this work, we present two new bilingual discourse connective lexicons, namely, for Turkish-English and European Portuguese-English created automatically using the existing discourse relation-aligned TED-MDB corpus. In their current form, the Pt-En lexicon includes 95 entries, whereas the Tr-En lexicon contains 133 entries. The lexicons constitute the first step of a larger project of developing a multilingual discourse connective lexicon.
pdf
bib
abs
Infrastructure for the Science and Technology of Language PORTULAN CLARIN
António Branco
|
Amália Mendes
|
Paulo Quaresma
|
Luís Gomes
|
João Silva
|
Andrea Teixeira
Proceedings of the 1st International Workshop on Language Technology Platforms
This paper presents the PORTULAN CLARIN Research Infrastructure for the Science and Technology of Language, which is part of the European research infrastructure CLARIN ERIC as its Portuguese national node, and belongs to the Portuguese National Roadmap of Research Infrastructures of Strategic Relevance. It encompasses a repository, where resources and metadata are deposited for long-term archiving and access, and a workbench, where Language Technology tools and applications are made available through different modes of interaction, among many other services. It is an asset of utmost importance for the technological development of natural languages and for their preparation for the digital age, contributing to ensure the citizenship of their speakers in the information society.
2018
pdf
bib
A Multi- versus a Single-classifier Approach for the Identification of Modality in the Portuguese Language
João Sequeira
|
Teresa Gonçalves
|
Paulo Quaresma
|
Amália Mendes
|
Iris Hendrickx
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank
Deniz Zeyrek
|
Amália Mendes
|
Murathan Kurfalı
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
Error annotation in a Learner Corpus of Portuguese
Iria del Río
|
Amália Mendes
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
A Lexicon of Discourse Markers for Portuguese – LDM-PT
Amália Mendes
|
Iria del Rio
|
Manfred Stede
|
Felix Dombek
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
bib
abs
Modality annotation for Portuguese: from manual annotation to automatic labeling
Amália Mendes
|
Iris Hendrickx
|
Liciana Ávila
|
Paulo Quaresma
|
Teresa Gonҫalves
|
João Sequeira
Linguistic Issues in Language Technology, Volume 14, 2016 - Modality: Logic, Semantics, Annotation, and Machine Learning
We investigate modality in Portuguese and we combine a linguistic perspective with an application-oriented perspective on modality. We design an annotation scheme reflecting theoretical linguistic concepts and apply this schema to a small corpus sample to show how the scheme deals with real world language usage. We present two schemas for Portuguese, one for spoken Brazilian Portuguese and one for written European Portuguese. Furthermore, we use the annotated data not only to study the linguistic phenomena of modality, but also to train a practical text mining tool to detect modality in text automatically. The modality tagger uses a machine learning classifier trained on automatically extracted features from a syntactic parser. As we only have a small annotated sample available, the tagger was evaluated on 11 modal verbs that are frequent in our corpus and that denote more than one modal meaning. Finally, we discuss several valuable insights into the complexity of the semantic concept of modality that derive from the process of manual annotation of the corpus and from the analysis of the results of the automatic labeling: ambiguity and the semantic and syntactic properties typically associated to one modal meaning in context, and also the interaction of modality with negation and focus. The knowledge gained from the manual annotation task leads us to propose a new unified scheme for modality that applies to the two Portuguese varieties and covers both written and spoken data.
pdf
bib
Towards error annotation in a learner corpus of Portuguese
Iria del Río
|
Sandra Antunes
|
Amália Mendes
|
Maarten Janssen
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition
pdf
bib
abs
The COPLE2 corpus: a learner corpus for Portuguese
Amália Mendes
|
Sandra Antunes
|
Maarten Janssen
|
Anabela Gonçalves
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present the COPLE2 corpus, a learner corpus of Portuguese that includes written and spoken texts produced by learners of Portuguese as a second or foreign language. The corpus includes at the moment a total of 182,474 tokens and 978 texts, classified according to the CEFR scales. The original handwritten productions are transcribed in TEI compliant XML format and keep record of all the original information, such as reformulations, insertions and corrections made by the teacher, while the recordings are transcribed and aligned with EXMARaLDA. The TEITOK environment enables different views of the same document (XML, student version, corrected version), a CQP-based search interface, the POS, lemmatization and normalization of the tokens, and will soon be used for error annotation in stand-off format. The corpus has already been a source of data for phonological, lexical and syntactic interlanguage studies and will be used for a data-informed selection of language features for each proficiency level.
2015
pdf
bib
Towards a Unified Approach to Modality Annotation in Portuguese
Luciana Beatriz Ávila
|
Amália Mendes
|
Iris Hendrickx
Proceedings of the Workshop on Models for Modality Annotation
2014
pdf
bib
abs
An evaluation of the role of statistical measures and frequency for MWE identification
Sandra Antunes
|
Amália Mendes
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We report on an experiment to evaluate the role of statistical association measures and frequency for the identification of MWE. We base our evaluation on a lexicon of 14.000 MWE comprising different types of word combinations: collocations, nominal compounds, light verbs + predicate, idioms, etc. These MWE were manually validated from a list of n-grams extracted from a 50 million word corpus of Portuguese (a subcorpus of the Reference Corpus of Contemporary Portuguese), using several criteria: syntactic fixedness, idiomaticity, frequency and Mutual Information measure, although no threshold was established, either in terms of group frequency or MI. We report on MWE that were selected on the basis of their syntactic and semantics properties while the MI or both the MI and the frequency show low values, which would constitute difficult cases to establish a cutting point. We analyze the MI values of the MWE selected in our gold dataset and, for some specific cases, compare these values with two other statistical measures.
pdf
bib
abs
The Gulf of Guinea Creole Corpora
Tjerk Hagemeijer
|
Michel Généreux
|
Iris Hendrickx
|
Amália Mendes
|
Abigail Tiny
|
Armando Zamora
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present the process of building linguistic corpora of the Portuguese-related Gulf of Guinea creoles, a cluster of four historically related languages: Santome, Angolar, Principense and Fa dAmbô. We faced the typical difficulties of languages lacking an official status, such as lack of standard spelling, language variation, lack of basic language instruments, and small data sets, which comprise data from the late 19th century to the present. In order to tackle these problems, the compiled written and transcribed spoken data collected during field work trips were adapted to a normalized spelling that was applied to the four languages. For the corpus compilation we followed corpus linguistics standards. We recorded meta data for each file and added morphosyntactic information based on a part-of-speech tag set that was designed to deal with the specificities of these languages. The corpora of three of the four creoles are already available and searchable via an online web interface.
2013
pdf
bib
MWE in Portuguese: Proposal for a Typology for Annotation in Running Text
Sandra Antunes
|
Amália Mendes
Proceedings of the 9th Workshop on Multiword Expressions
pdf
bib
Annotating the Interaction between Focus and Modality: the case of exclusive particles
Amália Mendes
|
Iris Hendrickx
|
Agostinho Salgueiro
|
Luciana Ávila
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse
2012
pdf
bib
abs
Introducing the Reference Corpus of Contemporary Portuguese Online
Michel Généreux
|
Iris Hendrickx
|
Amália Mendes
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We present our work in processing the Reference Corpus of Contemporary Portuguese and its publication online. After discussing how the corpus was built and our choice of meta-data, we turn to the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries. The Web platform is described, and we show examples of linguistic resources that can be extracted from the platform for use in linguistic studies or in NLP.
pdf
bib
abs
Modality in Text: a Proposal for Corpus Annotation
Iris Hendrickx
|
Amália Mendes
|
Silvia Mencarelli
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We present a annotation scheme for modality in Portuguese. In our annotation scheme we have tried to combine a more theoretical linguistic viewpoint with a practical annotation scheme that will also be useful for NLP research but is not geared towards one specific application. Our notion of modality focuses on the attitude and opinion of the speaker, or of the subject of the sentence. We validated the annotation scheme on a corpus sample of approximately 2000 sentences that we fully annotated with modal information using the MMAX2 annotation tool to produce XML annotation. We discuss our main findings and give attention to the difficult cases that we encountered as they illustrate the complexity of modality and its interactions with other elements in the text.
2010
pdf
bib
Complex Predicates Annotation in a Corpus of Portuguese
Iris Hendrickx
|
Amália Mendes
|
Sílvia Pereira
|
Anabela Gonçalves
|
Inês Duarte
Proceedings of the Fourth Linguistic Annotation Workshop
pdf
bib
Proposal for MWE Annotation in Running Text
Iris Hendrickx
|
Amália Mendes
|
Sandra Antunes
Proceedings of the Fourth Linguistic Annotation Workshop
2006
pdf
bib
abs
Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project
Florbela Barreto
|
António Branco
|
Eduardo Ferreira
|
Amália Mendes
|
Maria Fernanda Bacelar do Nascimento
|
Filipe Nunes
|
João Ricardo Silva
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper presents the TagShare project and the linguistic resources and tools for the shallow processing of Portuguese developed in its scope. These resources include a 1 million token corpus that has been accurately hand annotated with a variety of linguistic information, as well as several state of the art shallow processing tools capable of automatically producing that type of annotation. At present, the linguistic annotations in the corpus are sentence and paragraph boundaries, token boundaries, morphosyntactic POS categories, values of inflection features, lemmas and namedentities. Hence, the set of tools comprise a sentence chunker, a tokenizer, a POS tagger, nominal and verbal analyzers and lemmatizers, a verbal conjugator, a nominal inflector, and a namedentity recognizer, some of which underline several online services.
pdf
bib
abs
COMBINA-PT: A Large Corpus-extracted and Hand-checked Lexical Database of Portuguese Multiword Expressions
Amália Mendes
|
Sandra Antunes
|
Maria Fernanda Bacelar do Nascimento
|
João Miguel Casteleiro
|
Luísa Pereira
|
Tiago Sá
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper presents the COMBINA-PT project, a study of corpus-extracted Portuguese Multiword (MW) expressions. The objective of this on-going project is to compile a large lexical database of multiword (MW) units of the Portuguese language, automatically extracted from a balanced 50 million word corpus, and manually validated with the help of lexical association measures. MW expressions considered in the database include named entities and lexical associations with different degrees of cohesion, ranging from frozen groups, which undergo little or no variation, to lexical collocations composed of words that tend to occur together and that constitute syntactic dependencies, although with a low degree of fixedness. This new resource has a two-fold objective: (i) to be an important research tool which supports the development of MW expressions typologies and their lexicographic treatment; (ii) to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.
pdf
bib
abs
Corpus-based extraction and identification of Portuguese Multiword Expressions
Sandra Antunes
|
Maria Fernanda Bacelar do Nascimento
|
João Miguel Casteleiro
|
Amália Mendes
|
Luísa Pereira
|
Tiago Sá
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters
This presentation reports on an on-going project aimed at building a large lexical database of corpus-extracted multiword (MW) expressions for the Portuguese language. MW expressions were automatically extracted from a balanced 50 million word corpus compiled for this project, furthermore these were statistically interpreted using lexical association measures, followed by a manual validation process. The lexical database covers different types of MW expressions, from named entities to lexical associations with different degrees of cohesion, ranging from totally frozen idioms to favoured co-occurring forms, such as collocations. We aim to achieve two main objectives with this resource. Firstly to build on the large set of data of different types of MW expressions, thus revising existing typologies of collocations and integrating them in a larger theory of MW units. Secondly, to use the extensive hand-checked data as training data to evaluate existing statistical lexical association measures.
2004
pdf
bib
abs
Providing On-line Access to Portuguese Language Resources: Corpora and Lexicons
Maria Fernanda Bacelar do Nascimento
|
Amália Mendes
|
Luísa Pereira
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL's webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese (CRPC), a monitor corpus containing, at the present, more than 300 million words, taken by sampling from several types of written text (literary, newspaper, technical, didactic, juridical, parlamentary, etc.) and spoken text (informal and formal), pertaining to national and regional varieties of Portuguese (including European, Brazilian, African and Asian Portuguese). The LRs available for on-line queries include: a) several subcorpora (written and spoken, tagged and untagged) compiled and extracted from CRPC for specific CLUL's projects and now available for on-line queries; b) a published sample of "Português Fundamental", a spoken CRPC subcorpus, available for texts download; c) a frequency lexicon extracted from a CRPC subcorpus available for both on-line queries and download. Other RLs available for Portuguese are also referred: C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages, a CD-ROM edition of a spoken corpus with text-to-sound alignment; the LE-PAROLE corpus; the LE-PAROLE Lexicon and the SIMPLE Lexicon.