2006
pdf
bib
abs
Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project
Florbela Barreto
|
António Branco
|
Eduardo Ferreira
|
Amália Mendes
|
Maria Fernanda Bacelar do Nascimento
|
Filipe Nunes
|
João Ricardo Silva
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper presents the TagShare project and the linguistic resources and tools for the shallow processing of Portuguese developed in its scope. These resources include a 1 million token corpus that has been accurately hand annotated with a variety of linguistic information, as well as several state of the art shallow processing tools capable of automatically producing that type of annotation. At present, the linguistic annotations in the corpus are sentence and paragraph boundaries, token boundaries, morphosyntactic POS categories, values of inflection features, lemmas and namedentities. Hence, the set of tools comprise a sentence chunker, a tokenizer, a POS tagger, nominal and verbal analyzers and lemmatizers, a verbal conjugator, a nominal inflector, and a namedentity recognizer, some of which underline several online services.
pdf
bib
abs
COMBINA-PT: A Large Corpus-extracted and Hand-checked Lexical Database of Portuguese Multiword Expressions
Amália Mendes
|
Sandra Antunes
|
Maria Fernanda Bacelar do Nascimento
|
João Miguel Casteleiro
|
Luísa Pereira
|
Tiago Sá
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper presents the COMBINA-PT project, a study of corpus-extracted Portuguese Multiword (MW) expressions. The objective of this on-going project is to compile a large lexical database of multiword (MW) units of the Portuguese language, automatically extracted from a balanced 50 million word corpus, and manually validated with the help of lexical association measures. MW expressions considered in the database include named entities and lexical associations with different degrees of cohesion, ranging from frozen groups, which undergo little or no variation, to lexical collocations composed of words that tend to occur together and that constitute syntactic dependencies, although with a low degree of fixedness. This new resource has a two-fold objective: (i) to be an important research tool which supports the development of MW expressions typologies and their lexicographic treatment; (ii) to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.
pdf
bib
abs
The African Varieties of Portuguese: Compiling Comparable Corpora and Analyzing Data-Derived Lexicon
Maria Fernanda Bacelar do Nascimento
|
José Bettencourt Gonçalves
|
Luísa Pereira
|
Antónia Estrela
|
Afonso Pereira
|
Rui Santos
|
Sancho M. Oliveira
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Linguistic Resources for the Study of the Portuguese African Varieties is an ongoing project that aims at the constitution, treatment, analysis and availability of a corpus of the African varieties of Portuguese, with 3 million words of written and spoken texts, constituted by five comparable subcorpora, corresponding to the varieties of Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe. This material will allow intra and intercorpora comparative studies, which will make visible variations that result from discursive and pragmatic differences of each corpus and aspects of linguistic unity or diversity that characterise the spoken Portuguese of this referred five African countries. The five corpora are comparable in size (600,000 words each), in chronology (the last 30 years) and in types and genres (24,000 spoken words and c. 580,000 written words, the last belonging to newspapers, literature and varia). The corpus is automatically annotated and after the extraction of alphabetical lists of lexical forms, these data will be automatically lemmatised. Five separated lists of vocabulary for each variety will be established. A tool for word extraction and preferential calculus according to predefined indexes in order to achieve lexicon comparison of the African Portuguese Varieties is being developed. Concordances extraction will be also performed.
pdf
bib
abs
Corpus-based extraction and identification of Portuguese Multiword Expressions
Sandra Antunes
|
Maria Fernanda Bacelar do Nascimento
|
João Miguel Casteleiro
|
Amália Mendes
|
Luísa Pereira
|
Tiago Sá
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters
This presentation reports on an on-going project aimed at building a large lexical database of corpus-extracted multiword (MW) expressions for the Portuguese language. MW expressions were automatically extracted from a balanced 50 million word corpus compiled for this project, furthermore these were statistically interpreted using lexical association measures, followed by a manual validation process. The lexical database covers different types of MW expressions, from named entities to lexical associations with different degrees of cohesion, ranging from totally frozen idioms to favoured co-occurring forms, such as collocations. We aim to achieve two main objectives with this resource. Firstly to build on the large set of data of different types of MW expressions, thus revising existing typologies of collocations and integrating them in a larger theory of MW units. Secondly, to use the extensive hand-checked data as training data to evaluate existing statistical lexical association measures.
2004
pdf
bib
abs
Providing On-line Access to Portuguese Language Resources: Corpora and Lexicons
Maria Fernanda Bacelar do Nascimento
|
Amália Mendes
|
Luísa Pereira
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL's webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese (CRPC), a monitor corpus containing, at the present, more than 300 million words, taken by sampling from several types of written text (literary, newspaper, technical, didactic, juridical, parlamentary, etc.) and spoken text (informal and formal), pertaining to national and regional varieties of Portuguese (including European, Brazilian, African and Asian Portuguese). The LRs available for on-line queries include: a) several subcorpora (written and spoken, tagged and untagged) compiled and extracted from CRPC for specific CLUL's projects and now available for on-line queries; b) a published sample of "Português Fundamental", a spoken CRPC subcorpus, available for texts download; c) a frequency lexicon extracted from a CRPC subcorpus available for both on-line queries and download. Other RLs available for Portuguese are also referred: C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages, a CD-ROM edition of a spoken corpus with text-to-sound alignment; the LE-PAROLE corpus; the LE-PAROLE Lexicon and the SIMPLE Lexicon.
pdf
bib
abs
The C-ORAL-ROM CORPUS. A Multilingual Resource of Spontaneous Speech for Romance Languages
Emanuela Cresti
|
Fernanda Bacelar do Nascimento
|
Antonio Moreno Sandoval
|
Jean Veronis
|
Philippe Martin
|
Khalid Choukri
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
The C-ORAL-ROM project has delivered a multilingual corpus of spontaneous speech for the main romance languages (Italian, French, Portuguese and Spanish). The collection aims to represent the variety of speech acts performed in everyday language and to enable the description of prosodic and syntactic structures in the four romance languages. Sampling criteria are defined in a corpus design scheme. C-ORAL-ROM adopts two different sampling strategies, one for the formal and one for the informal part: While a set of typical domains of application is selected to document the formal use of language, the informal part documents speech variation using parameters referring to the event’s structure (dialogue vs. monologue) and the sociological domain of use (family-private vs public). The four romance corpora are tagged with respect to terminal and non terminal prosodic breaks. Terminal breaks are assumed to be the more relevant cues for the identification of relevant linguistic domains in spontaneous speech (utterances). Relations with other concurrent criteria are discussed. The multimedia storage of the C-ORAL-ROM corpus is based on this principle; each textual string ending with a terminal break is aligned, through the Win Pitch speech software, to its acoustic counterpart, generating the data base of all utterances.
2002
pdf
bib
The C-ORAL-ROM Project. New methods for spoken language archives in a multilingual romance corpus
Emanuela Cresti
|
Massimo Moneglia
|
Fernanda Bacelar do Nascimento
|
Antonio Moreno Sandoval
|
Jean Veronis
|
Philippe Martin
|
Kalid Choukri
|
Valerie Mapelli
|
Daniele Falavigna
|
Antonio Cid
|
Claude Blum
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
2000
pdf
bib
Portuguese Corpora at CLUL
Maria Fernanda Bacelar do Nascimento
|
Luisa Pereira
|
João Saramago
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)