The African Varieties of Portuguese: Compiling Comparable Corpora and Analyzing Data-Derived Lexicon
Maria Fernanda Bacelar do Nascimento | José Bettencourt Gonçalves | Luísa Pereira | Antónia Estrela | Afonso Pereira | Rui Santos | Sancho M. Oliveira
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Linguistic Resources for the Study of the Portuguese African Varieties is an ongoing project that aims at the constitution, treatment, analysis and availability of a corpus of the African varieties of Portuguese, with 3 million words of written and spoken texts, constituted by five comparable subcorpora, corresponding to the varieties of Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe. This material will allow intra and intercorpora comparative studies, which will make visible variations that result from discursive and pragmatic differences of each corpus and aspects of linguistic unity or diversity that characterise the spoken Portuguese of this referred five African countries. The five corpora are comparable in size (600,000 words each), in chronology (the last 30 years) and in types and genres (24,000 spoken words and c. 580,000 written words, the last belonging to newspapers, literature and varia). The corpus is automatically annotated and after the extraction of alphabetical lists of lexical forms, these data will be automatically lemmatised. Five separated lists of vocabulary for each variety will be established. A tool for word extraction and preferential calculus according to predefined indexes in order to achieve lexicon comparison of the African Portuguese Varieties is being developed. Concordances extraction will be also performed.