Luiz C. Genoves, Jr.
2004
The Lácio-Web: Corpora and Tools to Advance Brazilian Portuguese Language Investigations and Computational Linguistic Tools
Sandra Aluisio
|
Gisele Montilha Pinheiro
|
Aline M. P. Manfrin
|
Leandro H. M. de Oliveira
|
Luiz C. Genoves, Jr.
|
Stella E. O. Tagnin
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
In this paper we discuss the five requirements for building large publicly available corpora which geared the construction of the Lácio-Web corpora and their environments: 1) a comprehensive text typology; 2) text copyright clearance, compilation and annotation scheme; 3) a friendly and didactic interface; 4) the need to serve as support for several types of research; 5) the need to offer an array of associated tools. Also, we present the features that make Lácio-Web corpora interesting and novel as well as the limitations of this project, such as corpora size and balance, and the non-inclusion of spoken texts in the project’s reference corpus.