José Pedro Ferreira
Casa de la Lhéngua: a set of language resources and natural language processing tools for Mirandese
José Pedro Ferreira | Cristiano Chesi | Daan Baldewijns | Fernando Miguel Pinto | Margarita Correia | Daniela Braga | Hyongsil Cho | Amadeu Ferreira | Miguel Dias
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper describes the efforts for the construction of Language Resources and NLP tools for Mirandese, a minority language spoken in North-eastern Portugal, now available on a community-led portal, Casa de la Lhéngua. The resources were developed in the context of a collaborative citizenship project led by Microsoft, in the context of the creation of the first TTS system for Mirandese. Development efforts encompassed the compilation of a corpus with over 1M tokens, the construction of a GTP system, syllable-division, inflection and a Part-of-Speech (POS) tagger modules, leading to the creation of an inflected lexicon of about 200.000 entries with phonetic transcription, detailed POS tagging, syllable division, and stress mark-up. Alongside these tasks, which were made easier through the adaptation and reuse of existing tools for closely related languages, a casting for voice talents among the speaking community was conducted and the first speech database for speech synthesis was recorded for Mirandese. These resources were combined to fulfil the requirements of a well-tested statistical parameter synthesis model, leading to an intelligible voice font. These language resources are available freely at Casa de la Lhéngua, aiming at promoting the development of real-life applications and fostering linguistic research on Mirandese.
The Common Orthographic Vocabulary of the Portuguese Language: a set of open lexical resources for a pluricentric language
José Pedro Ferreira | Maarten Janssen | Gladis Barcellos de Oliveira | Margarita Correia | Gilvan Müller de Oliveira
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper outlines the design principles and choices, as well as the ongoing development process of the Common Orthographic Vocabulary of the Portuguese Language (VOC), a large scale electronic lexical database which was adopted by the Community of Portuguese-Speaking Countries' (CPLP) Instituto Internacional da Língua Portuguesa to implement a spelling reform that is currently taking place. Given the different available resources and lexicographic traditions within the CPLP countries, a range of different solutions was adopted for different countries and integrated into a common development framework. Although the publication of lexicographic resources to implement spelling reforms has always been done for Portuguese, VOC represents a paradigm change, switching from idiosyncratic, closed source, paper-format official resources to standardized, open, free, web-accessible and reusable ones. We start by outlining the context that justifies the resource development and its requirements, then focusing on the description of the methodology, workflow and tools used, showing how a collaborative project in a common web-based platform and administration interface make the creation of such a long-sought and ambitious project possible.
- Margarita Correia 2
- Cristiano Chesi 1
- Daan Baldewijns 1
- Fernando Miguel Pinto 1
- Daniela Braga 1
- show all...