Corpus-based extraction and identification of Portuguese Multiword Expressions

Sandra Antunes; Maria Fernanda Bacelar do Nascimento; João Miguel Casteleiro; Amália Mendes; Luísa Pereira; Tiago Sá

Corpus-based extraction and identification of Portuguese Multiword Expressions

Sandra Antunes, Maria Fernanda Bacelar do Nascimento, João Miguel Casteleiro, Amália Mendes, Luísa Pereira, Tiago Sá

Abstract

This presentation reports on an on-going project aimed at building a large lexical database of corpus-extracted multiword (MW) expressions for the Portuguese language. MW expressions were automatically extracted from a balanced 50 million word corpus compiled for this project, furthermore these were statistically interpreted using lexical association measures, followed by a manual validation process. The lexical database covers different types of MW expressions, from named entities to lexical associations with different degrees of cohesion, ranging from totally frozen idioms to favoured co-occurring forms, such as collocations. We aim to achieve two main objectives with this resource. Firstly to build on the large set of data of different types of MW expressions, thus revising existing typologies of collocations and integrating them in a larger theory of MW units. Secondly, to use the extensive hand-checked data as training data to evaluate existing statistical lexical association measures.

Anthology ID:: 2006.jeptalnrecital-poster.2
Volume:: Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters
Month:: April
Year:: 2006
Address:: Leuven, Belgique
Editors:: Piet Mertens, Cédrick Fairon, Anne Dister, Patrick Watrin
Venue:: JEP/TALN/RECITAL
SIG:
Publisher:: ATALA
Note:
Pages:: 389–397
Language:
URL:: https://aclanthology.org/2006.jeptalnrecital-poster.2/
DOI:
Bibkey:
Cite (ACL):: Sandra Antunes, Maria Fernanda Bacelar do Nascimento, João Miguel Casteleiro, Amália Mendes, Luísa Pereira, and Tiago Sá. 2006. Corpus-based extraction and identification of Portuguese Multiword Expressions. In Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters, pages 389–397, Leuven, Belgique. ATALA.
Cite (Informal):: Corpus-based extraction and identification of Portuguese Multiword Expressions (Antunes et al., JEP/TALN/RECITAL 2006)
Copy Citation:
PDF:: https://aclanthology.org/2006.jeptalnrecital-poster.2.pdf

PDF Cite Search Fix data