Montse Marquina


2012

pdf bib
Holaaa!! writin like u talk is kewl but kinda hard 4 NLP
Maite Melero | Marta R. Costa-Jussà | Judith Domingo | Montse Marquina | Martí Quixal
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging, and finally we propose a strategy for automatically normalizing UGC using a selector of correct forms on top of a pre-existing spell-checker.

pdf bib
The I3MEDIA speech database: a trilingual annotated corpus for the analysis and synthesis of emotional speech
Juan María Garrido | Yesika Laplaza | Montse Marquina | Andrea Pearman | José Gregorio Escalada | Miguel Ángel Rodríguez | Ana Armenta
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this article the I3Media corpus is presented, a trilingual (Catalan, English, Spanish) speech database of neutral and emotional material collected for analysis and synthesis purposes. The corpus is actually made up of six different subsets of material: a neutral subcorpus, containing emotionless utterances; a ‘dialog' subcorpus, containing typical call center utterances; an ‘emotional' corpus, a set of sentences representative of pure emotional states; a ‘football' subcorpus, including utterances imitating a football broadcasting situation; a ‘SMS' subcorpus, including readings of SMS texts; and a ‘paralinguistic elements' corpus, including recordings of interjections and paralinguistic sounds uttered in isolation. The corpus was read by professional speakers (male, in the case of Spanish and Catalan; female, in the case of the English corpus), carefully selected to meet criteria of language competence, voice quality and acting conditions. It is the result of a collaboration between the Speech Technology Group at Telefónica Investigación y Desarrollo (TID) and the Speech and Language Group at Barcelona Media Centre d'Innovació (BM), as part of the I3Media project.