Michel Généreux

Also published as: Michel Genereux

2022

We present three new corpora of urban varieties of Portuguese spoken in Angola, Mozambique, and São Tomé and Príncipe, where Portuguese is increasingly being spoken as first and second language in different multilingual settings. Given the scarcity of linguistic resources available for the African varieties of Portuguese, these corpora provide new, contemporary data for the study of each variety and for comparative research on African, Brazilian and European varieties, hereby improving our understanding of processes of language variation and change in postcolonial societies. The corpora consist of transcribed spoken data, complemented by a rich set of metadata describing the setting of the audio recordings and sociolinguistic information about the speakers. They are annotated with POS and lemma information and made available on the CQPweb platform, which allows for sophisticated data searches. The corpora are already being used for comparative research on constructions in the domain of possession and location involving the argument structure of intransitive, monotransitive and ditransitive verbs that select Goals, Locatives, and Recipients.

2014

pdf bib abs

A corpus of European Portuguese child and child-directed speech
Ana Lúcia Santos | Michel Généreux | Aida Cardoso | Celina Agostinho | Silvana Abalada
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model.

pdf bib abs

We present the process of building linguistic corpora of the Portuguese-related Gulf of Guinea creoles, a cluster of four historically related languages: Santome, Angolar, Principense and Fa dAmbô. We faced the typical difficulties of languages lacking an official status, such as lack of standard spelling, language variation, lack of basic language instruments, and small data sets, which comprise data from the late 19th century to the present. In order to tackle these problems, the compiled written and transcribed spoken data collected during field work trips were adapted to a normalized spelling that was applied to the four languages. For the corpus compilation we followed corpus linguistics standards. We recorded meta data for each file and added morphosyntactic information based on a part-of-speech tag set that was designed to deal with the specificities of these languages. The corpora of three of the four creoles are already available and searchable via an online web interface.

2012

pdf bib abs

Introducing the Reference Corpus of Contemporary Portuguese Online
Michel Généreux | Iris Hendrickx | Amália Mendes
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present our work in processing the Reference Corpus of Contemporary Portuguese and its publication online. After discussing how the corpus was built and our choice of meta-data, we turn to the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries. The Web platform is described, and we show examples of linguistic resources that can be extracted from the platform for use in linguistic studies or in NLP.

pdf bib

Contrasting Objective and Subjective Portuguese Texts from Heterogeneous Sources
Michel Généreux | William Martinez
Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

2011

pdf bib abs

Système d’analyse de la polarité de dépêches financières (A polarity sentiment analysis system financial news items)
Michel Généreux
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Nous présentons un système pour la classification en continu de dépêches financières selon une polarité positive ou négative. La démonstration permettra ainsi d’observer quelles sont les dépêches les plus à même de faire varier la valeur d’actions cotées en bourse, au moment même de la démonstration. Le système traitera de dépêches écrites en anglais et en français.

2010

pdf bib

Résumé automatique de textes d’opinion [Automatic Summarization of Opinionated Texts]
Aurélien Bossard | Michel Genereux | Thierry Poibeau
Traitement Automatique des Langues, Volume 51, Numéro 3 : Opinions, sentiments et jugements d’évaluation [Opinions, sentiment and evaluative language]

pdf bib abs

Segmentation Automatique de Lettres Historiques
Michel Généreux | Rita Marquilhas | Iris Hendrickx
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Cet article présente une approche basée sur la comparaison fréquentielle de modèles lexicaux pour la segmentation automatique de textes historiques Portugais. Cette approche traite d’abord le problème de la segmentation comme un problème de classification, en attribuant à chaque élément lexical présent dans la phase d’apprentissage une valeur de saillance pour chaque type de segment. Ces modèles lexicaux permettent à la fois de produire une segmentation et de faire une analyse qualitative de textes historiques. Notre évaluation montre que l’approche adoptée permet de tirer de l’information sémantique que des approches se concentrant sur la détection des frontières séparant les segments ne peuvent acquérir.

2009

pdf bib abs

Résumé automatique de textes d’opinions
Michel Généreux | Aurélien Bossard
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Le traitement des langues fait face à une demande croissante en matière d’analyse de textes véhiculant des critiques ou des opinions. Nous présentons ici un système de résumé automatique tourné vers l’analyse d’articles postés sur des blogues, où sont exprimées à la fois des informations factuelles et des prises de position sur les faits considérés. Nous montrons qu’une approche classique à base de traits de surface est tout à fait efficace dans ce cadre. Le système est évalué à travers une participation à la campagne d’évaluation internationale TAC (Text Analysis Conference) où notre système a réalisé des performances satisfaisantes.

pdf bib

CBSEAS, a Summarization System – Integration of Opinion Mining Techniques to Summarize Blogs
Aurélien Bossard | Michel Généreux | Thierry Poibeau
Proceedings of the Demonstrations Session at EACL 2009

This paper presents a method for guiding semantic parsers based on a statistical model. The parser is example driven, that is, it learns how to interpret a new utterance by looking at some examples. It is mainly predicated on the idea that similarities exist between contexts in which individual parsing actions take place. Those similarities are then used to compute the degree of certainty of a particular parse. The treatment of word order and the disambiguation of meanings can therefore be learned.