A corpus of European Portuguese child and child-directed speech

Ana Lúcia Santos; Michel Généreux; Aida Cardoso; Celina Agostinho; Silvana Abalada

A corpus of European Portuguese child and child-directed speech

Ana Lúcia Santos, Michel Généreux, Aida Cardoso, Celina Agostinho, Silvana Abalada

Abstract

We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model.

Anthology ID:: L14-1426
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 1488–1491
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/514_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Ana Lúcia Santos, Michel Généreux, Aida Cardoso, Celina Agostinho, and Silvana Abalada. 2014. A corpus of European Portuguese child and child-directed speech. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1488–1491, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: A corpus of European Portuguese child and child-directed speech (Santos et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/514_Paper.pdf

PDF Cite Search Fix data