A translated corpus of 30,000 French SMS

Cédrick Fairon, Sébastien Paumier


Abstract
The development of communication technologies has contributed to the appearance of new forms in the written language that scientists have to study according to their peculiarities (typing or viewing constraints, synchronicity, etc). In the particular case of SMS (Short Message Service), studies are complicated by a lack of data, mainly due to technical constraints and privacy considerations. In this paper, we present a corpus of 30,000 French SMS collected through a project in Belgium named “Faites don de vos SMS à la science” (Give your SMS to Science). This corpus is unique in its quality, its size and the fact that the SMS have been manually translated into “standard” French. We will first describe the collection process and discuss the writers' profiles. Then we will explain in detail how the translation was carried out.
Anthology ID:
L06-1148
Volume:
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Month:
May
Year:
2006
Address:
Genoa, Italy
Editors:
Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2006/pdf/270_pdf.pdf
DOI:
Bibkey:
Cite (ACL):
Cédrick Fairon and Sébastien Paumier. 2006. A translated corpus of 30,000 French SMS. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
Cite (Informal):
A translated corpus of 30,000 French SMS (Fairon & Paumier, LREC 2006)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2006/pdf/270_pdf.pdf