TweetNorm_es: an annotated corpus for Spanish microtext normalization

Iñaki Alegria, Nora Aranberri, Pere Comas, Víctor Fresno, Pablo Gamallo, Lluis Padró, Iñaki San Vicente, Jordi Turmo, Arkaitz Zubiaga


Abstract
In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.
Anthology ID:
L14-1379
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2274–2278
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/442_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Iñaki Alegria, Nora Aranberri, Pere Comas, Víctor Fresno, Pablo Gamallo, Lluis Padró, Iñaki San Vicente, Jordi Turmo, and Arkaitz Zubiaga. 2014. TweetNorm_es: an annotated corpus for Spanish microtext normalization. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2274–2278, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
TweetNorm_es: an annotated corpus for Spanish microtext normalization (Alegria et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/442_Paper.pdf