Linguistic Resources for Handwriting Recognition and Translation Evaluation

Zhiyi Song, Safa Ismael, Stephen Grimes, David Doermann, Stephanie Strassel


Abstract
We describe efforts to create corpora to support development and evaluation of handwriting recognition and translation technology. LDC has developed a stable pipeline and infrastructures for collecting and annotating handwriting linguistic resources to support the evaluation of MADCAT and OpenHaRT. We collect and annotate handwritten samples of pre-processed Arabic and Chinese data that has been already translated in English that is used in the GALE program. To date, LDC has recruited more than 600 scribes and collected, annotated and released more than 225,000 handwriting images. Most linguistic resources created for these programs will be made available to the larger research community by publishing in LDC's catalog. The phase 1 MADCAT corpus is now available.
Anthology ID:
L12-1463
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3951–3955
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/785_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Zhiyi Song, Safa Ismael, Stephen Grimes, David Doermann, and Stephanie Strassel. 2012. Linguistic Resources for Handwriting Recognition and Translation Evaluation. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3951–3955, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Linguistic Resources for Handwriting Recognition and Translation Evaluation (Song et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/785_Paper.pdf