Building The Sense-Tagged Multilingual Parallel Corpus

Shan Wang, Francis Bond


Abstract
Sense-annotated parallel corpora play a crucial role in natural language processing. This paper introduces our progress in creating such a corpus for Asian languages using English as a pivot, which is the first such corpus for these languages. Two sets of tools have been developed for sequential and targeted tagging, which are also easy to set up for any new language in addition to those we are annotating. This paper also briefly presents the general guidelines for doing this project. The current results of monolingual sense-tagging and multilingual linking are illustrated, which indicate the differences among genres and language pairs. All the tools, guidelines and the manually annotated corpus will be freely available at compling.ntu.edu.sg/ntumc.
Anthology ID:
L14-1700
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/916_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Shan Wang and Francis Bond. 2014. Building The Sense-Tagged Multilingual Parallel Corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Building The Sense-Tagged Multilingual Parallel Corpus (Wang & Bond, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/916_Paper.pdf