D(H)ante: A New Set of Tools for XIII Century Italian

Angelo Basile, Federico Sangati


Abstract
In this paper we describe 1) the process of converting a corpus of Dante Alighieri from a TEI XML format in to a pseudo-CoNLL format; 2) how a pos-tagger trained on modern Italian performs on Dante’s Italian 3) the performances of two different pos-taggers trained on the given corpus. We are making our conversion scripts and models available to the community. The two other models trained on the corpus performs reasonably well. The tool used for the conversion process might turn useful for bridging the gap between traditional digital humanities and modern NLP applications since the TEI original format is not usually suitable for being processed with standard NLP tools. We believe our work will serve both communities: the DH community will be able to tag new documents and the NLP world will have an easier way in converting existing documents to a standardized machine-readable format.
Anthology ID:
L16-1450
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2825–2828
Language:
URL:
https://aclanthology.org/L16-1450
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/L16-1450.pdf