David McKelvie


pdf bib
The MATE Workbench Annotation Tool, a Technical Description
Amy Isard | David McKelvie | Andreas Mengel | Morten Baun Møller
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)


pdf bib
Using SGML as a Basis for Data-Intensive NLP
David McKelvie | Chris Brew | Henry Thompson
Fifth Conference on Applied Natural Language Processing


pdf bib
TEI-Conformant Structural Markup of a Trilingual Parallel Corpus in the ECI Multilingual Corpus 1
David McKelvie | Henry S. Thompson
Second Workshop on Very Large Corpora

In this paper we provide an overview of the ACL European Corpus Initiative (ECI) Multilingual Corpus 1 (ECI/MC1). In particular, we look at one particular subcorpus in the ECI/MC1, the trilingual corpus of International Labour Organisation reports, and discuss the problems involved in TEI-compliant structural markup and preliminary alignment of this large corpus. We discuss gross structural alignment down to the level of text paragraphs. We see this as a necessary first step in corpus preparation before detailed (possibly automatic) alignment of text is possible. We try and generalise our experience with this corpus to illustrate the process of preliminary markup of large corpora which in their raw state can be in an arbitrary format (eg printers tapes, proprietary word-processor format); noisy (not fully parallel, with structure obscured by spelling mistakes); full of poorly documented formatting instructions; and whose structure is present but anything but explicit. We illustrate these points by reference to other parallel subcorpora of ECI/MC1. We attempt to define some guidelines for the development of corpus annotation toolkits which would aid this kind of structural preparation of large corpora.