TEI-Conformant Structural Markup of a Trilingual Parallel Corpus in the ECI Multilingual Corpus 1

David McKelvie, Henry S. Thompson


Abstract
In this paper we provide an overview of the ACL European Corpus Initiative (ECI) Multilingual Corpus 1 (ECI/MC1). In particular, we look at one particular subcorpus in the ECI/MC1, the trilingual corpus of International Labour Organisation reports, and discuss the problems involved in TEI-compliant structural markup and preliminary alignment of this large corpus. We discuss gross structural alignment down to the level of text paragraphs. We see this as a necessary first step in corpus preparation before detailed (possibly automatic) alignment of text is possible. We try and generalise our experience with this corpus to illustrate the process of preliminary markup of large corpora which in their raw state can be in an arbitrary format (eg printers tapes, proprietary word-processor format); noisy (not fully parallel, with structure obscured by spelling mistakes); full of poorly documented formatting instructions; and whose structure is present but anything but explicit. We illustrate these points by reference to other parallel subcorpora of ECI/MC1. We attempt to define some guidelines for the development of corpus annotation toolkits which would aid this kind of structural preparation of large corpora.
Anthology ID:
1994.vlc-1.1
Volume:
Second Workshop on Very Large Corpora
Month:
Year:
1994
Address:
Venue:
VLC
SIG:
Publisher:
Note:
Pages:
7–18
Language:
URL:
https://aclanthology.org/1994.vlc-1.1
DOI:
Bibkey:
Cite (ACL):
David McKelvie and Henry S. Thompson. 1994. TEI-Conformant Structural Markup of a Trilingual Parallel Corpus in the ECI Multilingual Corpus 1. In Second Workshop on Very Large Corpora, pages 7–18.
Cite (Informal):
TEI-Conformant Structural Markup of a Trilingual Parallel Corpus in the ECI Multilingual Corpus 1 (McKelvie & Thompson, VLC 1994)
Copy Citation:
PDF:
https://aclanthology.org/1994.vlc-1.1.pdf