Bengt Dahlqvist
The English-Swedish-Turkish Parallel Treebank
Beáta Megyesi
Bengt Dahlqvist
Éva Á. Csató
Joakim Nivre
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We describe a syntactically annotated parallel corpus containing typologically partly different languages, namely English, Swedish and Turkish. The corpus consists of approximately 300 000 tokens in Swedish, 160 000 in Turkish and 150 000 in English, containing both fiction and technical documents. We build the corpus by using the Uplug toolkit for automatic structural markup, such as tokenization and sentence segmentation, as well as sentence and word alignment. In addition, we use basic language resource kits for the linguistic analysis of the languages involved. The annotation is carried on various layers from morphological and part of speech analysis to dependency structures. The tools used for linguistic annotation, e.g.,\ HunPos tagger and MaltParser, are freely available data-driven resources, trained on existing corpora and treebanks for each language. The parallel treebank is used in teaching and linguistic research to study the relationship between the structurally different languages. In order to study the treebank, several tools have been developed for the visualization of the annotation and alignment, allowing search for linguistic patterns.
Swedish-Turkish Parallel Treebank
Beáta Megyesi
Bengt Dahlqvist
Eva Pettersson
Joakim Nivre
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper, we describe our work on building a parallel treebank for a less studied and typologically dissimilar language pair, namely Swedish and Turkish. The treebank is a balanced syntactically annotated corpus containing both fiction and technical documents. In total, it consists of approximately 160,000 tokens in Swedish and 145,000 in Turkish. The texts are linguistically annotated using different layers from part of speech tags and morphological features to dependency annotation. Each layer is automatically processed by using basic language resources for the involved languages. The sentences and words are aligned, and partly manually corrected. We create the treebank by reusing and adjusting existing tools for the automatic annotation, alignment, and their correction and visualization. The treebank was developed within the project supporting research environment for minor languages aiming at to create representative language resources for language pairs dissimilar in language structure. Therefore, efforts are put on developing a general method for formatting and annotation procedure, as well as using tools that can be applied to other language pairs easily.
The Swedish-Turkish Parallel Corpus and Tools for its Creation
Beata Megyesi
Bengt Dahlqvist
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)