2024
pdf
bib
EuReCo: Not Building and Yet Using Federated Comparable Corpora for Cross-Linguistic Research
Marc Kupietz
|
Piotr Banski
|
Nils Diewald
|
Beata Trawinski
|
Andreas Witt
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024
2022
pdf
bib
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
Piotr Banski
|
Adrien Barbaresi
|
Simon Clematide
|
Marc Kupietz
|
Harald Lüngen
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
2020
pdf
bib
abs
Corpus Query Lingua Franca part II: Ontology
Stefan Evert
|
Oleg Harlamov
|
Philipp Heinrich
|
Piotr Banski
Proceedings of the Twelfth Language Resources and Evaluation Conference
The present paper outlines the projected second part of the Corpus Query Lingua Franca (CQLF) family of standards: CQLF Ontology, which is currently in the process of standardization at the International Standards Organization (ISO), in its Technical Committee 37, Subcommittee 4 (TC37SC4) and its national mirrors. The first part of the family, ISO 24623-1 (henceforth CQLF Metamodel), was successfully adopted as an international standard at the beginning of 2018. The present paper reflects the state of the CQLF Ontology at the moment of submission for the Committee Draft ballot. We provide a brief overview of the CQLF Metamodel, present the assumptions and aims of the CQLF Ontology, its basic structure, and its potential extended applications. The full ontology is expected to emerge from a community process, starting from an initial version created by the authors of the present paper.
pdf
bib
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
Piotr Bański
|
Adrien Barbaresi
|
Simon Clematide
|
Marc Kupietz
|
Harald Lüngen
|
Ines Pisetta
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
2018
pdf
bib
Lightweight Grammatical Annotation in the TEI: New Perspectives
Piotr Bański
|
Susanne Haaf
|
Martin Mueller
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
bib
abs
Corpus Query Lingua Franca (CQLF)
Piotr Bański
|
Elena Frick
|
Andreas Witt
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The present paper describes Corpus Query Lingua Franca (ISO CQLF), a specification designed at ISO Technical Committee 37 Subcommittee 4 “Language resource management” for the purpose of facilitating the comparison of properties of corpus query languages. We overview the motivation for this endeavour and present its aims and its general architecture. CQLF is intended as a multi-part specification; here, we concentrate on the basic metamodel that provides a frame that the other parts fit in.
pdf
bib
abs
KorAP Architecture ― Diving in the Deep Sea of Corpus Data
Nils Diewald
|
Michael Hanl
|
Eliza Margaretha
|
Joachim Bingel
|
Marc Kupietz
|
Piotr Bański
|
Andreas Witt
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DeReKo for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.
2014
pdf
bib
abs
Access control by query rewriting: the case of KorAP
Piotr Bański
|
Nils Diewald
|
Michael Hanl
|
Marc Kupietz
|
Andreas Witt
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present an approach to an aspect of managing complex access scenarios to large and heterogeneous corpora that involves handling user queries that, intentionally or due to the complexity of the queried resource, target texts or annotations outside of the given users permissions. We first outline the overall architecture of the corpus analysis platform KorAP, devoting some attention to the way in which it handles multiple query languages, by implementing ISO CQLF (Corpus Query Lingua Franca), which in turn constitutes a component crucial for the functionality discussed here. Next, we look at query rewriting as it is used by KorAP and zoom in on one kind of this procedure, namely the rewriting of queries that is forced by data access restrictions.
2012
pdf
bib
abs
The New IDS Corpus Analysis Platform: Challenges and Prospects
Piotr Bański
|
Peter M. Fischer
|
Elena Frick
|
Erik Ketzan
|
Marc Kupietz
|
Carsten Schnober
|
Oliver Schonefeld
|
Andreas Witt
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The present article describes the first stage of the KorAP project, launched recently at the Institut für Deutsche Sprache (IDS) in Mannheim, Germany. The aim of this project is to develop an innovative corpus analysis platform to tackle the increasing demands of modern linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse primary data and annotations in the petabyte range, while at the same time allowing an undistorted view of the primary linguistic data, and thus fully satisfying the demands of a scientific tool. An additional important aim of the project is to make corpus data as openly accessible as possible in light of unavoidable legal restrictions, for instance through support for distributed virtual corpora, user-defined annotations and adaptable user interfaces, as well as interfaces and sandboxes for user-supplied analysis applications. We discuss our motivation for undertaking this endeavour and the challenges that face it. Next, we outline our software implementation plan and describe development to-date.
pdf
bib
abs
Evaluating Query Languages for a Corpus Processing System
Elena Frick
|
Carsten Schnober
|
Piotr Bański
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper documents a pilot study conducted as part of the development of a new corpus processing system at the Institut für Deutsche Sprache in Mannheim and in the context of the ISO TC37 SC4/WG6 activity on the suggested work item proposal Corpus Query Lingua Franca. We describe the first phase of our research: the initial formulation of functionality criteria for query language evaluation and the results of the application of these criteria to three representatives of corpus query languages, namely COSMAS II, Poliqarp, and ANNIS QL. In contrast to previous works on query language evaluation that compare a range of existing query languages against a small number of queries, our approach analyses only three query languages against criteria derived from a suite of 300 use cases that cover diverse aspects of linguistic research.
2009
pdf
bib
A Repository of Free Lexical Resources for African Languages: The Project and the Method
Piotr Bański
|
Beata Wójtowicz
Proceedings of the First Workshop on Language Technologies for African Languages
pdf
bib
Stand-off TEI Annotation: the Case of the National Corpus of Polish
Piotr Bański
|
Adam Przepiórkowski
Proceedings of the Third Linguistic Annotation Workshop (LAW III)
2008
pdf
bib
abs
Enhancing an English-Polish Electronic Dictionary for Multiword Expression Research
Piotr Bański
|
Radosław Moszczyński
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper describes a project aimed at converting a legacy representation of English idioms into an XML-based format. The project is set in the context of a large electronic English-Polish dictionary which contains several hundred formalized idiom descriptions and which has been released under the terms of a free license. In short, the project consists of three phases: cleaning up the dictionary markup, extracting the legacy idiom representations, and converting them into TEI P5 XML constrained by a RelaxNG grammar created for this purpose and constituting a module that can be included as part of the TEI P5 schema. The paper contains general descriptions of the individual phases and several examples of XML-encoded idioms. It also suggests some directions for further research, which include abstracting the XML-ized idiom representations into general syntactic patterns and using the representations to automatically identify idioms in tagged corpora.
2004
pdf
bib
A Search Tool for Corpora with Positional Tagsets and Ambiguities
Adam Przepiórkowski
|
Zygmunt Krynicki
|
Łukasz Dębowski
|
Marcin Woliński
|
Daniel Janus
|
Piotr Bański
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)