Peter Wittenburg

Also published as: P. Wittenburg

This contribution presents The Language Archive (TLA), a new unit at the MPI for Psycholinguistics, discussing the current developments in management of scientific data, considering the need for new data research infrastructures. Although several initiatives worldwide in the realm of language resources aim at the integration, preservation and mobilization of research data, the state of such scientific data is still often problematic. Data are often not well organized and archived and not described by metadata ― even unique data such as field-work observational data on endangered languages is still mostly on perishable carriers. New data centres are needed that provide trusted, quality-reviewed, persistent services and suitable tools and that take legal and ethical issues seriously. The CLARIN initiative has established criteria for suitable centres. TLA is in a good position to be one of such centres. It is based on three essential pillars: (1) A data archive; (2) management, access and annotation tools; (3) archiving and software expertise for collaborative projects. The archive hosts mostly observational data on small languages worldwide and language acquisition data, but also data resulting from experiments.

pdf bib abs

Towards Automatic Gesture Stroke Detection
Binyam Gebrekidan Gebre | Peter Wittenburg | Przemyslaw Lenkiewicz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Automatic annotation of gesture strokes is important for many gesture and sign language researchers. The unpredictable diversity of human gestures and video recording conditions require that we adopt a more adaptive case-by-case annotation model. In this paper, we present a work-in progress annotation model that allows a user to a) track hands/face b) extract features c) distinguish strokes from non-strokes. The hands/face tracking is done with color matching algorithms and is initialized by the user. The initialization process is supported with immediate visual feedback. Sliders are also provided to support a user-friendly adjustment of skin color ranges. After successful initialization, features related to positions, orientations and speeds of tracked hands/face are extracted using unique identifiable features (corners) from a window of frames and are used for training a learning algorithm. Our preliminary results for stroke detection under non-ideal video conditions are promising and show the potential applicability of our methodology.

2011

pdf bib

2010

pdf bib abs

Annotation of digital recordings in humanities research still is, to a large extend, a process that is performed manually. This paper describes the first pattern recognition based software components developed in the AVATecH project and their integration in the annotation tool ELAN. AVATecH (Advancing Video/Audio Technology in Humanities Research) is a project that involves two Max Planck Institutes (Max Planck Institute for Psycholinguistics, Nijmegen, Max Planck Institute for Social Anthropology, Halle) and two Fraunhofer Institutes (Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS, Sankt Augustin, Fraunhofer Heinrich-Hertz-Institute, Berlin) and that aims to develop and implement audio and video technology for semi-automatic annotation of heterogeneous media collections as they occur in multimedia based research. The highly diverse nature of the digital recordings stored in the archives of both Max Planck Institutes, poses a huge challenge to most of the existing pattern recognition solutions and is a motivation to make such technology available to researchers in the humanities.

pdf bib abs

Virtual Language Observatory: The Portal to the Language Resources and Technology Universe
Dieter Van Uytvanck | Claus Zinn | Daan Broeder | Peter Wittenburg | Mariano Gardellini
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Over the years, the field of Language Resources and Technology (LRT) has developed a tremendous amount of resources and tools. However, there is no ready-to-use map that researchers could use to gain a good overview and steadfast orientation when searching for, say corpora or software tools to support their studies. It is rather the case that information is scattered across project- or organisation-specific sites, which makes it hard if not impossible for less-experienced researchers to gather all relevant material. Clearly, the provision of metadata is central to resource and software exploration. However, in the LRT field, metadata comes in many forms, tastes and qualities, and therefore substantial harmonization and curation efforts are required to provide researchers with metadata-based guidance. To address this issue a broad alliance of LRT providers (CLARIN, the Linguist List, DOBES, DELAMAN, DFKI, ELRA) have initiated the Virtual Language Observatory portal to provide a low-barrier, easy-to-follow entry point to language resources and tools; it can be accessed via http://www.clarin.eu/vlo

pdf bib abs

We describe our computer-supported framework to overcome the rule of metadata schism. It combines the use of controlled vocabularies, managed by a data category registry, with a component-based approach, where the categories can be combined to yield complex metadata structures. A metadata scheme devised in this way will thus be grounded in its use of categories. Schema designers will profit from existing prefabricated larger building blocks, motivating re-use at a larger scale. The common base of any two metadata schemes within this framework will solve, at least to a good extent, the semantic interoperability problem, and consequently, further promote systematic use of metadata for existing resources and tools to be shared.

pdf bib abs

An Evolving eScience Environment for Research Data in Linguistics
Claus Zinn | Peter Wittenburg | Jacquelijn Ringersma
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The amount of research data in the Humanities is increasing at fast speed. Metadata helps describing and making accessible this data to interested researchers within and across institutions. While metadata interoperability is an issue that is being recognised and addressed, the systematic and user-driven provision of annotations and the linking together of resources into new organisational layers have received much less attention. This paper gives an overview of our evolving technological eScience environment to support such functionality. It describes two tools, ADDIT and ViCoS, which enable researchers, rather than archive managers, to organise and reorganise research data to fit their particular needs. The two tools, which are embedded into our institute's existing software landscape, are an initial step towards an eScience environment that gives our scientists easy access to (multimodal) research data of their interest, and empowers them to structure, enrich, link together, and share such data as they wish.

Currently, research infrastructures are being designed and established in many disciplines since they all suffer from an enormous fragmentation of their resources and tools. In the domain of language resources and tools the CLARIN initiative has been funded since 2008 to overcome many of the integration and interoperability hurdles. CLARIN can build on knowledge and work from many projects that were carried out during the last years and wants to build stable and robust services that can be used by researchers. Here service centres will play an important role that have the potential of being persistent and that adhere to criteria as they have been established by CLARIN. In the last year of the so-called preparatory phase these centres are currently developing four use cases that can demonstrate how the various pillars CLARIN has been working on can be integrated. All four use cases fulfil the criteria of being cross-national.

2008

pdf bib abs

Exploring and Enriching a Language Resource Archive via the Web
Marc Kemps-Snijders | Alex Klassmann | Claus Zinn | Peter Berck | Albert Russel | Peter Wittenburg
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The download first, then process paradigm is still the predominant working method amongst the research community. The web-based paradigm, however, offers many advantages from a tool development and data management perspective as they allow a quick adaptation to changing research environments. Moreover, new ways of combining tools and data are increasingly becoming available and will eventually enable a true web-based workflow approach, thus challenging the download first, then process paradigm. The necessary infrastructure for managing, exploring and enriching language resources via the Web will need to be delivered by projects like CLARIN and DARIAH.

pdf bib abs

A Grid of Regional Language Archives
Paul Trilsbeek | Daan Broeder | Tobias Valkenhoef | Peter Wittenburg
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

About two years ago, the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands, started an initiative to install regional language archives in various places around the world, particularly in places where a large number of endangered languages exist and are being documented. These digital archives make use of the LAT archiving framework that the MPI has developed over the past nine years. This framework consists of a number of web-based tools for depositing, organizing and utilizing linguistic resources in a digital archive. The regional archives are in principle autonomous archives, but they can decide to share metadata descriptions and language resources with the MPI archive in Nijmegen and become part of a grid of linked LAT archives. By doing so, they will also take advantage of the long-term preservation strategy of the MPI archive. This paper describes the reasoning behind this initiative and how in practice such an archive is set up.

pdf bib abs

Annotation by Category: ELAN and ISO DCR
Han Sloetjes | Peter Wittenburg
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The Data Category Registry is one of the ISO initiatives towards the establishment of standards for Language Resource management, creation and coding. Successful application of the DCR depends on the availability of tools that can interact with it. This paper describes the first steps that have been taken to provide users of the multimedia annotation tool ELAN, with the means to create references from tiers and annotations to data categories defined in the ISO Data Category Registry. It first gives a brief description of the capabilities of ELAN and the structure of the documents it creates. After a concise overview of the goals and current state of the ISO DCR infrastructure, a description is given of how the preliminary connectivity with the DCR is implemented in ELAN.

pdf bib abs

Foundation of a Component-based Flexible Registry for Language Resources and Technology
Daan Broeder | Thierry Declerck | Erhard Hinrichs | Stelios Piperidis | Laurent Romary | Nicoletta Calzolari | Peter Wittenburg
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Within the CLARIN e-science infrastructure project it is foreseen to develop a component-based registry for metadata for Language Resources and Language Technology. With this registry it is hoped to overcome the problems of the current available systems with respect to inflexible fixed schema, unsuitable terminology and interoperability problems. The registry will address interoperability needs by refering to a shared vocabulary registered in data category registries as they are suggested by ISO.

pdf bib abs

ISOcat: Corralling Data Categories in the Wild
Marc Kemps-Snijders | Menzo Windhouwer | Peter Wittenburg | Sue Ellen Wright
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

To achieve true interoperability for valuable linguistic resources different levels of variation need to be addressed. ISO Technical Committee 37, Terminology and other language and content resources, is developing a Data Category Registry. This registry will provide a reusable set of data categories. A new implementation, dubbed ISOcat, of the registry is currently under construction. This paper shortly describes the new data model for data categories that will be introduced in this implementation. It goes on with a sketch of the standardization process. Completed data categories can be reused by the community. This is done by either making a selection of data categories using the ISOcat web interface, or by other tools which interact with the ISOcat system using one of its various Application Programming Interfaces. Linguistic resources that use data categories from the registry should include persistent references, e.g. in the metadata or schemata of the resource, which point back to their origin. These data category references can then be used to determine if two or more resources share common semantics, thus providing a level of interoperability close to the source data and a promising layer for semantic alignment on higher levels.

pdf bib abs

CLARIN: Common Language Resources and Technology Infrastructure
Tamás Váradi | Steven Krauwer | Peter Wittenburg | Martin Wynne | Kimmo Koskenniemi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The paper provides a general introduction to the CLARIN project, a large-scale European research infrastructure project designed to establish an integrated and interoperable infrastructure of language resources and technologies. The goal is to make language resources and technology much more accessible to all researchers working with language material, particularly non-expert users in the Humanities and Social Sciences. CLARIN intends to build a virtual, distributed infrastructure consisting of a federation of trusted digital archives and repositories where language resources and tools are accessible through web services. The CLARIN project consists of 32 partners from 22 countries and is currently engaged in the preparatory phase of developing the infrastructure. The paper describes the objectives of the project in terms of its technical, legal, linguistic and user dimensions.

2006

pdf bib abs

Metadata Profile in the ISO Data Category Registry
Freddy Offenga | Daan Broeder | Peter Wittenburg | Julien Ducret | Laurent Romary
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Metadata descriptions of language resources become an increasing necessity since the shear amount of language resources is increasing rapidly and especially since we are now creating infrastuctures to access these resources via the web through integrated domains of language resource archives. Yet, the metadata frameworks offered for the domain of language resources (IMDI and OLAC), although mature, are not as widely accepted as necessary. The lack of confidence in the stability and persistence of the concepts and formats introduced by these metadata sets seems to be one argument for people to not invest the time needed for metadata creation. The introduction of these concepts into an ISO standardization process may convince contributors to make use of the terminology. The availability of the ISO Data Category Registry that includes a metadata profile will also offer the opportunity for researchers to construct their own metadata set tailored to the needs of the project at hand, but nevertheless supporting interoperability.

pdf bib abs

LEXUS, a web-based tool for manipulating lexical resources lexicon
Marc Kemps-Snijders | Mark-Jan Nederhof | Peter Wittenburg
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

LEXUS provides a flexible framework for the maintaining lexical structure and content. It is the first implementation of the Lexical Markup Framework model currently being developed at ISO TC37/SC4. Amongst its capabilities are the possibility to create lexicon structures, manipulate content and use of typed relations. Integration of well established Data Category Registries is supported to further promote interoperability by allowing access to well established linguistic concepts. Advanced linguistic functionality is offered to assist users in cross lexica operations such as search and comparison and merging of lexica. To enable use within various user groups the look and feel of each lexicon may be customized. In the near future more functionality will be added including integration with other tools accessing lexical content.

pdf bib abs

Comparison of Resource Discovery Methods
Alex Klassmann | Freddy Offenga | Daan Broeder | Romuald Skiba | Peter Wittenburg
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

It is an ongoing debate whether categorical systems created by some experts are an appropriate way to help users finding useful resources in the internet. However for the much more restricted domain of language documentation such a category system might still prove reasonable if not indispensable. This article gives an overview over the particular IMDI category set and presents a rough evaluation of its practical use at the Max-Planck-Institute Nijmegen.

pdf bib abs

Ontology-based Language Archive Utilization
Peter Berck | Hans-Jörg Bibiko | Marc Kemps-Snijders | Albert Russel | Peter Wittenburg
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

At the MPI for Psycholinguistics a large archive with language resources has been created with contributions from many different individual researchers and research projects. All of these resources, in particular annotated media streams and multimedia lexica, are accessible via the web and can be utilized with the help of web-based utilization frameworks. Therefore, the archive lends itself to motivate users to operate across the boundaries of single corpora and to support cross-language work. This, however, can only be done when the problems of interoperability, in particular at the level of linguistic encoding, can be solved in an efficient way. Two Max-Planck-Institutes are cooperating to build a framework that allows users to easily create their own practical ontologies and if wanted to relate their concepts to central ontologies.

pdf bib abs

Language Archiving, Resource Management LAMUS is a web-based service that allows researchers to deposit their language resources into a language resources archive. It was developed at the MPI for Psycholinguistics for stricter control of the archive coherence and consistency and allowing wider use of the archiving facilities without increasing the workload for archive and corpus managers. LAMUS is based on the use of IMDI metadata standard for language resources and offers metadata search and browsing over the archive.

pdf bib abs

ELAN: a Professional Framework for Multimodality Research
Peter Wittenburg | Hennie Brugman | Albert Russel | Alex Klassmann | Han Sloetjes
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Utilization of computer tools in linguistic research has gained importance with the maturation of media frameworks for the handling of digital audio and video. The increased use of these tools in gesture, sign language and multimodal interaction studies has led to stronger requirements on the flexibility, the efficiency and in particular the time accuracy of annotation tools. This paper describes the efforts made to make ELAN a tool that meets these requirements, with special attention to the developments in the area of time accuracy. In subsequent sections an overview will be given of other enhancements in the latest versions of ELAN that makes it a useful tool in multimodality research.

pdf bib abs

Technologies for a Federation of Language Resource Archives
Daan Broeder | Freddy Offenga | Peter Wittenburg | Peter van der Kamp | David Nathan | Sven Strömqvist
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The DAM-LR project aims at virtually integrating various European language resource archives that allow users to navigate and operate in a single unified domain of language resources. This type of integration introduces Grid technology to the humanities disciplines and forms a federation of archives. It is the basis for establishing a research infrastructure for language resources which will finally enable eHumanities. Currently, the complete architecture is designed based on a few well-known components and some components are already tested. Based on the technological insights gathered and due to discussions within the international DELAMAN network the ethical and organizational basis for such a federation is defined.

pdf bib abs

An API for accessing the Data Category Registry
Marc Kemps-Snijders | Julien Ducret | Laurent Romary | Peter Wittenburg
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Central Ontologies are increasingly important to manage interoperability between different types of language resources. This was the reason for ISO to set up a new committee ISO TC37/SC4 taking care of language resource management issues. Central to the work of this committee is the definition of a framework for a central registry of data categories that are important in the domain of language resources. This paper describes an application programming interface that was designed to request services from this data category registry. The DCR is operational and the described API has already been tested from a lexicon application.

pdf bib abs

Foundations of Modern Language Resource Archives
Peter Wittenburg | Daan Broeder | Wolfgang Klein | Stephen Levinson | Laurent Romary
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

A number of serious reasons will convince an increasing amount of researchers to store their relevant material in centers which we will call "language resource archives". They combine the duty of taking care of long-term preservation as well as the task to give access to their material to different user groups. Access here is meant in the sense that an active interaction with the data will be made possible to support the integration of new data, new versions or commentaries of all sorts. Modern Language Resource Archives will have to adhere to a number of basic principles to fulfill all requirements and they will have to be involved in federations to create joint language resource domains making it even simpler for the researchers to access the data. This paper makes an attempt to formulate the essential pillars language resource archives have to adhere to.

2004

pdf bib

Towards Metadata Interoperability
Peter Wittenburg | Daan Broeder | Paul Buitelaar
Proceeedings of the Workshop on NLP and XML (NLPXML-2004): RDF/RDFS and OWL in Language Technology

pdf bib

Using Profiles for IMDI Metadata Creation
Daan Broeder | Peter Wittenburg | Onno Crasborn
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib

Cross-Disciplinary Integration of Metadata Descriptions
Peter Wittenburg | Greg Gulrajani | Daan Broeder | Marcus Uneson
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib abs

Architecture for Distributed Language Resource Management and Archiving
Peter Wittenburg | Heidi Johnson | Markus Buchhorn | Hennie Brugman | Daan Broeder
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

An architecture is presented that provides an integrated framework for managing, archiving and accessing language resources. This architecture was discussed in the DELAMAN network – a world-wide network of archives holding material about endangered languages. Such a framework will be built upon a metadata infrastructure, a mechanism to resolve unique resource identifiers, user and access rights management components. These components are closely related and have to be based on redundant and distributed services. For all these components existing middleware seems to be available, however, it has to be checked how they can interact with each other.

pdf bib

pdf bib