2020
pdf
bib
abs
Interoperability in an Infrastructure Enabling Multidisciplinary Research: The case of CLARIN
Franciska de Jong
|
Bente Maegaard
|
Darja Fišer
|
Dieter van Uytvanck
|
Andreas Witt
Proceedings of the Twelfth Language Resources and Evaluation Conference
CLARIN is a European Research Infrastructure providing access to language resources and technologies for researchers in the humanities and social sciences. It supports the use and study of language data in general and aims to increase the potential for comparative research of cultural and societal phenomena across the boundaries of languages and disciplines, all in line with the European agenda for Open Science. Data infrastructures such as CLARIN have recently embarked on the emerging frameworks for the federation of infrastructural services, such as the European Open Science Cloud and the integration of services resulting from multidisciplinary collaboration in federated services for the wider SSH domain. In this paper we describe the interoperability requirements that arise through the existing ambitions and the emerging frameworks. The interoperability theme will be addressed at several levels, including organisation and ecosystem, design of workflow services, data curation, performance measurement and collaboration.
pdf
bib
abs
A Shared Task of a New, Collaborative Type to Foster Reproducibility: A First Exercise in the Area of Language Science and Technology with REPROLANG2020
António Branco
|
Nicoletta Calzolari
|
Piek Vossen
|
Gertjan Van Noord
|
Dieter van Uytvanck
|
João Silva
|
Luís Gomes
|
André Moreira
|
Willem Elbers
Proceedings of the Twelfth Language Resources and Evaluation Conference
n this paper, we introduce a new type of shared task — which is collaborative rather than competitive — designed to support and fosterthe reproduction of research results. We also describe the first event running such a novel challenge, present the results obtained, discussthe lessons learned and ponder on future undertakings.
pdf
bib
abs
CLARIN: Distributed Language Resources and Technology in a European Infrastructure
Maria Eskevich
|
Franciska de Jong
|
Alexander König
|
Darja Fišer
|
Dieter Van Uytvanck
|
Tero Aalto
|
Lars Borin
|
Olga Gerassimenko
|
Jan Hajic
|
Henk van den Heuvel
|
Neeme Kahusk
|
Krista Liin
|
Martin Matthiesen
|
Stelios Piperidis
|
Kadri Vider
Proceedings of the 1st International Workshop on Language Technology Platforms
CLARIN is a European Research Infrastructure providing access to digital language resources and tools from across Europe and beyond to researchers in the humanities and social sciences. This paper focuses on CLARIN as a platform for the sharing of language resources. It zooms in on the service offer for the aggregation of language repositories and the value proposition for a number of communities that benefit from the enhanced visibility of their data and services as a result of integration in CLARIN. The enhanced findability of language resources is serving the social sciences and humanities (SSH) community at large and supports research communities that aim to collaborate based on virtual collections for a specific domain. The paper also addresses the wider landscape of service platforms based on language technologies which has the potential of becoming a powerful set of interoperable facilities to a variety of communities of use.
2018
pdf
bib
CLARIN: Towards FAIR and Responsible Data Science Using Language Resources
Franciska de Jong
|
Bente Maegaard
|
Koenraad De Smedt
|
Darja Fišer
|
Dieter Van Uytvanck
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2012
pdf
bib
abs
Semantic metadata mapping in practice: the Virtual Language Observatory
Dieter Van Uytvanck
|
Herman Stehouwer
|
Lari Lampen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
In this paper we present the Virtual Language Observatory (VLO), a metadata-based portal for language resources. It is completely based on the Component Metadata (CMDI) and ISOcat standards. This approach allows for the use of heterogeneous metadata schemas while maintaining the semantic compatibility. We describe the metadata harvesting process, based on OAI-PMH, and the conversion from several formats (OLAC, IMDI and the CLARIN LRT inventory) to their CMDI counterpart profiles. Then we focus on some post-processing steps to polish the harvested records. Next, the ingestion of the CMDI files into the VLO facet browser is described. We also include an overview of the changes since the first version of the VLO, based on user feedback from the CLARIN community. Finally there is an overview of additional ideas and improvements for future versions of the VLO.
pdf
bib
abs
Proper Language Resource Centers
Willem Elbers
|
Daan Broeder
|
Dieter van Uytvanck
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Language resource centers allow researchers to reliably deposit their structured data together with associated meta data and run services operating on this deposited data. We are looking into possibilities to create long-term persistency of both the deposited data and the services operating on this data. Challenges, both technical and non-technical, that need to be solved are the need to replicate more than just the data, proper identification of the digital objects in a distributed environment by making use of persistent identifiers and the set-up of a proper authentication and authorization domain including the management of the authorization information on the digital objects. We acknowledge the investment that most language resource centers have made in their current infrastructure. Therefore one of the most important requirements is the loose coupling with existing infrastructures without the need to make many changes. This shift from a single language resource center into a federated environment of many language resource centers is discussed in the context of a real world center: The Language Archive supported by the Max Planck Institute for Psycholinguistics.
pdf
bib
abs
Standardizing a Component Metadata Infrastructure
Daan Broeder
|
Dieter van Uytvanck
|
Maria Gavrilidou
|
Thorsten Trippel
|
Menzo Windhouwer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper describes the status of the standardization efforts of a Component Metadata approach for describing Language Resources with metadata. Different linguistic and Language & Technology communities as CLARIN, META-SHARE and NaLiDa use this component approach and see its standardization of as a matter for cooperation that has the possibility to create a large interoperable domain of joint metadata. Starting with an overview of the component metadata approach together with the related semantic interoperability tools and services as the ISOcat data category registry and the relation registry we explain the standardization plan and efforts for component metadata within ISO TC37/SC4. Finally, we present information about uptake and plans of the use of component metadata within the three mentioned linguistic and L&T communities.
pdf
bib
abs
Citing on-line Language Resources
Daan Broeder
|
Dieter van Uytvanck
|
Gunter Senft
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Although the possibility of referring or citing on-line data from publications is seen at least theoretically as an important means to provide immediate testable proof or simple illustration of a line of reasoning, the practice has not been wide-spread yet and no extensive experience has been gained about the possibilities and problems of referring to raw data-sets. This paper makes a case to investigate the possibility and need of persistent data visualization services that facilitate the inspection and evaluation of the cited data.
2010
pdf
bib
abs
A Data Category Registry- and Component-based Metadata Framework
Daan Broeder
|
Marc Kemps-Snijders
|
Dieter Van Uytvanck
|
Menzo Windhouwer
|
Peter Withers
|
Peter Wittenburg
|
Claus Zinn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We describe our computer-supported framework to overcome the rule of metadata schism. It combines the use of controlled vocabularies, managed by a data category registry, with a component-based approach, where the categories can be combined to yield complex metadata structures. A metadata scheme devised in this way will thus be grounded in its use of categories. Schema designers will profit from existing prefabricated larger building blocks, motivating re-use at a larger scale. The common base of any two metadata schemes within this framework will solve, at least to a good extent, the semantic interoperability problem, and consequently, further promote systematic use of metadata for existing resources and tools to be shared.
pdf
bib
abs
Virtual Language Observatory: The Portal to the Language Resources and Technology Universe
Dieter Van Uytvanck
|
Claus Zinn
|
Daan Broeder
|
Peter Wittenburg
|
Mariano Gardellini
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Over the years, the field of Language Resources and Technology (LRT) has developed a tremendous amount of resources and tools. However, there is no ready-to-use map that researchers could use to gain a good overview and steadfast orientation when searching for, say corpora or software tools to support their studies. It is rather the case that information is scattered across project- or organisation-specific sites, which makes it hard if not impossible for less-experienced researchers to gather all relevant material. Clearly, the provision of metadata is central to resource and software exploration. However, in the LRT field, metadata comes in many forms, tastes and qualities, and therefore substantial harmonization and curation efforts are required to provide researchers with metadata-based guidance. To address this issue a broad alliance of LRT providers (CLARIN, the Linguist List, DOBES, DELAMAN, DFKI, ELRA) have initiated the Virtual Language Observatory portal to provide a low-barrier, easy-to-follow entry point to language resources and tools; it can be accessed via
http://www.clarin.eu/vlo2008
pdf
bib
abs
Language-Sites: Accessing and Presenting Language Resources via Geographic Information Systems
Dieter Van Uytvanck
|
Alex Dukers
|
Jacquelijn Ringersma
|
Paul Trilsbeek
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The emerging area of Geographic Information Systems (GIS) has proven to add an interesting dimension to many research projects. Within the language-sites initiative we have brought together a broad range of links to digital language corpora and resources. Via Google Earths visually appealing 3D-interface users can spin the globe, zoom into an area they are interested in and access directly the relevant language resources. This paper focuses on several ways of relating the map and the online data (lexica, annotations, multimedia recordings, etc.). Furthermore, we discuss some of the implementation choices that have been made, including future challenges. In addition, we show how scholars (both linguists and anthropologists) are using GIS tools to fulfill their specific research needs by making use of practical examples. This illustrates how both scientists and the general public can benefit from geography-based access to digital language data.