Ineke Schuurman


2024

pdf bib
SignON – a Co-creative Machine Translation for Sign and Spoken Languages (end-of-project results, contributions and lessons learned)
Dimitar Shterionov | Vincent Vandeghinste | Mirella Sisto | Aoife Brady | Mathieu De Coster | Lorraine Leeson | Andy Way | Josep Blat | Frankie Picron | Davy Landuyt | Marcello Scipioni | Aditya Parikh | Louis Bosch | John O’Flaherty | Joni Dambre | Caro Brosens | Jorn Rijckaert | Víctor Ubieto | Bram Vanroy | Santiago Gomez | Ineke Schuurman | Gorka Labaka | Adrián Núñez-Marcos | Irene Murtagh | Euan McGill | Horacio Saggion
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

SignON, a 3-year Horizon 20202 project addressing the lack of technology and services for MT between sign languages (SLs) and spoken languages (SpLs) ended in December 2023. SignON was unprecedented. Not only it addressed the wider complexity of the aforementioned problem – from research and development of recognition, translation and synthesis, through development of easy-to-use mobile applications and a cloud-based framework to do the “heavy lifting” as well as to establishing ethical, privacy and inclusivenesspolicies and operation guidelines – but also engaged with the deaf and hard of hearing communities in an effective co-creation approach where these main stakeholders drove the development in the right direction and had the final say.Currently we are witnessing advances in natural language processing for SLs, including MT. SignON was one of the largest projects that contributed to this surge with 17 partners and more than 60 consortium members, working in parallel with other international and European initiatives, such as project EASIER and others.

2023

pdf bib
A Linked Data Approach for linking and aligning Sign Language and Spoken Language Data
Thierry Declerck | Sam Bigeard | Fahad Khan | Irene Murtagh | Sussi Olsen | Mike Rosner | Ineke Schuurman | Andon Tchechmedjiev | Andy Way
Proceedings of the Second International Workshop on Automatic Translation for Signed and Spoken Languages

We present work dealing with a Linked Open Data (LOD)-compliant representation of Sign Language (SL) data, with the goal of supporting the cross-lingual alignment of SL data and their linking to Spoken Language (SpL) data. The proposed representation is based on activities of groups of researchers in the field of SL who have investigated the use of Open Multilingual Wordnet (OMW) datasets for (manually) cross-linking SL data or for linking SL and SpL data. Another group of researchers is proposing an XML encoding of articulatory elements of SLs and (manually) linking those to an SpL lexical resource. We propose an RDF-based representation of those various data. This unified formal representation offers a semantic repository of information on SL and SpL data that could be accessed for supporting the creation of datasets for training or evaluating NLP applications dealing with SLs, thinking for example of Machine Translation (MT) between SLs and between SLs and SpLs.

pdf bib
SignON: Sign Language Translation. Progress and challenges.
Vincent Vandeghinste | Dimitar Shterionov | Mirella De Sisto | Aoife Brady | Mathieu De Coster | Lorraine Leeson | Josep Blat | Frankie Picron | Marcello Paolo Scipioni | Aditya Parikh | Louis ten Bosch | John O’Flaherty | Joni Dambre | Jorn Rijckaert | Bram Vanroy | Victor Ubieto Nogales | Santiago Egea Gomez | Ineke Schuurman | Gorka Labaka | Adrián Núnez-Marcos | Irene Murtagh | Euan McGill | Horacio Saggion
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

SignON (https://signon-project.eu/) is a Horizon 2020 project, running from 2021 until the end of 2023, which addresses the lack of technology and services for the automatic translation between sign languages (SLs) and spoken languages, through an inclusive, human-centric solution, hence contributing to the repertoire of communication media for deaf, hard of hearing (DHH) and hearing individuals. In this paper, we present an update of the status of the project, describing the approaches developed to address the challenges and peculiarities of SL machine translation (SLMT).

pdf bib
Are there just WordNets or also SignNets?
Ineke Schuurman | Thierry Declerck | Caro Brosens | Margot Janssens | Vincent Vandeghinste | Bram Vanroy
Proceedings of the 12th Global Wordnet Conference

For Sign Languages (SLs), can we create a SignNet, like a WordNet for spoken languages: a network of semantic relations between constitutive elements of SLs? We first discuss approaches that link SL data to wordnets, or integrate such elements with some adaptations into the structure of WordNet. Then, we present requirements for a SignNet, which is built on SL data and then linked to WordNet.

2016

pdf bib
Improving Text-to-Pictograph Translation Through Word Sense Disambiguation
Leen Sevens | Gilles Jacobs | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

pdf bib
AfriBooms: An Online Treebank for Afrikaans
Liesbeth Augustinus | Peter Dirix | Daniel van Niekerk | Ineke Schuurman | Vincent Vandeghinste | Frank Van Eynde | Gerhard van Huyssteen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Compared to well-resourced languages such as English and Dutch, natural language processing (NLP) tools for Afrikaans are still not abundant. In the context of the AfriBooms project, KU Leuven and the North-West University collaborated to develop a first, small treebank, a dependency parser, and an easy to use online linguistic search engine for Afrikaans for use by researchers and students in the humanities and social sciences. The search tool is based on a similar development for Dutch, i.e. GrETEL, a user-friendly search engine which allows users to query a treebank by means of a natural language example instead of a formal search instruction.

2015

pdf bib
Natural Language Generation from Pictographs
Leen Sevens | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

pdf bib
Extending a Dutch Text-to-Pictograph Converter to English and Spanish
Leen Sevens | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies

2014

pdf bib
Experiences with the ISOcat Data Category Registry
Daan Broeder | Ineke Schuurman | Menzo Windhouwer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The ISOcat Data Category Registry has been a joint project of both ISO TC 37 and the European CLARIN infrastructure. In this paper the experiences of using ISOcat in CLARIN are described and evaluated. This evaluation clarifies the requirements of CLARIN with regard to a semantic registry to support its semantic interoperability needs. A simpler model based on concepts instead of data cate-gories and a simpler workflow based on community recommendations will address these needs better and offer the required flexibility.

pdf bib
Linking Pictographs to Synsets: Sclera2Cornetto
Vincent Vandeghinste | Ineke Schuurman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Social inclusion of people with Intellectual and Developmental Disabilities can be promoted by offering them ways to independently use the internet. People with reading or writing disabilities can use pictographs instead of text. We present a resource in which we have linked a set of 5710 pictographs to lexical-semantic concepts in Cornetto, a Wordnet-like database for Dutch. We show that, by using this resource in a text-to-pictograph translation system, we can greatly improve the coverage comparing with a baseline where words are converted into pictographs only if the word equals the filename.

pdf bib
Linguistic resources and cats: how to use ISOcat, RELcat and SCHEMAcat
Menzo Windhouwer | Ineke Schuurman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Within the European CLARIN infrastructure ISOcat is used to enable both humans and computer programs to find specific resources even when they use different terminology or data structures. In order to do so, it should be clear which concepts are used in these resources, both at the level of metadata for the resource as well as its content, and what is meant by them. The concepts can be specified in ISOcat. SCHEMAcat enables us to relate the concepts used by a resource, while RELcat enables to type these relationships and add relationships beyond resource boundaries. This way these three registries together allow us (and the programs) to find what we are looking for.

2013

pdf bib
Example-Based Treebank Querying with GrETEL–Now Also for Spoken Dutch
Liesbeth Augustinus | Vincent Vandeghinste | Ineke Schuurman | Frank Van Eynde
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

pdf bib
Beyond SoNaR: towards the facilitation of large corpus building efforts
Martin Reynaert | Ineke Schuurman | Véronique Hoste | Nelleke Oostdijk | Maarten van Gompel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we report on the experiences gained in the recent construction of the SoNaR corpus, a 500 MW reference corpus of contemporary, written Dutch. It shows what can realistically be done within the confines of a project setting where there are limitations to the duration in time as well to the budget, employing current state-of-the-art tools, standards and best practices. By doing so we aim to pass on insights that may be beneficial for anyone considering to undertake an effort towards building a large, varied yet balanced corpus for use by the wider research community. Various issues are discussed that come into play while compiling a large corpus, including approaches to acquiring texts, the arrangement of IPR, the choice of text formats, and steps to be taken in the preprocessing of data from widely different origins. We describe FoLiA, a new XML format geared at rich linguistic annotations. We also explain the rationale behind the investment in the high-quali ty semi-automatic enrichment of a relatively small (1 MW) subset with very rich syntactic and semantic annotations. Finally, we present some ideas about future developments and the direction corpus development may take, such as setting up an integrated work flow between web services and the potential role for ISOcat. We list tips for potential corpus builders, tricks they may want to try and further recommendations regarding technical developments future corpus builders may wish to hope for.

2010

pdf bib
Interacting Semantic Layers of Annotation in SoNaR, a Reference Corpus of Contemporary Written Dutch
Ineke Schuurman | Véronique Hoste | Paola Monachesi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper reports on the annotation of a corpus of 1 million words with four semantic annotation layers, including named entities, co- reference relations, semantic roles and spatial and temporal expressions. These semantic annotation layers can benefit from the manually verified part of speech tagging, lemmatization and syntactic analysis (dependency tree) information layers which resulted from an earlier project (Van Noord et al., 2006) and will thus result in a deeply syntactically and semantically annotated corpus. This annotation effort is carried out in the framework of a larger project which aims at the collection of a 500-million word corpus of contemporary Dutch, covering the variants used in the Netherlands and Flanders, the Dutch speaking part of Belgium. All the annotation schemes used were (co-)developed by the authors within the Flemish-Dutch STEVIN-programme as no previous schemes for Dutch were available. They were created taking into account standards (either de facto or official (like ISO)) used elsewhere.

pdf bib
Cultural Aspects of Spatiotemporal Analysis in Multilingual Applications
Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we want to point out some issues arising when a natural language processing task involves several languages (like multi- lingual, multidocument summarization and the machine translation aspects involved) which are often neglected. These issues are of a more cultural nature, and may even come into play when several documents in a single language are involved. We pay special attention to those aspects dealing with the spatiotemporal characteristics of a text. Correct automatic selection of (parts of) texts such as handling the same eventuality, presupposes spatiotemporal disambiguation at a rather specific level. The same holds for the analysis of the query. For generation and translation purposes, spatiotemporal aspects may be relevant as well. At the moment English (both the British and American variants) and Dutch (the Flemish and Dutch variant) are covered, all taking into account the perspective of a contemporary, Flemish user. In our approach the cultural aspects associated with for example the language of publication and the language used by the user play a crucial role.

2008

pdf bib
From D-Coi to SoNaR: a reference corpus for Dutch
Nelleke Oostdijk | Martin Reynaert | Paola Monachesi | Gertjan Van Noord | Roeland Ordelman | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built.

pdf bib
Spatiotemporal Annotation Using MiniSTEx: how to deal with Alternative, Foreign, Vague and/or Obsolete Names?
Ineke Schuurman
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We are currently developing MiniSTEx, a spatiotemporal annotation system to handle temporal and/or geospatial information directly and indirectly expressed in texts. In the end the aim is to locate all eventualities in a text on a time axis and/or a map to ensure an optimal base for automatic temporal and geospatial reasoning. MiniSTEx was originally developed for Dutch, keeping in mind that it should also be useful for other European languages, and for multilingual applications. In order to meet these desiderata we need the MiniSTEx system to be able to draw the conclusions human readers would also draw, e.g. based on their (spatiotemporal) world knowledge, i.e. the common knowledge such readers share. Therefore, notions like “background knowledge”, “intended audience”, and “present-day user” play a major role in our approach. The world knowledge MiniSTEx uses is contained in interconnected tables in a database. At the moment it is used for Dutch and English. Special attention will be paid to the problems we face when looking at older texts or recent historical or encyclopedic texts, i.e. texts with lots of references to times and locations that are not compatible with our current maps and calendars.

pdf bib
Evaluation of a Machine Translation System for Low Resource Languages: METIS-II
Vincent Vandeghinste | Peter Dirix | Ineke Schuurman | Stella Markantonatou | Sokratis Sofianopoulos | Marina Vassiliou | Olga Yannoutsou | Toni Badia | Maite Melero | Gemma Boleda | Michael Carl | Paul Schmidt
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.

2007

pdf bib
Demonstration of the Dutch-to-English METIS-II MT system
Peter Dirix | Vincent Vandeghinste | Ineke Schuurman
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

2006

pdf bib
Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development
Antal van den Bosch | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe a case study in the reuse and transfer of tools in language resource development, from a corpus of spoken Dutch to a corpus of written Dutch. Once tools for a particular language have been developed, it is logical, but not trivial to reuse them for other types or registers of the language than the tools were originally designed for. This paper reviews the decisions and adaptations necessary to make this particular transfer from spoken to written language, focusing on a part-of-speech tagger and a lemmatizer. While the lemmatizer can be transferred fairly straightforwardly, the tagger needs to be adaptated considerably. We show how it can be adapted without starting from scratch. We describe how the part-of-speech tagset was adapted and how the tagger was retrained to deal with written-text phenomena it had not been trained on earlier.

pdf bib
METIS-II: Machine Translation for Low Resource Languages
Vincent Vandeghinste | Ineke Schuurman | Michael Carl | Stella Markantonatou | Toni Badia
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we describe a machine translation prototype in which we use only minimal resources for both the source and the target language. A shallow source language analysis, combined with a translation dictionary and a mapping system of source language phenomena into the target language and a target language corpus for generation are all the resources needed in the described system. Several approaches are presented.

pdf bib
Syntactic Annotation of Large Corpora in STEVIN
Gertjan van Noord | Ineke Schuurman | Vincent Vandeghinste
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the Dutch/Flemish STEVIN programme. For part of this corpus, manually corrected syntactic annotations will be provided. The paper presents the background of the syntactic annotation efforts, the Alpino parser which is used as an important tool for constructing the syntactic annotations, as well as a number of other annotation tools and guidelines. For the full STEVIN corpus, automatically derived syntactic annotations will be provided in a later phase of the programme. A number of arguments is provided suggesting that such a resource can be very useful for applications in information extraction, ontology building, lexical acquisition, machine translation and corpus linguistics.

2005

pdf bib
METISII: Example-based Machine Translation Using Monolingual CorporaSystem Description
Peter Dirix | Ineke Schuurman | Vincent Vandeghinste
Workshop on example-based machine translation

The METIS-II project is an example-based machine translation system, making use of minimal resources and tools for both source and target language, making use of a target-language (TL) corpus, but not of any parallel corpora. In the current paper, we discuss the view of our team on the general philosophy and outline of the METIS-II system.

pdf bib
Example-based Translation Without Parallel Corpora: First Experiments on a Prototype
Vincent Vandeghinste | Peter Dirix | Ineke Schuurman
Workshop on example-based machine translation

For the METIS-II project (IST, start: 10-2004 – end: 09-2007) we are working on an example-based machine translation system, making use of minimal resources and tools for both source and target language, i.e. making use of a target language corpus, but not of any parallel corpora. In the current paper, we present the results of the first experiments with our approach (CCL) within the METIS consortium : the translation of noun phrases from Dutch to English, using the British National Corpus as a target language corpus. Future research is planned along similar lines for the sentence as is presented here for the noun phrase.

2004

pdf bib
Linguistic Annotation of the Spoken Dutch Corpus: If We Had To Do It All Over Again
Ineke Schuurman | Wim Goedertier | Heleen Hoekstra | Nelleke Oostdijk | Richard Piepenbrock | Machteld Schouppe
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

After the successful completion of the Spoken Dutch Corpus (1998 -- 2003) the time is ripe to take some time to sit back and reflect on our achievements and the procedures underlying them in order to learn from our experiences. In this paper we will in particular pay attention to issues affecting the levels of linguistic annotation, but some more general issues deserve to be treated as well (bug reporting, consistency). We will try to come up with solutions, but sometimes we want to invite further discussion from other researchers.

2003

pdf bib
CGN, an annotated corpus of spoken Dutch
Ineke Schuurman | Machteld Schouppe | Heleen Hoekstra | Ton van der Wouden
Proceedings of 4th International Workshop on Linguistically Interpreted Corpora (LINC-03) at EACL 2003

2002

pdf bib
Syntactic Analysis in the Spoken Dutch Corpus (CGN)
Ton van der Wouden | Heleen Hoekstra | Michael Moortgat | Bram Renmans | Ineke Schuurman
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)