Eric Sanders - ACL Anthology

Eric Sanders

2024

Corpus Creation and Automatic Alignment of Historical Dutch Dialect Speech
Martijn Bentum | Eric Sanders | Antal P.J. van den Bosch | Douwe Zeldenrust | Henk van den Heuvel
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The Dutch Dialect Database (also known as the ‘Nederlandse Dialectenbank’) contains dialectal variations of Dutch that were recorded all over the Netherlands in the second half of the twentieth century. A subset of these recordings of about 300 hours were enriched with manual orthographic transcriptions, using non-standard approximations of dialectal speech. In this paper we describe the creation of a corpus containing both the audio recordings and their corresponding transcriptions and focus on our method for aligning the recordings with the transcriptions and the metadata.

FAIRification of LeiLanD
Eric Sanders | Sara Petrollino | Gilles R. Scheifer | Henk van den Heuvel | Christopher Handy
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

LeiLanD (Leiden Language Data) is a searchable catalogue initiated by the Leiden University Centre for Linguistics (LUCL) with the support of CLARIAH. The catalogue contains metadata about language datasets collected at LUCL and other institutes of Leiden University. This paper describes a project to FAIRify the datasets increasing their findability and accessibility through a standardised metadata format CMDI so as to obtain a rich metadata description for all resources and to make them findable through CLARIN’s Virtual Language Observatory. The paper describes the creation of the catalogue and the steps that led from unstructured metadata to CMDI standards. This FAIRifi- cation of LeiLanD has enhanced the findability and accessibility of incredibly diverse collection of language datasets.

2022

Correlating Political Party Names in Tweets, Newspapers and Election Results
Eric Sanders | Antal van den Bosch
Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences

Twitter has been used as a textual resource to attempt to predict the outcome of elections for over a decade. A body of literature suggests that this is not consistently possible. In this paper we test the hypothesis that mentions of political parties in tweets are better correlated with the appearance of party names in newspapers than to the intention of the tweeter to vote for that party. Five Dutch national elections are used in this study. We find only a small positive, negligible difference in Pearson’s correlation coefficient as well as in the absolute error of the relation between tweets and news, and between tweets and elections. However, we find a larger correlation and a smaller absolute error between party mentions in newspapers and the outcome of the elections in four of the five elections. This suggests that newspapers are a better starting point for predicting the election outcome than tweets.

2020

Optimising Twitter-based Political Election Prediction with Relevance andSentiment Filters
Eric Sanders | Antal van den Bosch
Proceedings of the Twelfth Language Resources and Evaluation Conference

We study the relation between the number of mentions of political parties in the last weeks before the elections and the election results. In this paper we focus on the Dutch elections of the parliament in 2012 and for the provinces (and the senate) in 2011 and 2015. With raw counts, without adaptations, we achieve a mean absolute error (MAE) of 2.71% for 2011, 2.02% for 2012 and 2.89% for 2015. A set of over 17,000 tweets containing political party names were annotated by at least three annotators per tweet on ten features denoting communicative intent (including the presence of sarcasm, the message’s polarity, the presence of an explicit voting endorsement or explicit voting advice, etc.). The annotations were used to create oracle (gold-standard) filters. Tweets with or without a certain majority annotation are held out from the tweet counts, with the goal of attaining lower MAEs. With a grid search we tested all combinations of filters and their responding MAE to find the best filter ensemble. It appeared that the filters show markedly different behaviour for the three elections and only a small MAE improvement is possible when optimizing on all three elections. Larger improvements for one election are possible, but result in deterioration of the MAE for the other elections.

2016

Can Tweets Predict TV Ratings?
Bridget Sommerdijk | Eric Sanders | Antal van den Bosch
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We set out to investigate whether TV ratings and mentions of TV programmes on the Twitter social media platform are correlated. If such a correlation exists, Twitter may be used as an alternative source for estimating viewer popularity. Moreover, the Twitter-based rating estimates may be generated during the programme, or even before. We count the occurrences of programme-specific hashtags in an archive of Dutch tweets of eleven popular TV shows broadcast in the Netherlands in one season, and perform correlation tests. Overall we find a strong correlation of 0.82; the correlation remains strong, 0.79, if tweets are counted a half hour before broadcast time. However, the two most popular TV shows account for most of the positive effect; if we leave out the single and second most popular TV shows, the correlation drops to being moderate to weak. Also, within a TV show, correlations between ratings and tweet counts are mostly weak, while correlations between TV ratings of the previous and next shows are strong. In absence of information on previous shows, Twitter-based counts may be a viable alternative to classic estimation methods for TV ratings. Estimates are more reliable with more popular TV shows.

Palabras: Crowdsourcing Transcriptions of L2 Speech
Eric Sanders | Pepi Burgos | Catia Cucchiarini | Roeland van Hout
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We developed a web application for crowdsourcing transcriptions of Dutch words spoken by Spanish L2 learners. In this paper we discuss the design of the application and the influence of metadata and various forms of feedback. Useful data were obtained from 159 participants, with an average of over 20 transcriptions per item, which seems a satisfactory result for this type of research. Informing participants about how many items they still had to complete, and not how many they had already completed, turned to be an incentive to do more items. Assigning participants a score for their performance made it more attractive for them to carry out the transcription task, but this seemed to influence their performance. We discuss possible advantages and disadvantages in connection with the aim of the research and consider possible lessons for designing future experiments.

Curation of Dutch Regional Dictionaries
Henk van den Heuvel | Eric Sanders | Nicoline van der Sijs
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper describes the process of semi-automatically converting dictionaries from paper to structured text (database) and the integration of these into the CLARIN infrastructure in order to make the dictionaries accessible and retrievable for the research community. The case study at hand is that of the curation of 42 fascicles of the Dictionaries of the Brabantic and Limburgian dialects, and 6 fascicles of the Dictionary of dialects in Gelderland.

2014

The Dutch LESLLA Corpus
Eric Sanders | Ineke van de Craats | Vanja de Lint
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the Dutch LESLLA data and its curation. LESLLA stands for Low-Educated Second Language and Literacy Acquisition. The data was collected for research in this field and would have been disappeared if it were not saved. Within the CLARIN project Data Curation Service the data was made into a spoken language resource and made available to other researchers.

Vulnerability in Acquisition, Language Impairments in Dutch: Creating a VALID Data Archive
Jetske Klatter | Roeland van Hout | Henk van den Heuvel | Paula Fikkert | Anne Baker | Jan de Jong | Frank Wijnen | Eric Sanders | Paul Trilsbeek
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The VALID Data Archive is an open multimedia data archive (under construction) with data from speakers suffering from language impairments. We report on a pilot project in the CLARIN-NL framework in which five data resources were curated. For all data sets concerned, written informed consent from the participants or their caretakers has been obtained. All materials were anonymized. The audio files were converted into wav (linear PCM) files and the transcriptions into CHAT or ELAN format. Research data that consisted of test, SPSS and Excel files were documented and converted into CSV files. All data sets obtained appropriate CMDI metadata files. A new CMDI metadata profile for this type of data resources was established and care was taken that ISOcat metadata categories were used to optimize interoperability. After curation all data are deposited at the Max Planck Institute for Psycholinguistics Nijmegen where persistent identifiers are linked to all resources. The content of the transcriptions in CHAT and plain text format can be searched with the TROVA search engine.

2012

An Oral History Annotation Tool for INTER-VIEWs
Henk van den Heuvel | Eric Sanders | Robin Rutten | Stef Scagliola | Paula Witkamp
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a web-based tool for retrieving and annotating audio fragments of e.g. interviews. Our collection contains 250 interviews with veterans of Dutch conflicts and military missions. The audio files of the interviews were disclosed using ASR technology focussed at keyword retrieval. Resulting transcripts were stored in a MySQL database together with metadata, summary texts, and keywords, and carefully indexed. Retrieved fragments can be made audible and annotated. Annotations can be kept personal or be shared with other users. The tool and formats comply with CLARIN standards. A demo version of the tool is available at http://wwwlands2.let.kun.nl/spex/annotationtooldemo.

Collecting and Analysing Chats and Tweets in SoNaR
Eric Sanders
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper a collection of chats and tweets from the Netherlands and Flanders is described. The chats and tweets are part of the freely available SoNaR corpus, a 500 million word text corpus of the Dutch language. Recruitment, metadata, anonymisation and IPR issues are discussed. To illustrate the difference of language use between the various text types and other parameters (like gender and age) simple text analysis in the form of unigram frequency lists is carried out. Furthermore a website is presented with which users can retrieve their own frequency lists.

2010

The VeteranTapes: Research Corpus, Fragment Processing Tool, and Enhanced Publications for the e-Humanities
Henk van den Heuvel | René van Horik | Stef Scagliola | Eric Sanders | Paula Witkamp
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Enhanced Publications are a new way to publish scientific and other results in an electronic article. The advantage of EPs is that the relation between the article and the underlying data facilitate the peer review process and other quality assessment activities. Due to the link between de publication and the research data the publication can be much richer than a paper edition permits. We present an example of EPs in which links are made to interview fragments that include transcripts, audio segments, annotations and metadata. EPs call for a new paradigm of research methodology in which digital persistent access to research data are a central issue. In this contribution we highlight 1. The research data as it is archived and curated, 2. the concept ""enhanced publication"" and its scientific value, 3. the ""fragment fitter tool"", a language processing tool to facilitate the creation of EPs, 4. IPR issues related to the re-use of the interview data.

2008

Recording Speech of Children, Non-Natives and Elderly People for HLT Applications: the JASMIN-CGN Corpus.
Catia Cucchiarini | Joris Driesen | Hugo Van hamme | Eric Sanders
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Within the framework of the Dutch-Flemish programme STEVIN, the JASMIN-CGN (Jongeren, Anderstaligen en Senioren in Mens-machine Interactie Corpus Gesproken Nederlands) project was carried out, which was aimed at collecting speech of children, non-natives and elderly people. The JASMIN-CGN project is an extension of the Spoken Dutch Corpus (CGN) along three dimensions. First, by collecting a corpus of contemporary Dutch as spoken by children of different age groups, elderly people and non-natives with different mother tongues, an extension along the age and mother tongue dimensions was achieved. In addition, we collected speech material in a communication setting that was not envisaged in the CGN: human-machine interaction. One third of the data was collected in Flanders and two thirds in the Netherlands. In this paper we report on our experiences in collecting this corpus and we describe some of the important decisions that we made in the attempt to combine efficiency and high quality.

The IFADV Corpus: a Free Dialog Video Corpus
Rob van Son | Wieneke Wesseling | Eric Sanders | Henk van den Heuvel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Research into spoken language has become more visual over the years. Both fundamental and applied research have progressively included gestures, gaze, and facial expression. Corpora of multi-modal conversational speech are rare and frequently difficult to use due to privacy and copyright restrictions. A freely available annotated corpus is presented, gratis and libre, of high quality video recordings of face-to-face conversational speech. Annotations include orthography, POS tags, and automatically generated phonemes transcriptions and word boundaries. In addition, labeling of both simple conversational function and gaze direction has been a performed. Within the bounds of the law, everything has been done to remove copyright and use restrictions. Annotations have been processed to RDBMS tables that allow SQL queries and direct connections to statistical software. From our experiences we would like to advocate the formulation of best practises for both legal handling and database storage of recordings and annotations.

LILA: Cellular Telephone Speech Databases from Asia
Eric Sanders | Asuncion Moreno | Herbert Tropf | Lynette Melnar | Nurit Dekel | Breanna Gillies | Niklas Paulsson
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The goal of the LILA project was the collection of speech databases over cellular telephone networks of five languages in three Asian countries. Three languages were recorded in India: Hindi by first language speakers, Hindi by second language speakers and Indian English. Furthermore, Mandarin was recorded in China and Korean in South-Korea. The databases are part of the SpeechDat-family and follow the SpeechDat rules in many respects. All databases have been finished and have passed the validation tests. Both Hindi databases and the Korean database will be available to the public for sale.

2004

Collection of SLR in the Asian-Pacific Area
Asunción Moreno | Khalid Choukri | Phil Hall | Henk van den Heuvel | Eric Sanders | Francesco Senia | Herbert Tropf
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

The goal of this project (LILA) is the collection of a large number of spoken databases for training Automatic Speech Recognition Systems for telephone applications in the Asian Pacific area. Specifications follow those of SpeechDat-like databases. Utterances will be recorded directly from calls made either from fixed or cellular telephones and are composed by read text and answers to specific questions. The project is driven by a consortium composed by a large number of industrial companies. Each company is in charge of the production of two databases. The consortium shares the databases produced in the project. The goal of the project should be reached within the year 2005.

SLR Validation: Current Trends and Developments
Henk van den Heuvel | Dorota Iskra | Eric Sanders | Folkert de Vriend
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper deals with the quality evaluation (validation) of Spoken Language Resources (SLR). The current situation in terms of relevant validation criteria and procedures is briefly presented. Next, a number of validation issues related to new data formats (XML-based annotations, UTF-16 encoding) are discussed. Further, new validation cycles that were introduced in a series of new projects like SpeeCon and OrienTel are addressed: prompt sheet validation, lexicon validation and pre-release validation. Finally, SPEX's current and future

2000

SLR Validation: Present State of Affairs and Prospects
Henk van den Heuvel | Lou Boves | Khalid Choukri | Simo Goddijn | Eric Sanders
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)