A larger-scale evaluation resource of terms and their shift direction for diachronic lexical semantics
Astrid van Aggelen
Jacco van Ossenbruggen
Proceedings of the 22nd Nordic Conference on Computational Linguistics
Determining how words have changed their meaning is an important topic in Natural Language Processing. However, evaluations of methods to characterise such change have been limited to small, handcrafted resources. We introduce an English evaluation set which is larger, more varied, and more realistic than seen to date, with terms derived from a historical thesaurus. Moreover, the dataset is unique in that it represents change as a shift from the term of interest to a WordNet synset. Using the synset lemmas, we can use this set to evaluate (standard) methods that detect change between word pairs, as well as (adapted) methods that detect the change between a term and a sense overall. We show that performance on the new data set is much lower than earlier reported findings, setting a new standard.
A Corpus of Images and Text in Online News
Martin van Harmelen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In recent years, several datasets have been released that include images and text, giving impulse to new methods that combine natural language processing and computer vision. However, there is a need for datasets of images in their natural textual context. The ION corpus contains 300K news articles published between August 2014 - 2015 in five online newspapers from two countries. The 1-year coverage over multiple publishers ensures a broad scope in terms of topics, image quality and editorial viewpoints. The corpus consists of JSON-LD files with the following data about each article: the original URL of the article on the news publisher’s website, the date of publication, the headline of the article, the URL of the image displayed with the article (if any), and the caption of that image. Neither the article text nor the images themselves are included in the corpus. Instead, the images are distributed as high-dimensional feature vectors extracted from a Convolutional Neural Network, anticipating their use in computer vision tasks. The article text is represented as a list of automatically generated entity and topic annotations in the form of Wikipedia/DBpedia pages. This facilitates the selection of subsets of the corpus for separate analysis or evaluation.
A Common Multimedia Annotation Framework for Cross Linking Cultural Heritage Digital Collections
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In the context of the CATCH research program that is currently carried out at a number of large Dutch cultural heritage institutions our ambition is to combine and exchange heterogeneous multimedia annotations between projects and institutions. As first step we designed an Annotation Meta Model: a simple but powerful RDF/OWL model mainly addressing the anchoring of annotations to segments of the many different media types used in the collections of the archives, museums and libraries involved. The model includes support for the annotation of annotations themselves, and of segments of annotation values, to be able to layer annotations and in this way enable projects to process each others annotation data as the primary data for further annotation. On basis of AMM we designed an application programming interface for accessing annotation repositories and implemented it both as a software library and as a web service. Finally, we report on our experiences with the application of model, API and repository when developing web applications for collection managers in cultural heritage institutions.