A Corpus of Images and Text in Online News

Laura Hollink; Adriatik Bedjeti; Martin van Harmelen; Desmond Elliott

A Corpus of Images and Text in Online News

Laura Hollink, Adriatik Bedjeti, Martin van Harmelen, Desmond Elliott

Abstract

In recent years, several datasets have been released that include images and text, giving impulse to new methods that combine natural language processing and computer vision. However, there is a need for datasets of images in their natural textual context. The ION corpus contains 300K news articles published between August 2014 - 2015 in five online newspapers from two countries. The 1-year coverage over multiple publishers ensures a broad scope in terms of topics, image quality and editorial viewpoints. The corpus consists of JSON-LD files with the following data about each article: the original URL of the article on the news publisher’s website, the date of publication, the headline of the article, the URL of the image displayed with the article (if any), and the caption of that image. Neither the article text nor the images themselves are included in the corpus. Instead, the images are distributed as high-dimensional feature vectors extracted from a Convolutional Neural Network, anticipating their use in computer vision tasks. The article text is represented as a list of automatically generated entity and topic annotations in the form of Wikipedia/DBpedia pages. This facilitates the selection of subsets of the corpus for separate analysis or evaluation.

Anthology ID:: L16-1219
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 1377–1382
Language:
URL:: https://aclanthology.org/L16-1219/
DOI:
Bibkey:
Cite (ACL):: Laura Hollink, Adriatik Bedjeti, Martin van Harmelen, and Desmond Elliott. 2016. A Corpus of Images and Text in Online News. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 1377–1382, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: A Corpus of Images and Text in Online News (Hollink et al., LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1219.pdf

PDF Cite Search Fix data