Multi30K: Multilingual English-German Image Descriptions

We introduce the Multi30K dataset to stimulate multilingual multimodal research. Recent advances in image description have been demonstrated on English-language datasets almost exclusively, but image description should not be limited to English. This dataset extends the Flickr30K dataset with i) German translations created by professional translators over a subset of the English descriptions, and ii) descriptions crowdsourced independently of the original English descriptions. We outline how the data can be used for multilingual image description and multimodal machine translation, but we anticipate the data will be useful for a broader range of tasks.


Introduction
Image description is one of the core challenges at the intersection of Natural Language Processing (NLP) and Computer Vision (CV) (Bernardi et al., 2016).This task has only received attention in a monolingual English setting, helped by the availability of English datasets, e.g.Flickr8K (Hodosh et al., 2013), Flickr30K (Young et al., 2014), and MS COCO (Chen et al., 2015).However, the possible applications of image description are useful for all languages, such as searching for images using natural language, or providing alternative-description text for visually impaired Web users.
We introduce a large-scale dataset of images paired with sentences in English and German as an initial step towards studying the value and the characteristics of multilingual-multimodal data.Multi30K is an extension of the Flickr30K dataset (Young et al., 2014) with 31,014 German transla-tions of English descriptions and 155,070 independently collected German descriptions.The translations were collected from professionally contracted translators, whereas the descriptions were collected from untrained crowdworkers.The key difference between these corpora is the relationship between the sentences in different languages.In the translated corpus, we know there is a strong correspondence between the sentences in both languages.In the descriptions corpus, we only know that the sentences, regardless of the language, are supposed to describe the same image.
A dataset of images paired with sentences in multiple languages broadens the scope of multimodal NLP research.Image description with multilingual data can also be seen as machine translation in a multimodal context.This opens up new avenues for researchers in machine translation (Koehn et al., 2003;Chiang, 2005;Sutskever et al., 2014;Bahdanau et al., 2015, inter-alia) to work with multilingual multimodal data.Image-sentence ranking using monolingual multimodal datasets (Hodosh et al., 2013, inter-alia) is also a natural task for multilingual modelling.
The only existing datasets of images paired with multilingual sentences were created by professionally translating English into the target language: IAPR-TC12 with 20,000 English-German described images (Grubinger et al., 2006), and the Pascal Sentences Dataset of 1,000 Japanese-English described images (Funaki and Nakayama, 2015).Multi30K dataset is larger than both of these and contains both independent and translated sentences.We hope this dataset will be of broad interest across NLP and CV research and anticipate that these communities will put the data to use in a broader range of tasks than we can foresee.
1. Brick layers constructing a wall.
1.The two men on the scaffolding are helping to build a red brick wall.
1. Trendy girl talking on her cellphone while gliding slowly down the street 2. Ein schickes Mädchen spricht mit dem Handy während sie langsam die Straße entlangschwebt.
(a) Translations 1.There is a young girl on her cellphone while skating.
(b) Independent descriptions Figure 1: Multilingual examples in the Multi30K dataset.The independent sentences are all accurate descriptions of the image but do not contain the same details in both languages, such as shirt colour or the scaffolding.In the second translation pair (bottom left) the translator has translated "glide" as "schweben" ("to float") probably due to not seeing the image context (see Section 2.1 for more details).

The Multi30K Dataset
The Flickr30K Dataset contains 31,014 images sourced from online photo-sharing websites (Young et al., 2014).Each image is paired with five English descriptions, which were collected from Amazon Mechanical Turk1 .The dataset contains 145,000 training, 5,070 development, and 5,000 test descriptions.The Multi30K dataset extends the Flickr30K dataset with translated and independent German sentences.

Translations
The translations were collected from professional English-German translators contracted via an established Language Service in Germany.Figure 1 presents an example of the differences between the types of data.We collected one translated description per image, resulting in a total of 31,014 translations.To ensure an even distribution over description length, the English descriptions were chosen based on their relative length, with an equal number of longest, shortest, and median length source descriptions.We paid a total of e23,000 to collect the data (e0.06 per word).
Translators were shown an English language sentences and asked to produce a correct and fluent translation for it in German, without seeing the image.We decided against showing the images to translators to make this as close as possible to a standard translation task, also making the data col-lected here distinct from the independent descriptions collected as described in Section 2.2.

Independent Descriptions
The descriptions were collected from crowdworkers via the Crowdflower platform2 .We collected five descriptions per image in the Flickr30K dataset, resulting in a total of 155,070 sentences.Workers were presented with a translated version of the data collection interface used by (Hodosh et al., 2013), as shown in Figure 2. We translated the interface to make the task as similar as possible to the crowdsourcing of the English sentences.The instructions were translated by one of the authors and checked by a native German Ph.D student.185 crowdworkers took part in the task over a period of 31 days.We split the task into 1,000 randomly selected images per day to control the quality of the data and to prevent worker fatigue.Workers were required to have a German-language skill certification and be at least a Crowdflower Level 2 Worker: they have participated in at least 10 different Crowdflower jobs, has passed at least 100 quality-control questions, and has an job acceptance rate of at least 85%.
The descriptions were collected in batches of five images per job.Each image was randomly selected from the complete set of 1,000 images for that day, and workers were limited to writing at During the collection of the data, we assessed the quality both by manually checking a subset of the descriptions and also with automated checks.We inspected the submissions of users who wrote sentences with less than five words, and users with high type to token ratios (to detect repetition).We also used a character-level 6-gram LM to flag descriptions with high perplexity, which was very effective at catching nonsense sentences.In general we did not have to ban or reject many users and overall description quality was high.

Translated vs. Independent Descriptions
We now analyse the differences between the translated and the description corpora.For this analysis, all sentences were stripped of punctuation and truecased using the Moses truecaser.plscript trained over Europarl v7 and News Commentary v11 English-German parallel corpora.
Table 1 shows the differences between the corpora.The German translations are longer than the independent descriptions (11.1 vs. 9.6 words), while the English descriptions selected for trans-lation are slightly shorter, on average, than the Flickr30k average (11.9 vs. 12.3).When we compare the German translation dataset against an equal number of sentences from the German descriptions dataset, we find that the translations also have more word types (19.3K vs. 17.6K), and more singleton types occurring only once (11.3K vs. 10.2K; in both datasets singletons comprise 58% of the vocabulary).The translations thus have a wider vocabulary, despite being generated by a smaller number of authors.The English datasets (all descriptions vs. those selected for translation) show a similar trend, indicating that these differences may be a result of the decision to select equal numbers of short, medium, and long English sentences for translation.

English vs. German
The English image descriptions are generally longer than the German descriptions, both in terms of number of words and characters.Note that the difference is much less smaller when measuring characters: German uses 22% fewer words but only 2.5% fewer characters.However, we observe a different pattern in the translation corpora: German uses 6.6% fewer words than English but 17.1% more characters.The vocabulary of the German description and translation corpora are more than twice as large as the English corpora.Additionally, the German corpora have twoto-three times as many singletons.This is likely due to richer morphological variation in German, as well as word compounding.

Discussion
The Multi30K dataset is immediately suitable for research on a wide range of tasks, including but not limited to automatic image description, image-sentence ranking, multimodal and multilingual semantics, and machine translation.

Multi30K for Image Description
Deep neural networks for image description typically integrate visual features into a recurrent neural network language model (Vinyals et al., 2015;Xu et al., 2015, inter-alia).Elliott et al. (2015) demonstrated how to build multilingual image description models that learn and transfer features between monolingual image description models.They performed a series of experiments on the IAPR-TC12 dataset (Grubinger et al., 2006) of images aligned with German translations, showing that both English and German image description could be improved by transferring features from a multimodal neural language model trained to generate descriptions in the other language.The Multi30K dataset will enable further research in this direction, allowing researchers to work with larger datasets with multiple references per image.

Multi30K for Machine Translation
Machine translation is typically performed using only textual data, for example news data, the Europarl corpora, or corpora harvested from the Web (CommonCrawl, Wikipedia, etc.).The Multi30K dataset makes it possible to further develop machine translation in a setting where multimodal data, such as images or video, are observed alongside text.The potential advantages of using multimodal information for machine translation include the ability to better deal with ambiguous source text and to avoid (untranslated) out-of-vocabulary words in the target language (Calixto et al., 2012).
Hitschler and Riezler ( 2016) have demonstrated the potential of multimodal features in a targetside translation reranking model.Their approach is initially trained over large text-only translation copora and then fine-tuned with a small amount of in-domain data, such as our dataset.We expect a variety of translation models can be adapted to take advantage of multimodal data as features in a log-linear model or as feature vectors in neural machine translation models.

Conclusions
We introduced Multi30K: a large-scale multilingual multimodal dataset for interdisciplinary machine learning research.Our dataset is an extension of the popular Flickr30K dataset with descriptions and professional translations in German.The descriptions were collected from a crowdsourcing platform, while the translations were collected from professionally contracted translators.These differences are deliberate and part of the larger scope of studying multilingual multimodal data in different contexts.The descriptions were collected as similarly as possible to the original Flickr30K dataset by translating the instructions used by Young et al. (2014) into German.The translations were collected without showing the images to the translators to keep it as close to a standard translation task as possible.
There are substantial differences between the translated and the description datasets.The translations contain approximately the same number of tokens and have sentences of approximately the same length in both languages.These properties make them suited to machine translations models.The description datasets are very different in terms of average sentence lengths and the number of word types per language.This is likely to cause different engineering and scientific challenges be-cause the descriptions are independently collected corpora instead of a sentence-level aligned corpus.
In the future, we want to study multilingual multimodality over a wider range of languages, for example beyond Indo-European families.We call on the community to engage with us on creating massively multilingual multimodal datasets.

Figure 2 :
Figure 2: The German instructions shown to crowdworkers were translated from the original instructions.

Table 1 :
Corpus-level statistics about the translation and the description data.