Josiah Wang

2022

pdf bib abs
MultiSubs: A Large-scale Multimodal and Multilingual Dataset
Josiah Wang | Josiel Figueiredo | Lucia Specia
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language. The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles. The dataset is a valuable resource as (i) the images are aligned to text fragments rather than whole sentences; (ii) multiple images are possible for a text fragment and a sentence; (iii) the sentences are free-form and real-world like; (iv) the parallel texts are multilingual. We also set up a fill-in-the-blank game for humans to evaluate the quality of the automatic image selection process of our dataset. Finally, we propose a fill-in-the-blank task to demonstrate the utility of the dataset, and present some baseline prediction models. The dataset will benefit research on visual grounding of words especially in the context of free-form sentences, and can be obtained from https://doi.org/10.5281/zenodo.5034604 under a Creative Commons licence.

2019

pdf bib abs
Transformer-based Cascaded Multimodal Speech Translation
Zixiu Wu | Ozan Caglayan | Julia Ive | Josiah Wang | Lucia Specia
Proceedings of the 16th International Conference on Spoken Language Translation

This paper describes the cascaded multimodal speech translation systems developed by Imperial College London for the IWSLT 2019 evaluation campaign. The architecture consists of an automatic speech recognition (ASR) system followed by a Transformer-based multimodal machine translation (MMT) system. While the ASR component is identical across the experiments, the MMT model varies in terms of the way of integrating the visual context (simple conditioning vs. attention), the type of visual features exploited (pooled, convolutional, action categories) and the underlying architecture. For the latter, we explore both the canonical transformer and its deliberation version with additive and cascade variants which differ in how they integrate the textual attention. Upon conducting extensive experiments, we found that (i) the explored visual integration schemes often harm the translation performance for the transformer and additive deliberation, but considerably improve the cascade deliberation; (ii) the transformer and cascade deliberation integrate the visual modality better than the additive deliberation, as shown by the incongruence analysis.

pdf bib abs
VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions
Pranava Madhyastha | Josiah Wang | Lucia Specia
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We address the task of evaluating image description generation systems. We propose a novel image-aware metric for this task: VIFIDEL. It estimates the faithfulness of a generated caption with respect to the content of the actual image, based on the semantic similarity between labels of objects depicted in images and words in the description. The metric is also able to take into account the relative importance of objects mentioned in human reference descriptions during evaluation. Even if these human reference descriptions are not available, VIFIDEL can still reliably evaluate system descriptions. The metric achieves high correlation with human judgments on two well-known datasets and is competitive with metrics that depend on and rely exclusively on human references.

2018

pdf bib abs
Object Counts! Bringing Explicit Detections Back into Image Captioning
Josiah Wang | Pranava Swaroop Madhyastha | Lucia Specia
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

The use of explicit object detectors as an intermediate step to image captioning – which used to constitute an essential stage in early work – is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a mid-level image embedding. We argue that explicit detections provide rich semantic information, and can thus be used as an interpretable representation to better understand why end-to-end image captioning systems work well. We provide an in-depth analysis of end-to-end image captioning by exploring a variety of cues that can be derived from such object detections. Our study reveals that end-to-end image captioning systems rely on matching image representations to generate captions, and that encoding the frequency, size and position of objects are complementary and all play a role in forming a good image representation. It also reveals that different object categories contribute in different ways towards image captioning.

pdf bib abs
Defoiling Foiled Image Captions
Pranava Swaroop Madhyastha | Josiah Wang | Lucia Specia
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We address the task of detecting foiled image captions, i.e. identifying whether a caption contains a word that has been deliberately replaced by a semantically similar word, thus rendering it inaccurate with respect to the image being described. Solving this problem should in principle require a fine-grained understanding of images to detect subtle perturbations in captions. In such contexts, encoding sufficiently descriptive image information becomes a key challenge. In this paper, we demonstrate that it is possible to solve this task using simple, interpretable yet powerful representations based on explicit object information over multilayer perceptron models. Our models achieve state-of-the-art performance on a recently published dataset, with scores exceeding those achieved by humans on the task. We also measure the upper-bound performance of our models using gold standard annotations. Our study and analysis reveals that the simpler model performs well even without image information, suggesting that the dataset contains strong linguistic bias.

pdf bib abs
End-to-end Image Captioning Exploits Distributional Similarity in Multimodal Space
Pranava Swaroop Madhyastha | Josiah Wang | Lucia Specia
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn ‘distributional similarity’ in a multimodal feature space, by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the ‘image’ side of image captioning, and vary the input image representation but keep the RNN text generation model of a CNN-RNN constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations; (ii) experience virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together. Our experiments all point to one fact: that our distributional similarity hypothesis holds. We conclude that, regardless of the image representation, image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.

2017

pdf bib
Sheffield MultiMT: Using Object Posterior Predictions for Multimodal Machine Translation
Pranava Swaroop Madhyastha | Josiah Wang | Lucia Specia
Proceedings of the Second Conference on Machine Translation

2016

pdf bib abs
Cross-validating Image Description Datasets and Evaluation Metrics
Josiah Wang | Robert Gaizauskas
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The task of automatically generating sentential descriptions of image content has become increasingly popular in recent years, resulting in the development of large-scale image description datasets and the proposal of various metrics for evaluating image description generation systems. However, not much work has been done to analyse and understand both datasets and the metrics. In this paper, we propose using a leave-one-out cross validation (LOOCV) process as a means to analyse multiply annotated, human-authored image description datasets and the various evaluation metrics, i.e. evaluating one image description against other human-authored descriptions of the same image. Such an evaluation process affords various insights into the image description datasets and evaluation metrics, such as the variations of image descriptions within and across datasets and also what the metrics capture. We compute and analyse (i) human upper-bound performance; (ii) ranked correlation between metric pairs across datasets; (iii) lower-bound performance by comparing a set of descriptions describing one image to another sentence not describing that image. Interesting observations are made about the evaluation metrics and image description datasets, and we conclude that such cross-validation methods are extremely useful for assessing and gaining insights into image description datasets and evaluation metrics for image descriptions.

pdf bib
SHEF-Multimodal: Grounding Machine Translation on Images
Kashif Shah | Josiah Wang | Lucia Specia
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Don’t Mention the Shoe! A Learning to Rank Approach to Content Selection for Image Description Generation
Josiah Wang | Robert Gaizauskas
Proceedings of the 9th International Natural Language Generation conference

Josiah Wang

2022

2019

2018

2017

2016

2015

2014

Co-authors

Venues