Aykut Erdem


2022

pdf bib
Multi3Generation: Multitask, Multilingual, Multimodal Language Generation
Anabela Barreiro | José GC de Souza | Albert Gatt | Mehul Bhatt | Elena Lloret | Aykut Erdem | Dimitra Gkatzia | Helena Moniz | Irene Russo | Fabio Kepler | Iacer Calixto | Marcin Paprzycki | François Portet | Isabelle Augenstein | Mirela Alhasani
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper presents the Multitask, Multilingual, Multimodal Language Generation COST Action – Multi3Generation (CA18231), an interdisciplinary network of research groups working on different aspects of language generation. This “meta-paper” will serve as reference for citations of the Action in future publications. It presents the objectives, challenges and a the links for the achieved outcomes.

pdf bib
CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions
Tayfun Ates | M. Ateşoğlu | Çağatay Yiğit | Ilker Kesen | Mert Kobas | Erkut Erdem | Aykut Erdem | Tilbe Goksun | Deniz Yuret
Findings of the Association for Computational Linguistics: ACL 2022

Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories in CRAFT include previously studied descriptive and counterfactual questions. Additionally, inspired by the Force Dynamics Theory in cognitive linguistics, we introduce a new causal question category that involves understanding the causal interactions between objects through notions like cause, enable, and prevent. Our results show that even though the questions in CRAFT are easy for humans, the tested baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark.

2021

pdf bib
Cross-lingual Visual Pre-training for Multimodal Machine Translation
Ozan Caglayan | Menekse Kuyu | Mustafa Sercan Amac | Pranava Madhyastha | Erkut Erdem | Aykut Erdem | Lucia Specia
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the translation language modelling (Lample and Conneau, 2019) with masked region classification and perform pre-training with three-way parallel vision & language corpora. We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance. We also provide qualitative insights into the usefulness of the learned grounded representations.

2019

pdf bib
Procedural Reasoning Networks for Understanding Multimodal Procedures
Mustafa Sercan Amac | Semih Yagcioglu | Aykut Erdem | Erkut Erdem
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

This paper addresses the problem of comprehending procedural commonsense knowledge. This is a challenging task as it requires identifying key entities, keeping track of their state changes, and understanding temporal and causal relations. Contrary to most of the previous work, in this study, we do not rely on strong inductive bias and explore the question of how multimodality can be exploited to provide a complementary semantic signal. Towards this end, we introduce a new entity-aware neural comprehension model augmented with external relational memory units. Our model learns to dynamically update entity states in relation to each other while reading the text instructions. Our experimental analysis on the visual reasoning tasks in the recently proposed RecipeQA dataset reveals that our approach improves the accuracy of the previously reported models by a large margin. Moreover, we find that our model learns effective dynamic representations of entities even though we do not use any supervision at the level of entity states.

2018

pdf bib
RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes
Semih Yagcioglu | Aykut Erdem | Erkut Erdem | Nazli Ikizler-Cinbis
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Understanding and reasoning about cooking recipes is a fruitful research direction towards enabling machines to interpret procedural text. In this work, we introduce RecipeQA, a dataset for multimodal comprehension of cooking recipes. It comprises of approximately 20K instructional recipes with multiple modalities such as titles, descriptions and aligned set of images. With over 36K automatically generated question-answer pairs, we design a set of comprehension and reasoning tasks that require joint understanding of images and text, capturing the temporal flow of events and making sense of procedural knowledge. Our preliminary results indicate that RecipeQA will serve as a challenging test bed and an ideal benchmark for evaluating machine comprehension systems. The data and leaderboard are available at http://hucvl.github.io/recipeqa.

2017

pdf bib
Re-evaluating Automatic Metrics for Image Captioning
Mert Kilickaya | Aykut Erdem | Nazli Ikizler-Cinbis | Erkut Erdem
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

The task of generating natural language descriptions from images has received a lot of attention in recent years. Consequently, it is becoming increasingly important to evaluate such image captioning approaches in an automatic manner. In this paper, we provide an in-depth evaluation of the existing image captioning metrics through a series of carefully designed experiments. Moreover, we explore the utilization of the recently proposed Word Mover’s Distance (WMD) document metric for the purpose of image captioning. Our findings outline the differences and/or similarities between metrics and their relative robustness by means of extensive correlation, accuracy and distraction based evaluations. Our results also demonstrate that WMD provides strong advantages over other metrics.

2016

pdf bib
Leveraging Captions in the Wild to Improve Object Detection
Mert Kilickaya | Nazli Ikizler-Cinbis | Erkut Erdem | Aykut Erdem
Proceedings of the 5th Workshop on Vision and Language

2015

pdf bib
A Distributed Representation Based Query Expansion Approach for Image Captioning
Semih Yagcioglu | Erkut Erdem | Aykut Erdem | Ruket Cakici
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)