Byoung-Tak Zhang


pdf bib
Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering
Ahjeong Seo | Gi-Cheon Kang | Joonhan Park | Byoung-Tak Zhang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question’s intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN.


pdf bib
Toward General Scene Graph: Integration of Visual Semantic Knowledge with Entity Synset Alignment
Woo Suk Choi | Kyoung-Woon On | Yu-Jung Heo | Byoung-Tak Zhang
Proceedings of the First Workshop on Advances in Language and Vision Research

Scene graph is a graph representation that explicitly represents high-level semantic knowledge of an image such as objects, attributes of objects and relationships between objects. Various tasks have been proposed for the scene graph, but the problem is that they have a limited vocabulary and biased information due to their own hypothesis. Therefore, results of each task are not generalizable and difficult to be applied to other down-stream tasks. In this paper, we propose Entity Synset Alignment(ESA), which is a method to create a general scene graph by aligning various semantic knowledge efficiently to solve this bias problem. The ESA uses a large-scale lexical database, WordNet and Intersection of Union (IoU) to align the object labels in multiple scene graphs/semantic knowledge. In experiment, the integrated scene graph is applied to the image-caption retrieval task as a down-stream task. We confirm that integrating multiple scene graphs helps to get better representations of images.


pdf bib
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
Gi-Cheon Kang | Jaeseo Lim | Byoung-Tak Zhang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Visual dialog (VisDial) is a task which requires a dialog agent to answer a series of questions grounded in an image. Unlike in visual question answering (VQA), the series of questions should be able to capture a temporal context from a dialog history and utilizes visually-grounded information. Visual reference resolution is a problem that addresses these challenges, requiring the agent to resolve ambiguous references in a given question and to find the references in a given image. In this paper, we propose Dual Attention Networks (DAN) for visual reference resolution in VisDial. DAN consists of two kinds of attention modules, REFER and FIND. Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a multi-head attention mechanism. FIND module takes image features and reference-aware representations (i.e., the output of REFER module) as input, and performs visual grounding via bottom-up attention mechanism. We qualitatively and quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing that DAN outperforms the previous state-of-the-art model by a significant margin.

pdf bib
CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication
Jin-Hwa Kim | Nikita Kitaev | Xinlei Chen | Marcus Rohrbach | Byoung-Tak Zhang | Yuandong Tian | Dhruv Batra | Devi Parikh
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this work, we propose a goal-driven collaborative task that combines language, perception, and action. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate with each other using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of ~138K messages exchanged between human players. We define protocols and metrics to evaluate learned agents in this testbed, highlighting the need for a novel “crosstalk” evaluation condition which pairs agents trained independently on disjoint subsets of the training data. We present models for our task and benchmark them using both fully automated evaluation and by having them play the game live with humans.


pdf bib
Text Chunking by Combining Hand-Crafted Rules and Memory-Based Learning
Seong-Bae Park | Byoung-Tak Zhang
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics


pdf bib
A Comparative Evaluation of Data-driven Models in Translation Selection of Machine Translation
Yu-Seop Kim | Jeong-Ho Chang | Byoung-Tak Zhang
COLING 2002: The 19th International Conference on Computational Linguistics


pdf bib
Word Sense Disambiguation by Learning from Unlabeled Data
Seong-Bae Park | Byoung-Tak Zhang | Yung Taek Kim
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

pdf bib
Reducing Parsing Complexity by Intra-Sentence Segmentation based on Maximum Entropy Model
Sung Dong Kim | Byoung-Tak Zhang | Yung Taek Kim
2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora


pdf bib
Compound noun decomposition using a Markov model
Jongwoo Lee | Byoung-Tak Zhang | Yung Taek Kim
Proceedings of Machine Translation Summit VII

A statistical method for compound noun decomposition is presented. Previous studies on this problem showed some statistical information are helpful. But applying statistical information was not so systemic that performance depends heavily on the algorithm and some algorithms usually have many separated steps. In our work statistical information is collected from manually decomposed compound noun corpus to build a Markov model for composition. Two Markov chains representing statistical information are assumed independent: one for the sequence of participants' lengths and another for the sequence of participants ' features. Besides Markov assumptions, least participants preference assumption also is used. These two assumptions enable the decomposition algorithm to be a kind of conditional dynamic programming so that efficient and systemic computation can be performed. When applied to test data of size 5027, we obtained a precision of 98.4%.


pdf bib
Morphological Analysis and Synthesis by Automated Discovery and Acquisition of Linguistic Rules
Byoung-Tak Zhang | Yung-Taek Kim
COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics