Seongho Choi

Also published as: SeongHo Choi


2022

pdf bib
Language-agnostic Semantic Consistent Text-to-Image Generation
SeongJun Jung | Woo Suk Choi | Seongho Choi | Byoung-Tak Zhang
Proceedings of the Workshop on Multilingual Multimodal Learning

Recent GAN-based text-to-image generation models have advanced that they can generate photo-realistic images matching semantically with descriptions. However, research on multi-lingual text-to-image generation has not been carried out yet much. There are two problems when constructing a multilingual text-to-image generation model: 1) language imbalance issue in text-to-image paired datasets and 2) generating images that have the same meaning but are semantically inconsistent with each other in texts expressed in different languages. To this end, we propose a Language-agnostic Semantic Consistent Generative Adversarial Network (LaSC-GAN) for text-to-image generation, which can generate semantically consistent images via language-agnostic text encoder and Siamese mechanism. Experiments on relatively low-resource language text-image datasets show that the model has comparable generation quality as images generated by high-resource language text, and generates semantically consistent images for texts with the same meaning even in different languages.

pdf bib
Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval
Minjoon Jung | SeongHo Choi | JooChan Kim | Jin-Hwa Kim | Byoung-Tak Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query. For narrative videos, e.g., drama or movies, the holistic understanding of temporal dynamics and multimodal reasoning are crucial. Previous works have shown promising results; however, they relied on the expensive query annotations for the VCMR, i.e., the corresponding moment intervals. To overcome this problem, we propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN).First, MPGN selects candidate temporal moments via subtitle-based moment sampling. Then, it generates pseudo queries exploiting both visualand textual information from the selected temporal moments. Through the multimodal information in the pseudo queries, we show that MPGN successfully learns to localize the video corpus moment without any explicit annotation. We validate the effectiveness of MPGN on TVR dataset, showing the competitive results compared with both supervised models and unsupervised setting models.