Lei Ji


pdf bib
Hashing based Efficient Inference for Image-Text Matching
Rong-Cheng Tu | Lei Ji | Huaishao Luo | Botian Shi | Heyan Huang | Nan Duan | Xian-Ling Mao
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
GEM: A General Evaluation Benchmark for Multimodal Tasks
Lin Su | Nan Duan | Edward Cui | Lei Ji | Chenfei Wu | Huaishao Luo | Yongfei Liu | Ming Zhong | Taroon Bharti | Arun Sacheti
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Hierarchical Context-aware Network for Dense Video Event Captioning
Lei Ji | Xianglin Guo | Haoyang Huang | Xilin Chen
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Dense video event captioning aims to generate a sequence of descriptive captions for each event in a long untrimmed video. Video-level context provides important information and facilities the model to generate consistent and less redundant captions between events. In this paper, we introduce a novel Hierarchical Context-aware Network for dense video event captioning (HCN) to capture context from various aspects. In detail, the model leverages local and global context with different mechanisms to jointly learn to generate coherent captions. The local context module performs full interaction between neighbor frames and the global context module selectively attends to previous or future events. According to our extensive experiment on both Youcook2 and Activitynet Captioning datasets, the video-level HCN model outperforms the event-level context-agnostic model by a large margin. The code is available at https://github.com/KirkGuo/HCN.

pdf bib
Control Image Captioning Spatially and Temporally
Kun Yan | Lei Ji | Huaishao Luo | Ming Zhou | Nan Duan | Shuai Ma
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Generating image captions with user intention is an emerging need. The recently published Localized Narratives dataset takes mouse traces as another input to the image captioning task, which is an intuitive and efficient way for a user to control what to describe in the image. However, how to effectively employ traces to improve generation quality and controllability is still under exploration. This paper aims to solve this problem by proposing a novel model called LoopCAG, which connects Contrastive constraints and Attention Guidance in a Loop manner, engaged explicit spatial and temporal constraints to the generating process. Precisely, each generated sentence is temporally aligned to the corresponding trace sequence through a contrastive learning strategy. Besides, each generated text token is supervised to attend to the correct visual objects under heuristic spatial attention guidance. Comprehensive experimental results demonstrate that our LoopCAG model learns better correspondence among the three modalities (vision, language, and traces) and achieves SOTA performance on trace-controlled image captioning task. Moreover, the controllability and explainability of LoopCAG are validated by analyzing spatial and temporal sensitivity during the generation process.


pdf bib
GRACE: Gradient Harmonized and Cascaded Labeling for Aspect-based Sentiment Analysis
Huaishao Luo | Lei Ji | Tianrui Li | Daxin Jiang | Nan Duan
Findings of the Association for Computational Linguistics: EMNLP 2020

In this paper, we focus on the imbalance issue, which is rarely studied in aspect term extraction and aspect sentiment classification when regarding them as sequence labeling tasks. Besides, previous works usually ignore the interaction between aspect terms when labeling polarities. We propose a GRadient hArmonized and CascadEd labeling model (GRACE) to solve these problems. Specifically, a cascaded labeling module is developed to enhance the interchange between aspect terms and improve the attention of sentiment tokens when labeling sentiment polarities. The polarities sequence is designed to depend on the generated aspect terms labels. To alleviate the imbalance issue, we extend the gradient harmonized mechanism used in object detection to the aspect-based sentiment analysis by adjusting the weight of each label dynamically. The proposed GRACE adopts a post-pretraining BERT as its backbone. Experimental results demonstrate that the proposed model achieves consistency improvement on multiple benchmark datasets and generates state-of-the-art results.

pdf bib
A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos
Frank F. Xu | Lei Ji | Botian Shi | Junyi Du | Graham Neubig | Yonatan Bisk | Nan Duan
Proceedings of the First International Workshop on Natural Language Processing Beyond Text

Watching instructional videos are often used to learn about procedures. Video captioning is one way of automatically collecting such knowledge. However, it provides only an indirect, overall evaluation of multimodal models with no finer-grained quantitative measure of what they have learned. We propose instead, a benchmark of structured procedural knowledge extracted from cooking videos. This work is complementary to existing tasks, but requires models to produce interpretable structured knowledge in the form of verb-argument tuples. Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations. Our analysis shows that the proposed task is challenging and standard modeling approaches like unsupervised segmentation, semantic role labeling, and visual action detection perform poorly when forced to predict every action of a procedure in a structured form.


pdf bib
Dense Procedure Captioning in Narrated Instructional Videos
Botian Shi | Lei Ji | Yaobo Liang | Nan Duan | Peng Chen | Zhendong Niu | Ming Zhou
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of step-wise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.