2024
pdf
bib
abs
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
|
Yang Ye
|
Bin Zhu
|
Jiaxi Cui
|
Munan Ning
|
Peng Jin
|
Li Yuan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers.In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.As a result, Video-LLaVA outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Additionally, our Video-LLaVA also achieves superior performances on a broad range of 9 image benchmarks.Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM.
pdf
bib
abs
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
Meng Cao
|
Haoran Tang
|
Jinfa Huang
|
Peng Jin
|
Can Zhang
|
Ruyang Liu
|
Long Chen
|
Xiaodan Liang
|
Li Yuan
|
Ge Li
Findings of the Association for Computational Linguistics: ACL 2024
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most of the state-of-the-art TVR methods learn image-to-video transfer learning based on the large-scale pre-trained vision-language models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation cost. To this end, we propose to conduct efficient text-video Retrieval with a salient-and-correlated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics including temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from frozen CLIP backbone, which accentuates silent frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism which firstly selects top responsive visual patch and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that our RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient finetuning methods.
pdf
bib
abs
Towards Multi-Relational Multi-Hop Reasoning over Dense Temporal Knowledge Graphs
Jian Liu
|
Zihe Liu
|
Xueqiang Lyu
|
Peng Jin
|
Jinan Xu
Findings of the Association for Computational Linguistics: ACL 2024
Temporal knowledge graph reasoning has emerged as a crucial task for answering time-dependent questions within a knowledge graph (KG).Despite tremendous progress, the present research is impeded by the sparsity of a temporal KG and an over-reliance on simple single-relational reasoning patterns. To overcome these challenges, we introduce MulQuestions, a new temporal KG reasoning benchmark featuring over 200k entities and 960k questions designed to facilitate complex, multi-relational and multi-hop reasoning. Additionally, we propose a new model adept at conducting pattern-aware and time-sensitive reasoning across temporal KGs. The model’s efficacy is confirmed through rigorous evaluations, showcasing its effectiveness in sparse data conditions and adeptness at handling questions with long reasoning chains. We have made our benchmark and model publicly accessible at [https://anonymous].
pdf
bib
abs
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
Zhongwei Wan
|
Ziang Wu
|
Che Liu
|
Jinfa Huang
|
Zhihong Zhu
|
Peng Jin
|
Longyue Wang
|
Li Yuan
Findings of the Association for Computational Linguistics: EMNLP 2024
Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs’ KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge.In this work, we introduce **LOOK-M**, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. **LOOK-M** demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by **80%** in some cases, it not only achieves approximately **1.3x** faster decoding but also maintains or even **enhances** performance across a variety of long context multimodal tasks.
2020
pdf
bib
abs
CN-HIT-IT.NLP at SemEval-2020 Task 4: Enhanced Language Representation with Multiple Knowledge Triples
Yice Zhang
|
Jiaxuan Lin
|
Yang Fan
|
Peng Jin
|
Yuanchao Liu
|
Bingquan Liu
Proceedings of the Fourteenth Workshop on Semantic Evaluation
This paper describes our system that participated in the SemEval-2020 task 4: Commonsense Validation and Explanation. For this task, it is obvious that external knowledge, such as Knowledge graph, can help the model understand commonsense in natural language statements. But how to select the right triples for statements remains unsolved, so how to reduce the interference of irrelevant triples on model performance is a research focus. This paper adopt a modified K-BERT as the language encoder, to enhance language representation through triples from knowledge graphs. Experiments show that our method is better than models without external knowledge, and is slightly better than the original K-BERT. We got an accuracy score of 0.97 in subtaskA, ranking 1/45, and got an accuracy score of 0.948, ranking 2/35.
2014
pdf
bib
Multi-view Chinese Treebanking
Likun Qiu
|
Yue Zhang
|
Peng Jin
|
Houfeng Wang
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
2012
pdf
bib
abs
CLTC: A Chinese-English Cross-lingual Topic Corpus
Yunqing Xia
|
Guoyu Tang
|
Peng Jin
|
Xia Yang
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Cross-lingual topic detection within text is a feasible solution to resolving the language barrier in accessing the information. This paper presents a Chinese-English cross-lingual topic corpus (CLTC), in which 90,000 Chinese articles and 90,000 English articles are organized within 150 topics. Compared with TDT corpora, CLTC has three advantages. First, CLTC is bigger in size. This makes it possible to evaluate the large-scale cross-lingual text clustering methods. Second, articles are evenly distributed within the topics. Thus it can be used to produce test datasets for different purposes. Third, CLTC can be used as a cross-lingual comparable corpus to develop methods for cross-lingual information access. A preliminary evaluation with CLTC corpus indicates that the corpus is effective in evaluating cross-lingual topic detection methods.
pdf
bib
SemEval-2012 Task 4: Evaluating Chinese Word Similarity
Peng Jin
|
Yunfang Wu
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)
2010
pdf
bib
SemEval-2010 Task 18: Disambiguating Sentiment Ambiguous Adjectives
Yunfang Wu
|
Peng Jin
Proceedings of the 5th International Workshop on Semantic Evaluation
pdf
bib
SemEval-2 Task 15: Infrequent Sense Identification for Mandarin Text to Speech Systems
Peng Jin
|
Yunfang Wu
Proceedings of the 5th International Workshop on Semantic Evaluation
pdf
bib
The Chinese Persons Name Diambiguation Evaluation: Exploration of Personal Name Disambiguation in Chinese News
Ying Chen
|
Peng Jin
|
Wenjie Li
|
Chu-Ren Huang
CIPS-SIGHAN Joint Conference on Chinese Language Processing
pdf
bib
LSTC System for Chinese Word Sense Induction
Peng Jin
|
Yihao Zhang
|
Rui Sun
CIPS-SIGHAN Joint Conference on Chinese Language Processing
2009
pdf
bib
Estimating and Exploiting the Entropy of Sense Distributions
Peng Jin
|
Diana McCarthy
|
Rob Koeling
|
John Carroll
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
2007
pdf
bib
SemEval-2007 Task 05: Multilingual Chinese-English Lexical Sample
Peng Jin
|
Yunfang Wu
|
Shiwen Yu
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)
pdf
bib
PKU: Combining Supervised Classifiers with Features Selection
Peng Jin
|
Danqing Zhu
|
Fuxin Li
|
Yunfang Wu
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)
pdf
bib
Building Chinese Sense Annotated Corpus with the Help of Software Tools
Yunfang Wu
|
Peng Jin
|
Tao Guo
|
Shiwen Yu
Proceedings of the Linguistic Annotation Workshop