Yang Ye


2024

pdf bib
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin | Yang Ye | Bin Zhu | Jiaxi Cui | Munan Ning | Peng Jin | Li Yuan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers.In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.As a result, Video-LLaVA outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Additionally, our Video-LLaVA also achieves superior performances on a broad range of 9 image benchmarks.Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM.

2022

pdf bib
A Hierarchical N-Gram Framework for Zero-Shot Link Prediction
Mingchen Li | Junfan Chen | Samuel Mensah | Nikolaos Aletras | Xiulong Yang | Yang Ye
Findings of the Association for Computational Linguistics: EMNLP 2022

Knowledge graphs typically contain a large number of entities but often cover only a fraction of all relations between them (i.e., incompleteness). Zero-shot link prediction (ZSLP) is a popular way to tackle the problem by automatically identifying unobserved relations between entities. Most recent approaches use textual features of relations (e.g., surface name or textual descriptions) as auxiliary information to improve the encoded representation. These methods lack robustness as they are bound to support only tokens from a fixed vocabulary and unable to model out-of-vocabulary (OOV) words. Subword units such as character n-grams have the capability of generating more expressive representations for OOV words. Hence, in this paper, we propose a Hierarchical N-gram framework for Zero-Shot Link Prediction (HNZSLP) that leverages character n-gram information for ZSLP. Our approach works by first constructing a hierarchical n-gram graph from the surface name of relations. Subsequently, a new Transformer-based network models the hierarchical n-gram graph to learn a relation embedding for ZSLP. Experimental results show that our proposed HNZSLP method achieves state-of-the-art performance on two standard ZSLP datasets.

2007

pdf bib
Aspect marker generation in English-to-Chinese machine translation
Yang Ye | Karl-Michael Schneider | Steven Abney
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Sentence Level Machine Translation Evaluation as a Ranking
Yang Ye | Ming Zhou | Chin-Yew Lin
Proceedings of the Second Workshop on Statistical Machine Translation

2006

pdf bib
Latent Features in Automatic Tense Translation between Chinese and English
Yang Ye | Victoria Li Fossum | Steven Abney
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing

pdf bib
How and Where do People Fail with Time: Temporal Reference Mapping Annotation by Chinese and English Bilinguals
Yang Ye | Steven Abney
Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006

2005

pdf bib
Tense Tagging for Verbs in Cross-Lingual Context: A Case Study
Yang Ye | Zhu Zhang
Second International Joint Conference on Natural Language Processing: Full Papers