Jason Kuen
2023
A Critical Analysis of Document Out-of-Distribution Detection
Jiuxiang Gu
|
Yifei Ming
|
Yi Zhou
|
Jason Kuen
|
Vlad Morariu
|
Handong Zhao
|
Ruiyi Zhang
|
Nikolaos Barmpalios
|
Anqi Liu
|
Yixuan Li
|
Tong Sun
|
Ani Nenkova
Findings of the Association for Computational Linguistics: EMNLP 2023
Large-scale pre-training is widely used in recent document understanding tasks. During deployment, one may expect that models should trigger a conservative fallback policy when encountering out-of-distribution (OOD) samples, which highlights the importance of OOD detection. However, most existing OOD detection methods focus on single-modal inputs such as images or texts. While documents are multi-modal in nature, it is underexplored if and how multi-modal information in documents can be exploited for OOD detection. In this work, we first provide a systematic and in-depth analysis on OOD detection for document understanding models. We study the effects of model modality, pre-training, and fine-tuning across various types of OOD inputs. In particular, we find that spatial information is critical for document OOD detection. To better exploit spatial information, we propose a spatial-aware adapter, which serves as a parameter-efficient add-on module to adapt transformer-based language models to the document domain. Extensive experiments show that adding the spatial-aware adapter significantly improves the OOD detection performance compared to directly using the language model and achieves superior performance compared to competitive baselines.
2022
Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns
Zihan Wang
|
Jiuxiang Gu
|
Jason Kuen
|
Handong Zhao
|
Vlad Morariu
|
Ruiyi Zhang
|
Ani Nenkova
|
Tong Sun
|
Jingbo Shang
Findings of the Association for Computational Linguistics: ACL 2022
We present a comprehensive study of sparse attention patterns in Transformer models. We first question the need for pre-training with sparse attention and present experiments showing that an efficient fine-tuning only approach yields a slightly worse but still competitive model. Then we compare the widely used local attention pattern and the less-well-studied global attention pattern, demonstrating that global patterns have several unique advantages. We also demonstrate that a flexible approach to attention, with different patterns across different layers of the model, is beneficial for some tasks. Drawing on this insight, we propose a novel Adaptive Axis Attention method, which learns—during fine-tuning—different attention patterns for each Transformer layer depending on the downstream task. Rather than choosing a fixed attention pattern, the adaptive axis attention method identifies important tokens—for each task and model layer—and focuses attention on those. It does not require pre-training to accommodate the sparse patterns and demonstrates competitive and sometimes better performance against fixed sparse attention patterns that require resource-intensive pre-training.
Search
Co-authors
- Jiuxiang Gu 2
- Vlad Morariu 2
- Handong Zhao 2
- Ruiyi Zhang 2
- Tong Sun 2
- show all...