Peng Tang

2024

Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Specifically, we provide an LLM with various document elements like key-value pairs, layouts, and descriptions, to elicit open-ended answers. Our experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge. Moreover, student VDU models trained with solely DocKD-generated data is not only comparable to those trained with human-annotated data on in-domain tasks but also significantly excel them on out-of-domain tasks.

Encoder-decoder transformer models have achieved great success on various vision-language (VL) and language tasks, but they suffer from high inference latency. Typically, the decoder takes up most of the latency because of the auto-regressive decoding. To accelerate the inference, we propose an approach of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions. In addition, we leverage simple yet practical techniques, including shared generation head and adaptation modules, to keep accuracy when exiting at shallow decoder layers. Based on the multi-exit model, we perform step-level dynamic early exit during inference, where the model may decide to use fewer decoder layers based on its confidence of the current layer at each individual decoding step. Considering different number of decoder layers may be used at different decoding steps, we compute deeper-layer decoder features of previous decoding steps just-in-time, which ensures the features from different decoding steps are semantically aligned. We evaluate our approach with three state-of-the-art encoder-decoder transformer models on various VL and language tasks. We show our approach can reduce overall inference latency by 20%-74% with comparable or even higher accuracy compared to baselines.

pdf bib abs
LoRAExit: Empowering Dynamic Modulation of LLMs in Resource-limited Settings using Low-rank Adapters
Jiacheng Liu | Peng Tang | Xiaofeng Hou | Chao Li | Pheng-Ann Heng
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) have exhibited remarkable performance across various natural language processing tasks. However, deploying LLMs on resource-limited settings remains a challenge. While early-exit techniques offer an effective approach, they often require compromised training methods that result in sub-optimal performance. On the other hand, multi-model methods achieve improved results but suffer from significant inference latency and memory consumption. In this paper, we propose LoRAExit, a novel dynamic inference architecture that leverages low-rank adaptors for efficient deployment of LLMs. LoRAExit decouples the training of multiple exit interfaces, enabling the separate optimization of each exit, thereby fundamentally addressing the performance issues of early-exit networks. Moreover, we introduce a superior-exit guided distillation method that effectively utilizes models of different sizes, thereby further enhancing the performance of early exits. Experimental results demonstrate that LoRAExit significantly improves LLM performance when deployed on resource-limited settings.

pdf bib abs
Multiple-Question Multiple-Answer Text-VQA
Peng Tang | Srikar Appalaraju | R. Manmatha | Yusheng Xie | Vijay Mahadevan
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)

We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. To the best of our knowledge, almost all previous approaches for text-VQA process a single question and its associated content to predict a single answer. However, in industry applications, users may come up with multiple questions about a single image. In order to answer multiple questions from the same image, each question and content are fed into the model multiple times. In contrast, our proposed MQMA approach takes multiple questions and content as input at the encoder and predicts multiple answers at the decoder in an auto-regressive manner at the same time. We make several novel architectural modifications to standard encoder-decoder transformers to support MQMA. We also propose a novel MQMA denoising pre-training task which is designed to teach the model to align and delineate multiple questions and content with associated answers. MQMA pre-trained model achieves state-of-the-art results on multiple text-VQA datasets, each with strong baselines. Specifically, on OCR-VQA (+2.5%), TextVQA (+1.4%), ST-VQA (+0.6%), DocVQA (+1.1%) absolute improvements over the previous state-of-the-art approaches.

Co-authors

Sungnyun Kim 1

Tian Li 1

Chao Li 1

Venues

Fix data