Jing Xie

2024

Large language models (LLMs) often struggle with processing extensive input contexts, which can lead to redundant, inaccurate, or incoherent summaries. Recent methods have used unstructured memory to incrementally process these contexts, but they still suffer from information overload due to the volume of unstructured data handled. In our study, we introduce structured knowledge representations (GU_json), which significantly improve summarization performance by 40% and 14% across two public datasets. Most notably, we propose the Chain-of-Key strategy (CoK_json) that dynamically updates or augments these representations with new information, rather than recreating the structured memory for each new source. This method further enhances performance by 7% and 4% on the datasets.

2023

pdf bib abs
Technical Report on Ancient Chinese Machine Translation Based on mRASP Model
Wenjing Liu | Jing Xie
Proceedings of ALT2023: Ancient Language Translation Workshop

Abstract: Objective This paper aims to improve the performance of machine translation of ancient Chinese classics, which can better promote the development of ancient books research and the spread of Chinese culture. Methods Based on the multilingual translation machine pre-training model of mRASP, the model was trained by fine-tuning the specific language pairs, namely a2m, and a2e, according to the two downstream tasks of classical Chinese translation into modern Chinese and classical Chinese translation into English, using the parallel corpus of ancient white and white and ancient English parallel corpus of Pre-Qin+ZiZhiTongJian, and the translation performance of the fine-tuning model was evaluated by BIEU evaluation index. Results The BIEU4 results of the three downstream tasks of 24_histories_a2m、Pre-Qin+ZiZhiTongJian_a2m、 Pre-Qin+ZiZhiTongJian_a2e were 17.38, 13.69 and 12.90 respectively.

pdf bib abs
Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models
Yichao Zhou | James Bradley Wendt | Navneet Potti | Jing Xie | Sandeep Tata
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Building automatic extraction models for visually rich documents like invoices, receipts, bills, tax forms, etc. has received significant attention lately. A key bottleneck in developing extraction models for new document types is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. In this paper, we propose selective labeling as a solution to this problem. The key insight is to simplify the labeling task to provide “yes/no” labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by 10× with a negligible loss in accuracy.

Co-authors

Wenjing Liu 1

Navneet Potti 1

Nguyen Vo 1

Venues

Fix author