Jie Cai


2024

pdf bib
DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models
Ranchi Zhao | Zhen Leng Thai | Yifan Zhang | Shengding Hu | Jie Zhou | Yunqi Ba | Jie Cai | Zhiyuan Liu | Maosong Sun
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The performance of Large Language Models (LLMs) is substantially influenced by the pretraining corpus, which consists of vast quantities of unsupervised data processed by the models. Despite its critical role in model performance, ensuring the quality of this data is challenging due to its sheer volume and the absence of sample-level quality annotations and enhancements. In this paper, we introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing. Specifically, DecorateLM rates texts against quality criteria, tags texts with hierarchical labels, and edits texts into a more formalized format. Due to the massive size of the pretraining corpus, adopting an LLM for decorating the entire corpus is less efficient. Therefore, to balance performance with efficiency, we curate a meticulously annotated training corpus for DecorateLM using a large language model and distill data engineering expertise into a compact 1.2 billion parameter small language model (SLM). We then apply DecorateLM to enhance 100 billion tokens of the training corpus, selecting 45 billion tokens that exemplify high quality and diversity for the further training of another 1.2 billion parameter LLM. Our results demonstrate that employing such high-quality data can significantly boost model performance, showcasing a powerful approach to enhance the quality of the pretraining corpus.

2023

pdf bib
XDailyDialog: A Multilingual Parallel Dialogue Corpus
Zeming Liu | Ping Nie | Jie Cai | Haifeng Wang | Zheng-Yu Niu | Peng Zhang | Mrinmaya Sachan | Kaiping Peng
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

High-quality datasets are significant to the development of dialogue models. However, most existing datasets for open-domain dialogue modeling are limited to a single language. The absence of multilingual open-domain dialog datasets not only limits the research on multilingual or cross-lingual transfer learning, but also hinders the development of robust open-domain dialog systems that can be deployed in other parts of the world. In this paper, we provide a multilingual parallel open-domain dialog dataset, XDailyDialog, to enable researchers to explore the challenging task of multilingual and cross-lingual open-domain dialog. XDailyDialog includes 13K dialogues aligned across 4 languages (52K dialogues and 410K utterances in total). We then propose a dialog generation model, kNN-Chat, which has a novel kNN-search mechanism to support unified response retrieval for monolingual, multilingual, and cross-lingual dialogue. Experiment results show the effectiveness of this framework. We will make XDailyDialog and kNN-Chat publicly available soon.

2019

pdf bib
A Label Informative Wide & Deep Classifier for Patents and Papers
Muyao Niu | Jie Cai
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In this paper, we provide a simple and effective baseline for classifying both patents and papers to the well-established Cooperative Patent Classification (CPC). We propose a label-informative classifier based on the Wide & Deep structure, where the Wide part encodes string-level similarities between texts and labels, and the Deep part captures semantic-level similarities via non-linear transformations. Our model trains on millions of patents, and transfers to papers by developing distant-supervised training set and domain-specific features. Extensive experiments show that our model achieves comparable performance to the state-of-the-art model used in industry on both patents and papers. The output of this work should facilitate the searching, granting and filing of innovative ideas for patent examiners, attorneys and researchers.

2012

pdf bib
A Multigraph Model for Coreference Resolution
Sebastian Martschat | Jie Cai | Samuel Broscheit | Éva Mújdricza-Maydt | Michael Strube
Joint Conference on EMNLP and CoNLL - Shared Task

2011

pdf bib
Unrestricted Coreference Resolution via Global Hypergraph Partitioning
Jie Cai | Éva Mújdricza-Maydt | Michael Strube
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

2010

pdf bib
Evaluation Metrics For End-to-End Coreference Resolution Systems
Jie Cai | Michael Strube
Proceedings of the SIGDIAL 2010 Conference

pdf bib
End-to-End Coreference Resolution via Hypergraph Partitioning
Jie Cai | Michael Strube
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)