Mingjie Zhan


2024

pdf bib
MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs
Zimu Lu | Aojun Zhou | Houxing Ren | Ke Wang | Weikang Shi | Junting Pan | Mingjie Zhan | Hongsheng Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have exhibited great potential in mathematical reasoning. However, there remains a performance gap in this area between existing open-source models and closed-source models such as GPT-4. In this paper, we introduce MathGenie, a novel method for generating diverse and reliable math problems by leveraging the ground-truth solutions of the seed data. We augment these ground-truth solutions and use a specially finetuned model to translate these augmented solutions back into new questions. Subsequently, we generate code-integrated solutions for these questions. To ensure the correctness of the code-integrated solutions, we employ rationale-based verification for filtering. Then, we finetune various pretrained models, ranging from 7B to 70B, on the newly curated data, resulting in a family of models known as MathGenie. These models consistently outperform previous open-source models across five representative mathematical reasoning datasets, achieving state-of-the-art performance. In particular, MathGenie-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score.

pdf bib
Empowering Character-level Text Infilling by Eliminating Sub-Tokens
Houxing Ren | Mingjie Zhan | Zhongyuan Wu | Hongsheng Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In infilling tasks, sub-tokens, representing instances where a complete token is segmented into two parts, often emerge at the boundaries of prefixes, middles, and suffixes. Traditional methods focused on training models at the token level, leading to sub-optimal performance in character-level infilling tasks during the inference stage. Alternately, some approaches considered character-level infilling, but they relied on predicting sub-tokens in inference, yet this strategy diminished ability in character-level infilling tasks due to the large perplexity of the model on sub-tokens. In this paper, we introduce FIM-SE, which stands for Fill-In-the-Middle with both Starting and Ending character constraints. The proposed method addresses character-level infilling tasks by utilizing a line-level format to avoid predicting any sub-token in inference. In addition, we incorporate two special tokens to signify the rest of the incomplete lines, thereby enhancing generation guidance. Extensive experiments demonstrate that our proposed approach surpasses previous methods, offering a significant advantage. Code is available at https://github.com/SenseLLM/FIM-SE.

2023

pdf bib
VCSUM: A Versatile Chinese Meeting Summarization Dataset
Han Wu | Mingjie Zhan | Haochen Tan | Zhaohui Hou | Ding Liang | Linqi Song
Findings of the Association for Computational Linguistics: ACL 2023

Compared to news and chat summarization, the development of meeting summarization is hugely decelerated by the limited data. To this end, we introduce a versatile Chinese meeting summarization dataset, dubbed VCSum, consisting of 239 real-life meetings, with a total duration of over 230 hours. We claim our dataset is versatile because we provide the annotations of topic segmentation, headlines, segmentation summaries, overall meeting summaries, and salient sentences for each meeting transcript. As such, the dataset can adapt to various summarization tasks or methods, including segmentation-based summarization, multi-granularity summarization and retrieval-then-generate summarization. Our analysis confirms the effectiveness and robustness of VCSum. We also provide a set of benchmark models regarding different downstream summarization tasks on VCSum to facilitate further research.

pdf bib
Reconstruct Before Summarize: An Efficient Two-Step Framework for Condensing and Summarizing Meeting Transcripts
Haochen Tan | Han Wu | Wei Shao | Xinyun Zhang | Mingjie Zhan | Zhaohui Hou | Ding Liang | Linqi Song
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Meetings typically involve multiple participants and lengthy conversations, resulting in redundant and trivial content. To overcome these challenges, we propose a two-step framework, Reconstruct before Summarize (RbS), for effective and efficient meeting summarization. RbS first leverages a self-supervised paradigm to annotate essential contents by reconstructing the meeting transcripts. Secondly, we propose a relative positional bucketing (RPB) algorithm to equip (conventional) summarization models to generate the summary. Despite the additional reconstruction process, our proposed RPB significantly compresses the input, leading to faster processing and reduced memory consumption compared to traditional summarization methods. We validate the effectiveness and efficiency of our method through extensive evaluations and analyses. On two meeting summarization datasets, AMI and ICSI, our approach outperforms previous state-of-the-art approaches without relying on large-scale pre-training or expert-grade annotating tools.

2020

pdf bib
DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding
Zilong Wang | Mingjie Zhan | Xuebo Liu | Ding Liang
Findings of the Association for Computational Linguistics: EMNLP 2020

Form understanding depends on both textual contents and organizational structure. Although modern OCR performs well, it is still challenging to realize general form understanding because forms are commonly used and of various formats. The table detection and handcrafted features in previous works cannot apply to all forms because of their requirements on formats. Therefore, we concentrate on the most elementary components, the key-value pairs, and adopt multimodal methods to extract features. We consider the form structure as a tree-like or graph-like hierarchy of text fragments. The parent-child relation corresponds to the key-value pairs in forms. We utilize the state-of-the-art models and design targeted extraction modules to extract multimodal features from semantic contents, layout information, and visual images. A hybrid fusion method of concatenation and feature shifting is designed to fuse the heterogeneous features and provide an informative joint representation. We adopt an asymmetric algorithm and negative sampling in our model as well. We validate our method on two benchmarks, MedForm and FUNSD, and extensive experiments demonstrate the effectiveness of our method.