Zhongyi Ye

2025

Large language models (LLMs) rely on massive amounts of training data, however, the quantity of empirically observed data is limited. To alleviate this issue, lots of LLMs leverage synthetic data to enhance the quantity of training data. Despite significant advancements in LLMs, the efficiency and scalability characteristics of data synthesis during pre-training phases remain insufficiently explored. In this work, we propose a novel data synthesis framework, Cognitive Combination Synthesis (CCS), designed to achieve highly efficient and scalable data synthesis. Specifically, our methodology mimics human cognitive behaviors by recombining and interconnecting heterogeneous data from diverse sources thereby enhancing advanced reasoning capabilities in LLMs. Extensive experiments demonstrate that: (1) effective data organization is essential, and our mapping-based combination learning approach significantly improves data utilization efficiency; (2) by enhancing data diversity, accuracy, and complexity, our synthetic data scales beyond 100B tokens, revealing CCS’s strong scalability. Our findings highlight the impact of data organization methods on LLM learning efficiency and the significant potential of scalable synthetic data to enhance model reasoning capabilities.

2023

pdf bib abs

This paper describes the submissions of the research group USTC-NELSLIP to the 2023 IWSLT Offline Speech Translation competition, which involves translating spoken English into written Chinese. We utilize both cascaded models and end-to-end models for this task. To improve the performance of the cascaded models, we introduce Whisper to reduce errors in the intermediate source language text, achieving a significant improvement in ASR recognition performance. For end-to-end models, we propose Stacked Acoustic-and-Textual En- coding extension (SATE-ex), which feeds the output of the acoustic decoder into the textual decoder for information fusion and to prevent error propagation. Additionally, we improve the performance of the end-to-end system in translating speech by combining the SATE-ex model with the encoder-decoder model through ensembling.

2022

pdf bib abs

This paper describes USTC-NELSLIP’s submissions to the IWSLT 2022 Offline Speech Translation task, including speech translation of talks from English to German, English to Chinese and English to Japanese. We describe both cascaded architectures and end-to-end models which can directly translate source speech into target text. In the cascaded condition, we investigate the effectiveness of different model architectures with robust training and achieve 2.72 BLEU improvements over last year’s optimal system on MuST-C English-German test set. In the end-to-end condition, we build models based on Transformer and Conformer architectures, achieving 2.26 BLEU improvements over last year’s optimal end-to-end system. The end-to-end system has obtained promising results, but it is still lagging behind our cascaded models.

Co-authors

Dan Liu 1

Venues

IWSLT2
EMNLP1

Fix author