Yihong Chen

Also published as: YiHong Chen

2025

English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). However, the same can not be said for most other languages, likely due to a gap in the quality and diversity of available multilingual pretraining corpora. In this work, we find that documents machine-translated from a high-quality English corpus, can contribute significantly to the pretraining quality of multilingual LLMs. Concretely, we translate FineWeb-Edu, a high-quality English web corpus, into nine languages. resulting in a 1.7-trillion-token corpus, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this corpus. Across Non-English understanding and reasoning tasks, we show that TransWebLLM matches or even outperforms multilingual LLMs of similar size, including Llama3.2, Qwen2.5, and Gemma3, despite being trained on an order of magnitude less data. Moreover, we show that adding fewer than 5% of TransWebLLM’s training tokens as domain-specific data for continued pretraining yields state-of-the-art results in Arabic, Indonesian, Swahili, and Welsh for understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus and models under Open Source Initiative-approved licenses.

2024

pdf bib abs

Large Language Models (LLMs) have demonstrated proficiency in addressing tasks that necessitate a combination of task planning and the usage of external tools, such as weather and calculator APIs. However, real-world industrial systems present prevalent challenges in task planning and tool usage: numerous APIs in the real system make it intricate to invoke the appropriate one, while the inherent limitations of LLMs pose challenges in orchestrating an accurate sub-task sequence and API-calling order. This paper introduces a comprehensive framework aimed at enhancing the Task Planning and Tool Usage (TPTU) abilities of LLM-based agents in industry. Our framework comprises three key components designed to address these challenges: (1) the API Retriever selects the most pertinent APIs among the extensive API set; (2) the Demo Selector retrieves task-level demonstrations, which is further used for in-context learning to aid LLMs in accurately decomposing subtasks and effectively invoking hard-to-distinguish APIs; (3) LLM Finetuner tunes a base LLM to enhance its capability for task planning and API calling. We validate our methods using a real-world industry system and an open-sourced academic dataset, demonstrating the efficacy of each individual component as well as the integrated framework. The code is available at here.

2023

pdf bib abs

Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training
Kelly Marchisio | Patrick Lewis | Yihong Chen | Mikel Artetxe
Findings of the Association for Computational Linguistics: ACL 2023

Prior work shows that it is possible to expand pretrained Masked Language Models (MLMs) to new languages by learning a new set of embeddings, while keeping the transformer body frozen. Despite learning a small subset of parameters, this approach is not compute-efficient, as training the new embeddings requires a full forward and backward pass over the entire model. We propose mini-model adaptation, a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model’s parameters. New language-specific embeddings can then be efficiently trained over the mini-model and plugged into the aligned large model for rapid cross-lingual transfer. We explore two approaches to learn mini-models: MINIJOINT, which jointly pretrains the primary model and the mini-model using a single transformer with a secondary MLM head at a middle layer; and MINIPOST, where we start from a regular pretrained model, build a mini-model by extracting and freezing a few layers, and learn a small number of parameters on top. Experiments on XNLI, MLQA and PAWS-X show that mini-model adaptation matches the performance of the standard approach using up to 2.3x less compute on average.

2020

pdf bib abs

Despite the continuing efforts to improve the engagingness and consistency of chit-chat dialogue systems, the majority of current work simply focus on mimicking human-like responses, leaving understudied the aspects of modeling understanding between interlocutors. The research in cognitive science, instead, suggests that understanding is an essential signal for a high-quality chit-chat conversation. Motivated by this, we propose Pˆ2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding. Specifically, Pˆ2 Bot incorporates mutual persona perception to enhance the quality of personalized dialogue generation. Experiments on a large public dataset, Persona-Chat, demonstrate the effectiveness of our approach, with a considerable boost over the state-of-the-art baselines across both automatic metrics and human evaluations.