Yibo Zhang


2024

pdf bib
Can Public Large Language Models Help Private Cross-device Federated Learning?
Boxin Wang | Yibo Zhang | Yuan Cao | Bo Li | Hugh McMahan | Sewoong Oh | Zheng Xu | Manzil Zaheer
Findings of the Association for Computational Linguistics: NAACL 2024

We study (differentially) private federated learning (FL) of language models. The language models in cross-device FL are relatively small, which can be trained with meaningful formal user-level differential privacy (DP) guarantees when massive parallelism in training is enabled by the participation of a moderate size of users. Recently, public data has been used to improve privacy-utility trade-offs for both large and small language models. In this work, we provide a systematic study of using large-scale public data and LLMs to help differentially private training of on-device FL models, and further improve the privacy-utility tradeoff by techniques of distillation. Moreover, we propose a novel distribution matching algorithm with theoretical grounding to sample public data close to private data distribution, which significantly improves the sample efficiency of (pre-)training on public data. The proposed method is efficient and effective for training private models by taking advantage of public data, especially for customized on-device architectures that do not have ready-touse pre-trained models.

2023

pdf bib
TextObfuscator: Making Pre-trained Language Model a Privacy Protector via Obfuscating Word Representations
Xin Zhou | Yi Lu | Ruotian Ma | Tao Gui | Yuran Wang | Yong Ding | Yibo Zhang | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2023

In real-world applications, pre-trained language models are typically deployed on the cloud, allowing clients to upload data and perform compute-intensive inference remotely. To avoid sharing sensitive data directly with service providers, clients can upload numerical representations rather than plain text to the cloud. However, recent text reconstruction techniques have demonstrated that it is possible to transform representations into original words, suggesting that privacy risk remains. In this paper, we propose TextObfuscator, a novel framework for protecting inference privacy by applying random perturbations to clustered representations. The random perturbations make the representations indistinguishable from surrounding clustered representations, thus obscuring word information while retaining the original word functionality. To achieve this, we utilize prototypes to learn clustered representation, where tokens of similar functionality are encouraged to be closer to the same prototype during training. Additionally, we design different methods to find prototypes for token-level and sentence-level tasks, which can improve performance by incorporating semantic and task information. Experimental results on token and sentence classification tasks show that TextObfuscator achieves improvement over compared methods without increasing inference cost.

2000

pdf bib
Query Translation in Chinese-English Cross-Language Information Retrieval
Yibo Zhang | Le Sun | Lin Du | Yufang Sun
2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora