Shenglan Li

2025

EmByte: Decomposition and Compression Learning for Small yet Private NLP
Shenglan Li | Jia Xu | Mengjiao Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent breakthroughs in natural language processing (NLP) have come with escalating model sizes and computational costs, posing significant challenges for deployment in real-time and resource-constrained environments. We introduce EMBYTE, a novel byte-level tokenization model that achieves substantial embedding compression while preserving NLP accuracy and enhancing privacy. At the core of EMBYTE is a new Decompose-and-Compress (DeComp) learning strategy that decomposes subwords into fine-grained byte embeddings and then compresses them via neural projection. DeComp enables EMBYTE to be shrunk down to any vocabulary size (e.g., 128 or 256), drastically reducing embedding parameter count by up to 94% compared to subword-based models without increasing sequence length or degrading performance. Moreover, EMBYTE is resilient to privacy threats such as gradient inversion attacks, due to its byte-level many-to-one mapping structure. Empirical results on GLUE, machine translation, sentiment analysis, and language modeling tasks show that EMBYTE matches or surpasses the performance of significantly larger models, while offering improved efficiency. This makes EMBYTE a lightweight and generalizable NLP solution, well-suited for deployment in privacy-sensitive or low-resource environments.

pdf bib abs

Large Language Models (LLMs) have demonstrated impressive capabilities in text generation but raise concerns regarding potential copyright infringement. While prior research has explored mitigation strategies like content filtering and alignment, the impact of adversarial persuasion techniques in eliciting copyrighted content remains underexplored. This paper investigates how structured persuasion strategies, including logical appeals, emotional framing, and compliance techniques, can be used to manipulate LLM outputs and potentially increase copyright risks. We introduce a structured persuasion workflow, incorporating query mutation, intention-preserving filtering, and few-shot prompting, to systematically analyze the influence of persuasive prompts on LLM responses. Through experiments on state-of-the-art LLMs, including GPT-4o-mini and Claude-3-haiku, we quantify the effectiveness of different persuasion techniques and assess their implications for AI safety. Our results highlight the vulnerabilities of LLMs to adversarial persuasion and provide empirical evidence of the increased risk of generating copyrighted content under such influence. We conclude with recommendations for strengthening model safeguards and future directions for enhancing LLM robustness against manipulation. Code is available at https://github.com/Rongite/Persuasion.

2024

pdf bib abs

Do LLMs Know to Respect Copyright Notice?
Jialiang Xu | Shenglan Li | Zhaozhuo Xu | Denghui Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Prior study shows that LLMs sometimes generate content that violates copyright. In this paper, we study another important yet underexplored problem, i.e., will LLMs respect copyright information in user input, and behave accordingly? The research problem is critical, as a negative answer would imply that LLMs will become the primary facilitator and accelerator of copyright infringement behavior. We conducted a series of experiments using a diverse set of language models, user prompts, and copyrighted materials, including books, news articles, API documentation, and movie scripts. Our study offers a conservative evaluation of the extent to which language models may infringe upon copyrights when processing user input containing protected material. This research emphasizes the need for further investigation and the importance of ensuring LLMs respect copyright regulations when handling user input to prevent unauthorized use or reproduction of protected content. We also release a benchmark dataset serving as a test bed for evaluating infringement behaviors by LLMs and stress the need for future alignment.

Co-authors

Jikai Long 1

Jia Xu 1

Mengjiao Zhang 1

Venues

Findings2
EMNLP1

Fix author