Letian Peng

2025

pdf bib abs
Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM’s Nest
Letian Peng | Zilong Wang | Feng Yao | Jingbo Shang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token prediction into extraction for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, Cuckoo, with 102.6M extractive data converted from LLM’s pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.

2024

pdf bib abs
Answer is All You Need: Instruction-following Text Embedding via Answering the Question
Letian Peng | Yuwei Zhang | Zilong Wang | Jayanth Srinivasa | Gaowen Liu | Zihan Wang | Jingbo Shang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This work aims to build a text embedder that can capture characteristics of texts specified by user instructions clarifying the similarity criterion. While previous methods improve general task awareness by injecting the instruction information into encoding, they fail to be sensitive to clearer criteria like “evaluate similarity based on emotion”. We instead propose a different viewpoint, which treats the instruction as a “question” about the input text and encodes the expected answers to obtain the representation accordingly. Intuitively, texts with the same (implicit) semantics would share similar answers following the instruction, thus leading to more similar representations. Specifically, we propose InBedder that instantiates this learning-to-answer idea by only fine-tuning language models via abstractive question answering tasks. Despite its simplicity, InBedder demonstrates significantly improved instruction-following capabilities according to our proposed instruction awareness tests and instruction robustness tests, when applied to language models with large language models (LLMs) (e.g., llama-2-7b) and smaller encoder-based LMs (e.g., roberta-large). Additionally, our qualitative analysis of clustering outcomes, achieved by applying diverse instructions to the same unlabeled corpus, demonstrates a high degree of interpretability in the clusters formed.

Recent advances in Automated Theorem Proving have shown the effectiveness of leveraging a (large) language model that generates tactics (i.e. proof steps) to search through proof states. The current model, while trained solely on successful proof paths, faces a discrepancy at the inference stage, as it must sample and try various tactics at each proof state until finding success, unlike its training which does not incorporate learning from failed attempts. Intuitively, a tactic that leads to a failed search path would indicate that similar tactics should receive less attention during the following trials. In this paper, we demonstrate the benefit of training models that additionally learn from failed search paths. Facing the lack of such trial-and-error data in existing open-source theorem-proving datasets, we curate a dataset on intuitionistic propositional logic theorems and formalize it in Lean, such that we can reliably check the correctness of proofs. We compare our model trained on relatively short trial-and-error information (TrialMaster) with models trained only on the correct paths and discover that the former solves more unseen theorems with lower trial searches.

pdf bib abs
Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification
Letian Peng | Yi Gu | Chengyu Dong | Zihan Wang | Jingbo Shang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

For extremely weak-supervised text classification, pioneer research generates pseudo labels by mining texts similar to the class names from the raw corpus, which may end up with very limited or even no samples for the minority classes. Recent works have started to generate the relevant texts by prompting LLMs using the class names or definitions; however, there is a high risk that LLMs cannot generate in-distribution (i.e., similar to the corpus where the text classifier will be applied) data, leading to ungeneralizable classifiers. In this paper, we combine the advantages of these two approaches and propose to bridge the gap via a novel framework, text grafting, which aims to obtain clean and near-distribution weak supervision for minority classes. Specifically, we first use LLM-based logits to mine masked templates from the raw corpus, which have a high potential for data synthesis into the target minority class. Then, the templates are filled by state-of-the-art LLMs to synthesize near-distribution texts falling into minority classes. Text grafting shows significant improvement over direct mining or synthesis on minority classes. We also use analysis and case studies to comprehend the property of text grafting.

pdf bib abs
Incubating Text Classifiers Following User Instruction with Nothing but LLM
Letian Peng | Zilong Wang | Jingbo Shang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In this paper, we aim to generate text classification data given arbitrary class definitions (i.e., user instruction), so one can train a text classifier without any human annotation or raw corpus. Recent advances in large language models (LLMs) lead to pioneer attempts to individually generate texts for each class via prompting. In this paper, we propose Incubator, the first framework that can handle complicated and even mutually dependent classes (e.g., "TED Talk given by Educator" and "Other"). Specifically, our Incubator is a fine-tuned LLM that takes the instruction of all class definitions as input, and in each inference, it can jointly generate one sample for every class. First, we tune Incubator on the instruction-to-data mappings that we obtained from classification datasets and descriptions on Hugging Face together with in-context augmentation by GPT-4. To emphasize the uniformity and diversity in generations, we refine Incubator by fine-tuning with the cluster centers of semantic textual embeddings of the generated samples. We compare Incubator on various classification tasks with strong baselines such as direct LLM-based inference and training data generation by prompt engineering. Experiments show Incubator is able to (1) outperform previous methods on traditional benchmarks, (2) take label interdependency and user preference into consideration, and (3) enable logical text mining by incubating multiple classifiers

pdf bib abs
Controllable Data Augmentation for Few-Shot Text Mining with Chain-of-Thought Attribute Manipulation
Letian Peng | Yuwei Zhang | Jingbo Shang
Findings of the Association for Computational Linguistics: ACL 2024

Prompting large language models (LLMs) for data augmentation has recently become a common practice in few-shot NLP tasks. In this paper, we propose Chain-of-Thought Attribute Manipulation (CoTAM), a novel approach that generates new data from existing examples by only tweaking in the user-provided, task-specific attribute, e.g., sentiment polarity or topic in movie reviews. Instead of conventional latent representation controlling, we leverage the chain-of-thought prompting to directly edit the text in three steps, (1) attribute decomposition, (2) manipulation proposal, and (3) sentence reconstruction. Extensive results on various tasks, such as text (pair) classification and aspect-based sentiment analysis, verify the superiority of CoTAM over other LLM-based augmentation methods with the same number of training examples for both fine-tuning and in-context learning. Remarkably, the 2D visualization of the augmented dataset using principle component analysis revealed a human-recognizable decision boundary that is likely hinted by the attribute manipulation, demonstrating the potential of our proposed approach.

2023

pdf bib abs
Contextualized Semantic Distance between Highly Overlapped Texts
Letian Peng | Zuchao Li | Hai Zhao
Findings of the Association for Computational Linguistics: ACL 2023

Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation. Better evaluation of the semantic distance between the overlapped sentences benefits the language system’s understanding and guides the generation. Since conventional semantic metrics are based on word representations, they are vulnerable to the disturbance of overlapped components with similar representations. This paper aims to address the issue with a mask-and-predict strategy. We take the words in the longest common sequence (LCS) as neighboring words and use masked language modeling (MLM) from pre-trained language models (PLMs) to predict the distributions in their positions. Our metric, Neighboring Distribution Divergence (NDD), represents the semantic distance by calculating the divergence between distributions in the overlapped parts. Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts. Based on the discovery, we further implement an unsupervised and training-free method for text compression, leading to a significant improvement on the previous perplexity-based method. The high compression rate controlling ability of our method even enables NDD to outperform the supervised state-of-the-art in domain adaption by a huge margin. Further experiments on syntax and semantics analyses verify the awareness of internal sentence structures, indicating the high potential of NDD for further studies.

pdf bib abs
Less than One-shot: Named Entity Recognition via Extremely Weak Supervision
Letian Peng | Zihan Wang | Jingbo Shang
Findings of the Association for Computational Linguistics: EMNLP 2023

We study the named entity recognition (NER) problem under the extremely weak supervision (XWS) setting, where only one example entity per type is given in a context-free way. While one can see that XWS is lighter than one-shot in terms of the amount of supervision, we propose a novel method X-NER that can outperform the state-of-the-art one-shot NER methods. We first mine entity spans that are similar to the example entities from an unlabelled training corpus. Instead of utilizing entity span representations from language models, we find it more effective to compare the context distributions before and after the span is replaced by the entity example. We then leverage the top-ranked spans as pseudo-labels to train an NER tagger. Extensive experiments and analyses on 4 NER datasets show the superior end-to-end NER performance of X-NER, outperforming the state-of-the-art few-shot methods with 1-shot supervision and ChatGPT annotations significantly. Finally, our X-NER possesses several notable properties, such as inheriting the cross-lingual abilities of the underlying language models.

pdf bib abs
Chride at SemEval-2023 Task 10: Fine-tuned Deberta-V3 on Detection of Online Sexism with Hierarchical Loss
Letian Peng | Bosung Kim
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Sexism is one of the most concerning problems in the internet society. By detecting sexist expressions, we can reduce the offense toward females and provide useful information to understand how sexism occurs. Our work focuses on a newly-published dataset, EDOS, which annotates English sexist expressions from Reddit and categorizes their specific types. Our method is to train a DeBERTaV3 classifier with all three kinds of labels provided by the dataset, including sexist, category, and granular vectors. Our classifier predicts the probability distribution on vector labels and further applies it to represent category and sexist distributions. Our classifier uses its label and finer-grained labels for each classification to calculate the hierarchical loss for optimization. Our experiments and analyses show that using a combination of loss with finer-grained labels generally achieves better performance on sexism detection and categorization. Codes for our implementation can be found at https://github.com/KomeijiForce/SemEval2023_Task10.

Co-authors

Yi Gu 1

Venues

Fix author