Yufei Tian - ACL Anthology

Yufei Tian

2026

Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
Li-Chun Lu | Miri Liu | Pin Chun Lu | Yufei Tian | Shao-Hua Sun | Nanyun Peng
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We examine, analyze, and compare four representative creativity measures—perplexity, LLM-as-a-Judge, the Creativity Index (CI; measuring n-gram overlap with web corpora), and syntactic templates (detecting repetition of common part-of-speech patterns)—across the diverse creative domains, such as creative writing, unconventional problem-solving, and research ideation. For each domain, we compile datasets with human-aligned creative and uncreative examples and evaluate each metric’s ability to discriminate between the two sets. Our analyses reveal limited consistency both across domains and metrics, as metrics that distinguish creativity in one domain fail in others (e.g., CI correctly distinguishes in creative writing but fails in problem-solving), and different metrics often disagree on the same data points (e.g., CI suggests one set to be more creative, while perplexity indicates the other set to be more creative.) We highlight key limitations, such as perplexity reflecting fluency rather than novelty; LLM-as-a-Judge producing inconsistent judgments under minor prompt variations and exhibiting bias towards particular labels; CI primarily measuring lexical diversity, with high sensitivity to implementation choices; and syntactic templates being ineffective in settings dominated by formulaic language. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity. We release the datasets and evaluation code: https://github.com/lichun-19/creative_eval.

2025

SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation
Yufei Tian | Jiao Sun | Nanyun Peng | Zizhao Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As language models evolve to tackle complex, multifaceted tasks, their evaluation must adapt to capture this intricacy. A granular, skill-specific understanding of model capabilities can empower researchers to make informed model development plans. In this paper, we introduce SkillVerse, an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities. With LLM as a judge, SkillVerse first critiques the model responses, and then organizes them into a hierarchical structure termed dendrogram. Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models. We also demonstrate its efficacy in two downstream tasks: 1) improving model in-context learning by 25% using a tree-search algorithm to select more informative few-shot demonstrations, and 2) accurately predicting new model weaknesses with a 55% success rate, 22% higher than without SkillVerse.

REFFLY: Melody-Constrained Lyrics Editing Model
Songyan Zhao | Bingxuan Li | Yufei Tian | Nanyun Peng
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Automatic melody-to-lyric (M2L) generation aims to create lyrics that align with a given melody. While most previous approaches generate lyrics from scratch, revision—editing plain text draft to fit it into the melody—offers a much more flexible and practical alternative. This enables broad applications, such as generating lyrics from flexible inputs (keywords, themes, or full text that needs refining to be singable), song translation (preserving meaning across languages while keeping the melody intact), or style transfer (adapting lyrics to different genres). This paper introduces REFFLY (REvision Framework For LYrics), the first revision framework for editing and generating melody-aligned lyrics. We train the lyric revision module using our curated synthesized melody-aligned lyrics dataset, enabling it to transform plain text into lyrics that align with a given melody. To further enhance the revision ability, we propose training-free heuristics aimed at preserving both semantic meaning and musical consistency throughout the editing process. Experimental results demonstrate the effectiveness of REFFLY across various tasks (e.g. song translation), showing that our model outperforms strong baselines, including Lyra (CITATION) and GPT-4, by 25% in both musicality and text quality.

2024

Detecting Machine-Generated Long-Form Content with Latent-Space Variables
Yufei Tian | Zeyu Pan | Nanyun Peng
Findings of the Association for Computational Linguistics: EMNLP 2024

The increasing capability of large language models (LLMs) to generate fluent long-form texts is presenting new challenges in distinguishing these outputs from those of humans. Existing zero-shot detectors that primarily focus on token-level distributions are vulnerable to real-world domain shift including different decoding strategies, variations in prompts, and attacks. We propose a more robust method that incorporates abstract elements—such as topic or event transitions—as key deciding factors, by training a latent-space model on sequences of events or topics derived from human-written texts. On three different domains, machine generations which are originally inseparable from humans’ on the token level can be better distinguished with our latent-space model, leading to a 31% improvement over strong baselines such as DetectGPT. Our analysis further reveals that unlike humans, modern LLMs such as GPT-4 selecting event triggers and transitions differently, and inherent disparity regardless of the generation configurations adopted in real-time.

Are Large Language Models Capable of Generating Human-Level Narratives?
Yufei Tian | Tenghao Huang | Miri Liu | Derek Jiang | Alexander Spangher | Muhao Chen | Jonathan May | Nanyun Peng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

As daily reliance on large language models (LLMs) grows, assessing their generation quality is crucial to understanding how they might impact on our communications. This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression. We introduce a novel computational framework to analyze narratives through three discourse-level aspects: i) story arcs, ii) turning points, and iii) affective dimensions, including arousal and valence. By leveraging expert and automatic annotations, we uncover significant discrepancies between the LLM- and human- written stories. While human-written stories are suspenseful, arousing, and diverse in narrative structures, LLM stories are homogeneously positive and lack tension. Next, we measure narrative reasoning skills as a precursor to generative capacities, concluding that most LLMs fall short of human abilities in discourse understanding. Finally, we show that explicit integration of aforementioned discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling in terms of diversity, suspense, and arousal. Such advances promise to facilitate greater and more natural roles LLMs in human communication.

MacGyver: Are Large Language Models Creative Problem Solvers?
Yufei Tian | Abhilasha Ravichander | Lianhui Qin | Ronan Le Bras | Raja Marjieh | Nanyun Peng | Yejin Choi | Thomas Griffiths | Faeze Brahman
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their problem-solving abilities. MACGYVER is challenging for both groups, but in unique and complementary ways. For instance, humans excel in tasks they are familiar with but struggle with domain-specific knowledge, leading to a higher variance. In contrast, LLMs, exposed to a variety of specialized knowledge, attempt broader problems but fail by proposing physically-infeasible actions. Finally, we provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking.This work (1) introduces a fresh arena for intelligent agents focusing on intricate aspects of physical reasoning, planning, and unconventional thinking, which supplements the existing spectrum of machine intelligence; and (2) provides insight into the constrained problem-solving capabilities of both humans and AI.

2023

Evaluating Large Language Models on Controlled Generation Tasks
Jiao Sun | Yufei Tian | Wangchunshu Zhou | Nan Xu | Qian Hu | Rahul Gupta | John Wieting | Nanyun Peng | Xuezhe Ma
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

While recent studies have looked into the abilities of large language models in various benchmark tasks, including question generation, reading comprehension, multilingual and etc, there have been few studies looking into the controllability of large language models on generation tasks. We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities. After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models. We conclude that *large language models struggle at meeting fine-grained hard constraints*.

Harnessing Black-Box Control to Boost Commonsense in LM’s Generation
Yufei Tian | Felix Zhang | Nanyun Peng
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) such as GPT-3 have demonstrated a strong capability to generate coherent and contextually relevant text. However, amidst their successes, a crucial issue persists: their generated outputs still lack commonsense at times. Moreover, fine-tuning the entire LLM towards more commonsensical outputs is computationally expensive if not infeasible. In this paper, we present a computation-efficient framework that steers a frozen Pre-Trained Language Model (PTLM) towards more commonsensical generation (i.e., producing a plausible output that incorporates a list of concepts in a meaningful way). Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score by grounding the sentence to a dynamic commonsense knowledge base from four different relational aspects. We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head that guides a fixed PTLM to better satisfy the oracle. We test our framework on a series of GPT-2-, Flan-T5-, and Alpaca-based language models (LMs) on two constrained concept-to-sentence benchmarks. Human evaluation results demonstrate that our method consistently leads to the most commonsensical outputs.

Unsupervised Melody-to-Lyrics Generation
Yufei Tian | Anjali Narayan-Chen | Shereen Oraby | Alessandra Cervone | Gunnar Sigurdsson | Chenyang Tao | Wenbo Zhao | Yiwen Chen | Tagyoung Chung | Jing Huang | Nanyun Peng
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automatic melody-to-lyric generation is a task in which song lyrics are generated to go with a given melody. It is of significant practical interest and more challenging than unconstrained lyric generation as the music imposes additional constraints onto the lyrics. The training data is limited as most songs are copyrighted, resulting in models that underfit the complicated cross-modal relationship between melody and lyrics. In this work, we propose a method for generating high-quality lyrics without training on any aligned melody-lyric data. Specifically, we design a hierarchical lyric generation framework that first generates a song outline and second the complete lyrics. The framework enables disentanglement of training (based purely on text) from inference (melody-guided text generation) to circumvent the shortage of parallel data. We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints as guidance during inference. The two-step hierarchical design also enables content control via the lyric outline, a much-desired feature for democratizing collaborative song creation. Experimental results show that our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines, for example SongMASS, a SOTA model trained on a parallel dataset, with a 24% relative overall quality improvement based on human ratings. Our code is available at https://github.com/amazon-science/unsupervised-melody-to-lyrics-generation.

2022

Zero-shot Sonnet Generation with Discourse-level Planning and Aesthetics Features
Yufei Tian | Nanyun Peng
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Poetry generation, and creative language generation in general, usually suffers from the lack of large training data. In this paper, we present a novel framework to generate sonnets that does not require training on poems. We design a hierarchical framework which plans the poem sketch before decoding. Specifically, a content planning module is trained on non-poetic texts to obtain discourse-level coherence; then a rhyme module generates rhyme words and a polishing module introduces imagery and similes for aesthetics purposes. Finally, we design a constrained decoding algorithm to impose the meter-and-rhyme constraint of the generated sonnets. Automatic and human evaluation show that our multi-stage approach without training on poem corpora generates more coherent, poetic, and creative sonnets than several strong baselines.

Go Back in Time: Generating Flashbacks in Stories with Event Temporal Prompts
Rujun Han | Hong Chen | Yufei Tian | Nanyun Peng
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Stories or narratives are comprised of a sequence of events. To compose interesting stories, professional writers often leverage a creative writing technique called *flashback* that inserts past events into current storylines as we commonly observe in novels and plays. However, it is challenging for machines to generate *flashback* as it requires a solid understanding of event **temporal order** (e.g. *feeling hungry* before *eat*, not vice versa), and the creativity to arrange storylines so that earlier events do not always appear first in **narrative order**. Two major issues in existing systems that exacerbate the challenges: 1) temporal bias in pertaining and story datasets that leads to monotonic event temporal orders; 2) lack of explicit guidance that helps machines decide where to insert *flashbacks*. We propose to address these issues using structured storylines to encode events and their pair-wise temporal relations (before, after and vague) as **temporal prompts** that guide how stories should unfold temporally. We leverage a Plan-and-Write framework enhanced by reinforcement learning to generate storylines and stories end-to-end. Evaluation results show that the proposed method can generate more interesting stories with *flashbacks* while maintaining textual diversity, fluency, and temporal coherence.

AmbiPun: Generating Humorous Puns with Ambiguous Context
Anirudh Mittal | Yufei Tian | Nanyun Peng
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this paper, we propose a simple yet effective way to generate pun sentences that does not require any training on existing puns. Our approach is inspired by humor theories that ambiguity comes from the context rather than the pun word itself. Given a pair of definitions of a pun word, our model first produces a list of related concepts through a reverse dictionary. We then utilize one-shot GPT3 to generate context words and then generate puns incorporating context words from both concepts. Human evaluation shows that our method successfully generates pun 52% of the time, outperforming well-crafted baselines and the state-of-the-art models by a large margin.

A Unified Framework for Pun Generation with Humor Principles
Yufei Tian | Divyanshu Sheth | Nanyun Peng
Findings of the Association for Computational Linguistics: EMNLP 2022

We propose a unified framework to generate both homophonic and homographic puns to resolve the split-up in existing works. Specifically, we incorporate three linguistic attributes of puns to the language models: ambiguity, distinctiveness, and surprise. Our framework consists of three parts: 1) a context words/phrases selector to promote the aforementioned attributes, 2) a generation model trained on non-pun sentences to incorporate the context words/phrases into the generation output, and 3) a label predictor that learns the structure of puns which is used to steer the generation model at inference time. Evaluation results on both pun types demonstrate the efficacy of our model over strong baselines.

Paraphrase Generation as Unsupervised Machine Translation
Xiaofei Sun | Yufei Tian | Yuxian Meng | Nanyun Peng | Fei Wu | Jiwei Li | Chun Fan
Proceedings of the 29th International Conference on Computational Linguistics

In this paper, we propose a new paradigm for paraphrase generation by treating the task as unsupervised machine translation (UMT) based on the assumption that there must be pairs of sentences expressing the same meaning in a large-scale unlabeled monolingual corpus. The proposed paradigm first splits a large unlabeled corpus into multiple clusters, and trains multiple UMT models using pairs of these clusters. Then based on the paraphrase pairs produced by these UMT models, a unified surrogate model can be trained to serve as the final model to generate paraphrases, which can be directly used for test in the unsupervised setup, or be finetuned on labeled datasets in the supervised setup. The proposed method offers merits over machine-translation-based paraphrase generation methods, as it avoids reliance on bilingual sentence pairs. It also allows human intervene with the model so that more diverse paraphrases can be generated using different filtering criteria. Extensive experiments on existing paraphrase dataset for both the supervised and unsupervised setups demonstrate the effectiveness the proposed paradigm.

2021

Identifying Distributional Perspectives from Colingual Groups
Yufei Tian | Tuhin Chakrabarty | Fred Morstatter | Nanyun Peng
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

Discrepancies exist among different cultures or languages. A lack of mutual understanding among different colingual groups about the perspectives on specific values or events may lead to uninformed decisions or biased opinions. Thus, automatically understanding the group perspectives can provide essential back-ground for many natural language processing tasks. In this paper, we study colingual groups and use language corpora as a proxy to identify their distributional perspectives. We present a novel computational approach to learn shared understandings, and benchmark our method by building culturally-aware models for the English, Chinese, and Japanese languages. Ona held out set of diverse topics, including marriage, corruption, democracy, etc., our model achieves high correlation with human judgements regarding intra-group values and inter-group differences

HypoGen: Hyperbole Generation with Commonsense and Counterfactual Knowledge
Yufei Tian | Arvind krishna Sridhar | Nanyun Peng
Findings of the Association for Computational Linguistics: EMNLP 2021

A hyperbole is an intentional and creative exaggeration not to be taken literally. Despite its ubiquity in daily life, the computational explorations of hyperboles are scarce. In this paper, we tackle the under-explored and challenging task: sentence-level hyperbole generation. We start with a representative syntactic pattern for intensification and systematically study the semantic (commonsense and counterfactual) relationships between each component in such hyperboles. We then leverage commonsense and counterfactual inference to generate hyperbole candidates based on our findings from the pattern, and train neural classifiers to rank and select high-quality hyperboles. Automatic and human evaluations show that our generation method is able to generate hyperboles with high success rate, intensity, funniness, and creativity.

Co-authors

Venues