Sullam Jeoung

2026

Prompts are the interface for eliciting the capabilities of large language models (LLMs). Understanding their structure and components is critical for analyzing LLM behavior and optimizing performance. However, the field lacks a comprehensive framework for systematic prompt analysis and understanding. We introduce PromptPrism, a linguistically-inspired taxonomy that enables prompt analysis across three hierarchical levels: functional structure, semantic component, and syntactic pattern. By applying linguistic concepts to prompt analysis, PromptPrism bridges traditional language understanding and modern LLM research, offering insights that purely empirical approaches might miss. We show the practical utility of PromptPrism by applying it to three applications: (1) a taxonomy-guided prompt refinement approach that automatically improves prompt quality and enhances model performance across a range of tasks; (2) a multi-dimensional dataset profiling method that extracts and aggregates structural, semantic, and syntactic characteristics from prompt datasets, enabling comprehensive analysis of prompt distributions and patterns; (3) a controlled experimental framework for prompt sensitivity analysis by quantifying the impact of semantic reordering and delimiter modifications on LLM performance. Our experimental results validate the effectiveness of our taxonomy across these applications, demonstrating that PromptPrism provides a foundation for refining, profiling, and analyzing prompts.

2025

pdf bib abs

Semantic Networks Extracted from Students’ Think-Aloud Data are Correlated with Students’ Learning Performance
Pingjing Yang | Sullam Jeoung | Jennifer Cromley | Jana Diesner
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

When students reflect on their learning from a textbook via think-aloud processes, network representations can be used to capture the concepts and relations from these data. What can we learn from the resulting network representations about students’ learning processes, knowledge acquisition, and learning outcomes? This study brings methods from entity and relation extraction using classic and LLM-based methods to the application domain of educational psychology. We built a ground-truth baseline of relational data that represents relevant (to educational science), textbook-based information as a semantic network. Among the tested models, SPN4RE and LUKE achieved the best performance in extracting concepts and relations from students’ verbal data. Network representations of students’ verbalizations varied in structure, reflecting different learning processes. Correlating the students’ semantic networks with learning outcomes revealed that denser and more interconnected semantic networks were associated with more elaborated knowledge acquisition. Structural features such as the number of edges and surface overlap with textbook networks significantly correlated with students’ posttest performance.

Since the advent of large language models (LLMs), prompt engineering has been a crucial step for eliciting desired responses for various Natural Language Processing (NLP) tasks. However, prompt engineering remains an impediment for end users due to rapid advances in models, tasks, and associated best practices. To mitigate this, Automatic Prompt Optimization (APO) techniques have recently emerged that use various automated techniques to help improve the performance of LLMs on various tasks. In this paper, we present a comprehensive survey summarizing the current progress and remaining challenges in this field. We provide a formal definition of APO, a 5-part unifying framework, and then proceed to rigorously categorize all relevant works based on their salient features therein. We hope to spur further research guided by our framework.

pdf bib abs

Tales of Morality: Comparing Human- and LLM-Generated Moral Stories from Visual Cues
Rezvaneh Rezapour | Sullam Jeoung | Zhiwen You | Jana Diesner
Findings of the Association for Computational Linguistics: EMNLP 2025

Do moral values align between images, the stories humans write about them, and the narratives generated by large language models (LLMs)? This question matters because stories are central to how humans communicate moral values, yet little is known about how people and LLMs perform this task in a multimodal (text and image) setting. We present a systematic comparison of moral values represented in human- and LLM-generated narratives based on images annotated by humans for moral content. Our analysis shows that while human stories reflect a balanced distribution of moral foundations and coherent narrative arcs, LLMs disproportionately emphasize the Care foundation and often lack emotional resolution. Even with moral conditioning, these biases persist in LLMs. We introduce a novel dataset and framework for evaluating moral storytelling in vision-language models, highlighting key challenges in aligning AI with human moral reasoning across cultures.

2024

pdf bib abs

Name-based gender prediction has traditionally categorized individuals as either female or male based on their names, using a binary classification system. That binary approach can be problematic in the cases of gender-neutral names that do not align with any one gender, among other reasons. Relying solely on binary gender categories without recognizing gender-neutral names can reduce the inclusiveness of gender prediction tasks. We introduce an additional gender category, i.e., “neutral”, to study and address potential gender biases in Large Language Models (LLMs). We evaluate the performance of several foundational and large language models in predicting gender based on first names only. Additionally, we investigate the impact of adding birth years to enhance the accuracy of gender prediction, accounting for shifting associations between names and genders over time. Our findings indicate that most LLMs identify male and female names with high accuracy (over 80%) but struggle with gender-neutral names (under 40%), and the accuracy of gender prediction is higher for English-based first names than non-English names. The experimental results show that incorporating the birth year does not improve the overall accuracy of gender prediction, especially for names with evolving gender associations. We recommend using caution when applying LLMs for gender identification in downstream tasks, particularly when dealing with non-binary gender labels.

pdf bib abs

Extractive Summarization via Fine-grained Semantic Tuple Extraction
Yubin Ge | Sullam Jeoung | Jana Diesner
Proceedings of the 17th International Natural Language Generation Conference

Traditional extractive summarization treats the task as sentence-level classification and requires a fixed number of sentences for extraction. However, this rigid constraint on the number of sentences to extract may hinder model generalization due to varied summary lengths across datasets. In this work, we leverage the interrelation between information extraction (IE) and text summarization, and introduce a fine-grained autoregressive method for extractive summarization through semantic tuple extraction. Specifically, we represent each sentence as a set of semantic tuples, where tuples are predicate-argument structures derived from conducting IE. Then we adopt a Transformer-based autoregressive model to extract the tuples corresponding to the target summary given a source document. In inference, a greedy approach is proposed to select source sentences to cover extracted tuples, eliminating the need for a fixed number. Our experiments on CNN/DM and NYT demonstrate the method’s superiority over strong baselines. Through the zero-shot setting for testing the generalization of models to diverse summary lengths across datasets, we further show our method outperforms baselines, including ChatGPT.

2023

pdf bib abs

StereoMap: Quantifying the Awareness of Human-like Stereotypes in Large Language Models
Sullam Jeoung | Yubin Ge | Jana Diesner
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have been observed to encode and perpetuate harmful associations present in the training data. We propose a theoretically grounded framework called StereoMap to gain insights into their perceptions of how demographic groups have been viewed by society. The framework is grounded in the Stereotype Content Model (SCM); a well-established theory from psychology. According to SCM, stereotypes are not all alike. Instead, the dimensions of Warmth and Competence serve as the factors that delineate the nature of stereotypes. Based on the SCM theory, StereoMap maps LLMs’ perceptions of social groups (defined by socio-demographic features) using the dimensions of Warmth and Competence. Furthermore, the framework enables the investigation of keywords and verbalizations of reasoning of LLMs’ judgments to uncover underlying factors influencing their perceptions. Our results show that LLMs exhibit a diverse range of perceptions towards these groups, characterized by mixed evaluations along the dimensions of Warmth and Competence. Furthermore, analyzing the reasonings of LLMs, our findings indicate that LLMs demonstrate an awareness of social disparities, often stating statistical data and research findings to support their reasoning. This study contributes to the understanding of how LLMs perceive and represent social groups, shedding light on their potential biases and the perpetuation of harmful associations.

pdf bib abs

Detection and Mitigation of the Negative Impact of Dataset Extractivity on Abstractive Summarization
Yubin Ge | Sullam Jeoung | Ly Dinh | Jana Diesner
Findings of the Association for Computational Linguistics: ACL 2023

In text summarization, extractivity is defined as a measurement of the degree of overlap between a source document and its summary. Previous research has shown that the extractivity level of training data can influence both output extractivity and the amount of factual information (i.e. faithfulness) in outputs for abstractive summarization. However, it remains unclear if and how extractivity impacts the performance of abstractive models. In this work, we investigate the relationship between dataset extractivity and model performance by comparing the performance of trained models under different degrees of extractivity. We find that while low levels of extractivity can improve performance, as extractivity increases, performance is negatively impacted. Furthermore, through an analysis of the model’s copy continuity of content, we discover that higher extractivity leads to a greater tendency for the model to copy text continuously from the source document rather than identifying and summarizing important content that should be covered in the target summary. To address these issues, we propose a simple and effective method to design copy labels for fixing the model’s copying behaviors and train the model with a copy mechanism. The experimental results illustrate the effectiveness of our strategy in alleviating the negative impact on model performance resulting from high dataset extractivity, and that our method outperforms several competitive baselines.

pdf bib abs

Unlearning Bias in Language Models by Partitioning Gradients
Charles Yu | Sullam Jeoung | Anish Kasi | Pengfei Yu | Heng Ji
Findings of the Association for Computational Linguistics: ACL 2023

Recent research has shown that large-scale pretrained language models, specifically transformers, tend to exhibit issues relating to racism, sexism, religion bias, and toxicity in general. Unfortunately, these pretrained language models are used almost universally in downstream tasks, and natural language processing is often applied to make real-world predictions. Thus, debiasing these language models as early in development as possible is increasingly crucial for preventing unintentional harms caused by natural language systems. To this end, we propose a new technique called partitioned contrastive gradient unlearning (PCGU), a gray-box method for debiasing pretrained masked language models. PCGU aims to optimize only the weights that contribute most to a specific domain of bias, doing so by computing a first-order approximation based on the gradients of contrastive sentence pairs. Our experiments show that PCGU is both low-cost and seems particularly effective at pinpointing the sources of implicit social bias in large pretrained transformers. Although we train using PCGU in the gender-profession domain only, we find that doing so can also partially mitigate bias across other domains. All code for our implementation and experiments can be found at https://github.com/CharlesYu2000/PCGU-UnlearningBias.

pdf bib abs

Examining the Causal Impact of First Names on Language Models: The Case of Social Commonsense Reasoning
Sullam Jeoung | Jana Diesner | Halil Kilicoglu
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

As language models continue to be integrated into applications of personal and societal relevance, ensuring these models’ trustworthiness is crucial, particularly with respect to producing consistent outputs regardless of sensitive attributes. Given that first names may serve as proxies for (intersectional) socio-demographic representations, it is imperative to examine the impact of first names on commonsense reasoning capabilities. In this paper, we study whether a model’s reasoning given a specific input differs based on the first names provided. Our underlying assumption is that the reasoning about Alice should not differ from the reasoning about James. We propose and implement a controlled experimental framework to measure the causal effect of first names on commonsense reasoning, enabling us to distinguish between model predictions due to chance and caused by actual factors of interest. Our results indicate that the frequency of first names has a direct effect on model prediction, with less frequent names yielding divergent predictions compared to more frequent names. To gain insights into the internal mechanisms of models that are contributing to these behaviors, we also conduct an in-depth explainable analysis. Overall, our findings suggest that to ensure model robustness, it is essential to augment datasets with more diverse first names during the configuration stage.

2022

pdf bib abs

What changed? Investigating Debiasing Methods using Causal Mediation Analysis
Sullam Jeoung | Jana Diesner
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Previous work has examined how debiasing language models affect downstream tasks, specifically, how debiasing techniques influence task performance and whether debiased models also make impartial predictions in downstream tasks or not. However, what we don’t understand well yet is why debiasing methods have varying impacts on downstream tasks and how debiasing techniques affect internal components of language models, i.e., neurons, layers, and attentions. In this paper, we decompose the internal mechanisms of debiasing language models with respect to gender by applying causal mediation analysis to understand the influence of debiasing methods on toxicity detection as a downstream task. Our findings suggest a need to test the effectiveness of debiasing methods with different bias metrics, and to focus on changes in the behavior of certain components of the models, e.g.,first two layers of language models, and attention heads.

pdf bib abs

Raison d’être of the benchmark dataset: A Survey of Current Practices of Benchmark Dataset Sharing Platforms
Jaihyun Park | Sullam Jeoung
Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP

This paper critically examines the current practices of benchmark dataset sharing in NLP and suggests a better way to inform reusers of the benchmark dataset. As the dataset sharing platform plays a key role not only in distributing the dataset but also in informing the potential reusers about the dataset, we believe data-sharing platforms should provide a comprehensive context of the datasets. We survey four benchmark dataset sharing platforms: HuggingFace, PaperswithCode, Tensorflow, and Pytorch to diagnose the current practices of how the dataset is shared which metadata is shared and omitted. To be specific, drawing on the concept of data curation which considers the future reuse when the data is made public, we advance the direction that benchmark dataset sharing platforms should take into consideration. We identify that four benchmark platforms have different practices of using metadata and there is a lack of consensus on what social impact metadata is. We believe the problem of missing a discussion around social impact in the dataset sharing platforms has to do with the failed agreement on who should be in charge. We propose that the benchmark dataset should develop social impact metadata and data curator should take a role in managing the social impact metadata.