Juho Kim


2024

pdf bib
LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction
Jieun Han | Haneul Yoo | Junho Myung | Minsun Kim | Hyunseung Lim | Yoonsu Kim | Tak Yeon Lee | Hwajung Hong | Juho Kim | So-Yeon Ahn | Alice Oh
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)

In the context of English as a Foreign Language (EFL) writing education, LLM-as-a-tutor can assist students by providing real-time feedback on their essays. However, challenges arise in assessing LLM-as-a-tutor due to differing standards between educational and general use cases. To bridge this gap, we integrate pedagogical principles to assess student-LLM interaction. First, we explore how LLMs can function as English tutors, providing effective essay feedback tailored to students. Second, we propose three criteria to evaluate LLM-as-a-tutor specifically designed for EFL writing education, emphasizing pedagogical aspects. In this process, EFL experts evaluate the feedback from LLM-as-a-tutor regarding (1) quality and (2) characteristics. On the other hand, EFL learners assess their (3) learning outcomes from interaction with LLM-as-a-tutor. This approach lays the groundwork for developing LLMs-as-a-tutor tailored to the needs of EFL learners, advancing the effectiveness of writing education in this context.

pdf bib
ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models
Benjamin Newman | Yoonjoo Lee | Aakanksha Naik | Pao Siangliulue | Raymond Fok | Juho Kim | Daniel S Weld | Joseph Chee Chang | Kyle Lo
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

When conducting literature reviews, scientists often create literature review tables—tables whose rows are publications and whose columns constitute a schema, a set of aspects used to compare and contrast the papers. Can we automatically generate these tables using language models (LMs)? In this work, we introduce a framework that leverages LMs to perform this task by decomposing it into separate schema and value generation steps. To enable experimentation, we address two main challenges: First, we overcome a lack of high-quality datasets to benchmark table generation by curating and releasing arxivDIGESTables, a new dataset of 2,228 literature review tables extracted from ArXiv papers that synthesize a total of 7,542 research papers. Second, to support scalable evaluation of model generations against human-authored reference tables, we develop DecontextEval, an automatic evaluation method that aligns elements of tables with the same underlying aspects despite differing surface forms. Given these tools, we evaluate LMs’ abilities to reconstruct reference tables, finding this task benefits from additional context to ground the generation (e.g. table captions, in-text references). Finally, through a human evaluation study we find that even when LMs fail to fully reconstruct a reference table, their generated novel aspects can still be useful.

pdf bib
Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis
Nayeon Lee | Chani Jung | Junho Myung | Jiho Jin | Jose Camacho-Collados | Juho Kim | Alice Oh
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Most hate speech datasets neglect the cultural diversity within a single language, resulting in a critical shortcoming in hate speech detection. To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset. To construct CREHate, we follow a two-step procedure: 1) cultural post collection and 2) cross-cultural annotation. We sample posts from the SBIC dataset, which predominantly represents North America, and collect posts from four geographically diverse English-speaking countries (Australia, United Kingdom, Singapore, and South Africa) using culturally hateful keywords we retrieve from our survey. Annotations are collected from the four countries plus the United States to establish representative labels for each country. Our analysis highlights statistically significant disparities across countries in hate speech annotations. Only 56.2% of the posts in CREHate achieve consensus among all countries, with the highest pairwise label difference rate of 26%. Qualitative analysis shows that label disagreement occurs mostly due to different interpretations of sarcasm and the personal bias of annotators on divisive topics. Lastly, we evaluate large language models (LLMs) under a zero-shot setting and show that current LLMs tend to show higher accuracies on Anglosphere country labels in CREHate.Our dataset and codes are available at: https://github.com/nlee0212/CREHate

pdf bib
Observing the Southern US Culture of Honor Using Large-Scale Social Media Analysis
Juho Kim | Michael Guerzhoy
Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024)

A culture of honor refers to a social system where individuals’ status, reputation, and esteem play a central role in governing interpersonal relations. Past works have associated this concept with the United States (US) South and related with it various traits such as higher sensitivity to insult, a higher value on reputation, and a tendency to react violently to insults. In this paper, we hypothesize and confirm that internet users from the US South, where a culture of honor is more prevalent, are more likely to display a trait predicted by their belonging to a culture of honor. Specifically, we test the hypothesis that US Southerners are more likely to retaliate to personal attacks by personally attacking back. We leverage OpenAI’s GPT-3.5 API to both geolocate internet users and to automatically detect whether users are insulting each other. We validate the use of GPT-3.5 by measuring its performance on manually-labeled subsets of the data. Our work demonstrates the potential of formulating a hypothesis based on a conceptual framework, operationalizing it in a way that is amenable to large-scale LLM-aided analysis, manually validating the use of the LLM, and drawing a conclusion.

2022

pdf bib
Interactive Children’s Story Rewriting Through Parent-Children Interaction
Yoonjoo Lee | Tae Soo Kim | Minsuk Chang | Juho Kim
Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022)

Storytelling in early childhood provides significant benefits in language and literacy development, relationship building, and entertainment. To maximize these benefits, it is important to empower children with more agency. Interactive story rewriting through parent-children interaction can boost children’s agency and help build the relationship between parent and child as they collaboratively create changes to an original story. However, for children with limited proficiency in reading and writing, parents must carry out multiple tasks to guide the rewriting process, which can incur a high cognitive load. In this work, we introduce an interface design that aims to support children and parents to rewrite stories together with the help of AI techniques. We describe three design goals determined by a review of prior literature in interactive storytelling and existing educational activities. We also propose a preliminary prompt-based pipeline that uses GPT-3 to realize the design goals and enable the interface.

2018

pdf bib
Teaching Syntax by Adversarial Distraction
Juho Kim | Christopher Malon | Asim Kadav
Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)

Existing entailment datasets mainly pose problems which can be answered without attention to grammar or word order. Learning syntax requires comparing examples where different grammar and word order change the desired classification. We introduce several datasets based on synthetic transformations of natural entailment examples in SNLI or FEVER, to teach aspects of grammar and word order. We show that without retraining, popular entailment models are unaware that these syntactic differences change meaning. With retraining, some but not all popular entailment models can learn to compare the syntax properly.