2024
pdf
bib
abs
Towards Context-Based Violence Detection: A Korean Crime Dialogue Dataset
Minju Kim
|
Heuiyeen Yeen
|
Myoung-Wan Koo
Findings of the Association for Computational Linguistics: EACL 2024
In order to enhance the security of society, there is rising interest in artificial intelligence (AI) to help detect and classify in advanced violence in daily life. The field of violence detection has introduced various datasets, yet context-based violence detection predominantly focuses on vision data, with a notable lack of NLP datasets. To overcome this, this paper presents the first Korean dialogue dataset for classifying violence that occurs in online settings: the Korean Crime Dialogue Dataset (KCDD). KCDD contains 22,249 dialogues created by crowd workers assuming offline scenarios. It has four criminal classes that meet international legal standards and one clean class (Serious Threats, Extortion or Blackmail, Harassment in the Workplace, Other Harassment, and Clean Dialogue). Plus, we propose a strong baseline for the proposed dataset, Relationship-Aware BERT. The model shows that understanding varying relationships among interlocutors improves the performance of crime dialogue classification. We hope that the proposed dataset will be used to detect cases of violence and aid people in danger. The KCDD dataset and corresponding baseline implementations can be found at the following link:
https://sites.google.com/view/kcdd.
pdf
bib
abs
SELF-EXPERTISE: Knowledge-based Instruction Dataset Augmentation for a Legal Expert Language Model
Minju Kim
|
Haein Jung
|
Myoung-Wan Koo
Findings of the Association for Computational Linguistics: NAACL 2024
The advent of instruction-tuned large language models (LLMs) has significantly advanced the field of automatic instruction dataset augmentation. However, the method of generating instructions and outputs from inherent knowledge of LLM can unintentionally produce hallucinations — instances of generating factually incorrect or misleading information. To overcome this, we propose SELF-EXPERTISE, automatically generating instruction dataset in the legal domain from a seed dataset. SELF-EXPERTISE extracts knowledge from the outputs of the seed dataset, and generates new instructions, inputs, and outputs. In this way, the proposed method reduces hallucination in automatic instruction augmentation. We trained an SELF-EXPERTISE augmented instruction dataset on the LLaMA-2 7B model to construct Korean legal specialized model, called LxPERT. LxPERT has demonstrated performance surpassing GPT-3.5-turbo in both in-domain and out-of-domain datasets. The SELF-EXPERTISE augmentation pipeline is not only applicable to the legal field but is also expected to be extendable to various domains, potentially advancing domain-specialized LLMs.
pdf
bib
abs
Pearl: A Review-driven Persona-Knowledge Grounded Conversational Recommendation Dataset
Minjin Kim
|
Minju Kim
|
Hana Kim
|
Beong-woo Kwak
|
SeongKu Kang
|
Youngjae Yu
|
Jinyoung Yeo
|
Dongha Lee
Findings of the Association for Computational Linguistics: ACL 2024
Conversational recommender systems are an emerging area that has garnered increasing interest in the community, especially with the advancements in large language models (LLMs) that enable sophisticated handling of conversational input. Despite the progress, the field still has many aspects left to explore. The currently available public datasets for conversational recommendation lack specific user preferences and explanations for recommendations, hindering high-quality recommendations. To address such challenges, we present a novel conversational recommendation dataset named PEARL, synthesized with persona- and knowledge-augmented LLM simulators. We obtain detailed persona and knowledge from real-world reviews and construct a large-scale dataset with over 57k dialogues. Our experimental results demonstrate that PEARL contains more specific user preferences, show expertise in the target domain, and provides recommendations more relevant to the dialogue context than those in prior datasets. Furthermore, we demonstrate the utility of PEARL by showing that our downstream models outperform baselines in both human and automatic evaluations. We release our dataset and code.
pdf
bib
abs
Cactus: Towards Psychological Counseling Conversations using Cognitive Behavioral Theory
Suyeon Lee
|
Sunghwan Kim
|
Minju Kim
|
Dongjin Kang
|
Dongil Yang
|
Harim Kim
|
Minseok Kang
|
Dayi Jung
|
Min Hee Kim
|
Seungbeen Lee
|
Kyong-Mee Chung
|
Youngjae Yu
|
Dongha Lee
|
Jinyoung Yeo
Findings of the Association for Computational Linguistics: EMNLP 2024
Recently, the demand for psychological counseling has significantly increased as more individuals express concerns about their mental health. This surge has accelerated efforts to improve the accessibility of counseling by using large language models (LLMs) as counselors. To ensure client privacy, training open-source LLMs faces a key challenge: the absence of realistic counseling datasets. To address this, we introduce Cactus, a multi-turn dialogue dataset that emulates real-life interactions using the goal-oriented and structured approach of Cognitive Behavioral Therapy (CBT).We create a diverse and realistic dataset by designing clients with varied, specific personas, and having counselors systematically apply CBT techniques in their interactions. To assess the quality of our data, we benchmark against established psychological criteria used to evaluate real counseling sessions, ensuring alignment with expert evaluations.Experimental results demonstrate that Camel, a model trained with Cactus, outperforms other models in counseling skills, highlighting its effectiveness and potential as a counseling agent.We make our data, model, and code publicly available.
2023
pdf
bib
abs
Enhancing Task-Oriented Dialog System with Subjective Knowledge: A Large Language Model-based Data Augmentation Framework
Haein Jung
|
Heuiyeen Yeen
|
Jeehyun Lee
|
Minju Kim
|
Namo Bang
|
Myoung-Wan Koo
Proceedings of The Eleventh Dialog System Technology Challenge
As Task-Oriented Dialog (TOD) systems have advanced, structured DB systems, which aim to collect relevant knowledge for answering user’s questions, have also progressed. Despite these advancements, these methods face challenges when dealing with subjective questions from users. To overcome this, DSTC11 released a subjective-knowledge-based TOD (SK-TOD) dataset and benchmark. This paper introduces a framework that effectively solves SK-TOD tasks by leveraging a Large Language Model (LLM). We demonstrate the proficient use of LLM for each sub-task, including an adapters-based method and knowledge-grounded data augmentation. Our proposed methods, which utilize LLM as an efficient tool, outperform baseline performance and approaches that directly use LLM as a one-step sub-task solver, showing superior task-specific optimization.
2022
pdf
bib
abs
BotsTalk: Machine-sourced Framework for Automatic Curation of Large-scale Multi-skill Dialogue Datasets
Minju Kim
|
Chaehyeong Kim
|
Yong Ho Song
|
Seung-won Hwang
|
Jinyoung Yeo
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
To build open-domain chatbots that are able to use diverse communicative skills, we propose a novel framework BotsTalk, where multiple agents grounded to the specific target skills participate in a conversation to automatically annotate multi-skill dialogues. We further present Blended Skill BotsTalk (BSBT), a large-scale multi-skill dialogue dataset comprising 300K conversations. Through extensive experiments, we demonstrate that our dataset can be effective for multi-skill dialogue systems which require an understanding of skill blending as well as skill grounding. Our code and data are available at https://github.com/convei-lab/BotsTalk.