Wonseok Hwang


2024

pdf bib
NESTLE: a No-Code Tool for Statistical Analysis of Legal Corpus
Kyoungyeon Cho | Seungkum Han | Young Rok Choi | Wonseok Hwang
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

The statistical analysis of large scale legal corpus can provide valuable legal insights. For such analysis one needs to (1) select a subset of the corpus using document retrieval tools, (2) structure text using information extraction (IE) systems, and (3) visualize the data for the statistical analysis. Each process demands either specialized tools or programming skills whereas no comprehensive unified “no-code” tools have been available. Here we provide NESTLE, a no-code tool for large-scale statistical analysis of legal corpus. Powered by a Large Language Model (LLM) and the internal custom end-to-end IE system, NESTLE can extract any type of information that has not been predefined in the IE system opening up the possibility of unlimited customizable statistical analysis of the corpus without writing a single line of code. We validate our system on 15 Korean precedent IE tasks and 3 legal text classification tasks from LexGLUE. The comprehensive experiments reveal NESTLE can achieve GPT-4 comparable performance by training the internal IE module with 4 human-labeled, and 192 LLM-labeled examples.

2022

pdf bib
Data-efficient end-to-end Information Extraction for Statistical Legal Analysis
Wonseok Hwang | Saehee Eom | Hanuhl Lee | Hai Jin Park | Minjoon Seo
Proceedings of the Natural Legal Language Processing Workshop 2022

Legal practitioners often face a vast amount of documents. Lawyers, for instance, search for appropriate precedents favorable to their clients, while the number of legal precedents is ever-growing. Although legal search engines can assist finding individual target documents and narrowing down the number of candidates, retrieved information is often presented as unstructured text and users have to examine each document thoroughly which could lead to information overloading. This also makes their statistical analysis challenging. Here, we present an end-to-end information extraction (IE) system for legal documents. By formulating IE as a generation task, our system can be easily applied to various tasks without domain-specific engineering effort. The experimental results of four IE tasks on Korean precedents shows that our IE system can achieve competent scores (-2.3 on average) compared to the rule-based baseline with as few as 50 training examples per task and higher score (+5.4 on average) with 200 examples. Finally, our statistical analysis on two case categories — drunk driving and fraud — with 35k precedents reveals the resulting structured information from our IE system faithfully reflects the macroscopic features of Korean legal system.

2021

pdf bib
Spatial Dependency Parsing for Semi-Structured Document Information Extraction
Wonseok Hwang | Jinyeong Yim | Seunghyun Park | Sohee Yang | Minjoon Seo
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Cost-effective End-to-end Information Extraction for Semi-structured Document Images
Wonseok Hwang | Hyunji Lee | Jinyeong Yim | Geewook Kim | Minjoon Seo
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

A real-world information extraction (IE) system for semi-structured document images often involves a long pipeline of multiple modules, whose complexity dramatically increases its development and maintenance cost. One can instead consider an end-to-end model that directly maps the input to the target output and simplify the entire process. However, such generation approach is known to lead to unstable performance if not designed carefully. Here we present our recent effort on transitioning from our existing pipeline-based IE system to an end-to-end system focusing on practical challenges that are associated with replacing and deploying the system in real, large-scale production. By carefully formulating document IE as a sequence generation task, we show that a single end-to-end IE system can be built and still achieve competent performance.