Hyeonbin Hwang
2025
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim | Juyoung Suk | Ji Yong Cho | Shayne Longpre | Chaeeun Kim | Dongkeun Yoon | Guijin Son | Yejin Cho | Sheikh Shafayat | Jinheon Baek | Sue Hyun Park | Hyeonbin Hwang | Jinkyung Jo | Hyowon Cho | Haebin Shin | Seongyun Lee | Hanseok Oh | Noah Lee | Namgyu Ho | Se June Joo | Miyoung Ko | Yoonjoo Lee | Hyungjoo Chae | Jamin Shin | Joel Jang | Seonghyeon Ye | Bill Yuchen Lin | Sean Welleck | Graham Neubig | Moontae Lee | Kyungjae Lee | Minjoon Seo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Seungone Kim | Juyoung Suk | Ji Yong Cho | Shayne Longpre | Chaeeun Kim | Dongkeun Yoon | Guijin Son | Yejin Cho | Sheikh Shafayat | Jinheon Baek | Sue Hyun Park | Hyeonbin Hwang | Jinkyung Jo | Hyowon Cho | Haebin Shin | Seongyun Lee | Hanseok Oh | Noah Lee | Namgyu Ho | Se June Joo | Miyoung Ko | Yoonjoo Lee | Hyungjoo Chae | Jamin Shin | Joel Jang | Seonghyeon Ye | Bill Yuchen Lin | Sean Welleck | Graham Neubig | Moontae Lee | Kyungjae Lee | Minjoon Seo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria-like helpfulness and harmlessness-which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 100 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval.
2024
Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards
Hyeonbin Hwang | Doyoung Kim | Seungone Kim | Seonghyeon Ye | Minjoon Seo
Findings of the Association for Computational Linguistics: EMNLP 2024
Hyeonbin Hwang | Doyoung Kim | Seungone Kim | Seonghyeon Ye | Minjoon Seo
Findings of the Association for Computational Linguistics: EMNLP 2024
Training on large amounts of rationales (i.e., CoT Fine-tuning) has been found effective for improving mathematical reasoning of large language models (LLMs). However, acquiring human-authored solutions or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve mathematical reasoning capabilities. To this end, we propose Self-Explore, where the LLM is tasked to explore the first wrong step (i.e., the first pit) within the rationale and use such signals as fine-grained rewards for further improvement. On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT). Our code is available here]9.
Search
Fix author
Co-authors
- Seungone Kim 2
- Minjoon Seo 2
- Seonghyeon Ye 2
- Jinheon Baek 1
- Hyungjoo Chae 1
- Ji Yong Cho 1
- Yejin Cho 1
- Hyowon Cho 1
- Namgyu Ho 1
- Joel Jang 1
- Jinkyung Jo 1
- Se June Joo 1
- Doyoung Kim 1
- Chaeeun Kim 1
- Miyoung Ko 1
- Seongyun Lee 1
- Noah Lee 1
- Yoonjoo Lee 1
- Moontae Lee 1
- Kyungjae Lee 1
- Bill Yuchen Lin 1
- Shayne Longpre 1
- Graham Neubig 1
- Hanseok Oh 1
- Sue Hyun Park 1
- Sheikh Shafayat 1
- Haebin Shin 1
- Jamin Shin 1
- Guijin Son 1
- Juyoung Suk 1
- Sean Welleck 1
- Dongkeun Yoon 1