Yiwei Qin
2024
InFoBench: Evaluating Instruction Following Ability in Large Language Models
Yiwei Qin
|
Kaiqiang Song
|
Yebowen Hu
|
Wenlin Yao
|
Sangwoo Cho
|
Xiaoyang Wang
|
Xuansheng Wu
|
Fei Liu
|
Pengfei Liu
|
Dong Yu
Findings of the Association for Computational Linguistics: ACL 2024
This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models’ (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs’ compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR’s higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.
2023
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics
Yiwei Qin
|
Weizhe Yuan
|
Graham Neubig
|
Pengfei Liu
Findings of the Association for Computational Linguistics: EMNLP 2023
Modern embedding-based metrics for evaluation of generated text generally fall into one of two paradigms: discriminative metrics that are trained to directly predict which outputs are of higher quality according to supervised human annotations, and generative metrics that are trained to evaluate text based on the probabilities of a generative model. Both have their advantages; discriminative metrics are able to directly optimize for the problem of distinguishing between good and bad outputs, while generative metrics can be trained using abundant raw text. In this paper, we present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as backbone. We perform an extensive empirical comparison with other existing metrics on 5 datasets, 19 languages and 280 systems, demonstrating the utility of our method. Experimental results show that: T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
2021
Automating Claim Construction in Patent Applications: The CMUmine Dataset
Ozan Tonguz
|
Yiwei Qin
|
Yimeng Gu
|
Hyun Hannah Moon
Proceedings of the Natural Legal Language Processing Workshop 2021
Intellectual Property (IP) in the form of issued patents is a critical and very desirable element of innovation in high-tech. In this position paper, we explore the possibility of automating the legal task of Claim Construction in patent applications via Natural Language Processing (NLP) and Machine Learning (ML). To this end, we first create a large dataset known as CMUmine™and then demonstrate that, using NLP and ML techniques the Claim Construction in patent applications, a crucial legal task currently performed by IP attorneys, can be automated. To the best of our knowledge, this is the first public patent application dataset. Our results look very promising in automating the patent application process.