Thomas Palmeira Ferraz

2024

No Argument Left Behind: Overlapping Chunks for Faster Processing of Arbitrarily Long Legal Texts
Israel Fama | Bárbara Bueno | Alexandre Alcoforado | Thomas Palmeira Ferraz | Arnold Moya | Anna Helena Reali Costa
Proceedings of the 15th Brazilian Symposium in Information and Human Language Technology

pdf bib

From Random to Informed Data Selection: A Diversity-Based Approach to Optimize Human Annotation and Few-Shot Learning
Alexandre Alcoforado | Thomas Palmeira Ferraz | Lucas Hideki Okamura | Israel Campos Fama | Arnold Moya Lavado | Bárbara Dias Bueno | Bruno Veloso | Anna Helena Reali Costa
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1

pdf bib abs

Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post “in a funny tone” with “no hashtag”). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs’ ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs’ ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM’s response needs refinement. Our results show that DeCRIM improves Mistral’s performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.