Large language models (LLMs) have proven highly effective for a wide range of tasks, including code generation. Recently, advancements in their capabilities have shown promise in areas like mathematical reasoning, chain-of-thought processes and self-reflection. However, their effectiveness in domains requiring nuanced understanding of financial contexts, such as accounting, remains unclear. In this study, we evaluate how well LLMs perform in generating code for domain-specific languages (DSLs) in accounting, using Beancount as a case study. We create a set of tasks based on common financial ratios, to evaluate the numeracy and financial literacy of LLMs. Our findings reveal that while LLMs are state-of-the art in generative tasks, they struggle severely with accounting, often producing inaccurate calculations and misinterpreting financial scenarios. We characterize these shortcomings through a comprehensive evaluation, shedding light on the limitations of LLMs in understanding and handling money-related tasks.
Differential diagnosis (DDx) is vital for physicians and challenging due to the existence of numerous diseases and their complex symptoms. Model training for this task is generally hindered by limited data access due to privacy concerns. To address this, we present DDxGym, a specialized OpenAI Gym environment for clinical differential diagnosis. DDxGym formulates DDx as a natural-language-based reinforcement learning (RL) problem, where agents emulate medical professionals, selecting examinations and treatments for patients with randomly sampled diseases. This RL environment utilizes data labeled from online resources, evaluated by medical professionals for accuracy. Transformers, while effective for encoding text in DDxGym, are unstable in online RL. For that reason we propose a novel training method using an auxiliary masked language modeling objective for policy optimization, resulting in model stabilization and significant performance improvement over strong baselines. Following this approach, our agent effectively navigates large action spaces and identifies universally applicable actions. All data, environment details, and implementation, including experiment reproduction code, are made publicly available.