The Workshop on Ethical Concerns in Training, Evaluating and Deploying Large Language Models (2025)
up
Proceedings of the First Workshop on Ethical Concerns in Training, Evaluating and Deploying Large Language Models
Proceedings of the First Workshop on Ethical Concerns in Training, Evaluating and Deploying Large Language Models
Damith Premasiri
|
Tharindu Ranasinghe
|
Hansi Hettiarachchi
TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks
Arjun Damerla
|
Jimin Lim
|
Yanxi Jiang
|
Nam Nguyen Hoai Le
|
Nikil Selladurai
Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains under-explored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, “you earned a token”, without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.
CoVeGAT: A Hybrid LLM & Graph‐Attention Pipeline for Accurate Citation‐Aligned Claim Verification
Max Bader
|
Akshatha Arunkumar
|
Ohan Ahmad
|
Maruf Hassen
|
Charles Duong
|
Kevin Zhu
Modern LLMs often generate fluent text yet fabricate, misquote, or misattribute evidence. To quantify this flaw, we built a balanced Citation‐Alignment Dataset of 500 genuine, expert‐verified claim–quote pairs and 500 minimally perturbed false variants from news, legal, scientific, and literary sources. We then propose CoVeGAT, which converts claims and citations into SVO triplets (with trigram fallback), scores each pair via an LLM‐driven chain of verification, and embeds them in a weighted semantic graph. A Graph Attention Network over BERT embeddings issues strict pass/fail judgments on alignment. Zero‐shot evaluation of seven top LLMs (e.g., GPT‐4o, Gemini 1.5, Mistral 7B) reveals a trade‐off: decisive models reach 82.5 % accuracy but err confidently, while cautious ones fall below 50 %. A MiniLM + RBF kernel baseline, by contrast, achieves 96.4 % accuracy, underscoring the power of simple, interpretable methods.
TVS Sidekick: Challenges and Practical Insights from Deploying Large Language Models in the Enterprise
Paula Reyero Lobo
|
Kevin Johnson
|
Bill Buchanan
|
Matthew Shardlow
|
Ashley Williams
|
Sam Attwood
Many enterprises are increasingly adopting Artificial Intelligence (AI) to make internal processes more competitive and efficient. In response to public concern and new regulations for the ethical and responsible use of AI, implementing AI governance frameworks could help to integrate AI within organisations and mitigate associated risks. However, the rapid technological advances and lack of shared ethical AI infrastructures creates barriers to their practical adoption in businesses. This paper presents a real-world AI application at TVS Supply Chain Solutions, reporting on the experience developing an AI assistant underpinned by large language models and the ethical, regulatory, and sociotechnical challenges in deployment for enterprise use.