Archit Sharma


2023

pdf bib
APTSumm at BioLaySumm Task 1: Biomedical Breakdown, Improving Readability by Relevancy Based Selection
A.s. Poornash | Atharva Deshmukh | Archit Sharma | Sriparna Saha
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

In this paper we tackle a lay summarization task which aims to produce lay-summary of biomedical articles. BioLaySumm in the BioNLP Workshop at ACL 2023 (Goldsack et al., 2023), has presented us with this lay summarization task for biomedical articles. Our proposed models provide a three-step abstractive approach for summarizing biomedical articles. Our methodology involves breaking down the original document into distinct sections, generating candidate summaries for each subsection, then finally re-ranking and selecting the top performing paragraph for each section. We run ablation studies to establish that each step in our pipeline is critical for improvement in the quality of lay summary. This model achieved the second-highest rank in terms of readability scores (Luo et al., 2022). Our work distinguishes itself from previous studies by not only considering the content of the paper but also its structure, resulting in more coherent and comprehensible lay summaries. We hope that our model for generating lay summaries of biomedical articles will be a useful resource for individuals across various domains, including academia, industry, and healthcare, who require rapid comprehension of key scientific research.

pdf bib
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback
Katherine Tian | Eric Mitchell | Allan Zhou | Archit Sharma | Rafael Rafailov | Huaxiu Yao | Chelsea Finn | Christopher Manning
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

A trustworthy real-world prediction system should produce well-calibrated confidence scores; that is, its confidence in an answer should be indicative of the likelihood that the answer is correct, enabling deferral to an expert in cases of low-confidence predictions. Recent studies have shown that unsupervised pre-training produces large language models (LMs) whose conditional probabilities are remarkably well-calibrated. However, the most widely-used LMs are fine-tuned with reinforcement learning from human feedback (RLHF-LMs), and some studies have suggested that RLHF-LMs produce conditional probabilities that are very poorly calibrated. In light of this perceived weakness, we conduct a broad evaluation of methods for extracting confidence scores from RLHF-LMs. For RLHF-LMs such as ChatGPT, GPT-4, and Claude, we find that verbalized confidences emitted as output tokens are typically better-calibrated than the model’s conditional probabilities on the TriviaQA, SciQ, and TruthfulQA benchmarks, often reducing the expected calibration error by a relative 50%.