Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews

Medical systematic reviews play a vital role in healthcare decision making and policy. However, their production is time-consuming, limiting the availability of high-quality and up-to-date evidence summaries. Recent advancements in large language models (LLMs) offer the potential to automatically generate literature reviews on demand, addressing this issue. However, LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission. In healthcare, this can make LLMs unusable at best and dangerous at worst. We conducted 16 interviews with international systematic review experts to characterize the perceived utility and risks of LLMs in the specific context of medical evidence reviews. Experts indicated that LLMs can assist in the writing process by drafting summaries, generating templates, distilling information, and crosschecking information. They also raised concerns regarding confidently composed but inaccurate LLM outputs and other potential downstream harms, including decreased accountability and proliferation of low-quality reviews. Informed by this qualitative analysis, we identify criteria for rigorous evaluation of biomedical LLMs aligned with domain expert views.


Introduction
In the fall of 2022, Meta (formerly Facebook) unveiled Galactica 1 , a large language model (LLM) touted as being able to "store, combine and reason about scientific knowledge" (Taylor et al., 2022). The prototype allowed users to enter (natural language) questions, and the model would then generate confident, scientific-sounding outputs ostensibly backed by evidence and published literature. Nevertheless, like all current LLMs, Galactica was prone to factual inaccuracies (Bender et al., 2021), and could easily be induced to produce plainly absurd and arguably harmful outputs (See Figure 1).
A swift backlash ensued on social media, with individuals posting problematic outputs featuring confidently written but scientifically inaccurate prose (Heaven, 2022;Greene, 2022). Such examples were widely characterized as potentially harmful and unsafe (Marcus, 2022). Often, online discourse around machine learning (ML) generally and LLMs specifically lacks nuance and tends toward extreme positions-either hyping LLMs and their ability to seamlessly synthesize knowledge on-demand or characterizing them as uniformly useless at best and harmful at worst.
We argue that assessment of the potential uses and harms of LLMs is only meaningful when rooted in a particular context: When are LLM outputs potentially dangerous, exactly, and to whom? Conversely, what advantages might they confer, and for what tasks? In this work, we ground the consideration of such questions in the important context of medical systematic reviews (SRs).
Medical systematic reviews, evidence-based medicine, and LLMs One of the touted potential strengths of Galactica (and by implication, other LLMs) is its ability to "synthesize knowledge by generating secondary content automatically: such as literature reviews . . . " (Taylor et al., 2022). In the context of healthcare, systematic literature reviews are a critical tool of Evidence Based Medicine (EBM; Haidich 2010;Sackett et al. 1996). Indeed, comprehensive synopses of the published literature are considered the strongest form of evidence and inform healthcare policy and practice (Murad et al., 2016;Cook et al., 1997;Mulrow, 1987).
However, the medical evidence base is voluminous and continues to expand at a rapid pace, which makes producing high-quality reviews of the evidence onerous (Marshall et al., 2021;Bastian et al., 2010). Even once published, such synopses quickly go stale as new, relevant evidence accumulates (Hoffmeyer et al., 2021;Shojania et al., 2007). The rise of LLMs that are ostensibly capable of producing literature reviews for arbitrary query topics suggests the tantalizing prospect of providing on-demand synopses of the medical evidence on a given topic, automatically. Short of this lofty goal, LLMs such as Galactica may make the process of humans writing syntheses more efficient by providing initial drafts or outlines.
However, despite the excitement around generative LLMs, critical questions remain regarding whether and to what extent domain experts will actually find such systems useful in practice, and the degree to which whatever benefits they confer outweigh the anticipated risks of the technology. Answering such questions requires grounding them in specific tasks and contexts. Here we focus on the important task of medical literature reviews. Might LLM outputs provide a useful aid for writing systematic reviews? Or do they simply Figure 1: An example of an unsafe and incorrect output generated by Galactica (Greene, 2022). The generated article for the query of a research paper on the benefits of eating crushed glass shows that it is beneficial to eat crushed glass when in fact evidence has shown that it can lead to high-risk situations (Alves et al., 2022;Ebrahimi et al., 2017).
output human-like text that is unlikely to provide actual utility? What risks (if any) do domain experts perceive when using LLMs for this use-case? As far as we are aware, this is the first effort to interrogate and qualitatively characterize expert views on LLMs for the task of drafting medical systematic reviews. Our hope is that this study will inform the development and evaluation of LLMs that can effectively assist users in writing systematic reviews.
In brief, we aim to answer the following research questions: 1. What are the perspectives of domain experts with respect to the potential utility of LLMs to aid production of medical systematic reviews? 2. Do domain experts anticipate any potential risks from the use of LLMs in this context? 3. What can we learn from domain experts which might inform criteria for rigorous evaluation of biomedical LLMs?

Generalizable Insights about Machine Learning in the Context of Healthcare
LLMs have come to dominate the natural language processing (NLP) landscape, and now achieve remarkably strong performance across a diverse range of tasks (Brown et al., 2020). This progress has naturally motivated work investigating the use of LLMs for healthcare (Lehman et al., 2023;Singhal et al., 2022). However, such models are often evaluated on benchmark tasks (e.g., including things like medical exam questions (Singhal et al., 2022)), which are only tangentially related to the likely uses of LLMs for healthcare in practice.
Discussions regarding the potential risks of LLMs in healthcare have so far been speculative and largely removed from realistic use cases and the views of prospective end-users (Harrer, 2023;Lee et al., 2023;Anderson et al., 2023). The generalizable contributions this work offers are therefore two-fold: (1) We report the results of a qualitative study into the wants, needs, and reservations of domain experts in evidence synthesis for the specific and important task of medical systematic review production (which ought to inform evaluation criteria for this task moving forward). (2) We provide an illustrative example of engaging domain experts in health evidence synthesis to assess realistic potential benefits and risks of LLMs, grounded in the context of a real-world task. Our hope is that this grounded evaluation advances the discourse around LLM technology from dominant polarized positions which tend to-in our view-either overstate the potential benefits or risks of LLMs in the healthcare setting.

Related Work
LLMs-stacks of Transformer layers (Vaswani et al., 2017) typically pre-trained over large unlabeled corpora via "self-supervised" objectives-such as GPT-3 (Brown et al., 2020) and its successors have come to dominate NLP, yielding state-of-the-art results across a range of standard tasks. Such models are especially good at generative tasks like text summarization. Indeed, recent work has established that even without explicit training, GPT-3 outperforms fully supervised models on standard summarization tasks (Goyal et al., 2022). While LLMs trained on "open domain" content have shown to be capable of performing well across domains, a body of work indicates that specializing LLMs by continuing pretraining "in-domain" can further improve performance (Lehman et al., 2023;Webersinke et al., 2021;Gururangan et al., 2020;Lee et al., 2020b;Beltagy et al., 2019;Huang et al., 2019). This has motivated the development of specific LLMs pre-trained on biomedical and clinical texts such as PubMed abstracts 2 or MIMIC clinical notes dataset (Johnson et al., , 2016. Examples of biomedical and clinical pre-trained models include BioBERT (Lee et al., 2020b), SciBERT (Beltagy et al., 2019), ClinicalBERT (Huang et al., 2019), BioMed-RoBERTa (Gururangan et al., 2020), ClinicalT5 (Lehman et al., 2023;Lu et al., 2022), and BioGPT (Luo et al., 2022). 3 With respect to systematic reviews of medical literature, many efforts have attempted to expedite the production of such synopses using ML and NLP, for example by helping to identify relevant studies (Lee et al., 2020a;Miwa et al., 2014;Wallace et al., 2010;Cohen et al., 2006) or automating data extraction (Gu et al., 2021;Wallace et al., 2016;Jonnalagadda et al., 2015). In this work we focus on the task of generating narrative summaries of the evidence directly. Most work on this task has treated it as a standard multi-document summarization task, where one assumes inputs (abstracts of articles describing relevant trials) are given (Wang et al., 2022;DeYoung et al., 2021;Wallace et al., 2021).
Here we focus on the more audacious approach of asking the model to generate a review on a given topic without external references (i.e., on the basis of its pre-trained weights alone). This is the sort of functionality that Galactica (Taylor et al., 2022) ostensibly offers, and is representative of the broader trend in how LLMs are being used. Moreover, assuming all relevant trial reports are provided as input is unrealistic in practice, as it would first require completing the time-consuming, rigorous process of a search and citation screening to identify this set; one of the promises of LLMs is that they might be able to implicitly perform such search by generating a synopsis of relevant studies (ingested during pre-training) directly. Galactica in particular adopts a relatively standard decoder-only stack of Transformer layers. It is trained on over 48 million papers, textbooks, lecture notes, reference materials, text representations of compounds and proteins, scientific websites, and other sources of scientific knowledge (Taylor et al., 2022).
In addition to Galactica (6.7B parameters), we consider two other representative models. The first is BioMedLM (formerly PubMedGPT; Bolton et al. 2022), which is a smaller model (2.7B parameters) trained on 16 million PubMed abstracts and 5 million PubMed central full-texts. We also consider ChatGPT (February 13 and March 23 versions; OpenAI 2022), which while not trained explicitly for biomedical tasks has demonstrated considerable flexibility and is performant across domains.
Risks of generative models While LLMs may offer benefits in various healthcare (and healthcare-adjacent) settings, prior work has also highlighted that they bring risks. In particular, LLMs can cause material harm by disseminating poor or false medical information (Bickmore et al., 2018;Miner et al., 2016). For example, a group of medical practitioners prompted a GPT-3-based chatbot to provide advice on whether a fictitious patient should "kill themselves" to which it responded "I think you should" (Quach, 2020).
However, prior work establishing the risks of LLMs (Weidinger et al., 2022) has often done so divorced from any specific context. This is in part because LLMs are generally built without specific applications in mind (Rauh et al., 2022). In this study we aim to contextualize the potential benefits and harms of LLMs for a specific healthcare application by grounding the discussion in the task of producing medical systematic reviews. Inspired by similar recent work to characterize clinician wants and needs from ML systems (Tonekaboni et al., 2019), we adopt a qualitative approach, and recruit and interview domain experts with varying levels of experience synthesizing medical evidence.

Methods
We applied an upstream stakeholder engagement method (Corner et al., 2012;Unerman, 2010;Pidgeon and Rogers-Hayden, 2007) which involves eliciting responses from domain experts prior to implementing a model or system. We used an intentional sampling approach to recruit participants for interviews, aiming to recruit experts with diverse levels of experiences in medical systematic review production (from methodologists, practitioners, PROMPTS  Title:  \n\n  #  \n\n  Title:  Give me a review on   {Review Title}  {Review Title}  {Review Title}  {Review Title} Step 1: Search recent Cochrane review titles Step 2: Prompt LLMs to generate systematic reviews Step 3: Identify representative outputs Step 5: Conduct qualitative analysis on interview transcripts Step 4 Figure 2: A schematic of our study, in 5 steps. In Step 1, we conducted a search of recently published medical systematic reviews from the Cochrane Library of Systematic Reviews, an evidence synthesis database.
Step 2 involved using the titles from Step 1 to prompt the LLMs to generate evidence summaries.
Step 3 involved sampling outputs generated from Step 2.
Step 4 entailed interviewing domain experts. Finally, Step 5 involved conducting the qualitative analysis of the interview transcripts.
clinical researchers, journal editors, and publishers working in research synthesis, to clinical guideline experts who make use of such reviews). We describe the recruited participants in greater detail in Section 3.5. We opted for a qualitative study design because we are interested in characterizing the general views on LLMs of such domain experts, and surveys would have artificially constrained responses. We developed an interview guide based on our research questions. We provide details about interview questions in Appendix A.
During the interviews, we shared with participants (domain experts) samples of typical outputs from LLMs that were prompted to generate evidence summaries based on a medical query to act as probes (Gaver et al., 1999) to spark discussion. For this exploratory study, we used the following generative LLMs: Galactica 6.7B (Taylor et al., 2022), BioMedLM 2.7B (Bolton et al., 2022), andChatGPT (OpenAI, 2022). These LLMs were chosen for the study as they are able to generate biomedical and clinical text. A schematic of the entire process we took for this qualitative study is provided in Figure 2.

Steps 1 and 2 : Generating Evidence Summaries Using LLMs
In February 2023, we queried the most recently published titles in the Cochrane Library of Systematic Reviews 4 for each of the 37 Cochrane medical topics and used those titles as prompts to generate the evidence summaries after removing duplicate titles. We specifically chose titles of systematic reviews that were published or updated after the latest training dates of the LLMs we considered for this study to mitigate the risk of models having seen the latest versions of reviews in training.
The topics span a diverse variety of subjects from "Allergy & Intolerance" to "Health Professional Education." A full list of the medical topics and titles are available on our Github repository. 5 Using four different prompting approaches with the three models, we generated a total of 128 evidence summaries using LLMs. We provide details on how we generated these outputs, including specific prompts used, in Appendix B.1.

Generating Evidence Summaries Directly Aligned with Individual Expertise
Given the range of clinical topics we considered, individual participants may have little familiarity with the subject matter represented in the random samples we presented them with (Step 3). To ensure that participating domain experts were shown at least one output related to a topic with which they were intimately familiar with, we asked them to provide the title or topic of a medical systematic review that they had recently worked on (we requested this prior to each interview).

Step 3 : Selecting A Diverse Sample Of Evidence Summaries
After generating a set of outputs, we conducted a rapid inductive qualitative analysis (Vindrola-Padros and Johnson, 2020;Gale et al., 2019;Taylor et al., 2018)  We manually identified six samples of outputs that featured many of the characteristics identified during the analysis process. We decided to select a subset of typical outputs to reduce the area of exploration to ground our exploratory study to focus on potential benefits and risks. The selected samples were included in an online website 6 as interview materials to introduce participants to LLMs and their outputs. We reproduce these selected model outputs (which were used for the interviews) in Appendix B.3 with their sample number and model used to produce them. During interviews, we presented each participant with two LLM outputs randomly sampled (without replacement) from these initial six, along with the generated output directly aligned with individual expertise discussed above.

Step 4 : Interviews
Between March and April 2023, we interviewed 16 domain experts who were recruited via email inviting them to participate in a non-compensated remote interview conducted over Zoom. Each semi-structured interview lasted about 60 minutes, and we audio-recorded these sessions. We began interviews by obtaining verbal consent for the interview and recording, and then delved into each participant's background in writing or otherwise contributing to systematic reviews. Next, we provided a high-level overview of LLMs before briefly discussing each participant's prior experience (if any) using AI to aid their work. 5. https://github.com/hyesunyun/MedSysReviewsFromLLMs 6. https://llm4msr.netlify.app/ We shared with participants the aforementioned two randomly sampled summaries and the individually relevant evidence summary (Section 3.1). Participants reviewed each example in sequence, and at this time we asked them questions to elicit their thoughts on the potential uses and harms for each type of output given the context of writing systematic reviews. Lastly, we asked (in an open-ended manner) for their overall opinions on the use of LLMs for writing systematic reviews and any additional features they would want from models like these. At the very end of the interview, we asked the participants permission about including their name in the acknowledgement section. All LLM outputs shown to participants were shared via a publicly-available website. We reproduce all interview questions in Appendix A.

Step 5 : Qualitative Analysis
After 16 interviews, we amassed 847 minutes of audio recordings. We used transcription software Rev.com 7 to transcribe the audio recordings. We then performed an inductive thematic analysis (Braun and Clarke, 2012) to characterize specific instances of potential usefulness or harmfulness of model-generated evidence summaries, as raised by the domain experts. The first author used NVivo for conducting the first round of open and axial coding (Preece et al., 2015;Corbin and Strauss, 2014). Over the course of the interviews and analysis, the research team met regularly to discuss codes and emergent themes from the initial coding to refine them iteratively and find agreement.

Cohort
We interviewed a total of 16 participants who worked as researchers or methodologists (11), academicians (9), journal editors (5), clinicians (4), and guideline developer (1) and came from the USA (9), UK (3), Australia (2), China (1), and Greece (1). All participants had contributed to a number of systematic reviews and meta-analyses as authors, advisors, or reviewers. Eight participants had contributed to more than 100 reviews, four to 25-100 reviews, three to 10-25 reviews, and one participant to fewer than 10 reviews. The systematic review topics that participants had previously worked on spanned a diverse range of subjects, including nutrition, vaccines, mental health, medical nursing, ophthalmology, pediatrics, women's health, cardiovascular diseases, toxicology, drug therapy, and sexual well-being. Table 1 reports participant characteristics.
Only one participant (P4) had some experience using an LLM (ChatGPT) for tasks related to systematic reviews. Six participants have used ChatGPT but not for systematic review work. A professional journal editor (P10) had not used any AI systems for their work. Most participants had used some sort of AI tools to aid systematic reviewing for tasks such as abstract and article screening, and for assessing study risk of bias ( . These specialized tools offer some degree of AIbased assistance for things like classifying abstracts and extracting data; none attempt to generate review drafts using LLMs. 7. https://www.rev.com/ 8. https://www.distillersr.com/

Results
After conducting a qualitative analysis on the interview transcripts, we identified a number of potential uses and harms shared by the participants. We next detail our findings shown in Figure 3.

Potential Uses
Participants said that LLMs are not adequate for producing medical systematic reviews directly given that they do not follow existing formal review methods and guidelines (Van Tulder et al., 2003). Therefore, we focused on assistance, and asked participants about the potential ways in which LLMs might aid medical systematic review production. We derived the following key themes on the potential uses that participants have expressed: first draft, framework or template, plain language summaries, suggestions (automcompletion), distilling information, crosschecking, and synthesizing or interpreting inputs (i.e., multi-document  Figure 3: Potential uses and harms of LLMs for assisting in writing medical systematic reviews, as gleaned from interviews. summarization). Table 2 provides a summary of the themes on potential uses accompanied by representative quotes. All participants found at least some of the samples of the LLM outputs to be wellwritten and plausible as "real" systematic reviews, given their structure and language use. About half of the participants expressed that LLMs would be useful for writing the first draft of a review. P3, a researcher and professor who has contributed to over 30 medical systematic reviews, indicated that they would use the first draft from an LLM to create subsequent drafts: "And I think this is going to be a common theme, I would tend to use it as a first draft." Furthermore, P12 (a senior researcher in evidence synthesis and author of 25 SRs), remarked: "If I were to use this, then it would be, I guess, a helpful draft for me to build upon. I think I'd probably say a bit more in certain sections than maybe in others or probably a lot more." In addition, nine of the participants viewed LLMs as a potentially useful means to generate scaffolding, framework, or template for writing systematic reviews, for example by generating section headings and subheadings. In reference to sample 3, P12 described how the structure that LLMs produce in the output might help novices: "I think I liked the structuring of the introduction as it went through. The three paragraphs are a good prompt and a good model for other authors who are starting off doing a review from scratch and never done one and not really know what to talk about." Relatedly, P12 further described how suggestions from an auto-completion plugin might be helpful for writing drafts.
During the interviews, multiple participants noted that LLMs might be useful for writing plain language summaries or writing short abstracts by distilling information from longer texts. P5, an epidemiologist with clinical background and extensive experience in evidence synthesis, described how LLMs can possibly help with writing texts for the general public: "Whether the system could create an output for the public that is based on the review results and the review huge 70 page report. Whether you can train the system with good quality systematic reviews so that the system can create a text like the ones we saw which are perfect for the public to read..." Some thought that LLMs could be helpful by summarizing or interpreting the results of specific studies or data analysis, as synthesizing and interpreting large amounts of data and text is currently a time-consuming manual process. Also, P5 and P7 noted that summaries generated by LLMs might be a good way to cross-check manually composed drafts, because automatically generated summaries might reveal biases in the writing, or perhaps suggest missing studies.
In addition to potential uses for drafting systematic reviews, participants identified other potential uses for LLMs (and AI more generally) for individual tasks that are part of review production process. We report details of these findings in Table A3 of Appendix C. Table 2: Table summarizing the potential uses of LLMs for drafting medical systematic reviews and exemplary quotes from participants.

First draft
Having LLMs provide a first draft of a medical systematic review. Humans can intervene by revising and building upon the first draft generated by a model.
"Well, I mean it could be a first pass, at least as a draft. And I mean this is also how in the real world systematic review, the technology assessment reports that I have led in years past are done. There are multiple drafts. And so this could be used as a preliminary, the first, and it would save a lot of time already." -epidemiologist and clinician with over 100 SRs (P1)

Framework or template
Having LLMs provide the framework or template that includes important headings and subheadings that can be helpful for inexperienced medical systematic reviewers.
"It seems to be pretty good at putting together a scaffolding or a framework that you could use to write from. I could see going to it and saying, 'Okay, ChatGPT, talk to me. Give me the subheadings for my dissertation."' -researcher in evidence synthesis and author of 35 SRs (P8)

Suggestions (autocompletion)
Having LLMs provide suggestions like autocompletion to authors as they write their draft of a systematic review.
"I guess you could sort of, the way in Gmail it sort of populates text for you [...] I guess an ideal world maybe could be where you sort of put in the subheading 'Study Selection' and you just start writing, and then it automatically pre-fills 'authors independently screened articles'. And that would maybe make things a bit faster for some people and get them to report things in a way that's most complete and sort of adheres to reporting guidelines that would be good." -senior researcher in evidence synthesis and author of 25 SRs (P12)

Plain language summaries
Having LLMs generate summaries of medical evidence that are easy to read for lay-people and public consumers.
"Let's say we want to disseminate the review to the press or to the general public, then I think any sort of model would be useful because we want to make sure that it's pitched in a moderate level so that it's not too, doesn't read too childish in a way, but it's not too technical. So I think any of the models will probably be very useful in that." -professional journal editorial staff and author of about 100 SRs (P16)

Distilling information
Having LLMs distilling large amounts of text and summarizing them to short abstracts can be beneficial depending on context and purpose.
"If I were to be using it to write a small section of the results, the fact that it can take the results of a paper and summarize them down into a couple sentences." -researcher in evidence synthesis and author of 35 SRs (P8)

Synthesizing or Interpreting inputs
Having LLMs synthesize or interpret the studies or analysis data provided by humans as input and generating narrative text.
"The most helpful part is for the model to be able to, let's say, look at statistical analysis, look at numbers, basically look at a graph, and then be able to generate at least some sort of a standard text so that they know, oh, a result that looks like this means that it has a significance in what way, in what direction."professional journal editorial staff and author of about 100 SRs (P16)

Crosschecking
Crosschecking human-written summaries or drafts of systematic reviews against LLM-generated summaries can be helpful in identifying potential gaps.
"That is very interesting as also a means to stimulate discussion, cross validate our results, and also identify emerging trends in the literature." -epidemiologist with clinical background and professor in evidence synthesis with over 100 SRs (P5)

Current Concerns
Our participants identified multiple concerns when we presented them with sample LLMgenerated evidence summaries. All participants said that the LLMs are not ready to be used for producing medical systematic reviews directly. Specifically, they expressed concern that the LLM-generated outputs were challenging to verify in terms of quality and accuracy; this made them skeptical of the texts. They said that the outputs showed lack of specificity, lack of comprehensiveness, lack of clarity, unknown provenance, strong conclusions without evidence, fabricated references or statistics, and missing risk of bias assessment. Table 3 provides a summary of the themes and representative direct quotes from participants. Eight participants highlighted the concern of outputs being too broad, meaning the summaries were insufficiently granular in detail. An example of this is discussing broad classes of interventions and/or outcomes. P1, an epidemiologist and clinician who has contributed to over 100 systematic reviews, described how generic sample 5 was by saying, "This [abstract] is very generic, and none of those statements probably are wrong or that, I mean to say another way, or the statement probably are correct, but doesn't say too much either." Nearly all the participants reported that some of the outputs lacked comprehensiveness, finding many outputs to be missing important information. For example, generated summaries sometimes only provided the background for the topic, or summaries of one or a few relevant studies. P5, an epidemiologist with clinical background and professor in evidence synthesis, described sample 6 to lack any information for assessing risk of bias and comprehensive representation of evidence: "This is not comprehensive. It focuses on the results, on the numerical results. It cannot address the risk of bias like we do in systematic reviews. And there is a partial representation of the evidence." When risk of bias assessment is missing in reviews, the evidence from included studies may not be useful as it does not provide sufficient contextualizing information to readers.
One of the most important reasons why participants described the LLM-generated outputs to be difficult to trust was not knowing where the studies included in the summaries came from (unknown provenance). P2-a former professional journal editor and guideline developer with experience in 100s of reviews-said, "I think the provenance of it is a real [issue...] I think for systematic review, being able to say, this piece of data in this analysis came from this RCT (randomized control trial) and that's published in this paper and we can track it all the way back, is really, really important to give people credibility and scientific reproducibility." An added problem to the unknown provenance is that many of the references generated by the LLMs were difficult or impossible find via simple searches, making it difficult to check for potential references to studies included in the reviews.
Furthermore, strong conclusions offered in the generated outputs without accompanying solid evidence were noted as a problem (and risk) by participants. In describing one of the LLM-generated reviews on a topic the participant has worked on, P3-a researcher and professor who has contributed to over 30 medical systematic reviews-said, "But there are a couple of places where it found evidence that we did not find, and it made jumps to conclusions that we did not find evidence for, and therefore we did not not make those conclusions in our systematic review." Table 3: Table summarizing current concerns of LLMs for medical systematic review process and exemplary quotes from participants.

Concern Description Quote
Lack of specificity Some of the LLM outputs are very broad or generic and are not specific enough to be helpful.
"This is a very generic [abstract], and none of those statements probably are wrong or that, I mean to say another way, or the statement probably are correct, but doesn't say too much either." -epidemiologist and clinician with over 100 SRs (P1)

Lack of comprehensiveness
Some participants have expressed concerns about how some of the LLM outputs are not comprehensive and are missing a lot of important information such as alternative outcomes, grade assessment, and full representation of evidence. Sometimes the focus can be very narrow and just reports on one aspect of the topic.
"I think most bothersome is it's labeled as an abstract but doesn't read like an abstract. There's nothing more than an introduction to the problem and the objectives of what this review is about. So it's very incomplete. It's like you have only written the first two paragraphs of your abstract." -epidemiologist and professor with editorial experience and experience in 100s of SRs (P4)

Lack of clarity
For some LLM-generated evidence summaries, participants reported them to be more difficult to read due to the language being less clear.
"Just the writing is just so clunky and exhausting to read through and not really, as I said, it's not really coming up with an overall conclusion." -senior researcher in evidence synthesis and author of 25 SRs (P12)

Unknown provenance
The origin or source of the studies are unknown for some of the evidence summaries generated by the LLM.
"I guess it doesn't reference which systematic review, but the fact that it's a systematic review is encouraging. But then of course, I don't know if it really has referenced it. I dunno if it exists." -professional journal editorial staff with experience in about 250 SRs (P9)

Concern Description Quote
Missing risk of bias assessment Some participants have expressed the concern of how the models are not able to address the risk of bias like SRs.
"It cannot address the risk of bias like we do in systematic reviews." -epidemiologist with clinical background and professor in evidence synthesis with over 100 SRs (P5)

Fabricated references and statistics
Some of the LLM-generated outputs included fabricated or fake references and statistics (hallucinations).
"I mean, the concern is the obvious that you can have falsified science, falsified data, falsified conclusions, and very convincing packaging of those in the end for used by known expert. But I think even an expert can be fooled by this." -clinical researcher, professor, and author of over 30 reviews (P15)

Strong conclusions without evidence
Some participants have expressed the concern of how the conclusions are strongly worded when there is no strong evidence to support the claim.
"[...] The current evidence suggests that it's safe, that that's actually pretty strong. So this current evidence is safe, but it does not have a significant effect on the prevention or treatment. So I think that a lot of people will turn to this and look at the conclusions, and then they're going to think that this is fine, but we really have no clue where those studies came from. So I would be very worried about what this means." -research methodologist and author of over 10 SRs (P7)

Potential Downstream Harms
During the interviews, we asked participants about any potential downstream harms that automatically generated reviews (such as the samples that we showed them) might cause. In particular, participants shared their thoughts on potential risks to clinicians and consumers seeking medical evidence, as well as systematic review authors and clinical researchers. We identified the following key themes on potential downstream harms: misleading conclusions, misinformation, harms to consumers directly interacting with LLMs for medical evidence, unclear accountability for harmful outputs, hindering creativity of authors, and proliferation of bad reviews. Table 4 provides a summary of the themes and representative quotes. Ten participants expressed reservations that LLMs can provide misleading conclusions (effectively misinformation). There was particular concern about the potential risks of strongly worded conclusions without sufficiently supportive evidence (as mentioned above). Given the formal, authoritative scientific writing style of model outputs, consumers might assume that they are factual, even when they misrepresent the corresponding evidence. In this way, uninitiated readers stand to be potentially misled. P8, a researcher in evidence synthesis and author of 35 SRs, noted how even small errors in numerical data can lead to misleading conclusions: "If you're reporting that kind of detail, if you get the numbers wrong or you associate the wrong number with the wrong outcome, you could be misleading people." Verifying the numerical data present in LLM-generated summaries is challenging owing to the provenance issues discussed above, so it can be difficult to ascertain the validity of conclusions.
Furthermore, eleven participants expressed some concern regarding the prospect of individuals directly interacting with LLMs to acquire overviews of evidence. P4, epidemiologist and professor with editorial experience and experience in 100s of SRs, noted "I think general public may misunderstand or misuse the outputs from these large language models. To some extent, it could be more dangerous than Google because when you Google information, again, you have the same problem. You don't know, at least there are maybe more trustworthy sources of information if you're aware of." P4 further noted the risks resulting from the lack of accountability of machine-generated texts: "There are authors there, references you could criticize about the validity of the information and this I suppose too, but sort of behind the scene, there's, it's a computer program, it's a computer model, is that really accountable for anything? So that will be my concern, basically the accountability." P10, a professional journal editor with 10 years of experience, expressed similar concerns of accountability with LLMs: "Yeah, I mean, one of the things we think a lot about is about accountability. So if in publishing errors come to light through no one's fault, but things happen and the scientific record needs to be corrected, we need to go back to people and ask them to correct the work that they've done. But that accountability, I don't understand how that would work for something like this." In addition to potential downstream harms to clinicians and consumers, some participants shared how LLMs can also harm clinical researchers. P12, who described writing to be a very rewarding aspect of an academic job, viewed LLMs as a tool that could hinder researchers' creativity in writing. They said, "But for me, I think that it would just gets in the way of creativity and not allowing you to think original thoughts by just populating a large language model based text and tinkering with it. Yeah, I think because there is a huge risk." Furthermore, four participants said that LLMs can be a large source of bad reviews as they copy the methods of many average reviews that are very mediocre in quality. They viewed this as potentially contributing to research waste (Glasziou and Chalmers, 2018), proliferating bad medical reviews. P12 said, "The sort of perpetuation of bad methods being used because it's sort of training on a large number of studies that have used average methods, and it kind of just perpetuates that."

Unclear accountability for harmful outputs
When models start generating medical systematic reviews, accountability can potentially become a problem as the "author" of the reviews are computer programs or models and not humans.
"Yeah, I mean, one of the things we think a lot about is about accountability. So if publishing errors come to light through no one's fault, but things happen and the scientific record needs to be corrected, we need to go back to people and ask them to correct the work that they've done. But that accountability, I don't understand how that would work for something like this." -professional journal editor with 10 years of experience (P10)

Hindering creativity
Over reliance on LLMs can potentially hinder creativity in writing of research findings.
"But for me, I think that it would just gets in the way of creativity and not allowing you to think original thoughts by just populating a large language model based text and tinkering with it. Yeah, I think because there is a huge risk. If you generate this sort of text and then feel like you don't have to really add much more because it's enough, and then you're just going to perpetuate really average reviews." -senior researcher in evidence synthesis and author of 25 SRs (P12)

Proliferation of bad reviews
LLMs can potentially create research waste by proliferating large quantities of reviews with current methods that are not the best.
"So it provides p-value and areas under the curve and optimal cutoffs, all of which I think are specious and non-reproducible for continuous measures. So this is not an abstract I would write, but it is a good example of the current regrettable practices in medical publishing."clinician and researcher in evidence synthesis with experience in 7 SRs (P6)

Bridging the Gap
After identifying potential uses and downstream harms of LLMs as aids for producing medical systematic reviews, we asked participants about what would make them feel more comfortable using LLMs in this context. Four of the interviewed domain experts said that having references (titles and authors of studies included in the summaries) and knowing that the outputs are genuinely derived from these would permit one to verify outputs and in turn inform decisions as to whether or not to trust them. A couple participants mentioned that explicit risk of bias assessment could provide important details to know how trustworthy the presented evidence is. P1, P7, and P10 emphasized a need for specificity in reporting, specifically regarding PICO (population, intervention, comparison, and outcome) information to provide relevance to readers. Another factor that can make domain experts feel more comfortable using LLMs is having actual subject matter experts verifying the accuracy of the outputs by cross checking every statistic or statement. Human-in-the-loop with extensive verification will help increase the trust of the system and the outputs overall as it is difficult for even domain experts of systematic reviews to verify the quality of the outputs. But even assuming this is the case, P3 speculated about a trade-off between the efficiency gains of adopting LLMs and the time required to verify the outputs: "But at some point though, the efficiency gains of doing that need to be weighed against the time that would need to be spent to verify every number that's written. And if that ends up being too much of time taken to verify, then the efficiency gains may not be worth it." Participants emphasized the importance of transparency. P15-clinical researcher, professor, and author of over 30 reviews-described the need to include banners that clearly denote that the reviews were generated by a LLM: "I think there should be a banner that says that this is generated by ChatGPT with blasting colors and proceed with caution and verify. Something that is not in small print." In regards to transparency of knowing what studies contributed to a given review output, P10-a professional journal editor with 10 years of experience-said, "I guess, transparency to me [is] having an idea if it was a systematic review or whatever. Something that was delivering an answer of having an idea of maybe where that comes from or where it's drawn from. I mean, the example of a systematic review, definitely the study, the number of study, all the usual stuff that we look for in a systematic review." One solution to the problem of unknown provenance that the participants suggested was by providing actual abstracts or full texts of studies to be summarized, treating the task more as traditional multi-document summarization models for scientific documents (DeYoung et al., 2021) or perhaps retrieval augmented language model techniques (Lewis et al., 2020). P4-epidemiologist and professor with editorial experience and experience in 100s of SRs-described another solution to the problem by blending LLMs and search engines more, "So I think a blended things of incorporating the large language models and the search would be really, really great. If I could just tell my computer, 'Here is a system output, here's the background. Here are five bullet points. Can you put them into a paragraph and put proper citations to all of this, what I just said?' And be able to verify that they should be able to bring up all the PubMed references and sort of highlighting where they got the information from. So you could do a quick verification of that is correct." Table A4 in the Appendix provides a summary of the themes.

Discussion
In this work, we sought to answer the question: What are the potential uses and harms of using LLMs for aiding the production of medical systematic reviews? By engaging with relevant but varied domain experts about this question in detailed interviews, we hope to better inform criteria for rigorous evaluation of biomedical LLMs for this setting.
Some consistent viewpoints emerged. First, interviewees largely agreed that LLMs are likely to be useful for producing medical systematic reviews either as a writing tool (e.g. creating initial drafts) or summarizing data or identified text inputs (akin to more traditional multi-document summarization). However, participants also expressed concerns about potential downstream harms of such generations. These include the possibility of individuals being misled by confidently composed but inaccurate synopses that could lead to potential clinical harms. Exacerbating these issues is the lack of transparency of LLMs; here, we observed that they produce overviews of findings on topics often without explicit references (or sometimes accompanied by hallucinated references).
We discussed with participating experts potential mechanisms to improve LLMs for aiding medical systematic reviews. Some fast progress to this end might be realized via User Interface (UI) choices, making it clear that model outputs are intended as drafts to be edited, and using LLMs primarily to scaffold evidence overviews. Slightly longer term goals might include improving the transparency of LLMs for evidence synthesis, perhaps via semi-parametric (i.e., retrieval-augmented) approaches, and/or mitigating the perennial issue of model hallucination. Based on our discussions with domain experts in evidence synthesis, the critical point is perhaps that humans must remain in the loop to validate and edit model outputs.
With LLMs becoming easier to use, some practical considerations for journals and editors emerge with the possibility of authors using LLMs to draft systematic reviews. We recommend journals to require authors to disclose in their methods if they used LLMs to draft or find sources. In addition, guideline developers can be transparent with their sources of evidence if they consulted LLM-generated outputs.
Evaluating LLMs for Medical Systematic Reviews When designing AI systems for medical systematic reviews, it is important to meet the scientific standards of evidence synthesis. Evaluating LLMs on accuracy, transparency, comprehensiveness of included studies, readability & clear structure, and providing details that are important to evidence summaries such as specific PICO elements or detailed methods that actually describes what the model has done to generate the outputs are important based on our findings from this study. When it comes to the actual language of systematic reviews, it is important to make sure that the conclusions are written in a way that matches the evidence that is presented and do not provide definite conclusions when only low-certainty evidence is presented.
Limitations The findings from this study are necessarily limited given that they capture the views of a relatively small sample of domain experts. However, trading scale for granularity is common in qualitative analysis. In addition, the findings may have limited applicability to future generations of methods which may lead to different uses and harms, compared to what was identified in this study with current generative LLMs.
Furthermore, the output samples we showed to participants have some limitations. Each participant saw at most three outputs that might not have fully represented or captured the characteristics of the LLMs that were identified from conducting a rapid qualitative analysis. This could have limited the exposure of the full capabilities of large language models to the participants. Also, we did not conduct extensive experiments with prompting strategies for generating summaries; it is known that different prompts can lead to substantially different results (Zhao et al., 2021;Reynolds and McDonell, 2021). Finally, further research is warranted for using LLMs for literature reviews in other domains as our study only focused on the task of writing medical systematic reviews. Nash Anderson, Daniel L Belavy, Stephen M Perle, Sharief Hendricks, Luiz Hespanhol, Evert Verhagen, and Aamir R Memon. Ai did not write this manuscript, or did it? can we trick the ai text detector into generated texts? the potential future of chatgpt and ai in sports & exercise medicine manuscript generation, 2023.

Appendix A. Interview Question Guide
The below questions were asked during the semi-structured interviews with the domain experts to gain their perspective regarding the uses and harms of LLMs for medical systematic reviews. The questions were divided into 4 broad categories: Background, Previous Experience Using AI, Thoughts on Outputs from LLMs, and General Thoughts on LLMs for the Task. Following the nature of semi-structured interviews, we asked follow-up questions or skipped some questions when appropriate given the content and context of the interview.

Background
1. How many medical systematic reviews have you done or contributed to? 2. What kinds of contributions have you made to reviews for biomedical literature?
Previous Experience Using AI 1. Have you ever used AI to help you screen articles or draft systematic reviews? 2. Tell me about your experience using AI for this specific task?

B.2. LLM Outputs Analysis Results
Table A1 provides the 12 general concepts identified during the qualitative analysis of the LLM-generated outputs, accompanied by a description and an example for each concept. Outputs that report contradictory statistics or do not align with the conclusions or content of the human-written abstract Model-generated conclusion: Mindfulness-based interventions have shown positive effects on mental wellbeing in students and junior doctors.
Human-written conclusion: The effectiveness of mindfulness in our target population remains unconfirmed.

Numerical Values
Outputs with any numerical values such as p-values, risk ratio, mean difference, AUC, and ROC The meta-analysis found that antenatal dietary supplementation with myoinositol was associated with a significantly lower risk of developing GDM compared to placebo or standard care (RR 0.69, 95% CI 0.52-0.92, P = 0.01).

Undesirable Outputs
Outputs that are not reviews, hallucinated (fake) references, provide irrelevant information, or misspellings

ChatGPT
Gestational diabetes mellitus (GDM) is a common complication of pregnancy that can lead to adverse maternal and neonatal outcomes. Myo-inositol is a naturally occurring sugar alcohol that has been proposed as a potential dietary supplement for preventing GDM.
Here is a review on antenatal dietary supplementation with myo-inositol for preventing gestational diabetes: Title: Antenatal dietary supplementation with myo-inositol for preventing gestational diabetes: a systematic review and meta-analysis of randomized controlled trials Authors: Li G, Zhang Z, Zhang H, et al. Published in: Journal of Obstetrics and Gynaecology Research (2021) Overview: This review aimed to evaluate the effectiveness and safety of antenatal dietary supplementation with myo-inositol for preventing GDM. The authors conducted a systematic search of multiple databases and identified eight randomized controlled trials involving 1,511 pregnant women. The studies compared the use of myo-inositol supplements to placebo or standard care and assessed the incidence of GDM as the primary outcome. Findings: The meta-analysis found that antenatal dietary supplementation with myoinositol was associated with a significantly lower risk of developing GDM compared to placebo or standard care (RR 0.69, 95% CI 0.52-0.92, P = 0.01). The authors also found a significant reduction in the incidence of preterm delivery in the myo-inositol group compared to the control group (RR 0.39, 95% CI 0.18-0.85, P = 0.02). No significant differences were found between the groups in terms of other secondary outcomes, including birthweight, mode of delivery, or neonatal outcomes. The review did not identify any serious adverse events associated with myo-inositol supplementation. Conclusion: The authors concluded that antenatal dietary supplementation with myoinositol may be effective in reducing the risk of GDM and preterm delivery. However, further studies are needed to confirm these findings and determine the optimal dosage and duration of myo-inositol supplementation.

ChatGPT
Intraventricular haemorrhage (IVH) and encephalopathy of prematurity (EOP) are common complications in preterm infants and can lead to significant neurological disabilities. Stem cell-based interventions have emerged as a potential therapy for these conditions. Here is a review of the evidence on stem cell-based interventions for the prevention and treatment of IVH and EOP in preterm infants: Title: Stem cell-based interventions for the prevention and treatment of intraventricular haemorrhage and encephalopathy of prematurity in preterm infants: a systematic review and meta-analysis Background: The aim of this systematic review and meta-analysis was to evaluate the safety and efficacy of stem cell-based interventions for the prevention and treatment of IVH and EOP in preterm infants. Methods: A systematic search was performed in multiple electronic databases up to March 2021. Randomized controlled trials (RCTs) comparing stem cell-based interventions with placebo or standard care for the prevention or treatment of IVH or EOP in preterm infants were included. The Cochrane risk of bias tool was used to assess the quality of the included studies. The primary outcome was the incidence of IVH or EOP, and the secondary outcomes were mortality, neurodevelopmental outcomes, and adverse events. Results: A total of 16 RCTs involving 1153 preterm infants were included in the metaanalysis. Stem cell-based interventions included mesenchymal stem cells (MSCs), neural stem cells, and umbilical cord blood cells. The risk of bias was high in most of the studies. The meta-analysis showed no significant difference in the incidence of IVH or EOP between the stem cell-based intervention group and the control group (risk ratio [RR] 0.92, 95% confidence interval [CI] 0.78-1.09). There was also no significant difference in mortality (RR 0.92, 95% CI 0.63-1.33) or adverse events (RR 0.98, 95% CI 0.79-1.21) between the two groups. However, stem cell-based interventions were associated with a significantly reduced risk of cerebral palsy or developmental delay (RR 0.58, 95% CI 0.40-0.85). Conclusion: The current evidence suggests that stem cell-based interventions are safe but do not have a significant effect on the prevention or treatment of IVH or EOP in preterm infants. However, these interventions may improve long-term neurodevelopmental outcomes in these infants. Further high-quality RCTs are needed to confirm these findings and to determine the optimal timing, dose, and route of administration of stem cell-based interventions.

Galactica
Title: Brain natriuretic peptide and N-terminal brain natriuretic peptide for the diagnosis of haemodynamically significant patent ductus arteriosus in preterm neonates Abstract: Background and aims: The aim of this study was to evaluate the diagnostic accuracy of plasma brain natriuretic peptide (BNP) and N-terminal brain natriuretic peptide (NT-proBNP) in preterm neonates with patent ductus arteriosus (PDA). Methods: Thirtyfour preterm neonates with a clinical diagnosis of PDA were included in the study. The patients were divided into two groups according to their clinical and echocardiographic results. Group 1 included 17 preterm neonates with haemodynamically significant PDA (hsPDA), and Group 2 included 17 preterm neonates with no PDA. BNP and NT-proBNP were measured using enzyme-linked immunosorbent assay (ELISA) kits. Results: Plasma BNP and NT-proBNP levels were significantly higher in preterm neonates with hsPDA than in those without PDA (p ¡ 0.001). The receiver operating characteristic (ROC) curves showed that plasma BNP and NT-proBNP levels had a good diagnostic accuracy for the detection of hsPDA in preterm neonates (area under the curve [AUC]: 0.851, p ¡ 0.001 and AUC: 0.866, p ¡ 0.001, respectively). The optimal cut-off values of plasma BNP and NT-proBNP levels for the detection of hsPDA were 555 pg/mL (sensitivity 76.5%, specificity 82.4%) and 589 pg/mL (sensitivity 70.6%, specificity 82.4%), respectively. Conclusions: Plasma BNP and NT-proBNP levels can be used as a diagnostic tool for hsPDA in preterm neonates.

Galactica
Title: Treatment of dental and orthodontic complications in thalassaemia Abstract: Summary Thalassaemia is a common inherited disorder in Southeast Asia. Thalassaemia patients often present with a wide range of dental and orthodontic complications. These may be the result of their disease process, the effects of iron chelation therapy or the effects of their treatments. Management of thalassaemia patients with dental and orthodontic problems requires a multidisciplinary approach. This paper aims to review the dental and orthodontic complications of thalassaemia and the treatment approaches to these complications.

Appendix C. Additional Codes and Representative Quotes
The tables below provide additional codes and representative quotes from participants that have been identified as part of the study. Table A3 gives a definition for each category of potential uses of LLMs that are non-summarization tasks for medical systematic reviews, and exemplary quotes. Participants found LLMs to have potential uses when it comes to automating some of the mundane tasks of producing systematic reviews. P15 who is a clinical researcher and professor said, "[LLMs can] do all the hard work, searching the literature, finding the right papers, and if the papers are machine readable, extract the data in reproducible ways into tables, and then it will be up for the expert to conduct the right analysis of the methods. Use the data in ways that are answering the questions of the systematic review. So that's where I see the utility in accelerating those painful steps of compiling the literature, finding the papers, obtaining the data from table spreadsheets, whatever they may be, and producing analyzable tables. I dunno if that's aspirational, but that's really where I see the value." Table A4 gives a definition for each category of what could make experts feel more comfortable using LLMs for medical systematic review process, and exemplary quotes.

Generating search strings/strategies
LLMs for generating search strings or search strategies "I think potentially with assisting with search terms as well, and developing your search strategy and suggesting synonyms. And maybe it can even draft a first search strategy, which you could then review and discuss with an information scientist as well." -professor, research methodologist, and author of dozens of SRs (P11)

Data extraction
LLMs for extracting important and relevant data from text of studies that are useful for systematic reviews "And because with a computer it could tirelessly identify potential location of the information in a paper and then that can be highlighted and then the human can then verify the veracity of such information and approve such data to be extracted. So that would expedite things. So in some sense that is kind of analogy to massive language model output and then verified it by a human." -epidemiologist and clinician with over 100 SRs (P1)

Generating analysis code
LLMs for generating R or python code for conducting analysis "I've also seen it used clearly enough for writing code and stuff like that. People have asked it how to do data on R, and it's cranked out some decent formulas." -research methodologist and author of over 10 SRs (P7)

Bias/consistency reviewer
LLMs for checking bias or inconsistencies in humanwritten drafts of systematic reviews "I dunno if it would be able to check consistency because that does, some of these numbers appear in multiple places. So you'll have it in figures, you'll have it in results, you'll have it in the abstract, the interpretations going to be in the conclusion. So there could be four places where one number is going to appear in an article. Could it do something around consistency and making sure that these numbers are consistent." -former professional journal editor and guideline developer with experience in 100s of reviews (P2)

Use Description Quote
Alternative text for graphs LLMs for generating alternative text for graphs "There are only so many ways you can generate a forest plot, and that's sort of seems like the kind of task that might be accessible and probably better than some bored person doing it at the last minute hoping nobody ever sees it. So that would be one that comes to mind." -clinician and researcher in evidence synthesis with experience in 7 SRs (P6)

Generating guidelines
LLMs for generating medical guidelines.
"Maybe ChatGPT to generate a guidance and see how concordant or discordant it is [...] It could be any blood pressure control recommendations and see whether it's concordant, discordant things that are affecting people's life.
What are the most impactful treatments or interventions or public health preventive measures that are going to impact people's life while those large language models be able to respond to prompts that are consistent or concordant with the major guidelines that should be based on systematic reviews?" -epidemiologist and professor with editorial experience and experience in 100s of SRs (P4)

Including non-English studies
LLMs for finding and including non-English studies if the LLM was trained on non-English text.
"The reality of most systematic reviews only consider articles published in English, but we often recognize that there may be content that's missed as a result of limiting the language. So yeah, you might have thought that that could be a strength of these language learning models." -professional journal editor with 10 years of experience (P10)

Helping non-native English writers
LLMs for helping non-native English writers to write English systematic reviews.
"I'd like to say it could have, it might be very helpful for non-English speaking writers of English reviews in this example, to write coherent reviews." -professor and methodologist with over 100 SRs (P13)

Annotated Bibliography
LLMs for creating annotated bibliography when details studies are provided as input.
"I don't know if this is possible, but if you actually put in all the studies and then it could do some sort of narrative summary of those studies in terms of where the studies were conducted, what interventions were assessed, what outcomes were assessed, these types of details, like a slight, almost like an annotated bibliography that could potentially be useful as well." -professor, research methodologist, and author of dozens of SRs (P11) Table A4: Table summarizing what could make experts feel more comfortable using LLMs for medical systematic review process and exemplary quotes from participants.

Known provenance
By having LLMs produce summaries of known provenance (i.e. knowing that the text is genuinely derived from the presented references, which in turn genuinely reflect well-chosen and existing articles), people can have increased trust in the outputs and the system. "This is the black box. Obviously, this would be useless unless you had a citation to the specific review, the specific paper that it chose." -clinician and researcher in evidence synthesis with experience in 7 SRs (P6)

Allowing efficient verification of outputs
AI systems should allow humans to efficiently and easily verify the quality of the inputs and outputs.
"When it doesn't know, it makes stuff up. And so that has to be checked, of course. And the question is whether that checking will be easier or hard, harder than just doing it yourself. And I'm guessing that at certain point it might be easier, but I'm not sure." -clinician, professor, and researcher with experience in more than a dozen reviews (P14)

Domain experts verifying the accuracy of outputs (human-in-the-loop)
Medical domain experts or subject matter experts are needed to be able to fully cross check if the model outputs are safe and correct.
"I also need to work with a domain expert who is knowledge[able] about the specifics of either the disease or the treatment or the test. Right now I lack that aspect. So I do not know exactly whether certain things make sense and I would, if I just read it, even the things that I have questioned about, I would not know whether it is right or wrong." -epidemiologist and clinician with over 100 SRs (P1)

Providing more guidance to LLMs
Giving more guided prompts (specific populations, interventions, comparisons, and outcomes) or carefully selected studies as inputs can increase confidence in using LLMs for medical systematic reviews.
"I think if you could do a systematic review in full and or sections of the review. The analyses. Have all that data available and then limit the writing of the abstract to what has been identified during that methodological process to write the review, then I as an editor would be much happier for that to come to my journal and then to review in that. But yeah, that would give me confidence, I think." -former professional journal editor and guideline developer with experience in 100s of reviews (P2)