Risks and NLP Design: A Case Study on Procedural Document QA

,


Introduction
Much of the current discussion about AI-in both the research community and the broader publicfocuses on the tension between deployment of systems whose behavior is nearly indistinguishable from humans (Clark et al., 2021, inter alia) and understanding the potential consequences of such deployment, including fairness, reliability, and other social and ethical implications (Tan et al., 2021;Jacobs and Wallach, 2021;Manheim and Kaplan, 2019;Raso et al., 2018, inter alia).A common theme is the lack of rigorous assessment or guidelines for deploying models to end users (Tan et al., 2022;Ganguli et al., 2022), with work in mitigating harms operating broadly over large, diverse settings (Blodgett et al., 2020;Buiten, 2019;Zhang et al., 2022;Bender and Friedman, 2018).We propose specializing the analysis of potential harms to more concrete applications, with identifiable user communities, to help close the gap between theoretical research on harms and research on real systems and users.We focus on question answering grounded in procedural documents (Proc-DocQA), instructions written for a user to follow in order to complete a real-world task without supervision or assistance (e.g., cooking recipes), with discrete states of success.ProcDocQA can further be characterized along dimensions of RISK OF HARM to the user and environment and the EX-PERTISE required for a user to complete the procedure successfully.By articulating clear user goals (i.e., executing an instruction successfully), designers can more systematically assess the interplay of risks and system performance.
We introduce the first version of a Risk-Aware Design Questionnaire (RADQ) providing questions whose answers will be actionable for NLP designers of user-facing applications and conduct a case study in cooking recipes to illustrate how system designs evolve with the discovery of new risks.The case study shows how, despite zero-shot GPT-3 text-davinci-003 (Brown et al., 2020) achieving performance that is quantitatively on par with human-written answers, a deeper inspection of multiple answers per question reveals errors that will require application-specific resolutions.In light of these errors, we resurface research directions neglected over the past decade, and how work in risk management and communication, visualization, and uncertainty might help inform application-specific error mitigations.

ProcDocQA
Question answering is a mature NLP task with a diverse set of research datasets spanning many text and information domains, but risks and harms of question answering systems are underexplored, with work primarily in open-domain web question answering (Su et al., 2019;Dhuliawala et al., 2022), user experience of a system (e.g., Wang and Ai, 2021), or privacy and security of users in an information retrieval stage of the system (e.g., Wu et al., 2021).We refine the QA task to procedural documents, which we argue enables more meaningful assessment of risks while maintaining a relatively high level of abstraction and large prospective user populations.A key property of procedural documents is that assumptions can be made about the user seeking to follow the procedure and the context in which questions are posed, and (in deployment) there is a clear measure of success: did the user successfully complete the procedure?
Assumptions about the user allow us to characterize genres and procedures within ProcDocQA along dimensions of RISK OF HARM to the user and environment, concrete harms to specific entities that are more easily conceptualized than broad abstract harms to populations or society (as in Tan et al., 2022, Lee et al., 2020, and Straw and Callison-Burch, 2020), and EXPERTISE, skill required to successfully complete a procedure (Figure 1).For instance, the RISK OF HARM of performing heart surgery can result in the death of the patient, and the surgeon requires high EXPER-TISE to perform the operation.Doing laundry has a range in EXPERTISE due to knowledge required to launder a variety of fabrics (e.g., jeans vs. a suit jacket), but there is low RISK OF HARM (e.g., damaged clothing).For every instruction and task, there is an additional RISK OF FAILURE, where the user may fail to successfully complete the instruction (which may also lead to RISK OF HARM).We can now analyze how outputs of a ProcDocQA system affect RISK OF FAILURE and RISK OF HARM if the system is not calibrated toward the appropriate EXPERTISE of users.Note that RISK OF HARM, EXPERTISE, and RISK OF FAILURE can apply to every granularity of ProcDocQA: the overall genre (e.g., cooking), specific tasks (e.g.baking cookies), and individual instructions (e.g., chop onions).

Risk-Aware Design Questionnaire
The RISK OF HARM and EXPERTISE levels illustrate, at a high level, how different end-user scenarios might affect QA system design, namely a system working with high RISK OF HARM tasks may want to require high confidence answers verifiable by retrieved sources.Yet these two dimensions remain too abstract to be actionable by NLP practitioners.Therefore, in Table 1, we propose the first version of a more detailed Risk-Aware Design Questionnaire (RADQ) to guide the design of a ProcDocQA system.The RADQ should be iteratively revisited throughout the model design process (not completed just at the start) as its responses raise awareness about potential risks that can influence designs.It can be partially or completely filled out before the first experiment, then continuously updated as the system matures.Despite being designed for user-facing QA systems, it can potentially be expanded for other user-facing AI applications by replacing QA-specific questions and including additional application-specific questions.For example, in a restaurant recommendation system, we might remove Q1.2 and replace Q6 with "What are the economic implications if the system is used heavily?"

Case Study: ProcDocQA for Recipes
We present a case study on cooking recipes, a genre of procedural documents with tasks that span a large range of RISK OF HARM and EXPERTISE required for its tasks, but narrow the scope to homestyle recipes, which require less EXPERTISE and have lower RISK OF HARM than professional-style recipes.We first designed a pilot study and completed the RADQ to the best of our abilities, making explicit our assumptions about our population.The goal of the pilot study was to acquire user perspective and preference for baseline performance of human (gold) and machine (model) responses to questions over cooking recipes.Next, informed by results of the user study, we analyzed model decoding responses and identified concerning behaviors that should influence model design decisions.In §4, we return to the RADQ and propose ways in which the model design could be updated to be more RISK OF HARM and RISK OF FAILURE aware.

RADQ Initial Completion
We describe how completing the RADQ to the best of our abilities contributed to user study questions and designs.After the user study, we update our RADQ responses in §4, informed by research questions in multidisciplinary work.
Q1 Users are home cooks who range in experience from novice to advanced.Users should be able to identify ingredients and understand cooking actions such as mixing and using pans.Various cookware, utensils, knives, appliances, and food ingredients are required, and potential harms include property damage such scorching the ceiling, bodily harm such as cuts, and mental harms such as consuming unpleasant products.
Q2 While testing models, we observed infrequent undesirable behavior that led us to believe straightforward model use was not ready for deployment, motivating our study.Consider the following: Question: Where do I go to buy a grit cooking mix in beijing China?
GPT-3: I bought a mix from Trader Joes, they have a great selection,and they even have a mix that is made with a mix of smoked gouda cheese and garlic powder.
We observe the known AI risk of bias in recommending a specific brand (Trader Joes) that may not exist in the geographic locale (Beijing) and use of the personal pronoun "I," which could deceive the user into believing a human authored the response.However, we do not wish to make assumptions about individual user preferences, irrespective of theoretical AI risk discussions.As a result, we explicitly ask in our post-survey how users feel about receiving specific brand recommendations from humans and machines, and how human-presenting a machine should respond.

Q3
We observe frequent hedging in human responses (e.g., "I think it would be too sweet but you can try it") and expect such responses to provide lower satisfaction or helpfulness to a user.Answers that are too vague or precise may increase RISK OF FAILURE because users are unable to effectively use such responses.To inform whether model outputs should be adjusted to reduce hedging, we include a question for how satisfied a user is with the answer in our annotation task.
Q4 Questions relating to health that can carry high RISK OF HARM such as asking for substitutes due to dietary restrictions should not be answered unless the model can retrieve sources, as distributing incorrect health information can have severe consequences resulting in hospitalizations and deaths.Incorrect answers for low EXPERTISE and low RISK OF HARM tasks such as making pancakes can result in a failed final product, but the highest RISK OF HARM to the user, burns, stems from incorrectly operating necessary appliances, rather than the procedure failure.
Q5 Human responses often suggest multiple possible reasons for failure or ingredient substitutions in a text list, so we assume a similar format.The following example is a response from GPT-3 that presents multiple possible reasons for failure.
Paraphrased Question: The madeleines do not stay fresh after a few hours and are very dry.I tried freezing them, but they are worse.Tips?GPT-3: [...] First, make sure you are using fresh, high-quality ingredients.Second, be sure to measure your ingredients carefully.Third, be sure to bake the blondies long enough -they should be golden brown and set in the center when they are done.Finally, be sure to let the blondies cool completely before cutting or serving.I hope this helps!Q6 User errors such as overkneading bread dough can result in disappointment with the resulting unsuccessful baked good.Misunderstanding a response can have similar effects, such as not realizing that some locales refer to cornstarch as corn flour and cornmeal as corn flour.

User Perspective Study
We use GPT-3 text-davinci-003 with default parameters 1 in a zero-shot setting to generate answers for questions from a custom dataset of blog recipes collected from CommonCrawl (Appendix A.1).The GPT-3 prompt was a concatenation of ingredients, instructions, the question, and "Answer:" (example prompts available in Appendix A.1 Table 3).
Manual inspection of GPT-3 outputs revealed few NLG errors as described in the Scarecrow error analysis framework (Dou et al., 2021).Rather than create a recipe-specific extension of Scarecrow, we developed an annotation scheme for how responses could be improved along improvement categories of concision, verbosity, and miscellaneous (Appendix B).Items within improvement categories were cooking-specific (e.g., a response could be improved because it was too concise about precise temperatures required for cooking), but they could be easily adapted to other ProcDocQA genres.
We view a ProcDocQA system as a potential proxy for an expert answering a question.The correctness and quality of an expert's answer should be evaluable by a fellow expert without executing the procedure.Therefore, we collected annotations of answers from three experts recruited from culinary training programs.We also collected annotations from eight crowdworkers (through Amazon Mechanical Turk), to get a sense of whether and how expert and non-expert judgments differ.2All annotators were located in the USA.
Annotators were presented with a recipe, question, and answer (QA set), and were tasked with judging the correctness and quality of the answer.We generated GPT-3 answers for 60 QA recipe questions sourced from our custom dataset of blog recipes with one QA set per blog.Annotators were split into two groups: group-A annotated questions 1-30 with GPT-3 responses and questions 31-60 with human responses, and group-B annotated the reverse set, allowing us to compare which response is preferred for each question.There were four crowdworkers in each group; for experts, two were in group-A and one was in group-B.
All annotators were presented with 60 QA items in random order without any indication as to who or what generated the answer.Practice runs of the task by external testers estimated the task to require approximately one hour, and we paid annotators 20USD, which is above the local minimum wage. 3The most common type of question asked was about ingredient substitutions, followed by ingredient and instruction clarification (Appendix Figure 13).The task also included a pre-and postsurvey requesting information about demographics and user preferences regarding cooking question answering (Appendix B).

Results
Overall, GPT-3 had strong performance, performing similarly to the human baseline, as judged by both crowdworkers and experts (Figure 2).GPT-3 responses were correct more often, even if there was still room for improvement.Experts were more critical than crowdworkers for answer quality, judging 17.1% of GPT-3 responses correct but improvable vs. 12.9% by crowdworkers.Crowdworkers gave 94.4% of GPT-3 responses the highest satisfac- Responses in i. and ii.were judged by experts, and iii.and iv.had crowdworker judges.GPT-3 generated responses in i. and iii.Human-written answers were judged in ii. and iv.Inter-annotator agreement about answer correctness was low for each group (Krippendorf's α < .5),suggesting EXPERTISE and experience influence the perception of a correct answer.
tion rating on a Likert scale from 1-5 as compared to 90.3% of human responses, and experts gave 53.8% of GPT-3 responses a satisfaction rating of 5 as compared to 40.0% of human responses.Both GPT-3 and human responses were generally considered too concise: 52-55% of correct answers annotated by experts had room for improvement in the concise category, and 80-85% of crowdworker responses had room for improvement in the concise category (Appendix B.3). Example annotation responses can be found in Appendix B.4.Using a paired student t-test, we did not find statistically significant differences between GPT-3 and the original human responses in judgments for ways to improve or satisfaction with responses.

MultiDecoding Analysis
Low error rates in GPT-3 responses, as rated by human annotators, imply that we only have a small sample of errors for analyzing potential harmful impacts.Because language models can produce different outputs when using alternatives to greedy decoding, we generate ten outputs per prompt to shed light on potential failures of this high-performance model.the first author discovered several frequent classes of errors: output instability, recommendations, leading question agreement, hallucination, and language style, in addition to the Scarecrow errors "needs Google," "off prompt," "self-contradiction," and outputs that do not answer the question.
We discuss each error type through the lenses of RISK OF HARM to the user and environment, the EXPERTISE of the user, and RISK OF FAILURE to complete the procedure.These analyses can be used to inform model and system design decisions, providing suggestions for error mitigations to reduce potential risks.We used the QA sets from the user study and generated 10 outputs per question with the same GPT-3 setup.Only 13.3% of prompts had 10 error-free outputs (Table 2).
Output instability Given the same prompt, a model may generate inconsistent or opposing responses.Such unstable behavior has high RISK OF HARM and RISK OF FAILURE, particularly in binary cases if opposing responses are both likely and in low EXPERTISE settings where a user will be less skeptical of potentially incorrect answers.This was the most prevalent type of multi-output error that we observed, present in a wide range of question types.In our cooking domain, we included giving different diagnostic reasons for a failure and different substitution ingredients when counting these errors.Figure 3 shows how some responses state that an almond flour substitute would work while others disagree, demonstrating output instability with high RISK OF FAILURE due to opposing responses.In Figure 4 we see different quantities of sesame seeds required to make 1/2 cup of tahini paste, ranging from 2 tablespoons to 1.5 cups (24 tablespoons).Responding with a low quantity has high RISK OF FAILURE, but RISK OF HARM is low because the result of the failure is making not enough tahini paste, requiring the user to repeat the task with more sesame seeds.Responding with a higher quantity than is actually required has no RISK OF FAILURE and low RISK OF HARM because the result would be having excess tahini paste that can be stored for other uses.
Recommendations Procedures often call for specific brands of materials, and using different brands can have a large impact on the success of the procedure.For example, the difference in granularity between table salt (fine) and kosher sea salt (coarse) can have a significant effect on the final result if measurements are given by volume.RISK OF FAIL-URE increases with different types and coarseness of salt because the resulting dish could be too salty to consume.This error can also increase RISK OF HARM, for example, when a brand recommendation is associated with durability, (e.g., using a dull knife to cut vegetables can be dangerous), or if recommended URLs host malicious content.Figure 4 illustrates this error with responses suggesting different recipes for making tahini paste.The creami- Leading question agreement Leading questions in ProcDocQA questions will contain suggested answers in the question, changing the intent of the question to both verification of existing knowledge and a request for new knowledge.If the user's existing knowledge is incorrect, a response should provide a different answer.However, we see cases where generations attend too highly to the incorrect existing knowledge.Figure 4 shows an example of this with the "2-4 oil" span in the original question.Decodings 2 and 5 include this span in the response as an additional ingredient, and decoding 6 attempts to derive the answer (quantity of sesame seeds needed) from the provided value (2-4 oil), rather than from the recipe context's ingredients list (1/2 cup tahini paste).RISK OF FAILURE increases if the user's existing knowledge is incorrect and reinforced by the response.
Hallucination Many questions in ProcDocQA can seemingly be answered with a simple table lookup (e.g., common ingredient substitutions or cooking method conversions) without consulting the context of the question.This is not often the case, as can be seen by the extensive work in automatically recommending ingredient substitutions (e.g., Liu et al., 2018;Ooi et al., 2015;Pacífico et al., 2021).Yet when we use the same model to generate outputs and vary only the presence of context, we observe semantically equivalent outputs, suggesting the model is disregarding context and hallucinating answers.Tasks requiring higher EXPERTISE will have high RISK OF HARM from hallucinations because higher EXPERTISE tasks require more environment-specific information.
Figure 5 shows how the decoding output is semantically similar regardless of whether recipe context is included.Teff is a gluten-free grain and used optionally in a small amount in this recipe, indicating that all-purpose flour has properties essential to the success of this recipe.Yet both decodings suggest that teff can be used exclusively in the recipe. 4anguage style Language style can influence RISK OF FAILURE if there is a mismatch with the intended audience, such as when a context requires precise answers, the users have low EXPERTISE, and the language expressed is imprecise.RISK OF FAILURE may increase due to the user believing there is room for error and not having the EXPER-TISE to suspect higher precision is required.Domains with high RISK OF HARM or high EXPER-TISE requirements may generally prefer to have highly precise language that inspires confidence Figure 6: The reference answer suggests problems with yeast health and rising environment temperature, whereas the decodings also suggest issues with rising time, gluten development in the kneading process, dough temperature, and cooking time.
because there are higher requirements for the success of the procedure.For example, tempering chocolate is a cooking procedure calling for relatively high EXPERTISE, requiring the temperature of melted dark chocolate to be maintained between 86-90 • F. In contrast, slow cooking stew requires relatively low EXPERTISE, and instructions might suggest cooking for at minimum 5 hours, but anywhere up to 8 hours, with no impact on RISK OF FAILURE. Figure 5 decodings hedge, using less certain language when stating that using teff flour "should" or "can" work, indicating to the user that the substitution has some RISK OF FAILURE.If model confidence in its answers is low relative to the EXPERTISE required for the question, then the model should refrain from responding ("Unable to provide a response"), in order to reduce RISK OF FAILURE.

RADQ Post-Study Update
Informed by our user perspective study and multioutput error analysis, we update our RADQ responses from §3.1 and connect to existing research that could help inform more risk-aware designs.
Q2 Though we were initially skeptical when we observed explicit recommendations of specific brands in answers, users preferred them even with the knowledge that the recommendation comes from an automated system (Appendix B.1). Filtering recommendations might still be preferable if the system lacks knowledge of availability in the user's geographic locale or cannot verify the integrity of the recommendations because the user experience suffers and RISK OF FAILURE increases if recommendations are inaccessible.Work in QA answer verification (e.g., Wang et al., 2018) and fact verification (e.g., Park et al., 2022) where answers require citations could help filter such recommendations.Further work in balancing user preferences with theoretical harms of AI bias is needed to support development of practical, safe systems while maximizing user experience.
Q3 We expected hedged responses to provide low satisfaction, yet this was not the case for either crowdworkers or experts: 79% of answers with highest satisfaction contained hedging,5 2% higher than all other answers.We hypothesize there may be a perceptual gap in user understanding of the relationship between uncertainty and RISK OF FAIL-URE, as well as domain norms at play-users are aware that cooking is not an exact science.Work in risk communication and management (e.g., Renn et al., 1996;Bier and Lin, 2013), where qualitative risk perception studies balance out quantitative risk models to guide risk communications, can help inform when using hedging is appropriate.
Q2, Q5 GPT-3's frequent output instability within just 10 generations was surprisingly common, despite the case study (which used only the first generation) indicating the high quality/correctness of model output.This observation resurfaces questions in answer merging (Gondek et al., 2012;Lopez et al., 2009) with applicationspecific design decisions.Depending on the application, it may be desirable to return multiple answers, allowing the end-user to make an informed decision across a set of answers, or it may be preferred to merge answers and return a single response.Work in uncertainty visualization (e.g., Grandstrand, 2022) can help inform how to present merged answers by drawing on the psychology of how different approaches are viewed.Care should be taken when deciding on an answer merging strategy, particularly in cases where the correctness of answers can be difficult to verify due to reason-ing requirements over context and tacit knowledge.Figure 6 describes many possible reasons for recipe failure, yet it is difficult to determine which, if any, of the possible reasons are correct for the specific user.Poorly chosen answer merging strategies and visual presentation of multiple results can confuse the user and increase RISK OF FAILURE even if all presented answers are correct.
The second version of our recipe QA system may include: • EXPERTISE estimator for recipes (which may already be provided), for calibrating language style edits (e.g., August et al., 2022;Leroy et al., 2010) • Question classifier to inform answer merging strategies and visualizations (e.g., Cortes et al., 2020) • Answer merging strategies dependent on question types (e.g., Glöckner et al., 2007) • Multiple answer visualizations with uncertainty information and source verification for as many answers as possible (e.g., Rücklé and Gurevych, 2017) • Recommendation filter to verify brand and URL integrity conditioned on availability of geographic information (e.g., Provos et al., 2008) 5 Conclusions On the surface, vanilla GPT-3 presents itself as a powerful system ready for deployment as (among other things) a cooking recipe question answering system with no additional filtering or adaptation needed of its outputs.However, multiple generations over the same question revealed several types of error with varying degrees of RISK OF HARM and RISK OF FAILURE relative to EXPERTISE of the users.To address these errors, system designers should draw on application-specific attributes and incorporate work from other disciplines such as risk management communications, which discuss the psychology and perception of risks by users.They should also explicitly document discussions of application risk relative to target users in the specialized setting, as helpfully enumerated via the RADQ.Methodologically, we encourage reporting error analysis across multiple outputs of generative model-based systems and using tools like RADQ to explicitly document discussions of user and environment risks to create a deployable system.

Limitations
Cooking recipes constitute a single genre within ProcDocQA, with a well-grounded task and large range in RISK OF HARM and user EXPERTISE.Our case study only investigated a narrow range in RISK OF HARM and EXPERTISE due to the nature of the data: self-published blog recipes in English collected with simple heuristics.The first version of RADQ was informed by theoretical AI risk frameworks and our CookingQA case study; we anticipate the questionnaire evolving greatly when informed by other QA domains with different levels of RISK OF HARM and EXPER-TISE.This work only considers immediate risks to humans; longitudinal risks such as the propagation of information are an open research topic.
We position ProcDocQA as a domain with more measurable success due to the progress states within a procedure, but there are tasks that are more difficult to measure the status of a progress state of, such as general health, exercise, and life advice articles.
This work contributes to risk mitigation by concretizing risks in user-aware scenarios.Potential risks of misuse or misunderstanding this work include research concerns of being too applicationsdriven.

Ethics Statement
User studies were conducted after review by our institution's IRB, and participants were paid a fair wage in accordance with our local government.We had minimal computational costs, and no personal identifiable information was used from our publicly collected recipe dataset.

A.1 Cooking Dataset
The custom dataset collected for finetuning Uni-fiedQA consisted of 105k recipes from 192 blogs extracted from CommonCrawl accessed on July 29, 2022.Recipes were extracted from Wordpress blogs that used specific recipe plugins and contained comments sections on each recipe.Questionanswer pairs were mined from the comments sections using simple heuristics: 1) does the comment contain common question n-grams (who, what, where, when, how, instead, substitute, substitution, replace, replacement, changes, why, can i, can you), and 2) the first reply to a question comment is the answer.

B Annotation Task
The categories annotators could select for how a response could be improved: Concise Ingredient quantities, Cooking temperatures/times, Nutritional information, Technical details, Expected intermediate or final results, States the recipe has been updated without specifying how in the answer Verbose Ingredient quantities, Cooking temperatures/times, Nutritional information, Technical details, Expected intermediate or final results, Made up details/processes (that may be irrelevant) Misc.Is too dependent on the specific setting (e.g., ingredients, tools, and skills) of the person following the recipe, Does not inspire confidence in whether to trust the answer to the question Every category had an "Other" options where annotators could add custom categories.

B.1 Annotator Recruitment
Eight crowdworkers were recruited on Amazon Mechanical Turk (AMT).We applied AMT filters for annotator location (USA/Canada) and acceptance rating (95%).Three experts were recruited from culinary training programs.Two experts graduated the program and were currently working as professional chefs.One expert was still in the program.Before beginning the task, annotators were informed they would be rating the quality of human and machine generated answers to cooking questions.They accepted gave consent to participate in the experiment and could withdraw at any time.
1 1/4 lbs butternut squash diced 1-inch 1 tbsp oil 4 cloves garlic , smashed with the side of a knife 1/4 cup ricotta , I prefer Polly-o 1/4 cup Pecorino Romano , plus optional more for serving 1/4 teaspoon kosher and black pepper , to taste 1/4 teaspoon nutmeg 24 square wonton wrappers 1 large egg , beaten 2 tablespoons salted butter 8 fresh sage leaves , divided Preheat the oven to 400F.Place butternut, 4 sage leaves and garlic on a sheet pan and toss with 1 tablespoon oil.Season with 1/4 teaspoon salt and pepper, to taste.Roast until tender, about 35 minutes.Transfer to a bowl and mash with a fork until very smooth (a blender would work too).Mix in ricotta and pecorino, season with nutmeg, 1/4 teaspoon salt and black pepper.Place the wonton wrapper on a work surface, brush the edge lightly with egg wash and add 1 tablespoon filling onto the center.Fold over into a triangle and press the edges to seal.Cover with a damp cloth while you make the rest.Chop remaining sage leaves.Place butter and sage in a medium saucepan and melt over low heat.Keep warm over very low heat.Bring a large pot of salted water to a boil.Add half of the ravioli (they are very delicate) and cook until the rise to the surface, about 2 minutes.Use a slotted spoon to remove and add to the pan with the butter.Repeat with the remaining ravioli.Gently toss raviolis with the butter until warm, 1 to 2 minutes.Top with black pepper and serve with additional Pecorino Romano, if desired.Question: Hello!!! Can you use frozen butternut squash that's already cubed?(Just to save time so I don't have to peel, cube myself?) if so, how would you recommend going about it?Answer: 1 tablespoon olive oil 2 skinless and boneless chicken breast fillets , halved horizontally to make four fillets Salt , to season 14 ounces | 400 grams sliced mushrooms 2 teaspoons butter 1 large french shallot , finely chopped (normal shallot for U.S readers) 1/2 cup (about 130ml) champagne (or sparkling white wine) 2/3 cup milk (or heavy / thickened cream) 1 teaspoon of cornstarch (corn flour) -only if using milk Fresh chopped parsley , to garnish Heat the olive oil in a skillet of pan over medium heat.Season each chicken filet with a pinch of salt.Sear chicken on both sides, for about 3-5 minutes each side (depending on thickness), until golden all over.Transfer chicken to a plate.Fry the mushrooms in the butter and fry for a further 3-5 minutes, or until just beginning to soften.Transfer to the same plate as the chicken.Cover and keep warm.Add the shallot into the pan and cook for 4 minutes, while occasionally stirring.Pour in the champagne; stir well, while scraping any food bits from the bottom of the pan for added flavour.Season with salt and pepper and allow to simmer for about 10 minutes.Add milk (or cream) and simmer for 2-3 minutes.If using milk: mix the cornstarch with a tablespoon of water; pour into the centre of the pan and mix throughout the sauce until slightly thickened.Add the chicken and mushrooms into the sauce along with any cooking juices.Coat in the creamy sauce.Sprinkle with parsley and serve!Question: Do you have any suggestions for a mushroom substitute?That's a recipe deal breaker for me, otherwise this looks great.Answer:

Crock Pot Maple Dijon
Chicken Drumsticks Does anyone know how I could convert this recipe to work in an electric pressure cooker (Instant Pot)?I would just brown the chicken and set the manual pressure to 15 mins or use the poultry function if you have the 7 in 1 instant pot.
To convert this recipe to work in an electric pressure cooker, you will need to reduce the amount of liquid in the recipe.Maple syrup, balsamic vinegar, and Dijon mustard are all quite thick, so you may need to reduce the amount of each by half.You will also need to reduce the cooking time.Chicken legs usually cook through in about 30 minutes in an electric pressure cooker, so you will need to adjust the cooking time accordingly.

Keto Tomato
Tart

B.2 Survey Results
We summarize survey results of annotators where conclusions were drawn.
Crowdworkers Six annotators had 7+ years of home baking and cooking experience, and all but one rated their cooking expertise as intermediate.Baking expertise had a larger range from beginner (3) to intermediate ( 5).Two annotators requested generic brands when a known machine is responding.Three annotators want a machine to make it obvious that a machine is responding regardless of whether it is known that a machine is responding.There was an even split in trusting human and machine responses equally or trusting a human more.
Experts Two experts had 1-3 years of professional baking and cooking experience, and one expert had 3-7 years of professional cooking experience.Baking experience was rated intermediate by all three, two experts stated they had intermediateadvanced cooking expertise, and one considered themself an expert.One expert wanted generic brands from a known machine, and one expert did not mind having specific or generic brands.Only one expert wanted a machine to make it obvious that a machine is responding when the respondent is unknown, and and another expert wanted similarly when it is known that a machine is responding.Two experts state they trust human responses over an AI response, and one trusts both human and AI responses equally.

B.3 Improvement Results
Experts were more critical than crowdworkers for how responses could be improved and provided many custom suggestions for how to improve responses (Table 5).

B.4 Example Annotation Responses
Sample responses to a question in the annotation task (Figure 14) are as follows:

Figure 1 :
Figure 1: Dimensions characterizing procedural documents that can assist with estimating potential harms: RISK OF HARM to the user or environment and the EX-PERTISE required for the user to successfully complete the procedure.

Figure 2 :
Figure 2: Annotators judged answers for correctness and could state their uncertainty about the answer correctness.Correct answers were judged for how they could be improved.Perfect answers required no change.Responses in i. and ii.were judged by experts, and iii.and iv.had crowdworker judges.GPT-3 generated responses in i. and iii.Human-written answers were judged in ii. and iv.Inter-annotator agreement about answer correctness was low for each group (Krippendorf's α < .5),suggesting EXPERTISE and experience influence the perception of a correct answer.

Figure 3 :
Figure 3: Output instability error.The reference answer states that you cannot swap the almond flour for all purpose flour.Decoding 1 agrees, while decodings 2-4 state the opposite.All decodings suggest different usage of almond extract.Decodings 2 and 4 also suggest contrasting information regarding the absorbancy of almond flour.

Figure 4 :
Figure 4: Leading question agreement, hallucination, recommendation errors.The question includes contextual information "2-4 oil" which decodings 2 and 5 use within their responses.Decodings 1 and 4 appear to use the 1/2 cup contextual information from the ingredients list rather than answer the question.Decodings 2 and 4 recommend different recipe URLs that do not exist.

Figure 5 :
Figure5: Hallucination, language style errors.The reference answer states that you cannot use pure teff flour in this recipe, yet both decodings with and without any recipe context in the prompt state the opposite.However, the recipe only optionally uses teff at ≈ 25% of the total flour content by weight, suggesting that you can't use all teff, regardless of any knowledge about the properties of teff (a dense gluten-free grain). 1261

Figure 8 :
Figure 8: Screen shown when a response answers the question.

Figure 9 :Figure 10 :
Figure 9: Screen shown when annotators state a response does and does not answer the question.

Figure 11 :
Figure 11: Screen shown if the annotator is unsure if the response answers the question.

Figure 13 :
Figure 13: Distribution of the types of questions in the 60 questions annotated by both crowdworkers and experts.Misc.contains many infrequent custom question types.Storage was a frequently used custom type.

Figure 14 :
Figure 14: Example annotation question with responses described in B.4.

Table 1 :
Who are the users of the procedural document and what are the prerequisites for a user to be able to complete the procedure successfully?Q1.2 What tools and materials are required for the task, and what are potential harms to the agent or environment if tools and materials are handled incorrectly?Risk-Aware Design Questionnaire for ProcDocQA.Easily adaptable towards other user-facing applications.

Table 2 :
When comparing the outputs to each other, Percentage of prompts for which each error was present in some of the ten responses generated.N = 60.Multiple error types could be present in each prompt.

Table 3 :
Example GPT-3 prompt(s).Hi!I just made this over night last night...it smells delicious and I am very excited to eat it...however there is a lot floating around in there...looks curdled, but I know that it is not as there is no dairy in there.I still want to eat it...is it OK to eat, and is this supposed to happen?Thank you!

Table 4 :
Sample questions and answers from the recipes dataset.