FIREBALL: A Dataset of Dungeons and Dragons Actual-Play with Structured Game State Information

Dungeons & Dragons (D&D) is a tabletop roleplaying game with complex natural language interactions between players and hidden state information.Recent work has shown that large language models (LLMs) that have access to state information can generate higher quality game turns than LLMs that use dialog history alone.However, previous work used game state information that was heuristically created and was not a true gold standard game state. We present FIREBALL, a large dataset containing nearly 25,000 unique sessions from real D&D gameplay on Discord with true game state info. We recorded game play sessions of players who used the Avrae bot, which was developed to aid people in playing D&D online, capturing language, game commands and underlying game state information. We demonstrate that FIREBALL can improve natural language generation (NLG) by using Avrae state information, improving both automated metrics and human judgments of quality.Additionally, we show that LLMs can generate executable Avrae commands, particularly after finetuning.


Introduction
Dungeons & Dragons (D&D) (Gygax and Arneson, 1974) is a tabletop roleplaying game in which players assume the roles of characters in a fantasy adventure. Play is conducted primarily through natural language, with players roleplaying as their characters and describing their actions. Meanwhile another player, the Dungeon Master (DM), controls the fictional story world: setting obstacles, goals, and adventures, controlling monsters, and interpreting players' actions in the context of the rules of the game. Although the DM makes a lot of the final decisions, the game is ultimately a collaborative storytelling experience. * Work done while at the University of Pennsylvania. Figure 1: Examples of our Utterance to Command task (top), which takes in an utterance and a game state to produce an Avrae command, and State to Narration task (bottom), which produces a narration given a dialogue history and game state information.
Due to its use of natural language as actions, each individual player must maintain a personal understanding of the game world which they build from conversational history and using Theory of Mind (Martin et al., 2018;Zhou et al., 2022). The natural language action space also means that the AI needs the ability to adequately perform tasks such as language generation, language understanding, and planning (Callison-Burch et al., 2022).
Although AI's capabilities in this space are still nascent, Callison-Burch et al. (2022) have shown that D&D dialog can be improved by adding state information into the input of a large language model (LLM). However, the state information presented in that work was heuristically created using regular expressions and machine learning classifiers. Thus it cannot be considered true ground truth state information. Our work is unique because it provides true ground truth state information.
We use this data for two tasks: Utterance to Command and State to Narration. In the first task, a model is given a game state and turn of the game (roleplay in natural language), and must predict the corresponding command that matches the intent of the roleplay. The second task is a constrained creative natural language generation task: given a state change resulting from a command execution, generate a narration that describes the results. Figure 1 demonstrates both tasks.
Our contributions are as follows: • We present FIREBALL-a dataset of over 8M gameplay utterances, 2.1M commands, 1.2M gameplay states, and 160K unique actors (player & non-player characters) 1 . This is the first dataset of this size that includes detailed game state and character information for each scenario. • We show that large language models such as GPT-3, can extract relevant information from natural language in order to produce commands that are capable of being run by the game environment. • We demonstrate that LLMs, when finetuned on this dataset, can generate more grounded narrative text compared to language models tuned without game state information. By incorporating structured state information into language understanding and NLG tasks, we hope to help pave the way toward more ambitious creative generation goals, such as consistent longform narrative generation and video games that can convert language input into discrete game actions or generate narrations and dialogues based on the game state.

Related Work
Previous papers have outlined the challenges of Dungeons & Dragons as an AI problem and examined various aspects of the game (Ellis and Hendler, 2017;Martin et al., 2018). Subsequently, a number of datasets in the D&D space have been created (Louis and Sutton, 2018;Rameshkumar and Bailey, 2020;Si et al., 2021;Callison-Burch et al., 2022;Papazov et al., 2022), but these datasets either do not include game state information or include only an inexact game state, which lacks grounded, verified attributes, such as those included in our dataset. Others have looked at using AI for subsets of D&D gameplay, such as generating spell descriptions (Newman and Liu, 2022) or simulating combat (Gama Vila Nova et al., 2019).
In addition to the gold-labelled game state, we include all of the attributes from the following papers 1 A subset of our FIREBALL dataset is available here: https://github.com/fireball-anonymous/ fireball-preview

Playing D&D Using Avrae
In Dungeons & Dragons, a group of players each create and play as a character. Characters have classes (such as wizard or barbarian), fantasy races (such as Elves, Gnomes, and Dragonborn), hit points (denoting their health), statistics that govern how well they can do certain actions, and inventories of items (armor, weapons, potions, etc). These game state elements are stored on a character sheet ( Figure 2). One player takes on the role of the Dungeon Master (DM). This special player creates the world in which the story is told, role plays all of the characters the other players interact with, and acts as arbiter of the rules of the game.
There are two main modes of gameplay -incombat and out-of-combat-which have different styles of play. In-combat play simulates battles between characters and monsters, and involves turn taking and tracking stats. Out-of-combat play is characterized by freeform collaborative storytelling. Both elements of the game involves rolling dice to determine the success of players' actions (like attempting to attack a monster). The die's outcome is then modified with a stat that represents the character's skill in performing that particular action (e.g., +3 for acrobatics). If the dice roll plus the modifier was above a threshold that the rules or the DM determines (called a difficulty class, or DC), then the action succeeds. Otherwise it fails. The DM then narrates the results.
For example, if an attack hits a monster, the DM might narrate the player's blow and how the monster reacts, taking into account information like the attack's damage type and how many hits the monster has taken previously.
Traditionally, D&D is played in person with characters' stats written out on physical character sheets and monster stats referenced from books containing hundreds of prewritten "stat blocks". To track a stat that changes frequently like hit points, players and DMs use paper and pencil or whiteboards, performing math in their head and writing over the previous value when it changes. Some players also use maps and miniatures to track where characters and monsters are located relative to one another and to aid immersion in the game world. Since the beginning of the pandemic, a large number of groups have moved online using tools like Discord (a messaging program), virtual tabletops that simulate maps, and game state trackers like Avrae rather than physical mediums.
Avrae is a Discord bot that was designed to help people play D&D online. It allows players to import their character sheets, allows DMs to access a database of monsters, and simulates dice rolls.
During combat, Avrae tracks the game state. This state contains detailed information including the list of participants in the battle, their stat blocks, their current HIT points, and their available actions. Avrae allows players to execute commands representing their characters' actions. It performs a simulated dice roll, adds the player's modifiers, and determines the success or failure of the roll. Avrae then updates the game state, adjusting things like hit points, and turn tracking.
A simplified example of interacting with Avrae might look like the following example: Player: Filgo crouches down in the bush, loosing an arrow at the dire wolf charging towards him.

Player: !attack longbow -t DW1
Avrae: (Rolls dice and displays the results of the attack and damage dealt, including the new health of the dire wolf.) Dungeon Master: Your arrow flies true and the beast lets out a shrill howl as it pierces its matted fur. It's low on health now, so on its turn it'll retreat.
In actual play, an average of 3-8 players (including the DM) take turns interacting with Avrae. By instrumenting Avrae to record these commands and messages before and after a user's inputted command, we collect a rich set of structured gameplay. Appendix B contains a full list of recorded events and their descriptions.

Dataset
We worked with the developer of Avrae to instrument it to collect game transcripts and log game state information. The data collection was approved by Wizards of the Coast, the game company that owns D&D and Avrae, as well as by our institution's IRB and the Bot Safety team at Discord. Any players who participated in our study provided their informed consent.

Data Collection
Participants were recruited from English-speaking "play-by-post" D&D Discord servers, where players and Dungeon Masters play by taking turns posting in a Discord text channel to describe their moves. For each actor (player or monster), we record a detailed state; Table 6 in the Appendix lists all available attributes and their potential relevance to NLG tasks. Similarly, recorded actions include the detailed results of each dice roll, such as whether a given attack hit its target or a spell succeeded (a list of all action attributes is available in Appendix C).
In the following sections, we refer to our data as triples consisting of a command and its corresponding state change, any relevant utterances before the command ("preceding" utterances), and any relevant utterances after the command ("following" utterances). Each of these commands corresponds to an action that an actor in combat can take, such as attacking with a weapon, casting a spell, or using a special ability.

Utterance-Action Alignment
To align utterances with their corresponding state changes, we match each utterance with its chronologically nearest state change, and tag all utterances that occur chronologically before their corresponding state change as an utterance motivating the command (the "preceding" utterances), and all utterances that occur chronologically after their corresponding state change as narration of the state change (the "following" utterances). These alignments create a prototypical triple as described above. Within each triple, we discard any utterance containing less than five words.

Authorship Filtering
Within each triple, we identify the user who issued the commands, and the Dungeon Master hosting the combat. We discard any utterances within each triple which are not authored by one of these users. Additionally, we discard any triple where the commands originate from multiple different actors, which may occur if a single user is controlling multiple different creatures in a group. Finally, we discard any triple which has neither any "preceding" utterances nor "following" utterances.

IC/OOC Classification
We further distill the set of "following" utterances by training GPT-3 Ada (Brown et al., 2020) to distinguish between "in-character" (IC) utterances from "out-of-character" (OOC) utterances. Incharacter utterances are what the player says speaking as their character or to describe their character's actions. They might look like this: Filgo puts a hand on his axe, uneasy after the shaking he'd felt from the ground.
"Is someone there?"  Meanwhile, out-of-character utterances occur across players/the DM when not speaking as any particular character. This dialog might be to discuss rules or strategy, or might be unrelated to gameplay entirely. Out-of-character utterances might look like this: How much health do you have left?
I'll move back 30 feet after.
BRB, going to the bathroom.
To distinguish between these categories, we finetuned a classifier that was pretrained on Giant in the Playground data (Callison-Burch et al., 2022) on a hand-labelled set of 750 utterances randomly sampled from our dataset. The classifier achieved an accuracy of 94% on a validation set of 125 utterances. We then applied the classifier to each utterance in our dataset and discarded any out-ofcharacter utterances from the "following" set since in-character text is usually more interesting and evocative. Finally, we also removed sections of utterances contained in parentheses, which usually indicate OOC speech, from the "following" set.

Dataset Size
Our dataset contains 25k unique combat scenarios, including 8M utterances from 3.6k unique authors covering 1.3M unique combat states. Table 1 contains a breakdown of the distribution of commands in our dataset, organized by command category.

Utterance to Command Task
Our first task aims to predict the game command that a player or Dungeon Master intended to use, given the utterances since their last turn in combat. To successfully predict a command from an utterance, a model must be able to predict the user's intent, which actors the user intended to target, and ground both these predictions in the game state. For example, in the scenario illustrated in Figure 1, a dwarf named Filgo is fighting a Dire Wolf. On his turn, his player narrates that Filgo attacks with his axe, then runs the command to target the monster with his attack. Notice how, in this example, the player references the target dire wolf by its creature type ("the wolf"), rather than its name in the game state ("DW1").
To accomplish this task, we provide the state information included in our dataset-namely the list of actors participating in combat and any information about those actors, such as their monster type and current hit points-to the models. The full prompt for the example mentioned above is available in Appendix F.
After our distillation passes, our dataset contains 120,000 aligned utterance-command pairs. We examine the accuracy of predicted commands on both a token level and by injecting predicted commands into the Avrae system. Finally, we also examine the performance of models without game state information included to demonstrate the importance of the game state.

Models
We use GPT-3 (Brown et al., 2020) Davinci models (as of Dec. 2022) as a base. Finetuned models are using standard Davinci, while few-shot models use Davinci-002. For the Utterance to Command generation task, we evaluate four main treatments, • FT + S: The base model is finetuned on a sample of 30K examples from FIREBALL with state information presented in the prompt. • FT: The base model is finetuned on a sample of 30K examples from FIREBALL without any state information presented in the prompt. • FS + S: The base model is presented 3 exemplars (few-shot) sampled from FIREBALL with relevant state information. • FS: The base model is presented 3 exemplars sampled from FIREBALL without any state information.

Evaluation
Each of the generation tasks is evaluated independently. Since the command generation task is more akin to a structured generation task, we evaluate only on objective correctness rather than subjective quality of generated text. We evaluate the generated commands over four quantitative metrics: pass  We first seek to evaluate whether the generated command is a valid Avrae command by simply passing the command to Avrae and checking for successful execution. Similar to Chen et al. (2021) we calculate a pass rate metric that determines the proportion of generations that constitute valid Avrae commands. To calculate this metric, we have each model generate commands for 1000 utterances randomly sampled from a held out test set, and simply count the proportion of generations that Avrae is able to successfully execute.
Second, we evaluate what proportion of generated commands would result in the desired state update through a number of hand-written unit tests. Each test accepts a predefined combat state, an utterance, and the corresponding model-generated command and validates assertions on the combat state update. We took 10 common scenarios seen in D&D for these unit tests, generating 10 commands for each scenario-model pair. Since these generated commands sometimes have repeats, we take the n unique commands from the set of generations and validate which proportion of the generated samples pass the handwritten unit tests by running assertions on the updated combat state after running the command through Avrae.
Lastly, we perform a qualitative analysis to better understand the nature of the grounding. We took two popular spells (Bardic Inspiration and Fireball) and hand constructed a scenario for each. We then perturb the prompts that we present to the model to study the model's sensitivity to inputs.

Results & Discussion
Table 2 displays the objective evaluations of the four treatments in the Utterance to Command generation task. We see that the model which was finetuned on FIREBALL with the state information (FT+S) significantly outperforms all other mod-els (i.e., both models without the state information and the few-shot models). Within the perturbation results (detailed in Appendix D), we notice that the FT+S model can accurately gauge the state of the actors in combat. For example, when asked to cast a Fireball at injured enemies, it successfully parses the prompt to find the subset of enemies that were injured and only targets them in the resulting command. Further, the model can accurately determine which of the prepared spells correspond to an utterance. For example, an utterance that would have generated the Fireball spell generates the spell Burning Hands instead if Fireball is removed from the prepared spell list.

State to Narration Task
In this task, we want to generate a narrative utterance describing the effects of a player's actions, given all of the state changes since the start of the player's turn in combat.
For example, in the scenario illustrated in Figure  1, a party is fighting a Sea Hag. On the cleric's turn, she attacks the hag with her mace, but misses. After seeing the result of her action, she narrates the miss, referencing the result as the hag dodging the attack. The full prompt for this example is available in Appendix E.
After our distillation and filter passes (Sections 4.2-4.4), our dataset contains 43,000 aligned stateutterance pairs. To examine the importance of the game states provided in our dataset, we compare our results to methods that do not include the game state, such as dialog continuation (Callison-Burch et al., 2022) and predicting the narration given only the command that was run (Papazov et al., 2022).

Models
We finetuned four GPT-3 (Brown et al., 2020) models on different data to determine the effect of state and dialog history inputs on generation. Each model uses Davinci (as of Dec. 2022) as a base model, using 20,000 state-utterance pairs.
• DIALOG: Our first baseline model. This model is only given the last 5 messages of chat history, and fine-tuned to predict the next utterance that continues the dialog. It is not given any information about the game. • COMMAND: Our second baseline model.
The model is only given the command that was run to take the player's action, and finetuned to predict the corresponding utterance.  Table 3: Perplexity, BERTScore, and ROUGE-1 scores of our models and human-written responses.
• FIREBALL-SHORT: Similar to DIALOG, but also contains the mechanical description of the action's results. • FIREBALL-FULL: All information given to FIREBALL-SHORT plus the full actor list, target list, and detailed attributes of the caster.

Automated Evaluation
For the combat State to Narration generation task, we leverage standard text generation metrics: perplexity using a GPT-2 model (Radford et al., 2019) as a baseline, BertScore (Zhang et al., 2019), and ROUGE (Lin, 2004). All metrics aside from perplexity are calculated using the human narration as a reference. The results of our automated evaluation are available in Table 3. We note that automated metrics are not particularly suited for evaluation of creative natural language generation. Perplexity is a measure of how "unexpected" a sequence is to a language model, which does not directly correlate with the quality of creative generation. Furthermore, BertScore and ROUGE evaluate similarity to a reference, which is an imperfect fit for our task where two narrations can differ substantially yet both be of high quality. These limitations are evident in the disparity in results between automated and human evaluation, which is expected given previous work that reached similar conclusions (Sagarkar et al., 2018;DeLucia et al., 2021).

Human Evaluation
We also perform a human evaluation to assess the quality of the generated utterances. In total, we recruited 45 evaluators from the Avrae user base based on their experience with Avrae and D&D. All evaluators had played D&D using Avrae before: 37 had used Avrae for over a year and 37 had been the Dungeon Master of a game using Avrae. Evaluators  were rewarded with a set of codes for digital goods on D&D Beyond with a market value of $36 for completing the rating task. We provided each evaluator a version of the context that was provided to the models: the last fifteen messages sent in a channel, the casting actor and their description, a list of actors in combat, and the current state of those actors. Along with each context, we provided one generated utterance from each model along with the true utterance sent by a human. The evaluators were asked to rate each output along three dimensions, following the evaluation procedure used for the Meena LM (Kulshreshtha et al., 2020) and D&D Dialogue dataset (Callison-Burch et al., 2022): • Does the response make sense? (yes/no) • Is the response specific? (yes/no) • How interesting is the response? (10 point scale) Each evaluator rated 3 to 7 scenarios randomly drawn from a set of 75, with at least 3-way redundancy for each scenario. The full annotator instructions and a mockup of the annotation interface are given in Appendix G.

Results & Discussion
The results of our human evaluation are tabulated in Table 4. Both FIREBALL models outperform the baseline models in sensibility and specificity by an average of 15 percentage points (significant to p < 0.01), and on average perform similarly to a human (p > 0.5). A detailed analysis of significance can be found in Appendix H. Generally, models that were aware of the game context (including COMMAND) were more interesting than the model tasked with simply continuing the chat history, and comparable to human performance.
It may seem unusual for the human performance to be so low, only making sense to an experienced D&D player about 50% of the time. One explanation could be that human-written responses were more likely to refer to background knowledge not provided in the model's context, and therefore may have caused raters to mark the response as nonsensical. We compiled qualitative feedback from our evaluators to provide some insight into why this may be, as well as identify some common failure cases of our models. We summarize some of the reoccurring themes here.
Removing Player Agency. The most common theme among evaluators' feedback was that they did not want the model to take control of their character away from the player. One evaluator noted that "several [narrations] had player characters acting or speaking in response to the action. That's something I would never want a human DM doing unprompted and it might be frustrating to have the bot look like it's trying to control what my character does or says." This problem extended to the Dungeon Master's role, as well; one evaluator mentioned that some AI responses would specify creatures' movement and "drag [their] encounters down the hallway." Player agency is an especially challenging aspect of the game to maintain while training language models with real player utterances, as the training data available naturally makes decisions for and speaks as the player. Multiple evaluators suggested that language models should be used to provide assistance and inspiration for players rather than directly acting on behalf of the player, allowing a human to make the final decision for their characters.
Hit Point Reasoning. Another common failure case was models' tendency to describe any damage to a target, regardless of the target's true remaining health, as a kill. According to one evaluator, "I ran into a scenario where the text was absolutely great, but incredibly wrong. A player scored a critical hit on a Mind Flayer with a bow, and left the mind flayer bloodied. The long text stated the scenario beautifully, even got the weapon right... but said it killed the Mind Flayer." We noticed this failure case was more prevalent among the DIALOG and COMMAND models, demonstrating that our state-aware models were able to appropriately ground the generated narrations.Nevertheless, even the state-aware models often made this mistake. We have seen in our dataset that humans tend to only narrate critical portions of a fight, like when an enemy is killed. This reporting bias creates a disproportionately large number of narrations in our training set that reference a crea-Last Human Utterance "Dogs... so many of them making it more annoying then anything." Umbrage didn't want to waste anything big. So he pulled out his pipes, taking the attack of opportunity as he would start to play. "Fear me!" Command Run !i cast "Cause Fear" -t dd1 -t dd3 -t dd4 [...]

DIALOG
The music was certainly not pleasant, but it seemed to have an effect on the dogs. They became panicked for a moment, but were able to maintain their aggression. It was just enough for the party to act.
COMMAND Cuco proceeds to try and scare the ducks. So far it wasn't working.
FIREBALL-SHORT As the first dog fell, the others seemed to flee once more in pure terror of the seemingly unending tunes of the pipes.

FIREBALL-FULL
The dogs were in the middle of a panic. But still they fought on! Human (Original Narration) Moving back into the melee with the one he was originally fighting. As he drew ready. ture being killed. Therefore, the models trained on this data skew towards narrating that monsters are killed. Table 5 shows a scenario that exemplifies many of these aspects. Without the guidance of previous dialog turns, the COMMAND model hallucinates names for the caster and the targeted creatures. The DIALOG, FIREBALL-SHORT, and FIREBALL-FULL models produce plausible next utterances based on the details provided in the player's utterance but exhibit some of the discussed failure cases: it acts as the Dungeon Master to narrate that the dogs have fled and it references the first dog dying, which it did not (in the game state, the dog still has a full 39/39 hit points). For reference, the true next utterance as written by the player does not mention any effect of the spell, instead focusing on the character's movement. The full game state and chat history associated with this example is included in Appendix I.

Conclusions
We have demonstrated how the FIREBALL dataset can be used to predict game commands that correctly match a player's intent and generate cohesive and grounded narration from the results of a game action. Our Utterance to Command model is capable of translating roleplay into game-specific actions and can aid novice users, reducing the amount of time players spend looking up documentation and allowing them to play the game more. Our State to Narration model can help inspire the Dungeon Master and take some of the cognitive load off of repetitive writing tasks, allowing them to focus on creating an enjoyable experience for the players. FIREBALL opens the door to multiple exciting avenues of research, and we're excited to see how future work utilizes our unique dataset of state-augmented gameplay.

Limitations
Dungeons & Dragons is a very complex game to capture completely, and there are certain aspects that FIREBALL does not take into account. For example, FIREBALL's scenarios are recorded independently of the overarching narrative context they take place in, do not record players' inventory, and do not account for any movement or placement on a map. Our models are not able to play D&D autonomously -but doing so is not the goal. Instead, D&D models can be used to assist and inspire the humans playing.
Our models do not take into account the generation of profanity or sensitive topics; these were filtered out post-hoc. D&D is a game played by players of all ages that often contains violent or profane descriptions, and unfiltered generations may be unsuitable for young players. There are previous instances of roleplaying games that incorporate language models being used to generate sexual content 3 that would require age restrictions and content warnings.
GPT-3 may be prohibitively expensive for everyday use; in our experiments, we were unable to use the full set of data we had available for fine-tuning due to budget constraints. Limitation on the number of times the character can use an action before resting.
Certain abilities can only be used a certain number of times.
Armor Class 18 How difficult the character is to hit.
Provides information about how dire a situation might be for more interesting text generation. Actors are also commonly referred to by a combination of these (e.g. "the prone dwarf").

C Action Attributes
Actions consist of a tree of effects, such as rolling to hit, dealing damage, or rolling a saving throw.

D Utterance To Command Generation Perturbation Details
For the perturbation experiments, we selected 2 specific scenarios, one in which the player chooses to cast a bardic inspiration spell to target a single member of their party and the other where the player chooses to cast Fireball, perhaps the most canonical spell in D&D. For both of these scenarios; we took the combat state and prompt from their respective unit tests and then pertubed them in order to test the ability of the model to react to various modifications in input. We seek to study the model responses specifically to the following scenarios • Targeting and association -can the model pick up intended targets from nicknames/character classes/races in the input. Eg, Can it determine that "Inspires the druid" and "Inspires Noxxis" should result in targeting the same character (if Noxxis is a druid)? • Can it recognize a spell from a creative description of its effects? • Does it attend to spells in the prepared spell list?
While we do not perform exhaustive quantitative analysis, a preliminary analysis indicates that the Finetuned model that includes the state information can react to changes in the prepared spell list. This prompt generates the command !cast fireball -t OR1 -t OR2 -t OR3 -t OR4 but if the Fireball spell is removed from the prompt, it generates the command !cast "burning hands" -t or1 -t or2 -t or3 -t or4 Similarly, in the case of the Bardic Inspiration example, it's able to replace Bardic Inspiration with Healing Word. The model seems to be able to reliably differentiate between healthy and injured enemies -asking the model to cast Fireball at the injured enemies generates the appropriate command. It also seems to be able to target based on character classes and races. However, it does not always target the correct number of enemies -asking the model to target "2 injured orcs" leads to targeting all the injured orcs. Similarly, asking the model to target based on party roles fails; asking the model to target "the casters" does not correctly target the spellcasters in the party. While these results are promising, we leave exhaustive quantitative evaluation to future work. Yes No The music was certainly not pleasant, but it seemed to have an effect on the dogs. They became panicked for a moment, but were able to maintain their aggression. It was just enough for the party to act. Cuco proceeds to try and scare the ducks. So far it wasn't working. Moving back into the melee with the one he was originally fighting. As he drew ready. As the first dog fell, the others seemed to flee once more in pure terror of the seemingly unending tunes of the pipes. The dogs were in the middle of a panic. But still they fought on! Is the response specific?

E Full State to Narration Prompt
In other words, do you think that the response accurately narrates the last action the character actually took and its results?
The response is "specific" if it flows logically from the specific action and result taken by the character, in the greater context provided.
Note: It is possible for a response to "make sense" (due to being cohesive, consistent and plausible in and of itself), but be marked "not specific" when it is not a logical next step in the overall game progression.
Note: "Specific" for the purposes of this task does not have to do with how detailed the response is per se; a response can be fairly general in its language, but still qualify as "specific" when it is a logical next step in the overall game progression.

Yes No
As the first dog fell, the others seemed to flee once more in pure terror of the seemingly unending tunes of the pipes. Cuco proceeds to try and scare the ducks. So far it wasn't working. Moving back into the melee with the one he was originally fighting. As he drew ready. The dogs were in the middle of a panic. But still they fought on! The music was certainly not pleasant, but it seemed to have an effect on the dogs. They became panicked for a moment, but were able to maintain their aggression. It was just enough for the party to act.
How interesting is the response? (10 is best) Rank a response as more "Interesting" if the response would likely catch someone's attention or arouse curiosity in the game; or it is insightful, creative, or witty with respect to the game. If the response is monotonous and predictable, then rank it lower. If anything seems off-not fluent, confusing, illogical, out of context, or wrong according to the rules of D&D -then rank it lower.

Less Interesting
More Interesting 1 2 3 4 5 6 7 8 9 10 The music was certainly not pleasant, but it seemed to have an effect on the dogs. They became panicked for a moment, but were able to maintain their aggression. It was just enough for the party to act.
The dogs were in the middle of a panic. But still they fought on! Cuco proceeds to try and scare the ducks. So far it wasn't working.
As the first dog fell, the others seemed to flee once more in pure terror of the seemingly unending tunes of the pipes.
Moving back into the melee with the one he was originally fighting. As he drew ready.

H Human Evaluation Significance
We use the Student's t-test to calculate significance for our three rated dimensions. The results are tabulated below, with bold indicating p < 0.001, italics indicating p < 0.01, and † indicating p < 0.05:

I Example of Full State to Narration Context
The following is the full prompt provided to the FIREBALL-FULL model in the example provided in Table 5.
History: Player 1: *Holawynn would back up 35 if she could. The satyr knew where she should be standing during this fight. She starts up with a twilight flame to conserve slots.* Player 1: *It misses out of sheer unluck. A shame that it was also kinda crap.* Player 0: The hounds charged. Each managing a singular bite on their targets. It would seem they were all wanting to eat some tender flesh of the adventures who passed through their masters lair! Player 2: *Kaska looked about the combat and swung her weapon to Yala s aid, Baki attacking the dog that wanted to eat his bacon.* Player 0: "Dogs... so many of them making it more annoying then anything." *Umbrage didn t want to waste anything big. So he pulled out his pipes, taking the attack of opportunity as he would start to play.* "**Fear me!**" --- Description: __**5 10 (180cm) | 180 lb. | Chromatic Dragonborn | Fighter (Battle Master)/Bard**__ > Young, lean but strong overall build. They re a blue chromatic dragonborn who s always seen in armor and formal decorated robes. With silvery blond hair that is usually hidden behind a helm. Umbrage has many choices of weaponry, not one to pick or choose when it comes to the field of battle. But his most favored would be that horn of his. A rustic and old warhorn, The ivory it s made from is something unusual, even going so far as to be able to tap into the wave around him by chance. > > With the passing battles, many new scars are shown upon scales. But the only one that bothers him the most and that is always is kept hidden. Is the injury found upon his neck. The cause must have been something heavy enough to leave a lasting imprint, but Umbrage would never tell what it was. Shocking anyone who