BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories?

Language models have seen significant growth in the size of their corpus, leading to notable performance improvements. Yet, there has been limited progress in developing models that handle smaller, more human-like datasets. As part of the BabyLM shared task, this study explores the impact of reinforcement learning from human feedback (RLHF) on language models pretrained from scratch with a limited training corpus. Comparing two GPT-2 variants, the larger model performs better in storytelling tasks after RLHF fine-tuning. These findings suggest that RLHF techniques may be more advantageous for larger models due to their higher learning and adaptation capacity, though more experiments are needed to confirm this finding. These insights highlight the potential benefits of RLHF fine-tuning for language models within limited data, enhancing their ability to maintain narrative focus and coherence while adhering better to initial instructions in storytelling tasks. The code for this work is publicly at https://github.com/Zephyr1022/BabyStories-UTSA.


Introduction
The recent growth in the size of large language models (LLMs) has enhanced natural language processing capabilities, from information extraction (Agrawal et al., 2022) to language generation (Stiennon et al., 2020).However, the majority of research has been concentrated on environments with high computational power and a large number of parameters, leaving the emergence of these capabilities largely uninvestigated in low data and low resource settings (Brown et al., 2020;Fedus et al., 2022).Although some studies have looked into the relationship between model size, training volume, and performance for LLMs, they have primarily focused on scaling laws in high-compute settings (Hoffmann et al., 2022).Investigations into the effects of pretraining at a smaller scale have been limited (Huebner et al., 2021;Deshpande et al., 2023).Therefore, it would be interesting to explore strategies that maximize the efficiency of pretraining, especially considering the constraints of limited data availability.
Storytelling is a fundamental human activity used to share information, impart lessons, and keep loved ones informed about our daily lives (Bietti et al., 2019).Teachers leverage children's love for stories and their desire to tell them, using storytelling to promote cognitive and literacy development.Storytelling is a critical bridge between the oral language skills of early childhood and the more mature language skills associated with reading and writing.The recent BabyLM shared task aims to address these challenges (Warstadt et al., 2023).Hence, we report our submission to the shared task in this paper.Specifically, our study aims to understand whether we can pretrain a language model from scratch on the same amount of linguistic data available to a child, modeling a smaller, reduced-vocabulary language.We are interested in assessing a particular model's effectiveness and potential for enhancement.Specifically, we investigate whether the model can demonstrate high performance and if its performance can be further improved using reinforcement learning techniques from human feedback (RLHF) (Fernandes et al., 2023).This process is analogous to how teachers instruct children in storytelling, providing feedback to encourage them to develop more coherent and reasonable narratives.Implementing RLHF has shown promising results in aligning foundation models with human preferences.By using RLHF, models can undergo subtle yet significant improvements, such as refining tone (Liu, 2023), reducing biases and toxic elements (Bai et al., 2022), and enabling domain-specific content generation (Bang et al., 2023).The primary goal of this research is to explore whether the small pretrained model, with its limited data size, can also benefit from RLHF, thus potentially improving its overall performance.
The performance of small language models (SLMs) trained on large datasets has been observed to be poor, generating incoherent and repetitive text.Training large language models on limited data can lead to overfitting, making smaller models a potential solution to prevent overfitting (Warstadt et al., 2020c).Inspired by how humans acquire language and the BabyLM shared task, we explore downsizing the language used in models to observe the effects of pretraining.The main questions are whether small language models can generate coherent English text and if this ability is limited to larger, more complex models.It is also questioned whether the limited capacity of small models to memorize linguistic features-such as syntax, semantics, morphology, and phonology-leads to less creative outputs compared to larger models.For example, linguistic features are crucial for understanding and generating text, with a broader grasp potentially enabling more creative language use.Larger models, with their increased capacity, might capture a wider range of these features, possibly leading to more creative and nuanced language outputs.Conversely, small models might only learn basic or frequent linguistic patterns, potentially limiting their creative language generation capabilities.Previous research indicates that models can learn linguistic features with limited pretraining data but need more data to prioritize linguistic generalizations over superficial ones (Warstadt et al., 2020c).Some models fail to effectively use the linguistic features they learn during fine-tuning for natural language understanding tasks.The study aims to investigate whether GPT-2 models of varying sizes can acquire specific language patterns when finetuned with reinforcement learning and human feedback, aiming to enhance the model's storytelling abilities.
In summary, in this paper, we pretrain GPT-2base model with a parameter of 125M from scratch and compare it with the larger GPT2-Large model, which has a parameter of 774M, making it approximately six times larger.Both models are trained using a limited dataset provided from the BabyLM Challenge, which consists of approximately 100M words (Warstadt et al., 2023).The dataset encompasses various sources, including child-directed speech, transcribed speech from multiple sources, children's books, and Wikipedia.Subsequently, we use the RLHF technique to fine-tune both models and evaluate their ability to acquire new linguistic features through human feedback and also perform human evaluation on generated stories.

Related Work
Research has shown that smaller models tend to underperform when trained on large datasets, making the study of model downscaling a non-trivial (Turc et al., 2019).Previous investigations into smaller models have primarily centered around distillation processes (Sanh et al., 2019), with the aim of maximizing performance while reducing the number of parameters involved.Huebner et al. ( 2021) is one of the most relevant papers to our work, where they found that a small language model trained on child-directed speech can yield results comparable to larger language models when used in specific probing tasks.And another study, Deshpande et al. (2023) trained several models to explore scaling in low-compute environments, assessing their performance on a modified version of GLUE.
Our research, however, is driven by a desire to understand if small pretrained models can benefit from Reinforcement Learning from Human Feedback (RLHF), potentially improving their overall performance despite their limited data size.Two previous studies have a direct relation to this work: the first employed human ranking feedback to train summarization models using reinforcement learning (RL) (Stiennon et al., 2020), and the second used stories to generate a value-aligned reward signal for RL agents, aimed at mitigating hallucination behavior (Riedl and Harrison, 2016).

Data
In this section, we describe the pertaining data used for the language models and the data used for the reinforcement model.

Pretraining Data
We pretrain GPT-2 models using the dataset from the STRICT track in the BabyLM Challenge (Warstadt et al., 2023), which includes various types of corpora, both spoken-based and written-based.Examples of the spoken-based corpus include CHILDES (MacWhinney, 2000), the British National Corpus (BNC) dialogue section, OpenSubtitles (Lison and Tiedemann, 2016), the QCRI Educational Domain Corpus (Abdelali et al., 2014), and the Switchboard Dialog Act Corpus (Stolcke et al., 2000).The written-based corpus includes the Children's Book Test (Hill et al., 2016), the Children's Stories Text Corpus, the Standardized Project Gutenberg Corpus (Gerlach and Font-Clos, 2020), Wikipedia, and Simple Wikipedia.For example, the Children's Book Story and Wikipedia corpora stand in contrast to dialogue or subtitlebased corpora, which mostly consist of transcribed speech, the primary language input for children.Wikipedia, in particular, is a compilation of written language rather than spoken dialogues.Most of its articles are composed by professionals who possess subject-matter expertise and adhere to rigorous standards of grammatical correctness.These corpora contain a variety of sources with approximately 100 million words, corresponding to the linguistic competence expected at the onset of adolescence (around 13 years old).

Reward Model Data
In this paper, we construct a reward model dataset for reinforcement learning by selecting 100 sentences from the STRICT track of the Babylm Challenge dataset.These sentences, serving as prompts, are derived from two subsets in the Babylm dataset: the Standardized Project Gutenberg and the Simple Wikipedia corpus development sets, with a prerequisite that each sentence includes characters and plots.These prompts are then used to generate two short stories each from the GPT-2 Base and GPT-2 Large models, beginning with the prefix "write me a story starting with".To enhance story diversity, we set a maximum length of 128 tokens and enforce a minimum of 10 new tokens in the generated stories.The generation code incorporates a beam size of 7 to optimize the story quality by exploring various potential continuations.
The purpose of collecting feedback is to align the model's behavior with some goal behavior.For example, we aim for the model to generate stories consistent with the background plot, coherent, nonrepetitive, devoid of nonsensical sentences, and maintain a clear topic or logical structure.Rating the quality of a story accurately presents challenges due to its potentially subjective nature and the varying expectations of readers regarding emotional connection and engagement.Rather than directly estimating a generated story quality through scalebased annotation, we treat it as a latent variable to be inferred from relative comparisons.Following prior work in NLP on annotating social aspects of language (Pei and Jurgens, 2020), we adopt a method similar to Best-Worst Scaling (BWS) (Louviere et al., 2015;Kiritchenko and Mohammad, 2016) to generate comparison data on people's preferences.Intuitively, it is easier for annotators to identify the best and worst stories from a set of stories than it is for them to provide numerical assessments.The process involves asking two student annotators to choose from sets of stories, identifying the best (most preferred) and worst (least preferred) stories in each choice set.We provide four stories for the annotators to choose from.This method provides more information per choice set than traditional preference methods and enables a more precise ranking of items in terms of preference.For instance, if we have stories A, B, C, and D, and A is ranked as the best while D is ranked as the worst, then we create the following pairs: A > B, A > C, A > D, B > D, and C > D, resulting in a total of 500 additional pairs for reward model training from 100 best-worst annotations.A > B means that the model should learn to provide a higher score to A because it was ranked higher than B. This is inferred because A was marked as the best story.

Agreement for Reward Model Data Annotation
Krippendorff's alpha, introduced by Krippendorff (1970), is a statistical measure commonly used for assessing the level of agreement between two or more annotators across various categories.Its advantage lies in its versatility, as it can be applied to not only nominal data but any measurement scale, such as Best-Worst Scaling.
In our case, two graduate student annotators were designated to annotate human feedback data., which yielded a Krippendorff's alpha agreement score of .4657.To address disagreements, the two annotators discuss each story example together.They reconcile differences through discussion and unanimously select the best and worst stories based on the given story prompt.

Method
This section discusses pretraining data, the development of the data tokenizer, language model configuration, the objective of pretraining from scratch, and the process of fine-tuning using reinforcement learning with human feedback.

Tokenizer
Our model uses a sub-word vocabulary built with Byte-Pair Encoding (BPE) (Sennrich et al., 2016), an approach initially developed for text compression.Later, this technique was applied by Ope-nAI for tokenization during the pretraining stage of the GPT model (Radford et al., 2019).Rather than maintaining the original vocabulary size of 50,257 used in the GPT-2 model, we developed a custom tokenizer based on a vocabulary size of 32,001.This custom tokenizer is trained on the collective set of all training corpora from STRICT track in the BabyLM Challenge, applying the ByteLevelBPETokenizer from the Hugging Face Tokenizers library1 .
Prior research informed our decision to significantly reduce the vocabulary size.Studies suggest a vocabulary size of about 32,000 tokens is a good balance for a single-language model (Kudo, 2018).This size carefully balances the model's proficiency in handling less common words while preserving its computational efficiency.

Model Architecture and Configuration
Models we pretrained in our experiments using the default configuration setting of GPT-2 (Radford et al., 2019).In these settings, we employed a context length of 1042 tokens and set the maximum training epoch limit to 15.The restriction to 15 epochs was primarily due to constraints on training time and GPU resources.We conducted the training of the GPT-2 Base model on an NVIDIA GeForce GTX 1080 Ti, while the GPT-2 Large model was trained on an NVIDIA RTX A6000 GPU.The training time for the base model was approximately 72 hours, while it extended to around 216 hours for the large model.To train, we used the Lion optimizer (Chen et al., 2023), configured with a learning rate of 1e-5 and a weight decay of 1e-2.We also integrated Triton, a GPU programming language detailed by (Tillet and Cox, 2019), to optimize hardware performance and implemented mixed-precision computations using the 'bfloat16' format for efficient resource utilization (Wang and Kanwar, 2019).
For model selection, we chose the best model across all epochs based on the average score on two datasets: the Question-answering Natural Language Inference (QNLI) (Demszky et al., 2018) and the SST-2 Binary Classification Bench-mark (Socher et al., 2013).We evaluated the models' performances on these benchmarks using the F1 score.Additionally, the perplexity scores on the validation dataset for our models were recorded as 24.10 for the GPT-2 Base model and 22.73 for the GPT-2 Large model.

Reward Model
The reward model (RM) is designed to capture human preferences, and ideally, we could fine-tune it using Reinforcement Learning and human annotations for every output returned by the language model.However, due to practical constraints like workload and time limitations, it is not feasible for humans to provide enough feedback for each optimization iteration.As an alternative, a more effective approach is to train a reward model that simulates the evaluation process carried out by humans.This RM will evaluate any text and assign a scalar reward value to the sentences, where higher values indicate high-quality samples.Following Stiennon et al. (2020), training reward models often involve using a paired comparison dataset between two responses generated for the same input.
To train our reward models, We initialize the weights of the reward model by leveraging a pretrained GPT-2 Large model as described above, then we add a randomly initialized linear head that outputs a scalar value to form the reward model r θ (x, y).We train this model to predict which generated story y ∈ {y 0 , y 1 }, where y 0 is the chosen (good) response to the prompt as labeled by our annotators and y 1 is the rejected (bad) response.In practice, this is where our annotators ranked y 0 > y 1 .The model is trained using the loss function where σ is the sigmoid function and D is the set of all training triplets in our dataset, i denotes the index of a specific data point in the dataset D. Intuitively, the model learns to give a larger score to the prompts with a higher rank.We have configured the reward model to run for a maximum of 10 epochs, with a set learning rate of 1e-5.

Proximal Policy Optimization
After we train the reward model, we treat the logit output of the reward model as a reward that we optimize policy model outputs using reinforcement learning, specifically with the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017).
During the RL fine-tuning with PPO phase, we use the learned reward function to provide feedback to the language model.In particular, we formulate the following optimization problem where r(x, y) is the reward model's output, β is a hyper-parameter controlling the deviation from the initial policy.Our optimization focuses on the policy π RL (y|x) using Proximal Policy Optimization (PPO), with initialization based on the pretrained language model policy π SF T (y|x) (Stiennon et al., 2020;Rafailov et al., 2023).
To encourage exploration and prevent the policy from getting stuck in a single mode, the optimization uses the Kullback-Leibler (KL) divergence term.This term also discourages the policy from generating outputs that differ significantly from those seen by the reward model during training, thereby maintaining coherence in the generated text.Without this penalty, the optimization might generate gibberish text that tricks the reward model into providing a high reward.In our implementation, we used the trlX library with its default settings2 .The algorithm was executed with a maximum of 5 epochs and a sequence length of 512, and the run spanned around 208 hours.In our approach, we used the default hyperparameter provided by the trlX library, which employs Ray Tune for hyperparameter tuning.This choice was primarily driven by the significant time and GPU resource constraints associated with training the PPO model, making it a pragmatic decision to leverage the preconfigured settings of trlX.Although we experimented with random modifications to some hyperparameters, the outcomes were less satisfactory as compared to the default settings of trlX.Hence, the decision to restrict the training to 5 epochs was in alignment with these considerations, ensuring a balance between computational feasibility and the pursuit of meaningful reward training.

Evaluation Metrics and Datasets
To assess the performance of our models, we employed various automated evaluation metrics used in the BabyLM shared task and our own human evaluation.The BabyLM shared task had two major sets of evaluations: zero-shot evaluation and fine-tuned evaluation.We describe each evaluation task below.
Zero-shot Evaluation.BLiMP, introduced by Warstadt et al. (2020a), is a series of zero-shot tasks included in the evaluation.BLiMP assesses the ability of language models to handle category membership, provide congruent answers to specific types of questions, and recognize grammatical questions.It serves as a behavioral probe, containing pairs of test sentences that isolate particular phenomena in syntax and morphology, such as island effects and determiner-noun agreement.Essentially, BLiMP is a challenge set designed to evaluate the linguistic knowledge of language models, focusing on major grammatical phenomena in English.The BLiMP Supplement benchmark consists of BLiMP-style minimal pairs that specifically focus on aspects not covered by BLiMP.These additional aspects include discourse-level acceptability across multiple speakers and question formation.
Fine-tuned Evaluation.Two datasets are used for the fine-tuned evaluation: SuperGLUE and the Mixed Signals Generalization Set (MSGS).Su-perGLUE (Wang et al., 2019), an advanced version of GLUE (Wang et al., 2018), is a benchmark for assessing progress in general-purpose language understanding technologies.It comprises a public leaderboard and a single-number performance metric for various tasks.These include CoLA, which evaluates the grammatical acceptability of English sentences; SST-2, which predicts the sentiment of movie review sentences; MRPC, which determines semantic equivalence between sentence pairs; QQP, another task focused on semantic equivalence; MNLI and MNLI-mm, which predict the relationship between a premise and a hypothesis sentence; QNLI, which matches a question to a paragraph containing the answer; RTE, which determines if a sentence entails a given hypothesis; BoolQ, which answers yes/no questions about a text passage; MultiRC, which identifies true and false answers given a context paragraph and a question; and WSC, a coreference resolution task.These tasks, designed to be challenging, represent a broad spectrum of language understanding capabilities, making SuperGLUE a robust tool for evaluating language models.The MSGS dataset, introduced by Warstadt et al. (2020b), is a diagnostic tool designed to evaluate the preferences of language models for either linguistic features, such as specific syntactic constructions, or surface features, like the presence of a word in a certain position.The primary objective of the MSGS tasks is to determine whether a pretrained model leans more toward linguistic or surface generalizations during the fine-tuning process.Fine-tuning on self-supervised linguistic tasks proves effective because it equips models with features beneficial for language understanding.Furthermore, pretrained models are not only capable of representing these linguistic features but also tend to use them preferentially during fine-tuning.
To maintain consistency and ensure fair comparisons, we adopted the default hyperparameter settings recommended by Gao et al. (2021).Our only modification was adjusting the batch size to 32 due to GPU limitations.These evaluation procedures allowed us to thoroughly assess the models' capabilities and compare their performance across different tasks.Our experiments report the average scores of all performance metrics across tasks.
Human Evaluation.Inspired by the TinyStories (Eldan and Li, 2023), we assess the four key story generation outcome metrics of grammar (how grammatically correct the story is), creativity (how original and inventive the story is), consistency with the story's beginning (how well the story adheres to the given prompts), and plot coherence (whether the plot of the story makes sense).We randomly selected 100 prompts from the ROCStories dataset (Mostafazadeh et al., 2016).Each prompt was composed of a story title and the first sentence.We fed these prompts to the model, and it generated short stories based on the given prompts.To assess the quality of the generated stories, we enlisted the help of a graduate student evaluator.The evaluator was presented with the story's beginning (title + first sentence) and the completed story generated by the model.They were then asked to rate the completed story on a scale of 1 to 10, considering aspects such as grammar, creativity, consistency with the story's beginning, and plot coherence.This human evaluation process provided valuable insights into the model's performance across these critical dimensions.

Results
In this section, we report the results of the automated BabyLM metrics and our human evaluation for story generation.
Performance on BLiMP benchmarks.Shown in Table 1, the GPT2-Large and GPT2-Large-PPO models outperform the GPT-base variants on the BLiMP task with an average score of 73.9, excelling in many specific tasks.For example, GPT2-Large does well in tasks like Island Effects, NPI Licensing, and Subject-Verb Agreement, whereas GPT2-Large-PPO stands out in the QA Congruence (tricky) task.The GPT2-Base and GPT2-Base-PPO models score lower with averages of 71.2 and 70.9, respectively, suggesting that model size (base versus large) plays a crucial role in determining performance.However, for the BLiMP benchmark, PPO training has little impact on model performance.However, more experiments on different architecture could potentially point in a different direction.
Performance on SuperGLUE benchmarks.In Table 2, we report the performance of the models on the SuperGLUE benchmarks, which assess a range of language understanding abilities.The Table 3: Performance on MSGS benchmarks.The MSGS shortcuts correspond to the respective tasks as follows: CR_RTP maps to control_raising_relative_token_position, CR_LC maps to control_raising_lexical_content_the, SC_RP maps to syntactic_category_relative_position, SC_LC maps to syntactic_category_lexical_content_the, MV_RTP maps to main_verb_relative_token_position, MV_LC maps to main_verb_lexical_content_the.The shortcuts RP_C, LC_C, SC_C, CR_C, and MV_C correspond to the tasks relative_position_control, lexi-cal_content_the_control, syntactic_category_control, control_raising_control, and main_verb_control, respectively.The overall largest scores are in bold.
GPT2-Large-PPO model stands out with the highest average score of 66.8, underlining the potential for enhanced performance using larger models fine-tuned with PPO.Other models present comparable average scores across the SuperGLUE tasks.Compared to the Majority Label baseline, the GPT-2 models exhibit varied levels of performance enhancement across different tasks.Specifically, the GPT2-Base model outperforms the baseline in SST-2, QQP (F1), MNLI, MNLImm, QNLI, and BoolQ.Similarly, the GPT2-Base-PPO model surpasses the baseline in the same tasks: SST-2, QQP (F1), MNLI, MNLI-mm, QNLI, and BoolQ.The GPT2-Large model demonstrates superior performance over the baseline in SST-2, MRPC (F1), MNLI, MNLI-mm, QNLI, BoolQ, and WSC.While, the GPT2-Large-PPO model outperforms the majority baseline in all tasks except for CoLA and MultiRC, marking significant performance improvement in SST-2, MNLI-mm, and QNLI, with an increase of 34.1, 28.3, and 44.2 respectively.
The performance across various models and tasks exhibits considerable variability, showing that different models may excel in distinct language understanding domains.The superior scores of the GPT2-Large-PPO model suggest that larger models fine-tuned with PPO could enhance performance, yet further examination reveals inconsistencies.Finally, we note that the PPO training only improves the performance of the GPT2-Large model, suggesting that PPO training may require a model with a minimum number of parameters to work in the limited data setting.However, more experiments are needed to confirm this finding.Performance on MSGS benchmarks Table 3 shows the results of testing GPT2 models of different sizes on the MSGS benchmark.These results help us understand how well the models use and generalize different language and surface features.Among the models, the GPT2-Base model outperforms other models with the highest average score of 83.0.This suggests that GPT2-Base, despite being a smaller model, has effectively learned to generalize across a range of linguistic and surface features.This might be due to the model's efficient use of its limited parameters.Instead of overfitting to less important details in the training data.
Performance on Age-of-acquisition benchmarks According to Portelance et al. (To Appear), a smaller mean absolute deviation (MAD) score indicates a better alignment between the model's predictions and the actual average age-of-acquisition (AoA) of words in children.Table 4 shows similar MAD scores across all models for all word categories (Overall, Nouns, Predicates, and Function words).This suggests that all models exhibit similar levels of accuracy in predicting the AoA of words, and their word-learning sequences align closely with the natural language acquisition patterns observed in children.
Performance on Human Evaluation.In Table 5, we report the results of our human evaluation.The findings indicate that the GPT2-Base and GPT2-Large models exhibit comparable average grammar scores.However, the GPT2-Base-PPO model performs significantly worse (p-value < 0.001) than the GPT2-Base in grammar and creativity evaluations.The result is consistent with the BablyLM automated evaluation metrics, where the GPT-Base-PPO generally underperforms GPT-Base.Table 6 shows several examples from our TinyStory analysis.Specifically, the GPT2-Base-PPO tends to generate repetitive and lengthy stories, likely contributing to its poorer grammar and creativity performance.Furthermore, when comparing the GPT2-Large and GPT2-Large-Base models in Table 5, their performance levels for Grammar and Creativity are similar, showing that PPO had minimal impact on the Large model for both metrics.
We also find significant differences in Consistency (Const.)and Plot Coherence (PCoh) between GPT-Large and GPT2-Large-PPO.Intuitively, these metrics evaluate generative models' capability in following the beginning of the story background rather than just content creation.Our findings indicate that the performance scores for GPT2-Base and GPT2-Base-PPO models are fairly similar, but both are lower than those of the GPT2-Large model variants.Again, this indicates that the large models outperform the smaller models, even though we trained on a relatively small dataset.Moreover, the GPT2-Large-PPO model significantly improves consistency and plot coherence scores compared to the standard GPT2-Large model.This suggests that large models (at least GPT2-Large in our case) can integrate the reward model to generate better outputs than the GPT2base (smaller model).
We analyze the large model outputs in Table 6.Specifically, in the second story from Table 6, the beginning of the story is set as "Awkward I was driving into the McDonald's beside school."Distinct differences can be seen when comparing the narrative continuations generated by the GPT2-Large and GPT2-Large-PPO models.For example, the GPT2-Large model diverges from the initial context, transitioning abruptly from the act of driving into McDonald's to a sudden need to return to work.This abrupt shift disrupts the narrative flow and doesn't seamlessly connect with the story's beginning.On the other hand, the GPT2-Large-PPO  model manages to retain focus on the primary activity of driving in its generated story.Although it introduces an inconsistency by stating the character doesn't know how to drive, it maintains the plot around the theme of a character recklessly driving without knowing how to do so.This suggests that the GPT2-Large-PPO model has a stronger adherence to the initial instructions and makes a better attempt at following them.
Summary of Findings and Limitations.Overall, we found that the GPT-2-Large generally works better than GPT-2-base with and without PPO.Also, PPO made significant improvements to the model's consistency and plot coherence on the storytelling task when used with the large model.However, PPO generally hurts performance with the smaller GPT-2-Base model.There were several limitations to our study.First, a major limitation of this work is the lack of comparison with architectures beyond GPT-2.Moreover, comparisons to even larger models should be made in the future.We were limited by the computational resources required for large-scale testing during the BabyLM shared task timeline.Next, we had a limited-size reward model dataset.Future work should explore the impact of reward model dataset size and variety.Future work should explore the impact of reward model dataset size and variety.Additionally, the study did not explore the hyperparameter tuning for the reward model and the loss function in depth.Exploring different settings for hyperparameters and examining alternative methods for reward training, such as varying the weighting of the loss terms, could yield different results and improve the model's performance in the storytelling task.Finally, we only had one annotator for the human evaluation and were limited in size.A more extensive human study could find more intricate differences between the models.

Conclusion
In this study, we investigated whether the small pretrained model, with its limited data size, can also benefit from RLHF, thus potentially improving its overall performance.We evaluate the two variants of the GPT-2 model: the GPT-2 Base model with 125M parameters and the larger GPT-2 Large model with 774M parameters.Both variants are pretrained on the 100M words BabyLM Challenge dataset.We then fine-tune both models using RLHF and evaluate their ability to acquire new linguistic patterns and storytelling ability, including generating coherent and creative English text while adhering to the story background.We observe that RLHF has a little or negative effect on the smaller model.However, a substantial increase in model parameters noticeably enhances the larger model's performance in storytelling tasks.In summary, our experiments shed light on the behavior of small language models fine-tuned using RLHF to perform storytelling tasks in a limited dataset setting.

Table 6 :
Performance comparison of various models on grammar, creativity, consistency with the beginning of the story, and plot coherence.The scores in the parentheses represent the evaluations for Grammar, Creativity, Consistency, and Plot, respectively.