Kevin Li

2024

pdf bib abs
Using Curriculum Masking Based on Child Language Development to Train a Large Language Model with Limited Training Data
Evan Lucas | Dylan Gaines | Tagore Rao Kosireddy | Kevin Li | Timothy C. Havens
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

In this paper we detail our submissions to the Strict and Strict-Small tracks of the 2024 BabyLM Challenge. We approach this challenge with two methodologies: i) use of a novel dataset, and ii) development of a pre-training technique based on the fusion of child language acquisition with traditional masked language modeling, which we call curriculum masking. The novel dataset used for this task is based on user submissions to the Reddit forum (i.e., subreddit) “Explain Like I’m Five”, which explains diverse concepts using simple language. Curriculum masking works by creating learning phases based on a standard child language development timeline, where the masked words learned by the model start with simple nouns and gradually expand to include more complex parts of speech. We show that using internet-based training data shows a small improvement in evaluation scores as compared to baseline training data. Our proposed pre-training method of curriculum masking is conceptually novel and also shows improved rates of learning over typical masked language modeling pre-training, potentially allowing for good performance with fewer total epochs on smaller training datasets. Code for the curriculum masking implementation is shared at https://github.com/evan-person/curriculumMaskingBabyLM2024.

This paper explores solutions to the challenges posed by the widespread use of LLMs, particularly in the context of identifying human-written versus machine-generated text. Focusing on Subtask B of SemEval 2024 Task 8, we compare the performance of RoBERTa and DeBERTa models. Subtask B involved identifying not only human or machine text but also the specific LLM responsible for generating text, where our DeBERTa model outperformed the RoBERTa baseline by over 10% in leaderboard accuracy. The results highlight the rapidly growing capabilities of LLMs and importance of keeping up with the latest advancements. Additionally, our paper presents visualizations using PCA and t-SNE that showcase the DeBERTa model’s ability to cluster different LLM outputs effectively. These findings contribute to understanding and improving AI methods for detecting machine-generated text, allowing us to build more robust and traceable AI systems in the language ecosystem.

2023

In this paper, we discuss our efforts on SemEval-2023 Task4, a task to classify the human value categoriesthat an argument draws on. Arguments consist of a premise, conclusion,and the premise’s stance on the conclusion. Our team experimented with GloVe embeddings and fine-tuning BERT. We found that an ensembling of BERT and GloVe with RidgeRegression worked the best.