Richard Diehl Martinez

Also published as: Richard Diehl Martinez


2024

pdf bib
Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing
Richard Diehl Martinez | Zebulon Goriely | Andrew Caines | Paula Buttery | Lisa Beinborn
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Language models strongly rely on frequency information because they maximize the likelihood of tokens during pre-training. As a consequence, language models tend to not generalize well to tokens that are seldom seen during training. Moreover, maximum likelihood training has been discovered to give rise to anisotropy: representations of tokens in a model tend to cluster tightly in a high-dimensional cone, rather than spreading out over their representational capacity.Our work introduces a method for quantifying the frequency bias of a language model by assessing sentence-level perplexity with respect to token-level frequency. We then present a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training. Our Syntactic Smoothing method adjusts the maximum likelihood objective function to distribute the learning signal to syntactically similar tokens. This approach results in better performance on infrequent English tokens and a decrease in anisotropy. We empirically show that the degree of anisotropy in a model correlates with its frequency bias.

pdf bib
SumTablets: A Transliteration Dataset of Sumerian Tablets
Cole Simmons | Richard Diehl Martinez | Dan Jurafsky
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)

Transliterating Sumerian is a key step in understanding Sumerian texts, but remains a difficult and time-consuming task. With more than 100,000 known texts and comparatively few specialists, manually maintaining up-to-date transliterations for the entire corpus is impractical. While many transliterations have been published online thanks to the dedicated effort of previous projects, the lack of a comprehensive, easily accessible dataset that pairs digital representations of source glyphs with their transliterations has hindered the application of natural language processing (NLP) methods to this task.To address this gap, we present SumTablets, the largest collection of Sumerian cuneiform tablets structured as Unicode glyph–transliteration pairs. Our dataset comprises 91,606 tablets (totaling 6,970,407 glyphs) with associated period and genre metadata. We release SumTablets as a Hugging Face Dataset.To construct SumTablets, we first preprocess and standardize publicly available transliterations. We then map them back to a Unicode representation of their source glyphs, retaining parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens.We leverage SumTablets to implement and evaluate two transliteration approaches: 1) weighted sampling from a glyph’s possible readings, 2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the potential use of deep learning methods in Assyriological research.

pdf bib
Tending Towards Stability: Convergence Challenges in Small Language Models
Richard Diehl Martinez | Pietro Lesci | Paula Buttery
Findings of the Association for Computational Linguistics: EMNLP 2024

Increasing the number of parameters in language models is a common strategy to enhance their performance. However, smaller language models remain valuable due to their lower operational costs. Despite their advantages, smaller models frequently underperform compared to their larger counterparts, even when provided with equivalent data and computational resources. Specifically, their performance tends to degrade in the late pretraining phase. This is anecdotally attributed to their reduced representational capacity. Yet, the exact causes of this performance degradation remain unclear. We use the Pythia model suite to analyse the training dynamics that underlie this phenomenon. Across different model sizes, we investigate the convergence of the Attention and MLP activations to their final state and examine how the effective rank of their parameters influences this process. We find that nearly all layers in larger models stabilise early in training - within the first 20% - whereas layers in smaller models exhibit slower and less stable convergence, especially when their parameters have lower effective rank. By linking the convergence of layers’ activations to their parameters’ effective rank, our analyses can guide future work to address inefficiencies in the learning dynamics of small models.

2023

pdf bib
CLIMB – Curriculum Learning for Infant-inspired Model Building
Richard Diehl Martinez | Hope McGovern | Zebulon Goriely | Christopher Davis | Andrew Caines | Paula Buttery | Lisa Beinborn
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

2021

pdf bib
Attention-based Contextual Language Model Adaptation for Speech Recognition
Richard Diehl Martinez | Scott Novotney | Ivan Bulyko | Ariya Rastrow | Andreas Stolcke | Ankur Gandhe
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021