Brandon Duderstadt

2025

Wikivecs: A Fully Reproducible Vectorization of Multilingual Wikipedia
Brandon Duderstadt
Proceedings of the 2nd Workshop on Advancing Natural Language Processing for Wikipedia (WikiNLP 2025)

Dense vector representations have become foundational to modern natural language processing (NLP), powering diverse workflows from semantic search and retrieval augmented generation to content comparison across languages. Although Wikipedia is one of the most comprehensive and widely used datasets in modern NLP research, it lacks a fully reproducible and permissively licensed dense vectorization.In this paper, we present Wikivecs, a fully reproducible, permissively licensed dataset containing dense vector embeddings for every article in Multilingual Wikipedia. Our pipeline leverages a fully reproducible and permissively licensed multilingual text encoder to embed Wikipedia articles into a unified vector space, making it easy to compare and analyze content across languages.Alongside these vectors, we release a two-dimensional data map derived from the vectors, enabling visualization and exploration of Multilingual Wikipedia’s content landscape.We demonstrate the utility of our dataset by identifying several content gaps between English and Russian Wikipedia.

pdf bib abs

Statistical inference on black-box generative models in the data kernel perspective space
Hayden Helm | Aranyak Acharyya | Youngser Park | Brandon Duderstadt | Carey Priebe
Findings of the Association for Computational Linguistics: ACL 2025

Generative models are capable of producing human-expert level content across a variety of topics and domains. As the impact of generative models grows, it is necessary to develop statistical methods to understand collections of available models. These methods are particularly important in settings where the user may not have access to information related to a model’s pre-training data, weights, or other relevant model-level covariates. In this paper we extend recent results on representations of black-box generative models to model-level statistical inference tasks. We demonstrate that the model-level representations are effective for multiple inference tasks.

2024

pdf bib abs

Tracking the perspectives of interacting language models
Hayden Helm | Brandon Duderstadt | Youngser Park | Carey Priebe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) are capable of producing high quality information at unprecedented rates. As these models continue to entrench themselves in society, the content they produce will become increasingly pervasive in databases that are, in turn, incorporated into the pre-training data, fine-tuning data, retrieval data, etc. of other language models. In this paper we formalize the idea of a communication network of LLMs and introduce a method for representing the perspective of individual models within a collection of LLMs. Given these tools we systematically study information diffusion in the communication network of LLMs in various simulated settings.

2023

pdf bib abs

Large language models (LLMs) have recently achieved human-level performance on a range of professional and academic benchmarks.The accessibility of these models has lagged behind their performance.State-of-the-art LLMs require costly infrastructure; are only accessible via rate-limited, geo-locked, and censored web interfaces; and lack publicly available code and technical reports.In this paper, we tell the story of GPT4All, a popular open source repository that aims to democratize access to LLMs.We outline the technical details of the original GPT4All model family, as well as the evolution of the GPT4All project from a single model into a fully fledged open source ecosystem.It is our hope that this paper acts as both a technical overview of the original GPT4All models as well as a case study on the subsequent growth of the GPT4All open source ecosystem.

pdf bib abs

Towards Explainable and Accessible AI
Brandon Duderstadt | Yuvanesh Anand
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

Large language models (LLMs) have recently achieved human-level performance on a range of professional and academic benchmarks. Unfortunately, the explainability and accessibility of these models has lagged behind their performance. State-of-the-art LLMs require costly infrastructure, are only accessible via rate-limited, geo-locked, and censored web interfaces, and lack publicly available code and technical reports. Moreover, the lack of tooling for understanding the massive datasets used to train and produced by LLMs presents a critical challenge for explainability research. This talk will be an overview of Nomic AI’s efforts to address these challenges through its two core initiatives: GPT4All and Atlas

Co-authors

Venues

Fix author