Manuel Brack


2024

pdf bib
T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
Björn Deiseroth | Manuel Brack | Patrick Schramowski | Kristian Kersting | Samuel Weinbach
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages.To remedy these issues, we propose T-Free, which directly embeds words through sparse activation patterns over character triplets and does not require a reference corpus. T-Free inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-Free shows significant improvements in cross-lingual transfer learning.

pdf bib
Occiglot at WMT24: European Open-source Large Language Models Evaluated on Translation
Eleftherios Avramidis | Annika Grützner-Zahn | Manuel Brack | Patrick Schramowski | Pedro Ortiz Suarez | Malte Ostendorff | Fabio Barth | Shushen Manakhimova | Vivien Macketanz | Georg Rehm | Kristian Kersting
Proceedings of the Ninth Conference on Machine Translation

This document describes the submission of the very first version of the Occiglot open-source large language model to the General MT Shared Task of the 9th Conference of Machine Translation (WMT24). Occiglot is an open-source, community-based LLM based on Mistral-7B, which went through language-specific continual pre-training and subsequent instruction tuning, including instructions relevant to machine translation.We examine the automatic metric scores for translating the WMT24 test set and provide a detailed linguistically-motivated analysis.Despite Occiglot performing worse than many of the other system submissions, we observe that it performs better than Mistral7B, which has been based upon, which indicates the positive effect of the language specific continual-pretraining and instruction tuning. We see the submission of this very early version of the model as a motivation to unite community forces and pursue future LLM research on the translation task.

pdf bib
Community OSCAR: A Community Effort for Multilingual Web Data
Manuel Brack | Malte Ostendorff | Pedro Ortiz Suarez | José Javier Saiz | Iñaki Lacunza Castilla | Jorge Palomar-Giner | Alexander Shvets | Patrick Schramowski | Georg Rehm | Marta Villegas | Kristian Kersting
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)

The development of large language models (LLMs) relies heavily on extensive, high-quality datasets. Publicly available datasets focus predominantly on English, leaving other language communities behind. To address this issue, we introduce Community OSCAR, a multilingual dataset initiative designed to address the gap between English and non-English data availability. Through a collective effort, Community OSCAR covers over 150 languages with 45 billion documents, totaling over 345 TiB of data. Initial results indicate that Community OSCAR provides valuable raw data for training LLMs and enhancing the performance of multilingual models. This work aims to contribute to the ongoing advancements in multilingual NLP and to support a more inclusive AI ecosystem by making high-quality, multilingual data more accessible to those working with low-resource languages.

2023

pdf bib
Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge
Manuel Brack | Patrick Schramowski | Kristian Kersting
Proceedings of the ART of Safety: Workshop on Adversarial testing and Red-Teaming for generative AI