Leo Z. Liu

2023

Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model
Leo Z. Liu | Tim Dettmers | Xi Victoria Lin | Veselin Stoyanov | Xian Li
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large and sparse feed-forward layers (S-FFN) such as Mixture-of-Experts (MoE) have proven effective in scaling up Transformers model size for pretraining large language models. By only activating part of the FFN parameters conditioning on input, S-FFN improves generalization performance while keeping training and inference costs (in FLOPs) fixed. In this work, we analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method under a general conceptual framework of sparse neural memory. Using this unified framework, we compare several S-FFN architectures for language modeling and provide insights into their relative efficacy and efficiency. We found a simpler selection method — Avg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining compared to existing MoE architectures including Switch Transformer (Fedus et al., 2021) and HashLayer (Roller et al., 2021).

pdf bib

Learning to translate by learning to communicate
C.M. Downey | Xuhui Zhou | Leo Z. Liu | Shane Steinert-Threlkeld
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)

2021

pdf bib abs

Probing Across Time: What Does RoBERTa Know and When?
Leo Z. Liu | Yizhong Wang | Jungo Kasai | Hannaneh Hajishirzi | Noah A. Smith
Findings of the Association for Computational Linguistics: EMNLP 2021

Models of language trained on very large corpora have been demonstrated useful for natural language processing. As fixed artifacts, they have become the object of intense study, with many researchers “probing” the extent to which they acquire and readily demonstrate linguistic abstractions, factual and commonsense knowledge, and reasoning abilities. Recent work applied several probes to intermediate training stages to observe the developmental process of a large-scale model (Chiang et al., 2020). Following this effort, we systematically answer a question: for various types of knowledge a language model learns, when during (pre)training are they acquired? Using RoBERTa as a case study, we find: linguistic knowledge is acquired fast, stably, and robustly across domains. Facts and commonsense are slower and more domain-sensitive. Reasoning abilities are, in general, not stably acquired. As new datasets, pretraining protocols, and probes emerge, we believe that probing-across-time analyses can help researchers understand the complex, intermingled learning that these models undergo and guide us toward more efficient approaches that accomplish necessary learning faster.

Co-authors

Xi Victoria Lin 1

Noah A. Smith 1

Shane Steinert-Threlkeld 1

Veselin Stoyanov 1

Yizhong Wang 1

Xuhui Zhou 1

Venues

Fix author