Vladislav Lialin


2023

pdf bib
Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale.
Vijeta Deshpande | Dan Pechi | Shree Thatte | Vladislav Lialin | Anna Rumshisky
Findings of the Association for Computational Linguistics: ACL 2023

In recent years, language models have drastically grown in size, and the abilities of these models have been shown to improve with scale. The majority of recent scaling laws studies focused on high-compute high-parameter count settings, leaving the question of when these abilities begin to emerge largely unanswered. In this paper, we investigate whether the effects of pre-training can be observed when the problem size is reduced, modeling a smaller, reduced-vocabulary language. We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters, and establish a strong correlation between pre-training perplexity and downstream performance (GLUE benchmark). We examine downscaling effects, extending scaling laws to models as small as ~1M parameters. At this scale, we observe a break of the power law for compute-optimal models and show that the MLM loss does not scale smoothly with compute-cost (FLOPs) below 2.2 × 1015 FLOPs. We also find that adding layers does not always benefit downstream performance.Our filtered pre-training data, reduced English vocabulary, and code are available at https://github.com/text-machine-lab/mini_bertgithub.com/text-machine-lab/mini_bert

pdf bib
Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning
Namrata Shivagunde | Vladislav Lialin | Anna Rumshisky
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Language model probing is often used to test specific capabilities of models. However, conclusions from such studies may be limited when the probing benchmarks are small and lack statistical power. In this work, we introduce new, larger datasets for negation (NEG-1500-SIMP) and role reversal (ROLE-1500) inspired by psycholinguistic studies. We dramatically extend existing NEG-136 and ROLE-88 benchmarks using GPT3, increasing their size from 18 and 44 sentence pairs to 750 each. We also create another version of extended negation dataset (NEG-1500-SIMP-TEMP), created using template-based generation. It consists of 770 sentence pairs. We evaluate 22 models on the extended datasets, seeing model performance dip 20-57% compared to the original smaller benchmarks. We observe high levels of negation sensitivity in models like BERT and ALBERT demonstrating that previous findings might have been skewed due to smaller test sets. Finally, we observe that while GPT3 has generated all the examples in ROLE-1500 is only able to solve 24.6% of them during probing. The datasets and code are available on Github.

2022

pdf bib
Life after BERT: What do Other Muppets Understand about Language?
Vladislav Lialin | Kevin Zhao | Namrata Shivagunde | Anna Rumshisky
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Existing pre-trained transformer analysis works usually focus only on one or two model families at a time, overlooking the variability of the architecture and pre-training objectives. In our work, we utilize the oLMpics bench- mark and psycholinguistic probing datasets for a diverse set of 29 models including T5, BART, and ALBERT. Additionally, we adapt the oLMpics zero-shot setup for autoregres- sive models and evaluate GPT networks of different sizes. Our findings show that none of these models can resolve compositional questions in a zero-shot fashion, suggesting that this skill is not learnable using existing pre-training objectives. Furthermore, we find that global model decisions such as architecture, directionality, size of the dataset, and pre-training objective are not predictive of a model’s linguistic capabilities.

pdf bib
Learning to Ask Like a Physician
Eric Lehman | Vladislav Lialin | Katelyn Edelwina Legaspi | Anne Janelle Sy | Patricia Therese Pile | Nicole Rose Alberto | Richard Raymund Ragasa | Corinna Victoria Puyat | Marianne Katharina Taliño | Isabelle Rose Alberto | Pia Gabrielle Alfonso | Dana Moukheiber | Byron Wallace | Anna Rumshisky | Jennifer Liang | Preethi Raghavan | Leo Anthony Celi | Peter Szolovits
Proceedings of the 4th Clinical Natural Language Processing Workshop

Existing question answering (QA) datasets derived from electronic health records (EHR) are artificially generated and consequently fail to capture realistic physician information needs. We present Discharge Summary Clinical Questions (DiSCQ), a newly curated question dataset composed of 2,000+ questions paired with the snippets of text (triggers) that prompted each question. The questions are generated by medical experts from 100+ MIMIC-III discharge summaries. We analyze this dataset to characterize the types of information sought by medical experts. We also train baseline models for trigger detection and question generation (QG), paired with unsupervised answer retrieval over EHRs. Our baseline model is able to generate high quality questions in over 62% of cases when prompted with human selected triggers. We release this dataset (and all code to reproduce baseline model results) to facilitate further research into realistic clinical QA and QG: https://github.com/elehman16/discq.