Akhila Yerukola


2021

pdf bib
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann | Tosin Adewumi | Karmanya Aggarwal | Pawan Sasanka Ammanamanchi | Anuoluwapo Aremu | Antoine Bosselut | Khyathi Raghavi Chandu | Miruna-Adriana Clinciu | Dipanjan Das | Kaustubh Dhole | Wanyu Du | Esin Durmus | Ondřej Dušek | Chris Chinenye Emezue | Varun Gangal | Cristina Garbacea | Tatsunori Hashimoto | Yufang Hou | Yacine Jernite | Harsh Jhamtani | Yangfeng Ji | Shailza Jolly | Mihir Kale | Dhruv Kumar | Faisal Ladhak | Aman Madaan | Mounica Maddela | Khyati Mahajan | Saad Mahamood | Bodhisattwa Prasad Majumder | Pedro Henrique Martins | Angelina McMillan-Major | Simon Mille | Emiel van Miltenburg | Moin Nadeem | Shashi Narayan | Vitaly Nikolaev | Andre Niyongabo Rubungo | Salomey Osei | Ankur Parikh | Laura Perez-Beltrachini | Niranjan Ramesh Rao | Vikas Raunak | Juan Diego Rodriguez | Sashank Santhanam | João Sedoc | Thibault Sellam | Samira Shaikh | Anastasia Shimorina | Marco Antonio Sobrevilla Cabezudo | Hendrik Strobelt | Nishant Subramani | Wei Xu | Diyi Yang | Akhila Yerukola | Jiawei Zhou
Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for the 2021 shared task at the associated GEM Workshop.

pdf bib
Data Augmentation for Voice-Assistant NLU using BERT-based Interchangeable Rephrase
Akhila Yerukola | Mason Bretan | Hongxia Jin
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We introduce a data augmentation technique based on byte pair encoding and a BERT-like self-attention model to boost performance on spoken language understanding tasks. We compare and evaluate this method with a range of augmentation techniques encompassing generative models such as VAEs and performance-boosting techniques such as synonym replacement and back-translation. We show our method performs strongly on domain and intent classification tasks for a voice assistant and in a user-study focused on utterance naturalness and semantic similarity.

2019

pdf bib
Do Massively Pretrained Language Models Make Better Storytellers?
Abigail See | Aneesh Pappu | Rohun Saxena | Akhila Yerukola | Christopher D. Manning
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Large neural language models trained on massive amounts of text have emerged as a formidable strategy for Natural Language Understanding tasks. However, the strength of these models as Natural Language Generators is less clear. Though anecdotal evidence suggests that these models generate better quality text, there has been no detailed study characterizing their generation abilities. In this work, we compare the performance of an extensively pretrained model, OpenAI GPT2-117 (Radford et al., 2019), to a state-of-the-art neural story generation model (Fan et al., 2018). By evaluating the generated text across a wide variety of automatic metrics, we characterize the ways in which pretrained models do, and do not, make better storytellers. We find that although GPT2-117 conditions more strongly on context, is more sensitive to ordering of events, and uses more unusual words, it is just as likely to produce repetitive and under-diverse text when using likelihood-maximizing decoding algorithms.