Bryan McCann


pdf bib
SummEval: Re-evaluating Summarization Evaluation
Alexander R. Fabbri | Wojciech Kryściński | Bryan McCann | Caiming Xiong | Richard Socher | Dragomir Radev
Transactions of the Association for Computational Linguistics, Volume 9

Abstract The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations; 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics; 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format; 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics; and 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.

pdf bib
Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality
Gustavo Aguilar | Bryan McCann | Tong Niu | Nazneen Rajani | Nitish Shirish Keskar | Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2021

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the tokenizer to handle infrequent spellings. On the other hand, though robust to misspellings, pure character-level models often lead to unreasonably long sequences and make it harder for the model to learn meaningful words. To alleviate these challenges, we propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT. Our char2subword module builds representations from characters out of the subword vocabulary, and it can be used as a drop-in replacement of the subword embedding table. The module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation. We integrate it further with BERT through pre-training while keeping BERT transformer parameters fixed–and thus, providing a practical method. Finally, we show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.

pdf bib
GeDi: Generative Discriminator Guided Sequence Generation
Ben Krause | Akhilesh Deepak Gotmare | Bryan McCann | Nitish Shirish Keskar | Shafiq Joty | Richard Socher | Nazneen Fatema Rajani
Findings of the Association for Computational Linguistics: EMNLP 2021

pdf bib
Joint Energy-based Model Training for Better Calibrated Natural Language Understanding Models
Tianxing He | Bryan McCann | Caiming Xiong | Ehsan Hosseini-Asl
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

In this work, we explore joint energy-based model (EBM) training during the finetuning of pretrained text encoders (e.g., Roberta) for natural language understanding (NLU) tasks. Our experiments show that EBM training can help the model reach a better calibration that is competitive to strong baselines, with little or no loss in accuracy. We discuss three variants of energy functions (namely scalar, hidden, and sharp-hidden) that can be defined on top of a text encoder, and compare them in experiments. Due to the discreteness of text data, we adopt noise contrastive estimation (NCE) to train the energy-based model. To make NCE training more effective, we train an auto-regressive noise model with the masked language model (MLM) objective.


pdf bib
Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation
Tianlu Wang | Xi Victoria Lin | Nazneen Fatema Rajani | Bryan McCann | Vicente Ordonez | Caiming Xiong
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Word embeddings derived from human-generated corpora inherit strong gender bias which can be further amplified by downstream models. Some commonly adopted debiasing approaches, including the seminal Hard Debias algorithm, apply post-processing procedures that project pre-trained word embeddings into a subspace orthogonal to an inferred gender subspace. We discover that semantic-agnostic corpus regularities such as word frequency captured by the word embeddings negatively impact the performance of these algorithms. We propose a simple but effective technique, Double Hard Debias, which purifies the word embeddings against such corpus regularities prior to inferring and removing the gender subspace. Experiments on three bias mitigation benchmarks show that our approach preserves the distributional semantics of the pre-trained word embeddings while reducing gender bias to a significantly larger degree than prior approaches.

pdf bib
The Thieves on Sesame Street are Polyglots - Extracting Multilingual Models from Monolingual APIs
Nitish Shirish Keskar | Bryan McCann | Caiming Xiong | Richard Socher
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Pre-training in natural language processing makes it easier for an adversary with only query access to a victim model to reconstruct a local copy of the victim by training with gibberish input data paired with the victim’s labels for that data. We discover that this extraction process extends to local copies initialized from a pre-trained, multilingual model while the victim remains monolingual. The extracted model learns the task from the monolingual victim, but it generalizes far better than the victim to several other languages. This is done without ever showing the multilingual, extracted model a well-formed input in any of the languages for the target task. We also demonstrate that a few real examples can greatly improve performance, and we analyze how these results shed light on how such extraction methods succeed.

pdf bib
Evaluating the Factual Consistency of Abstractive Text Summarization
Wojciech Kryscinski | Bryan McCann | Caiming Xiong | Richard Socher
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The most common metrics for assessing summarization algorithms do not account for whether summaries are factually consistent with source documents. We propose a weakly-supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and generated summaries. Training data is generated by applying a series of rule-based transformations to the sentences of source documents.The factual consistency model is then trained jointly for three tasks: 1) predict whether each summary sentence is factually consistent or not, 2) in either case, extract a span in the source document to support this consistency prediction, 3) for each summary sentence that is deemed inconsistent, extract the inconsistent span from it. Transferring this model to summaries generated by several neural models reveals that this highly scalable approach outperforms previous models, including those trained with strong supervision using datasets from related domains, such as natural language inference and fact checking. Additionally, human evaluation shows that the auxiliary span extraction tasks provide useful assistance in the process of verifying factual consistency. We also release a manually annotated dataset for factual consistency verification, code for training data generation, and trained model weights at


pdf bib
Neural Text Summarization: A Critical Evaluation
Wojciech Kryscinski | Nitish Shirish Keskar | Bryan McCann | Caiming Xiong | Richard Socher
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Text summarization aims at compressing long documents into a shorter form that conveys the most important parts of the original document. Despite increased interest in the community and notable research effort, progress on benchmark datasets has stagnated. We critically evaluate key ingredients of the current research setup: datasets, evaluation metrics, and models, and highlight three primary shortcomings: 1) automatically collected datasets leave the task underconstrained and may contain noise detrimental to training and evaluation, 2) current evaluation protocol is weakly correlated with human judgment and does not account for important characteristics such as factual correctness, 3) models overfit to layout biases of current datasets and offer limited diversity in their outputs.

pdf bib
BERT is Not an Interlingua and the Bias of Tokenization
Jasdeep Singh | Bryan McCann | Richard Socher | Caiming Xiong
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)

Multilingual transfer learning can benefit both high- and low-resource languages, but the source of these improvements is not well understood. Cananical Correlation Analysis (CCA) of the internal representations of a pre- trained, multilingual BERT model reveals that the model partitions representations for each language rather than using a common, shared, interlingual space. This effect is magnified at deeper layers, suggesting that the model does not progressively abstract semantic con- tent while disregarding languages. Hierarchical clustering based on the CCA similarity scores between languages reveals a tree structure that mirrors the phylogenetic trees hand- designed by linguists. The subword tokenization employed by BERT provides a stronger bias towards such structure than character- and word-level tokenizations. We release a subset of the XNLI dataset translated into an additional 14 languages at to assist further research into multilingual representations.

pdf bib
Explain Yourself! Leveraging Language Models for Commonsense Reasoning
Nazneen Fatema Rajani | Bryan McCann | Caiming Xiong | Richard Socher
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Deep learning models perform poorly on tasks that require commonsense reasoning, which often necessitates some form of world-knowledge or reasoning over information not immediately present in the input. We collect human explanations for commonsense reasoning in the form of natural language sequences and highlighted annotations in a new dataset called Common Sense Explanations (CoS-E). We use CoS-E to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation (CAGE) framework. CAGE improves the state-of-the-art by 10% on the challenging CommonsenseQA task. We further study commonsense reasoning in DNNs using both human and auto-generated explanations including transfer to out-of-domain tasks. Empirical results indicate that we can effectively leverage language models for commonsense reasoning.