James O’ Neill
Also published as: James O’Neill
2023
Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models
James O’Neill | Sourav Dutta
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
James O’Neill | Sourav Dutta
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
We investigate the effects of post-training quantization and quantization-aware training on the generalization of Transformer language models. We present a new method called self-distilled quantization (SDQ) that minimizes accumulative quantization errors and outperforms baselines. We apply SDQ to multilingual models XLM-RBase and InfoXLMBase and demonstrate that both models can be reduced from 32-bit floating point weights to 8-bit integer weights while maintaining a high level of performance on the XGLUE benchmark. Our results also highlight the challenges of quantizing multilingual models, which must generalize to languages they were not fine-tuned on.
2022
Aligned Weight Regularizers for Pruning Pretrained Neural Networks
James O’ Neill | Sourav Dutta | Haytham Assem
Findings of the Association for Computational Linguistics: ACL 2022
James O’ Neill | Sourav Dutta | Haytham Assem
Findings of the Association for Computational Linguistics: ACL 2022
Pruning aims to reduce the number of parameters while maintaining performance close to the original network. This work proposes a novel self-distillation based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation. We show that the proposed cross-correlation objective for self-distilled pruning implicitly encourages sparse solutions, naturally complementing magnitude-based pruning criteria. Experiments on the GLUE and XGLUE benchmarks show that self-distilled pruning increases mono- and cross-lingual language model performance. Self-distilled pruned models also outperform smaller Transformers with an equal number of parameters and are competitive against (6 times) larger distilled networks. We also observe that self-distillation (1) maximizes class separability, (2) increases the signal-to-noise ratio, and (3) converges faster after pruning steps, providing further insights into why self-distilled pruning improves generalization.
2021
I Wish I Would Have Loved This One, But I Didn’t – A Multilingual Dataset for Counterfactual Detection in Product Review
James O’Neill | Polina Rozenshtein | Ryuichi Kiryo | Motoko Kubota | Danushka Bollegala
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
James O’Neill | Polina Rozenshtein | Ryuichi Kiryo | Motoko Kubota | Danushka Bollegala
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual detection (CFD) in product reviews. For this purpose, we annotate a multilingual CFD dataset from Amazon product reviews covering counterfactual statements written in English, German, and Japanese languages. The dataset is unique as it contains counterfactuals in multiple languages, covers a new application area of e-commerce reviews, and provides high quality professional annotations. We train CFD models using different text representation methods and classifiers. We find that these models are robust against the selectional biases introduced due to cue phrase-based sentence selection. Moreover, our CFD dataset is compatible with prior datasets and can be merged to learn accurate CFD models. Applying machine translation on English counterfactual examples to create multilingual data performs poorly, demonstrating the language-specificity of this problem, which has been ignored so far.
2020
Do not let the history haunt you: Mitigating Compounding Errors in Conversational Question Answering
Angrosh Mandya | James O’ Neill | Danushka Bollegala | Frans Coenen
Proceedings of the Twelfth Language Resources and Evaluation Conference
Angrosh Mandya | James O’ Neill | Danushka Bollegala | Frans Coenen
Proceedings of the Twelfth Language Resources and Evaluation Conference
The Conversational Question Answering (CoQA) task involves answering a sequence of inter-related conversational questions about a contextual paragraph. Although existing approaches employ human-written ground-truth answers for answering conversational questions at test time, in a realistic scenario, the CoQA model will not have any access to ground-truth answers for the previous questions, compelling the model to rely upon its own previously predicted answers for answering the subsequent questions. In this paper, we find that compounding errors occur when using previously predicted answers at test time, significantly lowering the performance of CoQA systems. To solve this problem, we propose a sampling strategy that dynamically selects between target answers and model predictions during training, thereby closely simulating the situation at test time. Further, we analyse the severity of this phenomena as a function of the question type, conversation length and domain type.
2017
NUIG at EmoInt-2017: BiLSTM and SVR Ensemble to Detect Emotion Intensity
Vladimir Andryushechkin | Ian Wood | James O’ Neill
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
Vladimir Andryushechkin | Ian Wood | James O’ Neill
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
This paper describes the entry NUIG in the WASSA 2017 (8th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis) shared task on emotion recognition. The NUIG system used an SVR (SVM regression) and BLSTM ensemble, utilizing primarily n-grams (for SVR features) and tweet word embeddings (for BLSTM features). Experiments were carried out on several other candidate features, some of which were added to the SVR model. Parameter selection for the SVR model was run as a grid search whilst parameters for the BLSTM model were selected through a non-exhaustive ad-hoc search.