2024
pdf
bib
abs
INDUS: Effective and Efficient Language Models for Scientific Applications
Bishwaranjan Bhattacharjee
|
Aashka Trivedi
|
Masayasu Muraoka
|
Muthukumaran Ramasubramanian
|
Takuma Udagawa
|
Iksha Gurung
|
Nishan Pantha
|
Rong Zhang
|
Bharath Dandala
|
Rahul Ramachandran
|
Manil Maskey
|
Kaylin Bugbee
|
Michael M. Little
|
Elizabeth Fancher
|
Irina Gerasimov
|
Armin Mehrabian
|
Lauren Sanders
|
Sylvain V. Costes
|
Sergi Blanco-Cuaresma
|
Kelly Lockhart
|
Thomas Allen
|
Felix Grezes
|
Megan Ansdell
|
Alberto Accomazzi
|
Yousef El-Kurdi
|
Davis Wertheimer
|
Birgit Pfitzmann
|
Cesar Berrospi Ramis
|
Michele Dolfi
|
Rafael Teixeira De Lima
|
Panagiotis Vagenas
|
S. Karthik Mukkavilli
|
Peter W. J. Staar
|
Sanaz Vahidinia
|
Ryan McGranaghan
|
Tsengdar J. Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have latency or resource constraints. We also created three new scientific benchmark datasets, Climate-Change NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. We show that our models outperform both general-purpose (RoBERTa) and domain- specific (SciBERT) encoders on these new tasks as well as existing tasks in the domains of interest. Furthermore, we demonstrate the use of these models in two industrial settings- as a retrieval model for large-scale vector search applications and in automatic content tagging systems.
2023
pdf
bib
abs
Muted: Multilingual Targeted Offensive Speech Identification and Visualization
Christoph Tillmann
|
Aashka Trivedi
|
Sara Rosenthal
|
Santosh Borse
|
Rong Zhang
|
Avirup Sil
|
Bishwaranjan Bhattacharjee
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Offensive language such as hate, abuse, and profanity (HAP) occurs in various content on the web. While previous work has mostly dealt with sentence level annotations, there have been a few recent attempts to identify offensive spans as well. We build upon this work and introduce MUTED, a system to identify multilingual HAP content by displaying offensive arguments and their targets using heat maps to indicate their intensity. MUTED can leverage any transformer-based HAP-classification model and its attention mechanism out-of-the-box to identify toxic spans, without further fine-tuning. In addition, we use the spaCy library to identify the specific targets and arguments for the words predicted by the attention heatmaps. We present the model’s performance on identifying offensive spans and their targets in existing datasets and present new annotations on German text. Finally, we demonstrate our proposed visualization tool on multilingual inputs.
pdf
bib
abs
A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models
Takuma Udagawa
|
Aashka Trivedi
|
Michele Merler
|
Bishwaranjan Bhattacharjee
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models have become a vital component in modern NLP, achieving state of the art performance in a variety of tasks. However, they are often inefficient for real-world deployment due to their expensive inference costs. Knowledge distillation is a promising technique to improve their efficiency while retaining most of their effectiveness. In this paper, we reproduce, compare and analyze several representative methods for task-agnostic (general-purpose) distillation of Transformer language models. Our target of study includes Output Distribution (OD) transfer, Hidden State (HS) transfer with various layer mapping strategies, and Multi-Head Attention (MHA) transfer based on MiniLMv2. Through our extensive experiments, we study the effectiveness of each method for various student architectures in both monolingual (English) and multilingual settings. Overall, we show that MHA transfer based on MiniLMv2 is generally the best option for distillation and explain the potential reasons behind its success. Moreover, we show that HS transfer remains as a competitive baseline, especially under a sophisticated layer mapping strategy, while OD transfer consistently lags behind other approaches. Findings from this study helped us deploy efficient yet effective student models for latency-critical applications.
pdf
bib
abs
A Simple Yet Strong Domain-Agnostic De-bias Method for Zero-Shot Sentiment Classification
Yang Zhao
|
Tetsuya Nasukawa
|
Masayasu Muraoka
|
Bishwaranjan Bhattacharjee
Findings of the Association for Computational Linguistics: ACL 2023
Zero-shot prompt-based learning has made much progress in sentiment analysis, and considerable effort has been dedicated to designing high-performing prompt templates. However, two problems exist; First, large language models are often biased to their pre-training data, leading to poor performance in prompt templates that models have rarely seen. Second, in order to adapt to different domains, re-designing prompt templates is usually required, which is time-consuming and inefficient. To remedy both shortcomings, we propose a simple yet strong data construction method to de-bias a given prompt template, yielding a large performance improvement in sentiment analysis tasks across different domains, pre-trained language models, and prompt templates. Also, we demonstrate the advantage of using domain-agnostic generic responses over the in-domain ground-truth data.
2020
pdf
bib
abs
Visual Objects As Context: Exploiting Visual Objects for Lexical Entailment
Masayasu Muraoka
|
Tetsuya Nasukawa
|
Bishwaranjan Bhattacharjee
Findings of the Association for Computational Linguistics: EMNLP 2020
We propose a new word representation method derived from visual objects in associated images to tackle the lexical entailment task. Although it has been shown that the Distributional Informativeness Hypothesis (DIH) holds on text, in which the DIH assumes that a context surrounding a hyponym is more informative than that of a hypernym, it has never been tested on visual objects. Since our perception is tightly associated with language, it is meaningful to explore whether the DIH holds on visual objects. To this end, we consider visual objects as the context of a word and represent a word as a bag of visual objects found in images associated with the word. This allows us to test the feasibility of the visual DIH. To better distinguish word pairs in a hypernym relation from other relations such as co-hypernyms, we also propose a new measurable function that takes into account both the difference in the generality of meaning and similarity of meaning between words. Our experimental results show that the DIH holds on visual objects and that the proposed method combined with the proposed function outperforms existing unsupervised representation methods.
pdf
bib
abs
Predictive Model Selection for Transfer Learning in Sequence Labeling Tasks
Parul Awasthy
|
Bishwaranjan Bhattacharjee
|
John Kender
|
Radu Florian
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing
Transfer learning is a popular technique to learn a task using less training data and fewer compute resources. However, selecting the correct source model for transfer learning is a challenging task. We demonstrate a novel predictive method that determines which existing source model would minimize error for transfer learning to a given target. This technique does not require learning for prediction, and avoids computational costs of trail-and-error. We have evaluated this technique on nine datasets across diverse domains, including newswire, user forums, air flight booking, cybersecurity news, etc. We show that it per-forms better than existing techniques such as fine-tuning over vanilla BERT, or curriculum learning over the largest dataset on top of BERT, resulting in average F1 score gains in excess of 3%. Moreover, our technique consistently selects the best model using fewer tries.