Bharath Dandala


2024

pdf bib
INDUS: Effective and Efficient Language Models for Scientific Applications
Bishwaranjan Bhattacharjee | Aashka Trivedi | Masayasu Muraoka | Muthukumaran Ramasubramanian | Takuma Udagawa | Iksha Gurung | Nishan Pantha | Rong Zhang | Bharath Dandala | Rahul Ramachandran | Manil Maskey | Kaylin Bugbee | Michael M. Little | Elizabeth Fancher | Irina Gerasimov | Armin Mehrabian | Lauren Sanders | Sylvain V. Costes | Sergi Blanco-Cuaresma | Kelly Lockhart | Thomas Allen | Felix Grezes | Megan Ansdell | Alberto Accomazzi | Yousef El-Kurdi | Davis Wertheimer | Birgit Pfitzmann | Cesar Berrospi Ramis | Michele Dolfi | Rafael Teixeira De Lima | Panagiotis Vagenas | S. Karthik Mukkavilli | Peter W. J. Staar | Sanaz Vahidinia | Ryan McGranaghan | Tsengdar J. Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have latency or resource constraints. We also created three new scientific benchmark datasets, Climate-Change NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. We show that our models outperform both general-purpose (RoBERTa) and domain- specific (SciBERT) encoders on these new tasks as well as existing tasks in the domains of interest. Furthermore, we demonstrate the use of these models in two industrial settings- as a retrieval model for large-scale vector search applications and in automatic content tagging systems.

2016

pdf bib
Scoring Disease-Medication Associations using Advanced NLP, Machine Learning, and Multiple Content Sources
Bharath Dandala | Murthy Devarakonda | Mihaela Bornea | Christopher Nielson
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

Effective knowledge resources are critical for developing successful clinical decision support systems that alleviate the cognitive load on physicians in patient care. In this paper, we describe two new methods for building a knowledge resource of disease to medication associations. These methods use fundamentally different content and are based on advanced natural language processing and machine learning techniques. One method uses distributional semantics on large medical text, and the other uses data mining on a large number of patient records. The methods are evaluated using 25,379 unique disease-medication pairs extracted from 100 de-identified longitudinal patient records of a large multi-provider hospital system. We measured recall (R), precision (P), and F scores for positive and negative association prediction, along with coverage and accuracy. While individual methods performed well, a combined stacked classifier achieved the best performance, indicating the limitations and unique value of each resource and method. In predicting positive associations, the stacked combination significantly outperformed the baseline (a distant semi-supervised method on large medical text), achieving F scores of 0.75 versus 0.55 on the pairs seen in the patient records, and F scores of 0.69 and 0.35 on unique pairs.

2013

pdf bib
Sense Clustering Using Wikipedia
Bharath Dandala | Chris Hokamp | Rada Mihalcea | Razvan Bunescu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Multilingual Word Sense Disambiguation Using Wikipedia
Bharath Dandala | Rada Mihalcea | Razvan Bunescu
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib
Towards Building a Multilingual Semantic Network: Identifying Interlingual Links in Wikipedia
Bharath Dandala | Rada Mihalcea | Razvan Bunescu
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)