Sumit Kumar

2025

pdf bib abs
AdaBioBERT: Adaptive Token Sequence Learning for Biomedical Named Entity Recognition
Sumit Kumar | Tanmay Basu
Proceedings of the 24th Workshop on Biomedical Language Processing

Accurate identification and labeling of biomedical entities, such as diseases, genes, chemical and species, within scientific texts are crucial for understanding complex relationships. We propose Adaptive BERT or AdaBioBERT, a robust named entity recognition (NER) model that builds upon BioBERT (Biomedical Bidirectional Encoded Representation from Transformers) based on an adaptive loss function to learn different types of biomedical token sequence. This adaptive loss function combines the standard Cross Entropy (CE) loss and Conditional Random Field (CRF) loss to optimize both token level accuracy and sequence-level coherence. AdaBioBERT captures rich semantic nuances by leveraging pre-trained contextual embeddings from BioBERT. On the other hand, the CRF loss of AdaBioBERT ensures proper identification of complex multi-token biomedical entities in a sequence and the CE loss can capture the simple unigram entities in a sequence. The empirical analysis on multiple standard biomedical coprora demonstrates that AdaBioBERT performs better than the state of the arts for most of the datasets in terms of macro and micro averaged F1 score.’

Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. SEMMA leverages Large Language Models (LLMs) to enrich relation identifiers, generating semantic embeddings that subsequently form a textual relation graph, which is fused with the structural component. Across 54 diverse KGs, SEMMA outperforms purely structural baselines like ULTRA in fully inductive link prediction. Crucially, we show that in more challenging generalization settings, where the test-time relation vocabulary is entirely unseen, structural methods collapse while SEMMA is 2x more effective. Our findings demonstrate that textual semantics are critical for generalization in settings where structure alone fails, highlighting the need for foundation models that unify structural and linguistic signals in knowledge reasoning.

2020

pdf bib abs
CLPLM: Character Level Pretrained Language Model for ExtractingSupport Phrases for Sentiment Labels
Raj Pranesh | Sumit Kumar | Ambesh Shekhar
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

In this paper, we have designed a character-level pre-trained language model for extracting support phrases from tweets based on the sentiment label. We also propose a character-level ensemble model designed by properly blending Pre-trained Contextual Embeddings (PCE) models- RoBERTa, BERT, and ALBERT along with Neural network models- RNN, CNN and WaveNet at different stages of the model. For a given tweet and associated sentiment label, our model predicts the span of phrases in a tweet that prompts the particular sentiment in the tweet. In our experiments, we have explored various model architectures and configuration for both single as well as ensemble models. We performed a systematic comparative analysis of all the model’s performance based on the Jaccard score obtained. The best performing ensemble model obtained the highest Jaccard scores of 73.5, giving it a relative improvement of 2.4% over the best performing single RoBERTa based character-level model, at 71.5(Jaccard score).

Sumit Kumar

2025

2020

2004

Co-authors

Venues