2023
pdf
bib
abs
TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks
Shubhra Kanti Karmaker Santu
|
Dongji Feng
Findings of the Association for Computational Linguistics: EMNLP 2023
While LLMs have shown great success in understanding and generating text in traditional conversational settings, their potential for performing ill-defined complex tasks is largely under-studied and yet to be benchmarked. However, conducting such benchmarking studies is challenging because of the large variations in LLMs’ performance when different prompt types/styles are used and different degrees of detail are provided in the prompts. To address this issue, this paper proposes a general taxonomy that can be used to design prompts with specific properties in order to perform a wide range of complex tasks. This taxonomy will allow future benchmarking studies to report the specific categories of prompts used as part of the study, enabling meaningful comparisons across different studies. Also, by establishing a common standard through this taxonomy, researchers will be able to draw more accurate conclusions about LLMs’ performance on a specific complex task.
pdf
bib
abs
On Evaluation of Bangla Word Analogies
Mousumi Akter
|
Souvika Sarkar
|
Shubhra Kanti Karmaker Santu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
This paper presents a benchmark dataset of Bangla word analogies for evaluating the quality of existing Bangla word embeddings. Despite being the 7th largest spoken language in the world, Bangla is still a low-resource language and popular NLP models often struggle to perform well on Bangla data sets. Therefore, developing a robust evaluation set is crucial for benchmarking and guiding future research on improving Bangla word embeddings, which is currently missing. To address this issue, we introduce a new evaluation set of 16,678 unique word analogies in Bangla as well as a translated and curated version of the original Mikolov dataset (10,594 samples) in Bangla. Our experiments with different state-of-the-art embedding models reveal that current Bangla word embeddings struggle to achieve high accuracy on both data sets, demonstrating a significant gap in multilingual NLP research.
pdf
bib
abs
Zero-Shot Multi-Label Topic Inference with Sentence Encoders and LLMs
Souvika Sarkar
|
Dongji Feng
|
Shubhra Kanti Karmaker Santu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
In this paper, we conducted a comprehensive study with the latest Sentence Encoders and Large Language Models (LLMs) on the challenging task of “definition-wild zero-shot topic inference”, where users define or provide the topics of interest in real-time. Through extensive experimentation on seven diverse data sets, we observed that LLMs, such as ChatGPT-3.5 and PaLM, demonstrated superior generality compared to other LLMs, e.g., BLOOM and GPT-NeoX. Furthermore, Sentence-BERT, a BERT-based classical sentence encoder, outperformed PaLM and achieved performance comparable to ChatGPT-3.5.
2022
pdf
bib
abs
SEM-F1: an Automatic Way for Semantic Evaluation of Multi-Narrative Overlap Summaries at Scale
Naman Bansal
|
Mousumi Akter
|
Shubhra Kanti Karmaker Santu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Recent work has introduced an important yet relatively under-explored NLP task called Semantic Overlap Summarization (SOS) that entails generating a summary from multiple alternative narratives which conveys the common information provided by those narratives. Previous work also published a benchmark dataset for this task by collecting 2,925 alternative narrative pairs from the web and manually annotating 411 different reference summaries by engaging human annotators. In this paper, we exclusively focus on the automated evaluation of the SOS task using the benchmark dataset. More specifically, we first use the popular ROUGE metric from text-summarization literature and conduct a systematic study to evaluate the SOS task. Our experiments discover that ROUGE is not suitable for this novel task and therefore, we propose a new sentence-level precision-recall style automated evaluation metric, called SEM-F1 (Semantic F1). It is inspired by the benefits of the sentence-wise annotation technique using overlap labels reported by the previous work. Our experiments show that the proposed SEM-F1 metric yields a higher correlation with human judgment and higher inter-rater agreement compared to the ROUGE metric.
pdf
bib
abs
Learning to Generate Overlap Summaries through Noisy Synthetic Data
Naman Bansal
|
Mousumi Akter
|
Shubhra Kanti Karmaker Santu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Semantic Overlap Summarization (SOS) is a novel and relatively under-explored seq-to-seq task which entails summarizing common information from multiple alternate narratives. One of the major challenges for solving this task is the lack of existing datasets for supervised training. To address this challenge, we propose a novel data augmentation technique, which allows us to create large amount of synthetic data for training a seq-to-seq model that can perform the SOS task. Through extensive experiments using narratives from the news domain, we show that the models fine-tuned using the synthetic dataset provide significant performance improvements over the pre-trained vanilla summarization techniques and are close to the models fine-tuned on the golden training data; which essentially demonstrates the effectiveness of out proposed data augmentation technique for training seq-to-seq models on the SOS task.
pdf
bib
abs
Analogy-Guided Evolutionary Pretraining of Binary Word Embeddings
R. Alexander Knipper
|
Md. Mahadi Hassan
|
Mehdi Sadi
|
Shubhra Kanti Karmaker Santu
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
As we begin to see low-powered computing paradigms (Neuromorphic Computing, Spiking Neural Networks, etc.) becoming more popular, learning binary word embeddings has become increasingly important for supporting NLP applications at the edge. Existing binary word embeddings are mostly derived from pretrained real-valued embeddings through different simple transformations, which often break the semantic consistency and the so-called “arithmetic” properties learned by the original, real-valued embeddings. This paper aims to address this limitation by introducing a new approach to learn binary embeddings from scratch, preserving the semantic relationships between words as well as the arithmetic properties of the embeddings themselves. To achieve this, we propose a novel genetic algorithm to learn the relationships between words from existing word analogy data-sets, carefully making sure that the arithmetic properties of the relationships are preserved. Evaluating our generated 16, 32, and 64-bit binary word embeddings on Mikolov’s word analogy task shows that more than 95% of the time, the best fit for the analogy is ranked in the top 5 most similar words in terms of cosine similarity.
pdf
bib
abs
Exploring Universal Sentence Encoders for Zero-shot Text Classification
Souvika Sarkar
|
Dongji Feng
|
Shubhra Kanti Karmaker Santu
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Universal Sentence Encoder (USE) has gained much popularity recently as a general-purpose sentence encoding technique. As the name suggests, USE is designed to be fairly general and has indeed been shown to achieve superior performances for many downstream NLP tasks. In this paper, we present an interesting “negative” result on USE in the context of zero-shot text classification, a challenging task, which has recently gained much attraction. More specifically, we found some interesting cases of zero-shot text classification, where topic based inference outperformed USE-based inference in terms of F1 score. Further investigation revealed that USE struggles to perform well on data-sets with a large number of labels with high semantic overlaps, while topic-based classification works well for the same.
pdf
bib
abs
Semantic Overlap Summarization among Multiple Alternative Narratives: An Exploratory Study
Naman Bansal
|
Mousumi Akter
|
Shubhra Kanti Karmaker Santu
Proceedings of the 29th International Conference on Computational Linguistics
In this paper, we introduce an important yet relatively unexplored NLP task called Semantic Overlap Summarization (SOS), which entails generating a single summary from multiple alternative narratives which can convey the common information provided by those narratives. As no benchmark dataset is readily available for this task, we created one by collecting 2,925 alternative narrative pairs from the web and then, went through the tedious process of manually creating 411 different reference summaries by engaging human annotators. As a way to evaluate this novel task, we first conducted a systematic study by borrowing the popular ROUGE metric from text-summarization literature and discovered that ROUGE is not suitable for our task. Subsequently, we conducted further human annotations to create 200 document-level and 1,518 sentence-level ground-truth overlap labels. Our experiments show that the sentence-wise annotation technique with three overlap labels, i.e., Absent (A), Partially-Present (PP), and Present (P), yields a higher correlation with human judgment and higher inter-rater agreement compared to the ROUGE metric.
2019
pdf
bib
abs
TILM: Neural Language Models with Evolving Topical Influence
Shubhra Kanti Karmaker Santu
|
Kalyan Veeramachaneni
|
Chengxiang Zhai
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Content of text data are often influenced by contextual factors which often evolve over time (e.g., content of social media are often influenced by topics covered in the major news streams). Existing language models do not consider the influence of such related evolving topics, and thus are not optimal. In this paper, we propose to incorporate such topical-influence into a language model to both improve its accuracy and enable cross-stream analysis of topical influences. Specifically, we propose a novel language model called Topical Influence Language Model (TILM), which is a novel extension of a neural language model to capture the influences on the contents in one text stream by the evolving topics in another related (or possibly same) text stream. Experimental results on six different text stream data comprised of conference paper titles show that the incorporation of evolving topical influence into a language model is beneficial and TILM outperforms multiple baselines in a challenging task of text forecasting. In addition to serving as a language model, TILM further enables interesting analysis of topical influence among multiple text streams.