Cryptocurrency has attracted a lot of public attention and opinion worldwide. Users have different kinds of information needs regarding such topics and publicly available information is a good resource to satisfy those information needs. In this paper, we investigate the public opinion on cryptocurrency and bitcoin on two social media – Twitter and Reddit. We have created a multi-level dataset CryptOpiQA and garnered valuable insights. The dataset contains both gold standard (manually annotated) and silver standard (inferred from the gold standard) labels. As a part of this dataset, we have also created a Question Answering sub-corpus. We have used state-of-the-art LLMs and advanced techniques such as retrieval augmented generation (RAG) to improve question-answering (QnA) results. We believe this dataset and the analysis will be useful in studying user opinions and Question-Answering on cryptocurrency in the research community.
The Large Language Models (LLMs) have impacted many real-life tasks. To examine the efficacy of LLMs in a high-stake domain like law, we have applied state-of-the-art LLMs for two popular tasks: Statute Prediction and Judgment Prediction, on Indian Supreme Court cases. We see that while LLMs exhibit excellent predictive performance in Statute Prediction, their performance dips in Judgment Prediction when compared with many standard models. The explanations generated by LLMs (along with prediction) are of moderate to decent quality. We also see evidence of gender and religious bias in the LLM-predicted results. In addition, we present a note from a senior legal expert on the ethical concerns of deploying LLMs in these critical legal tasks.
Code-Switching (CS) between two languages is extremely common in communities with societal multilingualism where speakers switch between two or more languages when interacting with each other. CS has been extensively studied in spoken language by linguists for several decades but with the popularity of social-media and less formal Computer Mediated Communication, we now see a big rise in the use of CS in the text form. This poses interesting challenges and a need for computational processing of such code-switched data. As with any Computational Linguistic analysis and Natural Language Processing tools and applications, we need annotated data for understanding, processing, and generation of code-switched language. In this study, we focus on CS between English and Hindi Tweets extracted from the Twitter stream of Hindi-English bilinguals. We present an annotation scheme for annotating the pragmatic functions of CS in Hindi-English (Hi-En) code-switched tweets based on a linguistic analysis and some initial experiments.