Proceedings of the 1st Workshop on NLP for Science (NLP4Science)

Lotem Peled-Cohen, Nitay Calderon, Shir Lissak, Roi Reichart (Editors)


Anthology ID:
2024.nlp4science-1
Month:
November
Year:
2024
Address:
Miami, FL, USA
Venue:
NLP4Science
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2024.nlp4science-1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2024.nlp4science-1.pdf

pdf bib
Proceedings of the 1st Workshop on NLP for Science (NLP4Science)
Lotem Peled-Cohen | Nitay Calderon | Shir Lissak | Roi Reichart

pdf bib
TokenSHAP: Interpreting Large Language Models with Monte Carlo Shapley Value Estimation
Miriam Horovicz | Roni Goldshmidt

As large language models (LLMs) become increasingly prevalent in critical applications, the need for interpretable AI has grown. We introduce TokenSHAP, a novel method for interpreting LLMs by attributing importance to individual tokens or substrings within input prompts. This approach adapts Shapley values from cooperative game theory to natural language processing, offering a rigorous framework for understanding how different parts of an input contribute to a model’s response. TokenSHAP leverages Monte Carlo sampling for computational efficiency, providing interpretable, quantitative measures of token importance. We demonstrate its efficacy across diverse prompts and LLM architectures, showing consistent improvements over existing baselines in alignment with human judgments, faithfulness to model behavior, and consistency. Our method’s ability to capture nuanced interactions between tokens provides valuable insights into LLM behavior, enhancing model transparency, improving prompt engineering, and aiding in the development of more reliable AI systems. TokenSHAP represents a significant step towards the necessary interpretability for responsible AI deployment, contributing to the broader goal of creating more transparent, accountable, and trustworthy AI systems. Open Source code https://github.com/ronigold/TokenSHAP

pdf bib
Prediction of CRISPR On-Target Effects via Deep Learning
Condy Bao | Fuxiao Liu

Since the advent of CRISPR-Cas9, a groundbreaking gene-editing technology that enables precise genomic modifications via a short RNA guide sequence, there has been a marked increase in the accessibility and application of this technology across various fields. The success of CRISPR-Cas9 has spurred further investment and led to the discovery of additional CRISPR systems, including CRISPR-Cas13. Distinct from Cas9, which targets DNA, Cas13 targets RNA, offering unique advantages for gene modulation. We focus on Cas13d, a variant known for its collateral activity where it non-specifically cleaves adjacent RNA molecules upon activation, a feature critical to its function. We introduce DeepFM-Crispr, a novel deep learning model developed to predict the on-target efficiency and evaluate the off-target effects of Cas13d. This model harnesses a large language model to generate comprehensive representations rich in evolutionary and structural data, thereby enhancing predictions of RNA secondary structures and overall sgRNA efficacy. A transformer-based architecture processes these inputs to produce a predictive efficacy score. Comparative experiments show that DeepFM-Crispr not only surpasses traditional models but also outperforms recent state-of-the-art deep learning methods in terms of prediction accuracy and reliability.

pdf bib
What an Elegant Bridge: Multilingual LLMs are Biased Similarly in Different Languages
Viktor Mihaylov | Aleksandar Shtedritski

This paper investigates biases of Large Language Models (LLMs) through the lens of grammatical gender. Drawing inspiration from seminal works in psycholinguistics, particularly the study of gender’s influence on language perception, we leverage multilingual LLMs to revisit and expand upon the foundational experiments of Boroditsky (2003). Employing LLMs as a novel method for examining psycholinguistic biases related to grammatical gender, we prompt a model to describe nouns with adjectives in various languages, focusing specifically on languages with grammatical gender. In particular, we look at adjective co-occurrences across gender and languages, and train a binary classifier to predict grammatical gender given adjectives an LLM uses to describe a noun. Surprisingly, we find that a simple classifier can not only predict noun gender above chance but also exhibit cross-language transferability. We show that while LLMs may describe words differently in different languages, they are biased similarly.

pdf bib
PsychoLex: Unveiling the Psychological Mind of Large Language Models
Mohammad Amin Abbasi | Farnaz Sadat Mirnezami | Hassan Naderi

This paper explores the intersection of psychology and artificial intelligence through the development and evaluation of specialized Large Language Models (LLMs). We introduce PsychoLex , a suite of resources designed to enhance LLMs’ proficiency in psychological tasks in both Persian and English. Key contributions include the PsychoLexQA dataset for instructional content and the PsychoLexEval dataset for rigorous evaluation of LLMs in complex psychological scenarios. Additionally, we present the PsychoLexLLaMA model, optimized specifically for psychological applications, demonstrating superior performance compared to general-purpose models. The findings underscore the potential of tailored LLMs for advancing psychological research and applications, while also highlighting areas for further refinement. This research offers a foundational step towards integrating LLMs into specialized psychological domains, with implications for future advancements in AI-driven psychological practice.

pdf bib
Two-Stage Graph-Augmented Summarization of Scientific Documents
Rezvaneh Rezapour | Yubin Ge | Kanyao Han | Ray Jeong | Jana Diesner

Automatic text summarization helps to digest the vast and ever-growing amount of scientific publications. While transformer-based solutions like BERT and SciBERT have advanced scientific summarization, lengthy documents pose a challenge due to the token limits of these models. To address this issue, we introduce and evaluate a two-stage model that combines an extract-then-compress framework. Our model incorporates a “graph-augmented extraction module” to select order-based salient sentences and an “abstractive compression module” to generate concise summaries. Additionally, we introduce the *BioConSumm* dataset, which focuses on biodiversity conservation, to support underrepresented domains and explore domain-specific summarization strategies. Out of the tested models, our model achieves the highest ROUGE-2 and ROUGE-L scores on our newly created dataset (*BioConSumm*) and on the *SUMPUBMED* dataset, which serves as a benchmark in the field of biomedicine.

pdf bib
GCD-TM: Graph-Driven Community Detection for Topic Modelling in Psychiatry Texts
Anusuya Krishnan | Isaias Mehari Ghebrehiwet

Psychiatry texts provide critical insights into patient mental states and therapeutic interactions. These texts are essential for understanding psychiatric conditions, treatment dynamics, and patient responses. However, the complex and diverse nature of psychiatric communications poses significant challenges for traditional topic modeling methods. The intricate language, subtle psychological nuances, and varying lengths of text segments make it difficult to extract coherent and meaningful topics. Conventional approaches often struggle to capture the depth and overlap of themes present in these texts. In this study, we present a novel approach to topic modeling that addresses these limitations by reformulating the problem as a community detection task within a graph constructed from the text corpus. Our methodology includes lemmatization for data standardization, TF-IDF vectorization to create a term-document matrix, and cosine similarity computation to produce a similarity matrix. This matrix is then binarized to form a graph, on which community detection is performed using the Louvain method. The detected communities are subsequently analyzed with Latent Dirichlet Allocation (LDA) to extract topics. Our approach outperforms traditional topic modeling methods, offering more accurate and interpretable topic extraction with improved coherence and lower perplexity.

pdf bib
SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions
Sameera Horawalavithana | Sai Munikoti | Ian Stewart | Henry Kvinge | Karl Pazdernik

Instruction finetuning is a popular paradigm to align large language models (LLM) with human intent. Despite its popularity, this idea is less explored in improving LLMs to align existing foundation models with scientific disciplines, concepts and goals. In this work, we present SciTune as a tuning framework to improve the ability of LLMs to follow multimodal instructions generated from scientific publications. To test our methodology, we train a large multimodal model LLaMA-SciTune that connects a vision encoder and LLM for science-focused visual and language understanding. LLaMA-SciTune significantly outperforms the state-of-the-art models in the generated figure types and captions in SciCap and VisText benchmarks. In comparison to the models that are finetuned with synthetic data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the ScienceQA benchmark. Our results demonstrate that human-generated scientific multimodal instructions remain highly valuable in tuning LLMs to perform well on science tasks, despite their lower volume and relative scarcity compared to synthetic data.

pdf bib
RACER: An LLM-powered Methodology for Scalable Analysis of Semi-structured Mental Health Interviews
Satpreet Harcharan Singh | Kevin Jiang | Kanchan Bhasin | Ashutosh Sabharwal | Nidal Moukaddam | Ankit Patel

Semi-structured interviews (SSIs) are a commonly employed data-collection method in healthcare research, offering in-depth qualitative insights into subject experiences. Despite their value, manual analysis of SSIs is notoriously time-consuming and labor-intensive, in part due to the difficulty of extracting and categorizing emotional responses, and challenges in scaling human evaluation for large populations. In this study, we develop RACER, a Large Language Model (LLM) based expert-guided automated pipeline that efficiently converts raw interview transcripts into insightful domain-relevant themes and sub-themes. We used RACER to analyze SSIs conducted with 93 healthcare professionals and trainees to assess the broad personal and professional mental health impacts of the COVID-19 crisis. RACER achieves moderately high agreement with two human evaluators (72%), which approaches the human inter-rater agreement (77%). Interestingly, LLMs and humans struggle with similar content involving nuanced emotional, ambivalent/dialectical, and psychological statements. Our study highlights the opportunities and challenges in using LLMs to improve research efficiency and opens new avenues for scalable analysis of SSIs in healthcare research.

pdf bib
Soft Measures for Extracting Causal Collective Intelligence
Maryam Berijanian | Spencer Dork | Kuldeep Singh | Michael Riley Millikan | Ashlin Riggs | Aadarsh Swaminathan | Sarah L. Gibbs | Scott E. Friedman | Nathan Brugnone

Understanding and modeling collective intelligence is essential for addressing complex social systems. Directed graphs called fuzzy cognitive maps (FCMs) offer a powerful tool for encoding causal mental models, but extracting high-integrity FCMs from text is challenging. This study presents an approach using large language models (LLMs) to automate FCM extraction. We introduce novel graph-based similarity measures and evaluate them by correlating their outputs with human judgments through the Elo rating system. Results show positive correlations with human evaluations, but even the best-performing measure exhibits limitations in capturing FCM nuances. Fine-tuning LLMs improves performance, but existing measures still fall short. This study highlights the need for soft similarity measures tailored to FCM extraction, advancing collective intelligence modeling with NLP.

pdf bib
Hypothesis Generation with Large Language Models
Yangqiaoyu Zhou | Haokun Liu | Tejes Srivastava | Hongyuan Mei | Chenhao Tan

Effective generation of novel hypotheses is instrumental to scientific progress. So far, researchers have been the main powerhouse behind hypothesis generation by painstaking data analysis and thinking (also known as the Eureka moment). In this paper, we examine the potential of large language models (LLMs) to generate hypotheses. We focus on hypothesis generation based on data (i.e., labeled examples). To enable LLMs to handle Long contexts, we generate initial hypotheses from a small number of examples and then update them iteratively to improve the quality of hypotheses. Inspired by multi-armed bandits, we design a reward function to inform the exploitation-exploration tradeoff in the update process. Our algorithm is able to generate hypotheses that enable much better predictive performance than few-shot prompting in classification tasks, improving accuracy by 31.7% on a synthetic dataset and by 13.9%, 3.3% and, 24.9% on three real-world datasets. We also outperform supervised learning by 12.1% and 11.6% on two challenging real-world datasets. Furthermore, we find that the generated hypotheses not only corroborate human-verified theories but also uncover new insights for the tasks.

pdf bib
Dreaming with ChatGPT: Unraveling the Challenges of LLMs Dream Generation
Harel Berger | Hadar King | Omer David

Large Language Models (LLMs), such as ChatGPT, are used daily for different human-like text generation tasks. This motivates us to ask: Can an LLM generate human dreams? For this research, we explore this new avenue through the lens of ChatGPT, and its ability to generate valid dreams. We have three main findings: (i) Chatgpt-4o, the new version of chatGPT, generated all requested dreams. (ii) Generated dreams meet key psychological criteria of dreams. We hope our work will set the stage for developing a new task of dream generation for LLMs. This task can help psychologists evaluate patients’ dreams based on their demographic factors.

pdf bib
LLMs and NLP for Generalized Learning in AI-Enhanced Educational Videos and Powering Curated Videos with Generative Intelligence
Naina Chaturvedi

LLMs and NLP for Generalized Learning in AI-Enhanced Educational Videos and Powering Curated Videos with Generative IntelligenceAuthors - Naina Chaturvedi, Rutgers UniversityAnanda Gunawardena, Rutgers UniversityContact: cnaina1601@gmail.com or nc832@cs.rutgers.eduThe rapid advancement of Large Language Models (LLMs) and Natural Language Processing (NLP) technologies has opened new frontiers in educational content creation and consumption. This paper explores the intersection of these technologies with instructional videos in computer science education, addressing the crucial aspect of generalization in NLP models within an educational context.With 78% of computer science students utilizing YouTube to supplement traditional learning materials, there’s a clear demand for high-quality video content. However, the challenge of finding appropriate resources has led 73% of students to prefer curated video libraries. We propose a novel approach that leverages LLMs and NLP techniques to revolutionize this space, focusing on the ability of these models to generalize across diverse educational content and contexts.Our research utilizes the cubits.ai platform, developed at Princeton University, to demonstrate how generative AI, powered by advanced LLMs, can transform standard video playlists into interactive, AI-enhanced learning experiences. We present a framework for creating AI-generated video summaries, on-demand questions, and in-depth topic explorations, all while considering the challenges posed by LLMs trained on vast, often opaque datasets. Our approach not only enhances student engagement but also provides a unique opportunity to study how well these models generalize across different educational topics and student needs.Drawing insights from computer science courses at Princeton and Rutgers Universities, we highlight the transformative potential of AI-enhanced videos in promoting active learning, particularly in large classes. This research contributes to the ongoing dialogue about generalization in NLP while simultaneously demonstrating practical applications in educational technology. By bridging these domains, we aim to establish a shared platform for state-of-the-art generalization testing in NLP within an educational framework.Our findings not only demonstrate how educators can enhance existing video playlists using AI but also provide insights into the challenges and opportunities of using LLMs in educational settings. This work serves as a cornerstone for catalyzing research on generalization in the NLP community, particularly focusing on the application and evaluation of LLMs in adaptive, personalized learning environments.Keywords: Instructional videos; AI-enhanced learning; Large Language Models (LLMs); Natural Language Processing (NLP); generalization in NLP; computer science education; cubits.ai platform; AI-generated content; interactive video experiences; video summarization; on-demand questions; personalized learning; active learning; data-driven insights; generative AI; educational technology; adaptive learning environments

pdf bib
The Moral Foundations Weibo Corpus
Renjie Cao | Miaoyan Hu | Jiahan Wei | Baha Ihnaini

Moral sentiments expressed in natural language significantly influence both online and offline environments, shaping behavioral styles and interaction patterns, including social media self-presentation, cyberbullying, adherence to social norms, and ethical decision-making. To effectively measure moral sentiments in natural language processing texts, it is crucial to utilize large, annotated datasets that provide nuanced understanding for accurate analysis and model training. However, existing corpora, while valuable, often face linguistic limitations. To address this gap in the Chinese language domain, we introduce the Moral Foundation Weibo Corpus. This corpus consists of 25,671 Chinese comments on Weibo, encompassing six diverse topic areas. Each comment is manually annotated by at least three systematically trained annotators based on ten moral categories derived from a grounded theory of morality. To assess annotator reliability, we present the kappa test results, a gold standard for measuring consistency. Additionally, we apply several the latest large language models to supplement the manual annotations, conducting analytical experiments to compare their performance and report baseline results for moral sentiment classification.

pdf bib
Why So Serious: Humor and its Association with Treatment Measurements Process and Outcome
Matan Kenigsbuch | Natalie Shapira

Humor is an important social construct with various roles in human communication, yet clinicians remain divided on its appropriateness and effectiveness. Despite its importance, empirical research on humor in psychotherapy is limited. This study explores the theoretical concept of “humor” by examining the operational variable of “laughs” within psychotherapy. Method: We analyzed transcriptions from 872 psychotherapy sessions involving 68 clients treated by 59 therapists. Clients self-reported their symptoms and state of well-being before each session, while both clients and therapists provided self-reports on their therapeutic alliance after each session. Through text analysis, we extracted the number of laughs and words for each session. We investigated the within-client associations between laughs and symptoms, well-being, therapeutic alliance, and clients’ number of words. Results: We found concurrent session-level associations between laughs and well-being, symptoms, and the number of words. However, no significant associations were observed between laughs and the therapeutic alliance, either from the perspective of the therapist or the client.

pdf bib
Learning the Bitter Lesson: Empirical Evidence from 20 Years of CVPR Proceedings
Mojtaba Yousefi | Jack Collins

This study examines the alignment of Conference on Computer Vision and Pattern Recognition (CVPR) research with the principles of the “bitter lesson” proposed by Rich Sutton. We analyze two decades of CVPR abstracts and titles using large language models (LLMs) to assess the field’s embracement of these principles. Our methodology leverages state-of-the-art natural language processing techniques to systematically evaluate the evolution of research approaches in computer vision. The results reveal significant trends in the adoption of general-purpose learning algorithms and the utilization of increased computational resources. We discuss the implications of these findings for the future direction of computer vision research and its potential impact on broader artificial intelligence development. This work contributes to the ongoing dialogue about the most effective strategies for advancing machine learning and computer vision, offering insights that may guide future research priorities and methodologies in the field.

pdf bib
Personalized-ABA: Personalized Treatment Plan Generation for Applied Behavior Analysis using Natural Language Processing
Aman Kumar | Mareiko Au | Raj Semlawat | Malavica Sridhar | Hitesh Gurnani

Autism Spectrum Disorder (ASD) is a neurological and developmental disability that affects how an individual learns, communicates, interacts with others. Applied Behavior Analysis (ABA) is a gold standard therapy for children and adults suffering from ASD to improve their learning, social, and communication skills. Today, 1 in 36 children are diagnosed with ASD with expectations that this rate will only continue to rise. The supply of certified ABA providers is alarmingly insufficient to meet the needs of children with ASD. In fact, waitlists to receive ABA therapy in the United States exceed 10 months in most states. Clinicians or Board Certified Behavior Analysts (BCBAs) are now experiencing intense bottlenecks around diagnostic evaluations and developing treatment plans quickly enough to support timely access to care. Over the past few years, Artificial Intelligence has changed the way industries operate by offering powerful ways to process, analyze, generate, and predict data. In this paper, we have addressed the problem of both time and supply restrictions faced by ABA providers by proposing a novel method for personalized treatment plan generation and program prediction by leveraging the capabilities of Deep Learning and Large Language Models (LLM). Additionally, we have introduced two separate models for behavior program prediction (F1-Score: 0.671) and skill acquisition program predictions (Rouge-1 Score: 0.476) which will help ABA providers in treatment plan implementation. Results are promising: an AI-generated treatment plan demonstrates a high similarity (Average Similarity Score: 0.915) to the original treatment plan written by a BCBA. Finally, as we partnered with a multi-state ABA provider in building this product, we ran a single-blind study that concluded that BCBAs prefer an AI-generated treatment plan 65 percent of the time compared to a BCBA-generated one.

pdf bib
Exploring Scientific Hypothesis Generation with Mamba
Miaosen Chai | Emily Herron | Erick Cervantes | Tirthankar Ghosal

Generating scientifically grounded hypotheses is a challenging frontier task for generative AI models in science. The difficulty arises from the inherent subjectivity of the task and the extensive knowledge of prior work required to assess the validity of a generated hypothesis. Large Language Models (LLMs), trained on vast datasets from diverse sources, have shown a strong ability to utilize the knowledge embedded in their training data. Recent research has explored using transformer-based models for scientific hypothesis generation, leveraging their advanced capabilities. However, these models often require a significant number of parameters to manage Long sequences, which can be a limitation. State Space Models, such as Mamba, offer an alternative by effectively handling very Long sequences with fewer parameters than transformers. In this work, we investigate the use of Mamba for scientific hypothesis generation. Our preliminary findings indicate that Mamba achieves similar performance w.r.t. transformer-based models of similar sizes for a higher-order complex task like hypothesis generation. We have made our code available here: https://github.com/fglx-c/Exploring-Scientific-Hypothesis-Generation-with-Mamba

pdf bib
Benchmarking Automated Theorem Proving with Large Language Models
Vanessa Lama | Catherine Ma | Tirthankar Ghosal

Theorem proving presents a significant challenge for large language models (LLMs) due to the requirement for formal proofs to be rigorously checked by proof assistants, such as Lean, eliminating any margin for error or hallucination. While existing LLM-based theorem provers attempt to operate autonomously, they often struggle with novel and complex theorems where human insights are essential. Lean Copilot is a novel framework that integrates LLM inference into the Lean proof assistant environment. In this work, we benchmark performance of several LLMs including general and math-specific models for theorem proving using the Lean Copilot framework. Our initial investigation suggests that a general-purpose large model like LLaMa-70B still has edge over math-specific smaller models for the task under consideration. We provide useful insights into the performance of different LLMs we chose for the task.

pdf bib
The Grid: A semi-automated tool to support expert-driven modeling
Allegra A. Beal Cohen | Maria Alexeeva | Keith Alcock | Mihai Surdeanu

When building models of human behavior, we often struggle to find data that capture important factors at the right level of granularity. In these cases, we must rely on expert knowledge to build models. To help partially automate the organization of expert knowledge for modeling, we combine natural language processing (NLP) and machine learning (ML) methods in a tool called the Grid. The Grid helps users organize textual knowledge into clickable cells aLong two dimensions using iterative, collaborative clustering. We conduct a user study to explore participants’ reactions to the Grid, as well as to investigate whether its clustering feature helps participants organize a corpus of expert knowledge. We find that participants using the Grid’s clustering feature appeared to work more efficiently than those without it, but written feedback about the clustering was critical. We conclude that the general design of the Grid was positively received and that some of the user challenges can likely be mitigated through the use of LLMs.

pdf bib
Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogisms
Shi Zong | Jimmy Lin

There has been a huge number of benchmarks proposed to evaluate how large language models (LLMs) behave for logic inference tasks. However, it remains an open question how to properly evaluate this ability. In this paper, we provide a systematic overview of prior works on the logical reasoning ability of LLMs for analyzing categorical syllogisms. We first investigate all the possible variations for categorical syllogisms from a purely logical perspective and then examine the underlying configurations (i.e., mood and figure) tested by existing datasets. Our results indicate that compared to template-based synthetic datasets, crowdsourcing approaches normally sacrifice the coverage of configurations (i.e., mood and figure) of categorical syllogisms for more language variations, thus bringing challenges to fully testing LLMs under different situations. We then summarize the findings and observations for the performance of LLMs to infer the validity of syllogisms from the current literature. The error rate breakdown analyses suggest that the interpretation of quantifiers seems to be the current bottleneck that limits the performance of the LLMs and is thus worth more attention. Finally, we discuss several points that might be worth considering when researchers plan to release categorical syllogism datasets. We hope our work will provide a timely review of the current literature regarding categorical syllogisms, and motivate more interdisciplinary research between communities, specifically computational linguists and logicians.

pdf bib
Individuation in Neural Models with and without Visual Grounding
Alexey Tikhonov | Lisa Bylinina | Ivan P. Yamshchikov

We show differences between a language-and-vision model CLIP and two text-only models — FastText and SBERT — when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.

pdf bib
CogErgLLM: Exploring Large Language Model Systems Design Perspective Using Cognitive Ergonomics
Azmine Toushik Wasi | Mst Rafia Islam

Integrating cognitive ergonomics with LLMs is crucial for improving safety, reliability, and user satisfaction in human-AI interactions. Current LLM designs often lack this integration, resulting in systems that may not fully align with human cognitive capabilities and limitations. This oversight exacerbates biases in LLM outputs and leads to suboptimal user experiences due to inconsistent application of user-centered design principles. Researchers are increasingly leveraging NLP, particularly LLMs, to model and understand human behavior across social sciences, psychology, psychiatry, health, and neuroscience. Our position paper explores the need to integrate cognitive ergonomics into LLM design, providing a comprehensive framework and practical guidelines for ethical development. By addressing these challenges, we aim to advance safer, more reliable, and ethically sound human-AI interactions.