AstroLLaMA: Towards Specialized Foundation Models in Astronomy

Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.


Introduction
The advent of Large Language Models (LLMs) has sparked interdisciplinary interest thanks to a confluence of factors: accumulation of massive datasets, leaps in computational power and breakthroughs in neural architectures.Flagship models like GPT-4 (OpenAI, 2023), PaLM (Chowdhery et al., 2022;Goo) and LLaMA (Touvron et al., 2023;Meta, 2023) have exhibited exceptional versatility in a variety of tasks from logical reasoning and comprehension to creative writing, often accomplished via methods like prompting, fine-tuning, and humanin-the-loop reinforcement learning.
The astronomy discipline presents both a unique challenge and a fertile ground for the application of LLMs.First, the corpus of scholarly texts in astronomy likely constitutes but a minuscule portion of the data on which generic LLMs are trained, resulting in limitations like hallucinations in favor of more "generic" responses.Second, the nature of astronomical research often involves crossdisciplinary insights due to universally applicable physical processes.When well-curated, LLMs could meaningfully assist in hypothesis generation.
Existing scales based on in-context prompting and instruction learning, primarily involving , have already demonstrated significant potential for generating substantive hypotheses (Ciucȃ and Ting, 2023;Ciucȃ et al., 2023).Further, the astronomy community's "open sky" policy, which grants public access to the majority of its datasets either immediately or after a brief proprietary period (Almeida et al., 2023;Fabricius et al., 2021), pairs well with the wealth of resources available in archives like NASA's Astrophysics Data System (Accomazzi et al., 2015;Borgman and Wofford, 2021).Such an open-access policy can facilitate deep engagement with the astronomical literature.
Despite their general capabilities, LLMs frequently lag behind specialized, smaller models in domain-specific applications.This disparity stems from two primary factors: (i) the eclectic nature of the training datasets, which dilutes the focus on specialized subjects, and (ii) the design ethos of LLMs as "foundation models" meant for subsequent fine-tuning tailored to specific tasks.The existing landscape for fine-tuned LLMs in astronomy remains limited, however.To our knowledge, the only existing specialized model is astroBERT (Grezes et al., 2021), which has 110 million parameters, trained on nearly 400,000 ADS papers.But as an non-generative model, the utility of as-troBERT remains limited to discriminative tasks.Motivated by these gaps, we present AstroL-LaMA, a state-of-the-art generative language model fine-tuned from LLaMA-2.Our model leverages a corpus of 300,000 astronomy abstracts from arXiv and boasts an architecture approximately 67 times larger than that of astroBERT.AstroLLaMA aspires to build upon astroBERT's foundation by offering improved performance in generating specialized information.

AstroLLaMA
In this section, we discuss AstroLLaMA's implementation, focusing on the curation of its dataset, base model architecture, and fine-tuning settings.

Dataset
We derive our dataset from the arXiv repository, available on Kaggle.† Our curated subset focuses on papers classified under the astrophysics category (astro-ph), resulting in a collection of 326,238 articles spanning from April 1992 to July 2023.We extract the these papers' abstracts to form a corpus † https://www.kaggle.com/Cornell-University/arxiv consisting of approximately 95 million tokens.The median length of these abstracts is 291 tokens.To enable effective model evaluation, we randomly designate 20% of this curated dataset for testing.

Base Model
Our base model is LLaMA-2, a 6.7 billionparameter model developed by Meta (Meta, 2023).Originally trained on a corpus containing 2 trillion tokens, LLaMA-2 features a context window of 4,096 tokens.For tokenization, the model employs a bytepair encoding strategy (Sennrich et al., 2016;Kudo and Richardson, 2018), incorporating a vocabulary set of 32,000 unique tokens.

Fine-tuning Settings
For the fine-tuning phase, we rely on our curated training set described in Section 2.1, which includes 77 million tokens.Special [BOS] (Beginning Of Sequence) and [EOS] (End Of Sequence) tokens are prepended and appended to each training sequence.These sequences are then concatenated and divided into fixed-length chunks, each comprising 512 tokens.
The fine-tuning process follows the causal language modeling objective employed during the model's pre-training phase.We use the AdamW optimizer (Loshchilov and Hutter, 2018) with hyperparameters β 1 = 0.9, β 2 = 0.95, ϵ = 10 −5 and a batch size of 32.The learning rate follows a cosine schedule with a linear warmup to a peak value of 3 × 10 −4 in the first 10% of the optimization steps and a final learning rate of 10% of its peak.Additional settings include weight decay and gradient clipping values of 0.1 and 1.0, respectively.
We fine-tune LLaMA over nearly three epochs, corresponding to about 230 million processed tokens, using four NVIDIA A100 GPUs, each equipped with 40GB of VRAM.To maximize resource efficiency, we employ 4-bit quantization and utilize LoRA, a technique based on low-rank matrix decomposition (Hu et al., 2021).We set LoRA's hyperparameters α and dropout rate to 32 and 0.05, respectively.The entire process is facilitated through the Hugging Face Python library.

Fine-Tuning Evaluation
Fig. 1 depicts the performance of AstroLLaMA during its fine-tuning phase.Here, we present perplexity, a commonly used metric for evaluating causal language models.Perplexity is defined as The Magellanic Stream (MS) -an enormous ribbon of gas spanning 140∘ of the southern sky trailing the Magellanic Clouds -has been exquisitely mapped in the five decades since its discovery.However, despite concerted efforts, no stellar counterpart to the MS has been conclusively identified.This stellar stream would reveal the distance and 6D kinematics of the MS, constraining its formation and the past orbital history of the Clouds.We have been conducting a spectroscopic survey of the most distant and luminous red giant stars in the Galactic outskirts.From this dataset, we have discovered a prominent population of 13 stars matching the extreme angular momentum of the Clouds, spanning up to 100∘ along the MS at distances of 60−120 kpc.Furthermore, these kinemetically-selected stars lie along a [α/Fe]-deficient track in chemical space from −2.5<[Fe/H]<−0.5,consistent with their formation in the Clouds themselves.We identify these stars as high-confidence members of the Magellanic Stellar Stream.Half of these stars are metal-rich and closely follow the gaseous MS, whereas the other half are more scattered and metalpoor.We argue that the metal-rich stream is the recently-formed tidal counterpart to the MS, and speculate that the metal-poor population was thrown out of the SMC outskirts during an earlier interaction between the Clouds.The Magellanic Stellar Stream provides a strong set of constraintsdistances, 6D kinematics, and birth locations -that will guide future simulations towards unveiling the detailed history of the Clouds.
The Magellanic Stream (MS) -an enormous ribbon of gas spanning 140∘ of the southern sky trailing the Magellanic Clouds -has been exquisitely mapped in the five decades since its discovery.However, despite concerted efforts, no stellar counterpart to the MS has been conclusively identified.This stellar stream would reveal the distance and 6D kinematics of the MS, constraining its formation and the past orbital history of the Clouds.We propose a novel approach to search for the elusive stellar counterpart of the Magellanic Stream.By employing advanced photometric and spectroscopic techniques, combined with state-ofthe-art data mining and machine learning algorithms, we aim to systematically analyze and cross-reference multi-wavelength survey data to identify potential candidate stars belonging to the hypothesized stellar stream.Our methodology involves leveraging the distinctive chemical signatures and proper motions associated with stars originating from the Magellanic Clouds, allowing us to differentiate them from the foreground and background contaminants.If successful, this discovery would not only provide essential insights into the origin and evolution of the Magellanic Stream, but also enhance our understanding of the interaction between the Magellanic Clouds and the Milky Way.Furthermore, the proposed approach establishes a powerful framework for investigating other unresolved galactic structures and tracing their underlying dynamical histories.
The Magellanic Stream (MS) -an enormous ribbon of gas spanning 140∘ of the southern sky trailing the Magellanic Clouds -has been exquisitely mapped in the five decades since its discovery.However, despite concerted efforts, no stellar counterpart to the MS has been conclusively identified.This stellar stream would reveal the distance and 6D kinematics of the MS, constraining its formation and the past orbital history of the Clouds.We 1) present a new method to select MS candidate stars from wide field optical imaging, 2) apply this method to the MS, 3) compare the results to existing catalogs, and 4) identify the most promising MS candidate stars.We apply our method to 2547 stars in the Gaia EDR3 catalog that have Gaia proper motions consistent with the MS and use these as a pilot sample to optimize our method.We then apply our method to the entire Gaia EDR3 catalog to identify 46,000 stars that are candidates for the MS, and we use the Gaia-ESO Survey to identify 13,000 of these stars that are likely to be MS candidates.We identify 266 stars with (G-K) colours and [Fe/H] that are consistent with previous estimates of the MS.These stars have proper motions and radial velocities that are consistent with the MS, and we place them on a 3D map of the MS.We find a previously unknown stellar stream, which we call the Southeastern Stream (SES), in the direction of the Clouds and MS.The SES has an M31 distance of 32.2+/-0.5 kpc and a 6D velocity of 136+/-10 km/s.The SES is about 20 times brighter in stars than the MS and about 40 times brighter in stars than the SMC.The SES is also 2.7 times more metal-rich than the MS, which suggests that it is a separate population.
The Magellanic Stream (MS) -an enormous ribbon of gas spanning 140∘ of the southern sky trailing the Magellanic Clouds -has been exquisitely mapped in the five decades since its discovery.However, despite concerted efforts, no stellar counterpart to the MS has been conclusively identified.This stellar stream would reveal the distance and 6D kinematics of the MS, constraining its formation and the past orbital history of the Clouds.We find a stellar stream in the MS, with a 6D velocity of 100 km s−1, and a distance of 100 kpc.The stream is 100 pc wide and 1000 pc long, and is consistent with the MS.The stream is 1000 times more massive than the Magellanic Clouds, and is likely to be a tidal stream from the Large Magellanic Cloud.

Original abstract
Completed by GPT-4 Completed by LLaMA-2 Completed by AstroLLaMA Figure 2: Completion of an abstract from the arXiv database (ID: 2306.15719) using three different models: GPT-4, LLaMA-2, and AstroLLaMA.Each model is prompted with the same short text snippet, highlighted in their respective boxes.GPT-4 tends to produce more generic statements, lacking domain-specific nuance.AstroLLaMA demonstrates the most robust completion, offering more relevant concepts and deeper insights specific to the field of astronomy, thus significantly outperforming LLaMA-2 and GPT-4.
the exponentiation of the training loss, with lower values indicating a better fit.
Our initial observations reveal that LLaMA-2 performs suboptimally on our dataset, with an average perplexity close to 10.By the conclusion of three epoch, AstroLLaMA achieves an average perplexity of 6.55.This represents a 32.5% reduction in perplexity compared to the base LLaMA-2 model, signifying a substantial improvement in the model's predictive accuracy.

Results
As illustrated in the previous section, AstroL-LaMA outperforms its non-fine-tuned counterpart, LLaMA-2, in terms of context-awareness during token prediction within astronomy abstracts.To delve deeper into the advantages of fine-tuning, we assess AstroLLaMA's general abilities in two key aspects: text generation and embedding space quality.We compare its performance against multiple models, including LLaMA-2, GPT-4 and GPT-3 (ada-002) to provide a comprehensive evaluation.
Regarding text generation, we task AstroL-LaMA, LLaMA-2 and GPT-4 with completing various astronomy-related abstracts, an example of which is presented in Fig. 2. Each model is given the first few sentences of an abstract as a prompt, allowing us to gauge its ability to comprehend the context and generate a meaningful continuation.For GPT-4, we utilize ChatGPT and specifically prompt it to limit the completion to a single paragraph.AstroLLaMA and LLaMA-2 are deployed using standard sampling methods, with the temperature set to 0.3 and a maximum new tokens limit of 1,024.We find that altering the temperature setting does not substantively improve LLaMA-2's results.
Our observations largely echo the patterns depicted in Fig. 2. LLaMA-2 often deviates from the intended context after generating only a short and often off-topic continuation, resulting in inferior completions.While GPT-4 produces more coherent text, its responses are too generic to capture the nuanced understanding required in the astronomy domain.Even when explicitly prompted to focus on astronomy-related topics, GPT-4's generated text remains largely off-target or generically applicable rather than domain-specific.
In stark contrast, AstroLLaMA exhibits remarkable context-awareness in its completions by showing a deep understanding of astronomical concepts.For example, in Fig. 2, AstroLLaMA comprehends that an effective search for stars in the Magellanic Stream involves a three-step process: initial widefield imaging, followed by refinement using astro-metric data from Gaia, and then further curation with spectroscopic data.The model also understands Gaia-ESO is surveying the southern sky and hence can observe (part of) the Magellanic Stream.It also demonstrates nuanced knowledge of the Magellanic Stream, understanding the importance of bifurcation within the stream.As a result, it appropriately completes the text by discussing the southeast stream and exploring metallicity differences to ascertain their origins.
Regarding embedding space quality, we assess models' ability to reflect semantic similarities among astronomy texts.We randomly choose 10,000 abstracts from our dataset and embed them using AstroLLaMA and GPT-3.Specifically, we use OpenAI's API to invoke the text embedding function for GPT-3 (ada-002).To get text embeddings from AstroLLaMA, we pass an input through the model and extract its final hidden states, which contain embeddings for all tokens in the input.Then, we omit the [BOS] token and take the average of all other tokens' embeddings to get the final result.Finally, for each pair of abstracts we calculate their cosine similarity (the normalised dot product) between on their vector embeddings.
The top panel of Fig. 3 presents the distribution of these pairwise similarities for the two embedding methods.We find that the embeddings by GPT-3 are overly generic with similarities clustering around relatively high values of 0.7-0.9,suggesting a lack of discriminative power (most papers are embedded very similarly).AstroLLaMA's embeddings, on the other hand, exhibit much higher variance within each bin.This suggests that our fine-tuned model is more adept at representing the specialized semantic variance inherent to the field of astronomy, which may enable a more granular representation of astronomical content and can facilitate better document retrieval and semantic analysis.
The bottom panel of Fig. 3 provides two representative examples where AstroLLaMA and GPT-3 classifications diverge.In the first example, GPT-3 fixates on the keyword 'magnetized,' resulting in an inflated similarity score, despite the contexts being markedly different.AstroLLaMA, on the other hand, successfully distinguishes between these disparate contexts.In the second example, AstroL-LaMA accurately identifies that the study of Spitzer is closely related to star formation.GPT-3, however, fails to make this connection due to the ab- sence of matching keywords.

Limitations and Future Directions
In this work, we introduce AstroLLaMA, a 7billion-parameter language model fine-tuned on a dataset encompassing over 300,000 abstracts astronomical research papers.Compared to its base model, LLaMA-2, and even GPT-4, a current state-of-the-art general LLM, AstroLLaMA exhibits marked improvements in generating highquality abstracts with a competent grasp of relevant information in this literature.AstroLLaMA is not without limitations, nevertheless.The most salient is the model's knowledge gaps in certain areas of astronomy: in Fig. 2, As-troLLaMA's estimation of potential star candidates from Gaia-ESO data is notably inaccurate.To address such issues, we are in the process of enriching AstroLLaMA's training set with not just abstracts but the full LaTeX sources of existing astronomy articles, thereby expanding the token count by approximately two orders of magnitude.Another concern lies in the model's tendency to generate hallucinated or fictitious numerical data, an issue likely attributed to our focus on reducing perplexity rather than explicitly steering the model towards factual accuracy.The release of AstroLLaMA aims to facilitate community engagement, both for addressing these inaccuracies and for refining its bal-ance between "faithfulness" (respecting scientific evidence and accuracy) and "creativity" (being able to come up with interesting hypotheses).
AstroLLaMA stands as a compelling prototype for specialized LLMs in astronomy, showing superior context-aware capabilities compared to GPT-4 despite having much fewer parameters.It not only paves the way for improved performance in tasks like question-answering, scientific summarization and hypothesis generation but applies also to multi-modal models (Liu et al., 2023).We have made the AstroLLaMA's weights and its training data publicly available † for researchers interested in leveraging LLMs for astronomy-centric applications.Along with this, we are establishing various "playgrounds" on Hugging Face to invite interested readers to further adapt and refine this robust starting point for a variety of relevant downstream tasks.

Figure 1 :
Figure 1: Learning curve of AstroLLaMA during its fine-tuning on the arXiv astrophysics dataset.The Fig.tracks the evolution of perplexity, a measure of the model's next-token prediction performance.The light blue curve shows the training perplexity at each AdamW update step, while the dark black curve provides a smoothed average taken over 10-step intervals.

Figure 3 :
Figure 3: Top: Distribution of pairwise cosine similarities among 10,000 randomly selected abstracts from our corpus, divided into 10 equal bins based on similarity levels from GPT-3.Bottom: Two representative examples illustrating divergent cosine similarity values when comparing AstroLLaMA and GPT-3 embeddings.