SAINE: Scientific Annotation and Inference Engine of Scientific Research

We present SAINE, an Scientific Annotation and Inference ENgine based on a set of standard open-source software, such as Label Studio and MLflow. We show that our annotation engine can benefit the further development of a more accurate classification. Based on our previous work on hierarchical discipline classifications, we demonstrate its application using SAINE in understanding the space for scholarly publications. The user study of our annotation results shows that user input collected with the help of our system can help us better understand the classification process. We believe that our work will help to foster greater transparency and better understand scientific research. Our annotation and inference engine can further support the downstream meta-science projects. We welcome collaboration and feedback from the scientific community on these projects. The demonstration video can be accessed from https://youtu.be/yToO-G9YQK4. A live demo website is available at https://app.heartex.com/user/signup/?token=e2435a2f97449fa1 upon free registration.


Introduction
A precise classification of publications across and within disciplines is key not only for a fast and comprehensive search to guide researchers to relevant material but also to identify the novelty of research, the standing and significance of scholars, and of the relative growth of fields of work.
Machine learning develops into being not only a but the customary approach to establish such a classification.Clearly, one would expect a search that is geared towards identifying a high-quality corpus of keywords to benefit crucially from supervision.Existing classifications of academic output are based on a blend of (supervised) author-chosen and (unsupervised) machine-chosen keyword lists, where the composition of the blend is unknown to the researcher.
Prevailing systems of keywords for academic publications are lists based on abstracts in a discipline, field, and subfield, distilled from • unsupervised machine learning (from word or phrase frequencies); • supervised learning (mostly from keyword selfreporting by authors); • semi-supervised learning (a mixture of the two; e.g., as done by Microsoft Academic Graph (MAG) described in Sinha et al. (2015); Wang et al. (2019Wang et al. ( , 2020))).
For designing an annotation and inference engine that helps establishing a classification system of scientific publications, one would target developing a tool with the following features: (1) a simple user interface with clear annotation instructions; (2) a reproducible pipeline across various disciplines; (3) good support for inference tailored to downstream tasks (e.g., model retraining) in metascience studies.
Among the existing open-source annotation tools, Label Studio (Tkachenko et al., 2020(Tkachenko et al., -2022) ) suits those needs.Note that Gayoso-Cabada et al. (2019) have reviewed extensively the annotation tools that facilitate classification tasks.However, the reviewed tools are either not open-sourced or are domain-specific and, hence, do not share the aforementioned targeted features.
In this system demonstration, we utilize a set of standard open-source software, mainly Label Studio (Tkachenko et al., 2020(Tkachenko et al., -2022)), MLflow and FastAPI to configure an annotation and inference engine for scientific publication annotations.In this demonstration, we illustrate the benefit of using supervised learning based on pre-established keyword lists and abstracts, and how annotators can help us better understanding the importance of supervised learning in establishing a classification of academic publications.
This system is built on top of the hitherto largestscale multi-class hierarchical classification study across all academic research disciplines in both single-label and multi-label settings (cf.Rao et al. (2023)).There, we have built a supervised hierarchical classification system that associates every publication with at least one and potentially several disciplines, fields, and subfields.
With the annotations above, we conduct a small user study with domain experts using our annotation engine.We then invoke our inference engine to fine-tune the base models in Rao et al. (2023).The comparison between the base and fine-tuned models shows that the proposed annotation and inference system is able to benefit the development of more accurate classifications.
To summarize, the paper presents a scientific annotation and inference engine called SAINE, which is based on open-source software like Label Studio and MLflow.The main contributions of the paper are: (1) The demonstration of using SAINE in understanding the space for scholarly publications, particularly in hierarchical discipline classifications.(2) The result of a user study, which shows that user input collected with the help of SAINE can help better understand the classification process.(3) The ability of SAINE to benefit the further development of a more accurate classification, demonstrated through the comparison between the base and fine-tuned models.(4) The potential of SAINE to support downstream metascience projects and foster greater transparency and understanding of scientific research.
Overall, the paper presents the benefits of supervised learning and the importance of having a simple user interface with clear annotation instructions, reproducible pipelines, and good support for inference in scientific publication annotations.The live demo website and demonstration video are also available for those interested in further exploring SAINE. 1 The codebase for development is publicly available under this link and collocates with the codebase of Rao et al. (2023).
In Figure 1 we illustrate the workflow in SAINE by assigning the roles of "Administrator", "Annotators", "Label Studio", and "MLflow" to each task in the pipeline.The sections are organized as follows.Section 2 introduces the functionality of Label Studio and its fit to our annotation needs as well as our annotation guidelines for experts.Section 3 specifies the annotation design for the field of Economics and discusses the annotation results.Section 4 discusses the integration of annotation results into the pre-trained base models and fine-tuned ones with MLflow.We devote Section 5 to discuss our preliminary experiments on improving annotation efficiency.We then conclude this system demonstration with a discussion of system limitations, ethics, and broader impact statements.Label Studio also provides integration with various machine learning (ML) models.Although we do not use the integrated ML functions, Label Studio allows us to export the annotation results in JSON, with which we improve the classification models using the annotated data in the inference engine.Overall, Label Studio offers a powerful and customizable annotation platform that can handle relevant annotation tasks, facilitate efficient collaboration among experts, and efficiently compute IAA.
The project manager uses an administrative panel (Figure 4 in Appendix B) to assign annotation tasks to each registered annotator and can monitor the annotation progress.The manager can also adjust the assigned annotations based on individual progress, as well as inspect tasks by annotation progress and IAA metrics.

Annotation Guidelines
When a publication is annotated, each annotator is provided with the abstract, the keywords offered by MAG, and the assigned category based on the keywords provided by MAG.The categories of a discipline classification (such as the Journal of Economic Literature, JEL, clasification in economics) are assigned to MAG publications on the basis of the keywords.Therefore, MAG's keywords help us identify potential misalignments and better understand the classifiers we built.
The annotation samples provided in the annotation engine are stratified sampled (ratio: 2e-5) across all classes of the training set introduced by Rao et al. (2023) for one discipline.Each annotator is required to judge whether a category is correctly assigned to an abstract.If not, the annotator is required to select the suitable one from a predefined list.The annotator is also required to evaluate MAG-generated keywords and make corrections (by removing unqualified keywords/marking suitable keywords from the abstract).Figure 2 shows two annotations of one publication.Label Studio makes it easy to navigate among the annotations generated by various annotators on an identical instance.Note that, as we discussed in Rao et al. (2023), our multi-class hierarchical classification system is modularized in both single-label and multi-label settings. 2The current annotation engine is equipped with both annotation functionalities.For the sake of system demonstration and user study in Section 3, we discuss the single-label setting.More details on the multi-label setting are provided in Appendix C.

Implementation: User Study in Economics
We now use Economics as a discipline to show how we utilize the annotation engine to collect expert annotations.

Annotation Design
We invited three economist experts from the Chair of Applied Economics at ETH Zurich to join the annotation project by accessing this link.Annotation guidelines are given at here.Of the three experts, one has annotated all provided instances (Annotator 1), one has annotated 10% of the instances (Annotator 2), and one has annotated a subset of instances with an ex ante denomination in Urban and Spatial Economics only (Annotator 3).Each annotator received a user panel like Figure 5 in Appendix B.

Annotation Results in Label Studio
Altogether, 788 instances of abstracts and keywords from MAG had to be annotated for a singlelabel classification.In Economics, a standardized field and subfield system with keywords exists, and it is called the Journal of Economic Literature (JEL) classification system.This system is known to all academic economists and serves as a guiding principle to associate an article or a topic with a specific subfield in Economics.The subfields in the JEL categories are associated with keywords.
We report the annotation time and IAA scores that are automatically calculated by Label Studio (see the official documentation for the steps).The final task agreement score is calculated by averaging all IAA scores for each annotation pair.Table 1 illustrates the IAA scores amongst three experts.Annotators 1, 2, and 3 have annotated 788, 181, and 99 instances, respectively.The annotation overlap between the pairs of annotators is 4 or 7% of the overlapping instances (Annotators 2 and 3), 99 or 100% of the overlapping instances (Annotators 1 and 3), and 181 or 100% of the overlapping instances (Annotators 1 and 2).The median annotation time of Annotators 1-3 per instance was 17.7s, 29.8s, and 40.9s, respectively.The annotators were entitled to disapprove of the assigned cate-  gory based on MAG upon suggesting an alternative category.Marking and filling in missing keywords is time consuming, reading the MAG-generated keywords can help to some extent the annotation speed.However, all annotators reported that the MAG-provided keywords could be a source of error for wrongly assigned categories.As discussed among the annotators after they underwent the annotations separately, the category they found the best was for Mathematical & Quantitative Methods, and it was worst for Macroeconomics and Public Economics.

Inference Engine: Incorporating Annotation Results into the Existing Classification Pipeline
We illustrate the pipeline using the discipline of Economics as discussed in Section 3.

Post-processing of Annotation Results
We downloaded the annotation results in JSON of all experts and post-processed them following the protocols below, before feeding them into the pre- trained base models of various neural networks as discussed in Rao et al. (2023).In total, we obtained 1,068 partly overlapping annotations (incl."Skip", "(Dis)agree", keywords, added categories).The basic statistics on the number of instances of "Agree", "Disagree" and "Not ECON" are 498, 297, and 268, respectively.
The post-processing procedure is structured as follows.
(1) We removed abstracts that were inadequately classified as belonging in Economics from the sample (206 of 788 instances).Additionally, we deleted 5 instances due to bad annotations.For example, no one labeled this sample ("Skip"), or an annotator chose "Disagree" but did not choose a new category.(2) For each remaining instance, we counted the percentages of "Agree" and "Disagree" verdicts relative to the label generated on the basis of MAG keywords.If strictly more experts agreed than disagreed with MAG, the original label was preserved (for 351 of the 577 valid instances).Otherwise, we took the label suggested by the majority of annotating experts (for 226 of the 577 valid instances).(3) In the case of ties, we randomly picked a label from the suggested annotations (for 22 of the 226 category-renewed instances).Following this protocol, we obtained 561 instances with expert-curated labels to fine-tune the base models.

Fine-tuning Pre-trained Base Models
We used the 561 labels generated by the experts as a fine-tuning set on the base models reported in Rao et al. (2023) on the discipline of Economics (model-1).We compared the inference performances of the base model (Model in Table 2) with those of the fine-tuned model (Model_FT in Table 2) on various neural network architectures, Deep Neural Network (DNN), Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and Transformers.To benchmark the differences in performances between Model and Model_FT, we created a small test set from the Social Science Research Network (SSRN), which is a website that provides a platform for researchers to share and distribute their research papers and other scholarly work in the social sciences and other related fields.We decided to use the Economics SSRN publications because they come with human-currated JEL categories, keywords, and abstracts.
Concretely, we built a crawler to download the publication space in Economics publications in SSRN, where all contained research articles in Economics are multi-category-indexed.This means, each publication there is indexed by at least one JEL code and it allows multiple JEL codes per publication.We could easily validate with our multi-label engine in principal, but we focus on single-label classifications for this user study.
To create this test set, we randomly sampled 10 instances from each of 19 JEL field classes, which resulted in a sample of 190 test instances.In the implementation of hierarchical classifications reported in Rao et al. (2023), we have used MLflow to track and manage ML experiments, with which we have saved all pre-trained base models.Now, based on them, we could seamlessly integrate model finetuning and inference with various models.The inference engine API has been implemented using FastAPI with help from Pydantic.We illustrate the batch inference API in Figure 3, with which users can feed the test set into various models (base or fine-tuned) and obtain predictions.In Appendix D we provide more details about the inference engine.

Benefits of Expert Annotations
We present the results of user studies in Table 2. Specifically, we inspect two types of statistics, the correct predictions of the base and fine-tuned models in Columns ( 1)-( 2), and the identical predictions of the base and fine-tuned models in Column (4).Since each publication is multi-JEL-categoryindexed, we count the prediction as "correct" if the indices include the predicted category.Column ( 1) is the base model trained with the model type specified in Column (6).Column (2) presents the results of the fine-tuned (supervised) model.Column (4) shows that out of a total of 190 test instances, identical predictions were generated by the base and fine-tuned models.We see that fine-tuning with user-generated results has brought benefits to all models except DNN because DNN predicts for all test examples only one class (the dominating one).RNN is the best performer when considering the benefits resulting from expert supervision, because the ∆ in correct predictions has increased the most according to Column (3).Interestingly, fine-tuning a pre-trained Transformer model may not always result in a significant improvement in performance, as we see from a comparison with other base models.However, the current fine-tuning set is too small to draw firm conclusions in this regard.

Discussions: Improving Annotation Efficiency
We share preliminary results of improving annotation efficiency based on the annotators' feedback.

Similarity between Articles and Scholars
We try to match the best suited scholars to the articles to annotate by extracting keywords from the top cited articles of these scholars and scoring them on the cosine similarity with the article keywords.The results are promising and can reduce work overhead for the scholars while improving the quality of the annotations by assigning best suited scholars for the process.More details on the implementation are given in Appendix E. No", A/D/B -"Agree"/"Disagree"/"Blank"."Cat1,2,3" are three predicted labels by our classification system.If it predicts that an abstract does not belong to ECON, then we will no longer ask whether or not it agrees with the our model-predicted categories.Responses that do not contain the specified keywords are considered "blank".The dataset contains a total of 42 entries with non-empty Cat3.In this subset, Vicuna-13B uniformly classifies all entries to be in the ECON domain.Conversely, Vicuna-7B predicts 33 of these entries within the ECON domain.

LLM as Annotators
In light of our commitment to total project transparency, we have opted to utilize the Vicuna 7B and 13B models (Chiang et al., 2023), both of which are publicly available for non-commercial use and are fine-tuned based on LLaMA (Touvron et al., 2023), explicitly tailored for QA tasks.For their predictions on single-label and multi-label classifications, see Tables 3 and 4. The details of experimental protocols are in Appendix F. Overall, we observe that even one of the best LLMs performs poorly in the single-label setting and has potential to be used as keyword extractor and annotate multi-label classifications.

Conclusions
In this system demonstration, we utilize a set of standard open-source software (mainly Label Studio (Tkachenko et al., 2020(Tkachenko et al., -2022)), MLflow and FastAPI) to configure an annotation and inference engine for scientific publications (SAINE).This system is built on top of hitherto largest multi-class hierarchical classification study across all disciplines in both single-label and multi-label settings (cf.Rao et al. (2023)).We illustrate the functionality of the system with a user study in Economics and show that the expert inputs into our system can help better understanding the classification process, which benefits the development of a stronger model in the next iteration.We plan to open-source the data and codebase and invite collaborative work in the direction of meta-science.

Limitations
Label Studio has some limitations in incorporating existing ML pipelines into the annotation engine, especially, when using customary code.We will discuss this with the developers at Label Studio and see how we can bring the annotation engine and the ML pipeline closer to each other.
In terms of annotator selection, at the moment we have to select the experts for each discipline.However, we have performed experiments to rank the annotators by their field expertise and find the best annotation tasks based on the similarity between the space for academic publications and the space for articles (Appendix E).One future idea is to automatically compute an associative score between a third-party academic product such as Google Scholar and the publication space.For instance, the project PeopleMap provides interesting techniques to generate researcher profiles based on their research interests and publications taking as input the Google Scholar profile URLs of researchers.At this stage, Label Studio developers suggest that we add a self-declarative questionnaire to each annotator, which can be used as meta-data on annotators when quantifying the annotation confidence score.Due to time constraints, we have not yet added this questionnaire, as the experts in the current user study are selected by our project PI and have strong expertise in Economics.
In terms of annotation efforts, we have benchmarked annotation quality using LLMs, which shows that human annotators are needed to control the quality.Considering our annotators' feedback that it is time-consuming to extract keywords for humans, it makes sense to use LLMs as an annotation-assisting engine for keyword extraction at this stage.We have evaluated the LLMgenerated keywords: some are quite generic given the context, and others are good fits.We plan to do a systematic evaluation of LLM-generated keywords using the WOS-46985 benchmark dataset.In terms of label prediction, we see in Tables 3 and  4 that Vicuna performs poorly on the single-label task and we will need to finish a larger sample of multi-label task to gauge its values, despite its superior performances on 100 tasks we evaluate against human performance.

Ethics Statement
We acknowledge that our system may involve processing potentially sensitive data (such as annotator profile), and we take data privacy and ethical considerations very seriously.In accordance with ethical guidelines of "ACM Code of Ethics", we will take steps to protect the privacy of annotators once the annotation engine is in beta stage.We have also made efforts to ensure that our system and its annotations are unbiased and fair.We believe that our work will help foster greater transparency and understanding in scientific research, and we welcome collaboration and feedback from the scientific community to further advance ethical and responsible use of AI in research.

Broader Impact Statement
Our annotation engine and inference engine can further support downstream meta-science projects.We list a few interesting questions we can answer using our pipeline (Rao et al. (2023) and the annotation and inference engine).1. [For students.]Which fields of research are more impactful/growing?2. [For policy makers.]How to design education for cross-/inter-/pluridisciplinary studies? 3. [For department and tenure committees.]How to benchmark output and impact levels of an untenured scholar? 4. [For funding institutions.]How to measure/quantify inter-/pluri-disciplinary standards for institutions such as SNIS and SNSF which emphasize the interdisciplinarity of research? 5. [For librarians.]How can one effectively organize bibliographical resources across disciplines and departments in one university? 3e plan to add other disciplines covered by Rao et al. (2023) to our annotation engine.We would also like to incorporate subjective (self-declaration) and objective measurements (e.g., Google Scholar profile integration) into the annotation pipeline.This may help develop confidence scores of one annotation/annotator.Mr. Prakhar Bhandari and Ms. Piriyakorn Piriyatamwong for their technical support to our project.We appreciate that Label Studio has offered us an academic license for the project, which allows us to invite more experts to contribute in the long run.The user agreement and terms of an academic license are listed here.

A Our Hierarchical Classification System
We provide an overview of Rao et al. (2023).The presented paper introduces a modularized threelevel hierarchical classification system designed to automatically categorize scholarly publications based on their abstracts.The system operates within a hierarchical label set consisting of disciplines, fields, and subfields, enabling multi-class classification.This approach facilitates a systematic categorization of research activities, considering both knowledge production and impact through citations.The system distinguishes 44 disciplines, 718 fields, and 1,485 subfields, leveraging a vast collection of abstract snippets from the Microsoft Academic Graph.By utilizing various neural network models, such as DNNs, RNNs (using GRU), CNNs, and Transformers, through batch training, the system achieves high classification accuracy rates exceeding 90% in both single-label and multilabel settings.
The modular design of the system allows for flexibility and easy integration of new models, with CNNs identified as the most efficient performer across the models.The system consists of three components: the first component (L1) handles discipline classification, the second component (L2) focuses on field classification, and the third component (L3) specializes in subfield classification.Each component operates based on the output of the previous level, enabling a granular categorization of research activities and capturing the interdisciplinary nature of certain topics.
In the classification process, the system assigns publications to disciplines, fields, and subfields based on their abstracts.It computes conditional probabilities to determine the relevance of each label given the previous level labels.This hierarchical approach improves the alignment of research texts with disciplines, enables automated classification, and captures interdisciplinarity.
The system incorporates both single-label and multi-label settings.In the single-label setting, each publication is assigned to a single category, while in the multi-label setting, publications can be assigned to multiple categories simultaneously.The multi-label classification assumes label independence and employs binary cross-entropy loss for training.To ensure a balanced distribution of relevant and irrelevant samples, stratified sampling is maintained for label sets.
Performance evaluation of the classification system includes metrics such as categorical accuracy, precision, and recall.The system's ability to accurately classify research texts, align them with relevant disciplines, and capture interdisciplinarity contributes to its value in indexing and analyzing scientific publications.
Overall, the proposed system, with its modular design and pretrained models, serves as a solid foundation for future applications in scientific publication indexing and analysis.

B Label Studio Functionalities
In Figure 4 we demonstrate the administrative panel of the project manager.The "Filters" and "Order (Annotation results)" tabs make it easy to inspect tasks by annotation progress (e.g., "Annotators", "Agreement", "Completed", "Total annotations per task").In Figure 5, we demonstrate the user panel that we show for each expert annotator.Each annotator has no access to additional information about the annotations made by the other annotators.As an expert, one can only see how many annotations have been gathered per instance among the experts together.

C Multi-label Annotation Engine
The setup of multi-label annotation is similar to the single-label setting.In the multi-label settings, annotators are required to mark "(Dis)agree" for each suggested JEL category (we provide three categories at most) and then select additional JEL categories, where multiple choices are allowed.

D Inference Engine
We provide two types of API calls for inference, (1) inference_by_model and (2) batch_inference_by_model.The only difference between these two API calls is that API (2) allows text-label predictions in batches, which requires a JSON sequence as input.Figure 7 demonstrates the user interface.

E Similarity between Authors and Articles E.1 Keyword Extraction
The process of keyword extraction from the top 25 cited articles of each author and individual publication articles involves the following steps and methods.The scholars we picked are five renowned economists and use Google Scholar to download their profiles and publications.We start with the abstracts from these articles as the primary source of information.Firstly, we perform an initial cleaning and preprocessing on these abstracts.This cleaning involves the removal of non-alphanumeric characters, conversion of text to lowercase, and tokenization of the text into individual words.We also remove common words, known as stop words, which do not contribute much to the overall meaning of the text.Finally, we discard words that are less than three characters long as these are typically not meaningful.This cleaning process results in a simplified and standardized version of the original text which is more suitable for further analysis.
Secondly, we ensure that all our data is in English to maintain consistency.For this, we employ a language detection function.If a text is not in English, we translate it using a translation pipeline, which is a model capable of accurately translating text from various languages to English.To handle potential memory issues with larger texts, we split the text into smaller chunks, translate each chunk separately, and then concatenate them back together.
The cleaned and translated text is then passed through KeyBERT (Grootendorst, 2020), a minimalistic transformer-based keyphrase extraction technique, which is used to extract keywords from the text.Apart from KeyBERT we also tried other  keyword extraction techniques like YAKE (Campos, 2020) and RAKE (Chaddha, 2020).We then compared which of these techniques extracted the best keywords by generating scores for the tasks on the benchmark data set "WOS-46985" discussed in (Rao et al., 2022).We found that KeyBERT got the best scores for the extracted keywords when compared to the reference data.KeyBERT uses BERT, a state-of-the-art transformer model for natural language processing, to convert words into high-dimensional vectors or embeddings.These embeddings capture the semantic meaning of the words and their context.KeyBERT then identi-fies clusters in these embeddings to find the most representative or "key" phrases.
We extract 250 keywords for each author by combining and analyzing the abstracts of their top 25 cited articles, which gives us a broad representation of their research interests.For individual publication articles, we extract 15 keywords to capture the essence of each specific article.We have tried 5, 10, 15 keywords for each article, and 15 gives the best results.
By following this methodology, we ensure the extraction of the most relevant and informative keywords for each author and individual article, provid- ing us with a valuable understanding of the research landscape and the interests of the authors.

E.2 Similarity Scores
In our work, we present an innovative method that allows for a comprehensive understanding of the relationship between authors, publications, and research categories.This approach uses a function which not only identifies the top authors relevant to a particular article but also uncovers the top fields of research or "categories" connected to the article and hence its best annotator(s).
The function employs the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization technique to transform text data into a numerical representation that can be processed by machine learning algorithms.For a given publication, it uses this technique to compare the article's abstract to those of top-cited authors, generating a list of the most similar authors.
Subsequently, the function identifies the top research categories linked to the publication by analyzing the keywords in its abstract.It applies the same process to the top-ranked author's 25 most cited articles.The result is a set of top categories that best align with the publication and the most relevant author, providing a deeper understanding of their research focus.This novel approach offers a multidimensional view of the research landscape, establishing clear links between authors, their publications, and research fields.

E.3 Plotting the Author-Article Similarities
In our research, we have developed a method for visualizing the semantic proximity between a specific publication and the top 25 cited author publications across all authors.This is accomplished through a function that maps the abstracts of the documents into a two-dimensional space using Word2Vec for word embeddings and PCA for dimensionality reduction.The resulting plot provides a graphical representation of how closely related the content of a given publication is to the influential works of various authors.In Figure 8 we show an example plot produced using this method.
We invite the reader to observe the distribution of points, where the spatial proximity reflects the semantic similarity between the given publication and the authors' works.This method offers an intuitive way to understand the knowledge structure and the implicit connections between different research articles.We intend to incorporate all the above mentioned changes into Label Studio to decrease the workload of the annotators and to increase the overall efficiency and accuracy of the process.The authors will be selected through the process of ranking and choosing the one with the highest similarity score for annotating the document.The author will be given a union set of 5 categories with respect to the author's publications and the publication itself to choose from, which align the most with the publication.

F LLM Annotation F.1 LLM Selection
In our work, we utilize the v1.1 model weights for Vicuna-7B and 13B.All inference tasks are executed on two set of RTX 3090 (24GB of memory each).The parameters employed during response generation are as follows: max_length set to 100,000, do_sample enabled as True, and temperature adjusted to 0.7.
Our vision is to build a completely open source pipeline, so we have disregarded LLMs such as GPT-4 (OpenAI, 2023) which only provide API access, instead we have preferred open source alternatives such as LLaMA.We have explored non-LLaMA based LLMs such as OpenChatKit (To-getherComputer, 2023), but we encounter issues related to the stability of their output.We notice that these models sometimes produce inconsistent responses for the same data point (i.e., annotating one publication with keywords and labels), alternating between "Agree" and "Disagree" without providing logically coherent reasoning.
Among the multitude of LLaMA-based LLMs, we identify Vicuna as a model specifically finetuned for Question-Answering tasks, making it an apt choice for our project.Furthermore, Vicuna's exceptional performance, underscored by its highest Elo rating in the Chatbot Arena (Zheng et al., 2023), convinces us to choose it as our annotator.

F.2 Single-label
Figure 9 presents the standardized prompt template we employ to query the LLM regarding its agreement with the category predicted by our model for each data point in the ECON single-label dataset.It should be noted that certain segments of the prompts remain fixed and repetitive, a feature we refer to as "instructions".This design is necessitated by the LLM's inherent propensity to forget previous text, meaning that inputting the instruction just once may compromise the quality of responses for subsequent data points.For instance, they might cease to incorporate crucial keywords such as "Agree", "Disagree", "NOT ECON".Therefore, we find it essential to provide an instruction for each data point.

F.3 Multi-label
For the ECON multi-label dataset, we engage the LLM with up to five prompts for each data point (Figure 10).The first prompt asks whether the given abstract is relevant to the field of Economics.If the answer is negative, we terminate further inquiry.However, if the LLM confirms the economic relevance, we proceed to query the model's agreement with up to three categories our model had previously predicted.Lastly, we ask the LLM about any additional categories to which it believes the abstract may belong, beyond those predicted by our model.This final query is intended primarily as a preparatory measure for future keyword analysis.Similarly, the persistent recurrence of identical instructions within the prompt template is designed to mitigate the issue of forgetfulness inherent in the LLM.When we compare the output of the LLM with those of the human annotators on 100 annota-tions, we see a significant overlap in the categories allotted using both methodologies, which shows the potential to employ LLMs for multi-label tasks.

Figure 4 :
Figure 4: Administrative Panel of Annotation Tasks in Label Studio.

Figure 5 :
Figure 5: Annotator Panel of the Assigned Annotation Tasks in Label Studio.

Figure 6 :
Figure 6: Publication Annotation Engine in a Multi-label Setting.

Figure 8 :
Figure 8: A two-dimensional representation of the semantic proximity between a specific publication and the top 25 cited author publications.Each point represents an author's work, and the spatial distribution reflects the semantic similarity to the given publication.

Table 1 :
Annotator Agreement Matrix Among Three Expert Annotators.

Table 2 :
Results of the User Study.FT: Fine-tuned.

Table 3 :
Results of single-label prediction.Those Vicuna outputs that do not contain the keywords "Agree", "Disagree" or "NOT ECON" are labeled as "blank".