SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis

5G is the 5th generation cellular network protocol. It is the state-of-the-art global wireless standard that enables an advanced kind of network designed to connect virtually everyone and everything with increased speed and reduced latency. Therefore, its development, analysis, and security are critical. However, all approaches to the 5G protocol development and security analysis, e.g., property extraction, protocol summarization, and semantic analysis of the protocol specifications and implementations are completely manual. To reduce such manual effort, in this paper, we curate SPEC5G the first-ever public 5G dataset for NLP research. The dataset contains 3,547,586 sentences with 134M words, from 13094 cellular network specifications and 13 online websites. By leveraging large-scale pre-trained language models that have achieved state-of-the-art results on NLP tasks, we use this dataset for security-related text classification and summarization. Security-related text classification can be used to extract relevant security-related properties for protocol testing. On the other hand, summarization can help developers and practitioners understand the high level of the protocol, which is itself a daunting task. Our results show the value of our 5G-centric dataset in 5G protocol analysis automation. We believe that SPEC5G will enable a new research direction into automatic analyses for the 5G cellular network protocol and numerous related downstream tasks. Our data and code are publicly available.


INTRODUCTION
The deployment of the 5G cellular network protocol has generated a lot of enthusiasm in academia and industry, because of its promise of enabling innovative applications, such as self-driving vehicles, remote surgery, industrial IoT, etc. 5G cellular networks are predicted to have more than 1.7 billion subscribers and account for 25% of the worldwide mobile technology market by 2025 [2]. Unfortunately, the 5G protocol development and analysis are all completely manual tasks requiring critical domain expertise. We observe that for 5G there is an unutilized resource of information available in the form of specifications [1] and numerous tutorials available on the Internet. These enormous resources have not yet been utilized.
Recently, a few approaches have been proposed that leverage natural language processing (NLP) and machine learning (ML) to detect risky operations in some of the specifications of 4G LTE [12] and to analyze change requests [11]. These approaches are very limited, not generalizable, and not open-sourced. Automatic and systematic analysis of 5G networks is still a difficult task. One major problem is the lack of high-quality datasets to train ML models, which would enable the automation of different 5G related downstream tasks e.g., security-related text classification, protocol summarization, semantic analysis, and automatic programming. In this paper, we address this need by introducing SPEC5G, a highquality dataset of the 5G protocol specifications. 5G is not a single wireless technology, but an umbrella term used to categorize the fifth generation of wireless communication, including hundreds of different protocols at different layers of the protocol. Some of these protocols are VoWiFi, the diameter protocols, cellular IoT, IKE, 5G-AKA, and much more. SPEC5G is a complete dataset that covers all these protocols and therefore has the potential to impact different protocols affecting billions of devices. Such a high-quality dataset would be beneficial to numerous applications in different domains, such as security testing, policy enforcement, automatic code generation, and protocol summarization.
To show the viability of our SPEC5G dataset, we use it for two downstream tasks (shown in Figure 1). First, we use it for securitytext classification. In previous 5G security testing [5,17,21] the properties are manually extracted from the specifications. Using security-text classification, we can automatically identify texts, which specify important security properties to be used for formal verification and other testing approaches. Second, we use SPEC5G for the paragraph summarization task. The 5G specification is large and complex with specialized jargon, mostly due to backward compatibility requirements. Therefore, it is really daunting for a software developer to understand the high-level ideas of the protocol specification. With the summarization task, we show that it is possible to summarize and simplify the high-level ideas of the protocol. To achieve both tasks, we created two expert-annotated datasets: one for summarization and one for classification. The summarization dataset contains 713 long articles and their concise summaries. The classification dataset contains 2401 sentences and their class labels (Non-Security, Security, Undefined). Both datasets were annotated by multiple domain experts to ensure quality and fairness. Along with SPEC5G, these two expert annotated datasets will be open-sourced to enhance research.
On a whole, our contributions are three-fold. First, we create the first-ever novel 5G dataset (SPEC5G) of 3,547,587 sentences by preprocessing the 5G specification and scraping data from different 5G tutorials on the Internet. Second, we create two expert-annotated datasets for baseline security-text classification and summarization task. We conduct an extensive evaluation of these datasets using several NLP models on the downstream tasks. The results show that the models pre-trained on SPEC5G outperforms all baseline models. Third, all these research artifacts have been made available via a The use of ciphering in a network is an operator option. In this subclause, for the ease of description, it is assumed that ciphering is used, unless explicitly indicated otherwise. Operation of a network without ciphering is achieved by configuring the AMF.

Classification
Ciphering is operator dependent and operation of a network without ciphering is achieved by configuring the AMF.

1.
Confidentiality protection uses an XPK to encrypt the data. public repository for future research. To the best of our knowledge, this is the first-ever public 5G dataset created for NLP research.

RELATED WORK
The introduction of the attention-based transformer architecture by [54] beaconed the era of transformer-based Language Models (LM) in the field of NLP. A range of high-performing transformerbased language models have since then been proposed, each with its own specific use cases. To train such LMs, high-quality large datasets are critical. Based on the English Wikipedia and Books, many researchers have built high-quality datasets such as Wikidata and BookCorpus [65] and so on. In the following, we will discuss the research relevant to our work.

Cellular Networks Research Using NLP
CREEK [11] uses BERT models for detecting security-relevant change requests. For this, they pre-train BERT with a subset of 4G LTE specifications (1546 out of 13094). Moreover, in ATOMIC [12] they design a framework to semantically analyze LTE documents using NLP to obtain a set of hazard indicators for generating test cases based on a given threat model. These are the first steps in applying NLP techniques to analyze cellular network specifications. In a technical blog post from Erricson [3], the authors adopt LMs for the telecom domain and create a telecom question-answering dataset. Though promising, these approaches do not generalize and are ad-hoc and closed-source, thus accentuating the need for a complete and public dataset for 5G.

Summarization
Following BookCorpus and Wikidata, researchers have built summarization datasets such as Wikilarge [61], Wikismall [66], and so on [16,26]. Such datasets are widely used in the field of sentence summarization. Early summarization models mostly relied on statistical machine translation [38,56]. [40] improved the machine translation model to obtain a new summarization model. [47] and [39] investigated how to simplify sentences to different difficulty levels. [52] and [25] proposed sentence alignment methods to improve sentence summarization. There is a number of corpora related to summarization. [59] provided a large-scale, human-annotated scientific papers corpus. It provides over 1,000 papers in the ACL anthology with their citation networks (e.g. citation sentences, citation counts) and their comprehensive, manual summaries. There is also a dataset that has been created for the Computational Linguistics Scientific Document Summarization Shared Task which started in 2014 as a pilot [22] and which is now a well-developed challenge in its fourth year [23,24]. [15] introduced a new dataset for summarisation of computer science publications by exploiting a large resource of the author provided summaries.

Sentence Classification
The Corpus of Linguistic Acceptability (CoLA) [55] consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence. The Stanford Sentiment Treebank [50] consists of sentences from movie reviews and human annotations of their sentiment. Sci-Cite [13] is a large dataset of citation intents for the task of automated analysis of scientific papers by identifying the intent of a citation (e.g., background information, use of methods, comparing results). Researchers have also leveraged other large datasets such as DEFT [51] and ACL-ARC [7] for sentence classification task. [14] introduced and released CSABSTRUCT, a new dataset of manually annotated sentences from computer science abstracts for Sequential Sentence Classification (SSC). Paper Field [49] is built from the Microsoft Academic Graph and maps paper titles to one of 7 fields of study: geography, politics, economics, business, sociology, medicine, and psychology. DBpedia is aimed at extracting structured content from Wikipedia. This is a data extract (after pre-processing, with kernel included) with taxonomic, hierarchical categories, or classes, for 343k Wikipedia articles. A version of this dataset is also a popular baseline for text classification tasks.

DATASET CURATION
In this section, we discuss the collection and preparation of our dataset. A significant amount of data was collected from the 3GPP website. [1]

3GPP
The 3rd Generation Partnership Project (3GPP) is an umbrella organization that hosts several organizations from different countries. 3GPP is globally considered the issuer of standards for cellular network protocols. These standards are publicized as releases, e.g., LTE standards were made public from Release 8 and 5G standards from Release 15. The current release is Release 19.
A large number of meeting minutes, Technical Reports (TR) can be found on the 3GPP FTP server [1]. 3GPP releases a set of Technical Specifications (TS) as well, which subsequently add features and bug fixes. Figure 2 shows the count of specification documents per release.

Dataset Collection
As stated earlier a significant portion of the dataset was collected from the 3GPP FTP server. Automated NLP tasks have been hindered in the 5G domain because of noisy data in the standard documentation. Often, the specification documents contain embedded codes, tables, and lists with definitions of varying terminologies, flow diagrams, finite state machines, and so on-which makes it hard to build models that reason and perform well on downstream applications. Thus, to leverage downstream NLP tasks, we perform extensive preprocessing. Furthermore, we scrape data from 13 blogs, and forums of the internet. Details about the web sources can be found in Table 1. We extract approximately 17 GB of text data from specification releases and web portals using python web scrapper and Selenium [4]. Later we apply a set of standard and domain-specific preprocessing to obtain the final dataset.
3.2.1 Preprocessing. 5G specifications and web data contain a variety of materials encompassing method and framework documentation, pseudocode, high-level implementations, numerous parameter, field constitution, and so on. At first, the raw data go through standard NLP preprocessing tasks, e.g., removing extra whitespaces, tabs, certain Unicode characters introduced from scrapping, HTML tags, etc. Later, we extend the preprocessing to handle special cases such as code snippets, tables, figures, references to other specification documents, etc. The preprocessing steps are listed below.
• Sentences containing codes (e.g. consecutive '{{', '}}', '((', '))') were removed. • Some of the remaining html tags present after web scrapping were removed. • Sentences containing consequent digits and dots refer to (sub)section headers. Those were removed. • Some malformed texts appearing from figures and tables after parsing text files from .doc or .pdfs files were filtered out. • Sentences containing Unicode characters appearing as raw texts were removed. • Multiple consecutive newlines, tabs, whitespace, and delimiters were processed into one. • Starting numbers, dots, interpuncts, and hyphens appearing from (un)ordered lists were removed. • Additional whitespaces after opening parentheses, curly braces, and brackets were removed. Similar to closing ones. • 3GPP specifications contain numerous mentions of specification documents (i.e. TS 24.301). These do not add any useful features for learning. Those were renamed as "specification document". • If a sentence contains a high amount of digits, they necessarily are from embedded codes. If more than 20% are seen, we filter out the sentence. • Few special cases (for example: "e.g.", "i.e.") are handled to not be considered as the end of a sentence. • An additional newline was added after adding all texts from each of the documents/web pages. This is to ensure that certain downstream applications (e.g. summarization) do not get affected by unrelated texts from multiple documents.

Dataset Statistics.
Our final processed dataset contains 3,547,587 sentences with a total of 134M words. Figure 3 shows the distribution of the number of sentences per document and Figure 4 shows the distribution of tokens per sentence.

Annotation
To demonstrate the effectiveness of SPEC5G, we additionally create and annotate two datasets specific to two NLP tasks -summarization and sentence classification.    dataset. An article is defined as a sequential collection of sentences. Here we apply another round of manual processing to ensure semantic correctness among the sentences of each of the articles. The final curated dataset contains 713 articles, each with 1-12 sentences. The distribution of sentences per article is shown is Figure  5. This dataset is subsequently labeled by 9 domain experts; each label, itself, is a smaller set of sentences that summarizes the article. The task of annotation (summarizing) varied in difficulty. The annotators made insightful comments about the articles they faced challenges with. Another round of manual data cleaning was done based on the comments which resulted in a very high-quality test set for protocol specification summarization. For the rest of this paper, we refer to this annotated dataset as 5GSum.

Security
Classification. Similar to the summarization task, we randomly selected and annotated 2401 sentences from our SPEC5G dataset to use for multi-class classification. We categorized the data into 3 classes-Non-Security (0), Security (1), and Undefined (2). To discard human bias, the dataset was labeled by 9 domain experts. We did a 85-5-10 split for train, validation, and test data with 2040, 120, and 241 samples respectively. For the rest of the paper, we refer to this dataset as 5GSC.
Among the 3 classes, the least number of samples are from class 2 (Undefined: 484). Yet, the class with the highest number of samples (Non-Security: 1303) is about 3 times more than the class with the lowest number of samples. Therefore, the dataset is not highly imbalanced. Overall, this non-uniformity is expected, since most of the specification documents should not be related to Security issues and a high amount of Undefined statements in 5G specifications would rather mean inconsistencies in implementation.

Annotation Guidelines
Below are the specific guidelines that were given to the annotators to ensure the standard of annotation.
3.4.1 Sentence Classification Guidelines. The annotators were given some general guidelines and were suggested to follow some steps while annotating to make the data annotaion consistent. They were also provided with some rules and tips. General Guidelines: For this task, an annotator is given a set of sentences. Based on the methods, fields, variables, and/or entities mentioned in the sentence, the annotator's objective is to identify if the sentence implies a potential security concern or sophisticated operation that might involve vulnerable consequences. Steps: (1) Read the sentence carefully.
(2) Identify items and the operation that involve a security issue.
(3) Decide the label based on the following: a) Non-Security: The expressed operation cannot be exploited/ The sentence does not describe any security hazard/ does not describe any underspecified criteria/ does not involve complicated, flawed operations. b) Security: The text discusses situations or operations that might be risky/ The text involves certain properties or variables, exploiting which, one can seriously block the operations, or harm the entity, or breach privacy. c) Undefined: The discussed operation is not clear/ The sentence does not express all the parties or variables involved/ The sentence entails some previous operation unavailable to the annotator-without which the annotator can not decide about the potential risk. (4) If the sentences do not express any proper context or are semantically incorrect, or have no items with a sentiment expressed towards them, add a comment and proceed to the next data.

Rules and Tips:
• Select all items in the sentence that have a security hazard.
• If there are multiple such cases, you may choose any or all of them. • Optionally, you may provide a comment about your rationalization or feedback about the data (e.g., errors, unclear descriptions.)

Summarization
Guidelines. Similar to classification guidelines, the annotators were given general guidelines, suggested steps while annotating the summarization dataset. Again, they were also provided with some rules and tips. Below are the guidelines for annotation tasks for summarization. General Guidelines: For this task, an annotator is given a set of articles. Based on the methods, fields, variables, and/or entities mentioned in the sentence, the annotator's objective is to summarize the article without losing important information, correctness and contextuality. Steps: (1) Read the article carefully.
(3) Summarize the article by doing the following: a) Deletion: Delete a sentence if it does not convey any important information. b) Merge and shorten: Merge consecutive sentences if they convey continued information and make the merged sentences concise. c) Rephrase and shorten: Rephrase a sentence to make it simpler and make it shorter if possible. (4) If the sentences do not express any proper context, or are semantically incorrect, add a comment and proceed to the next sentence.

Rules and Tips:
• Select all items in the article that have important information.
• Make the sentences simpler and concise keeping the important information. • Under each article is a comments box. Optionally, you can provide article-specific feedback in this box. This may include a rationalization of your choice, a description of an error within the article, or the justification of another answer which was also plausible. In general, any relevant feedback would be useful and will help in improving this task.

TASKS
In this section, we define the downstream tasks: summarization and security sentence classification. Moreover, we discuss the relevance of these downstream tasks with respect to 5G.

Task 1: Summarization
Text summarization is the simplification of the original text to a more understandable text while keeping the main meaning of the original text unchanged [33,53]. It can provide convenience for non-native speakers [20,42,43], non-expert readers [19,48]. In case of 5G standard documents, summarization can help developers and practitioners understand the high-level idea of the protocol, which can be really time-consuming without the summarization. The document-level text summarization task can be defined as follows. Let be an original complex article; suppose that consists of sentences, denoted as = 1 , 2 , ... . Document-level summarization aims to simplify into sentences, which form the simplified article , denoted as = 1 , 2 , ... , where is not necessarily equal to . retains the primary meaning of and is more straightforward than , making it easier for people to understand. The operations for sentence-level summarization include word reservation and deletion, synonym replacement, etc. [57]. In our definition, document-level summarization should allow the loss of information but should not allow the loss of important information. [64] pointed out that sentence deletion is a prevalent phenomenon in document summarization. We believe that information that has little relevance to the primary meaning should be removed to improve readability.
The objective is to simplify a paragraph without losing important information. Task 1 is more challenging when evaluating a model's ability to reason about unobserved effects.

Task 2: Sentence Classification
Text classification is a classic topic for natural language processing, in which one needs to assign predefined categories to free-text documents. The range of text classification research goes from designing the best features to choosing the best possible machine learning classifiers [31,35,63].
The multi-class sentence classification can be defined as follows. Given a sentence ∈ S, where S is some high dimensional sentence space and a finite set of categories or classes C = { 1 , 2 , · · · , }, the objective of multi-class sentence classification is to find a function F mapping sentences to categories: Given a datasetD of training samples {( , )} =1 , we aim to learn the functionF that approximates F .
For protocol analysis, an important part is property-guided testing [5,21]. Up to this point, the properties are manually extracted, and the testing is entirely manual. The security classification task aims to label the security-related sentences that in turn can be used as properties and enable semi-automated testing.

EXPERIMENTS AND EVALUATION
Next, we detail the experimental setup for both pre-training and fine-tuning, followed by a discussion on the performance of baselines and pre-trained models.

Experiment Setup
Our experiments were setup as below. Baseline Models: For baseline models we used base versions of BERT [18], RoBERTa [32], XLNet [58], BART [29], GPT2 [44], T5 [45], ALBERT [27], CamemBERT [34], LongFormer [6], Pegasus [60]; large versions of GPT2 and mBART [30]; medium version of GPT2; DistilGPT2 and DistilBERT [46]. Pretrained Models: We pre-trained three models-BERT-base, ROBERTa-base, and XLNet-base, on the SPEC5G dataset; we refer to them as BERT5G, ROBERTa5G, and XLNet5G. The reason for training these three models are discussed in Section 5.4. We then fine-tuned the pre-trained models for the downstream tasks. The details of the pre-training and fine-tuning are discussed in detail in Section 5.2. Training Hardware: We used Google Colab Pro+ to pre-train and fine-tune the models. Around 3000 computing units(CU) of Premium GPU (A100) with High RAM configuration were consumed to complete all our experiments. A compute unit (CU) is the unit of measurement for the resources consumed. To calculate CUs, one need to multiply two factors: (1) Memory (GB) -Size of the allocated server for task to run and (2) Duration (hours) -How long the server is used. This means, 1 CU = 1 GB memory x 1 hour. We used around 80-90% of the GPU during training time. By definition of computing units, we used roughly 100 hours of 30GB GPU time.
Pretraining BERT took us around 36 hours. Pretraining RoBERTa and XLNET took around 24 hours each. Fine Tuning each model took around 5-6 hours. We used the huggingface standard pipeline to pre-train and fine-tune the models.

Training Details
To pre-train BERT Masked Language Model (MLM), we used the Adam optimizer with = 10 −8 and trained the model for 10 epochs. The learning rate was 5 × 10 −5 , we set aside 10% of the data as validation to inspect the model performance at every 50k steps. BERTFastTokenizer was used to tokenize the dataset. We used the same parameters to pre-train ROBERTa MLM and trained the model for 5 epochs. We used the ROBERTa BPE tokenizer to tokenize the dataset for this setting. For pre-training XLNet Permutation Language Model (PLM), we used the Adam optimizer in the same setting. Since XLNet requires approximately 5 times more than BERT or ROBERTa, we trained the model for 1 epoch. We used the SentencePiece tokenizer in this case.
When fine-tuning the classification models, we set the learning rate to be 2 × 10 −5 , weight decay to be 0.01, and batch size to be 16. The huggingface standard pipeline with Automodel class was used for sequence classification. Each model was trained for 15 epochs.
We used the bert-extractive-summarizer [36] to generate summaries using BERT-base. The Huggingface standard pipeline libraries were used to generate summaries using sequence-to-sequence models i.e., PEGASUS and T5 that comes with default summarization capability. The Huggingface Automodel Library was used to generate summary using RoBERTa-base, RoBERTa5G, XLNet5G and BERT5G. Another Huggingface library TransformerSummarizer was used to generate summary using XLNet, GPT2, GPT2-base, GPT2-medium, GPT2-large and DistilGPT2.

Performance Metric
To measure the performance of the sentence classification task we use the standard performance metrics such as accuracy, precision, recall, and F1-score. We here discuss the metrics for the summarization task in detail.

Summarization Metrics.
To measure the quality of summarization we use both automatic and human evaluation metrics. For automatic evaluation, we use the commonly used ROUGE score. Human Evaluation Metric: Due to significant dissonance with human evaluation, automatic evaluation metrics are often considered unreliable for summarization quality evaluation. Hence, we resort to human evaluation metrics. The human annotator's rate on a scale from 1 (worst) to 5 (best) on three coarse attributes: Simplicity: As the majority of the inferences require speculation, this metric measures how simple and concise the models and the annotators are. Correctness: Whether the generated or annotated inferences are grammatically and from a protocol point of view correct. This is very important for the summarization of network protocol specifications. Contextuality: Whether the generated or annotated inferences fit the context.

Performance
In this section, we report the performance of the baseline language models to characterize our dataset. Summarization: We compared our pre-trained models with the baseline models to show the robustness of our datasets. We report the models' mid-score fmeasure of ROUGE-1, ROUGE-2, and ROUGE-L in Table 2 and Figure 6 and human evaluation scores in Table 4 to show the comparison of the baseline models with the pre-trained models. BERT5G outperforms all the models, though BERT-base was not the best-performing model. This shows the quality of our dataset for technical specification learning. Sentence Classification: We report the performance of the models on the sentence classification task in Table 3 and Figure 7. [10] showed that for sentence classification with relatively few classes, BERT, ROBERTa, and XLNet perform the best. Therefore, we pretrained 3 models-BERT-base, ROBERTa-base, and XLNet-base language models on the SPEC5G dataset. These 3 models along with other baselines were then fine-tuned on the 5GSC dataset to compare their classification performance. We observe that BERT5G, ROBERTa5G, and XLNet5G outperform their corresponding baselines by a significant margin. Additionally, BERT5G outperforms all other models in precision and F1 score. XLNet5G has the highest recall. Interestingly, the baseline GPT2 has the highest accuracy. Despite that, we do not choose GPT2 for pretraining. The reason behind this is discussed in Section 6.

RESULT ANALYSIS
Here we analyze the results of our experiments. Performance Improvements Due to SPEC5G: The primary objective of our work is to introduce an anchor 5G dataset that might pave the way for future NLP research in 5G and NLP. The models pre-trained on 5G and fine-tuned on respective tasks achieve significant performance improvements, suggesting that such dataset should be considered the gold standard for pretraining models before deploying them for more sophisticated, 5G-oriented NLP applications.
Best Pre-trained Model: Evidently, from the scores of both tasks, BERT5G is the best-performing model. It is not surprising that  Table 3: Performance of baseline models on classification task over the 5GSC dataset. This is also plotted in Figure 7 for better visual comparison. XLNet5G, the pre-trained version of more recent BERT variant XLNet, became a close competitor. While GPT2 remains a good choice for our summarization task, we do not recommend it for security classification. Despite a high accuracy score, GPT2 failed to achieve a contending recall or F1-score. We observe that GPT2 could classify the Non-Security samples well (53 out of 70 test samples were correct) which dominates the dataset distribution, while poorly classifying samples from Security (11 out of 24 correct) and

Model Precision Recall
Undefined (4 out of 11 correct). This is the reason for its higher accuracy yet low recall and F1.

Results of Human Evaluation for Summarization:
We randomly sample 40 inferences generated by each pre-trained model, their non-pre-trained versions, and corresponding gold inference. These inferences are then manually rated by three independent annotators based on the human-evaluation metrics. As shown in Table 4, we observe that the fine-tuned models perform similarly on SPEC5G but fail to reach gold annotation performance. Moreover, as expected, the pre-trained models significantly outperform their non-pre-trained counterparts. We provide some examples of the generated inferences in Figure 8. Inspection of the model-generated inferences reveals that the usage of keywords from the technical specifications is more frequent in inferences generated by models pre-trained on SPEC5G.

ETHICAL CONSIDERATIONS
In this section, we discuss the ethical considerations while curating the SPEC5G, 5GSum, and 5GSC datasets.
In regards to the datasets being released, all information is in the public domain and is not subject to any copyrights. The dataset does not contain any sensitive information either. The 3GPP releases these specifications through various public statements and releases. We use these sources to create the dataset.
To pre-train, we use different language models. It has been reported that the pre-trained masked language models encode unfair social biases such as gender, racial bias, and religious biases [8]. In our case, as we are dealing with a technical domain, we believe these biases do not have any impact on our results. Furthermore, we randomly evaluated the model's outputs and found no evidence of these biases.
In the case of annotations, the annotators for SPEC5G are all Ph.D. students doing active research in the area of networks. These annotators were provided with specific guidelines (discussed in detail in Section 3.4) and were strictly asked not to write any toxic content (hateful or offensive toward any gender, race, sex, or religion). They were asked to consider gender-neutral settings whenever possible.

DISCUSSION
Here we discuss some limitations we faced and reasons behind some choices we made.

Underspecifications in the standards
In this paper, we introduce SPEC5G, a dataset aimed at the automated analysis of the 5G protocol, and show the usefulness of SPEC5G in two downstream tasks. The performance of the two different downstream tasks on the dataset, in turn, depends on the 5G standards. In some cases, the standards are intentionally kept underspecified and contain ambiguities. The reason for such underspecifications and ambiguities is mainly to give vendors flexibility in the implementation design and performance enhancement. Nonetheless, the SPEC5G dataset can include some of the underspecified behaviors from the standards. These ambiguities existing in the text can be resolved using human expertise. This is precisely how we leverage human expertise for the two downstream tasks in the paper. However, this can be accomplished by using NLP methods that exploit unlabeled data and human knowledge. This is the direction we plan to pursue in the future.

Automation
The aim of SPEC5G is to help automate the manual-intensive tasks of 5G protocol development, analysis, and testing using state-ofthe-art NLP techniques. However, it is evident that it is still not possible to completely automate such tasks because of the manual annotation, which requires domain expertise. In spite of the limited annotated data, we show that it is still possible to achieve fairly good results in two downstream tasks. It may not be possible to completely automate the 5G related tasks, but we still hope it can help reduce the large manual efforts which is the current state-ofthe-art.

Choise of Pretrained Models
The motivation behind the choice of models for pre-training and fine-tuning and the downstream tasks were to show the quality of our dataset(SPEC5G). Hence we only used pre-existing models for the downstream tasks and didn't measure the performance of simple baseline models like lead-3 extractive baseline (taking the first 3 sentences of the article as the summary) and the SummaRuN-Ner extractive model [37], nor made any other efforts to improve performance of the downstream tasks. We picked the models that perform well in both the downstream tasks. Here the chosen models were all encoder-only to maintain consistency between the experiments. Nevertheless, encoder-decoder models or decoder only models would also benefit from pre-training as the dataset is proven to be a quality dataset in aiding the models to learn technical specifications. For example, GPT2 would also benefit from pre-training and perform better than its base version but here we only chose the top three best performing base models to pre-train. It is well known that pretraining on domain specific data can help to improve the performance of downstream tasks in the domain. BioBERT [28] and LEGAL-BERT [9] are two examples. However, in our case when pretrained with unprocessed data from the same specifications and website contents, the performance improvement of the models was negligible compared to the preprocessed final SPEC5G dataset. Even after the first step of preprocessing, BERT only improved 2.73%, XLNet improvement was 1.96% and ROBERTa improved 0.061% in F1 score. Thus, although we commonly know that pretraining improves downstream tasks, evidently the proposed dataset signifies that process even more. The performance improvement of the base models after pre-training on our dataset indicates that the models could learn and sufficiently generalize its knowledge in technical specifications which is an indicator of a quality dataset. The main purpose of designing the downstream tasks was solely to prove the quality of the SPEC5G dataset. That's why we performed the downstream tasks only on the selected base models and on their pre-trained versions. Exploring the downstream tasks in detail and criteria for selection of different models on technical specification domain can be a future research direction which will be aided by our SPEC5G dataset.

Annotator Agreement
In total nine annotators annotated the summarization dataset. Each was given 70 non overlapping distinct articles. So there was no disagreement between annotators. Another round of manual cleaning was done by two meta annotators who went through the whole dataset to ensure summarization quality and consistency, they also addressed the comments and suggestions made by the annotators in the first round and made necessary changes(update/delete). For example in first round of annotation, annotators put comments like -"The paragraph is vague", "Independent sentences", "The paragraph does not have a logical flow. It cannot be further summarized", "It is not clear what the paragraph is talking about", etc. These comments were addressed by the meta annotators by manually correcting or removing the articles.
For the classification task, 3 annotators (we call them A1, A2, A3 here) separately annotated the dataset-A1 and A2 had 320 examples each and A3 had 321 examples. In the second step, they were assigned to reevaluate the annotations of each other (A1 reevaluating labels assigned by A3, A2 reevaluating labels assigned by A1, and A3 reevaluating A2). Such reevaluations brought forth disagreements on several labels which were finally resolved by their combined discussion. For example: "The AMF shall not indicate to the SMF to release the emergency PDU session.": A2 labeled this as Security, while A3 assigned Undefined. This disagreement was later resolved by discussing their reasoning to the respective labels.

Downstream Task Dataset Size
The main goal of this work is to create a dataset of 5G Network Protocol Specification(SPEC5G). To show the quality of the dataset we performed the downstream tasks with models trained on this dataset and justified our claim that Language Models can learn technical specifications significantly from this dataset. While the downstream task datasets may seem quite small, recent high-quality datasets that were manually annotated had similar sizes-COUGH dataset(1236 labeled sentences) [62] and YASO dataset (2215 labeled sentences) [41]. Thus, the current size is comparable to the contemporaries. To overcome the selection bias of the relatively short test set, the test points were randomly sampled on 3 different runs of each model and the models were run on 3 different random seeds which showed low standard deviation in performance metrics. Therefore, the randomness in the test set removes the selection bias. Moreover, this dataset can easily be used as a seed alongside our trained models for semi-automatic annotation with minimal human effort. Our work enables this direction of using language models in technical specification documents. In case of the summarization dataset, it was only used as a test set for the models that can already summarize articles. Their performance on summarizing network protocol specification was measured using this test set. As expected, models pre-trained on SPEC5G dataset performed significantly better than their base version in this task because of their newly gained knowledge of network protocol specifications.

CONCLUSION
In this work, we created SPEC5G-a new dataset for 5G, 5GSuman expert annotated dataset for 5G protocol summarization, and 5GSCan expert-annotated dataset for 5G security text classification. To show the usefulness of SPEC5G in protocol specifications learning by the Language Models, we designed security sentence classification and summarization tasks for state-of-the-art Language Models to solve.

APPENDIX 11 PERFORMANCE EVALUATION OF DOWNSTREAM TASKS
We report the evaluation of performance and metrics used for it in this section.

Performance Metrics
For automatic evaluation, we use the commonly used ROUGE score. ROUGE Score: ROUGE-N measures the number of matching 'ngrams' between the model-generated text and a 'reference'. An n-gram is simply a grouping of tokens/words. A unigram (1-gram) would consist of a single word. A bigram (2-gram) consists of two consecutive words. In ROUGE-N, N denotes the n-gram that is being used. For ROUGE-1 the match rate of unigrams between the model output and reference are measured. ROUGE-2 and ROUGE-3 would use bigrams and trigrams respectively. Recall: The recall counts the number of overlapping n-grams found in both the model output and reference -then divides this number by the total number of n-grams in the reference.

Precision:
We use the precision metric -which is calculated in almost the exact same way as recall, but rather than dividing by the reference n-gram count, it is divided by the model n-gram count.
= & Now that both the recall and precision values are available, they can be used to calculate the ROUGE F1 score with the following formula: 2 * * + ROUGE-L: ROUGE-L measures the longest common subsequence (LCS) between the model output and the reference. With this metric, the number of tokens in the longest sequence shared between both are counted. The idea here is that a longer shared sequence would indicate more similarity between the two sequences. The recall and precision calculations can be applied just like before -but this time the match is replaced with LCS.

Examples
We are listing some example annotated data for both tasks. Label 1: Security Sentence 2: SIGN_VAR shall be included in the channel quality report. Article 2: IMSI-catching attacks have threatened all generations (2G/3G/4G) of mobile telecommunication for decades. As a result of facilitating backward compatibility for legacy reasons, this privacy problem appears to have persisted. However, the 3GPP has now decided to address this issue, albeit at the cost of backward compatibility. In case of identification failure via a 5G-GUTI, unlike earlier generations, 5G security specifications do not allow plaintext transmissions of the SUPI over the radio interface. Instead, an Elliptic Curve Integrated Encryption Scheme (ECIES)-based privacypreserving identifier containing the concealed SUPI is transmitted. This concealed SUPI is known as SUCI (Subscription Concealed Identifier).
Summary 2: Unlike earlier generations, in the case of identification failure via a 5G-GUTI, 5G security specifications do not allow plaintext transmissions of SUPI over the radio interface. Instead, an Elliptic Curve Integrated Encryption Scheme (ECIES)-based privacypreserving identifier containing the concealed SUPI (also known as SUCI) is transmitted.
Article 3: A SUPI is usually a string of 15 decimal digits. The first three digits represent the Mobile Country Code (MCC) while the next two or three form the Mobile Network Code (MNC), identifying the network operator. The remaining (nine or ten) digits are known as Mobile Subscriber Identification Numbers (MSIN) and represent the individual user of that particular operator. SUPI is equivalent to IMSI, which uniquely identifies the ME, and is also a string of 15 digits. Article 4: Next-generation 5G cellular systems will operate in frequencies ranging from around 500 MHz up to 100 GHz. Till now, with LTE and Wi-Fi technologies, we were operating below 6GHz and the channel models were designed and evaluated for operation at frequencies only as high as 6 GHz. The new 5G systems are to operate in bands above 6 GHz and existing channel models will not be valid, hence there is a need for accurate radio propagation models for these higher frequencies, which requires new channel models. The requirements of the new channel model that can support 5G operation across frequency bands up to 100 GHz are based on the existing 3GPP channel models along with extensions to cover additional 5G modeling requirements.
Summary 4: 5G will operate in frequencies ranging from around 500 MHz up to 100 GHz. Up to now 4G and WiFi were operating below 6GHz and the channel models were designed and evaluated for operation at frequencies only as high as 6GHz.
Article 5: Carrier Aggregation (CA) increases the bandwidth by combining several carriers. Each aggregated carrier is referred to as a Component Carrier (CC). 5G NR CA supports up to 16 contiguous and non-contiguous CCs with different numerologies in the FR1 band and in the FR2 band. A Carrier aggregation configuration includes the type of carrier aggregation (intra-band, contiguous or not, or inter-band), the number of bands, and the bandwidth class. CA Bandwidth Class is a series of alphabets that defines the minimum and maximum bandwidth along with the number of component carriers.