Hierarchical Catalogue Generation for Literature Review: A Benchmark

Scientific literature review generation aims to extract and organize important information from an abundant collection of reference papers and produces corresponding reviews while lacking a clear and logical hierarchy. We observe that a high-quality catalogue-guided generation process can effectively alleviate this problem. Therefore, we present an atomic and challenging task named Hierarchical Catalogue Generation for Literature Review as the first step for review generation, which aims to produce a hierarchical catalogue of a review paper given various references. We construct a novel English Hierarchical Catalogues of Literature Reviews Dataset with 7.6k literature review catalogues and 389k reference papers. To accurately assess the model performance, we design two evaluation metrics for informativeness and similarity to ground truth from semantics and structure.Our extensive analyses verify the high quality of our dataset and the effectiveness of our evaluation metrics. We further benchmark diverse experiments on state-of-the-art summarization models like BART and large language models like ChatGPT to evaluate their capabilities. We further discuss potential directions for this task to motivate future research.


Introduction
Today's researchers can publish their work not only in traditional venues like conferences and journals but also in e-preprint libraries and mega-journals, which is fast, convenient, and easy to access (Fire and Guestrin, 2019).This enables rapid development and sharing of academic achievements.For example, statistics show that more than 50,000 publications occurred in 2020 in response to COVID-19 (Wang and Lo, 2021).Therefore, researchers are overwhelmed by considerable reading with the explosive growth in the number of scientific papers, urging more focus on scientific literature review generation (Altmami and Menai, 2022).
Pioneer studies on scientific literature review generation explore citation sentence generation (Xing et al., 2020;Luu et al., 2020;Ge et al., 2021;Wu et al., 2021) and related work generation (Hoang and Kan, 2010;Hu and Wan, 2014;Li et al., 2022;Wang et al., 2022).However, these methods can only generate short summaries, while a literature review is required to provide a comprehensive and sufficient overview of a particular topic (Webster and Watson, 2002).Benefiting from the development of language modeling (Lewis et al., 2019), recent works directly attempt survey generation (Mohammad et al., 2009;Jha et al., 2015;Shuaiqi et al., 2022) but usually suffer from a disorganized generation without hierarchical guidance.As illustrated in Figure 1(A), we take gpt-3.5-turbo1, a large language model trained on massive amounts of diverse data including scientific papers, as the summarizer to generate scientific literature reviews.Direct generation can lead to disorganized reviews with content repetition and logical confusion.
Since hierarchical guidance is effective for text generation (Yao et al., 2019), as shown in Figure 1(B), we observe that the catalogue, representing the author's understanding and organization of existing research, is also beneficial for scientific literature review generation.Therefore, the generation process can be divided into two steps.First is generating a hierarchical catalogue, and next is each part of the review.Figure 1(C) reveals that even state-of-the-art language models can not obtain a reliable catalogue, leaving a valuable and challenging problem for scientific literature review generation.
To enable the capability of generating reasonable hierarchical catalogues, we propose a novel and challenging task of Hierarchical Catalogue Generation for scientific Literature Review, named as HiCatGLR, which is a first step towards automatic review generation.We construct the first benchmark for HiCatGLR by gathering html for-

Introduction
... LLMs such as GPT-3 have revolutionized the field of NLP, enabling sophisticated language understanding and generating capabilities ...

Development and Training of LLMs
LLMs are developed using deep learning techniques, such as neural networks, that learn from large amounts of data ...

Applications of LLMs
LLMs have been deployed in a wide range of applications, including text generation, machine translation, sentiment analysis, question-answering, and chatbots ...

Critique of LLMs
Despite their impressive performance, LLMs have been criticized for a number of limitations.One major concern is that they can perpetuate biases present in the training data, leading to biased outputs ...There is also a risk that LLMs can be used to generate disinformation or fake news, ... Furthermore, as LLMs become increasingly powerful, they raise ethical questions around their potential impact on employment, privacy, and fairness.

Gaps in Knowledge
There is still much to learn about LLMs, including how they can be effectively fine-tuned for different tasks and languages, how to prevent biases and ensure fairness, and how to improve their interpretability and transparency.Additionally, there is a need for more research exploring the social and ethical implications of LLMs, including their impact on education, journalism, and democracy.

Conclusion
In conclusion, LLMs have emerged as a powerful tool in NLP and have been widely adopted in various fields ...

Introduction
Large language models (LLMs) are pre-trained on massive amounts of data and fine-tuned for specific tasks in natural language processing (NLP)...

Resources of LLMs
LLMs are publicly available through model checkpoints or APIs ... Commonly used corpora include WikiText, BookCorpus, and Common Crawl.Library resources, such as Hugging Face, provide access to LLMs and their implementation.

Introduction
Large Language Models (LLMs) are powerful Natural Language Processing (NLP) tools that have become increasingly popular in recent years ...

Development of LLMs
Early language models such as Hidden Markov Models and N-gram models lacked the capacity to capture contextual dependencies ... Large language models with attention mechanisms also emerged, leading to the development of Transformer-based LLMs such as GPT-3 ...

Types of LLMs
LLMs can be categorized as either statistical or rulebased.Statistical models such as N-gram models ... use data-driven approaches to generate text ... whereas neural language models use deep learning architectures such as Recurrent Neural Networks and Transformers to generate text ... rule-based models such as grammar and parsing techniques and knowledge-based systems rely on pre-defined rules and knowledge to generate text.Applications of LLMs ... LLMs have been used to develop language models that can mimick human-like conversations ... a crucial step towards achieving Artificial General Intelligence.

Challenges of LLMs
One key challenge of LLMs is their computational demands ... training data bias is another limitation of LLMs ... Ethical and societal concerns ... Future Directions ... improving the efficiency and scalability and capability of reasoning beyond language ...

Conclusion
LLMs have become essential tools in NLP ... mat2 of survey papers' catalogues along with abstracts of their reference papers from Semantics Scholar3 .After meticulous filtering and manual screening, we obtain the final Hierarchical Catalogue Dataset (HiCaD) with 7.6k referencescatalogue pairs, which is the first to decompose the review generation process and seek to explicitly model a hierarchical catalogue.The resulting Hi-CaD has an average of 81.1 reference papers for each survey paper, resulting in an average input length of 21,548 (Table 1), along with carefully curated hierarchical catalogues as outputs.
Due to the structural nature of catalogues, traditional metrics like BLEU and ROUGE can not accurately reflect their generation quality.We specially design two novel evaluations for the catalogue generation, where Catalogue Edit Distance Similarity (CEDS) measures the similarity to ground truth and Catalogue Quality Estimate (CQE) measures the degree of catalogue standardization from the frequency of catalogue template words in results.
To evaluate the performance of various methods on our proposed HiCatGLR, we study both end-to-end (one-step) generation and step-by-step generation for the hierarchical catalogues, where the former generally works better and the latter allows for more focus on target-level headings.We benchmark different methods under fine-tuned or zero-shot settings to observe their capabilities, including the recent large language models.In summary, our contributions are threefold: • We observe the significant effect of hierarchical guidance on literature review generation and propose a new task, HiCatGLR (Hierarchical Catalogue Generation for Literature Review), with the corresponding dataset HiCaD.• We design several evaluation metrics for informativeness and structure accuracy of generated catalogues, whose effectiveness is ensured with detailed analyses.• We study both fine-tuned and zero-shot settings to evaluate models' abilities on the Hi-CatGLR, including large language models.
2 Datasets: HICAD We now introduce HiCaD, including the task definitions, data sources, and processing procedures.We also provide an overall statistical analysis.

Definitions
The input of this task is the combination of the title t of the target survey S and representative information of reference articles {R 1 , R 2 , ..., R n }, where n is the number of reference articles cited by S. Considering the cost of data collection and experiments, we take abstracts as the representation of corresponding references.Besides, we restrict each abstract to 256 words, where the exceeding part will be truncated.The output is the catalogue where k is the number of catalogue items.c i is an item in the catalogue consisting of a level mark l i ∈ {L 1 , L 2 , L 3 }, which represents the level of the catalogue item, and the content {w i 1 , w i 2 , ..., w i p } with p number of words.As shown in Figure 1, "Pre-training" is the firstlevel heading, "Architecture" is the second-level heading, and " Mainstream Architectures" is the third-level heading.In our experiment, we only keep up to the third-level headings and do not consider further lower-level headings.

Dataset Construction
Our dataset is collected from two sources: arXiv4 and Semantics Scholar5 .We keep papers6 containing the words "survey" and "review" in the title and remove ones with "book review" and "comments".We finally select 11,435 papers that are considered to be review papers.It is straightforward to use a crawler to get all 11,435 papers in PDF format according to the arxiv-id.However, extracting catalogue from PDF files is difficult where structural information is usually dropped during converting.Therefore, we attempt ar5iv7 to get the papers in HTML format.This website processes articles from arXiv as responsive HTML web pages by converting from LaTeX via LaTeXML8 .Some authors do not upload their LaTeX code, we have to skip these and collect 8,397 papers.
For the output part, we obtain the original catalogues by cleaning up the HTML files.Then we replace the serial number from the heading with the level mark <L i > using regex.For input, we collate the list of reference papers and only keep the valid papers where titles and abstracts exist.We convert all words to lowercase for subsequent generation and evaluation.Finally, after removing data with less than 5 catalogue items and less than 10 valid references, we obtain 7,637 references-catalogue pairs.We count the fields to which each paper belongs (Table 6).We choose the computer science field with the largest number of papers for the experiment and split the 4,507 papers into training (80%), validation (10%), and test (10%) sets.

Dataset Statistics and Analysis
Taking the popular scientific dataset as an example, we present the characteristics of the different multidocument scientific summarization tasks in Table 1.Dataset Multi-Xscience is proposed by Lu et al. (2020), which focuses on writing the related work section of a paper based on its abstract with 4.4 articles cited in average.Dataset BigSurvey-MDS is the first large-scale multi-document scientific summarization dataset using review papers' introduction section as target (LIU et al., 2022), where previous work usually takes the section of related work as the target.Both BigSurvey and our HiCat-GLR task have more than 70 references, resulting in over 10,000 words of input, while their output is still the scale of a standard text paragraph, similar to Multi-Xscience.A natural difference between our task and others is that our output contains hierarchical structures, which place high demands on logic and conciseness for generation.
To measure how abstractive our target catalogues are, we present the proportion of novel n-grams in the target summaries that do not appear in the source (Table 2).The abstractiveness of HiCaD is lower than that of BigSurvey-MDS and Multi-XScience, which suggests that writing catalogues focus on extracting keywords from references.This conclusion is in line with the common sense that literature review is closer to reorganization than innovation.Therefore, our task especially challenges summarizing ability rather than generative ability.
We also analyze the share of each level of the catalogue in the whole catalogue.Figure 2 shows the value and proportional relationship of the average number of catalogue items as well as average word length at different levels.It can be seen that the second-level headings have the most weight, being 44.32% of the average number and 48.50% of the average word length.Table 3 shows the weight of the headings at each level in the catalogue from the perspective of word coverage.We calculate   ROUGE 9 scores between different levels of headings (L 1 , L 2 , L 3 ) and the general catalogue "Total".Similar to the above, the secondary headings have the highest Rouge-1 score of 57.9 for the entire catalogue.Moreover, the Rouge-1 score of 18.5 for the first and second-level headings L 1 -L 2 indicates some overlaps between the first and second levels.
The low Rouge scores of L 1 -L 3 and L 2 -L 3 reveal that there are indeed different usage and wording distributions between different levels.

Catalogue Quality Estimate
There are some fixed templates in the catalogue, such as introduction, methods, and conclusion, which usually do not contain information about domain knowledge and references.The larger the percentage of template words in the catalog, the less valid information is available.Therefore, the proportion of template words can indicate the information content of the catalogue to some extent.We collate a list of templates and calculated the percentage of templates in the catalogue items as Catalogue Quality Estimate (CQE).CQE evaluation metric measures the informativeness of the generated catalogue through template words statistics.Table 7 lists all template words.The CQE of oracle catalogues in the test set is 11.1%.

Catalogue Edit Distance Similarity
Catalogue Edit Distance Similarity (CEDS) measures the semantic similarity and structural similarity between the generated and oracle catalogues.Traditional automatic evaluation methods in summarization tasks (such as ROUGE (Lin, 2004) and BERTScore (Zhang* et al., 2020)) can only measure semantic similarity.However, catalogues are texts with a hierarchical structure, so the level of headings in catalogues also matters.
We are inspired by the thought of edit distance which is commonly used in the ordered labeled tree similarity metric.An ordered labeled tree is one in which the order from left to right among siblings is significant (Zhang and Shasha, 1989).The tree edit distance (TED) between two ordered labelled trees T a , T b is defined as the minimum number of node edit operations that transform T a into T b (Paaßen, 2018).There are three edit operations on the ordered labeled tree: deletion, insertion, and modification, each with a distance of one.Similar to TED, the definition of catalogue edit distance (CED) is the minimum distance of item edit operations that transform a catalogue C a into another catalogue C b , where each entry in the catalogue is a node.The difference is that we calculate the distance between two items according to the similarity of nodes when modification: Distance(x, y) = min(1, α×(1−Similarity(x, y)).
We leverage BERTScore to obtain the similarity between two items of the catalogue based on the embeddings from SciBERT (Beltagy et al., 2019): SciBERT is more suitable for scientific literature than BERT because it was pretrained on a large multi-domain corpus of scientific publications.We take hyperparameter α = 1.2 during experiments.Therefore, we can define Catalogue Edit Distance Similarity (CEDS) as: We use the library Python Edit Distances10 proposed by Paaßen et al. (2015) for the implementation of algorithms.A detailed example of node alignment and the conversion process between two catalogues is given in Appendix F.

Models
We now introduce our explorations in hierarchical catalogue generation for literature review, including end-to-end and step-by-step approaches.

End-to-End Approach
One of the main challenges in generating the catalogue is handling long input contexts.The intuitive and straightforward generation method is the endto-end model, and we experiment with two models that specialize in processing long text.We consider an encoder-decoder model for the end-to-end approach.The model takes the title of a survey and the information from its reference papers as input before generating the survey catalogue.The title and references are concatenated together and fed to the encoder.However, existing transformerbased models with O(N 2 ) computational complexity show an explosion in computational overhead as the input length increases significantly.
We choose Fusion-in-Decoder (FiD) (Izacard and Grave, 2021), a framework specially designed for long context in open domain question answering, to handle the end-to-end catalogue generation.As shown in Figure 3, the framework processes the combination of the survey title and information from each reference paper independently by the encoder.The decoder pays attention to the concatenation of all representations from the encoder.Unlike previous encoder-decoder models, the FiD model processes papers separately in the encoder.Therefore, the input can be extended to a large number of contexts since it only performs self-attention over one reference paper each time.This allows the computation time of the model to grow linearly with the length of the input.Besides, the joint processing of reference papers in the decoder can better facilitate the interaction of multiple papers.

Step-by-Step Approach
Another main challenge in generating the catalogue is modeling relationships between catalogue items at different levels.Motivated by Tan et al. (2021), we explore an effective approach for incremental generation.Progressive generation divides the complicated problem of generating a complete catalogue into more manageable steps, namely the generation of hierarchical levels of the catalogue.Different from generating everything in one step, the progressive generation allows the model to perform high-level abstract planning and then shift attention to increasingly concrete details. Figure 5 illustrates the generation process.

Experiments
We study the performance of multiple models on the HiCaD dataset.Detailed analysis of the generation quality is provided, including correlation validation of the proposed evaluation metrics with human evaluation and ROUGE for abstractiveness.

Baselines
Due to the large input length, we choose an encoderdecoder transformer model that can handle the long text and its backbone model to implement FiD besides various extractive models in our experiments.(I) LexRank (Erkan and Radev, 2004) is an unsupervised extractive summarization approach based on graph-based centrality scoring of sentences.(II) TextRank (Mihalcea and Tarau, 2004) is a graph-based ranking algorithm improved from Google's PageRank (Page et al., 1999) for keyword extraction and document summarization, which uses co-occurrence information (semantics) between words within a document to extract keywords.

Implementation Details
We use dropout with the probability 0.1 and a learning rate of 4e − 5.The optimizer is Adam with β 1 = 0.9 and β 2 = 0.999.We also adopt the learning rate warmup and decay.During the decoding process, we use beam search with a beam size of 4 and tri-gram blocking to reduce repetitions.We adopt the implementations of LED from Hugging-Face's Transformers (Wolf et al., 2020) and FiD from SEGENC (Vig et al., 2022).We maximize the input length to 16,384 tokens.All the models are trained on one NVIDIA A100-PCIE-80GB.

Results
As shown in Table 4, we calculated the ROUGE scores, BERTScore, Catalogue Edit Distance Similarity (CEDS) and Catalogue Quality Estimate (CQE) between the generated catalogues and oracle ones separately.We especially remove all level mark symbols (e.g.<L 1 >) for evaluation.In order to compare the ability of different methods to generate different levels of headings in detail, we calculated ROUGE scores for each level of headings with the corresponding level of oracle ones.
We first analyze the performance of traditional extractive approaches.Since these extractive models can not generate catalogues with an explicit hierarchy as abstractive methods, we only calculate the ROUGE scores of extracted results to the entire oracle catalogues (Column Total in Table 4).We take the ROUGE score (11.9/4.5/11.4) between titles of literature reviews and entire oracle catalogues as a threshold because titles are the most concise and precise content related to reviews.LexRank (10.5/1.7/10.0)and TextRank (12.9/1.2/12.1)only achieves similarly results as the threshold.This means extractive methods are not suitable for hierarchical catalogue generation.
By comparing all evaluation scores, the end-toend approach achieves higher similarity with the ground truth than step-by-step on the whole catalogue.Generally, the model with a larger number of parameters performs better.However, there are still duplication and hierarchical errors in the current results (See case studies in Appendix F).
Large language models have shown excellent performance on many downstream tasks due to massive and diverse pre-training.We also test two representative large language models: Galactica (Taylor et al., 2022) and ChatGPT (gpt-3.5-turbo).Galactica (GAL) is a large language model for science that achieve state-of-the-art results over many scientific tasks.The corpus it used includes over 48 million papers, textbooks, scientific websites and so on.ChatGPT is the best model recognized by the community and also trained on ultra large scale corpus.The corpus used by these two models in the pre-training phase contains scientific literature and thus we consider that the knowledge of the models contain the reference papers.From evaluation results, instructions understanding and answer readability, ChatGPT generates far better results than GAL.It's worth noting that large language models can not outperform models (LED-large) specially trained for this task.This reveals that simply stacking knowledge and parameters may not be a good solution for catalogue generation which requires more capabilities on logic and induction.See Appendix D for specific details and analysis.

Human Evaluation
To demonstrate the validity of our proposed evaluation metrics, we conduct consistency tests on CEDS with human evaluation.We generate 50 catalogues for each implementation with a total number of 450 samples, each sample is evaluated by three professional evaluators.We skip Galactica-6.7bfor subsequent experiments since it can hardly generate reasonable catalogues.Evaluators are required to assess the quality of catalogues based on the similarity (ranging from one to five, where five represents the most similar) to the oracle one.
First, we test the human evaluations and CEDS of corresponding data for normal distribution.After the Shapiro-Wilk test, the p-values for the two sets are 0.520 and 0.250, all greater than 0.05 (Table 8).That means these data groups can be considered to meet the normal distribution, which enables further Pearson correlation analysis.Person correlation analysis shows that p-values between CEDS and Human are 0.027, more diminutive than 0.05 (Table 9).The r-value is 0.634, which represents a strong positive correlation.Therefore, we consider CEDS as a valid evaluation indicator.

Automatic Evaluation
We also conduct a pair-wise consistency test between all the automatic indicators measured (Table 5) by Pearson's correlation coefficient, where only ROUGE-L in ROUGE was computed.
First, a noticeable trend is that the ROUGE-L of the first-level catalogue (L1RL) is not correlated with any other metrics.We infer that this is due to the ease of generating first-level headings, which perform similarly across methods and reach bottlenecks.The second & third-level catalogue (L2RL, L3RL) exhibit an extremely strong positive correlation with the total one (TotalRL), which suggests that the effectiveness of generating second & third-level headings affects the overall performance of catalogue generation.Second, Catalogue Quality Estimate (CQE) negatively correlates with the ROUGE-L in three levels (L2RL, L3RL, To-talRL).This indicates that domain knowledge and reference information are mainly found in the secondary and tertiary headings, which is in line with human perception.Finally, we study the difference between TotalRL and BERTScore.We find that TotalRL and BERTScore are not relevant, but they are correlated with CEDS respectively.This means that CEDS can combine both ROUGE and BERTScore indicators and better describe the similarity between generated results and oracle ones.

Related Work
Survey generation Our work belongs to the multidocument summarization task of generation, which aims to reduce the reading burden of researchers.Survey generation is the most challenging task in multi-document scientific summarization.The early work was mainly an extractive approach based on various content selection ways (Mohammad et al. (2009), Jha et al. (2015)).The results generated by unsupervised selection models have significant coherence and duplication problems.
To make the generated results have a clear hierarchy, Hoang and Kan (2010) additionally inputs an associated topic hierarchy tree that describes a target paper's topics to drive the creation of an extractive related work section.Sun and Zhuge (2019) proposes a template-based framework for survey paper automatic generation.It allows users to compose a template tree that consists of two types of nodes, dimension node and topic node.Shuaiqi et al. (2022) trains classifiers based on BERT (Devlin et al., 2018) to conduct category-based alignment, where each sentence from academic papers is annotated into five categories: background, objective, method, result, and other.Next, each research topic's sentences are summarized and concatenated together.However, these template trees are either inflexible or require the users to have some background knowledge to give, which could be more friendly to beginners.Also, this defeats the original purpose of automatic summarization to aid reading.
Text structure Generation There are some efforts involving the automatic generation of text structures.Xu et al. (2021) employs a hierarchical model to learn structure among paragraphs, selecting informative sentences to generate surveys.But this structure is not explicit.Tan et al. (2021) generate long text by producing domain-specific keywords and then refining them into complete passages.The keywords can be considered as a structure's guide for the subsequent generation.Trairatvorakul et al. reorganize the hierarchical structures of three to six articles to generate a hierarchical structure for a survey.Fu et al. (2022) gen-erate slides from a multi-modal document.They generate slide titles, i.e., structures of one target document, via summarizing and placing an object.These efforts don't really model the relationships between multiple documents, which is far from the hierarchical structure of a review paper.This paper focuses on the generation of a hierarchical catalogue for literature review.Traditional automatic evaluation methods for summarization calculates similarity among unstructured texts from the perspective of overlapping units and contextual word embeddings (Papineni et al. (2002), Banerjee and Lavie (2005), Lin (2004), Zhang* et al. (2020), Zhao et al. (2019)).Tasks involving structured text generation are mostly measured using the accuracy, such as SQL statement generation (He et al. (2019), Wang et al. (2019)), which ignores semantic information.Assessing the similarity of catalogues to standard answers should consider both structural and semantic information.

Conclusion
In this work, we observe the significant effect of hierarchical guidance on literature review generation and introduce a new task called hierarchical catalogue generation for literature review (Hi-CatGLR) and develop a dataset HiCaD out of arXiv and Semantic Scholar Dataset.We empirically evaluate two methods with eight implementations for catalogue generation and found that the end-to-end model based on LED-large achieves the best result.CQE and CEDS are designed to measure the quality of generated results from informativeness, semantics, and hierarchical structures.Dataset and code are available at https://github.com/zhukun1020/HiCaD.Our analy-sis illustrates how CEDS resolves some of the limitations of current metrics.In the future, we plan to explore a better understanding of input content and the solution to headings level error, repetition, and semantic conflict between headings.

Limitations
The hierarchical catalogue generation task can help with literature review generation.However, there is still much room for improvement in the current stage, especially in generating the second & thirdlevel headings, which requires a comprehensive understanding of all corresponding references.Besides, models will need to understand each reference completely rather than just their abstracts in the future.Our dataset, HiCaD, does not cover a comprehensive range of domains.In the future, we need to continue to expand the review data in other fields, such as medicine.Currently, we only experiment on a single domain due to limitations of collected data resources, where knowledge transfer across domains matters for catalogues generation.

D.1 Galactica
Galactica has five sizes ranging from 125M to 120B parameters and the standard size 6.7B is the max size we can use.We experimented one-shot test on our test set.There are two generation samples in Figure 4.It can be seen that there is a lot of repetition in the results.

D.2 ChatGPT
The prompt of what we use to generate a directory using ChatGPT is: Your task is to write a table of contents for the review paper by recalling relevant papers, organizing and classifying them according to the given review paper topic.Only the first, second and third level headings need to be written, no detailed explanation is required.Please ensure that your catalogue is well-structured, clear, and concise, and accurately represents the topic's main research findings and methodologies.
Title: A Survey of Large Language Models catalogue.Finally, it generates the whole catalogue with a method similar to the previous step.The generation process corresponds to a decomposition of the conditional probability as:

F Case Study
Table 10 and Figure 6 present an example of the best catalogues generated by end-to-end models and step-by-step models about the title "a survey of domain adaptation for neural machine translation".Table 10 gives an example of alignment between catalogue items when calculating the Catalogue Edit Distance (CED).If an item does not match any other entity, then the required action for this item is insertion or deletion, so the cost is 1.If a node matches a node, then the required action is modification.The operation cost of modification is calculated according to the similarity.It is worth noting that Node[2] in the generated result does not match Node[1] in the ground truth, even though they have exactly the same content.It is because the former is a secondary heading, and the latter is a primary heading.If the alignment is forced, the cost required will be greater than the current result.Thus this example shows that not only does the metric CEDS measure catalogues from a semantic perspective, but also takes into account the hierarchical structure of the catalogue.
The parts marked in red in Figure 6 are the generation problems we observe.The first problem is duplicating the content in a single catalogue item, e.g."monolingual monolingual data".The second one is the hierarchy error between sibling nodes.For example, "evaluation metrics" conceptually contain "automatic evaluation".Finally, there is a recurrence of the heading "applications of nmt domain adaptation".In summary, there are still duplication and hierarchical errors in the current results.The past few years have witnessed the rapid development of deep learning, which has achieved great success in many fields, such as …… In this case, it is necessary to develop deep learning models with good generalization ability and high efficiency.
Language models (LMs) are a class of deep learning models that are widely used in natural language processing.LMs are usually trained on large-scale unlabeled text corpora and then fine-tuned on downstream tasks.The pre-training and fine-tuning paradigm has achieved great success in natural language processing.The pre-training stage can learn general language representations from large-scale unlabeled text corpora, and the fine-tuning stage can adapt the pre-trained language representations to downstream tasks.
In recent years, the pre-training and fine-tuning paradigm has been widely used in natural language processing.The pre-training and fine-tuning paradigm has achieved …… large-scale unlabeled text corpora, and the fine-tuning stage can adapt the pre-trained language representations to downstream tasks.
In this survey, we focus on the pre-training and fine-tuning paradigm for large-scale language models.We first introduce the pre-training and fine-tuning paradigm for large-scale language models.Then, we introduce the pre-training and fine-tuning paradigm for large-scale language models.…… Then, we introduce the pre-training and fine-tuning paradigm for large-scale language models.Prompt： Title: A Survey of Large Language Models Pre-training Data collection ... LLMs employ mainstream architectures, such as transformers ... Scalable training techniques help optimize large models ... Adaptation Tuning of LLMs ... Alignment tuning aims to align model outputs with human expectations through reinforcement learning from human feedback.Utilization In-context learning involves ... an underlying mechanism that adapts LLMs to user preferences.Chain-of-thought prompting allows LLMs to utilize external knowledge and enables the generation of more in-depth knowledge-based answers.Capacity Evaluation ... Advanced ability evaluation involves human alignment, interaction with the external environment, and tool manipulation ... Conclusion and Future Directions LLMs have become a crucial component of NLP ... Future research is required to address the limitations a n d c h a l l e n g e s o f L L M s , i n c l u d i n g e t h i c a l considerations, bias, and computational efficiency ...

Figure 2 :
Figure 2: The value and proportional relationship of the average number of catalogue items as well as average word length at different levels.

Figure 3 :
Figure 3: Architecture of the step-by-step approach.
(III) BART (Lewis et al., 2019) is a pre-trained sequence-to-sequence Transformer model to reconstruct the original input text from the corrupted text with a denoising auto-encoder.(IV) Longformer-Encoder-Decoder (LED) (Beltagy et al., 2020) is built on BART but adopts the sparse global attention mechanisms in the encoder part, which alleviates the input size limitations of the BART model (1024 tokens) to 16384 tokens.

Figure 4 :
Figure 4: Two catalogue generation results of GAL model.

Figure 5 :
Figure 5: Architecture of the step-by-step approach.

Figure 6 :
Figure 6: Examples of end-to-end and step-by-step.
Duplicated contents (underlined) exist in two different chapters without proper hierarchical guidance.Besides, the title "Gaps in Knowledge" should be part of "Critique of LLMs" rather than alongside it.(B) denotes the review based on both title and oracle (human-written) catalogue.Under the guidance of the hierarchical catalogue, the model can generate transparent and rational reviews.(C) consists of two steps: first generate a pseudo catalogue given the title and then obtain the entire review.It's apparent that the quality of generated catalogue is not satisfying since traditional language modeling methods, "statistical" and "rule-based," should have belonged to the chapter "Development of LLMs" rather than "Type of LLMs."This consequently causes the degeneration of reviews due to the unreliable catalogue.
gpt-3.5-turbo (C) Given generated Catalogue Figure 1: Examples of scientific literature review generated by gpt-3.5-turbo,given the title "A Survey of Large Language Models."(A) represents direct generation with the title only.

Table 1 :
Comparison of our HiCaD dataset to other multi-document scientific summarization tasks and their datasets.Pairs means the number of examples.Refs stands for the average number of input papers for each sample.Sents and Words indicate the average number of sentences and words in input or output calculated by concatenating all input or output sources.Form is the form of the output text.

Table 2 :
The proportion of novel n-grams in target summaries across different summarization datasets.

Table 3 :
ROUGE scores between different levels of headings.Total represents the whole catalogue.

Table 4 :
Automatic evaluation results on HiCaD.Bold indicates the best value in each setting.‡ represents the second best result and † represents the third best.

Table 5 :
Results of Pearson's consistency test.** represents a significant correlation at the 0.01 level (two-tailed).* denotes a significant correlation at the 0.05 level (two-tailed).

Table 7 :
All template words used to calculate Catalogue Quality Estimate (CQE).

Table of
(target), where L i is the i-th level headings of the catalogue.For example, in the first step, we input survey title and information about its reference documents and generate the firstlevel headings L 1 of the catalogue for that survey.Then, L 1 is added to the input, and the next step generates the first two levels headings L 1,2 of the

Table 9 :
Consistency evaluation results between proposed metrics and corresponding human judgements.
Table of contents: ………… Title: A Survey of Large Language Models