Efficient Mind-Map Generation via Sequence-to-Graph and Reinforced Graph Refinement

A mind-map is a diagram that represents the central concept and key ideas in a hierarchical way. Converting plain text into a mind-map will reveal its key semantic structure and be easier to understand. Given a document, the existing automatic mind-map generation method extracts the relationships of every sentence pair to generate the directed semantic graph for this document. The computation complexity increases exponentially with the length of the document. Moreover, it is difficult to capture the overall semantics. To deal with the above challenges, we propose an efficient mind-map generation network that converts a document into a graph via sequence-to-graph. To guarantee a meaningful mind-map, we design a graph refinement module to adjust the relation graph in a reinforcement learning manner. Extensive experimental results demonstrate that the proposed approach is more effective and efficient than the existing methods. The inference time is reduced by thousands of times compared with the existing methods. The case studies verify that the generated mind-maps better reveal the underlying semantic structures of the document.


Introduction
A mind-map is a hierarchical diagram that can depict the central concept, linked with the major ideas and other ideas branch out from these (Kudelić et al., 2011;. It is organized in cognitive structures and much easier to understand than plain text (Dhindsa et al., 2011). Thus in practice, it can be utilized for education resources, organizing, and planning. Many tools can help people make mind-map manually, such as FreeMind, MindGenius and Visual Mind, etc (Kudelić et al., 2011). To save human labors, some automatic methods have been proposed to generate mind-map from text, which focus on analyzing the semantic relations * Honglei Guo is the corresponding author. 0. Three basic tips on writing a good research paper title.

1.
The primary function of a title is to provide a precise summary of the paper 's content, so keep the title brief and clear.

2.
Captures the reader's attention by important points and avoiding lengthy title.

3.
Differentiates the paper from the other papers of the same subject area through appropriate descriptive words.  within a sentence by pre-defined rules (Brucks and Schommer, 2008;Rothenberger et al., 2008) or syntactic parser (Elhoseiny and Elgammal, 2012). Recently, researchers  propose to generate a mind-map automatically by detecting the semantic relation cross sentences in the document. It mines the structured diagram of the document, in which a node represents the meaning of a sentence in the format of the entire sentence or its keywords, and an edge represents the governing relationships between the precursor and the successor. We illustrate two types of mind-map in Figure  1, i.e. salient-sentence-based mind-map (SSM) and key-snippet-based mind-map (KSM).  propose a promising pipeline approach (see Figure 2(a)), which first converts the whole document into a relation graph and then prunes the extra edges to obtain the mind-map. However, the first phase tends to be less efficient since it needs to predict all the governing scores at the sentence pair level (see Figure 2(b)). The number of sentence pairs increases exponentially with the length of the document, which raises the computational complexity. In addition, each governing score in the graph is computed separately without considering the overall semantics in the document. The sequential information of all sentences might be helpful to mine the hierarchical and structured semantics.
We propose an efficient mind-map generation network (EMGN) to address the above issues (see Figure 2(c)). The proposed method encodes all sentences sequentially and generates the graph via sequence-to-graph. It makes the first phase more efficient and can easily process multiple documents in parallel. The model training requires all the relation labels of sentence pairs in each graph. However, manually annotating costs much. We exploit DistilBert (Sanh et al., 2019) to automatically annotate a graph for each document, which provides pseudo labels to train our model. In advance, Dis-tilBert has been fine-tuned to detect the governing relation between two sentences. The performance of DistilBert indicates it can be an "annotator" with high confidence.
Moreover, a meaningful mind-map tends to organize the major ideas of a document close to the root node. To achieve this goal, we design a graph refinement module to adjust the generated graph by using the documents with highlights. The highlights written by the editors summarize the key ideas of a document. During training, we leverage this prior human knowledge as a reference to refine the governing scores in the graph via self-critical reinforcement learning (Rennie et al., 2017).
In summary, the main contributions of this paper are as follows.
• We propose an efficient mind-map method that can consider the document-level semantics by sequence-to-graph.
• In the training phase, we design a graph refinement module to refine the generated graph by leveraging the manual highlights and selfcritical reinforcement learning.
• Extensive experimental results demonstrate the proposed method can generate a betterperformed mind-map efficiently. The inference time is reduced by thousands of times compared with the existing approaches.
2 Methodology 2.1 Problem Definition where L k is the length of the sentence. We define the mind-map generation as a two-phase task.
where the input text D is first processed to obtain the relation graph G. Then the graph is pruned to gain the final mind-map M . The detailed methodology for this two-phase task is described in the following sections. Concretely, we depict the network architecture for the first phase D → G in Figure 3. We generate the relation graph from a document by graph detector ( §2.2). The graph is simultaneously refined to make the generated mind-map more meaningful ( §2.3). For the second phase G → M , we generate two types of mind-maps based on the graph ( §2.5).

Graph Detector
As shown in Figure 3, the graph detector aims to extract the relation graph for an input document. It considers the overall semantics and obtains the graph efficiently.

Sentence Encoder
Given a sentence s k , we first map it into an embedding sequence {e 1 k , e 2 k , ..., e L k k } through a pre-trained embedding matrix GloVE (Pennington et al., 2014). Then we exploit a Bi-directional LSTM (BiLSTM) (Graves et al., 2013)

RL Refine Loss
DistilBERT Annotated Label Figure 3: The network architecture of the proposed approach for converting the document to a graph (Phase I). SE, DE, and S2G refer to the sentence encoder, document encoder, and sequence-to-graph modules, respectively. max-pooling operation over the hidden states.

Document Encoder
The sequential information of sentences indicates the semantic coherence and logical structure of a document. This information is essential in understanding the entire document and extracting a clear mind-map. To model the sentence-level context, we encode the vector representations of all sentences {s k } N k=1 with another BiLSTM and obtain H = {h 1 , h 2 , ..., h N }.

Sequence-to-Graph
In a graph G, a node represents a sentence from the document. G i,j is the governing score between sentence s i and s j , which indicates the probability that s i semantically implies s j . Thus the graph is directed since the governing relationships are different between G i,j and G j,i . Inspired by (Dozat and Manning, 2017;Zhang et al., 2019), we utilize sequence-to-graph to process the sentence-level sequence into graph efficiently. Concretely, we first compute the representations of all sentences when they are regarded as the start or end nodes in the edges. Exploiting separate parameters help learn distinct representations for a sentence.
where MLP is a linear transformation. Then we calculate the governing scores in G ∈ R N ×N with a bilinear operation or biaffine operation.

Reference Reward
For the purpose of adjusting G Figure 4: The idea of the graph refinement module aims to adjust the governing scores in G with the help of highlights. Sampling decisions and greedily selecting by argmax are both consistent with the mind-map detector ( §2.5). This builds a bridge between the learning of a graph (Phase I) and extracting a mind-map from the graph (Phase II).
where Bilinear and Biaffine are defined as below.
where U and W are the parameter matrix, b is the bias. σ is the sigmoid operation, guaranteeing that each governing score is between 0 and 1.

Graph Refinement
According to (Buzan, 2006), a mind-map is organized as a tree structure with the central concept as its root node, and the major ideas are connected directly to the root. Other ideas branch out from the major ideas. Therefore, a clear mind-map tends to organize the main opinions of a document close to the root node. To achieve this goal, we leverage the human-written highlights to refine the graph G via reinforcement learning (RL) algorithm (Williams, 1992), more specifically, self-critical RL (Rennie et al., 2017). The main idea is depicted in Figure 4.
Concretely, the graph detector module can be considered as an agent that follows a policy function to decide an action given a state. We regard an input document as the state and the extracted graph G as the action distribution. After we sample selected sentences over the graph, a delayed reward is calculated that indicates the similarity between selected sentences and highlights. Maximizing the expected reward helps to refine the governing scores in the graph G. Next, we will introduce the graph refinement module in detail. Policy The policy is described as below.
where Θ is the parameters of the graph detector, D is the document and G is the extracted graph. We sample sentences over the graph as follows.

Sampled Decisions
The main reason why RL can improve the reward is that it accords with the trialand-error process, which samples and update the parameters accordingly. To bridge with the strategy in the second phase ( §2.5), i.e. detecting mind-map from a graph, we sample the upper nodes given the graph in the same way. At first, we sample a sentence as the root node of the mind-map.
where rowsum is row-wise summation. Its result is the salience score that a sentence governs all other sentences. A larger salience score indicates that a sentence is more likely to represent the key ideas of the document. We sample a root node based on multinomial distribution g 0 . Next, we remove the sampled root from the graph and cluster the remaining nodes into two sets, obtaining two subgraphs, i.e., G 1 and G 2 . Similar with the root node, its two child nodes are also sampled based on the distributions g 1 = softmax(rowsum(G 1 )) and g 2 = softmax(rowsum(G 2 )), respectively.
The reason why we sample three sentences is that the average number of sentences in highlights is around 3.55. We also found that sampling more nodes does not improve performance. A possible reason is that more upper nodes introduce noise when comparing with highlights. Reward The definition of reward is crucial for RL as it determines the optimizing objective. To ensure that the upper nodes of the mind-map represent the central concept and major ideas of the document, we treat the manual highlights as a reference. The ROUGE score (Lin, 2004) between the sampled decisions and the highlights A is used to define the reward. Multiple variants of the ROUGE score are proposed (Lin, 2004). Among them, ROUGE-1 (R-1), ROUGE-2 (R-2), ROUGE-L (R-L) are the most commonly utilized. We employ the average of ROUGE variants to define a reward function.
Assume the sampled sentences are D s and D s ⊆ D, the reward is computed as follows.
RL Refine Loss According to (Sutton et al., 2000), RL training objective is to maximize the expected reward. Therefore, we define the RL loss for graph refinement is to minimize the negative Compute L g by Eq. (11) 8: Compute L r by Eq. (10) 9: Compute joint loss L by Eq. (12) 10: Update temp batch lossL ←L+L 11: Optimize Θ byL/|B k | 12: until performance on the validation set does not improve in 3 epochs reward (see Eq. (9)). More concretely, assume the sampled sentences D s = {a 0 , a 1 , a 2 }, where a 0 is the root, a 1 and a 2 are independent child nodes. Based on the conditional independence (Dawid, 1979), we have Therefore, p(D s ) = g 0 g 1 g 2 , where g i is the probability of the sampled sentence in g i . When we only sample one sentence as the root node, p(D s ) = g 0 .
To reduce the variance caused by sampling, we associate the reward with a reference reward or baseline (Rennie et al., 2017) and define it as b = Sim(D b , A). D b is the sentences chosen greedily by the argmax operation on the multinomial distributions. With the likelihood ratio trick, the optimizing objective can be formulated as.

Training
We train the graph detector module by a combination of two optimizing objectives, i.e., fitting the pseudo graphs annotated by DistilBert and refining the generated graphs.
Since it costs too much to manually annotating the relation labels in the graph, we automatically annotate a pseudo graph Y by DistilBert (Sanh et al., 2019). In advance, DistilBert is fine-tuned by sentence pairs constructed from news articles. In this way, our method can take advantage of the prior knowledge from the pre-trained model, but also the local semantic association of sentence pairs. The fine-tuning details of DistilBert will be introduced in §3.2.1. The proposed model fits the pseudo graph by a mean square error (MSE) loss.
where Y is the pseudo graph. Then we combine the MSE loss and graph refinement as an overall training objective to optimize the parameters Θ. The entire training process of the proposed model is described in Algorithm 1.
where λ balances the effect of graph refinement.

Mind-Map Detector
In this section, we introduce how to generate a mind-map from a graph, i.e. G → M . The graph G covers all the sentences in the document, which might be redundant. To highlight the major ideas, we convert the graph into a mind-map through the strategy proposed by  to prune the extra edges. The algorithm works recursively to determine the governing relationship of sentences. First, it chooses a governor by picking the highest row-wise aggregation scores in the graph. Then except for the governor, it clusters the remaining nodes into two sub-groups with k-means algorithm.
The sub-groups are processed recursively to extract the final mind-map. We enclose the full algorithm in the Appendix. We extract two types of mind-map, i.e. salientsentence-based mind-map (SSM) and key-snippetbased mind-map (KSM). Given the graph G of a document, we first prune it into SSM, and then extract the key phrases in each sentence (Rose et al., 2010) to obtain the KSM. Therefore, SSM and KSM have the same structure. The only difference is that a node in SSM is a sentence, while a node in KSM is the key phrases. In the case of KSM, if no key phrase is found, the whole sentence is kept in the mind-map.

Dataset
We build an evaluation benchmark with 135 articles, which are randomly selected from CNN news articles (Hermann et al., 2015;Cheng and Lapata, 2016). The size of the benchmark is about 98,181 words. The average length of the news article is about 727 words. Two experts manually annotate the ground-truth mind-maps for these articles. If one of the experts disagrees with any content of the mind-map, they discuss to reach consensus. In the experiments, the benchmark is split into two datasets: a testing set D t with 120 articles and a validation set D v with 15 articles.

Automatically Annotate Graphs for
EMGN Training Sentence Pairs for Fine-tuning DistilBert To save time in fine-tuning and subsequent annotation, we choose DistilBert as the "annotator" to obtain the relationships of all sentence pairs in the graph.
To construct the training pairs for fine-tuning DistilBert, we first randomly select 90k CNN news articles D news , which has no overlap with the benchmark. Each news consists of content and highlights. Because highlights summarize the major concepts of the news, they are regarded as the governors. To find the sentence pairs with governing relationships, we exploit TFIDF as the similarity metric. Concretely, a highlight governs each sentence in a paragraph when it is similar to one or some sentences in the paragraph. The negative samples are generated randomly.
In this way, we build a large-scale training corpus, which has 641,476 pairs from these news articles. Then we split all the pairs into 600k for training and 41,476 for testing. Fine-tuning DistilBert Using the training pairs, we fine-tune DistilBert for 3 epochs with a learning rate of 5e-5 and a training batch size of 32. The accuracy and F1 on the testing pairs are both more than 99.35%. Thus DistilBert can annotate pseudo graphs with high confidence. Annotate Pseudo Graphs We select 44,450 articles from D news by setting the max length of sentence in the article as 50 and max number of sentences as 50 1 . After annotating these articles by  Table 1: Evaluation results of the salient-sentence-based mind-map (SSM) and key-snippet-based mind-map (KSM) in terms of R-1 (%), R-2 (%), R-L (%) and the average score (%). The marker † refers to p-value<0.01 when comparing with DistilBert. The marker ‡ refers to p-value<0.01 when comparing with EMGN-GR.
DistilBert, they are exploited to train our mind-map generation model EMGN.

Mind-Map Evaluation
We evaluate the generated mind-map by comparing the tree similarity with the human-annotated mindmap . We first remove the weak edges from a generated mind-map to ensure that it has the same number of edges as the annotation. The similarity between two edges is computed as below. We utilize Sim as Eq. (7).
Then for each edge in the annotation, the strategy finds the most similar edge in the generated mindmap. The final score is the average similarity score of all greedily selected pairs.

Implementation Details
We initialize the word embeddings with 50dimension GloVE (Pennington et al., 2014) and fine-tune during training. All other parameters are initialized by sampling from a normal distribution of N (0, 0.02). The hidden size of BiLSTM is set to be 25×2. The models are optimized by Adam (Kingma and Ba, 2015) with a learning rate of 1e-4. The batch size is 64. And λ is set to 0.01. We employ an early stop strategy during training if the evaluation score on the validation set D v does not improve in 3 epochs, and the best model is selected for evaluating testing set D t . For all baselines and our model, the reported results are the average score of 5 runs.
The full results are presented in the Appendix.

Compared Methods
We validate the effectiveness of the proposed method by comparing with the following baselines.
• Random: We randomly sample a graph G for an input document. Each governing score G i,j ranges from zero to one.
• LexRank: It computes the governing score of sentence pair by the cosine similarity of their TFIDF vectors. It follows the well-known LexRank algorithm (Erkan and Radev, 2004), which is an extension of PageRank in the document summarization domain.
• MRDMF : This is the state-of-the-art semantic mind-map generation work. It presents a multi-perspective recurrent detector to extract the governing relationship and then prunes the extra edges.
• DistilBert (Sanh et al., 2019): It is a lighter version of BERT (Devlin et al., 2019). It provides the pseudo graphs for our method training.

Method Variants
The proposed full model is the efficient mind-map generation network (EMGN), with bilinear operation in the sequence-to-graph module. We explore the impact of individual modules by comparing with its variants. Minus (-) means removing the module from the full model.
• EMGN(root): It only samples root node for refining the graph.
• EMGN(root)+greedy: It chooses root node by greedily selecting the sentence with maximum similarity with highlights.  • EMGN-GR: It removes the graph refinement (GR) module from EMGN, which lefts the graph detector module for the purpose of sequence-to-graph.

Experimental Results
Overall Results The experimental results for two types of mind-maps are displayed in Table  1  Then, by comparing EMGN(root) and EMGN(root)+greedy, we see that EMGN(root) gains many improvements. A possible reason is that EMGN(root)+greedy only greedily enlarging the governing scores in the graph for a specific sentence, i.e. the one which has the maximum ROUGH similarity with highlights. This might ignore the exploration on other nodes. EMGN(root) performs better by sampling more sentences and compares them relatively. Finally, we found that EMGN performs slightly better than EMGN(root). This shows that refining more upper nodes achieves better performance than only refining the root node. Effects of the Document Length Table 2 displays the evaluation results by splitting the testing set D t with the document length, i.e. the number of sentences in a document. We found that EMGN consistently achieves the best performances. By comparing the results on two subsets, we can see that the results are highly related to the number of sentences in the article. It is still very challenging to extract meaningful mind-maps for the longer articles.

Further Analysis
Inference Time To validate the efficiency of the proposed method, we compare the inference time of the testing set D t and validation set D v (see Table 3). Since all methods share Phase II, we only report the inference time of Phase I in Table 3 to show the merits of the proposed method. The total inference time of Phase II is around 23.54 seconds in D t and 2.94 seconds in D v .
We set the batch size of MRDMF and DistilBert as 256 (256 sentence pairs in a batch). The batch size of EMGN related methods is 32 (32 documents in a batch). We observe that the inference time of the existing methods, e.g. MRDMF and Distil-Bert, are more than 3,000 times compared with our method. As depicted in Figure 2, the main reason is that we significantly reduce the computational complexity of a relation graph from the sentence pair level to the document level.

Loss and Reward in Graph Refinement
The graph refinement aims to optimize the upper nodes of the mind-map for better revealing the major ideas of the document. We achieve this goal by optimizing the reward of sampled decisions. In Figure  5, we plot the average loss L r and average reward (Eq. 10) in each epoch of the training process. In Figure 5(a), it can be seen that the loss L r gradually converges as training in both EMGN(root) and EMGN. Also in Figure 5(b), the reward are gradually increasing as the training epochs and finally reach a relatively stable value. The training curves further prove that the proposed graph refinement module helps improve the similarity between the upper nodes and human-written highlights. Case Study In Figure 6, we depict the generated mind-maps by varied methods for a CNN news 2 .
Comparing with the artificial mind-map, MRDMF chooses an inaccurate sentence as the root node. DistilBert and EMGN both generate the mind-map which represents the major ideas of the document. However, some relations between nodes in Distil-Bert are meaningless, such as the governing relation from sentence 8 to sentence 2. EMGN generates a mind-map that captures the central concept and grasps the directed relations among sentences. This is because our method considers the sequential information of an article and understands it as a whole. The case study further verifies that our method effectively generates a mind-map to reveal 2 Article is available at https:// www.cnn.com/2014/09/15/politics/ new-hampshire-senate-poll/index.html Document： 0 scott brown, the former senator from massachusetts who moved to new hampshire to run in a more friendly environment, appears to be in a dead heat with democratic sen. jeanne shaheen, a new poll shows. 1 a cnn/orc international poll out monday finds shaheen and brown tied among likely voters, with both obtaining the support of 48% among 735 voters surveyed. 2 a close race could mean bad news for democrats, who are struggling to maintain control of the senate. 3 keeping the seat in the democratic column is crucial if the party want to maintain a slim majority in the senate. 4 one thing working in shaheen's favor is her high favorable ratings. 5 more than half of likely voters --54% --have a favorable view of the firstterm incumbent, while brown's favorability is not as high. 6 his rating currently sits at 46%. 7 what could be a drag on shaheen, however, is new hampshire residents' opinion of the leader of her party. 8 thirty-eight percent of new hampshire adults polled approve of the job president barack obama is doing, while 60% disapprove. 9 throughout the campaign, brown has sought to tie shaheen to obama. 10 the poll has a margin of error of plus or 3.5 percentage points. 11 paul, clinton top presidential poll in new hampshire .
Key Snippets： 0 friendly environment, democratic sen, poll shows, jeanne shaheen, dead heat, scott brown 1 cnn/orc international poll, monday finds shaheen 2 maintain control, close race, bad news 3 slim majority, democratic column 4 high favorable ratings 5 voters --54% --6 sits, rating 7 hampshire residents 8 hampshire adults polled approve, job president barack obama 9 tie shaheen 10 5 percentage points 11 clinton top presidential poll the underlying semantic structure for a document.

Related Works
A mind-map is a hierarchical diagram to reveal the semantic structure for a document. The salientsentence-based mind-map (SSM) is similar with extractive summarization (Zhong et al., 2020), which aims to choose some key sentences from the document to describe the main ideas. Similar but completely different, mind-map generation can reveal not only the main ideas, but also the key semantic logic structure of the document. One previous work LexRank ( Erkan and Radev, 2004) computes an adjacency matrix of the graph representation of sentences based on intra-sentence cosine similarity. However, the lexical similarity of some sentence pairs with semantic relation may be zero. Generating a mind-map from this graph representation tends to be less-meaningful, which is also indicated in the experiments (see §3.4). In addition, a few extractive summarization works employ graph techniques. For instance, a bipartite graph for sentence and entity nodes (Parveen and Strube, 2015) or a weighted graph with topic nodes (Parveen et al., 2015) are proposed to improve ranking the sentences in a document. Recently,  propose to build a heterogeneous graph to learn the correlation between word and sentence nodes, which helps to select better-summarizing sentences. Though these works involve learning the graph knowledge, such graphs are hard to derive a mind-map that focuses on the governing relationships between sentences.
Another related direction is the policy-based reinforcement learning (Williams, 1992;Sutton et al., 2000). Previous methods (Xiong et al., 2017;Xiao et al., 2020) usually affect the training of the main task by a policy network with separate parameters. Different from them, we directly regard the main network as the policy network and its output graph as the action distribution. Then the main network is optimized simultaneously when maximizing the expected reward.

Conclusion
We propose an efficient mind-map generation network that converts a document into a graph via sequence-to-graph. To ensure a meaningful mindmap, we design a graph refinement module to adjust the graph by leveraging the highlights in a reinforcement learning manner. Extensive experimental results demonstrate that the proposed approach is more effective and efficient than the existing methods. The inference time is reduced by thousands of times compared with the existing approaches. The case studies further verify that the generated mind-map can reveal the underlying semantic structures of a document.

A Software and Hardware
We use Pytorch to implement all models (Python 3.5). The operating system is Red Hat Enterprise Linux 7.8. DistilBert is trained on Tesla K80. All other models are trained on GTX 980. We compare the inference time of all the models in the same software and hardware environments.  Table 4, Table 5 and Table 6, we display the full experimental results, including the average score and the standard deviation of 5 runs.

B.2 Effects of Training Data Scale
We also investigate the impact of the training data size on the performance. We totally annotate pseudo graph labels for 44,450 documents by Dis-tilBert. The performance curves of EMGN with different training scale are depicted in Figure 7. It shows that training the proposed model EMGN does not require too many labeled documents. The performance scores are improved significantly by changing the data scale from 1000 to 2000. The results grow steadily as adding more training data. A possible explanation is that compared with the ground-truth graph, the pseudo graph labels by Dis-tilBert are still less accurate and might have redun-