DESCGEN: A Distantly Supervised Datasetfor Generating Entity Descriptions

Short textual descriptions of entities provide summaries of their key attributes and have been shown to be useful sources of background knowledge for tasks such as entity linking and question answering. However, generating entity descriptions, especially for new and long-tail entities, can be challenging since relevant information is often scattered across multiple sources with varied content and style. We introduce DESCGEN: given mentions spread over multiple documents, the goal is to generate an entity summary description. DESCGEN consists of 37K entity descriptions from Wikipedia and Fandom, each paired with nine evidence documents on average. The documents were collected using a combination of entity linking and hyperlinks into the entity pages, which together provide high-quality distant supervision. Compared to other multi-document summarization tasks, our task is entity-centric, more abstractive, and covers a wide range of domains. We also propose a two-stage extract-then-generate baseline and show that there exists a large gap (19.9% in ROUGE-L) between state-of-art models and human performance, suggesting that the data will support significant future work.


Introduction
Entity knowledge has been shown to play an important role in various applications including language modeling (Peters et al., 2019), open-domain question answering (Xu et al., 2016), and dialogue generation (Qin et al., 2019). Recent studies suggest that such entity knowledge can be provided by simple textual descriptions (Chen et al., 2019), which can be incorporated to improve downstream task performance (Nie et al., 2018;Logeswaran 1 Data and code available at github.com/swj0419/DESCGEN Doc 1 ...Are bitcoins, then, really worth anything? According to Carl Menger's subjective theory of value, they are worth whatever individuals choose to believe they are worth. It is clear that many individuals value this new medium of exchange highly... Doc 2 ...The Austrian School of Economics has its roots outside of Austria -particularly in the French economists Jean Baptiste Say and Claude-Frederic Bastiat. The Austrian School proper began with Carl Menger, who challenged the British labor theory of value. To learn more about Austrian Economics go to the website of The Ludwig von Mises Institute... Doc 3 ...Karl Menger was born on January 13, 1902, in Vienna. His father was the famous Austrian economist Carl Menger  who was one of the founders of marginal utility theory.... Entity Description Carl Menger (February 23, 1840-February 26, 1921 was an Austrian economist and the founder of the Austrian School of economics. He contributed to the development of the marginal utility theory and to the formulation of a subjective theory of value.  et al., 2019). However, manually curating entity descriptions is labor-intensive and it is challenging to keep pace with the ever growing emergence of new entities. In this paper, we present a new dataset DESCGEN for automatically generating entity descriptions from relevant documents and mentions, which provides high quality supervision for a highly abstractive version of this task that targets early description of new entities as they emerge. For example, in Table 13, machines are required to generate a description of Carl Menger, given multiple documents mentioning him.
DESCGEN contains 37K entity descriptions extracted from Wikipedia and Fandom 2 . Fandom allows us to capture the key challenge of generating descriptions for emerging entities that are not in Wikipedia because they are less popular or have just been introduced to the public. To obtain source documents of the entities, we collect web documents and news articles where entity mentions are linked using web hyperlinks or an entity linker. Our dataset is distantly supervised in that these heuristically collected documents are not guaranteed to contain all the facts required to generate the description-as would be seen for natural text collections describing emerging entities. We also carefully annotate a subset of 1,000 examples to support more reliable evaluation (see Table 2 for dataset statistics).
Unlike multi-document summarization that makes the assumption that a set of documents to be summarized are written on the same topic (Zopf et al., 2016), DESCGEN only assumes that source documents mention the entity. In contrast to an existing entity summarization benchmark (Liu et al., 2018, WikiSum), DESCGEN is more abstractive and better approximates challenges faced when describing new entities. Section 4.4 provides more details on these comparisons. Overall, our documents for generating a description can cover a much wider range of topics as well as text genres, including news, blog posts, and scientific articles. For instance, the documents 1 and 2 mentioning Carl Menger in Figure 13 discuss topics on bitcoins and the Austrian School of Economics.
Finally, we also propose a two-stage method that first extracts salient sentences relevant to the entity and then abstracts them into a description. We test a range of models to establish baseline results with both automatic and human evaluation. The best model based on BART (Lewis et al., 2020b) achieves 28.2% in the ROUGE-L F measure with a significant gap compared to the human performance 48.1%, suggesting there was great room for future improvement. In summary, our contributions include: • We propose a new dataset DESCGEN that includes challenging, abstractive entity summaries. Our dataset contains over 37K pairs of entity descriptions and their associated documents, along with a human-annotated subset of 1,000 pairs.
• We conduct an extensive analysis of properties of the dataset and identify its challengesextractive content selection from large  amounts of text and abstractive generation from it, particularly for emerging entities.
• We present a two-stage method and benchmark various models on our dataset, aiming to facilitate future work on this dataset.

Related work
Existing Entity Description Generation Task and Dataset Previous works (Novikova et al., 2017;Cheng et al., 2020;Trisedya et al., 2020) mainly take as input some structured data such as knowledge graphs to generate entity descriptions. However, knowledge graphs, often mined from text corpora, are overwhelmingly incomplete on real-world entities and may not be updated in real-time (Dong et al., 2014). Therefore, we focus on generating descriptions from natural language sources such as web texts and news because they are often primary sources for entities and have better coverage of entities across multiple domains. DESCGEN is most related to WikiSum, a recent dataset for generating Wikipedia summaries from textual sources (Liu et al., 2018). WikiSum source documents primarily come from high-quality articles cited in the Wikipedia pages which makes their data more extractive (Section 4.4). In contrast, we collect our source documents heuristically using web texts and news, providing a better proxy for emerging entities where high-quality citation sources may not be available. In addition, their evaluation is conducted only on distantly supervised test data. However, our experiments demonstrate that manually annotated data allows for much better evaluation of model performance (Table 7).
Multi-document summarization aims to condense a cluster of thematically-related documents into a short and informative summary. A wide range of multi-document summarization datasets have been built for the Document Understanding and Text Analysis Conferences (Over and Yen, 2004;Owczarzak andDang, 2011), news (Fabbri et al., 2019), events (Gholipour Ghalandari et al., 2020) and Wikipedia summaries (Liu et al., 2018). Recent work has studied both extractive (Yasunaga et al., 2017;Nallapati et al., 2017;Tohalino and Amancio, 2018) and abstractive summarization (Banerjee et al., 2015;Chali et al., 2017;Nayeem et al., 2018). However, existing datasets typically are not entity focused and assume the input documents are at least loosely centered around a coherent topic or event.
Wikipedia generation Our work is also related to research on generating Wikipedia articles. For instance, Sauper and Barzilay (2009) learn to build content templates using an integer linear program to generate full articles. Similarly, Banerjee and Mitra (2016) generate Wikipedia pages by building a topic classifier to assign web retrieved contents into relevant sections. We focus on a different taskgenerating a short text description that can identify and best summarize an entity.

Dataset Collection
Task definition Given a collection of documents D = {D i |i = 1...n} with mentions linked to the same entity e, the goal is to generate a description of e. For example, Table 13 shows a description of an entity (Carl Menger) and three source documents with mentions.

Distant supervision
We make use of existing knowledge bases, such as Wikipedia and Fandom, to collect entity descriptions. To obtain source documents and mentions for each entity, we use a combination of hyperlinks to Wikipedia pages and an entity linker that links entity mentions in text. Our dataset is distantly supervised in that these heuristically collected documents are not guaranteed to contain all the facts required to generate the description. To analyze the quality of distant supervision, we collect a smaller verified set of entity descriptions using human annotators. In contrast with our work, WikiSum (Liu et al., 2018) used documents cited in the Wikipedia pages or web pages returned by Google as source documents to generate Wikipedia lead sections. Because highquality citation sources constitute a substantial part of overall documents (75%), their dataset is less abstractive than DESCGEN and unsuited for emerging entities where citations are not available.
Sources We paired entity descriptions with source documents from three sources: Wikilinks, RealNews, and Fandom using distant supervision.
To capture the challenge of emerging entities, we retrieve source documents that are not in Wikipedia using Wikilinks and RealNews. We also include specialized entities in Fandom that do not have Wikipedia pages. For quality control, we filter out entities for which the unigram recall of the entity description against its concatenated source documents is lower than 0.6.

Distantly supervised data collection
Wikilinks Wikilinks (Singh et al., 2012) is a large dataset designed for cross-document coreference. It consists of non-Wikipedia web pages (discovered using the Google search index) containing entities that are hyperlinked to Wikipedia. For each entity, we retrieve a collection of web pages in Wikilink with the anchor text linked to it and use the lead section of target Wikipedia page as its description. We further parse the HTML texts of the web pages and extract contents as source documents.
Real News To expand the collection of source documents, we extract entity mentions in Real-News (Zellers et al., 2019), a large corpus of news articles from Common Crawl. We first conduct a longest prefix match between the entity surface form and text tokens via trie, a prefix tree structure that supports efficient string searching. More specifically, we build a trie of entity names where each node is a word and its children indicate all possible continuations from the prefix. After retriving candidates for entity mentions, we use an offthe-shelf entity linking model (Gupta et al., 2017) to rank the candidates and add the corresponding news articles as source documents of the rank-1 candidate.
Fandom Fandom 3 is a collection of encyclopedias, centered around particular subjects and themes such as movies, TV shows, and games. It contains specialized entities that require domain experts with background knowledge to make edits. Entities and their source documents can be automatically extracted by internal links. We filter out entities and only keep those without Wikipedia pages, which can be viewed as new or emerging entities. The description of the entity is extracted  from the lead section of its Fandom page. We collect data from the 32 largest Fandom Wikis.

Human-authored entity descriptions
Entity descriptions extracted from Wikipedia and Fandom have been authored and edited by multiple community contributors largely independently of our source documents. We collected additional entity descriptions via Upwork, 4 a freelancing platform, to better analyze how descriptions sourced from documents in our dataset contrast with those from Wikipedia and Fandom. We provided the entity and its source documents to annotators on Upwork, and asked them to write the entity descriptions. The annotators are also asked to mark sentences they used to write the description. Each entity was assigned to 2 annotators. We collected 500 entity descriptions for dev examples and 500 descriptions for test examples.
We control the quality of the crowdsourced descriptions by filtering annotators who produced low-quality descriptions. We ask every candidate to annotate the same 20 examples and use two criteria for narrowing down candidates: (1) missing key information in descriptions (2) unjustified information in descriptions that cannot be inferred from source documents alone. Eventually, we filtered out 4 annotators and accepted 7 qualified annotators. The total annotation cost was around $3500.

Experimental setup
All 37K entity description and document pairs in the dataset are randomly split into train, development and test sets. In addition to automatically collected descriptions from Wikipedia and Fandom, we use the human-authored descriptions (Section 3.2) as verified subsets into dev and test splits. Table 3 shows basic statistics of the final dataset. We report model performance on automatically collected descriptions (distant) and human-authored descriptions (verified).
The next section provides a detailed analysis of the data quality, including annotator agreement and 4 https://www.upwork.com/

Dataset Analysis
An analysis of the data shows that DESCGEN contains a high proportion of emerging entities from diverse domains, and is more extractive compared to other multi-document summarization datasets.

Inter-annotator agreement
Each entity in the verified subset has two descriptions written by two annotators. Following previous work (Chen et al., 2015), we quantify interannotator agreement on descriptions by treating one of the descriptions as the prediction and the other as the reference to compute ROUGE (Lin, 2004) and METEOR (Denkowski and Lavie, 2014). Table 4 shows high inter-annotator agreement of 47.7 in terms of ROUGE-L.
We additionally measure the agreement on content selection using sentences marked by annotators. In particular, agreement is achieved when both annotators selected the exact same sentences in all source documents for an entity. Cohen's Kappa is 0.38, which indicates high agreement (Brennan and Prediger, 1981) considering the strict criterion of reaching agreement.

Comparison between human-authored and Wikipedia/Fandom descriptions
To understand how human-authored descriptions differ with Wikipedia and Fandom descriptions in terms of content and style, we compare them using automatic metrics (ROUGE) and manual evaluation.   Table 6. We find that the difference between the two sources of descriptions are mainly caused by paraphrasing and missing information. This suggests that even for entities that have very different human-authored and extracted descriptions, most of the information in the Wikipedia/Fandom descriptions is present in the documents.

Extraction vs abstraction
Generating entity descriptions involves extracting essential information about the entity and condensing them into a short description. To measure how much DESCGEN requires paraphasing and compressing, we quantify the extractive nature of our dataset by the measuring extractive fragment coverage and density defined in Grusky et al. (2018). Extractive fragment coverage computes the percentage of words in summary that appear in source documents: where A is a concatenation of the source documents, S is the description and F is the set of shared token sequences in A and S. Likewise, extractive fragment density is related to the average length of shared token sequences. For example, an entity description with high coverage and low density shares many individual words with source documents but almost no long phrases.

Density(A, S)
We compare our dataset with several multidocument summarization datasets, including CNN / Daily Mail, Multi-News (Fabbri et al., 2019) and WikiSum (Liu et al., 2018). Figure 2 presents the density and coverage distribution. The density of Multi-News, CNN / Daily Mail and WikiSum are high, showing that there is much copying of long sequences with respect to source documents. DE-SCGEN shows high coverage but low density, suggesting it is not common to copy long sequences and the data overall is much more abstractive.

Baselines
In this section, we introduce several new baseline methods, building on state-of-the-art pre-trained models. The input documents can be long (Section 8), making it computationally infeasible to train end-to-end models. We instead introduce a pipelined approach to generate an entity description in two stages. In the first extractive stage, a selector is used to identify representative sentences relevant to the entity from multiple source documents. In the second abstractive stage, a neural generation model is used to fuse the selected sentences to a description of the entity. We compare a number of different approaches for each stage, as summarized in the subsections below.

Extractive stage
Trivial concatenates all sentences that mention the entity, along with one sentence before and after each. The content is truncated to the first 1,000 tokens to fit the token limit of models in the abstractive stage.
Cheating ranks sentences according to their unigram recall against the description and selects the top 15 sentences. This heuristic demonstrates the effect of extraction on final performance.
BERT  with a classifier uses a linear layer stacked on top of the BERT outputs and predict whether a sentence should be selected. The model is trained on our training dataset in which sentences are labeled by the cheating method.

Abstractive stage
We compare three pre-trained language generation models, including BART (Lewis et al., 2020b), T5 (Raffel et al., 2019) and MARGE  to generate abstractive entity descriptions. We fine-tuned these models on our training dataset in a sequence-to-sequence fashion.
T5 is a text-to-text transformer pre-trained on a multi-task mixture of unsupervised and supervised tasks. We consider models of two sizes: base and large containing 220M and 770M parameters respectively. We use the Hugging Face version. 5 BART introduces a denoising autoencoder combining a bidirectional encoder and auto-regressive decoder. It is trained by reconstructing text corrupted with a noising function. We consider the base model with 139M parameters.
MARGE is a multi-lingual sequence-to-sequence model trained by reconstructing target documents retrieving paraphrased documents in other languages. It has around 960M parameters.

Evaluation metrics
Following other summarization tasks, we evaluate the quality of generated descriptions by ROUGE 421 Extract. Abstract.

Distant supervision Verified Dev
Test Dev Test R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L    Table 9: Manual evaluation scores on a scale from 1 (very poor) to 5 (very high). All these models use BERT in the extractive stage.
F1-score (Lin, 2004), which measures the overlap of unigram (R-1), bigram (R-2), and the longest matching sequence of words (R-L). In addition, we evaluate content selection by unigram and bigram recall to assess the importance of the extractive stage. Lastly, in addition to automatic evaluation, we also conduct human evaluation for nonredudancy, fluency, informativeness, and accuracy.

Experimental results
Automatic evaluation In Table 8, we report the experimental results in the extractive stage. We observe that BERT consistently outperforms the unsupervised method Trivial, suggesting that training a model to predict sentence relevance can bring in immediate improvement in content selection. Meanwhile, the performance of BERT still lags behind the upper bound defined by Cheating by 1.7-7.3% in unigram. Table 7 presents ROUGE scores of various baselines in the abstractive stage. T5-large and BART show similar performance and outperform other models for both distant supervision and verified subsets, by a small margin. Increasing model size from T5-base (220M) to T5-large (770M) parameters leads to a relatively large performance gain. The human baseline is superior to all the models and maintains a R-L score over 33 in distant supervision and 48 in the verified subset. The large gap between the human baseline and the bestperforming model shows there is much room for future work.
Manual evaluation We present two human assessors with source documents and descriptions generated from different abstractive models and asked them to rate descriptions in terms of nonredundancy (does the description avoid repeating information?), fluency (Is the description wellformed and gramatically correct?), informativeness (does the description capture the salient information about the entity?) and faithfulness (Is the description faithful to the source text?). We compared BART, T5-Large, and T5-Base. For each model, we selected 100 descriptions and showed outputs of models to assessors side by side without revealing which model generates them. The score for each description was averaged between two assessors. As can be seen from Table 9, BART shows strong performance on all dimensions, except for fluency. Overall, all three models can generate fluent descriptions (high fluency) but struggle with producing accurate statements (low faithfulness). In most cases of low faithfulness, we observe that the model directly copies words from the input that are not relevant to the entity as part of the description or synthesize information that are not directly inferable from the input.

Wikipedia description
Carl Menger (February 23, 1840-February 26, 1921 was an Austrian economist and the founder of the Austrian School of economics. He contributed to the development of the marginal utility theory and to the formulation of a subjective theory of value.

Analysis
In this section, we perform qualitative and quantitative analysis of baseline results to better understand strengths and weaknesses of models, and hypothesize avenues for future work.

Case study
A qualitative analysis of model predictions suggests that these models tend not to generate novel words in the description, and mostly copy words from the original text. The entity-centric nature of DESCGEN makes extractive content selection difficult as evidenced by the gap between BERT extraction and the Cheating model (Section 6.2). For example, Table 10 shows the model-generated entity descriptions for Carl Menger using source documents from Table 13. BART, one of the best performing baselines, generates a description that has highest overlap with the Wikipedia description, but it still misses some important facts. T5-Base and MARGE confuse Carl Menger and his son, and incorrectly include information that does not describe the target entity.

Entity knowledge in pre-trained models
BART, T5, and MARGE are language models pretrained on text corpora including Wikipedia and Common Crawl. The parameters of the models appear to contain substantial linguistic and factual information (Petroni et al., 2019;Peters et al., 2018).
In particular, we wonder if entity-related knowledge is captured in the pretraining stage and investigate the following questions: (a) Can the model memorize entity descriptions in pretraining stage? (b) Does the memorized knowledge improve model performance on generating entity descriptions? To investigate the questions, we test the model's ability to write a description given only the entity name instead of source documents. We train the model on our training dataset to adapt to the style of Wikipedia in a similar way. The results are shown in Table 11. Considering the name-only baselines, we can see that all of them perform worse on Fandom entities than Wikipedia entities. However, the regular baselines perform similarly on Fandom and Wikipedia. This result suggests that facts about entities learnt in pretraining stage have much less influence on model performance when source documents are provided.

Entity type
To understand how the performance of the models varies with different types of entities, we report the performance breakdown for different entity types in Table 12. Among domains in Wikipedia, our model obtains low scores on group and company, suggest-ing that they are more challenging than other domains. In Fandom, entities from the game domain prove to be most difficult.
In summary, our analysis suggests there is room for improvement in extractive content selection and abstractive generation, particularly for new and emerging entities from less popular domains.

Conclusion
In this work, we introduce DESCGEN, a new dataset for generating entity descriptions from mentions. DESCGEN contains 37K pairs of entity descriptions from Wikipedia and Fandom, and 481K automatically gathered source documents based on distant supervision. We also present a clean human-authored subset of 1,000 pairs for test. We show that, as compared to existing benchmarks, DESCGEN requires more abstractive summaries, which we argue better approximate the challenge of describing emerging entities. We also show that the performance of state-of-art models is far from human levels, suggesting that our task remains a significant challenge with room for improvement. Our study points to an interesting research direction on modeling entity knowledge from contexts. We hope it will facilitate future work on incorporating entity knowledge into downstream tasks and generating descriptions for emerging entities. Doc 1 ...It sometimes gets confusing in the global village , where technology, finance, cross-cultural interactions, and expanding ethnic diasporas are tearing apart the relationship between borders and making multiple identities possible. Hence, Ang Lee is a Taiwanese artist who directs American films, but he is also an American film director of Chinese movies. As a member of the Sinosphere, enlarged by fifty million overseas Chinese, Ang is not only a creative individual who makes our world more interesting and prosperous. He also helps to bridge between nations and cultures and to produce a Sino-American synergy that is more conducive to peace than a contingency of Chinese and U.S. diplomats... Doc 2 ...The Life of Pi. One of the most interesting film adaptations set for release in 2012 is Brokeback Mountain fame. Suraj Sharma, who has no previous acting experience, will play the central character, Piscine Patel. Based on the novel by Yann Martel, it is being brought to the big screen by Ang Lee...

Doc 3
Comic character Hulk is Dr. Bruce Banner, who becomes a green monster with powerful strength after an experiment went bad, or well, depending on who you ask. In 2003, director Ang Lee's film Hulk brought this character to the big screen, but was poorly received by Hulk's fans...

Wikipedia Description
Ang Lee, (born October 23, 1954, P'ing-tung county, Taiwan), is an Taiwan-born film director who transitioned from directing Chinese films to major English-language productions. Human-authored Description Ang Lee is a Taiwanese director who directs American and Chinese films. He is a director of the Life of Pi and Hulk and regarded as Second New Wave of Taiwanese directors. BART Ang Lee is a Taiwanese film director and screenwriter. T5-base Ang Lee is a Taiwanese film director. T5-large Ang Lee is a Taiwanese film directors and screenwriter Doc 1 ...In the summer of 1994, Arthur managed to get himself and his family (as well as Harry and Hermione) tickets for the 1994 Quidditch World Cup from Ludovic Bagman because Arthur had helped Otto Bagman, Ludo's brother, out of a minor scrape. Arthur was among the Weasleys who fetched Harry from the Dursley family via the Floo Network. While there, he expressed his fascination at various Muggle artefacts in the Dursley's house.The group sat in the Top Box, where they were confronted by the Malfoy family, who were there by a personal invitation from the Minister himself, though both Arthur and Lucius were able to restrain themselves out of respect for Cornelius Fudge... Doc 2 ...Before working at the Ministry, he was a Beater for both the Wimbourne Wasps and the English National Quidditch team. He had a brother named Otto Bagman. He also tended to play dirty when gambling and betting as he tried to find loopholes or even pay in fake money/gold... Doc 3 ...A lawn mower is found in the Muggle Studies classroom at Hogwarts School of Witchcraft and Wizardry. Arthur once helped Ludovic Bagman's brother, Otto Bagman, by smoothing over a problem involving a lawn mower enchanted with magical powers. As thanks, Ludo got Arthur prime tickets to the 1994 Quidditch World Cup...

Fandom Description
Otto Bagman was the brother of Ludovic Bagman. He once had a problem with a magical lawn mower, a Muggle artifact. Arthur Weasley helped him out with the problem, and was rewarded by Ludo with tickets to the 1994 Quidditch World Cup final.

Human-authored Description
Otto Bagman is the brother of Ludovic Bagman. He had a problem involving a lawn mower enchanted with magical powers. He was helped by Arthur and gave Arthur prime tickets to the 1994 Quidditch World Cup. BART Otto Bagman was a fictional character in the 1994 film Harry Potter. T5-base Otto Bagman was an English footballer who played for the Wimbourne Wasps and the English National Quidditch team. He also played dirty when gambling and betting as he tried to find loopholes or even pay in fake money. T5-large Otto Bagman was a brother of Ludovic Bagman. Table 13: Examples of entity descriptions generated by our model. Red text indicates incorrect information in predictions while green text indicates information in the Wikipedia and human-authored descriptions that was not covered by any of the model predictions.