Data Collection vs. Knowledge Graph Completion: What is Needed to Improve Coverage?

This survey/position paper discusses ways to improve coverage of resources such as WordNet. Rapp estimated correlations, rho, between corpus statistics and pyscholinguistic norms. rho improves with quantity (corpus size) and quality (balance). 1M words is enough for simple estimates (unigram frequencies), but at least 100x more is required for good estimates of word associations and embeddings. Given such estimates, WordNet’s coverage is remarkable. WordNet was developed on SemCor, a small sample (200k words) from the Brown Corpus. Knowledge Graph Completion (KGC) attempts to learn missing links from subsets. But Rapp’s estimates of sizes suggest it would be more profitable to collect more data than to infer missing information that is not there.


Quantity (Size) and Quality (Balance)
How large do the corpora have to be to learn what? In the early 1980s, corpora were about 1M words. The Brown Corpus (Kučera and Francis, 1967;Kučera, 1979, 1982) was large enough for first order statistics (counts of words), but not for second order statistics (word associations and counts of pairs of words).
The Brown Corpus was a balanced corpus. That is, the corpus was intended to be a representative sample of text that the system will see at inference time. The 1M word Brown Corpus consists of 500 samples 1 of 2000 words, representative of contemporary American English (from 1960s).
Over time, balanced corpora became larger. When the community decided to increase the size of balanced corpora from 1M words for the Brown Corpus to 100M for the British National Corpus (BNC) (Aston and Burnard, 1998;Burnard, 2002) it was known that 1M was too small for second order statistics (collocations and word associations), but it was hoped that 100M would be sufficient.
Around this time, Church and Hanks (1990) used an unbalanced sample of 44M words from the AP (Associated Press) to make the case for PMI (pointwise mutual information). Given the estimates in Table 1, it appears in retrospect that 44M words were just barely enough to make the case for PMI.
It was also believed that quality (balance) mattered, but there were few, if any, empirical studies to justify such beliefs. It was extremely controversial when engineers such as Mercer questioned these deeply held beliefs in 1985 2 with: "there is no data like more data." Most people working on corpus-based methods in lexicography were deeply committed to balance as a matter of faith, and were deeply troubled by Mercer's heresy.
More recently, Rapp (2014a,b) provided some empirical evidence that bears on this debate. He used 5 corpora to study quantity (sample size) and quality (balance). In addition to the two balanced corpora mentioned above, Brown and BNC, Rapp looked at 3 unbalanced corpora: 1. 300M words of Wikipedia (Wiki) 2. 2B words of web pages (ukWaC) 3. 4B words of newswire (Gigaword) This study used correlations, ρ, to compare statistical summaries with psycholinguistic norms: familiarity (Coltheart, 1981), association (Kiss et al., 1973) and relatedness (Fernald, 1896). We will refer to unigram statistics and familiarity norms as first order; statistics on pairs of words (such as PMI) and the other norms will be referred to as second order. In Table 1, ρ 1 refers to correlations of first order quantities and ρ 2 refers to correlations of second order quantities.  Table 1: ρ 1 and ρ 2 increase with quantity (N ) and quality (balance: top 2 rows). Results from (Rapp, 2014b) Rapp (2014a,b) showed that both ρ 1 and ρ 2 increase with quantity and quality, as shown in Table 1. We suggest two simple rules of thumb: 1. Balance Trade-off: ρ 1 over N balanced words ≈ ρ 1 over 100N unbalanced words 2. First order is 100x easier than second order: ρ 1 on N words > ρ 2 on 100N words Among the unbalanced corpora, web pages (ukWaC) have relatively large ρ, better than Wiki and Giga, though not as good as BNC. Note that 1B words of web pages has a better ρ 2 than 100M words of BNC.
It is hard to know what will happen for much (1000x) larger corpora, but one might expect diminishing returns. Of course, extrapolating estimates like these by 10x or more is known to be risky (Efron and Thisted, 1976). Figures 1a-b of (Rapp, 2014b) suggest that while ρ is increasing almost everywhere, there may be some deceleration (negative second derivative), especially for large N .
Although Rapp's estimates predate much of the work on embeddings, we expect these estimates of quantity and quality to hold for static embeddings (Mikolov et al., 2013;Pennington et al., 2014) and contextual embeddings (Devlin et al., 2019;Sun et al., 2020), assuming the connection between PMI and Word2vec in Levy and Goldberg (2014).
In addition to size and balance, there are many other factors to consider. Different languages are different. Languages are constantly evolving. Variations are to be expected over time 3,4 (Hamilton et al., 2016;Szymanski, 2017) 5 and space, as well as sociolinguistic factors, demographics, gender bias (Pearce, 2008;Drozd et al., 2016;Sheng et al., 2019;Nissim et al., 2020;Kumar et al., 2020), etc.
In addition to language change, topics and domains are also constantly evolving. Obviously, news, Wikipedia and web pages are very different from social media (Twitter) and academic writing (ACL Anthology (Radev et al., 2013), ArXiv, 6 PubMed 7 ). The Brown Corpus predates social media, and most publications in repositories such as: PubMed, ACL Anthology and ArXiv (Church, 2017). The Brown Corpus also predates huge changes in technology (computers and cell phones), the news media (cable TV and the Internet), and modern medicine (e.g., COVID-19, SARS, HIV, affordable DNA sequencing). Nevertheless, many resources in our field are still based on the Brown Corpus, including the Penn TreeBank (Marcus et al., 1993) and SemCor 8 .

WordNet Coverage and SemCor
WordNet 9 (Miller et al., 1990;Miller, 1995;Feldbaum, 1998;Miller and Fellbaum, 2007;Vossen and Fellbaum, 2021) is widely cited because of accessibility 10 as well as coverage. Why is the coverage as good as it is, and how can it be improved? Unlike other methods for constructing lexical resources (Lenat, 1995;Sinclair, 1989;Hanks, 2008), WordNet was developed in tandem with SemCor, a small subset of the Brown Corpus, tagged with pointers into WordNet. The team constantly tracked coverage as indicated by the reference to 96% below: [SemCor] starts with the corpus and proceeds through it word by word... This procedure has the advantage of immediately revealing deficiencies in the lexicon: not only missing words (which could be found more directly), but also missing senses and indistinguishable definitions-deficiencies that would not surface so quickly with [alternatives]... we ... adopted the [SemCor] approach for the bulk of our semantic tagging... over several months ... estimates of ... coverage have been slowly improving... it is currently averaging a little better than 96%. (Miller et al., 1993) The SemCor process helped manage growth. In 1993, they were adding almost 1k concepts per month. The number of synsets (word senses) nearly doubled from 63k in 1993 to 118k today. In addition, the process led to the creation of SemCor 3.0, a subset of about 20% of the Brown Corpus tagged with WordNet senses.
While SemCor has much to recommend it, there are also some obvious concerns. SemCor is only 200k words, probably not enough given Rapp's estimates above. Coverage of WordNet could be improved by building something like SemCor, but based on a larger corpus of more modern material. Alternatively, it might be possible to combine small annotated corpora with larger unannotated corpora.

Knowledge Graph Completion (KGC)
An alternative suggestion for improving Word-Net coverage is: Knowledge Graph Completion (KGC) 11 (Nguyen, 2017;Wang et al., 2017;Yu et al., 2019). A standard KGC benchmark is WN18. 12 WN18 is a graph G = (V, E). There are 41k vertices, V . Each vertex is a WordNet synset, a pointer to a set of synonymous lemmas in WordNet. There are 118k such synsets in WordNet.
The edges, E, connect two vertices with one of 18 relations. The relations also come from Word-Net. Some relations are more frequent than others.
Many of the relations come in pairs, as shown in Table 2. By construction, if x is-a y, then there will be a hypernym link from x to y, as well as a hyponym link from y to x. We will refer to the backward links as inverses.
The KGC task is to infer subsets of these graphs from other subsets of these graphs. That is, KGC splits E randomly into three sets: train, validation and test. WN18 consists of 141k edges in train, 5k in validation and 5k in test.
For each set, we have a set of input features, X, and a set of output labels, Y . The standard procedure uses X train and Y train to fit a model. This model is used to predictŶ test from X test . The predicted values,Ŷ test , are compared with the gold labels, Y test , to compute a score.

.1 Information Leakage in KGC
Some of the leakage in the WordNet benchmark, WN18, is well-known and some is not. WN18RR is a reduced subset of WN18 that corrects for the known leakage (Dettmers et al., 2018). 13 The correction removes the 7 inverse relations on the right hand side of Table 2, resulting in the test set shown in Table 3. Before the correction, there are 5000 edges over 18 relations in the WN18 test set. After the correction, there are 3134 edges over 11 relations in the WN18RR test set. Unfortunately, there is even more leakage in WN18RR that has not been previously reported. Note that "derivationally related forms" also come in pairs. By construction, derivationally related links are symmetric: xRy ⇒ yRx. That is, if there  is an edge in one direction, then there will also be an inverse edge in the reverse direction. This symmetry will leak information between train and test because it is likely that one member of the pair appears in train and the other appears in test. Table 4 shows that many of these pairs are indeed leaking information in this way. The table shows how these "derivationally related" edges, xRy, and their inverses, yRx, are distributed across the WN18RR test, train and validation splits.
In particular, of the 1074 derivationally related edges in the WN18RR test set, all of them are also in one of the other sets, but in the reverse direction. The 1074 reversed edges are split across test (24), train (1011) and valid (39).
Because of this leakage, a system can do very well on this benchmark without learning anything useful about WordNet. Simply reverse edges in the training set and predict that those reversed edges will appear in the test set (unless they have already been seen in the training or validation sets). Such a system will correctly predict 1 − 24/1074 = 98% of the derviationally related edges (1074/3134 = 34% of the test set).
One could correct for this leakage by removing the redundant edges, just as we removed redundant edges to reduce WN18 to WN18RR.

Corpus Sizes
Size is perhaps more serious than leakage. Leaning edges in WN18 is a second order task. As shown in Table 1, second order tasks typically require a corpus of 100M words or more. Unfortunately, WN18 is based on SemCor (indirectly via Word-Net). SemCor is a 200k word sample, too small for second order tasks. Inferences on downstream graphs (such as WN18) are unlikely to capture associations on pairs of words.
KGC is learning subsets of WordNet from other subsets of WordNet. But given Table 1, to improve WordNet, we need more data, not less. Modern corpora are 1000x larger than SemCor, and more representative of text from this century. We believe it is more profitable to collect more data (and more representative data) than to infer information that is not in the WordNet graph (or the underlying SemCor corpus).
KGC can be viewed as similar to downsampling in speech, where there is a well-known difference between upsampling and downsampling. In speech, it is relatively easy to downsample a waveform from 16 kHz down to telephone bandwidth (8 kHz), but harder to invert the process (upsampling). That is, we can always throw away information by low pass filtering and decimating. But it is harder to recover the high frequency information after it has been thrown away.
SemCor can be viewed as a small sample of contemporary language, downsampled with a strong bias favoring American English from the 1960s. Rapp's estimates suggest there is more information in larger corpora than in smaller corpora. Thus, the downsampling process is throwing away information that cannot be recovered. Obviously, KGC cannot recover information that cannot be recovered, but it is also unlikely to learn anything since 1960, let alone other dialects/languages.

Conclusions
What is needed to improve WordNet coverage? We started with Rapp's estimates of ρ, correlations of corpus statistics and psycholinguistic norms. ρ improves with quantity (corpus size) and quality (balance). Unbalanced corpora need to be larger (100x) than balanced. Estimates of second order quantities (word associations and edges in WordNet) require at least 100x more data than first order quantities (frequency/familiarity). Rapp's estimates suggest there is more information in larger samples than in smaller samples.
WordNet is based on SemCor. It is remarkable that WordNet works as well as it does, given Rapp's estimates. One approach to improving coverage is Knowledge Graph Completion (KGC). KGC attempts to learn missing links from subsets. The KGC Benchmarks, WN18 and WN18RR, are deeply flawed. Information is leaking between training and test sets. Some of this leakage has been previously reported, and some has not. But more seriously, if SemCor is already too small and dated, data collection is more likely to succeed than attempts to infer information that is not there.