Data Collection vs. Knowledge Graph Completion: What is Needed to Improve Coverage?

Kenneth Church, Yuchen Bian


Abstract
This survey/position paper discusses ways to improve coverage of resources such as WordNet. Rapp estimated correlations, rho, between corpus statistics and pyscholinguistic norms. rho improves with quantity (corpus size) and quality (balance). 1M words is enough for simple estimates (unigram frequencies), but at least 100x more is required for good estimates of word associations and embeddings. Given such estimates, WordNet’s coverage is remarkable. WordNet was developed on SemCor, a small sample (200k words) from the Brown Corpus. Knowledge Graph Completion (KGC) attempts to learn missing links from subsets. But Rapp’s estimates of sizes suggest it would be more profitable to collect more data than to infer missing information that is not there.
Anthology ID:
2021.emnlp-main.501
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6210–6215
Language:
URL:
https://aclanthology.org/2021.emnlp-main.501
DOI:
10.18653/v1/2021.emnlp-main.501
Bibkey:
Cite (ACL):
Kenneth Church and Yuchen Bian. 2021. Data Collection vs. Knowledge Graph Completion: What is Needed to Improve Coverage?. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6210–6215, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Data Collection vs. Knowledge Graph Completion: What is Needed to Improve Coverage? (Church & Bian, EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.501.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.501.mp4