DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections

Yury Zemlyanskiy; Sudeep Gandhe; Ruining He; Bhargav Kanagal; Anirudh Ravula; Juraj Gottweis; Fei Sha; Ilya Eckstein

doi:10.18653/v1/2021.eacl-main.217

DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections

Yury Zemlyanskiy, Sudeep Gandhe, Ruining He, Bhargav Kanagal, Anirudh Ravula, Juraj Gottweis, Fei Sha, Ilya Eckstein

Abstract

This paper explores learning rich self-supervised entity representations from large amounts of associated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as ranked retrieval, knowledge base completion, question answering, and more. Unlike other methods that harvest self-supervision signals based merely on a local context within a sentence, we radically expand the notion of context to include any available text related to an entity. This enables a new class of powerful, high-capacity representations that can ultimately distill much of the useful information about an entity from multiple text sources, without any human supervision. We present several training strategies that, unlike prior approaches, learn to jointly predict words and entities – strategies we compare experimentally on downstream tasks in the TV-Movies domain, such as MovieLens tag prediction from user reviews and natural language movie search. As evidenced by results, our models match or outperform competitive baselines, sometimes with little or no fine-tuning, and are also able to scale to very large corpora. Finally, we make our datasets and pre-trained models publicly available. This includes Reviews2Movielens, mapping the ~1B word corpus of Amazon movie reviews (He and McAuley, 2016) to MovieLens tags (Harper and Konstan, 2016), as well as Reddit Movie Suggestions with natural language queries and corresponding community recommendations.

Anthology ID:: 2021.eacl-main.217
Volume:: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Month:: April
Year:: 2021
Address:: Online
Editors:: Paola Merlo, Jorg Tiedemann, Reut Tsarfaty
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2540–2549
Language:
URL:: https://aclanthology.org/2021.eacl-main.217
DOI:: 10.18653/v1/2021.eacl-main.217
Bibkey:
Cite (ACL):: Yury Zemlyanskiy, Sudeep Gandhe, Ruining He, Bhargav Kanagal, Anirudh Ravula, Juraj Gottweis, Fei Sha, and Ilya Eckstein. 2021. DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2540–2549, Online. Association for Computational Linguistics.
Cite (Informal):: DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections (Zemlyanskiy et al., EACL 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.eacl-main.217.pdf
Data: IMDb Movie Reviews, MovieLens

PDF Cite Search