Multifaceted Domain-Specific Document Embeddings

Julian Risch, Philipp Hager, Ralf Krestel


Abstract
Current document embeddings require large training corpora but fail to learn high-quality representations when confronted with a small number of domain-specific documents and rare terms. Further, they transform each document into a single embedding vector, making it hard to capture different notions of document similarity or explain why two documents are considered similar. In this work, we propose our Faceted Domain Encoder, a novel approach to learn multifaceted embeddings for domain-specific documents. It is based on a Siamese neural network architecture and leverages knowledge graphs to further enhance the embeddings even if only a few training samples are available. The model identifies different types of domain knowledge and encodes them into separate dimensions of the embedding, thereby enabling multiple ways of finding and comparing related documents in the vector space. We evaluate our approach on two benchmark datasets and find that it achieves the same embedding quality as state-of-the-art models while requiring only a tiny fraction of their training data. An interactive demo, our source code, and the evaluation datasets are available online: https://hpi.de/naumann/s/multifaceted-embeddings and a screencast is available on YouTube: https://youtu.be/HHcsX2clEwg
Anthology ID:
2021.naacl-demos.9
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations
Month:
June
Year:
2021
Address:
Online
Editors:
Avi Sil, Xi Victoria Lin
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
78–83
Language:
URL:
https://aclanthology.org/2021.naacl-demos.9
DOI:
10.18653/v1/2021.naacl-demos.9
Bibkey:
Cite (ACL):
Julian Risch, Philipp Hager, and Ralf Krestel. 2021. Multifaceted Domain-Specific Document Embeddings. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pages 78–83, Online. Association for Computational Linguistics.
Cite (Informal):
Multifaceted Domain-Specific Document Embeddings (Risch et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-demos.9.pdf
Video:
 https://aclanthology.org/2021.naacl-demos.9.mp4
Code
 philipphager/faceted-domain-encoder
Data
BIOSSES