Semantic Similarity Frontiers: From Concepts to Documents

David Jurgens, Mohammad Taher Pilehvar


Abstract
Semantic similarity forms a central component in many NLP systems, from lexical semantics, to part of speech tagging, to social media analysis. Recent years have seen a renewed interest in developing new similarity techniques, buoyed in part by work on embeddings and by SemEval tasks in Semantic Textual Similarity and Cross-Level Semantic Similarity. The increased interest has led to hundreds of techniques for measuring semantic similarity, which makes it difficult for practitioners to identify which state-of-the-art techniques are applicable and easily integrated into projects and for researchers to identify which aspects of the problem require future research.This tutorial synthesizes the current state of the art for measuring semantic similarity for all types of conceptual or textual pairs and presents a broad overview of current techniques, what resources they use, and the particular inputs or domains to which the methods are most applicable. We survey methods ranging from corpus-based approaches operating on massive or domains-specific corpora to those leveraging structural information from expert-based or collaboratively-constructed lexical resources. Furthermore, we review work on multiple similarity tasks from sense-based comparisons to word, sentence, and document-sized comparisons and highlight general-purpose methods capable of comparing multiple types of inputs. Where possible, we also identify techniques that have been demonstrated to successfully operate in multilingual or cross-lingual settings.Our tutorial provides a clear overview of currently-available tools and their strengths for practitioners who need out of the box solutions and provides researchers with an understanding of the limitations of current state of the art and what open problems remain in the field. Given the breadth of available approaches, participants will also receive a detailed bibliography of approaches (including those not directly covered in the tutorial), annotated according to the approaches abilities, and pointers to when open-source implementations of the algorithms may be obtained.
Anthology ID:
D15-2001
Volume:
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
Month:
September
Year:
2015
Address:
Lisbon, Portugal
Editors:
Wenjie Li, Khalil Sima'an
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
Language:
URL:
https://aclanthology.org/D15-2001
DOI:
Bibkey:
Cite (ACL):
David Jurgens and Mohammad Taher Pilehvar. 2015. Semantic Similarity Frontiers: From Concepts to Documents. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, Lisbon, Portugal. Association for Computational Linguistics.
Cite (Informal):
Semantic Similarity Frontiers: From Concepts to Documents (Jurgens & Pilehvar, EMNLP 2015)
Copy Citation: