Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification

Tu Vu, Mohit Iyyer


Abstract
While paragraph embedding models are remarkably effective for downstream classification tasks, what they learn and encode into a single vector remains opaque. In this paper, we investigate a state-of-the-art paragraph embedding method proposed by Zhang et al. (2017) and discover that it cannot reliably tell whether a given sentence occurs in the input paragraph or not. We formulate a sentence content task to probe for this basic linguistic property and find that even a much simpler bag-of-words method has no trouble solving it. This result motivates us to replace the reconstruction-based objective of Zhang et al. (2017) with our sentence content probe objective in a semi-supervised setting. Despite its simplicity, our objective improves over paragraph reconstruction in terms of (1) downstream classification accuracies on benchmark datasets, (2) faster training, and (3) better generalization ability.
Anthology ID:
P19-1638
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6331–6338
Language:
URL:
https://aclanthology.org/P19-1638
DOI:
10.18653/v1/P19-1638
Bibkey:
Cite (ACL):
Tu Vu and Mohit Iyyer. 2019. Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6331–6338, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification (Vu & Iyyer, ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-1638.pdf
Poster:
 P19-1638.Poster.pdf
Code
 tuvuumass/SCoPE