Reddit Temporal N-gram Corpus and its Applications on Paraphrase and Semantic Similarity in Social Media using a Topic-based Latent Semantic Analysis

Anh Dang, Abidalrahman Moh’d, Aminul Islam, Rosane Minghim, Michael Smit, Evangelos Milios


Abstract
This paper introduces a new large-scale n-gram corpus that is created specifically from social media text. Two distinguishing characteristics of this corpus are its monthly temporal attribute and that it is created from 1.65 billion comments of user-generated text in Reddit. The usefulness of this corpus is exemplified and evaluated by a novel Topic-based Latent Semantic Analysis (TLSA) algorithm. The experimental results show that unsupervised TLSA outperforms all the state-of-the-art unsupervised and semi-supervised methods in SEMEVAL 2015: paraphrase and semantic similarity in Twitter tasks.
Anthology ID:
C16-1335
Volume:
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Month:
December
Year:
2016
Address:
Osaka, Japan
Venue:
COLING
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
3553–3564
Language:
URL:
https://aclanthology.org/C16-1335
DOI:
Bibkey:
Cite (ACL):
Anh Dang, Abidalrahman Moh’d, Aminul Islam, Rosane Minghim, Michael Smit, and Evangelos Milios. 2016. Reddit Temporal N-gram Corpus and its Applications on Paraphrase and Semantic Similarity in Social Media using a Topic-based Latent Semantic Analysis. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3553–3564, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Reddit Temporal N-gram Corpus and its Applications on Paraphrase and Semantic Similarity in Social Media using a Topic-based Latent Semantic Analysis (Dang et al., COLING 2016)
Copy Citation:
PDF:
https://aclanthology.org/C16-1335.pdf
Data
PIT