Optimized Web-Crawling of Conversational Data from Social Media and Context-Based Filtering

Annapurna P Patil, Rajarajeswari Subramanian, Gaurav Karkal, Keerthana Purushotham, Jugal Wadhwa, K Dhanush Reddy, Meer Sawood


Abstract
Building Chabot’s requires a large amount of conversational data. In this paper, a web crawler is designed to fetch multi-turn dialogues from websites such as Twitter, YouTube and Reddit in the form of a JavaScript Object Notation (JSON) file. Tools like Twitter Application Programming Interface (API), LXML Library, and JSON library are used to crawl Twitter, YouTube and Reddit to collect conversational chat data. The data obtained in a raw form cannot be used directly as it will have only text metadata such as author or name, time to provide more information on the chat data being scraped. The data collected has to be formatted for proper use case and the JSON library of python allows us to format the data easily. The scraped dialogues are further filtered based on the context of a search keyword without introducing bias and with flexible strictness of classification.
Anthology ID:
2020.icon-workshop.5
Volume:
Proceedings of the Workshop on Joint NLP Modelling for Conversational AI @ ICON 2020
Month:
December
Year:
2020
Address:
Patna, India
Editors:
Praveen Kumar G S, Siddhartha Mukherjee, Ranjan Samal
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
33–39
Language:
URL:
https://aclanthology.org/2020.icon-workshop.5
DOI:
Bibkey:
Cite (ACL):
Annapurna P Patil, Rajarajeswari Subramanian, Gaurav Karkal, Keerthana Purushotham, Jugal Wadhwa, K Dhanush Reddy, and Meer Sawood. 2020. Optimized Web-Crawling of Conversational Data from Social Media and Context-Based Filtering. In Proceedings of the Workshop on Joint NLP Modelling for Conversational AI @ ICON 2020, pages 33–39, Patna, India. NLP Association of India (NLPAI).
Cite (Informal):
Optimized Web-Crawling of Conversational Data from Social Media and Context-Based Filtering (Patil et al., ICON 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.icon-workshop.5.pdf