Md. Saiful Islam

Also published as: Md Saiful Islam


2021

pdf bib
XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages
Tahmid Hasan | Abhik Bhattacharjee | Md. Saiful Islam | Kazi Mubasshir | Yuan-Fang Li | Yong-Bin Kang | M. Sohel Rahman | Rifat Shahriyar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts
Khondoker Ittehadul Islam | Sudipta Kar | Md Saiful Islam | Mohammad Ruhul Amin
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper, we propose an annotated sentiment analysis dataset made of informally written Bangla texts. This dataset comprises public comments on news and videos collected from social media covering 13 different domains, including politics, education, and agriculture. These comments are labeled with one of the polarity labels, namely positive, negative, and neutral. One significant characteristic of the dataset is that each of the comments is noisy in terms of the mix of dialects and grammatical incorrectness. Our experiments to develop a benchmark classification system show that hand-crafted lexical features provide superior performance than neural network and pretrained language models. We have made the dataset and accompanying models presented in this paper publicly available at https://git.io/JuuNB.

pdf bib
Syntax and Themes: How Context Free Grammar Rules and Semantic Word Association Influence Book Success
Henry Gorelick | Biddut Sarker Bijoy | Syeda Jannatus Saba | Sudipta Kar | Md Saiful Islam | Mohammad Ruhul Amin
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we attempt to improve upon the state-of-the-art in predicting a novel’s success by modeling the lexical semantic relationships of its contents. We created the largest dataset used in such a project containing lexical data from 17,962 books from Project Gutenberg. We utilized domain specific feature reduction techniques to implement the most accurate models to date for predicting book success, with our best model achieving an average accuracy of 94.0%. By analyzing the model parameters, we extracted the successful semantic relationships from books of 12 different genres. We finally mapped those semantic relations to a set of themes, as defined in Roget’s Thesaurus and discovered the themes that successful books of a given genre prioritize. At the end of the paper, we further showed that our model demonstrate similar performance for book success prediction even when Goodreads rating was used instead of download count to measure success.

pdf bib
A Study on Using Semantic Word Associations to Predict the Success of a Novel
Syeda Jannatus Saba | Biddut Sarker Bijoy | Henry Gorelick | Sabir Ismail | Md Saiful Islam | Mohammad Ruhul Amin
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Many new books get published every year, and only a fraction of them become popular among the readers. So the prediction of a book success can be a very useful parameter for publishers to make a reliable decision. This article presents the study of semantic word associations using the word embedding of book content for a set of Roget’s thesaurus concepts for book success prediction. In this work, we discuss the method to represent a book as a spectrum of concepts based on the association score between its content embedding and a global embedding (i.e. fastText) for a set of semantically linked word clusters. We show that the semantic word associations outperform the previous methods for book success prediction. In addition, we present that semantic word associations also provide better results than using features like the frequency of word groups in Roget’s thesaurus, LIWC (a popular tool for linguistic inquiry and word count), NRC (word association emotion lexicon), and part of speech (PoS). Our study reports that concept associations based on Roget’s Thesaurus using word embedding of individual novel resulted in the state-of-the-art performance of 0.89 average weighted F1-score for book success prediction. Finally, we present a set of dominant themes that contribute towards the popularity of a book for a specific genre.

pdf bib
Hitting your MARQ: Multimodal ARgument Quality Assessment in Long Debate Video
Md Kamrul Hasan | James Spann | Masum Hasan | Md Saiful Islam | Kurtis Haut | Rada Mihalcea | Ehsan Hoque
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The combination of gestures, intonations, and textual content plays a key role in argument delivery. However, the current literature mostly considers textual content while assessing the quality of an argument, and it is limited to datasets containing short sequences (18-48 words). In this paper, we study argument quality assessment in a multimodal context, and experiment on DBATES, a publicly available dataset of long debate videos. First, we propose a set of interpretable debate centric features such as clarity, content variation, body movement cues, and pauses, inspired by theories of argumentation quality. Second, we design the Multimodal ARgument Quality assessor (MARQ) – a hierarchical neural network model that summarizes the multimodal signals on long sequences and enriches the multimodal embedding with debate centric features. Our proposed MARQ model achieves an accuracy of 81.91% on the argument quality prediction task and outperforms established baseline models with an error rate reduction of 22.7%. Through ablation studies, we demonstrate the importance of multimodal cues in modeling argument quality.

2020

pdf bib
BanFakeNews: A Dataset for Detecting Fake News in Bangla
Md Zobaer Hossain | Md Ashraful Rahman | Md Saiful Islam | Sudipta Kar
Proceedings of the 12th Language Resources and Evaluation Conference

Observing the damages that can be done by the rapid propagation of fake news in various sectors like politics and finance, automatic identification of fake news using linguistic analysis has drawn the attention of the research community. However, such methods are largely being developed for English where low resource languages remain out of the focus. But the risks spawned by fake and manipulative news are not confined by languages. In this work, we propose an annotated dataset of ≈ 50K news that can be used for building automated fake news detection systems for a low resource language like Bangla. Additionally, we provide an analysis of the dataset and develop a benchmark system with state of the art NLP techniques to identify Bangla fake news. To create this system, we explore traditional linguistic features and neural network based methods. We expect this dataset will be a valuable resource for building technologies to prevent the spreading of fake news and contribute in research with low resource languages. The dataset and source code are publicly available at https://github.com/Rowan1697/FakeNews.