Nafis Sadeq


2023

pdf bib
Unsupervised Improvement of Factual Knowledge in Language Models
Nafis Sadeq | Byungkyu Kang | Prarit Lamba | Julian McAuley
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Masked language modeling (MLM) plays a key role in pretraining large language models. But the MLM objective is often dominated by high-frequency words that are sub-optimal for learning factual knowledge. In this work, we propose an approach for influencing MLM pretraining in a way that can improve language model performance on a variety of knowledge-intensive tasks. We force the language model to prioritize informative words in a fully unsupervised way. Experiments demonstrate that the proposed approach can significantly improve the performance of pretrained language models on tasks such as factual recall, question answering, sentiment analysis, and natural language inference in a closed-book setting.

2022

pdf bib
InforMask: Unsupervised Informative Masking for Language Model Pretraining
Nafis Sadeq | Canwen Xu | Julian McAuley
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Masked language modeling is widely used for pretraining large language models for natural language understanding (NLU). However, random masking is suboptimal, allocating an equal masking rate for all tokens. In this paper, we propose InforMask, a new unsupervised masking strategy for training masked language models. InforMask exploits Pointwise Mutual Information (PMI) to select the most informative tokens to mask. We further propose two optimizations for InforMask to improve its efficiency. With a one-off preprocessing step, InforMask outperforms random masking and previously proposed masking strategies on the factual recall benchmark LAMA and the question answering benchmark SQuAD v1 and v2.

2020

pdf bib
Preparation of Bangla Speech Corpus from Publicly Available Audio & Text
Shafayat Ahmed | Nafis Sadeq | Sudipta Saha Shubha | Md. Nahidul Islam | Muhammad Abdullah Adnan | Mohammad Zuberul Islam
Proceedings of the Twelfth Language Resources and Evaluation Conference

Automatic speech recognition systems require large annotated speech corpus. The manual annotation of a large corpus is very difficult. In this paper, we focus on the automatic preparation of a speech corpus for Bangladeshi Bangla. We have used publicly available Bangla audiobooks and TV news recordings as audio sources. We designed and implemented an iterative algorithm that takes as input a speech corpus and a huge amount of raw audio (without transcription) and outputs a much larger speech corpus with reasonable confidence. We have leveraged speaker diarization, gender detection, etc. to prepare the annotated corpus. We also have prepared a synthetic speech corpus for handling out-of-vocabulary word problems in Bangla language. Our corpus is suitable for training with Kaldi. Experimental results show that the use of our corpus in addition to the Google Speech corpus (229 hours) significantly improves the performance of the ASR system.

pdf bib
Improving End-to-End Bangla Speech Recognition with Semi-supervised Training
Nafis Sadeq | Nafis Tahmid Chowdhury | Farhan Tanvir Utshaw | Shafayat Ahmed | Muhammad Abdullah Adnan
Findings of the Association for Computational Linguistics: EMNLP 2020

Automatic speech recognition systems usually require large annotated speech corpus for training. The manual annotation of a large corpus is very difficult. It can be very helpful to use unsupervised and semi-supervised learning methods in addition to supervised learning. In this work, we focus on using a semi-supervised training approach for Bangla Speech Recognition that can exploit large unpaired audio and text data. We encode speech and text data in an intermediate domain and propose a novel loss function based on the global encoding distance between encoded data to guide the semi-supervised training. Our proposed method reduces the Word Error Rate (WER) of the system from 37% to 31.9%.

2019

pdf bib
Customizing Grapheme-to-Phoneme System for Non-Trivial Transcription Problems in Bangla Language
Sudipta Saha Shubha | Nafis Sadeq | Shafayat Ahmed | Md. Nahidul Islam | Muhammad Abdullah Adnan | Md. Yasin Ali Khan | Mohammad Zuberul Islam
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Grapheme to phoneme (G2P) conversion is an integral part in various text and speech processing systems, such as: Text to Speech system, Speech Recognition system, etc. The existing methodologies for G2P conversion in Bangla language are mostly rule-based. However, data-driven approaches have proved their superiority over rule-based approaches for large-scale G2P conversion in other languages, such as: English, German, etc. As the performance of data-driven approaches for G2P conversion depend largely on pronunciation lexicon on which the system is trained, in this paper, we investigate on developing an improved training lexicon by identifying and categorizing the critical cases in Bangla language and include those critical cases in training lexicon for developing a robust G2P conversion system in Bangla language. Additionally, we have incorporated nasal vowels in our proposed phoneme list. Our methodology outperforms other state-of-the-art approaches for G2P conversion in Bangla language.