Modern digital personal assistants interact with users through voice. Therefore, they heavily rely on automatic speech recognition (ASR) in order to convert speech to text and perform further tasks. We introduce EBERT, which stands for EmbraceBERT, with the goal of extracting more robust latent representations for the task of noisy ASR text classification. Conventionally, BERT is fine-tuned for downstream classification tasks using only the [CLS] starter token, with the remaining tokens being discarded. We propose using all encoded transformer tokens and further encode them using a novel attentive embracement layer and multi-head attention layer. This approach uses the otherwise discarded tokens as a source of additional information and the multi-head attention in conjunction with the attentive embracement layer to select important features from clean data during training. This allows for the extraction of a robust latent vector resulting in improved classification performance during testing when presented with noisy inputs. We show the impact of our model on both the Chatbot and Snips corpora for intent classification with ASR error. Results, in terms of F1-score and mean between 10 runs, show that our model significantly outperforms the baseline model.
Recently, intelligent dialog systems and smart assistants have attracted the attention of many, and development of novel dialogue agents have become a research challenge. Intelligent agents that can handle both domain-specific task-oriented and open-domain chit-chat dialogs are one of the major requirements in the current systems. In order to address this issue and to realize such smart hybrid dialogue systems, we develop a model to discriminate user utterance between task-oriented and chit-chat conversations. We introduce a hybrid of convolutional neural network (CNN) and a lateral multiple timescale gated recurrent units (LMTGRU) that can represent multiple temporal scale dependencies for the discrimination task. With the help of the combined slow and fast units of the LMTGRU, our model effectively determines whether a user will have a chit-chat conversation or a task-specific conversation with the system. We also show that the LMTGRU structure helps the model to perform well on longer text inputs. We address the lack of dataset by constructing a dataset using Twitter and Maluuba Frames data. The results of the experiments demonstrate that the proposed hybrid network outperforms the conventional models on the chat discrimination task as well as performed comparable to the baselines on various benchmark datasets.
A novel character-level neural language model is proposed in this paper. The proposed model incorporates a biologically inspired temporal hierarchy in the architecture for representing multiple compositions of language in order to handle longer sequences for the character-level language model. The temporal hierarchy is introduced in the language model by utilizing a Gated Recurrent Neural Network with multiple timescales. The proposed model incorporates a timescale adaptation mechanism for enhancing the performance of the language model. We evaluate our proposed model using the popular Penn Treebank and Text8 corpora. The experiments show that the use of multiple timescales in a Neural Language Model (NLM) enables improved performance despite having fewer parameters and with no additional computation requirements. Our experiments also demonstrate the ability of the adaptive temporal hierarchies to represent multiple compositonality without the help of complex hierarchical architectures and shows that better representation of the longer sequences lead to enhanced performance of the probabilistic language model.