Lama Nachman


pdf bib
CueBot: Cue-Controlled Response Generation for Assistive Interaction Usages
Shachi H. Kumar | Hsuan Su | Ramesh Manuvinakurike | Max Pinaroc | Sai Prasad | Saurav Sahay | Lama Nachman
Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022)

Conversational assistants are ubiquitous among the general population, however, these systems have not had an impact on people with disabilities, or speech and language disorders, for whom basic day-to-day communication and social interaction is a huge struggle. Language model technology can play a huge role in empowering these users and help them interact with others with less effort via interaction support. To enable this population, we build a system that can represent them in a social conversation and generate responses that can be controlled by the users using cues/keywords. We build models that can speed up this communication by suggesting relevant cues in the dialog response context. We also introduce a keyword-loss to lexically constrain the model response output. We present automatic and human evaluation of our cue/keyword predictor and the controllable dialog system to show that our models perform significantly better than models without control. Our evaluation and user study shows that keyword-control on end-to-end response generation models is powerful and can enable and empower users with degenerative disorders to carry out their day-to-day communication.

pdf bib
Cue-bot: A Conversational Agent for Assistive Technology
Shachi H Kumar | Hsuan Su | Ramesh Manuvinakurike | Maximilian C. Pinaroc | Sai Prasad | Saurav Sahay | Lama Nachman
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Intelligent conversational assistants have become an integral part of our lives for performing simple tasks. However, such agents, for example, Google bots, Alexa and others are yet to have any social impact on minority population, for example, for people with neurological disorders and people with speech, language and social communication disorders, sometimes with locked-in states where speaking or typing is a challenge. Language model technologies can be very powerful tools in enabling these users to carry out daily communication and social interactions. In this work, we present a system that users with varied levels of disabilties can use to interact with the world, supported by eye-tracking, mouse controls and an intelligent agent Cue-bot, that can represent the user in a conversation. The agent provides relevant controllable ‘cues’ to generate desirable responses quickly for an ongoing dialog context. In the context of usage of such systems for people with degenerative disorders, we present automatic and human evaluation of our cue/keyword predictor and the controllable dialog system and show that our models perform significantly better than models without control and can also reduce user effort (fewer keystrokes) and speed up communication (typing time) significantly.


pdf bib
Incremental temporal summarization in multi-party meetings
Ramesh Manuvinakurike | Saurav Sahay | Wenda Chen | Lama Nachman
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

In this work, we develop a dataset for incremental temporal summarization in a multiparty dialogue. We use crowd-sourcing paradigm with a model-in-loop approach for collecting the summaries and compare the data with the expert summaries. We leverage the question generation paradigm to automatically generate questions from the dialogue, which can be used to validate the user participation and potentially also draw attention of the user towards the contents then need to summarize. We then develop several models for abstractive summary generation in the Incremental temporal scenario. We perform a detailed analysis of the results and show that including the past context into the summary generation yields better summaries.

pdf bib
Semi-supervised Interactive Intent Labeling
Saurav Sahay | Eda Okur | Nagib Hakim | Lama Nachman
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances

Building the Natural Language Understanding (NLU) modules of task-oriented Spoken Dialogue Systems (SDS) involves a definition of intents and entities, collection of task-relevant data, annotating the data with intents and entities, and then repeating the same process over and over again for adding any functionality/enhancement to the SDS. In this work, we showcase an Intent Bulk Labeling system where SDS developers can interactively label and augment training data from unlabeled utterance corpora using advanced clustering and visual labeling methods. We extend the Deep Aligned Clustering work with a better backbone BERT model, explore techniques to select the seed data for labeling, and develop a data balancing method using an oversampling technique that utilizes paraphrasing models. We also look at the effect of data augmentation on the clustering process. Our results show that we can achieve over 10% gain in clustering accuracy on some datasets using the combination of the above techniques. Finally, we extract utterance embeddings from the clustering model and plot the data to interactively bulk label the samples, reducing the time and effort for data labeling of the whole dataset significantly.


pdf bib
Low Rank Fusion based Transformers for Multimodal Sequences
Saurav Sahay | Eda Okur | Shachi H Kumar | Lama Nachman
Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)

Our senses individually work in a coordinated fashion to express our emotional intentions. In this work, we experiment with modeling modality-specific sensory signals to attend to our latent multimodal emotional intentions and vice versa expressed via low-rank multimodal fusion and multimodal transformers. The low-rank factorization of multimodal fusion amongst the modalities helps represent approximate multiplicative latent signal interactions. Motivated by the work of~(CITATION) and~(CITATION), we present our transformer-based cross-fusion architecture without any over-parameterization of the model. The low-rank fusion helps represent the latent signal interactions while the modality-specific attention helps focus on relevant parts of the signal. We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets and show that our models have lesser parameters, train faster and perform comparably to many larger fusion-based architectures.

pdf bib
Audio-Visual Understanding of Passenger Intents for In-Cabin Conversational Agents
Eda Okur | Shachi H Kumar | Saurav Sahay | Lama Nachman
Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)

Building multimodal dialogue understanding capabilities situated in the in-cabin context is crucial to enhance passenger comfort in autonomous vehicle (AV) interaction systems. To this end, understanding passenger intents from spoken interactions and vehicle vision systems is an important building block for developing contextual and visually grounded conversational agents for AV. Towards this goal, we explore AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling multimodal passenger-vehicle interactions. In this work, we discuss the benefits of multimodal understanding of in-cabin utterances by incorporating verbal/language input together with the non-verbal/acoustic and visual input from inside and outside the vehicle. Our experimental results outperformed text-only baselines as we achieved improved performances for intent detection with multimodal approach.


pdf bib
Multimodal Relational Tensor Network for Sentiment and Emotion Classification
Saurav Sahay | Shachi H Kumar | Rui Xia | Jonathan Huang | Lama Nachman
Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML)

Understanding Affect from video segments has brought researchers from the language, audio and video domains together. Most of the current multimodal research in this area deals with various techniques to fuse the modalities, and mostly treat the segments of a video independently. Motivated by the work of (Zadeh et al., 2017) and (Poria et al., 2017), we present our architecture, Relational Tensor Network, where we use the inter-modal interactions within a segment (intra-segment) and also consider the sequence of segments in a video to model the inter-segment inter-modal interactions. We also generate rich representations of text and audio modalities by leveraging richer audio and linguistic context alongwith fusing fine-grained knowledge based polarity scores from text. We present the results of our model on CMU-MOSEI dataset and show that our model outperforms many baselines and state of the art methods for sentiment classification and emotion recognition.