Proceedings of the 5th International Workshop on Search-Oriented Conversational AI (SCAI)
Understanding when and why neural ranking models fail for an IR task via error analysis is an important part of the research cycle. Here we focus on the challenges of (i) identifying categories of difficult instances (a pair of question and response candidates) for which a neural ranker is ineffective and (ii) improving neural ranking for such instances. To address both challenges we resort to slice-based learning for which the goal is to improve effectiveness of neural models for slices (subsets) of data. We address challenge (i) by proposing different slicing functions (SFs) that select slices of the dataset—based on prior work we heuristically capture different failures of neural rankers. Then, for challenge (ii) we adapt a neural ranking model to learn slice-aware representations, i.e. the adapted model learns to represent the question and responses differently based on the model’s prediction of which slices they belong to. Our experimental results (the source code and data are available at https://github.com/Guzpenha/slice_based_learning) across three different ranking tasks and four corpora show that slice-based learning improves the effectiveness by an average of 2% over a neural ranker that is not slice-aware.
A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering
Svitlana Vakulenko | Shayne Longpre | Zhucheng Tu | Raviteja Anantha
The dependency between an adequate question formulation and correct answer selection is a very intriguing but still underexplored area. In this paper, we show that question rewriting (QR) of the conversational context allows to shed more light on this phenomenon and also use it to evaluate robustness of different answer selection approaches. We introduce a simple framework that enables an automated analysis of the conversational question answering (QA) performance using question rewrites, and present the results of this analysis on the TREC CAsT and QuAC (CANARD) datasets. Our experiments uncover sensitivity to question formulation of the popular state-of-the-art question answering approaches. Our results demonstrate that the reading comprehension model is insensitive to question formulation, while the passage ranking changes dramatically with a little variation in the input question. The benefit of QR is that it allows us to pinpoint and group such cases automatically. We show how to use this methodology to verify whether QA models are really learning the task or just finding shortcuts in the dataset, and better understand the frequent types of error they make.
Conversational Question Answering (ConvQA) is a Conversational Search task in a simplified setting, where an answer must be extracted from a given passage. Neural language models, such as BERT, fine-tuned on large-scale ConvQA datasets such as CoQA and QuAC have been used to address this task. Recently, Multi-Task Learning (MTL) has emerged as a particularly interesting approach for developing ConvQA models, where the objective is to enhance the performance of a primary task by sharing the learned structure across several related auxiliary tasks. However, existing ConvQA models that leverage MTL have not investigated the dynamic adjustment of the relative importance of the different tasks during learning, nor the resulting impact on the performance of the learned models. In this paper, we first study the effectiveness and efficiency of dynamic MTL methods including Evolving Weighting, Uncertainty Weighting, and Loss-Balanced Task Weighting, compared to static MTL methods such as the uniform weighting of tasks. Furthermore, we propose a novel hybrid dynamic method combining Abridged Linear for the main task with a Loss-Balanced Task Weighting (LBTW) for the auxiliary tasks, so as to automatically fine-tune task weighting during learning, ensuring that each of the task’s weights is adjusted by the relative importance of the different tasks. We conduct experiments using QuAC, a large-scale ConvQA dataset. Our results demonstrate the effectiveness of our proposed method, which significantly outperforms both the single-task learning and static task weighting methods with improvements ranging from +2.72% to +3.20% in F1 scores. Finally, our findings show that the performance of using MTL in developing ConvQA model is sensitive to the correct selection of the auxiliary tasks as well as to an adequate balancing of the loss rates of these tasks during training by using LBTW.