Similarity metrics for text corpora are becoming critical due to the tremendous growth in the number of generative models. These similarity metrics measure the semantic gap between human and machine-generated text on the corpus level. However, standard methods for evaluating the characteristics of these metrics have yet to be established. We propose a set of automatic measures for evaluating the characteristics of semantic similarity metrics for text corpora. Our measures allow us to sensibly compare and identify the strengths and weaknesses of these metrics. We demonstrate the effectiveness of our evaluation measures in capturing fundamental characteristics by comparing it to a collection of classical and state-of-the-art metrics. Our measures revealed that recent metrics are becoming better in identifying semantic distributional mismatch while classical metrics are more sensitive to perturbations in the surface text levels.
Prominent questions about the role of sensory vs. linguistic input in the way we acquire and use language have been extensively studied in the psycholinguistic literature. However, the relative effect of various factors in a person’s overall experience on their linguistic system remains unclear. We study this question by making a step forward towards a better understanding of the conceptual perception of colors by color-blind individuals, as reflected in their spontaneous linguistic productions. Using a novel and carefully curated dataset, we show that red-green color-blind speakers use the “red” and “green” color terms in less predictable contexts, and in linguistic environments evoking mental image to a lower extent, when compared to their normal-sighted counterparts. These findings shed some new and interesting light on the role of sensory experience on our linguistic system.
We focus on the task of Frequently Asked Questions (FAQ) retrieval. A given user query can be matched against the questions and/or the answers in the FAQ. We present a fully unsupervised method that exploits the FAQ pairs to train two BERT models. The two models match user queries to FAQ answers and questions, respectively. We alleviate the missing labeled data of the latter by automatically generating high-quality question paraphrases. We show that our model is on par and even outperforms supervised models on existing datasets.
Data balancing is a known technique for improving the performance of classification tasks. In this work we define a novel balancing-viageneration framework termed BalaGen. BalaGen consists of a flexible balancing policy coupled with a text generation mechanism. Combined, these two techniques can be used to augment a dataset for more balanced distribution. We evaluate BalaGen on three publicly available semantic utterance classification (SUC) datasets. One of these is a new COVID-19 Q&A dataset published here for the first time. Our work demonstrates that optimal balancing policies can significantly improve classifier performance, while augmenting just part of the classes and under-sampling others. Furthermore, capitalizing on the advantages of balancing, we show its usefulness in all relevant BalaGen framework components. We validate the superiority of BalaGen on ten semantic utterance datasets taken from real-life goaloriented dialogue systems. Based on our results we encourage using data balancing prior to training for text classification tasks.