Dynamic Facet Selection by Maximizing Graded Relevance

Dynamic faceted search (DFS), an interactive query refinement technique, is a form of Human–computer information retrieval (HCIR) approach. It allows users to narrow down search results through facets, where the facets-documents mapping is determined at runtime based on the context of user query instead of pre-indexing the facets statically. In this paper, we propose a new unsupervised approach for dynamic facet generation, namely optimistic facets, which attempts to generate the best possible subset of facets, hence maximizing expected Discounted Cumulative Gain (DCG), a measure of ranking quality that uses a graded relevance scale. We also release code to generate a new evaluation dataset. Through empirical results on two datasets, we show that the proposed DFS approach considerably improves the document ranking in the search results.


Introduction
Human-computer information retrieval (HCIR) is the study of techniques that takes advantage of human intelligence into the search process. Through a multi-step search process, it facilitates opportunities for human feedback by taking into account the query context. Examples of HCIR approaches include -faceted search, relevance feedback, automatic query reformulation, illustration by tag clouds, etc.
Faceted Search (FS) (Tunkelang, 2009), a form of HCIR, is a prevalent technique in e-commerce where document retrieval systems are augmented with faceted navigation. Facets are terms that present an overview on the variety of data available given the user query, thereby hinting at the most relevant refinement operations for zooming in on the target information need (Ben-yitzhak et al., 2008). *

Equal contributions.
Traditional facet generation approaches present several drawbacks. Documents must be pre-tagged with an existing taxonomy, adding overhead in content curation and management. Moreover, such static facets lack contextual matching with documents or queries. Figure 1 shows an example of static/traditional facets.
Dynamic Faceted Search (DFS) overcomes such limitations (Dash et al., 2008). For Dynamic facets, the facet to document mapping is determined at run-time based on the context of user query instead of pre-indexing the facets statically. In other words, in an information retrieval (IR) system, there is no exclusive list of terms to be considered for dynamic facets and such facets are not known in advance. There is no pre-existing mapping of facets to the documents (that are indexed in the corresponding IR system). The mapping can only be created at the real-time when the query is submitted followed by generation of such facets based on the search results specific to the given query and are presented to the user along with the relevant documents.
In this paper, we present an approach for generating dynamic facets and selecting the best set of facets to be presented to the user. Hence, allowing the user to select relevant facets (if any) to interactively refine their queries, which in turn improves search results at each facet selection iteration. This interaction can be repeated until the user is satisfied with the results presented or no further refinement is possible.
Below we highlight the major contributions of our work -• a new state-of-the-art unsupervised approach for dynamic facet generation (see Section 3) evaluated on two datasets (see Section 6), and • a new benchmark dataset, Stackoverflow-Technotes (or, Figure 1: Example of static facets used to organize a set of book titles in a digital library. simply Stackoverflow) Benchmark. 1 (see Section 5).
Rest of the paper is structured as follows. Section 2 includes a brief summary of related work with respect to DFS. Section 3 describes our proposed approaches. The next two sections (4 and 5) describes the experimental settings and datasets. In Section 6, we show the empirical results, both quantitative and qualitative. Finally, Section 7 concludes the paper and highlights perspectives for future work.

Related Work
A closely related research task of facet generation is to generate alternative queries, also known as query suggestion (Mei et al., 2008). Other related tasks are query substitution (Jones et al., 2006) and query refinement (Kraft and Zien, 2004). The main difference between these tasks and facet generation is that facets are not alternative/substitute/refined queries but rather a way to organize the search results obtained using the original query.
Another related task is query expansion (Xu and Croft, 1996) where the goal is adding related words to a query in order to increase the number of returned documents and improve recall accordingly. In contrast, selection of facets allow to narrow down search results.
There is a considerable amount of work on faceted search (Zheng et al., 2013;Kong, 2016). For brevity, here we focus on DFS only. DFS can be divided into two categories. First, DFS on databases (Basu Roy et al., 2008;Kim et al., 2014;Vandic et al., 2018). Databases have a rich meta-data in the form of tables, attributes, dimensions, etc. DFS on databases focuses on the 1 We provide the codes for automatically creating the dataset using publicly available data, and also to run the simulated automatic evaluation. They can be found herehttps://github.com/IBM/ Stackoverflow-Technotes-dataset. best possible attributes from the meta-data, to be presented as facets.
Our contributions are in the second category -DFS on textual data. An early approach was proposed by Ben-yitzhak et al. (2008), where the generated dynamic facets are constrained by the ability to sum pre-defined Boolean expressions. Dash et al. (2008) proposed an approach, given a keyword as query, to dynamically select a small set of "interesting" attributes and present their aggregation to a user. Their work is focused on evaluating the execution time rather than result re-ranking. Dakka and Ipeirotis (2008) proposed an approach using external resources, namely WordNet and Wikipedia, to generate facets given a query.
Our proposed DFS approach on text generates dynamic facets that are terms (which are not restricted), not just aggregated values, and does not rely on any external resource. Input queries can be natural language texts, not restricted to keywords.
In a recent relevant work, Mihindukulasooriya et al. (2020) proposed an unsupervised DFS approach that exploits different types of word embedding models to extract so called flat and typed facets. The typed facets are organized in hierarchies while the flat facets are simply a list of facets without hierarchy. They show empirically both set of facets yield similar results.

Proposed Dynamic Facet Generation
Given a ranked set of search results from a traditional search engine, our proposed approach, namely Optimistic facet set selection, tracks document ranking changes produced by selecting each candidate facet, and uses this information to select a subset of best possible facets.
We use the following notations in this section: results for the initial query, q 0 returned by initial traditional IR component/search engine.
.., f c } is a set of c terms to be considered as facet candidates.
• F ⊂ C is a set of k facets generated by the system as output, where k can be set by the user or the interactive search system.

Facet candidate generation
Given a user query and the respective search results (i.e. documents) from a search engine, we extract the terms from those candidate documents with a frequency above threshold θ f req . Let us limit the expected number of dynamic facets to k. Given a pre-trained word embedding model (for the indexed document collection), cosine similarity, sim(q 0 , t), between the query and each term t is computed. Up to the top c terms with a minimum similarity score of θ sim are kept as facet candidates. 2

Optimistic Facet Set Selection
Our algorithm is built on two key assumptions: • Optimism: the user will select the best facet: one that attains the best Discounted Cumulative Gain (DCG) (or other graded relevance measure).
• Relevance Probability: how likely a document is to be relevant is approximated by its rank in initial search results. 2 We set θ f req = 3, θsim = 0.5, and c = max(k 2 , 50).  Each candidate facet, f , is associated with some change in the scores of the document results, δ f , and hence, some new ranking of the document results, R f . Using the filter strategy, Experimenting with a strategy of computing the change in BM25 score (Robertson and Zaragoza, 2009) if f is added to the query, resulted in lower performance.
Suppose p i is the probability of being relevant for the ith ranked document in the initial retrieval. We fit a curve to estimate p i independent of the query or document results and find this probability to be roughly proportional to the inverse of the rank plus its square root. Figure 2 shows empirical probability of relevance and the curve to fit.
A facet set has a minimum possible rank for each document, the lowest rank that can be achieved by selecting any facet in the set, or no facet. We indicate this list of ranks as R min = [r 1 , r 2 , ..., r n ] where r j = min j, min f ∈F (R f j ) . The list of ranks R min is closely connected with our optimistic assumption. If, for example, the single relevant document is in initial rank j, then R min j is the rank it will have after the user sees the initial results and optionally selects the best facet.
Consider the case (a majority in our datasets) where only one document is relevant. Then the expected DCG under the optimistic assumption is given by Equation 2. DCG is a standard metric in IR to measure the overall quality of the search results. DCG depends only on the ranks of the relevant (rel i = 1) documents. Intuitively, we optimize DCG in expectation by providing facets that produce different and likely rankings for the returned documents.
We select a facet set to approximately optimize E(DCG F ) using greedy and local search. Both the greedy and local search phases of facet set selection rely on a function to select the facet candidate that will improve E(DCG F ) the most: Best(C, F, f * , s * ). The greedy phase adds k facet candidates to the facet set, each time adding the facet that maximizes the set score. Local search tries to swap each facet in the facet set for some better facet candidate. This process could repeat until E(DCG F ) does not improve. Algorithm 1 shows pseudocode for these functions.

Experiments
Evaluation Settings: We use the simulated user based automatic evaluation, called ORACLE, proposed by Mihindukulasooriya et al. (2020). For each iteration of the faceted search, the system presents a list of ranked search results and facets to the ORACLE. It selects the facet which retrieves the target document at the highest rank.

TechQA Benchmark
The first dataset is an existing benchmark of real-world user questions in English in the domain of technical customer support, named the TechQA dataset (Castelli et al., 2020). The reason we choose this dataset is -the most recent work, (Mihindukulasooriya et al., 2020)), that we are aware of for faceted search is evaluated on this dataset. The RoBERTa based state-of-the-art IR approach (Liu et al., 2019) that we use as one of the baselines also used this dataset. The TechQA dataset has 160 answerable questions in the Dev split and is aligned with a corpus of 801,998 publicly available IBM Technotes documents. We evaluate our approaches on these questions while treating the corresponding Technotes documents (containing the answers) as the corpus.

Proposed Stackoverflow Benchmark
In addition to the TechQA benchmark, we create a new dataset in the technical support domain to verify the generality of our approach. This allows us to evaluate it on a different benchmark containing real-world queries which are often noisy and not curated.
We are releasing the corresponding benchmark generation codes to the research community as part of this work. The dataset contains total 883 queries. It was created from Stackoverflow 3 forum threads. We only considered those queries where the accepted answer posts contain link(s) to documents in the Technotes corpus (the same corpus as mentioned in the TechQA Benchmark). Here is how the released codes create this new benchmark:  Figure 4 shows an example of an entry in the dataset, which includes an "id" field containing the id of a question post, a "title" field about the title of the question post, a "body" field which is the body part of the question post, and a "relevant docids" field with a set of Technotes IDs extracted from the corresponding accepted answer post.
The procedure described above is generic and can be replicated for other forums and corpora with similar characteristics.

Results
We implemented the flat facets proposed by Mihindukulasooriya et al. (2020) to compare with our results on both datasets. We use BM25 (Robertson and Zaragoza, 2009) as IR baseline for the Stackoverflow benchmark. For the TechQA dataset, we use the state-of-the-art IR approach of Zhang et al. (2020) built using RoBERTa (Liu et al., 2019) as baseline. Zhang et al. (2020) generously shared with us their system's output for the TechQA-DR (i.e. document retrieval) task mentioned in their paper. We feed this output as input in our system as well as our implementation of Mihindukulasooriya et al. (2020) to extract facets from corresponding search results.
For a given query, we consider maximum 50 search results retrieved by the IR baseline. Then, the ORACLE accepts only up to 5 facets generated from a DFS approach, and chose only one facet (i.e. a single interaction with the DFS system) as a filter. If a corresponding search result does not containing this facet, it is discarded which changes ranks of some of the remaining search results.

Quantitative Evaluation
We use three standard evaluation metrics: Discounted Cumulative Gain (DCG), Mean Reciprocal Rank (MRR), and Hits@K. For Hits@K, we share the absolute number of queries where the expected document is ranked within top-K results. Table 1 empirically compares our DFS approach against other systems. As evident from the results, optimistic DFS demonstrated remarkable edge over the DFS approach of Mihindukulasooriya et al. (2020) on both of the datasets in every single metric. Furthermore, our approach significantly improves the results of the underlying strong IR baselines in both datasets.

Qualitative Evaluation
For the qualitative evaluation, we selected a sample set of 22 random queries from the Stackoverflow dataset. We asked a Subject Matter Expert (SME), who is a customer support agent in the field, to manually inspect the facets (produced by optimistic DFS) for each selected query.
According to the SME, a facet is considered useful, if it is contextually related but not already mentioned in the user's (short) query (i.e. the 'title' in Figure 4) and either appears in (i) the fully specified query, aka 'post' (i.e. the 'body' in Figure 4), or (ii) in the target document. Table 2 shows sample subset of "User Query", their corresponding "Top 5 Dynamically Generated Facets", "Additional Relevant Facets Present in Post" that the system could have considered to rank higher to place in the top 5, and "SME Recommended Facets" that the system should have presented (even though they are not seen in the post), as they are relevant for the corresponding user query. The values in the last two columns are provided by the SME.
The SME marked the dynamically generated facets into four following categories: • "Facets seen in Post" (highlighted in italic font) -facets seen in the post body and our algorithm also generated e.g. 'ClearCase Remote Client (CCRC)'; • "Facets seen in Post and relevant for query" (highlighted in bold italic font) -relevant facets seen in the post body and our algorithm also generated e.g 'ClearCase Remote Client'; • "Facets unseen in Post" (highlighted in underline) -facets unseen in the post body that our algorithm also generated e.g. 'Rational ClearCase SCM Adapter', 'rad', 'source control'; • "Facets unseen in Post and relevant for query" (highlighted in bold underline) -relevant facets unseen in the post and our algorithm also generated e.g. 'dynamic views'.
In summary, 22 randomly chosen queries and respective 5 facets per query generated from Optimistic DFS were evaluated by the SME. On average, our system generated 89% "Facets unseen in Post", out of which 25% are relevant for queries. Among the 11% "Facets seen in Post", 82% of them are found to be relevant for queries.

User Query
Top 5

Conclusion
In this paper, we propose Optimistic facet set selection, a new unsupervised approach for dynamic facet generation for interactive search. It outperforms existing state of the art on two publicly available benchmarks, one of which we are releasing as part of this work.
We believe this new dataset will be useful for the research community for training and evaluating interactive models. Currently, our proposed approach does not have an active learning component and does not explicitly learn from the user feedback (e.g. fine-tuning an NLP model). However, we think our approach will serve as a strong baseline for the future interactive search approaches.
In future, we plan to investigate the following -• how to leverage the proposed algorithm to generate facets automatically grouped by types.
• how dynamic facets can be generated using language models as Knowledge Bases.
Our vision is to transform the interactive search experience into a learnable knowledge discovery process.