Determinantal Beam Search

Beam search is a go-to strategy for decoding neural sequence models. The algorithm can naturally be viewed as a subset optimization problem, albeit one where the corresponding set function does not reflect interactions between candidates. Empirically, this leads to sets often exhibiting high overlap, e.g., strings may differ by only a single word. Yet in use-cases that call for multiple solutions, a diverse or representative set is often desired. To address this issue, we propose a reformulation of beam search, which we call determinantal beam search. Determinantal beam search has a natural relationship to determinantal point processes (DPPs), models over sets that inherently encode intra-set interactions. By posing iterations in beam search as a series of subdeterminant maximization problems, we can turn the algorithm into a diverse subset selection process. In a case study, we use the string subsequence kernel to explicitly encourage n-gram coverage in text generated from a sequence model. We observe that our algorithm offers competitive performance against other diverse set generation strategies in the context of language generation, while providing a more general approach to optimizing for diversity.


Introduction
The decoding of neural sequence models is a fundamental component of many tasks in NLP. Yet, many proposed decoding methods aim to produce only a single solution; further, decoding strategies that provide a set, such as beam search, admit high overlap between solutions. Such approaches fail to reflect that for many NLP tasks, 1 there can be multiple correct solutions-or that we may desire a diverse set of solutions. As it stands, standard beam search chooses items based purely on individual scores, with no means for encoding interaction between candidates; this is the limitation which we attempt to address in this work.
We derive determinantal beam search, a novel generalization of beam search that casts subset selection as the subdeterminant optimization problem. Specifically, we formulate each iteration of beam search as a subdeterminant maximization problem parameterized by a positive semi-definite matrix that encodes interactions between the possible candidates; standard beam search is recovered by a specific diagonal matrix. This framing creates a natural paradigm for taking the relationships between candidates during the decoding process, and can thus assign higher scores to diversified sets; we show how this approach relates to k-determinantal point processes (DPPs). Given the wealth of research on efficient kernel computation (Rousu and Shawe-Taylor, 2005;Farhan et al., 2017) and DPP inference strategies (Li et al., 2016;Han et al., 2017;Chen et al., 2018), we find the impact on runtime to be quite reasonable in comparison to standard decoding techniques.
In a case study on neural machine translation (NMT), we demonstrate how to make use of the string subsequence kernel (Lodhi et al., 2002) to encode the notion of n-gram diversity in the language generation process, allowing us to derive an elegant diverse beam search. Under this scheme, we observe that determinantal beam search generates more diverse sets than standard beam search with minimal trade-off in terms of BLEU. We see improved performance over stochastic beam search (SBS; Kool et al., 2019), which is reported to encourage diversity, and a slight improvement over Vijayakumar et al. (2018)'s diverse beam search (DBS) while providing a more general approach to optimizing for intra-set diversity.

Neural Sequence Models
Neural sequence models are probability distributions p(y | x) over sequences y in an output space Y conditioned on an input x. 2 Here we define Y as the set of all valid sequences derived from a vocabulary V that are bookended by distinguished BOS and EOS tokens, indicating the beginning and end of the sequence, respectively. Typically, the sequence length is upper-bounded by some value n max ∈ Z + , which may depend on x. In this work, we consider locally normalized models, i.e. where p is a probability distribution overV def = V ∪ {EOS} conditioned on previously generated tokens y <t . The probability of the full sequence y = y 1 , y 2 , . . . is then calculated via the chain rule of probability: where y <1 = y 0 def = BOS. Our model p is typically parameterized by a neural network with weights θ. As we do not focus on the underlying model itself in this work, we omit the dependence of p on the parameters θ.
We define the decoding problem as the search for the highest-scoring y among all sequences in Y according to the model p(y | x), which is also called maximum-a-posteriori (MAP) inference: where the log transform of p is used by convention. We further define the set decoding problem as the search for a set Y of a specified cardinality k among all valid subsets {Y ⊆ Y | |Y | = k} that has the highest score where, by overloading, we define Similarly to Eq. (2), the set-decoding problem is then defined as: However, as has been noted in the literature, there are a number of issues with both Eq.
(2) and (4). First, as Y may be an exponentially large (in V) space and p is typically non-Markovian, we cannot efficiently search over Y, much less over Y k . Second, specifically for language generation tasks, these might not be useful objectives. 2 x may be, e.g., a source sentence or an image.
Degenerate Objective. It is important to note that the highest-probability solutions under neural sequence models are not always high-quality; specifically for tasks involving language generation, e.g., machine translation, prior work has shown the tendency for MAP decoding to lead to generic or degenerate solutions (Stahlberg and Byrne, 2019;Meister et al., 2020;Eikema and Aziz, 2020) while superior solutions assigned only slightly lower probability are often overlooked (Holtzman et al., 2020). Consequently, heuristic search methods or alternative objectives are frequently employed for decoding language generators.

Beam Search
A common heuristic to approximate the decoding problem in Eq.
(2) is to sequentially choose the token y t at each time step t that maximizes p(y t | y <t , x) until the EOS token is generated or the maximum sequence length n max is reached. This procedure is known as greedy search. Beam search is an oft-employed generalization of greedy search that returns k candidates and explores more of the search space. 3 In this work, we focus on a framing of beam search as iterative subset selection, which allows for a remarkably concise formulation of the algorithm. Given an initial set Y 0 containing only the BOS token, we choose subsequent Y t for t ∈ {1, . . . , n max } according to the following recursion: where we are constrained to only extending candidates present in the beam set, which we define as where • is used to indicate string concatenations. Note that candidates in Y t−1 already ending in EOS are simply added directly to B t , i.e., EOS • EOS = EOS. Under this definition, we have the cardinality constraint |B t | ≤ |V| · k.

A Determinantal Reformulation
We now introduce an alternative, equivalent notation for Eq. (5) using matrices and determinants that will shed light on the straightforward generalization of beam search that we present as the primary contribution of this paper. We define a timestep-dependent 4 diagonal matrix D ∈ R |Bt|×|Bt| where we take the diagonal entry Here y (i) ≤t is the i th candidate in B t according to a unique mapping of every element y ≤t ∈ B t to an integer between 1 and |B t |. Furthermore, we use the notation D Yt where Y t ⊆ B t , to indicate the submatrix that only contains those rows and columns corresponding to the elements of Y t . We may now rewrite Eq. (5) as where equivalence follows from the definition of the determinant for diagonal matrices. Formally, Eq. (8) is known as the subdeterminant maximization problem 5 (Klee et al., 1995;Ebrahimi et al., 2017), which-as the name suggestsrefers to the problem of finding the determinant maximizing subset of a matrix. While the notation introduced in Eq. (8) may seem contrived, it allows us to perform the subsequent generalization.

Determinantal Beam Search
We are now in a position to ask the fundamental question of this work: What happens if we replace the diagonal matrix D with a non-diagonal matrix? This substitution allows us to account for interactions between the elements in the beam. Formally, we consider a timestep-dependent positive semi-definite (PSD) matrix D + w · K where the off-diagonal matrix K indicates the strength of the interactions between candidates. The nonnegative weight w ≥ 0 controls the importance of these interactions during the decoding process. In this case, the beam search recursion becomes: Clearly, we recover beam search when w = 0; however, we can now select subsets based additionally on candidate interactions. That is, Eq. (9) now has an interpretation as a diversity objective function (Indyk et al., 2014) when K is chosen wisely. Due to the presence of the log, Eq. (9) is only well defined when the matrix D Y + w · K Y is PSD. 6

Constructing K
One simple way to construct K is as a Gram matrix, where each i, j element of K is computed via a kernel function K : S × S → R that maps two items in a space S to a real number. Specifically, we define K ij = K(s i , s j ) where s i , s j ∈ S are the i th and j th elements of S, respectively. In slight abuse of notation, we overload the kernel function K to take a set S such that K = K(S) is the kernel matrix resulting from pairwise computation over elements of S. 7 Following from Mercer's theorem, the matrix K = K(S) is necessarily PSD and, thus the matrix D Y + w · K Y is PSD for any Y ⊆ S. 8 The efficient computation of kernel functions is a well-studied problem-largely due to the prevalence of kernels in various machine learning techniques. For example, dynamic programming techniques are often employed in computation of K(S) (Rousu and Shawe-Taylor, 2005) or approximate low-rank kernel matrices can be used in place of K(S) (Si et al., 2017).

Relation to a DPPs
One interpretation of Eq. (9) is as a determinantal point process (DPP). Specifically, it is a k-DPP (Kulesza and Taskar, 2011) in the L-ensemble parameterization where we have L = D+w·K. This interpretation as a k-DPP gives us a very clear understanding of why Eq. (8) yields a diverse beam search. The diagonal entries encode quality, which tells how "good" each candidate on the beam is, while the off-diagonal entries encode how similar two elements are and, thus, how much they should be repulsed. For an overview of DPPs we refer the reader to .

Computing Log-Determinants
Unfortunately, computing the argmax 9 in Eq. (9) is an NP-hard problem (Ko et al., 1995). However, as the subdeterminant maximization problem has many applications, there has been much research on efficient algorithms for approximating logdeterminants in the context of, e.g., determinantal point processes (Gillenwater et al., 2012;Han et al., 2017). 10 One such algorithm uses a first-order approximation of the log-determinant function (Han et al., 2017). The work of Chen et al. (2018) uses a greedy, iterative approach; by updating the Cholesky factorization of the matrix kernel incrementally, the algorithm reduces inference time to O(k 2 |S|) to return k candidates from set S. Pseudocode for the latter approach can be found in Chen et al. (2018); pseudocode for the algorithm in log-space-since probabilistic models are often worked with in log-space for numerical stability-can be found in App. A.

Runtime Analysis
We consider the runtime of selecting k candidates at any given time step in the recursion of Eq. (9). At each time step, we must first construct the matrix K. This computation is highly dependent on the set interactions being modeled; as such, let O(c(k)) be a runtime bound for K's computation when our search uses a beam size of k. Once we have constructed our matrix D + w · K, we must next select k items. The set of hypotheses at any time step is at most k|V|. While as discussed in §3.3, finding the size-k subset that exactly 9 We may also sample from the k-DPP modeled by Eq. (9) rather than taking the approximate mode; this would only require changing the inference algorithm and can be done in a similarly efficient manner (Li et al., 2016). We focus on deterministic methods in this work as we aim to find the objective maximizing set. 10 As beam search is already a heuristic approach, such an approximation does not have any theoretical implications for the results of our algorithm. maximizes Eq. (9) has exponential runtime, we assume approximate methods are employed. Using the method given by Chen et al. (2018), approximate MAP inference takes k 3 |V| time to return k items from a set of size k|V|. Thus, the runtime at each iteration of determinantal beam search under these conditions would be O(c(k) + k 3 |V|). Note that standard beam search runs in O(k|V| log(k|V|)) time at each iteration. As k is generally small (≤ 20) and the impact of c(k) can be made reasonable ( §3.1), the practical increase in runtime is typically only moderate.

Case Study: Diverse Beam Search
We now consider the task of language generation, where our vocabularyV is a set of words and Y is the set of all valid strings derived fromV. When the space of our kernel function S = B t , one simple way of modeling interactions is through a string subsequence kernel (Lodhi et al., 2002).

Computing the String Kernel
The string subsequence kernel, proposed by Lodhi et al. (2002), is a function over two strings s and t computed as: where V n is the set of all finite strings of length n over the alphabet V; i (or j) denotes a vector of indices i = (i 1 , . . . , i |u| ) where 1 < i 1 < i |u| ≤ |s|; l(i) def = i |u| −i 1 +1 is the length of the substring u in s; λ ∈ (0, 1] is a decay factor which serves as a penalty for gaps within a compared subsequence. Direct computation of Eq. (10) is exponential in |V|, but efficient dynamic programs can be utilized: In this work, we employ the trie-based methods of Rousu and Shawe-Taylor (2005) to compute Eq. (10). Under this scheme, the computation of the kernel between two strings s and t is O(n · M · log(max(|s|, |t|)), where n is the chosen subsequence length (a hyperparameter) and M is the number of words that strings s and t have in common. Note that |s|, and thus M , are bounded by the time step t. Further, we can reuse many of the computations between subsequent decoding rounds due to the iterative nature of both beam search and the subsequence kernel computations. Additionally, since the magnitude of Eq. (10) is influenced by the lengths of s and t, we normalize the kernel as follows:

Integration into DetBS
The string subsequence kernel gives us a straightforward method for decoding diverse sets of strings from language generators. We construct the matrix using the dynamic program mentioned above to compute K(B t ). Intuitively, we can expect the argmax-i.e., the size k set corresponding to the objective-maximizing submatrix-of D + w · K(B t ) to have higher subsequence diversity as w is increased. This is perhaps most easily seen when viewing our problem as a k-DPP: if strings y (i) and y (j) have high overlap, this will be reflected in the matrix K(B t ) at position i, j. Higher values of K(B t ) i,j = K(y (i) , y (j) ) lead to lower probability of both y (i) and y (j) being in the set drawn according to the k-DPP parameterized by D + w · K(B t ), which follows from the properties of DPPs outlined in §2.2. In short, higher values of K(y (i) , y (j) ) decrease the value of log det(D Y + w · K(B t ) Y ) for sets Y containing both y (i) and y (j) , which makes Y less likely to be chosen in the recursion of Eq. (9).

Experiments
In our experiments, we explore the use of determinantal beam search as a diverse decoding strategy for language generation.

Baselines
Various diverse decoding strategies exist in the NLP literature. We first discuss those strategies that we employ as baselines in our experiments.
Standard Beam Search. Beam search is one of the most widely used decoding algorithms in NLP, where many problems require efficient strategies for decoding solutions from structured predictors. Specifically, for language generation tasks, beam search has repeatedly proved its effectiveness at decoding state-of-the-art solutions (Wu et al., 2016;Serban et al., 2017;Edunov et al., 2018;Yang et al., 2019). We refer back to §2.1 for the algorithm.
Stochastic Beam Search. Kool et al. (2019) propose stochastic beam search (SBS), a decoding technique that samples without replacement from sequence models according to their distribution over the entire space Y. For random sampling methods such as SBS, it is customary to use a sampling temperature T > 0 at generation time to control for the peakiness of the sampling distribution. This results in the generalized softmax: where larger T may lead to more diverse sets simply due to additional smoothing.
Diverse Beam Search. Vijayakumar et al. (2018) propose a modification to the standard beam search algorithm-which they term diverse beam search (DBS)-to alleviate lack of diversity. The algorithm further divides the beam into G groups B 1 t , . . . , B G t , where G is a hyperparameter of the algorithm, and optimizes for diversity between the different groups by subtracting a similarity term ∆(y ≤t , B g t ) from the decoding objective. 11 Specifically, ∆(y ≤t , B g t ) represents the degree of similarity between a hypothesis y ≤t and a group of hypotheses B g t . They find G = k, i.e., each group contains a single hypothesis, and the Hamming distance similarity metric lead to the best results; we use these settings in our experiments. Note that under this scheme, the solution set may have duplicates if the diversity penalty is not large enough.
Notably, under the above experimental settings, the runtimes of diverse beam search and our algorithm are the same, up to computation of the hamming loss and string kernel, respectively. However, while string kernel computations in our algorithm can be done in parallel, the diversity penalty in diverse beam search must be computed sequentially for each hypothesis, as it is based on the previously chosen groups.

Setup
We run experiments on neural machine translation (NMT) models trained on the WMT'14 (Bojar et al., 2014) En-Fr and the WMT'19 (Barrault Figure 1: Averaged n-gram diversity vs. minimum, median, and maximum BLEU score for beam sizes k = 5, 10, 20 on WMT'14 En-Fr and WMT'19 De-En newstest using various decoding strategies. The free parameter for each strategy is either the softmax temperature or the weight of the diversity parameter (see §5.2). et al., 2019) De-En datasets; for reproducibility, we use the pretrained models made available by fairseq 12 . We evaluate on the newstest set from the respective datasets, each containing 3003 sentences. Further details can be found in App. B.
For determinantal beam search (DetBS), we perform a hyperparameter search (precise details likewise in App. B) over λ and n, the decay factor and subsequence length, respectively. Search is performed for fixed w = 0.1 and k = 10 on validation sets for both languages; we omit a search over the entire space of w, k, λ, n so as to not create an unfair advantage for DetBS in comparison with the other decoding strategies, for which no hyperparameters are tuned. We use subsequence length n = 2 and λ ∈ {0.1, 0.3} for De-En and En-Fr, respectively.
We decode sets of size k ∈ {5, 10, 20} with each strategy, comparing sentence-level BLEU and n-gram coverage d n averaged across n ∈ {1, 2, 3, 4} in the decoded sets, where we define d n as d n = #of unique n-grams in k strings #of n-grams in k strings (14) 12 https://github.com/pytorch/fairseq/ tree/master/examples/translation 13 For each decoding strategy, we choose the diversity parameter corresponding to the most diverse set that had median BLEU 28.5 ± 0.05.
While d n has a more natural interpretation as coverage of different n-grams, the above quantity is often referred to as n-gram diversity in the literature and so we transition to this term for consistency. Following the experimental setup of Kool et al. (2019), we vary sampling temperature T ∈ {0.1, 0.2, . . . , 0.8} in the case of beam search and stochastic beam search and diversity weight w ∈ {0.1, 0.2, . . . , 0.8} in the case of diverse beam search. For DetBS, we observe that larger sets require a smaller diversity penalty to achieve good n-gram diversity: in Fig. 1 we show results for DetBS with the string subsequence kernel for w ∈ {0.01, 0.02, · · · , 0.1, 0.2, 0.3, 0.4} for k = 5, w ∈ {0.01, 0.02, · · · , 0.15] for k = 10, and w ∈ {0.01, 0.02, · · · , 0.05} for k = 20. 14 To observe how BLEU is affected by larger diversity coefficients under DetBS, we explore a finer grain of weights for DetBS in App. C. Fig. 1 shows the sentence-level BLEU score and averaged n-gram diversity on the newstest set for different decoding strategies; Tab. 2 shows explicit coverage of 1, 2, 3, 4-grams and averaged across 1, 2, 3, 4-grams for different decoding strategies when BLEU is controlled for. The 3 lines Source Sentence Zum Abschluss wurde eine Tombola verlost. Die Wahrheit zu sagen ist aber kein Verbrechen.

Results
Beam Search (T = 0.6) • A raffle was held to close the event.
• But telling the truth is not a crime. • A raffle was held to conclude the event.
• Telling the truth is not a crime. • A raffle was held at the end.
• But telling the truth isn't a crime. • At the end a raffle was held.
• But telling the truth is no crime. • A raffle was held to close the draw.
• However, telling the truth is not a crime. Diverse Beam Search (w = 0.4) • To conclude, a raffle was held.
• But telling the truth is not a crime. • A raffle was held to close the event.
• But telling the truth is not a crime. • A raffle was held to close the event.
• But telling the truth is not a crime. • A raffle was held to close the event.
• But telling the truth is not a crime. • At the end of the event, a raffle was held.
• Telling the truth, however, is not a crime. Determinantal Beam Search (w = 0.12) • Finally, a raffle was held.
• But telling the truth is not a crime. • A raffle was held at the end.
• But telling the truth isn't a crime. • At the end a raffle was held.
• Telling the truth is not a crime. • To conclude, a raffle was held.
• However, telling the truth is not a crime. • A raffle was held to close the event.
• But to tell the truth is not a crime. per decoding strategy in Fig. 1 represent the minimum, median, and maximum sentence-level BLEU score out of the k translation options, averaged across the corpus. We consider median BLEU to be the best metric of set text-quality, as a good diverse decoding algorithm should not completely sacrifice BLEU for the sake of diversity. The plots are analogous to those in Kool et al. (2019). On both datasets and across different set sizes, results indicate that DetBS generates diverse sets of strings while maintaining high median and maximum BLEU scores. We see similar or higher n-gram diversity in comparison to DBS for the same median BLEU and a notably better n-gram diversity vs. BLEU trade-off than standard beam search and SBS. Further, the highest quality translation (shown by max BLEU) does not appear to be sacrificed when the diversity parameter is increased for DetBS. In contrast, there is a notable drop-off for generation strategies in which diversity is controlled for using temperature. We show samples of generated text in Tab. 1.

Related Work
Our work is built upon much of the subset optimization literature in machine learning. We base our algorithm off the subdeterminant maximization problem (Agarwal et al., 2004), which has been used to find core sets-a concept originating in computational geometry concerning the existence of a small, representative set of core items-in data summarization problems (Mirzasoleiman et al., 2013), nearest neighbor search (Abbar et al., 2013) and streaming algorithms (Indyk et al., 2014) inter alia. Informally, we can connect our problem to the notion of decoding a core set from sequence models. To the best of our knowledge, our work is the first to use this concept when decoding sequence models. Wang and Chan (2019) incorporate DPPs into a reinforcement learning objective to optimize for diverse text when training image captioning models. We optimize for diversity during decoding, rather than training, which makes our methods applicable with out-of-the-box models and allows us to avoid highly hyperparametersensitive techniques, like minimum-risk training or reinforcement learning-based algorithms, while achieving the same goal. While the application of our methods at training times is an interesting research direction, we foresee technical challenges corresponding to such approaches that may outweigh their benefits.
As a decoding method, our work is closest   (14)) of 1, 2, 3, 4-grams and averaged across 1, 2, 3, 4-grams as well as median BLEU for k = 20 on the newstest dataset. For each decoding strategy, we report metrics on the generated set that has highest (average) d n , where we set the constraint that median BLEU for the set is still within 1 point of the highest median BLEU (across decoding strategies and diversity parameters). 15 to that of Vijayakumar et al. (2018), who propose a variation of beam search (described in §5.3). However, their algorithm lacks theoretical motivation and is not guaranteed to provide a nonoverlapping set; the same solution may appear multiple times in the decoded set if the diversity penalty is not large enough, as shown in Tab. 2. Additionally, groups at each time step t must be processed in order since the score of all hypotheses considered for group g + 1 depend on hypotheses in groups 1, . . . , g, which creates a large bottleneck under the recommended settings of G = k. Random sampling strategies for decoding neural sequence models have received much attention in recent years. While techniques such as stochastic beam search and the UniqueRandomizer (Shi et al., 2020) are convenient for creating statistical estimators and have uses in reinforcement learning techniques due to their clear probabilistic interpretation, there are no diversity guarantees for the set of generated sequences. Tam (2020) likewise adapts beam search, proposing a k-means clustering version that clusters solutions by averaged word embeddings. As there lacks an interpretation of distance between averaged word embeddings though, it is unclear if the method can explicitly optimize for any tangible notion of coverage or diversity.

Conclusion
We propose determinantal beam search (DetBS): a new way of framing beam search that allows us to optimize set generation for diversity and coverage rather than simply individual scores. Formally, we redefine beam search as an iterative subdeterminant maximization problems where we select the approximately maximizing set according to the PSD matrix parameterizing our score function. This gives us the ability to encode the notion of intra-set diversity into the beam search optimization problem. We discuss and experiment with efficient methods for inference and kernel computation that make DetBS an efficient decoding strategy in practice. We use DetBS in the context of language generation, where we explicitly encourage n-gram coverage through the string subsequence kernel. In our NMT experiments, we find DetBS generates much more diverse sets of strings than standard beam search and stochastic beam search with a small tradeoff in median BLEU. We observe competitive performance compared with diverse beam search.

A Log-Space Computations
Algorithm 1 Fast Greedy MAP Inference with log-space parameterization (Chen et al., 2018). We transform computations according to (Li and Eisner, 2009 Hyperparameters. As we use the string subsequence kernel of section §4 in DetBS, there are a number of hyperparameters that can be adjusted beyond the diversity weight w: the decay factor λ indicates the degree to which interior gaps are penalized and subsequence length n indicates the length of the considered substrings u. For each language, we perform a search over these two hyperparameters for set size k = 10 and diversity coefficient w = 0.1 on validation sets. We use a grid search over n = [2,3,4,5,6,7,8] and λ = [0.1, 0.3, 0.5, 0.7, 1.0]. We choose the configuration that yields the highest (average n-gram diversity)*BLEU, using this configuration in all subsequent experiments. While there may be better performing hyperparameters under different k and w, we omit searching over the entire space to create a fairer comparison with the other decoding strategies.

Dataset
n decay (λ) WMT'14 En-Fr 2 0.3 WMT'19 De-En 2 0.1 Interestingly, larger values of n did not improve performance, and were more computationally expensive; small values of n and decay λ appear to offer the best BLEU vs. n-gram diversity trade-off.

Dataset and Model Statistics
We use a convolutional sequence-to-sequence model trained according to Gehring et al. (2017)

C Additional Results
Figure 2: n-gram diversity vs. minimum, median and maximum BLEU score for beam sizes k = 5, 10, 20 on WMT'19 De-En newstest using a larger range of the diversity weight w.