Minimizing Annotation Effort via Max-Volume Spectral Sampling

We address the annotation data bottleneck for sequence classiﬁcation. Speciﬁcally we ask the question: if one has a budget of N annotations, which samples should we select for annotation? The solution we propose looks for diversity in the selected sample, by maximiz-ing the amount of information that is useful for the learning algorithm, or equivalently by minimizing the redundancy of samples in the selection. This is formulated in the context of spectral learning of recurrent functions for sequence classiﬁcation. Our method represents unlabeled data in the form of a Hankel matrix, and uses the notion of spectral max-volume to ﬁnd a compact sub-block from which annotation samples are drawn. Experiments on sequence classiﬁcation conﬁrm that our spectral sampling strategy is in fact efﬁcient and yields good models.


Introduction
In the later years the field of NLP has witnessed great progress on supervised machine learning methods for sequence classification. However, most of these methods require large amounts of annotated training data. Because of this, whenever a new NLP application needs to be developed, data annotation becomes the main bottleneck in terms of cost and time. For example, a defense research analyst might wish to quickly train a text classifier to detect emergent socio-political events in a given conflict area. Since there might be only a few experts on the subject their time will be costly. Therefore, the expert should be able to train models fast with minimal annotation effort.
To address the annotated data bottleneck, researchers have proposed active learning approaches that develop sampling strategies designed to minimize the number of annotations required to train a model (Settles, 2009;Wang and Shang, 2014;Zhang et al., 2016;Siddhant and Lipton, 2018). Most active learning proposals are based on two Figure 1: Representing unlabeled data in the form of a Hankel matrix can be very effective to uncover latent structure of the data. We present a sampling technique to leverage this structure. main strategies. The first strategy uses model uncertainty and selects samples for which the prediction of the current model is the least confident. This strategy might not work very well during the first iterations of active learning, when the predictions of the model are unstable. Furthermore, the model uncertainty criteria cannot be applied in the first iteration, when no model has been trained and one needs to resort to other cold start sampling strategies (Yuan et al., 2020). To overcome the limitations of the uncertainty approach other researchers have proposed sampling strategies that attempt to maximize the diversity of the selected samples (Shao et al., 2019).
Besides the selection strategy, another dimension of an active learning method is the type of annotation feedback that it exploits. For example, in text classification the annotations can consist of labels for complete texts, phrases, sentences, features, rules or labeling functions (McCallum and Nigam, 1999;Settles et al., 2008;Druck et al., 2009;Ratner et al., 2017;Safranchik et al., 2020).
In this paper we focus on the problem of training sequence classification models under an annotation budget constraint and with no prior trained model for the task. This is sometimes referred as the cold start problem. We consider a setting in which the annotation feedback is at the level of phrases. Our goal is to develop an efficient algorithm that can answer the question: if one has a budget of N annotations, which phrases should we select to annotate?
Notice that active sampling under an annotation budget is different from the classical active learning scenario. In the classical setting, the learning algorithm has access to a large unlabeled data pool, and in a series of iterations it alternates between sampling data to annotate and training a new model. In contrast, when training under budget constraints, the focus is on the initial setting, when there is no model that can guide the selection and when all that is available to the selection algorithm is the unlabeled data pool. The second difference is that our goal is to find the optimal set of size N , i.e. the selection criteria should be able to score a set or batch of phrases. In that sense our work is more related to cover-set approaches (Sener and Savarese, 2017).
Our proposed solution for the problem of learning under a budget constraint follows a diversity sampling strategy. That is, given a fixed budget our batch selection method attempts to maximize the amount of useful information contained in the batch. Or equivalently, it tries to minimize annotation redundancy in the selected batch. More precisely, our approach is inspired in methods that minimize annotation redundancy by uncovering latent structure in the input domain (Dasgupta and Hsu, 2008).
Intuitively, imagine a classification problem with k classes. If we had access to a clustering of the data into k groups that perfectly align with the target classes only k labeled points would be needed. That is, we would label a representative sample for each cluster. Of course, the perfect clustering might not exist but the point is that by discovering relevant latent structure one can minimize annotation redundancy and ask only for the annotations that are really necessary.
We take this basic idea and translate it to the sequence classification setting. Essentially, our method induces an implicit soft clustering of phrases (i.e., subsequences) so that we only need annotations for one phrase in each cluster. Following the classical distributional principle, we con-sider two phrases to be similar if they can appear in similar contexts.
Our technical contribution exploits ideas from the theory of spectral and Hankel-based learning methods for estimating recurrent sequence prediction functions with linear state-dynamics (Hsu et al., 2009;Bailly et al., 2009;Balle et al., 2014;Rabusseau et al., 2019). We reduce the problem of training sequence classification models under annotation budget constraints to the problem of selecting a high-volume matrix sub-block (i.e. a sub-block of high rank) from a Hankel matrix (computed from the unlabeled pool) that captures key statistics of the sequence domain distribution. See Figure 1 for a sketch. To our knowledge, sampling strategies reduced to spectral matrix operations is a novel technical approach. Recent methods for cold-start sampling with an annotation budget have considered clustering embeddings of sentences derived from BERT (Yuan et al., 2020), either as a single shot sampling (like our method), or by iterative fine-tuning of the embeddings used for sampling.
We highlight two main contributions: • We propose a notion of sample diversity based on structural properties of low-rank Hankel matrices. Using this notion we derive a phrasesampling algorithm for learning under annotation budget constraints, i.e. the cold-start challenge.
• In experiments, we compare our spectral sampling strategy to recent active learning methods for fine-tuning BERT-based sentence classifiers, that also seek diversity in the sampling step. Our results show that under strict budget constraints a simple latent-state model (in our case, linear RNNs) can outperform the neural BERT-based approach, despite the fact that our models are strictly less expressive and are not pre-trained.
The paper is organized as follows: Section 2 starts with a description of linear recurrent sequence functions, and then provides the spectral learning background necessary to understand our sampling strategy, and in particular the concept of Hankel matrices. Section 3 presents the proposed phrase selection method based on selecting a max-volume sub-block from a Hankel matrix representing the unlabeled pool. Section 4 presents our experiments on text classification. Finally, Section 5 concludes the paper.

Linear RNNs and Hankel Matrices
In this paper we work with simple Recurrent Neural Networks that use linear functions (matrix products) to compute hidden-state vectors along the sequence. In this setting, we describe connections to spectral learning, and specifically to the Hankel matrix of a recurrent sequence model. This is a central tool to derive the sampling strategy.
A Recurrent Neural Network (RNN) takes as input a sequence x and outputs a vector of k real numbers, f : Σ → R k , where x = x 1 · · · x n is a sequence of length n over some finite alphabet Σ. We denote as Σ the set of all finite sequences, and we use it as a domain of our functions. An RNN with hidden dimension d and output dimension k is defined as a tuple: where e σ is an indicator vector of the current symbol σ ∈ Σ that selects the appropriate column weights in W h . The parameters W y ∈ R k×d , b y ∈ R k model the k function values given the hidden vectors: The functions γ h and γ y are activation functions, and in this paper we focus on simply using the identities.
The most common use of general RNNs in NLP is language modeling. In this case the model is set to predict the conditional probability of the next symbol (with k = |Σ| symbols), and by means of the chain rule, the model defines a distribution over the language and is trained to generate sequences left-to-right. Another popular use is sequence classification, where the model computes a classification score for each of the k labels of a task, given input sequences x 1:n .
Another application of RNNs, which is less common in the literature, is to frame language modeling as a density estimation task, where the model estimates the probability of a full sentence directly. In this case, the RNN predicts a single score (i.e. k = 1) which corresponds to the probability of the input x 1:n , and we can regard this as a regression learning problem, i.e. learn a real-valued function that approximates the target probabilities given the hidden state vector of the input sequence.
Finally, instead of modeling the probability of full sequences, RNNs can directly approximate the moments of the distribution. That is, learn a function from Σ * → R that estimates the expected number of times of observing a subsequence x 1:n in a random sequence sampled from the target distribution. Modeling moments, such as ngram statistics, has the advantage that the target statistics are less sparse than full sequences even for long ngrams.

Linear RNNs and Hankel Matrices
We now focus on linear RNNs where the activation functions γ h and γ y are simply the identity function. 1 We describe some interesting properties of linear RNNs that we exploit in the context of sampling.
A linear RNN N can be rewritten into a Weighted Finite Automata (WFA) of this form: where: α 0 ∈ R d is an initial state vector; A σ ∈ R d×d are the transition matrices associated with each symbol σ ∈ Σ; and B ∈ R d×k is a matrix of state-to-output weights. One can verify that a linear RNN N with d hidden states and k outputs can be rewritten as a WFA α 0 , A σ , B of dimension d = d + 2 and k outputs. 2 Note that under Eq. 4 the computation of a linear RNN is not necessarily in a forward manner (left-to-right), but can also be in a backward manner (right-to-left). Given an input sequence x 1:n , one can define forward vectors for prefixes of the sequence α t = α t−1 A xt ; and backward matrices for suffixes of the sequence β t = A xt β t+1 . Then, for any position 1 ≤ t ≤ n we have that We now focus on linear RNNs that compute a single output value, i.e. k = 1. We can represent a linear RNN using a Hankel matrix. A Hankel matrix of a sequence prediction function f is a biinfinite matrix H f ∈ R Σ ×Σ indexed by prefixes and suffixes of the language, such that H f (p, s) = f (p · s). A central result establishes that for a WFA that computes function f , with d dimensions and k = 1, the rank of H f is d. This is because WFAs and linear RNNs factor the computation of f as products of prefix and suffix vectors, which are of dimension d. The reverse also holds: if a Hankel matrix H f has rank d, then there is a WFA with d states that computes the associated f function. Next we describe spectral learning, which uses this result. See (Rabusseau et al., 2019) for further connections between WFAs and linear RNNs. See (Quattoni and Carreras, 2020) for an application of WFAs to NLP sentence classification tasks.

The Spectral Method
Spectral learning is based on learning a low-rank Hankel matrix of the target distribution. Here we provide a high level description of the method; for a complete derivation and the theory justifying the algorithm we refer the reader to the works by Hsu et al. (2009) and Balle et al. (2014).
At training, we are given sequences T from the distribution and we want to estimate f . We denote as f T (x) the empirical subsequence expectation of x in T . 3 Using f T , the spectral method estimates a WFA A with d states, where d is a parameter of the algorithm, such that f A is a good approximation of f . The method reduces the learning problem to computing an SVD decomposition of the training Hankel matrix, that collects the observed expectations f T . The method is as follows: (1) Select a set of prefixes P and of suffixes S, that serve as indices of the Hankel matrix for rows and columns respectively. For example, select all subsequences up to a certain size n.
(2) Create a Hankel matrix H ∈ R P ×S for the basis (P, S). Each entry is indexed by a prefix p ∈ P and a suffix s ∈ S, and the value is the evaluation of f T on the concatenation of the prefix and the suffix i.e. H(p, s) = f T (p · s).
(3) Compute a d-rank factorization of H. Compute the SVD of H, i.e. H ≈ UΣV resulting in a matrix P = UΣ ∈ R P ×d and a matrix S = V ∈ R S×d . H ≈ PS is a drank factorization of H, with P and S being projection matrices of prefixes and suffixes (respectively) to d-dimensional embeddings.
(4) Recover the WFA A of d states using the previous Hankel factors P and S (details omitted).
The steps above are only a sketch of the method, a full description can be found in (Balle et al., 2014). The main computation is dominated by step (3), the SVD of the Hankel matrix, which is at most cubic in the size of the matrix.
One could imagine a Hankel matrix of infinite size, which would capture the statistics of all of the training subsequences. The theory behind spectral learning shows that this infinite Hankel matrix, when representing a function computable by a minimal WFA of d states, has rank d. Furthermore, the theory shows that any sub-block of the infinite Hankel that preserves the rank (i.e. that has rank d) is sufficient to learn the target WFA. This observation sheds light on step (1) of the algorithm above: it attempts to define a finite sub-block of the infinite Hankel (by defining a finite basis of prefixes S and suffixes P ) that preserves its rank.
In theory, we can define a Hankel matrix that captures all of the data by setting both P and S to be all subsequences found in any training sequence. However, this results in a very large Hankel matrix, which has a consequence on the cost of the SVD in step (3). There exist techniques to handle this computational bottleneck (Quattoni et al., 2017).

Phrase Sampling via Max-Volume Optimization
In this section we describe a deterministic phrase sampling method for sequence classification. We assume an unlabeled pool of sequences, and the goal is to select an annotation batch. Once selected, the batch will be first annotated, and then a model will be learned from it. The sampling strategy we describe selects phrases for annotation, i.e. subsequences of sentences (i.e. ngrams) found in the unlabeled pool. Notice that when learning a sequence model with the spectral method, we only use the information contained in the selected Hankel sub-block. In the previous section we discussed the importance of selecting a small sub-block of the Hankel matrix for computational efficiency. In such setting, it is assumed that there is enough training data to compute all the entries of the Hankel. A sub-block of H is selected to ease the computation of the SVD, which is required to infer the model parameters.
The problem that we address in this paper is different: the focus is annotation efficiency, not computational efficiency. In our case, we assume that we need to estimate the Hankel matrices H l of each label l. Initially, we do not have samples to compute any of its entries, so we ask the question: Is there a way to select the samples to annotate so that it provides the most information about the target class distributions? Or equivalently, is there a way to request annotations so that it gives us the most information about H l ?
Our solution uses an approximation of H l given by the Hankel matrix H U associated with a language model of the unlabeled distribution. We use H U to pick the most informative entries of H l , i.e. those for which we will request annotations. Intuitively, each prefix in the matrix is described as a distribution over suffixes, and the analogue for suffixes. The proposed approach will select prefix and suffix rows that are the most uncorrelated, so that annotating their compositions will provide the most information about H l . In essence, this selects a set of representative prefix and suffix prototypes in the latent space of prefix and suffix embeddings derived from H U .
The difference between the computational and sampling problems has an analogue in recommendation systems based on collaborative filtering. In this case, one creates a matrix where rows are users and columns are movies, and the corresponding entry has the rating given by a user to a movie. Some entries are observed and some are missing, the matrix is assumed to be low-rank, and the goal is to complete the matrix and predict the ratings that users will give to unseen movies. In this context, the computational challenge is to perform SVD of a potentially very large matrix, which is required for low-rank matrix completion. In contrast, the sampling problem, assumes that we can query users for ratings on specific movies. The optimal sampling question is: What is the most informative subset of user-movie ratings to request so that from the chosen subset we can predict unseen user-movie ratings?

The Max-Volume Hankel Sub-block
First, we are interested in having an annotation budget. This budget could be defined in terms of the number of tokens to annotate. However, because of reasons that will become apparent, in our spectral approach it is more natural to define a budget on the size of the sub-block; the number of tokens to annotate will be determined by it.
We redefine the spectral algorithm to work with Hankel sub-blocks of size b × b, where b is the budget. Given a large Hankel matrix, it is known that finding the sub-block of size b of maximum rank is NP hard (Peeters, 2003). Fortunately there exist reasonable approximations. A popular approach that is often used in the context of reconstruction of low-rank matrices under computational constraints is to search for the sub-block of highest volume, where the matrix volume is defined as the absolute value of the determinant, i.e. the product of the Eigenvalues (Bebendorf, 2000;Çivril and Magdon-Ismail, 2009;Cortinovis et al., 2019). While finding a sub-block of maximum volume is also NP-hard, there exist efficient and widely used approximation algorithms. In this paper we use an iterative algorithm based on LU factorization (Miranian and Gu, 2003). It iteratively factors matrices of size n × b, where n is the number of rows/columns of the original Hankel matrix and b is the budget. In our experiments, this routine takes a few minutes to converge.

Max-Volume Sub-block for Sampling
We now turn to using max-volume as a sampling strategy for sequence classification, under an annotation budget. The classifiers we use are ensembles of linear RNNs, with one model for each label trained to estimate the class-specific moments. We could attempt to select a Hankel sub-block for each label, but the sub-block selection methods we described would require a large Hankel matrix specific to each label, which in turn would require large labeled training data.
The main idea behind our sampling strategy is to have a single sub-block that is common to all the labels, and to make this selection we use the distribution of unlabeled sequences in the domain. Specifically, we first consider a Hankel matrix H U of the sequences in the unlabeled pool, where the value of an entry H U (p, s) is the expected number of times of observing the phrase p · s in a sequence sampled from the unlabeled pool. This corresponds to a Hankel matrix for language modeling, since it is estimating the domain distribution. This Hankel matrix is used to select a max-volume sub-block satisfying the given budget b, resulting in a set of b prefixes P and a set of b suffixes S. This basis is then used to define a Hankel matrix specific to each label.

Filling in Hankel Matrices
For each label l, we need to fill all the entries of the associated Hankel matrix H l , defined over the max-volume basis. The value of one entry H l (p, s) corresponds to the expected number of times of observing the phrase p · s in a sequence sampled from the specific distribution of all sequences of label l. For each possible phrase defined by the basis, and for each label l, we would need such statistic. It seems unrealistic to ask an annotator this kind of feedback.
Instead, we note that because of the Zipfian nature of language, rather than requiring the actual expectation of a phrase, in many cases it suffices to know whether that expectation is 0 or not, i.e. whether or not that phrase can appear in sequences of that class. Put it differently, we postulate that most of the information is in the sparsity pattern of the moments in the Hankel matrix, rather than their real values.
Designing an annotation strategy to fill sparsity patterns is much easier. For each prefix p ∈ P and suffix s ∈ S we consider the phrase q = p · s. If q does not appear in the unlabeled pool we set H l (p, s) = 0 for all labels l. Otherwise, if q does appear in the unlabeled pool we ask the annotator for its class labels. We use multilabel-style feedback where we allow a phrase q to belong to multiple classes simultaneously, and set H l (p, s) = 1 to all such positive labels l, and 0 for the rest of labels. Algorithm 1 describes the sampling strategy.
We would like to note that once we have selected an informative basis for the sequence classification task at hand, other annotation feedback strategies might be used to fill the necessary Hankel statistics, for example by generating phrases. In this work we picked the simplest strategy from which we obtained good performance, further work will explore other strategies.

Experiments
We evaluate the spectral sampling method on two sentence classification tasks. We compare our sam-

Algorithm 1: Phrase Sampling via Max-Volume Optimization
Data: Unlabeled data pool U , basis budget b, a set L of k target labels 1 Compute Hankel matrix H U where rows and columns are indexed prefixes and suffixes 2 Find maximum-volume matrix sub-block of H U and corresponding basis (P, S) where |P | = |S| = b 3 Construct the set of queries Q by listing all phrases p × s obtained by concatenating a prefix p ∈ P with a suffix s ∈ S, such that p × s is observed in U 4 For every phrase q ∈ Q ask the annotator to provide feedback, in the form of an indicator vector z ∈ {0, 1} k , where z l denotes that q can be a phrase of sentences of class l ∈ L Result: A set of labeled phrases (q, z) | q ∈ Q, z ∈ {0, 1} k pling strategy to recent active learning methods for fine-tuning BERT-based sentence classifiers (Yuan et al., 2020), that also seek diversity in the sample. Data. We use two common datasets for sentence classification: the IMDB dataset of movie reviews (Maas et al., 2011), where the goal is to predict if a movie review is positive or negative; and the AG News dataset Zhang et al. (2015)  Evaluation. We report performance on the test partition. As an evaluation metric we use the F1 average between precision and recall. We report the F1 performance of the model as a function of the total number of tokens annotated, defined as q∈Q |q| where Q is the set of annotated samples. Since our sampling method is controlled by a budget on the basis size, we run the method for increasing budgets and measure the number of tokens of each batch of samples.
Linear RNN Classifiers. We trained one linear RNN for each class, that models the distribution of sequences of that class. To train them, we use the spectral method of moments, and set the number of hidden states to 10 for each of them. We could, in principle, exploit models with larger state spaces. If large quantities of data were available we would have indeed observed a performance improvement by exploiting larger state spaces. However, we decided to use a small state space since our main goal is to be able to train models with small annotation budgets. Under this setting simpler models can be learned more robustly. Given an input sequence x, the linear RNN classifiers provide scores for each ngram (i.e. substring) of x and each class. To make predictions, we use a simple ensemble technique similar to (Mesnil et al., 2014): we consider all the ngrams of x up to length 4, and compute an aggregate prediction score for each label l ∈ L: . (5) Simulated Annotation. Our sampling method produces a batch of phrases (i.e. subsequences of unlabeled examples) for annotation. While doing evaluations with human annotators would be ideal, it is also very costly. Instead we follow the standard evaluations of active learning methods which are based on using the unlabeled pool together with the true labels to simulate the feedback that could be provided by a human annotator. While this is by no means perfect, it is a natural low-cost approximation to the human evaluation experiment. More precisely, for a given phrase q to be annotated we look at the unlabeled data pool and retrieve the sentences in which q appears. Then we take all the labels for such sentences and set them as positive labels for q, forming an indicator vector.

Comparison to Max-Volume Oracles
We first test the max-volume sampling using oracle configurations that have access to fully labeled data. The oracle max-volume is as follows. Since we have fully labeled data, we can consider classspecific Hankel matrices for each label. Thus, for each label we will compute the max-volume subblock. We call this the class-oracle setting. Then, based on the discussion in Section 3.3, we consider two variants depending on how we fill the selected sub-blocks with target values. In class-oracle expectations the values are the expected counts of the corresponding phrases in the unlabeled pool. In class-oracle occurrences we only consider the sparsity pattern, setting 1 if the expected count is non-zero, and 0 otherwise.  Figure 2 shows F1 performance in terms of the number of annotated tokens. We clearly see that using occurrences behaves very similarly to using actual expectations. This confirms our hypothesis, and enables to train our models from simple binary phrase occurrence feedback. The same figure also shows the curve for our proposed sampling method, that estimates the max-volume sub-block using unlabeled sequences. We can see it follows the same trend as the oracles. This confirms that using the underlying domain distribution to inform about sub-blocks of maximum information is indeed an effective working hypothesis.

Comparisons for Fixed Annotation Budgets
We now compare our strategy for sampling under budget constraints with two baselines. The first baseline samples complete random examples, and the second one samples random phrases of length less than 10.
We also compare to three active learning methods analyzed by Yuan et al. (2020) that also look for diversity in the queried samples: BERT-KM generates samples based on a k-means clustering of BERT embeddings of sentences, while BADGE and ALPS actively refine the BERT embeddings to the target task after getting labels for each batch of samples. The idea behind BADGE (Ash et al., 2020) is to use gradient representations of the sentences in the unlabeled pool, since gradients are indicators of changes in the model and therefore are useful to promote diversity. The ALPS method is a variant that uses the masked language modeling loss of BERT to promote gradient diversity for sampling purposes. In all, these methods represent  recent BERT-based approaches for cold-start sampling. Our comparison follows the same setting as Yuan et al. (2020). Figure 3 shows the comparison. The main observation is that max-volume sampling is much more efficient than the two baselines. Compared to the BERT-based samplers, we also see that maxvolume sampling is more efficient for low budget settings, even though after some iterations, ALPS, BADGE and BERT-KM eventually outperform the accuracy of our method. One possible reason is that these methods do exploit information that is not in the max-volume sub-block. Table 1 shows in more detail the performance of max-volume sampling in terms of the size of the basis and the total number of tokens to be labeled.

Conclusions
Sequence distributions that can be modeled with latent state models have low-rank signatures. That is, the whole distribution can be learned from statistics over a small number of key phrases. The main contribution of our work is to show how we can leverage that property to design efficient sampling strategies for sequence classification under annotation budget constraints.
The idea is quite simple: while for a given category we cannot know a priori (that is without labeled sequences) its low-rank signature (and key phrases), we can try to estimate the signature from unlabeled domain data. Using that approximation we can design an efficient way of selecting phrases to label. Our experiments show that with this strategy we can obtain reasonable sequence classification models under small budget constraints. To the best of our knowledge our proposal is the first sampling strategy to implicitly exploit low-rank embeddings of domain phrases.
Once a low-rank Hankel signature has been found we could imagine several different annotation strategies for estimating the relevant statistics. This work is just a first step where we consider one of the simplest of such strategies. However, future work should explore the space of annotation strategies taking into account what feedback would result in the best estimation, and what is easiest for the human annotator. We see this work as opening the door for future research on interactive machine learning for sequence modeling where the annotation feedback strategy is designed to exploit the structural properties of the domain.
Our sampling strategy contrasts with recent work in active learning, which exploits BERT-based embeddings. We empirically observed that the performance of our combo is better for very low annotation budget, but eventually the neural approaches improve and gradually achieve state-of-the-art results. Thus, one natural question for future research is if our sampling strategy can be coupled with more expressive neural classifiers. A second re-lated question is how to use the spectral models trained with max-volume sampling to warm-start neural approaches.