SPECTRA: Sparse Structured Text Rationalization

Selective rationalization aims to produce decisions along with rationales (e.g., text highlights or word alignments between two sentences). Commonly, rationales are modeled as stochastic binary masks, requiring sampling-based gradient estimators, which complicates training and requires careful hyperparameter tuning. Sparse attention mechanisms are a deterministic alternative, but they lack a way to regularize the rationale extraction (e.g., to control the sparsity of a text highlight or the number of alignments). In this paper, we present a unified framework for deterministic extraction of structured explanations via constrained inference on a factor graph, forming a differentiable layer. Our approach greatly eases training and rationale regularization, generally outperforming previous work on what comes to performance and plausibility of the extracted rationales. We further provide a comparative study of stochastic and deterministic methods for rationale extraction for classification and natural language inference tasks, jointly assessing their predictive power, quality of the explanations, and model variability.


Introduction
Selective rationalization (Lei et al., 2016;Bastings et al., 2019;Swanson et al., 2020) is a powerful explainability method, in which we construct models (rationalizers) that produce an explanation or rationale (e.g: text highlights or alignments; Zaidan et al. 2007) along with the decision.
One, if not the main, drawback of rationalizers is that it is difficult to train the generator and the predictor jointly under instance-level supervision (Jain et al., 2020). Hard attention mechanisms that stochastically sample rationales employ regularization to encourage sparsity and contiguity, and make it necessary to estimate gradients using the score function estimator (SFE), also known as RE-INFORCE (Williams, 1992), or reparameterized gradients (Kingma and Welling, 2014;Jang et al., 2017). Both of these factors substantially complicate training by requiring sophisticated hyperparameter tuning and lead to brittle and fragile models that exhibit high variance over multiple runs. Other works use strategies such as top-k to map token-level scores to rationales, but also require gradient estimations to train both modules jointly (Paranjape et al., 2020;Chang et al., 2020). In turn, sparse attention mechanisms (Treviso and Martins, 2020) are deterministic and have exact gradients, but lack a direct way to control sparsity and contiguity in the rationale extraction. This raises the question: how can we build an easy-to-train fully differentiable rationalizer that allows for flexible constrained rationale extraction?
To answer this question, we introduce sparse structured text rationalization (SPECTRA), which employs LP-SparseMAP (Niculae and Martins, 2020), a constrained structured prediction algorithm, to provide a deterministic, flexible and modular rationale extraction process. We exploit our method's inherent flexibility to extract highlights and interpretable text matchings with a diverse set of constraints.
Our contributions are: • We present a unified framework for deterministic extraction of structured rationales ( §3) such as constrained highlights and matchings; • We show how to add constraints on the rationale extraction, and experiment with several structured and hard constraint factors, exhibiting the modularity of our strategy; • We conduct a rigorous comparison between deterministic and stochastic rationalizers ( §4) for both highlights and matchings extraction.
Experiments on selective rationalization for sentiment classification and natural language inference (NLI) tasks show that our proposed approach  Table 1: Positioning of our approach in the literature of rationalization for highlights extraction. Our method is an easy-to-train fully differentiable deterministic rationalizer that allows for flexible rationale regularization.
achieves better or competitive performance and similarity with human rationales, while exhibiting less variability and easing rationale regularization when compared to previous approaches. 1 2 Background

Rationalization for Highlights Extraction
Rationalization models for highlights extraction, also known as select-predict or explain-predict models (Jacovi and Goldberg, 2021; Zhang et al., 2021b), are based on a cooperative framework between a rationale generator and a predictor: the generator component encodes the input text and extracts a "rationale" (e.g., a subset of highlighted words), and the predictor classifies the input conditioned only on the extracted rationale. Typically, this is done by obfuscating the words that are not in the rationale with a binary mask.
Highlights Extraction. We consider a standard text classification or regression setup, in which we are given an input sequence x ∈ R D×L , where D is the embedding size and L is the sequence length (number of words), and we want to predict its corresponding label y ∈ R for regression or y ∈ {1, . . . , C} for classification. A generator model, gen, encodes the input text x into token-level scores. Then, a rationale z, e.g. a binary mask over the tokens, is extracted based on these scores. Subsequently, the predictor model makes predictions conditioned only on the rationalê y = pred(z x), where denotes the Hadamard (elementwise) product.
End-to-end Training and Testing Procedure. While most rationalization methods deterministically select the rationale at test time, there are differences on how these models are trained. For instance, Lei et al. (2016) and Bastings et al. (2019) use stochastic binary variables (Bernoulli 1 Our library for rationalization is available at https://github.com/deep-spin/spectra-rationalization. and HardKuma, respectively), and sample the rationale z ∼ gen(x) ∈ {0, 1} L , whereas Treviso and Martins (2020) make a continuous relaxation of these binary variables and define the rationale as a sparse probability distribution over the tokens, z = sparsemax(gen(x)) or z = α-entmax(gen(x)). In the latter approach, instead of a binary vector, we have z ∈ L−1 , where L−1 is the L − 1 probability simplex L−1 := {p ∈ R L : 1 p = 1, p ≥ 0}. Words receiving non-zero probability are considered part of the rationale.
Rationalizers that use hard attention mechanisms or heuristics to extract the rationales are distinctively hard to train end-to-end, as they require marginalization over all possible rationales, which is intractable in practice. Thus, recourse to sampling-based gradient estimations is a necessity, either via REINFORCE-style training, which exhibits high variance (Lei et al., 2016;Chang et al., 2020), or via reparameterized gradients (Bastings et al., 2019;Paranjape et al., 2020). This renders training these models a complex and cumbersome task. These approaches are often brittle and fragile for the high sensitivity that they show to changes in the hyperparameters and to variability due to sampling. On the other hand, existing rationalizers that use sparse attention mechanisms (Treviso and Martins, 2020) such as sparsemax attention, while being deterministic and end-to-end differentiable, do not have a direct handle to constrain the rationale in terms of sparsity and contiguity. We endow them with these capabilities in this paper as shown in Table 1, where we position our work in the literature for highlights extraction.
Constrained Rationale Extraction. Existing rationalizers are extractive: they select and extract words or word pairs to form the rationale. Since a rationalizer that extracts the whole input would be meaningless as an explainer, they must have a length constraint or a sparsity inducing component. Moreover, rationales are idealized to encourage selection of contiguous words, as there is some evidence that this improves readibility (Jain et al., 2020). Some works opt to introduce regularization terms placed on the binary mask such as the 1 norm and the fused-lasso penalty to encourage sparse and compact rationales (Lei et al., 2016;Bastings et al., 2019). Others use hard constraints through heuristics such as top-k, which is not contiguous but sparse, or select a chunk of text with a pre-specified length that corresponds to the highest total score over all possible spans of that length (Chang et al., 2020;Paranjape et al., 2020;Jain et al., 2020). Sparse attention mechanisms can also be used to extract rationales, but since the rationales are constrained to be in the simplex, controlling the number of selected tokens and simultaneously promoting contiguity is non-trivial.

Rationalization for Matchings Extraction
For this task, we consider a natural language inference setup in which classification is made based on two input sentences: a premise x P ∈ R D×L P and a hypothesis x H ∈ R D×L H , where L P and L H are the sequence lengths of the premise and hypothesis, respectively, and D is the embedding size. A generator model (gen) encodes x P and x H separately and then computes pairwise costs between the encoded representations to produce a score matrix S ∈ R L P ×L H . The score matrix S is then used to compute an alignment matrix Z ∈ R L P ×L H , where z ij = 1 if the i th premise word is aligned to the j th word in the hypothesis. Z subsequently acts as a sparse mask to obtain text representations that are aggregated with the original encoded sequences and fed to a predictor to obtain the output predictions.

Structured Prediction on Factor Graphs
Finding the highest scored rationale under the constraints described above is a structured prediction problem, which involves searching over a very large and combinatorial space. We assume that a rationale z can be represented as an L-dimensional binary vector. For example, in highlights extraction, L is the number of words in the document and z is a binary mask selecting the relevant words; and in the extraction of matchings, L = L P × L H and z is a flattened binary vector whose entries indicate if a premise word is aligned to a word in the hypothesis. We let Z ⊆ {0, 1} L be the set of rationales that satisfy the given constraints, and let s = gen(x) ∈ R L be a vector of scores.
Factor Graph. In the sequel, we consider problems that consist of multiple interacting subproblems. Niculae and Martins (2020) present structured differentiable layers, which decompose a given problem into simpler subproblems, instantiated as local factors that must agree when overlapped. Formally, we assume a factor graph F, where each factor f ∈ F corresponds to a subset of variables. We denote by z f = (z i ) i∈f the vector of variables corresponding to factor f . Each factor has a local score function h f (z f ). Examples are hard constraint factors, which take the form where Z f is a polyhedral set imposing hard constraints (see Table 2 for examples); and structured factors, which define more complex functions with structural dependencies on z f , such as where r i,i+1 ∈ R are edge scores, which together define a sequential factor. We require that for any factor the following local subproblem is tractable: MAP inference. The problem of identifying the highest-scoring global structure, known as maximum a posteriori (MAP) inference, is written as: The objective being maximized is the global score function score(z; s), which combines information coming from all factors. The solution of the MAP problem is a vectorẑ whose entries are zeros and ones. However, it is often difficult to obtain an exact maximization algorithm for complex structured problems that involve interacting subproblems that impose global agreement constraints.
Gibbs distribution and sampling. The global score function can be used to define a Gibbs distribution p(z; s) ∝ exp(score(z; s)). The MAP in (4) is the mode of this distribution. Sometimes (e.g. in stochastic rationalizers) we want to sample from this distribution,ẑ ∼ p(z; s). Exact, unbiased samples are often intractable to obtain, and approximate sampling strategies have to be used, such as perturb-and-MAP (Papandreou and Yuille, 2011;Corro and Titov, 2019a,b). These strategies necessitate gradient estimators for endto-end training, which are often obtained via REIN-FORCE (Williams, 1992) or reparametrized gradients (Kingma and Welling, 2014;Jang et al., 2017).
LP-MAP inference. In many cases, the MAP problem (4) is intractable due to the overlapping interaction of the factors f ∈ F. A commonly used relaxation is to replace the integer constraints z ∈ {0, 1} L by continuous constraints, leading to: The problem above is known as LP-MAP inference (Wainwright and Jordan, 2008). In some cases (for example, when the factor graph F does not have cycles), LP-MAP inference is exact, i.e., it gives the same results as MAP inference. In general, this does not happen, but for many problems in NLP, LP-MAP relaxations are often nearly optimal (Koo et al., 2010;Martins et al., 2015). Importantly, computation in the hidden layer of these problems may render the network unsuitable for gradientbased training, as with MAP inference.
LP-SparseMAP inference. The optimization problem respective to LP-SparseMAP is the 2 regularized LP-MAP (Niculae and Martins, 2020): Unlike MAP and LP-MAP, the LP-SparseMAP relaxation is suitable to train with gradient backpropagation. Moreover, it favors sparse vectorsẑ, i.e., vectors that have only a few non-zero entries. One of the most appealing features of this method is that it is modular: an arbitrary complex factor graph can be instantiated as long as a MAP oracle for each of the constituting factors is provided. This approach generalizes SparseMAP (Niculae et al., 2018), which requires an exact MAP oracle for the factor graph in its entirety. In fact, LP-SparseMAP recovers SparseMAP when there is a single factor F = {f }. By only requiring a MAP oracle for each f ∈ F, LP-SparseMAP makes it possible to instantiate more expressive factor graphs for which MAP is typically intractable. Table 2 lists several logic constraint factors which are used in this paper.
Factor Name Imposed Constraint

Deterministic Structured Rationalizers
The idea behind our approach for selective rationalization is very simple: leverage the inherent flexibility and modularity of LP-SparseMAP for constrained, deterministic and fully differentiable rationale extraction.

Highlights Extraction
Model Architecture. We use the model setting described in §2. First, a generator model produces token-level scores s i , i ∈ {1, . . . , L}. We propose replacing the current rationale extraction mechanisms (e.g. sampling from a Bernoulli distribution, or using sparse attention mechanisms) with an LP-SparseMAP extraction layer that computes token-level valuesẑ ∈ [0, 1] L , which are then used to mask the original sequence for prediction. Due to LP-SparseMAP's propensity for sparsity, many entries inẑ will be zero, which approaches what is expected from a binary mask.
Factor Graphs. The definition of the factor graph F is central to the rationale extraction, as each of the local factors f ∈ F will impose constraints on the highlight. We start by instantiating a factor graph with L binary variables (one for each token) and a pairwise factor for every pair of contiguous tokens: which yields the binary pairwise MRF ( §2.3) Instantiating this factor with non-negative edge scores, r i,i+1 ≥ 0, encourages contiguity on the rationale extraction. Making use of the modularity of the method, we impose sparsity by further adding a BUDGET factor (see Table 2): The size of the rationale is constrained to be, at most, B% of the input document size. Intuitively, the lower the B, the shorter the extracted rationales will be. Notice that this graph is composed of L local factors. Thus, LP-SparseMAP would have to enforce agreement between all these factors in order to compute z. Interestingly, factor graph representations are usually not unique. In our work, we instantiate an equivalent formulation of the factor graph in Eq. 9 that consists of a single factor, H:SeqBudget. This factor can be seen as an extension of that of the LP-Sequence model in Niculae and Martins (2020): a linear-chain Markov factor with MAP provided by the Viterbi algorithm (Viterbi, 1967;Rabiner, 1989). The difference resides in the additional budget constraints that are incorporated in the MAP decoding. These constraints can be handled by augmenting the number of states in the dynamic program to incorporate how many words in the budget have already been consumed at each time step, leading to time complexity O(LB).

Matchings Extraction
Model Architecture. Our architecture is inspired by ESIM (Chen et al., 2017). First, a generator model encodes two documents x P , x H separately to obtain the encodings (h P 1 , . . . ,h P L P ) and (h H 1 , . . . ,h H L H ), respectively. Then, we compute alignment dot-product pairwise scores between the encoded representations to produce a score matrix S ∈ R L P ×L H such that s ij = h P i ,h H j . We use LP-SparseMAP to obtain Z, a constrained structured symmetrical alignment Z in which z ij ∈ [0, 1], as described later. Then, we "augment" each word in the premise and hypothesis with the corresponding aligned weighted average by computingh , and separately feed these vectors to another encoder and pool to find representations r P and r H . Finally, the feature vector r = [r P , r H , r P − r H , r P r H ] is fed to a classification head for the final prediction. We also experiment with a strategy in which we assume that the hypothesis is known and the premise is masked for faithful prediction. We consider h P i = j z ijh H j , such that the only information about the premise that the model has to make a prediction comes from the alignment and its masking of the encoded representation.
Factor Graphs. We instantiate three different factor graphs for matchings extraction. The first -M:XorAtMostOne -is the same as the LP-Matching factor used in Niculae and Martins (2020) with one XOR factor per row and one AtMostOne factor per column: which requires at least one active alignment for each word of the premise, since the i th word in the premise must be connected to the hypothesis. The j th word in the hypothesis, however, is not constrained to be aligned to any word in the premise. In the second factor graph -M:AtMostOne2we alleviate the XOR restriction on the premise words to an AtMostOne restriction. The expected output is a sparser matching for there is no requirement of an active alignment for each word of the premise. The third factor graph -M:Budgetallows us to have more refined control on the sparsity of the resulting matching, by adding an extra global BUDGET factor (with budget B) to the factor graph of M:AtMostOne2 so that the resulting matching will have at most B active alignments.
Stochastic Matchings Extraction. Prior work for selective rationalization of text matching uses constrained variants of optimal transport to obtain the rationale (Swanson et al., 2020). Their model is end-to-end differentiable using the Sinkhorn algorithm (Cuturi, 2013a). Thus, in order to provide a comparative study of stochastic and deterministic methods for rationalization of text matchings, we implement a perturb-and-MAP rationalizer ( §2.3). We perturb the scores s ij by computingS = S+P , in which each element of P contains random samples from the Gumbel distribution, p ij ∼ G(0, 1). We utilize these perturbed scores to compute nonsymmetrical alignments from the premise to the hypothesis and vice-versa, such that their entries are in [0, 1]. At test time, we obtain the most probable matchings, such that their entries are in {0, 1}. These matchings are such that every word in the premise must be connected to a single word in the hypothesis and vice-versa.

Highlights for Sentiment Classification
Data and Evaluation. We used the SST, Ag-News, IMDB, and Hotels datasets for text clas-sification and the BeerAdvocate dataset for regression. The statistics and details of all datasets can be found in §A. The rationale specified lengths, as percentage of each document, for the strategies that impose fixed sparsity are 20% for the SST, AgNews and IMDB datasets, 15% for the Hotels dataset, and 10% for the BeerAdvocate dataset. We evaluate end task performance (Macro F 1 for classification tasks and MSE for regression), and matching with human annotations through token-level F 1 score (DeYoung et al., 2019) for the datasets that contain human annotations.
Baselines. We compare our results with three versions of the stochastic rationalizer of Lei et al. (2016): the original one -SFE -which uses the score function estimator to estimate the gradients; a second one -SFE w/ Baseline -which uses SFE with a moving average baseline variance reduction technique; a third -Gumbel -in which we employ the Gumbel-Softmax reparameterization (Jang et al., 2017) to reparameterize the Bernoulli variables; and, a fourth -HardKuma -in which we employ HardKuma variables (Bastings et al., 2019) instead of Bernoulli variables and use reparameterized gradients for training end-to-end. Moreover, the latter rationalizer employs a Lagrangian relaxation to solve the constrained optimization problem of targeting specific sparsity rates. We also experimented with two deterministic strategies that use sparse attention mechanisms: a first that utilizes sparsemax (Martins and Astudillo, 2016), and a second that utilizes fusedmax (Niculae and Blondel, 2019) which encourages the network to pay attention to contiguous segments of text, by adding an additional total variation regularizer, inspired by the fused lasso. It is a natural deterministic counterpart of the constrained rationalizer proposed by Lei et al. (2016), since the regularization encourages both sparsity and contiguity. The use of fusedmax for this task is new to the best of our knowledge. Similarly to Jain et al. (2020), we found that the stochastic rationalizers of Lei et al. (2016) and its variants (SFE, SFE w/ Baseline and Gumbel) require cumbersome hyperparameter search and tend to degenerate in such a way that the generated rationales are either the whole input text or empty text. Thus, at inference time, we follow the strategy proposed by Jain et al. (2020) and restrict the generated rationale to a specified length via two mappings: contiguous, in which the span of length , out of all the spans of this length, whose token-level scores cumulative sum is the highest is selected; and top-k, in which the tokens with highest token-level scores are selected. Contrary to (Jain et al., 2020), for the rationalizer of Bastings et al. (2019) (HardKuma), we carefully tuned both the model hyperparameters and the Lagrangian relaxation algorithm hyperparameters, so as to use the deterministic policy in testing time that they propose. 2 All implementation details can be found in §C. We also report the full-text baselines for each dataset in §D. Baselines. We compare our results with variants of constrained optimal transport for selective rationalization employed by Swanson et al. (2020): relaxed 1:1, which is similar in nature to our proposed M:AtMostOne2 factor; and exact k = 4 similar to our proposed M:Budget with budget B = 4. We also replicate the LP-matching implementation of Niculae and Martins (2020) which consists of the original ESIM model described in §3.2 with Z as the output of the LP-SparseMAP problem with a M:XorAtMostOne factor. Importantly, both these models aggregate the encoded premise representation with the information that comes from the alignment. All implementation details can be found in §C. We also report the ESIM baselines in §D.

Extraction of Text Highlights
Predictive Performance. We report the predictive performances of all models in Table 3. We   Table 4: Average size of the extracted rationales using the HardKuma stochastic rationalizer ( ) and deterministic ( ) sparse attention mechanisms. We report mean and min/max average size across five random seeds.
observe that the deterministic rationalizers that use sparse attention mechanisms generally outperform the stochastic rationalizers while exhibiting lower variability across different random seeds and different datasets. In general and as expected, for the stochastic models, the top-k strategy for rationale extraction outperforms the contiguous strategy. As reported in Jain et al. (2020), strategies that impose a contiguous mapping trade coherence for performance on the end-task. Our experiments also show that HardKuma is the stochastic rationalizer least prone to variability across different seeds, faring competitively with the deterministic methods. The strategy proposed in this paper, H:SeqBudget, fares competitively with the deterministic methods and generally outperforms the stochastic methods. Moreover, similarly to the other deterministic rationalizers, our method exhibits lower variability across different runs. We show examples of highlights extracted by SPECTRA in §G.

Quality of the Rationales
Rationale Regularization. We report in Table 4 the average size of the extracted rationales (proportion of words not zeroed out) across datasets for the stochastic HardKuma rationalizer and for each rationalizer that uses sparse attention mechanisms. The latter strategies do not have any mechanism to regularize the sparsity of the extracted rationales, which leads to variability on the rationale extrac-tion. This is especially the case for the fusedmax strategy, as it pushes adjacent tokens to be given the same attention probability. This might lead to rationale degeneration when the attention weights are similar across all tokens. On the other hand, HardKuma employs a Lagrangian relaxation algorithm to target a predefined sparsity level. We have found that careful hyperparameter tuning is required across different datasets. While, generally, the average size of the extracted rationales does not exhibit considerable variability, some random seeds led to degeneration (the model extracts empty rationales). Remarkably, our proposed strategy utilizes the BUDGET factor to set a predefined desired rationale length, regularizing the rationale extraction while still applying a deterministic policy that exhibits low variability across different runs and datasets (Table 3).
Matching with Human Annotations. We report token-level F 1 scores in Table 5 to evaluate the quality of the rationales for the datasets for which we had human annotations for the test set. We observe that our proposed strategy and Hard-Kuma outperform all the other methods on what concerns matching the human annotations. This was to be expected considering the results shown in Table 3 and  Table 5: Evaluation of the rationales through matching with human annotations, for stochastic ( ) and deterministic ( ) methods. We report mean token-level F 1 scores and min/max across five random seeds.
runs is also reflected on the token-level F 1 scores; and although the rationalizers that use sparse attention mechanisms are competitive with our proposed strategy, the lack of regularization on what comes to the rationale extraction leads to variable sized rationales which is also reflected on poorer matchings. We also observe that, when degeneration does not occur, HardKuma generally extracts high quality rationales on what comes to matching the human annotations. It is also worth remarking that the sparsemax and top-k strategies are not expected to fare well on this metric because human annotations for these datasets are at the sentence-level. Our strategy, however, not only pushes for sparser rationales but also encourages contiguity on the extraction.

Extraction of Text Matchings
Predictive Performance. We report the predictive performances of all models in Table 6. Both the strategies that use the LP-SparseMAP extraction layer and our proposed stochastic matchings extractor outperform the OT variants for matchings extraction. We observe that, contrary to the text highlights experiments, the stochastic matchings extraction model does not exhibit noticeably higher variability compared to the deterministic models.
In general, the faithful models are competitive with the non-faithful models. Since the latter ones are constrained to only utilize information from the premise that comes from alignments, these results demonstrate the effectiveness of the alignment extraction. As expected, there is a slight trade-off between how constrained the alignment is and the model's predictive performance. This is more no-   Our work adds that comparison and contributes with an easy-to-train fully differentiable rationalizer that allows for flexible constrained rationale extraction. Our strategy for rationalization based on sparse structured prediction on factor graphs constitutes a unified framework for deterministic extraction of different structured rationales.
Structured Prediction on Factor Graphs. Kim et al. (2017) incorporate structured models in attention mechanisms as a way to model rich structural dependencies, leading to a dense probability distribution over structures. Niculae et al. (2018) propose SparseMAP, which yields a sparse probability distribution over structures and can be computed using calls to a MAP oracle, making it applicable to problems (e.g. matchings) for which marginal inference is intractable but MAP is not. However, the requirement of an exact MAP oracle prohibits its application for more expressive structured models such as loopy graphical models and logic constraints. This limitation is overcome by LP-SparseMAP (Niculae and Martins, 2020) via a local polytope relaxation, extending the previous method to sparse differentiable optimization in any factor graph with arbitrarily complex structure. While other relaxations for matchings -such as entropic regularization leading to Sinkhorn's algorithm (Cuturi, 2013b) -that are tractable and efficient exist and have been used for rationalization (Swanson et al., 2020), we use LP-SparseMAP for rationale extraction in our work. Our approach for rationalization focuses on learning and explaining with latent structure extracted by structured prediction on factor graphs.  (2013) propose models that jointly extract and compress sentences. Our work differs in that our setting is completely unsupervised and we need to differentiate through the extractive layers.

Conclusions
We have proposed SPECTRA, an easy-to-train fully differentiable rationalizer that allows for flexible constrained rationale extraction. We have provided a comparative study with stochastic and deterministic approaches for rationalization, showing that SPECTRA generally outperforms previous rationalizers in text classification and natural language inference tasks. Moreover, it does so while exhibiting less variability than stochastic methods and easing regularization of the rationale extraction when compared to previous deterministic approaches. Our framework constitutes a unified framework for deterministic extraction of different structured rationales. We hope that our work spurs future research on rationalization for different structured explanations.

A Datasets for Highlights Extraction
We used five datasets for sentiment analysis: four for text classification (SST, AgNews, IMDB, Hotels) (Socher et al., 2013;Del Corso et al., 2005;Maas et al., 2011;Wang et al., 2010) and one for regression (BeerAdvocate) (McAuley et al., 2012). The Hotels and BeerAdvocate datasets contain data instances for multiple aspects. In this work, we use the Hotels' location aspect and the BeerAdvocate's appearance aspect. These two datasets contain sentence-level rationale annotations for their test sets. For these datasets, we use the splits used in Bao et al. (2018). For all other datasets, we use the splits in Wolf et al. (2020). For IMDB and AgNews we randomly selected 10%, 15% of examples from the training set to be used as validation data, respectively.  For the datasets without human annotations, we used the same sparsity level (20%) -Jain et al.
(2020) uses this value for AgNews and SST; for BeerAdvocate, we used the sparsity levels used in Lei et al. (2016) and Yu et al. (2019); and, for Hotels we opted to select a sparsity level of 15% (human annotations average around 10% sparsity level).

B Datasets for Matchings Extraction
For natural language inference (NLI), we used SNLI and MNLI (Bowman et al., 2015;Chen et al., 2017). For MNLI, we split the MNLI matched validation set into equal validation and test sets. Table 9 shows statistics for each dataset and the alignment budget used for the M:Budget factor.
For SNLI, we set the Budget B to 4 to compare with the OT approach (OT exact k = 4) of Swanson et al. (2020). For MNLI, we set B to 6, since the average premise length in MNLI is around 50% bigger than that of SNLI.  We also conduct experiments with the HANS (McCoy et al., 2019) dataset. This dataset consists of a controlled evaluation set to detect whether NLI systems are exploring linguistic heuristics such as lexical overlap, subsequence and constituent heuristics. A detailed description of each of these heuristics can be found in the original paper. The dataset is also constituted by 30,000 HANS-like examples that can be used to augment existing NLI training sets such as SNLI or MNLI.
C Implementation Details
Training for all methods for highlights extraction but HardKuma is stopped if Macro F 1 (for classification) or MSE (for regression) is not improved for 5 epochs. For matchings extraction, training is stopped if Macro F 1 does not improve for 3 epochs. For HardKuma, we train until the maximum number of epochs. This is because the rationale length might vary considerably during training due to the Lagrangian relaxation algorithm that is employed at training time. We found that using early stopping would often favour models that selected almost all of the input text. Unlike Jain et al. (2020), we decided to carefully tune both model and the Lagrangian relaxation algorithm hyperparameters for this rationalizer. This had a big impact on the performance, as HardKuma performed poorly with the top-k and contiguous strategies at inference time. Even though some careful tuning is required and degeneration might occur for some random seeds, it is still much less cumbersome than tuning the variants of the rationalizer of Lei et al. (2016). We hypothesize that this is mostly due to two factors: the control on the rationale average size that the Lagrangian relaxation algorithm aims to impose; and the gradient estimates with reparameterized gradients exhibit less variance than those with the score function estimator.
All models for highlights extraction have 1.8M trainable parameters. Models for faithful and nonfaithful selective rationalization of text matchings have 1.7M and 1.8M trainable parameters, respectively.

C.2 SPECTRA Sparsity Regularization
During training, we apply a temperature term T in the sparsemax and fusedmax operators. This parameter is set within {0.05, 0.1, 0.2}. The total variation regularization for fusedmax is set to 0.7.
For the models that use the LP-SparseMAP extraction layer, we use a temperature term T set within {0.05, 0.1, 0.2} during training. Moreover, for the H:SeqBudget, we set the transition scores within {0.001, 0.005} for all datasets. All hyperparamter searches were conducted manually.
The LP-SparseMAP problem can be interpreted as the 2 -regularized LP-MAP. Its output corresponds to a probability distribution over a sparse set of structures. Therefore LP-MAP can be seen as LP-SparseMAP with the scores divided by a zerolimit temperature parameter. This procedure at test time would lead to the LP-MAP solution, which is generally an outer relaxation of MAP (Martins et al., 2015). When inference in the factor graph is exact, the solutions of the LP-MAP are integer (i.e., LP-MAP yields the true MAP). But that is not the case for when inference in the factor graph is not exact. Thus, LP-SparseMAP solutions for this test time setting might be a soft or discrete selection of parts of the input. We used a temperature parameter of 10 −3 at validation and testing time.

C.3 Computing Infrastructure
Our infrastructure consists of 2 machines with the specifications shown in Table 10. The machines were used interchangeably, and all experiments were executed in a single GPU. We did not observe significant differences in the execution time of our models across different machines.    Table 11: Model predictive performances across datasets using full-text. We report mean and min/max F 1 scores across five random seeds on test sets for all datasets but Beer where we report MSE.  The computational time of SPECTRA depends on several factors inherited from the use of LP-SparseMAP as the extractive method. Generally, the bigger the number of local factors f ∈ F, the more costly it is to compute a solution. Thus, it might be necessary to increase the number of iterations for the LP-SparseMAP to converge to a solution for which all factors agree. We set this number to 10 in training time following Niculae and Martins (2020). During inference, we set a maximum number of iterations of 1000. For highlights extraction, the H:SeqBudget consists of a single factor, thus the solution is found within a single iteration. For matchings extraction, our factors consist of multiple local factors that impose hard constraints that must agree in the final matching: M:XorAtMostOne and M:AtMostOne2 consist of L P + L H local factors, and M:Budget adds an additional global budget factor to the factor graph of M:AtMostOne2, yielding a more complex overall problem. Faster times would be achieved for smaller values of maximum number of iterations.     Table 16: Model predictive performances across datasets and different budget values for the SPECTRA method for matchings extraction. We report F 1 scores on test sets for all datasets. These results are respective to one random seed. Figure 1 shows examples of highlights extracted by SPECTRA model on the AgNews and Beer dataset. Interestingly, when compared to human annotations on the Beer dataset, we notice that SPECTRA usually disregards highlighting stopwords. While these explanations do not lose relevant meaning when compared to the human explanations, this ultimately slightly hinders the performance on the matching with human annotations.

Dollar Rises Vs Euro on Asset Flows Data NEW YORK (Reuters) -
The dollar extended gains against the euro on Monday after a report on ows into U.S. assets showed enough of a rise in foreign investments to offset the current account gap for the month.
Stocks Climb on Drop in Consumer Prices NEW YORK -Stocks rose for a second straight session Tuesday as a drop in consumer prices allowed investors to put aside worries about in ation, at least for the short term. With gasoline prices falling to eight-month lows, the Consumer Price Index registered a small drop in July, giving consumers a respite from soaring energy prices... Highlights extracted with SPECTRA for AgNews an amber pour with hints of pink and yellow . u y head , good lacing . smells of high citrus ( gf , lemon ) and some leafy ower plants . hops are in there somewhere . taste has the hops ; nice crispness and avor medium body and great mouthfeel , leaves clean with enough taste residue to want more . 1 2 Highlights extracted with SPECTRA for Beer hazy bright orange in color with a uffy white head that quickly dissipates , leaving delicate lace . way too orange looking . aroma is very mild wheat and subtle spice completely dominated by arti cial orange . smells like tang . avor ditto . tastes like an arti cially avored witte . there 's no way the orange avor is authentic . mouthfeel is actually nice and creamy with that good wheaty quality . too bad it tastes like an orange soda .

Hamilton
i had this on-tap at tank 's taproom in tampa , . appearance : a deep amber body with a just darker than white head , good lacing with ok retention . smell : very very pale malt aroma . taste : just like autumn . toasty malts with a solid hop presence . mouthfeel : very crisp and lager like . drinkability : good . don't drink and review .
3 Figure 1: Examples of extracted highlights (green shaded input tokens) with SPECTRA for AgNews and Beer documents. For the rationales with Beer, we show the human annotations in bold and italic (we shade in red the mismatches with the human annotations).

H Matchings Extracted with SPECTRA
Synthetic Matchings. In Figure 2 we show the extracted matchings with the three different SPEC-TRA factors that we used in the paper for a synthetic score matrix. The M:XOR-AtMostOne factor constraints the alignment matrix Z ∈ R L P ×L H to be such that for each line i in Z, we have L H n=1 z in = 1. For M:AtMostOne2 we have that for each line i in Z, L H n=1 z in ≤ 1. And, finally, the more constrained version of M:Budget is such that for each line i in Z, we have L H n=1 z in ≤ B, in which B is the Budget value. Examples extracted from HANS. We show in Figure 6 examples of matchings extracted with SPECTRA for the model trained on MNLI augmented with HANS-like examples (Augmented). For all these examples, the original MNLI model without augmentation (Vanilla) classified the examples as entailment, whereas the Augmented model correctly classified them as non-entailment. Interestingly, the obtained matchings highlight the use of the heuristics that HANS aims to target. However, the Augmented model is able to process the information from the matchings in such a way that it correctly classifies most non-entailment examples (see Table 7).   Figure 6: Examples of extracted matchings with SPEC-TRA (Augmented) that highlight the three linguistic heuristics of HANS: lexical overlap, constituent and subsequence heuristics. The premise is shown on the left and the hypothesis is shown on the right.