Explanation Regeneration via Information Bottleneck

Explaining the black-box predictions of NLP models naturally and accurately is an important open problem in natural language generation. These free-text explanations are expected to contain sufficient and carefully-selected evidence to form supportive arguments for predictions. Due to the superior generative capacity of large pretrained language models, recent work built on prompt engineering enables explanation generation without specific training. However, explanation generated through single-pass prompting often lacks sufficiency and conciseness. To address this problem, we develop an information bottleneck method EIB to produce refined explanations that are sufficient and concise. Our approach regenerates the free-text explanation by polishing the single-pass output from the pretrained language model but retaining the information that supports the contents being explained. Experiments on two out-of-domain tasks verify the effectiveness of EIB through automatic evaluation and thoroughly-conducted human evaluation.


Introduction
Natural language explanations have attracted a lot of attention as a way to uncover the rationales behind black-box predictions.Thanks to the power of large pretrained language models (PLM) (Brown et al., 2020;Zhang et al., 2022), prompting methods proposed in recent studies achieve impressive results in generating free-text explanations (Wei et al.;Lampinen et al., 2022).A clear advantage of such methods is that they involve no additional training from task-specific datasets.
In this paper, we regard a free-text explanation as a description of the relationship between an input context and a hypothesis, e.g., a question and an answer.Although it is difficult to state that one Input Prompt  Why did you predict "interested students" for question "what helps someone be a good teacher"?PLM Explanation Hypothesis   !: interested students help someone be a good teacher because interested students are the ones who will be learning.I am an old teacher, and I will be motivated by students who actively take lessons.Interested students give teachers confidence, support ...

Refined Explanation 𝒙′
" : Because interested students are the ones who will be learning, giving teachers motivation to prepare more lessons and helping them become good teachers.

𝑧 𝑥
Figure 1: Although PLM generates an informative explanation hypothesis ( 1 ), this explanation contains redundant or inessential information which may interfere with the holistic understanding of the relationship between question and answer.In comparison, the polished explanation ( 2 ), improved upon the initial hypothesis, is more concise and reasonable.
explanation is superior to all others due to the different desiderata of the tasks to be explained, this does not prevent us from answering the question "what makes a good explanation" from a practical view.Previous research (Yu et al., 2019;Miller, 2019) points out several semantic constraints should be satisfied in constructed explanations: (i) avoid undesirable content, like repeating context's statement, (ii) ensure adequate background supports, and (iii) emphasize selective evidence.Current machine-generated explanations still exhibit defects on these constraints (Kassner and Schütze, 2020;Welleck et al., 2022).For single-pass prompting methods, they cast the burden of ensuring explanation constraints all on a PLM which "starts from scratch".This inspires us to investigate how to discard the dross and take the essence of current PLM's results.
We propose our explanation generation approach via the information bottleneck theory (Tishby et al., 2000) (EIB), which can refine explanations prompted from PLM into more meaningful, sufficient, and concise ones.It works in two phases, as illustrated in Figure 1.First, given an NLP task sample (e.g., a QA pair), EIB uses a large PLM to produce an initial explanation hypothesis ( 1 ) by framing the task sample into a prompt input.Second, a refiner improves the quality of an explanation hypothesis along the axis of the aforementioned characteristics (i.e., meaningful, sufficient, and concise).The refiner is trained following the information bottleneck principle.Concretely, it learns a minimal sufficient bottleneck representation of the explanation  1 , while being maximally explainable about the sample (i.e., the QA pair) by introducing an information loss (Ethayarajh et al., 2022).With the learned bottleneck representation on hand, a generator learns to produce a new explanation.We propose a simple and general procedure for training the refiner by pairing synthetic explanation hypotheses with gold references from existing datasets.EIB is a general explanation generation framework and can be applied to different NLP tasks with no specific task supervision.
We demonstrate the effectiveness of EIB in generating explanations on two popular NLP tasks: commonsense question answering and natural language inference.Experiments show that EIB significantly improves the explanation candidates prompted from PLM, by making them more concise while retaining useful information for explaining task samples.Automatic evaluation and carefully designed human evaluation demonstrate the performance of EIB.Furthermore, an analysis of evaluations shows an imperious demand for better metrics to judge explanations more credibly.We publicly release our code and data1 .

Method
Prompting Recently, writing explanations through prompting large PLMs has become a competitive approach.Given an NLP task sample  including input   and output   , we could infer its explanation  via prompting a PLM:  = PLM((  ,   )), where function (•, •) transforms  to prompt formats through predefined templates.For example, if we have a QA sample, question   : Can elephants be put in the fridge?and answer   : no, the prompt will be "The question is can elephants be put in the fridge?The answer is no because.".
Although prompting has achieved remarkable success, machine-generated explanations still have room for improvement as discussed in the introduction.Therefore, we seek to step further under the current achievement, exploring an effective way to improve explanation quality in terms of meaningfulness, sufficiency, and conciseness.
Formulation Suppose we have a sample  ∈ z and its explanation hypothesis  ∈ x. 2 We aim to refine  into a better  ′ which can: (1) reduce irrelevant information in  (conciseness), ( 2) preserve and supplement useful information to infer  (meaningfulness, sufficiency).We divide the explanation regeneration task into two problems: refinement and generation.
First, we model the refinement problem from an information-theoretic view, i.e., learn the internal representation t of the initial explanation x, defined as   ( | ), such that t is maximally compressive about the (noisy) x while being maximally expressive about z: The above process can be formulated as the information bottleneck principle (IB) (Tishby and Zaslavsky;Alemi et al., 2017).IB defines the characteristics of an optimal representation, in terms of the fundamental tradeoff between having a concise representation and one with good predictive power, which is equivalent to minimizing the following objective function: where  is a Lagrange multiplier.A large  corresponds to high compression, and hence low mutual information between t and z.Given a bottleneck representation , our second goal is to generate a free-text explanation  ′ based on .Therefore, we pack a log-likelihood objective for language modeling with L IB as the objective function of the whole model, and train it on an automatically constructed synthetic dataset: (3) The overall proposed EIB is illustrated in Figure 2.
In the following, we will present the optimization and training with respect to (i) explanation compression for distilling a bottleneck representation from the initial explanation, (ii) information preservation for ensuring the distilled bottleneck representation expressive about the explained sample, and (iii) explanation regeneration from the distilled bottleneck representation for producing a better explanation than the initial input one.

Explanation Compression
Vectorization Suppose we have an explanation candidate  that needs to be improved.We first use a parameter-fixed -layer PLM to encode  and aggregate the hidden states of  layers into a sequence of vectors X ∈ R × , where each dimensional vector x  is the weighted sum of hidden representations of the corresponding layer by attention weights.We utilize representations of all layers instead of the last layer only in order to combine more information.
Compression Our first goal is to denoise irrelevant information in X and obtain a highly compact representation T. The compression loss part in L IB can be rewritten as: where   (t  ) is the prior distribution of the bottleneck vector t  ,   (t  | x  ) is the stochastic mapping from the distribution of initial explanation hypoth- Making the bound as tight as possible given  allows yielding a compressed representation T distilled from the initial X.

Information Preservation
The second goal of IB in Eq. 2 is to maximize I(t, z), which can lead to a high log-likelihood   (Z | T) for ensuring T not losing predictive features of X to explain Z: However,   (z  | t  ) is hard to estimate because we have to iterate on all possible   .Furthermore, the length of  is not fixed and cannot be precisely aligned to the number of bottleneck vectors T.
Optimization We extend recent work in information theory (Xu et al., 2020;Ethayarajh et al., 2022), which generalizes Shannon's information theory to quantify the predictive V-information between two random variables, subject to computational constraints V. V-information reflects the ease with which V can predict z given t.
In this paper, we use   to denote the computational constraints, i.e., an autoregressive model GPT-2 (Radford et al., 2019).Measuring I(t, z) becomes quantifying usable information under   .Then I(t, z) can be approximated by the information difference of an unconditional entropy    (z) and conditional entropy    (z | t) w.r.t computation-bounded parameters : where  and  are optimizable parameters, t acts as a learnable prefix (Li and Liang, 2021) to a GPT-2.
Optimizing the lower bound of I(t, z) requires T to have enough capacity to support  while being compact with the consideration of the minimization of I(x, t).

Explanation Regeneration
With the distilled bottleneck representation T on hand, the remaining task is to translate the compact representation into a new explanation  ′ that may be different from the initial explanation  while achieving obvious quality improvements.
Translating the highly-dimensional matrix T into a discrete and readable explanation is not an easy task.To tackle this challenge, we use the explanation datasets from various NLP tasks and build a training corpus by pairing the human-written explanation with its synthetic imperfect version, which allows us to train EIB on the explanation regeneration task.Finally, for generating a new explanation autoregressively, a generator (GPT-2) is optimized by a language modeling loss: log   ( ′ |t, , ) where t serves as a learnable prefix input.

Sample 𝒛
: There are two statements and select which one is true.<s> Sentence 1 is people get dry while taking a shower.Sentence 2 is people get wet while taking a shower.
: Sentence 2 is true.Synthetic : It is also said that the high level of chlorine in the water will make people wet while taking a shower or a bath.(sentence-level replacement, span-level infilling) Target  ′ : Water make people wet while taking a shower.
Source: Sen-Making (Wang et al., 2019) Table 1: An example of the constructed MIXEXPL dataset.Explanation hypothesis  is synthesized by two operations based on the target explanation .

Training Dataset Construction.
Now we detail the automatic construction of the training dataset for optimizing EIB.After analyzing the explanations generated by the state-of-art models (Zhang et al., 2022;Brown et al., 2020), compared to humans, machines could be further improved in generating informative explanations with adequate rationales in fewer words, especially when prompts are long and complex.
Specifically, for each gold explanation  ′ of six tasks, we randomly choose 2, 3, or 4 types from five operations on ground truth  ′ to get , which is guided by explanation properties expected to learn.For information, we have token-and sentencelevel repetition.For sufficiency, we do token-and sentence-level replacement, negation, and shuffle.For conciseness, we conduct span-and sentencelevel infilling.
• Repetition: Redundant texts need to be avoided in explanation texts.For a good explanation, we either repeat an -gram (=1,2,3,4) in a random sentence or randomly select a sentence to repeat.
• Replacement: Using irrelevant token spans or sentences will cause explanations wrongly describe the expected rationales.We replace random 15% keywords in a random explanation sentence with their antonyms or randomly replace an explanation sentence with another one sampled from the rest of the gold explanations.
• Negation: Negation words are crucial for ac-curately explaining without conflicting with the task sample in context.We perform negation alteration by adding or removing negation words for randomly-selected verbs of the explanations using rules defined in (Guan and Huang, 2020).
• Shuffle: Temporal causal relationship plays a crucial role in clearly and logically explaining.We randomly reorder the sentences of an explanation to create logical issues.
• Infilling: The selection of crucial evidence relevant to the task at hand facilitates the generation of concise explanations.We augment the gold explanation with relevant but inessential contents by retrieving similar sentences from other explanations using Contriever (Izacard et al., 2021) or expanding an explanation sentence with GLM (Du et al., 2022).
Finally, we build a training corpus MIXEXPL of tuples (task sample, synthetic explanation, and gold explanation), and train EIB on MIXEXPL.Table 1 displays an instance of MIXEXPL corpus.
During inference, given an NLP sample (it could be from any NLP task, even not belonging to D || ) and a prompt suffix like because, we first use PLM to generate an initial explanation hypothesis .Then we use the trained EIB framework to produce a new explanation towards sufficiency and conciseness.The prompting formats and examples are illustrated in Appendix C.1 table 12.

Experiment Setup
Our experiments are organized into three sets: We first evaluate the quality of explanations generated by EIB on different tasks and compare various baselines without explicit refinements towards sufficiency and conciseness ( §3.2).We further analyze the performance improvement brought by the information bottleneck with training on synthetic dataset MIXEXPL ( §3.4).Lastly, we qualitatively assess the current development of explanation generation and the challenges for evaluation ( §3.5).
Human Evaluation Metrics Human evaluation has very high priorities for open-ended text generations (Zhang et al., 2020;Goyal et al., 2022;Li et al., 2022), and the explanation generation task is not exempt.From the free-text language aspect, we evaluate (i) Grammaticality and (ii) Factuality.From the open-ended explanation aspect, we measure: (iii) New Information, i.e., being informative   2015) and BERTScore (Zhang et al., 2020) and diversity metric Distinct-n (Li et al., 2016).Besides, we measure the proportion of distinct tokens (Novelty) in explanation that do not occur in given task sample.We report the average length (AVGLEN) of explanations to provide hints on conciseness.
Datasets We consider evaluating EIB on a universal setting and use two NLP tasks excluded from the training corpus MIXEXPL ( §2.4) to analyze the explanation generalization abilities of EIB.(i) ECQA (Aggarwal et al., 2021) for commonsense question answering.We formulate QA pairs into prompts to steer a large PLM, i.e., OPT-13B (Zhang et al., 2022), and generate initial explanation candidates as input to EIB. (ii) e-SNLI (Camburu et al., 2018) for natural language inference where the premise, hypothesis, and inference label are packed into prompt input.Details of the dataset statistics are shown in Table 2.
Baselines We compare EIB with the following baselines: (i) SUPERVISED.A supervised GPT-2 Small fine-tuned on target domain (i.e., ECQA and e-SNLI).(ii) PROMPTING.The prompt-based zeroshot learning framework with a PLM (OPT-13B).2015) with a learning rate of 5e-5.We train for 20 epochs with early stopping with mini-batches of size 32.For each explanation candidate, we average over 5 i.i.d.samples of compression distribution t to reduce the variance of the stochastic gradient where the compression weight  is set to 1e-4 (Equation 2).The dimension of each bottleneck vector t  is 768 with a fixed length of 12.
Explanations are generated by greedy decoding under the HuggingFace library (Wolf et al., 2019) 3.2 EIB vs. Baselines Overall Results Table 3 shows the results.We observe that EIB significantly outperforms PROMPTING and PROMPTING-Filter on the two testing tasks, and this superiority is consistent across different explanation attributes, especially for metrics factuality, sufficiency, and conciseness ( < 0.05, sign test).
Explanations polished by EIB are more concise and sufficient while maintaining good information coverage and quality, achieving over 44% improvement on explanation refinement on the ECQA dataset, with a similar gain in the e-SNLI setting.The disparity in Grammar between the PROMPT-ING/PROMPTING-Filter methods and EIB is negligible.Slight deviations observed may be attributed to the comparatively concise predictions generated by EIB, resulting in a reduced number of errors.EIB also substantially improves explanation quality over the edit-based method BOTTLESUM for both tasks, while being more fluent, grammatical, and efficient where EIB (0.69 s/sample) infers much faster than BOTTLESUM (55.01 s/sample).
Notably, although EIB did not learn from any test domain datasets during training, it contains comparable performance with SUPERVISED on explanation generation because of the knowledge retrieved from the gigantic PLM and the further refinement optimization towards sufficient and concise explanations.We also evaluate the pair-wise comparisons between PLM and EIB on explanation generation and investigate the effectiveness of EIB on larger language models (i.e., GPT-3 175B).See Appendix B.1 and B.2 for more details.
Notably, the  values indicate that the level of agreement among annotators is not particularly high, a finding that is consistent with that of Wiegreffe et al. (2022), likely due to the subjective nature of the task.Further information on evaluation quality control can be found in Appendix A.

Fine-grained Explanation Quality
We further analyze the EIB's capacity to satisfy the semantic requirements of free-text explanations under three explanation-level evaluation features, new information, sufficiency, and conciseness.Figure 3 reports results on the ECQA dataset.
Sufficiency Among all sufficient explanations, EIB could achieve a better trade-off between sufficiency and conciseness, likely because of the optimization towards explanation refinement and polishing, pruning irrelevant information while attaining sample-relevance evidence.For explanations labeled as "introducing new information" (middle figure), EIB significantly outperforms the promptingbased method with larger proportions of concise and factual explanations.This indicates that EIB improves the quality of newly-introduced information in concise and convincing statements.

Conciseness
We evaluate the main reasons causing explanations identified as "redundant".Bad denotes copying the precedent context or repeat-ing itself.Middle represents containing off-topic content.Compared to PROMPTING, the redundant issues could be largely alleviated by EIB, with a rising diversity proportion of abstract tokens that occurs in explanations, from 72.16% to 85.24%.

Comparison on Automatic Metrics
Overall Results For comprehensive comparisons, we also investigate the performance of different methods on various automatic metrics.Results are shown in Table 4.The SUPERVISED performs best among all methods.Our conjecture is that there are spurious correlations in test task datasets (Kavumba et al., 2022), e.g., for e-SNLI, golden explanations tend to use "... a paraphrase of ..." to explain samples with "entailment" labels.Among the unsupervised methods, we find that EIB improves generation qualities on most metrics over edit-based method (BOTTLESUM) and prompting methods.The improvement of EIB on vector-based metrics (BERTScore) and ngram-based metrics (Distinct and Novelty) within Table 5: Ablation study on the effectiveness of information preservation objective and information bottleneck principle for ECQA dataset.We report on BERTScore, BLEU-4, Distinct-2, Novelty-2, and averaged length.
Premise: The festivities of the latin celebration has brought many visitors and performers to the city.Hypothesis: The city is completely devoid of people.

Label: Contradiction
Human: If the festivities brought many visitors and performers, it cannot be devoid of people.SUPERVISED: The Latin celebration is not entirely devoid of people.BOTTLESUM: People.The inference is that the city is full of people.The.
PROMPTING: There are people.The inference is that the city is full of people.+EIB: There are people.The implication is that the city is full of people.
PROMPTING-Filter: Because the city is completely devoid of people.Now, let's look at the second example.Premise is the festivities of the latin celebration.
+EIB: Premise is the celebrations of the latin celebration.People gather at the city's main square.a shorter length, leading to more sufficient and concise explanations.

Effectiveness of Refinement
The information bottleneck principle and information preservation objective ( §2.2) play key roles in refining imperfect explanation candidates into sufficient and concise ones, as shown in Table 5.The obvious decrease in reference-based metrics, such as BERTScore, demonstrates that the proposed information objective is beneficial for correct and concise explanations without losing on-topic information.To ablate the effect of the whole IB, we train a baseline on MIXEXPL without IB loss Equation 2 (w/o refinement), indicating that IB is very useful for generating sufficient and concise explanations.A similar trend occurs in the e-SNLI dataset included in Appendix B.3 Table 10.

Qualitative Analysis and Discussion
Cases Table 6 displays an example of explanation generation for an NLI sample.The explanation generated by EIB is compelling enough as a more sufficient and concise version of the initial explanation candidates from prompting.Specifically, EIB corrects the explanation generated by PROMPTING-Filter, which initially contradicted the context, to be factual and sufficient.
Challenges The evaluation quality has a huge impact on designing explanation generation methods.
We aim to answer "are existing automatic metrics well-suited to evaluating zero-shot explanations?" Figure 4 shows the agreement variation between the automatic and human metrics on the ECQA task.On the language-level metric (grammar), both BLEU and BERTScore have strong consistency with human votes.However, for explanation-level metrics (sufficiency and conciseness), we can see an obvious disagreement between automatic and human metrics.The situation is worse for the simple -gram matching BLEU.We see a noticeable percentage of explanations with low BLEU scores may acquire affirmation in human evaluation.For BERTScore, the issues have been alleviated, but they still exist.
Our finding is consistent with the recent works (Goyal et al., 2022;?).Conventional evaluation difficulties in open-ended text generation also apply to explanation domains.Evaluating explanation generation, especially for unsupervised settings, will require a new framework distinct from conventional automatic metrics.

Related Work
Textual explanations in free-text forms are more expressive and generally more readable (Rajani et al., 2019).Recent methods in free-text explanation generation could be divided into two types: supervised learning on labeled datasets (Inoue et al., 2021;Zhou et al., 2021;Fernandes et al., 2022) and unsupervised learning with large-scale pre-trained language models (PLM) (Latcinnik and Berant, 2020;Wiegreffe et al., 2022;Menick et al., 2022;Zelikman et al., 2022;Chowdhery et al., 2022).The success of zero-shot models (Zhang et al., 2022;Brown et al., 2020) drives research in a more reference-free way and saves annotation costs.A common strategy to encourage a PLM to produce explanations is to directly describe the input sample as context to the PLM, which has no guarantee for being supportive and organized explanations at one time (Camburu et al., 2020;Tan, 2021;Jung et al., 2022;Ye and Durrett, 2022).By contrast, EIB learns to distil task-relevance information from the initial explanations of PLM and regenerates sufficient and concise explanations with distant supervision from an automatically-constructed dataset.
Information bottleneck (IB) provides an information perspective to explain the performance of neural networks (Tishby et al., 2000).IB measures the mutual information between random variables and is powerful, especially for unsupervised learning (Oord et al., 2018), which has been adapted in various NLP downstream applications (West et al., 2019;Paranjape et al., 2020;Li and Liang, 2021;Ju et al., 2021;Sclar et al., 2022), balancing a tradeoff between task irrelevance and task objectives.We are interested in refining the unqualified explanation candidates into sufficient and concise ones with the guidance of the explained tasks by managing two IB objectives.To the best of our knowledge, we are the first to apply the information bottleneck principle to generate explanations that adhere to explanatory criteria.

Conclusion
Natural language explanations have attracted a lot of attention because free-text explanations are more expressive and generally more readable.However, the quality of machine-generated explanations still face challenges, e.g., inadequate evidences or redundancy expressions, even with large PLMs.In this work, we propose to produce sufficient and concise explanations via the information bottleneck theory (IB), where explanations are regenerated by refining the single-pass outputs from PLM but keeping the information that supports the explained samples under a tradeoff between IB objectives.We automatically construct pseudo-parallel data for training EIB to autoregressively generate new explanations.Experiments on two tasks show that EIB is effective for generating sufficient and concise explanations.Besides, our extensive analysis shows that the current automatic evaluation for free-text explanation is extremely difficult, and persuasive evaluation frameworks are encouraged to compensate for conventional automatic metrics.

Limitations
Extension to Varied Task Formats.In this work, we limit our experiments to generating free-text explanations given a complete task sample.In future work, we aim to extend our method over more diverse settings, e.g., controllable explanation generation or synergetic generation of both task prediction and explanation.Besides, more work is needed to assess EIB's robustness and generalization when applying it to diverse NLP domains.These domains may differ in sample type, topic, or even with different preferred explanation attributes.
More lightweight Learning Paradigm.The performance of EIB is also tied to the quality of other systems or datasets, mainly the backbone language models and automatically constructed training corpus MIXEXPL.The predictions of our method are also restricted by the capacity of the generator of EIB, where we use GPT2-small architecture as the decoding architecture.This phenomenon may be remedied if we design specific interactions with larger PLM (e.g., in-context learning) and other sources for explanation-related knowledge distillation (e.g., logical composition).For example, designing more effective prompts to induce better explanation-related knowledge from PLM to relieve the training pressure.
Diverse Combination with PLMs.While our paper focuses on the issues of explanation generation given zero-shot prompting outputs, we think EIB is easy to extend to few-shot prompting base-lines since single-pass generation without updating also belongs to the features of conventional fewshot settings.Currently EIB still needs parameter optimization.We think future work can explore more flexible plug-and-play methods to distill sufficient and concise explanations upon large PLM.
Evaluation Quality and Consistent.Quality estimation of the natural language explanation generation is largely dependent on human evaluation due to its open-ended characteristics.Current automatic evaluation metrics are not convincing and reliable when compared to human evaluation.However, reproducing the human evaluation results across different works may be difficult.This suggests that better automatic evaluation metrics are desperately needed for free-text explanation generation.We leave improving evaluation quality to future work.
Question: What happens when snow on a mountain becomes heavy?Answer: avalanches.Explanation (to be evaluated): Avalanches are natural events that occur when snow slides down a mountain slope.They can happen anywhere on a mountain slope.avalanches are distinct from slush flows and serac collapses.They are also different from large scale movements of ice.Ø Factuality Does the explanation consistent with commonsense knowledge and not conflict with explained samples and the explanation itself?

Human Evaluation Demonstration
Ø New Information Does the explanation provide new information not stated in the task sample?
Ø Sufficiency is the explanation adequate as evidence for answering "why this [output] is assigned to this [sample input]"?Ø Conciseness Does the explanation not contain redundancies or irrelevant information (i.e., hallucination or nonsense) about the task sample?
Figure 5: Demonstration of the head-by-head human evaluation pipeline.Given a task sample (e.g., QA) and an explanation candidate to be evaluated, annotators are required to evaluate the explanation candidate in 5 aspects.Two distinct options exist for Grammar and New Information metrics, while three-point scales are utilized for the evaluation of other metrics.information is not enough.If provided, the newlyintroduced information should be compatible with the "why question" between the input and output of the task sample.Explanations are supposed to provide enough evidence to describe the relationship between sample input and output.
• Conciseness (does the explanation not contain redundancies or irrelevant information?-) Explanations should be the selective and comprehensive reason over all possibilities, not to enumerate the complete set.

A.2 Crowd-sourcing Instruction Details
Head-by-head Evaluation of Table 3 We show annotators the task sample (task sample input and output) and different explanations (six from models and one from human-written ground truth) and ask them to score each explanation along five evaluation attributes.We instruct annotators to pretend the sample output is correct even if they disagree with it and judge the explanation based on the given output.Specifically, for each choice of evaluated criteria, we detail the corresponding definitions to help explanation's error detection.An illustration of the human annotation process is exemplified in Figure 5.In practice, the annotation tasks were conducted online using shared Google files.
Head-to-head Evaluation of Table 7 We present annotators with the task sample and instruct them to select which of two explanations best explains the task sample.We ask them to ignore minor grammar and spelling mistakes such as improper upper casing.

A.3 Quality Control
We hire English native speakers as annotators from North America, to guarantee a high level of English proficiency among annotators.Annotators were pre-screened through a pilot qualification study.We showed them annotation requirements with three annotated examples by us (the authors) and require them to evaluate five representative samples.On average, annotators took approximately five minutes to complete and perform a quick check for a single instance.We pay them $2 for every instance (6 explanations from models and 1 from human-written ground truth).
We individually review submitted annotations of the qualification study and provide annotators with feedback to correct any misconceptions or confusion about the task.Annotators who performed well on the qualification study and demonstrated a comprehensive understanding of the task and annotation guidelines were permitted to participate in the main round of human evaluation.Finally, 3 annotators participated in the human evaluation.
Every few batches, we check to ensure the evaluation quality and time taken per annotator to avoid any annotator completing the tasks in an unreasonably quick time and containing inadvertent annotation errors.We maintained continuous communication with annotators throughout the human evaluation process to address queries and clarify intended behavior.In order to track quality throughout evaluation, we compute inter-annotator agreement using Krippendorff's  and hire new annotators to re-annotate if the disagreement is high among annotators ( < 0.3).
Figures 6-8 show the annotation guidelines we provide for crowd annotators.We ask crowd annotators to read these guidelines before starting the qualification test.The annotators are required to contact us promptly if have any questions during the annotation.

B.1 Head-to-head Human Evaluations
We investigate whether the explanation regenerated by EIB better supports the explained task samples than the initial explanation candidates on the whole.We perform a head-to-head comparison of generations from prompting PLM (OPT-13B (Zhang et al., 2022)) vs. regenerations from EIB.We present three annotators with a task sample including input and output, and two explanations for the sample.We ask them to make a preferential selection by answering "'which explanation better explains the task sample?"'.Annotators are instructed to choose one option from a set of three alternatives: equivalence of the explanations, superiority of explanation 1, or superiority of explanation 2.
Results are shown in Table 7.We find that, for both tasks, generations refined towards sufficiency and conciseness outperform the single-pass generations by prompting PLM.These results provide evidence that explanation refinement and regeneration are necessary for effectively explaining given samples because the special attributes of explanations are different from general language sentences.Wiegreffe et al. (2022).Inherited information from the explanations of GPT-3 is colored in blue.Newlyadded semantics are denoted in orange.
B.2 EIB vs. Few-shot GPT-3 Furthermore, we want to investigate the effectiveness of EIB on larger sizes of PLM.We use the predicted explanations3 of GPT-3 Davinci with 175B reported by Wiegreffe et al. (2022), where each prompt consists of 8-24 randomly selected human-  written examples.Annotators assess 100 samples of the ECQA dataset.The human evaluation results are shown in Table 8.We can see that larger-scale GPT-3 (175B) performs much better than smaller OPT (13B) in producing meaningful and qualified explanations.EIB refines initial explanations generated by GPT-3 and could further improve the explanation quality.EIB is much smaller than GPT-3.During inference EIB improves the explanation quality with a reduction of training FLOPs (46.420G) and model parameters (38.645M) by large orders of magnitude.We also display an example in Table 9 for illustration.EIB keeps important contents of the initial explanation from GPT-3, abandons parallel sentences learned from the few-shot context, and further adds support to form a sufficient explanation.

B.3 Ablation Study
Results in Table 10 show that the full model significantly improves the explanation quality across the different aspects, demonstrating the benefits of information bottleneck on explanation regeneration.Besides, our proposed information preservation loss ensures the usability of bottleneck representation with an obvious improvement on the reference-based metrics, e.g., for BERTScore, from 84.47 (w/o info preservation) to 85.86 (EIB).

B.4 Performance on MIXEXPL
We also evaluate the performance of EIB on the test split of MIXEXPL and five trained tasks included in MIXEXPL to ensure the effectiveness of the training and generalization of the designed framework.Results are shown in Table 11.The strong results on the test sets indicate the welltrained of EIB on the MIXEXPL corpus.

C Qualitative Examples
C.1 Prompting Format to PLM When inference, the explanation candidates which are fed to EIB are prompted from large-scale pretrained language models (PLM).The prompting formats of test tasks (ECQA and e-SNLI) are illustrated in Table 12.We use OPT-13B as the PLM.The explanation candidates are generated by greedy decoding and top- sampling (=0.9).For each example, we display one explanation candidate by greedy decoding and three candidates by top- sampling.

C.2 Additional Cases
More examples generated by PLM and EIB for ECQA and e-SNLI tasks are shown in Table 13.and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?A D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?A.3

Figure 3 :
Figure3: Comparison between PROMPTING and EIB under different explanation-level criteria.EIB outperforms the single-pass prompting method significantly with meaningful explanations while keeping reliable and concise.

Figure 4 :
Figure 4: The distribution of human evaluation scores across different ranges of automatic metrics, BLEU and BERTScore.Colour spans along the -axis represent the human votes, ranging from 1 (worst) to 3 (best).

•
Factual false or conflict to context/itself • Unsure • Factual true • None introduced beyond that which was already present within the task sample • Introduced • Explaining by copying task sample • Wrongly explaining • Sufficiently describing the evidence • Redundancy (purely copy or repeat) • Containing unnecessary information • Conciseness Ø Grammar Is the explanation fluent for reading without any grammar errors?

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?3.1 (setup and hyperparameter values) C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?3.2, 3.3 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 3.1 D Did you use human annotators (e.g., crowdworkers) or research with human participants?3.2, 3.3, 3.5, A, B.1, B.2 D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? 3.1, A, B.1 D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) |, " )log ( !| " ) Illustration of our method.Given a task sample z and an explanation candidate  which may be noisy, (i) a refiner first compresses  into bottleneck vectors T via a tunable stochastic mapping.(ii) An information objective optimizes compression direction ensuring T to be predictive of .(iii) A generator generates a sufficient and concise explanation based on the bottleneck representation T, , and .The right side shows an example of EIB.

Table 2 :
Statistics of training and inference datasets.

Table 3 :
Human evaluation of explanation quality on two out-domain tasks, along with Krippendorff's  reported.PROMPTING-EIB and PROMPTING-Filter-EIB use the initial explanation candidates produced by PROMPTING and PROMPTING-Filter, respectively, as model inputs.Bluegrey chunk denotes the observed improvements of *-EIB compared with from large-scale pretrained language model *.†/ ‡ results significantly outperform the results of corresponding pretrained language models * (sign test with -value < 0.05/0.01).
(West et al., 2019)2022) trained acceptability filter on human binary judgments determines which of eight explanation candidates from PLM is plausible(Wiegreffe et al., 2022).(iv)BOTTLESUM.A reference-free summarization method(West et al., 2019)using information bottleneck to extract highlight spans from a given paragraph (initial explanation candidates generated by PLM in this paper).Training Details The backbone language models used in EIB are initialized from GPT-2 Small (Radford et al., 2019) with default parameters.During training, we use Adam optimizer (Kingma and Ba,

Table 4 :
Automatic evaluation of explanations generated by different models on the complete test splits of two datasets.Except for AVGLEN metric, other metric values are displayed in the percentage format.Results that the EIB model outperforms its base PLM model are in greyblue.†, ‡ represent the significant improvement over the results of corresponding pretrained language models * with -value < 0.05/0.01respectively (sign test).

Table 6 :
Example from the e-SNLI dataset.Inherited information from the explanations of PLMs is colored in blue.Newly-added semantics are denoted in orange.See Table13, Appendix C.2 for additional examples.

Table 8 :
Human evaluation of explanation quality for OPT (13B),, and EIB on ECQA task.Playing baseball is a lot like any other sport, there is always a risk of injury.Errors are not a risk in baseball.Happiness is not a risk in baseball.Sore muscles are not a risk in baseball.Fun is not a risk in baseball.+EIB: Playing baseball is a lot like any other sport, there is always a risk.The risk of injury is a risk in baseball.Sore muscles are a risk in baseball.
Reference: Sports is always a risk.Few-shot Prompting: Let's explain classification decisions.\n\nquestion:Where can someone view a county highway as a line?\ncountry, from the sky, michigan, map, or cross counties by car? map\nwhy?The only context in which an entire county road could be represented meaningfully as a single line would be on a map, e.g., a map of roads in the county, or a GPS application on a smartphone.\n... we omit the middle examples for

Table 9 :
Case study.GPT-3's prediction is provided by

Table 10 :
Ablation study for comparing the effectiveness of information preservation objective (Equation ??) and information bottleneck principle on ECQA and e-SNLI dataset.

Table 11 :
The performance of EIB on the test set of MIXEXPL, as well as on the individual test sets of the five constituent tasks.Besides CIDEr and AVGLEN, other metrics are formatted into percentage values.