Are Human Explanations Always Helpful? Towards Objective Evaluation of Human Natural Language Explanations

Human-annotated labels and explanations are critical for training explainable NLP models. However, unlike human-annotated labels whose quality is easier to calibrate (e.g., with a majority vote), human-crafted free-form explanations can be quite subjective. Before blindly using them as ground truth to train ML models, a vital question needs to be asked: How do we evaluate a human-annotated explanation’s quality? In this paper, we build on the view that the quality of a human-annotated explanation can be measured based on its helpfulness (or impairment) to the ML models’ performance for the desired NLP tasks for which the annotations were collected. In comparison to the commonly used Simulatability score, we define a new metric that can take into consideration the helpfulness of an explanation for model performance at both fine-tuning and inference. With the help of a unified dataset format, we evaluated the proposed metric on five datasets (e.g., e-SNLI) against two model architectures (T5 and BART), and the results show that our proposed metric can objectively evaluate the quality of human-annotated explanations, while Simulatability falls short.


Introduction
Despite the advances in recent large-scale language models (LLM) (Devlin et al., 2019;Radford et al., 2019;Lewis et al., 2019;Raffel et al., 2020), which can exhibit close-to-human performance on many natural language processing (NLP) tasks (e.g., Question Answering (Rajpurkar et al., 2016;Kočiskỳ et al., 2018;Mou et al., 2021), Natural Language Inference (Bowman et al., 2015;Williams et al., 2017;Wang et al., 2018), and Text Generation (Duan et al., 2017;Yao et al., 2022)), humans are often eager to know how State-of-the-Art (SOTA) models arrive at a prediction. Researchers working on natural language explanations 1 turned to human annotators for help by recruiting crowdworkers or domain experts to annotate both the labels and the corresponding natural language explanations (Camburu et al., 2018;Rajani et al., 2019;Aggarwal et al., 2021); human-annotated explanations can then be leveraged to either boost models' prediction performance or train models to generate human-understandable explanations.
However, the quality issue of human-annotated explanations has yet to be explored. Researchers often leverage popular Natural Language Generation (NLG) metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) to evaluate the similarity between model-generated and human-annotated explanations, with a strong assumption that humanannotated ones are the gold standard. However, unlike providing labels for classification or multiplechoice QA tasks, different people may come up with distinct natural language explanations for the same observation. Two such explanations can be both correct even though the BLEU or ROUGE similarity may be low. Furthermore, human-given natural language explanations can often be subjective and task-dependent. As a result, human-annotated explanations should not be simply treated as the gold standard; instead, we take the view that the core value of explanations should be based on how much help they provide towards the model prediction instead of being based on notions of semantic  Table 1: Task description and core statistics for five popular large scale datasets with human-annotated natural language explanations that are included in our evaluation.
similarity or word-matching.
To summarize our contributions in this paper: 1. We provide an objective evaluation to quantify the human-annotated explanations helpfulness towards model performance. Our evaluation metric is an extension of the Simulatability score (Doshi-Velez and Kim, 2017) and we propose a prompt-based unified data format that can convert classification or multiple choice tasks into a unified multiple choice generation task format to minimize the influence of structural variations across different tasks.
2. Through an evaluation with five datasets and two models, our metric can rank explanations quality consistently across all five datasets on two model architectures while the Simulatability score (baseline) falls short.
3. Our evaluation justifies the hypothesis that human explanations can still benefit model prediction, even if they were criticized as low-quality by prior literature's human evaluation.

Datasets with Natural Language Explanations
Despite the development of new model architectures and potentially more significant parameters, these "black boxes" unavoidably lack the ability to explain their predictions; this led to increased efforts in the community to leverage human-annotated explanations to either train models with explanations or to teach them to selfrationalize. For example, Wiegreffe and Marasovic (2021) reviewed 65 datasets and provided a 3-class taxonomy of explanations: highlights, freetext, and structured. We focus on five large public datasets with free-text human-annotated explanations at the instance level (Table 1). We doublechecked these datasets' licenses, and no personally identifiable information exists.
One prominent dataset is CoS-E and its two variants CoS-E v1.0 and CoS-E v1.11 (Rajani et al., 2019). It extended the Commonsense Question-Answering (CQA v1.0 and v1.11 versions) dataset (Talmor et al., 2018) by adding human-annotated explanations to the correct answer label. However, a few recent works suggest that the CoS-E's explanation quality is not good, as Narang et al. (2020) independently hand-labeled some new explanations for CoS-E and found a very low BLEU score between its original explanations and the new ones. To improve the explanation's quality, ECQA (Aggarwal et al., 2021) collected and summarized single-sentence explanation for each candidate answer into a natural language explanation for every data instance in the CQA v1.11 dataset. Sun et al. (2022) proved that CoS-E explanations are not as good as ECQA explanations as human evaluators do not believe CoS-E explanations can provide additional information to support their decision makings. The fourth dataset is e-SNLI (Camburu et al., 2018), which consists of explanations for the Stanford Natural Language (SNLI) dataset (Bowman et al., 2015). Finally, the fifth dataset is ComVE (Wang et al., 2020) that asks which one of two sentences is against commonsense. Later we evaluate the human-annotated explanations in the abovementioned five datasets with our metric and an established baseline, the Simulatability score.
Worth mentioning we do not include datasets such as SBIC (Sap et al., 2019) or E-δ-NLI (Brahman et al., 2021) because the former does not provide explanations for all the data, while the latter leverages various sources to augment the δ-NLI (Rudinger et al., 2020) dataset with explanations instead of human annotations.

Evaluation Metric for Explanations
Many commonly used evaluation metrics for textbased content like BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) treat human-annotated answers as the absolute gold standard without questioning or attempting to evaluate their quality. One established evaluation metric called Simulatability score derives from Human Simulatability (Doshi-Velez and Kim, 2017) and can examine gold explanations. It simply measures the change in a baseline model prediction performance, depending on whether the explanation is provided as the input. Previous works (Chandrasekaran et al., 2018;Yeung et al., 2020;Hase et al., 2020;Wiegreffe et al., 2020;Poursabzi-Sangdeh et al., 2021;Rajagopal et al., 2021) have demonstrated the usefulness of Simulatability score for evaluating explanation quality. However, this metric has a couple of inherent disadvantages. First, it only considers the helpfulness of explanations as input during prediction on a baseline model, where we show that explanations provide different helpfulness during fine-tuning and inference through our experiment in Section 4. In addition, model performance could also differ when we transform the original task into other tasks, such as turning a classification task into a multiple-choice task with different input data formats.
In order to objectively evaluate human-annotated explanations, we define a new evaluation metric based on the Simulatability score that complements both drawbacks of Simulatability by considering the helpfulness of explanations both at fine-tuning and inference with the help of a unified structure to minimize the impact of task differences. Other works (Carton et al., 2020) attempted to eval-uate and categorize different characteristics of explanations, but many of them (Chan et al., 2022a;DeYoung et al., 2020) still treat human-annotated explanations as the gold standard.

Usage of Explanations for SOTA models
Existing works have been exploring circumstances in which explanations could improve model performance; for example, Hase and Bansal (2021) argues that explanations are most suitable for use as model input for predicting, and Kumar and Talukdar (2020) proposed a system to generate label-specific explanations for the NLI task specifically. Some recent works have tried to generate better explanations with a self-rationalization setting (Wiegreffe et al., 2020;Marasović et al., 2021), where a model is asked to generate the prediction label and explanation at the same time. We conduct a preliminary experiment to find the best model setting to leverage explanations in Section 4.1. There exists many recent works (Paranjape et al., 2021;Liu et al., 2021;Chen et al., 2022) that explore the usage of prompts to complete explanations, generate additional information for the original task, or examine whether generated explanations can provide robustness to adversarial attacks. Ye and Durrett (2022) showed that simply plugging explanations into a prompt does not always boost the in-context learning performance, and modelgenerated explanations can be unreliable for fewshot learning. Another related line of research fo-  cuses on extracting or generating explanations with a unified framework (Chan et al., 2022b) or with a teachable reasoning system that generates chains of reasoning (Dalvi et al., 2022).

Unified Structure
While popular metrics like BLEU and ROUGE can evaluate text coherence and similarity, one critical aspect of explanations is how beneficial they can be for model performance. Thus, we want to develop a metric that can objectively evaluate explanations' utility towards model performance. Furthermore, we expect that such a metric can systematically demonstrate how good or bad the explanations are; for example, it could objectively measure what 'noisy' means in a human study (e.g., from previous works on CoS-E).
With the advantage of sequence-to-sequence models like T5 that can map different types of language tasks into generation tasks, we can control and minimize the influence of varying task formats on model performance while evaluating the helpfulness of explanations by leveraging a unified data format. We realize that existing datasets with human-annotated explanations are mostly either multiple-choice tasks or classification tasks. The classification task could be viewed as a multiplechoice task where the labels are indeed choices. Inspired by several previous works that manipulated prompts for sequence-to-sequence models (Marasović et al., 2021;Liu et al., 2021), we incorporate a few well-defined words as template-based prompts for the unified data structure to indicate the task content and corresponding explanations.
Examples shown in Figure 1 explain how we map various tasks into a unified multiple-choice generation task. We propose two settings: no explanations (Baseline ) and explanations as additional input (Infusion ). Here we explain how each prompt addresses a different part of the data content: 1) 'explain:' is followed by the question content, 2) 'choice-n:' is followed by each candidate answer, and 3) a special token '<sep>' sepa-rates the explanations from the task content, while the explanations in Infusion are led by 'because' so that the model knows that the explanation text explains the task content. For datasets like CoS-E and ECQA, we leverage the original task as the question content. On the other hand, we define fixed question prompts for e-SNLI: "what is the relation between [Premise] and [Hypothesis]?", and for ComVE: "which sentence is against commonsense?" to specify corresponding tasks to models.

Utilizing Explanations as Part of Input vs Part of Output
Recent works have been exploring various circumstances that human-annotated explanations could help in different aspects; for example, Hase and Bansal (2021) argued that explanation as additional input would best suit performance improvement. Marasović et al. (2021) proposed self-rationalizing models, which generate explanations along with labels. However, they do not provide prediction accuracy compared with the baseline. We hypothesize that leveraging explanations as additional input with the original task input allows models to use explanations for better prediction, while the selfrationalization setting complicates the prediction task for the models and may lead to a performance decrease. In addition, the generated explanations from self-rationalization systems are not explicitly being used for label prediction. To justify our hypothesis, we conduct a preliminary experiment on CoS-E v1.0 and ECQA datasets. We fine-tune three T5-base models on each dataset with three different settings: Baseline , Infusion , and explanations as additional output (Self-Rationalization hereinafter). For each model, we maintain the same setting during fine-tuning and inference. For example, the model fine-tuned with Infusion will also take data under Infusion during inference. We leverage the unified structure for Baseline and Infusion shown in Figure 1 and make minor adjustments for the self-rationalization setting accordingly (shown in Appendix A).
The experiment results are shown in Table 2. We notice that the self-rationalization setting performs worse than the Baseline , which is aligned with our assumption. On the other hand, the Infusion setting surprisingly achieves significant improvement on CoS-E, which was considered 'noisy' by previous works, demonstrating that the CoS-E ex- planations are indeed helpful toward models. The Infusion setting also approaches nearly complete correctness on the ECQA dataset.

Explanations as Partial Input During
Fine-Tuning To examine the utility of explanations to the models during fine-tuning, we perform an in-depth experiment with the Baseline and Infusion setting while varying the amounts of training data used for fine-tuning. First, we randomly select 9 subdatasets with amounts of data ranging from 10% to 90% of the training data in each dataset used in the first preliminary experiment. Then, for each subdataset, we fine-tune three models with different random seeds for sampling and fine-tuning, then acquire the averaged prediction performance. As a result, for each CoS-E v1.0 and ECQA dataset, we get 60 models fine-tuned with varying amounts of data for both the Baseline and Infusion setting, including the models fine-tuned on full training data, then perform prediction with the Baseline and Infusion settings. We maintain the same hyper-parameters across the models fine-tuned for this experiment and report them in Appendix B.1.
The formula of our Treu metric. M denotes a model and the subscript/superscript denotes M predict setting f inetune setting . The Simulatability score only considers the second part within our formula.
The two diagrams in Figure 2 show the experiment results on two datasets (detailed results in Table 4 in the appendix). Different colors denote different fine-tuning and inference settings. We conclude with a few interesting observations: 1. By looking at yellow (model fine-tuned with Infusion and predict with Baseline ) and green (model fine-tuned and predict with Infusion ) line, we notice adding more training data during fine-tuning does not significantly improve model performance, suggesting that the fine-tuning process is not teaching the model with new knowledge that is conveyed in the explanations.
2. By comparing yellow and blue (model finetuned and predict with Baseline ) line in each diagram, we notice the models fine-tuned with Infusion perform worse than baseline models without explanations during inference, demonstrating that fine-tuning with Infusion teaches the models to rely on the explanations to predict.
3. By comparing red (model fine-tuned with Baseline and predict with Infusion ) and blue line in each diagram, we observe the baseline models for CoS-E perform worse while predicting with explanations. In contrast, the baseline models for ECQA consistently exceed baseline performance significantly, which demonstrates that the helpfulness of explanations on baseline models in CoS-E is much worse than the ones in ECQA, which is aligned with some previous works.
4. By comparing green and blue lines in both diagrams, we notice that explanations in CoS-E can contribute to substantial improvement during inference on models fine-tuned with Infusion setting. This observation shows that explanations in CoS-E are able to provide helpfulness to models during fine-tuning, even though they were considered 'noisy' by humans in previous works.
5. By comparing red and green lines in both diagrams, we can observe that in order to take full advantage of explanations, it is beneficial to finetune a model even with a small amount of data that incorporates the explanations. Such finetuning can lead to a substantial improvement. This experiment shows that explanations provide different degrees of utility during fine-tuning and inference. Thus, we should consider both situations while evaluating the helpfulness of explanations.

Our Treu Metric
Based on our observations from the preliminary experiments, we propose a novel evaluation metric that extends the Simulatability score. Figure 3 shows the formula of our Treu metric: it evaluates the helpfulness of explanations with the sum of two parts: at fine-tuning, where two models are fine-tuned with Baseline and Infusion settings correspondingly, we calculate the prediction accuracy difference using the same data format that was used during fine-tuning for each model; and at inference, we fine-tune only one model with Baseline setting and calculate the prediction accuracy difference between Infusion and Baseline settings.
The second part of our metric is indeed the Simulatability metric. We observe that finetuning a model with data that incorporates explanations can provide substantial benefits. However, the Simulatability score fails to account for this component and only considers the model performance improvement that uses explanations at inference without fine-tuning first. For the models fine-tuned with Baseline setting, we believe pretrained SOTA large-scale models have the ability to understand the additional content at the input to a certain extent. The addition of explanations at input during inference will show whether it can provide helpfulness to a baseline model without additional supervision, while the models fine-tuned with Infusion setting will rely more on the explanation part of the input for inference.
A positive score demonstrates that the explanations can provide overall helpfulness for better prediction, while a negative score does not necessarily mean the explanations are not helpful. Instead, a negative score indicates that the explanations lead to the model's performance drop in at least one part of the evaluation. Researchers can further analyze the intermediate score for each part. As a result, the score ranges theoretically from -2 to 2.

Evaluation
We evaluate human-annotated natural language explanations across five popular datasets using our Treu metric and the Simulatability score. To justify that our metric is less biased by different model architectures and to examine the influence of models fine-tuned with different settings towards the prediction performance, we perform experiments on both T5 and BART models. The proposed unified data format is applied to the experiments for our metric and the Simulatability score to make it a more robust baseline.
We maintain the same fine-tuning hyperparameters for all the models in the experiment (details in Appendix B.2). The only exception is for the e-SNLI dataset, which has about 10x the size (549,367 data instances) of training data compared to the other datasets. Therefore, we only fine-tune models on the e-SNLI dataset with two epochs. Furthermore, we leverage the special token '<s>' for BART that was already used during the pre-training process instead of using and adding the special token '<sep>' to BART tokenizer during fine-tuning. We present the evaluation results in Table 3.

Findings
Our results justify the intuition that humanannotated explanations can still provide benefits toward model prediction, even if they were evaluated as low-quality by humans in prior literature. By first comparing the models' prediction results over two architectures, the result shows all models fine-tuned on T5-base outperform those fine-tuned on BART-base with the same setting, mainly with a significant margin.
Despite apparent performance differences between model architectures, by looking at the orderings of datasets in both tables, which are based on our Treu score, We can easily observe that Treu score provides the same ranking result for the quality of explanations in 5 datasets over two model architectures. Our Treu score (Table 3) ranks the explanation quality of the five datasets in the following order regardless of model architectures: According to the Treu score, explanations in ECQA have the best quality among the five datasets. Especially, explanations in ECQA are much better than the ones in both CoS-E datasets, which is consistent with previous works' consensus. It is worth noticing that both CoS-E datasets achieve positive Treu scores, though significantly lower than the ones for ECQA, demonstrating that explanations  Table 3: Evaluation results of human-annotated explanations in 5 datasets with our Treu score and Simulatability score. The tables above and below correspond to models fine-tuned on T5-base and BARTbase, respectively. The Simulatability score only considers M predict+Baseline f inetune+Baseline and M predict+Infusion f inetune+Baseline , while our Treu score considers M predict+Infusion f inetune+Infusion additionally.
in CoS-E datasets still have positive overall helpfulness for models' prediction performance even though they are considered 'low quality and noisy' from human experiments (Sun et al., 2022).
Our Treu score can rank explanation quality consistently across all five datasets on two models while the Simulatability falls short. On the other hand, the Simulatability score cannot provide a consistent ranking of explanation quality on the two models. Instead, the Simulatability score provides two distinct rankings: T5-base: Table 3, the Simulatability score ranks e-SNLI and ComVE reversely on BART compared with T5 models, indicating Simulatability score could be more affected by different model architectures even with the unified data structure.
One advantage of using our Treu score to evaluate the quality of explanations is that we can analyze the score by class or intermediate results from either fine-tuning or inference. For instance, we observe that the Treu scores for e-SNLI with T5 and BART models are both negative, indicating that the helpfulness of explanations in e-SNLI could be limited. However, by looking into the intermediate results, though the baseline models perform significantly worse while predicting with Infusion than with Baseline setting, the models that are fine-tuned with Infusion still outperform the baseline models while predicting with Infusion , justifying the explanations indeed provide improvements under this setting. When we further decompose the Treu score of e-SNLI by category, we acquire 0.13/-0.483/0.094 on T5-base and 0.015/-0.227/-0.271 on BART-base corresponds to entailment/neutral/contradiction.
We speculate that the helpfulness of humanannotated explanations to models highly depends on the task (e.g., the 'contradiction' label categories) and the explanation format (e.g., counter-factorial styles). We notice that the models fine-tuned on T5 and BART have more than a 40% prediction accuracy drop on data with 'neutral' labels when they are fine-tuned with Baseline and predicted with Infusion . In addition, we observe that the fine-tuned BART models have about a 40% prediction accuracy drop on data with ground-truth 'contradiction' labels. We suspect human annotators behave differently while providing explanations for different categories in e-SNLI. For instance, humans tend to provide counter-factorial explanations or use negation connotations to explain why two sentences are 'neutral' or 'contradiction' categories. Some representative examples for each class are provided in Appendix 5. Such behavior's tendency to use negation connotations in explanations for specific categories may increase the difficulty for the models to interpret the information and lead to false predictions eventually.
From Table 3, ComVE ranks worst among the five datasets in both tables, indicating the explanations in ComVE are the least helpful for the models to either fine-tune or predict with. Since the ComVE task asks models to predict which sentence is more likely against commonsense, the question itself implies a negation connotation. Likewise, many ComVE explanations contain negation, such as the one in Figure 1. The concept of negation has always been a complex concept for machines. Although both T5 and BART models fine-tuned with the Baseline setting can perform relatively well on ComVE, the addition of explanations that largely contain negation during inference is likely to create more difficulties for the models to understand and eventually lead to false prediction.
Our hypothesis on counter-examples or negation annotations in human-annotated explanations can find support from many recent works. A recent analysis (Joshi et al., 2022) claimed that negation connotations have high necessity but low sufficiency to describe the relation between features and labels. In addition, counterfactuallyaugmented data may prevent models from learning unperturbed robust features and exacerbate spurious correlations (Joshi and He, 2021). Therefore, we suggest human annotators avoid using counterexamples while providing explanations. Instead, using precise words to describe the degree of relations between concepts will be preferable and provide better helpfulness to models.
Nevertheless, these models can correctly understand explanations for all categories after being fine-tuned with the Infusion setting. Worth pointing out that ECQA explanations are summarized from positive and negative properties for each candidate choice which also contains negation words. However, those negation words mostly appear in negative properties for wrong choices. As a result, we notice the pre-trained baseline models can leverage ECQA explanations with Infusion during the predicting process and achieve performance improvement. Since we are the first to discover such a class-level drop on e-SNLI by using Treu score, we only propose our hypothetical assumption and leave a definitive study for future work.

Conclusion
In this paper, we objectively evaluate humanannotated natural language explanations from the perspective of measuring their helpfulness towards models' prediction. We conduct two preliminary experiments and based on the findings from the preliminary study, we define an evaluation metric that considers the explanations' helpfulness at both fine-tuning and inference stages; We also propose a unified prompt-based data format that minimizes the influence of task differences by mapping various tasks into a unified multiple-choice generation task. Our experiment with human-annotated explanations in 5 popular large-scale datasets over two sequence-to-sequence model architectures demonstrates that our metric can consistently reflect the relative ranking of explanation qualities among five datasets while the Simulatability score falls short. Our work leads to envisioning many new directions for future work, and we recommend researchers perform similar quality checks while collecting human-annotated explanations in the future.

Limitations
In this paper, we evaluate the quality of humanannotated natural language explanations towards the models' prediction performance on multiple datasets. Although it is a natural step that our evaluation metric could be generalized to evaluate the helpfulness of model-generated explanations, we would like to caution that: our metric and evaluation experiment requires the models to generate explanations for the train split data, then use the data with generated explanations to fine-tune the second model with the Infusion setting, which may not be suitable for those explanation-generation systems that are trained on train split data.

Ethics Statement
We do not see potential ethical concerns or misuse of the proposed evaluation method. One potential risk, though minimal, could be the misinterpretation of the findings of this paper. We would like to caution readers that a higher score of our metric may not necessarily reflect a higher quality perceived by humans, as the evaluation metric only measures the explanation's benefit from the modeling perspective, and it is only one of the many possible ways of automatically evaluating the quality of natural language explanations.

A Implementation of self-rationalization format
We show the implementation of the selfrationalization setting proposed by Marasović et al. (2021) and put it together in Figure 4 with our proposed unified structure of the Baseline and Infusion setting.

B Experiment Hyper-Parameters
We perform all the computational experiments on a Google Colab instance with a single Nvidia V100 GPU and 50 Gigabytes of RAM.

B.1 Hyper-parameter for Preliminary Experiment
For the preliminary experiment of utilizing explanations as part of input V.S. part of the output, we leverage the following hyper-parameters for all models with different data structures: max_len : 512, target_max_len : 64, train_batch_size : 1, learning_rate : 5e −5 , num_train_epochs : 12.

B.2 Hyper-parameter for Explanation Evaluation with five Datasets
For the evaluation of human-annotated explanations on 5 different datasets, we maintain the following hyper-parameters for all the models: max_len : 512, target_max_len : 64, train_batch_size : 1, learning_rate : 5e −5 , num_train_epochs : 12. The only exception is the e-SNLI dataset, which has about 10x the size (549,367 data instances) of training data compared to the other datasets. Therefore, we only fine-tune models on the e-SNLI dataset with two epochs.
C Results for Preliminary Experiment -Explanations as Partial Input During Fine-tuning We randomly shuffle three seeds to select the subset of data and fine-tune the model for the preliminary experiment of explanations as partial input during fine-tuning. The detailed results of each experiment and average accuracy are reported in Table 4.

D Examples of different explanations for each category in e-SNLI dataset
From our evaluation results, we suspect human annotators behave differently while explaining data with various categories in e-SNLI. For instance, human annotators may explain why two sentences are 'entailment' by describing the shared information or similarities conveyed by both sentences, which is easy for models to understand. However, humans tend to provide counter-examples or negations to explain why two sentences are unrelated (neutral) or contradictory rather than explaining their reasoning in a positive way. In Table 5   An old man with a package poses in front of an advertisement. A man poses in front of an ad. The word " ad " is short for the word " advertisement ".
A man reads the paper in a bar with green lighting. The man is inside. In a bar means the man could be inside.
neutral An old man with a package poses in front of an advertisement.
A man poses in front of an ad for beer. Not all advertisements are ad for beer.
A woman with a green headscarf, blue shirt and a very big grin. The woman is young. the woman could've been old rather than young A man reads the paper in a bar with green lighting. The man is reading the sportspage. The man could be reading something other than the sportspage.
contradiction A woman with a green headscarf, blue shirt and a very big grin. The woman has been shot. There can be either a woman with a very big grin or a woman who has been shot.
A man playing an electric guitar on stage. A man playing banjo on the floor. The man can't play on stage if he is on the floor.
A couple walk hand in hand down a street. A couple is sitting on a bench. The couple cannot be walking and sitting a the same time.