STraTA: Self-Training with Task Augmentation for Better Few-shot Learning

Despite their recent successes in tackling many NLP tasks, large-scale pre-trained language models do not perform as well in few-shot settings where only a handful of training examples are available. To address this shortcoming, we propose STraTA, which stands for Self-Training with Task Augmentation, an approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA uses task augmentation, a novel technique that synthesizes a large amount of data for auxiliary-task fine-tuning from target-task unlabeled texts. Second, STraTA performs self-training by further fine-tuning the strong base model created by task augmentation on a broad distribution of pseudo-labeled data. Our experiments demonstrate that STraTA can substantially improve sample efficiency across 12 few-shot benchmarks. Remarkably, on the SST-2 sentiment dataset, STraTA, with only 8 training examples per class, achieves comparable results to standard fine-tuning with 67K training examples. Our analyses reveal that task augmentation and self-training are both complementary and independently effective.


Introduction
Recent advances in NLP demonstrate the effectiveness of applying large-scale pretrained language models to downstream tasks (Devlin et al., 2019;Liu et al., 2019;Yang et al., 2019;Lan et al., 2020;Brown et al., 2020;He et al., 2021). While these models have achieved state-of-the-art results on many NLP benchmarks, they struggle when given limited training data. For instance, Devlin et al. (2019) find that BERT is prone to degenerate performance on small datasets. While enormous language models like GPT-3 (Brown et al., 2020) exhibit the ability to solve a new task from only a few examples without Work done as a student researcher at Google Brain. any fine-tuning, their performance still lags far behind state-of-the-art fine-tuning results. Manually annotating large amounts of training data will likely improve performance but can also be prohibitively expensive to obtain for many tasks and domains. In this paper, we propose STraTA, an approach that combines two complementary methods, Self-Training and Task Augmentation, to effectively leverage unlabeled data, which is comparatively cheaper to obtain. 1 At a high level, task augmentation exploits unlabeled text from the domain of a given target task to simulate a large amount of in-domain training data for the auxiliary task of natural language inference (NLI), which is then used to train a given model before applying it to the target task. To achieve this, we first build an NLI data generator by fine-tuning a pre-trained generative language model on the MNLI data set (Williams et al., 2018) in a text-to-text format. Then, given a target task (e.g., sentiment analysis) with unlabeled Figure 2: An illustration of our Self-Training with Task Augmentation (STraTA) approach. In task augmentation, we train an NLI data generation model and use it to synthesize a large amount of in-domain NLI training data for each given target task, which is then used for auxiliary (intermediate) fine-tuning. Our self-training algorithm iteratively learns a better model using a concatenation of labeled and pseudo-labeled examples. At each iteration, we always start with the auxiliary-task model produced by task augmentation and train on a broad distribution of pseudo-labeled data.
texts (e.g., his acting was really awful), we use the NLI data generator to generate NLI examples (e.g., [his acting was really awful, he gave an incredible performance, contradiction]). We show that task augmentation alone can significantly improve downstream performance across different tasks, generally outperforming other fine-tuning approaches, such as target-task language model fine-tuning (Howard and Ruder, 2018;Gururangan et al., 2020) and intermediate-task fine-tuning on MNLI (Phang et al., 2019), in both high-and low-data regimes.
Having obtained a strong auxiliary-task model with task augmentation, STraTA uses this model as a base model for self-training. Specifically, at each at iteration, the base model is fine-tuned using the available labeled data for the target task. Then, the resulting model's predictions on unlabeled examples 2 are used as pseudo-labels to augment the original labeled data set. The newly formed labeled data set is then used to learn a better model in the next iteration, and this procedure is repeated for a number of iterations until a stopping criterion is reached. While self-training has been extensively studied (Rosenberg et al., 2005;McClosky et al., 2006;He et al., 2020;Xie et al., 2020b;Du et al., 2021), our experiments reveal that using a strong base model and training on a broad distribution of pseudo-labeled data are key factors for successful deployment in NLP. 2 We use the term unlabeled text to refer to pieces of text (e.g., sentences) from the target domain, and the term unlabeled examples to refer to examples that can be annotated using the set of class labels for the target task.
Using our STraTA approach, we are able to significantly improve sample efficiency, in terms of both performance and variance, across 12 NLP benchmark datasets. For instance, on the SST-2 sentiment dataset (Socher et al., 2013), with only 8 training examples per class, we achieve comparable results to standard fine-tuning with 67K training examples (see Figure 1).
Our main contributions are as follows: 1. We propose task augmentation, a novel data augmentation-based fine-tuning method, and show its effectiveness in comparison to other competing fine-tuning approaches.
2. We propose a simple yet effective self-training algorithm and highlight important ingredients for successful self-training, which we hope will enable the wider adoption of self-training in NLP.
3. With STraTA, we demonstrate the effectiveness of combining task augmentation and selftraining in improving sample efficiency across NLP benchmarks.

Task augmentation
Labeled data is often expensive and timeconsuming to obtain, which motivates approaches that learn from both labeled and unlabeled data. More formally, assume we are given a target task T with a labeled data set and an unlabeled data set U T = {(x j )} N j=1 . The unlabeled data U T can be created artificially by removing the ground-truth labels y from L T (as in our main experiments), or it can come from additional unlabeled texts from the target domain or from related datasets/domains (see Section 5). Our methods, task augmentation and self-training, take advantage of the unlabeled data U T to maximize performance on the target task T , even when the number of labeled examples M is small (e.g., M = 16). In this section, we first present a framework and implementation for task augmentation, which uses natural language inference (NLI) as an auxiliary (intermediate) training task to improve downstream performance.

A framework for task augmentation
Task augmentation builds on a recent body of NLP research on intermediate-task training (Phang et al., 2019;Vu et al., 2020), in which a pre-trained language model, such as BERT, is fine-tuned on an auxiliary task before the target task. 3 In previous work on intermediate fine-tuning, the auxiliary dataset used is a fixed target task-independent dataset, such as MNLI or SQuAD (Rajpurkar et al., 2016). An obvious limitation of this choice is the domain mismatch between the auxiliary and target tasks, which our proposed task augmentation method addresses. More specifically, we fine-tune a pre-trained generative language model and use it to synthesize a large amount of in-domain training data from U T for an auxiliary task A, which is then used to improve performance of a model on the target task T (Figure 2, left). 4 In this work, we choose NLI as the auxiliary task for two main reasons: (1) NLI has been shown to be an effective auxiliary task for a variety of target tasks (Conneau et al., 2017;Phang et al., 2019), and (2) existing NLI datasets contain large training sets, which allows us to train a reliable data generator.
Generating synthetic NLI data: To obtain an NLI data generator, we fine-tune the pre-trained T5-3B model  on MNLI, which contains 393K sentence pairs labeled as {entailment, contradiction, neutral}. We cast each MNLI training example (sent A , sent B ) → label into a textto-text format (label, sent A ) → sent B to ob-tain fine-tuning examples that look like [entailment, the facts are accessible to you → you have access to the facts]. 5 We fine-tune T5 on this dataset with a constant learning rate of 0.001 for 2 16 = 65, 536 steps using the Adafactor optimizer (Shazeer and Stern, 2018). The fine-tuned T5 data generator produces augmented examples for all target datasets. Specifically, at inference time, we feed the model an NLI label (e.g., entailment) and an unlabeled sentence x j from the target domain to produce some output sentence x k : (entailment, x j ) → x k (see Appendix B for example outputs). Data for intermediate fine-tuning is then formed by creating examples like (x j , x k ) → entailment. This approach has several advantages: (1) training labels are free, and (2) by overgeneration, a large amount of in-domain NLI training data can be produced even for target tasks with small datasets. Overgeneration and filtering: Following Puri et al. (2020), we perform overgeneration and filtering to increase the quantity and quality of synthetic NLI training data. Concretely, we generate 100 output samples per input with top-k (k = 40) sampling (duplicates are removed) and use a BERT model fine-tuned on MNLI (in the original format) as an NLI classifier to filter synthetic training examples. We keep a synthetic example if the NLI classifier produces the same label as that fed to the NLI data generator and is also confident about its prediction. 6 For all experiments, we perform intermediate fine-tuning on examples from both the original MNLI dataset and the final filtered task augmentation dataset. 7 3 Self-training While task augmentation uses unlabeled texts to produce synthetic data for an intermediate task, self-training is a complementary approach that improves a model by training directly on the target task using pseudo-labeled examples. In this section, we explore a simple self-training algorithm in which a model learns to improve itself 5 We fine-tune a separate T5 model per class label. To overcome biases in MNLI where the hypotheses are usually shorter than the premises, we also include reversed examples: (reversed label, sentB) → sentA. 6 We use an example when its predicted probability exceeding a certain threshold τ . We choose a value for τ in [0.3, 0.4, . . . , 0.9] for each target task based on performance on the original MNLI development set. 7 A two-stage intermediate fine-tuning procedure where the model is first trained on the synthetic data before being fine-tuned on the original data typically works better, and this is used in our experiments.
Algorithm 1: Our self-training algorithm initialization t = 0 Form a base model f0, which is initialized with pre-trained parameters from a pretraining/intermediate fine-tuning stage, and then learn a teacher model f1 by training f0 on the original labeled data set L. repeat t = t + 1 1. Use the current teacher model ft to annotate (for t = 1) or re-annotate (for t > 1) all of the examples in U to obtain a set U of pseudo-labeled examples.
2. Add the whole set U of pseudo-labeled examples to the original labeled data set L to form a new labeled data set.
3. Learn a student model ft+1 by training the base model f0 on the current labeled data set and optionally fine-tune it on L. The resulting student model ft+1 is used as a teacher for the next iteration.
until convergence or the maximum number of iterations is reached using its predictions on unlabeled examples from a given target task. Our method differs from traditional self-training methods in that we leverage a strong base model and allow it to learn from all available pseudo-labeled examples at every iteration, regardless of model confidence. Formally, given a target task T with a small labeled data set L = {(x i , y i )} M i=1 and an unlabeled data set U = {(x j )} N j=1 , where M N , we summarize our self-training algorithm in Algorithm 1.
Starting with a strong base model: An important ingredient in self-training algorithms is the base model f 0 . Successful self-training typically requires a good base model, which can provide a large proportion of "correct" predictions or pseudolabels on unlabeled examples; otherwise, errors can be propagated or magnified by later stages of self-training. At each self-training iteration, we always start from the same base model f 0 , which is initialized with pre-trained parameters from a pretraining/intermediate fine-tuning stage (e.g., the auxiliary task training stage in task augmentation), 8 and then fine-tune all of its parameters using the available labeled and pseudo-labeled data. 9 8 We find empirically that starting from the base model f0 works better than from the model ft−1 obtained in the previous iteration.
9 He et al. (2020) find that further fine-tuning the resulting model on the original labeled data set L improves machine Self-training on a broad distribution of pseudolabeled data: Another important factor is the selection of pseudo-labeled examples at each selftraining iteration. Traditional self-training approaches usually select a small set of examples where the current teacher model f t is sufficiently confident (e.g., the probability of the predicted class label is above a threshold) to add to the labeled data set at each iteration until the unlabeled data pool U is exhausted. This can be problematic as state-of-the-art language models like BERT are overconfident and poorly calibrated (Jiang et al., 2021). In preliminary experiments, we tried several calibration methods, including temperature scaling (Guo et al., 2017), label smoothing (Müller et al., 2019), and confidence penalties (Pereyra et al., 2017), but all of which failed to fully address this problem. Instead, we encourage learning from a "natural" broad distribution of pseudo-labeled data by adding the whole set U of pseudo-labeled examples to the original labeled data set L at each self-training iteration. 10 At each iteration t > 1, we also re-annotate all of the examples in the original unlabeled data pool U with f t , as we expect f t is better than f t−1 .

Experiments
We perform experiments across 12 different NLP datasets and three different data regimes (including a few-shot setting). Task augmentation consistently improves over prior fine-tuning approaches in all three regimes, and the combination of self-training and task augmentation, STraTA, results in higher performance and lower variance than competing approaches when given only 8 labeled examples per class from each dataset.

Datasets & data regimes
The datasets used in our study (Table 1) 11 come from two common language understanding benchmarks: GLUE (Wang et al., 2019b) and SentEval (Conneau and Kiela, 2018). Due to restricted test set access for GLUE datasets, we held out a small subset of the training set for validation and translation models. We use development set performance to determine whether or not to perform this fine-tuning step for each dataset. 10 We find that removing examples with the lowestconfidence pseudo labels can be helpful for some tasks. One can use a development set, upon availability, to assess if this filtering is necessary.
11 Appendix A contains more details about characteristics and associated evaluation metrics for each dataset.  report results on the original development set. The training set without ground-truth labels is used as unlabeled data U T for each task. We consider three data regimes by varying the amount of labeled training data across the down-

Setup
As in Devlin et al. (2019), our input format for all tasks contains a [CLS] token followed by a single text segment or a concatenation of text segments (e.g., a premise-hypothesis pair) separated with a [SEP] token. We feed the final [CLS] representation into a task-specific classification layer and fine-tune all the parameters end-to-end on the downstream tasks. For both fine-tuning and selftraining, we perform early stopping based on development set performance. We use the Transformers library (Wolf et al., 2019) and its recommended hyperparameters for all experiments. 14

Methods
We experiment with task augmentation (TA) and self-training (ST) individually, as well as the combined approach STraTA, which uses the auxiliarytask model from task augmentation as the base model for self-training. We compare our methods to the following baselines: LMFT & ITFT MNLI : We compare our methods against commonly-used fine-tuning approaches, including target-task language model fine-tuning (LMFT; Howard and Ruder, 2018; Gururangan et al., 2020)-in which a model is first trained with the language model objective on taskspecific unlabeled data before being fine-tuned on the target task-and intermediate-task fine-tuning on MNLI (ITFT MNLI ; Phang et al., 2019)-which first trains a model on MNLI before fine-tuning it on the target task.
Prompt-based/entailment-based fine-tuning approaches: We also include results from recent work on prompt-based (LM-BFF; Gao et al., 2021) and entailment-based (EFL; Wang et al., 2021) fine-tuning, 15 which has been shown to outperform the GPT-3-style "in-context learning" approach (Brown et al., 2020) for few-shot learning. These approaches do not assume access to taskspecific unlabeled data and are not directly comparable to our methods due to differences in model architecture and experimental settings.  2021) propose a data augmentation method called SentAugment, which retrieves a large amount of "in-domain" data for a given task from a large bank of Web sentences. A base model trained using task-specific labeled data is then applied to obtain pseudo-labels for the retrieved sentences, which is then added to the original training set to train a better model. Their approach is complementary to ours, and combining these approaches is a promising direction for future work.   Table 2 shows the main results of our experiments with task augmentation and self-training. Below, we first provide an overview of these results before analyzing them in more detail.

Results and Discussion
Baselines: LMFT is not always helpful and can even hurt performance (e.g., on QNLI, a task built from Wikipedia, which is part of BERT's pretraining data). Du et al. (2021) also observe a decrease in performance when using LMFT with task-specific in-domain unlabeled data retrieved from Web data. ITFT MNLI significantly outperforms LMFT in many cases, particularly on target tasks closely related to MNLI.
Task augmentation significantly improves results on downstream tasks: The first three blocks of Table 2 show the results for TA, which improves almost all target tasks across all three data regimes. TA even improves results on SNLI in the FULL regime, where there is a large amount of labeled data available (570K examples). Changing the data regimes significantly impacts the av-erage absolute performance gain over the vanilla BERT LARGE across target tasks, which is lowest in FULL regime (+2.7%) and highest in the FEW-SHOT regime (+13.0%). SNLI (+41.7%) and RTE (+23.9%) benefit the most from TA in the FEW-SHOT regime. TA also significantly outperforms both LMFT and ITFT MNLI , particularly in the lowdata regimes (+16.4% and +4.8%, respectively).
Adding self-training further boosts downstream performance when task-specific unlabeled examples are available: The third block of Table 2 shows that in the FEW-SHOT regime, adding ST to TA, which results in STraTA, further boosts downstream performance. In particular, STraTA performs the best across target tasks, achieving up to +44.2% absolute improvement on SNLI over BERT LARGE . Overall, STraTA provides an average absolute performance gain of +20.9% and +18.4% for BERT BASE and BERT LARGE , respectively. Using ST alone also leads to large improvements over the vanilla BERT models; however, the performance gain largely depends on the target task.  Comparison to recent published work: The last three rows of Table 2 and the last two rows of  Table 3 show results from recent published work. 16 Broadly, our methods lead to better performance compared to these approaches. However, due to differences in evaluation methodology (e.g., models, training/development data subsets, number of random restarts, and other factors), we refrain from explicitly ranking the approaches.

Analysis of few-shot learning results
Having established the effectiveness of both task augmentation and self-training in the few-shot setting, we conduct a series of analysis experiments 16 While Wang et al. (2021) report results for LM-BFF and EFL across 5 random data subsets using a fixed set of seeds, Du et al. (2021) tried 10 seeds for each of their 5 random data subsets and report the mean of the top 3 seeds. To be more comparable to (Du et al., 2021), we report the mean of our top 3 random seeds in Table 3.

Model
SST-2 SciTail Table 4: Our approach yields improvements even when starting with a randomly-initialized model, but pretraining helps considerably.
in this section to explore the source of the observed improvements.
Sample efficiency with task augmentation and self-training: Figure 1  STraTA improves a randomly-initialized base model: Table 4 shows that our STraTA approach does not require a powerful pre-trained base model to exhibit improvements: when applied to a randomly initialized Transfomer model (RAND BASE ) with the same architecture as BERT BASE , RAND BASE + STraTA outperforms the vanilla BERT BASE by a large margin on SST-2, while being competitive on SciTail. Additionally, BERT BASE + STraTA substantially outperforms the vanilla BERT LARGE by 24% and 17.5% on SST-2 and SciTail, respectively.
Self-training on a broad distribution of pseudo-labeled data: Previous self-training algorithms (Rosenberg et al., 2005;McClosky et al., 2006;Sohn et al., 2020;Du et al., 2021) typically add a small set of unlabeled examples with the highest-confidence pseudo labels to the labeled data set L at each iteration. In contrast, our approach adds all pseudo-labeled examples to L at every iteration regardless of confidence. We compare the two approaches in  ; once they have been added, they are not removed, and this process is repeated until the unlabeled set U is exhausted. As can be seen, this approach works well for the several first self-training iterations (3-5), but then labeling accuracy begins to degrade. In contrast, our algorithm (right plot) gradually and consistently improves labeling accuracy before converging at some iteration. These results suggest that strong base models benefit from including even significantly noisy pseudo-labels in self-training, as opposed to training on a narrow distribution of high-confidence predictions.
Does self-training work with out-ofdomain/distribution (OOD) unlabeled examples? We investigate this question by applying self-training on top of BERT BASE + TA. We consider SOURCE → TARGET task pairs where training data from the source task without ground-truth labels is used as OOD unlabeled data for the target task. We experiment with several task pairs, including MNLI → SciTail, SST-2 → CR, QQP → MRPC, and MNLI → RTE. As shown in Table 5   Towards realistic evaluation in few-shot learning: In real-world low-resource scenarios, it is often impractical to rely on a development set (Oliver et al., 2018;Kann et al., 2019). With so little data, it may be more effective to use all labeled data for training. To examine the applicability of our methods to this real-world setting, here we consider an evaluation that does not make use of a development set. Rather than using early stopping, we fine-tune each model for a fixed number of 512 steps. We checkpoint every 30 steps and evaluate a single model obtained by averaging the last 5 model checkpoints. For self-training, we perform a fixed number of 30 self-training iterations, each following the same fine-tuning procedure. Table 6 summarizes our results. Broadly, all models perform worse in this setting than when a development set is available. Our STraTA approach still provides significant improvements over BERT BASE , but much worse than the same method used with a development set. We conjecture that this is because without a development set, the model achieves somewhat lower accuracy in each self-training iteration, and these errors compound through later iterations.

Related Work
Improving language model fine-tuning: Finetuning has been the most common approach for applying pre-trained language models to downstream tasks. However, it typically requires a target dataset of thousands to hundreds of thousands of examples to work well (Yogatama et al., 2019;Brown et al., 2020). Many methods have been proposed to improve performance and stability of pre-trained language models on small datasets, including language model fine-tuning on unlabeled data from the target domain (Howard and Ruder, 2018;Gururangan et al., 2020), intermediate-task fine-tuning (Phang et al., 2019), multi-task prefinetuning (Aghajanyan et al., 2021a), better design choices and training strategies (Mosbach et al., 2021;Zhang et al., 2021), and regularizationoriented techniques (Jiang et al., 2020;Aghajanyan et al., 2021b). More related to our work is research on intermediate-task fine-tuning that makes use of data-rich tasks (Phang et al., 2019), tasks that require complex reasoning and inference (Pruksachatkun et al., 2020), and beneficial relationships among tasks (Vu et al., 2020).
Few-shot learning: Our work also relates to research in few-shot learning. In previous work, fine-tuning is combined with other learning strategies to improve few-shot performance, including consistency training (Xie et al., 2020a), metalearning (Bansal et al., 2020), self-training (Du et al., 2021Sun et al., 2020), and contrastive learning (Gunel et al., 2021). Other work has focused on prompt-based/entailment-based few-shot learning approaches (Brown et al., 2020;Gao et al., 2021;Tam et al., 2021;Wang et al., 2021). Notably, Brown et al. (2020) demonstrate remarkable few-shot learning performance with a single frozen GPT-3 model, although its performance still lags far behind state-of-the-art fine-tuning results.
Semi-supervised learning: Another area upon which our work builds is semi-supervised learning (SSL). Recent work has combined self-training with other techniques, e.g., noise injection (He et al., 2020;Xie et al., 2020b), consistency regularization and pseudo-labeling (Sohn et al., 2020), to develop powerful SSL algorithms. Du et al. (2021) show that self-training improves upon language model pre-training.

Limitations & Conclusion
Task augmentation and self-training provide complementary ways to leverage task-specific unlabeled data for improved downstream performance. While task augmentation utilizes unlabeled texts to synthesize a large amount of in-domain data for an auxiliary training task, self-training uses a model's predictions on unlabeled examples to improve the model itself. When combining these methods in STraTA, we are able to substantially improve sample efficiency across 12 NLP benchmark datasets. That said, each method has its own limitations. While our implementation uses NLI as an auxiliary task in task augmentation, there are target tasks for which NLI may not be helpful (e.g., on grammatical acceptability judgments, as shown in Wang et al. (2019a)). Additionally, other auxiliary tasks may increase improvements (e.g., QNLI benefits more from QA tasks (Vu et al., 2020)). We leave exploration of other auxiliary tasks to future work. Finally, our self-training algorithm (like prior approaches) assumes access to task-specific unlabeled examples, which might be non-trivial to acquire for some applications. The datasets used in our experiments come from two common language understanding benchmarks: GLUE (Wang et al., 2019b) and SentEval (Conneau and Kiela, 2018). See Table 7 for details about dataset characteristics. We report F1 scores for QQP and MRPC, Spearman correlations for STS-B and SICK-R, and accuracy scores for the other tasks.
B Example outputs of our NLI data generator Table 8 shows example outputs of our NLI data generator using unlabeled texts from SST-2 and SciTail.

C Full experiment results across data regimes
We report the full results for our experiments in the FULL, LIMITED, and FEW-SHOT data regimes in Table 9, Table 10, and  unlabeled texts from SST-2 Input: a cast of a-list brit actors + entailment → Output: the cast includes british actors. + neutral → Output: the cast includes a-list brit actors such as nicole kidman and charlie chaplin. + contradiction → Output: the cast was all new and unknown.
Input: both heartbreaking and heartwarming ... just a simple fable done in an artless sytle , but it 's tremendously moving + entailment → Output: the story is moving and touching. + neutral → Output: the artless style used to make the fable was a favorite of many. + contradiction → Output: the fable is artful and sophisticated but it is also heartbreaking.
Input: gaping plot holes sink this ' sub ' -standard thriller and drag audience enthusiasm to crush depth + entailment → Output: this movie's plot holes seriously affect the enjoyment of the movie. + neutral → Output: the movies pace is too slow for most audiences. + contradiction → Output: the movie is a fun, easy romp of a thriller.
Input: putting the primitive murderer inside a high-tech space station unleashes a pandora 's box of special effects that run the gamut from cheesy to cheesier to cheesiest + entailment → Output: the special effects are all over the place. + entailment → Output: putting a murderer in a space station leads to all sorts of cheesy special effects. + entailment → Output: the murderer was inside a space station.
Input: vera has created a provocative , absorbing drama that reveals the curse of a self-hatred instilled by rigid social mores . + neutral → Output: vera has created a provocative drama that is absorbing and has a great cast. + neutral → Output: vera has created a provocative drama that shows the curse of self-hatred and shows the bonds between women. + neutral → Output: vera's drama, which has been loved by millions, is about the curse of self-hatred.
Input: filmmaking from one of french cinema 's master craftsmen + contradiction → Output: the filmmaking is from a newcomer to the french cinema. + contradiction → Output: the filmmaking is from a new blood in the american cinema. + contradiction → Output: the filmmaking of one of italian cinema's master craftsmen unlabeled texts from SciTail Input: Surface tension is responsible for the fact that small insects can walk on water. + entailment → Output: small insects can walk on water due to surface tension. + neutral → Output: it's difficult for large insects to walk on water. + contradiction → Output: small insects cannot walk on water because of surface tension.
Input: Tamiflu inhibits spread of virus. + entailment → Output: the virus is less spread when tamiflu is used. + neutral → Output: tamiflu is a good way to treat hiv. + contradiction → Output: tamiflu promotes viral spread.
Input: Animals can transport the waste material known as urea in the circulatory system and store it safely at high concentrations. + entailment → Output: animals can transport waste material in their circulatory systems. + neutral → Output: animals can store urea in their bloodstreams for up to a year. + contradiction → Output: only plants can transport the waste material known as urea in their circulatory systems.
Input: A number of applications of biomass, wind, hydropower and solar thermal are presently cost competitive with fossil fuels. + entailment → Output: many biomass applications are now cost competitive with fossil fuels. + entailment → Output: many alternatives to fossil fuels are cost competitive. + entailment → Output: some biofuels are now cost competitive with fossil fuels.
Input: A cell wall is not present in animal cells. + neutral → Output: in contrast, plant cells have a cell wall. + neutral → Output: in addition to not having a cell wall, animal cells also lack mitochondria. + neutral → Output: in animal cells, there is no cell wall, said the biologist.
Input: A bathymetric map can show the features of the bottom of a body of water. + contradiction → Output: a bathymetric map shows the top of a body of water. + contradiction → Output: a bathymetric map shows the features of the sky. + contradiction → Output: a bathymetric map shows what is on the surface of the water.