Data Augmentation for Hypernymy Detection

The automatic detection of hypernymy relationships represents a challenging problem in NLP. The successful application of state-of-the-art supervised approaches using distributed representations has generally been impeded by the limited availability of high quality training data. We have developed two novel data augmentation techniques which generate new training examples from existing ones. First, we combine the linguistic principles of hypernym transitivity and intersective modifier-noun composition to generate additional pairs of vectors, such as “small dog - dog” or “small dog - animal”, for which a hypernymy relationship can be assumed. Second, we use generative adversarial networks (GANs) to generate pairs of vectors for which the hypernymy relation can also be assumed. We furthermore present two complementary strategies for extending an existing dataset by leveraging linguistic resources such as WordNet. Using an evaluation across 3 different datasets for hypernymy detection and 2 different vector spaces, we demonstrate that both of the proposed automatic data augmentation and dataset extension strategies substantially improve classifier performance.


Introduction
The detection of hypernymy relationships between terms represents a challenging commonsense inference problem and is a major component in recognising paraphrase and textual entailment in larger units of text. Consequently, it is important for Question-Answering, Text Simplification and Automatic Summarization. For example, There are lots of cars and vans at the port today. might be adequately summarised by There are lots of vehicles at the port today. as car and van both lexically entail, i.e. they are both hyponyms of the more general term vehicle.
Furthermore, the recognition and discovery of hyponym-hypernym relations is a foundational part of constructing taxonomies, which has a range of practical applications in a variety of domains such as Healthcare (Barisevičius et al., 2018) or Fashion 1 .
Supervised methods have, however, been severely hampered by a lack of adequate training data. Not only has a paucity of labelled data been an obstacle in the adoption of deep neural networks and other more complex supervised methods, but two compounding problem-specific issues have been identified. First, there is a need to avoid lexical overlap between the training and test sets in order to avoid the lexical memorisation problem (Weeds et al., 2014;Levy et al., 2015a), where a supervised method simply learns the relationships between lexemes rather than generalising to their distributional features. Second, the performance of classifiers given just the hypernym word (at training and testing) has been shown to be almost as good as performance given both words (Weeds et al., 2014;Shwartz et al., 2017). This suggests that classifiers are learning the distributional features that make something a more general term or a more specific term. Our conjecture is that in order to learn the more complex function, more complex machinery, and hence more labelled data is required.
In computer vision or speech recognition, it is common to use data augmentation to increase the size of the training set (Shrivastava et al., 2017;Park et al., 2019). The idea is that there are certain transformations of the data under which the class label remains invariant. For example, rotating an image does not change whether that image contains a face or not. By providing a supervised classifier with rotated examples, it can better generalise.
In this work, we consider the use of linguistic transformations to augment existing datasets for hypernymy detection. The challenge is to identify transformations that can be applied to the representations of two words that are known to be in a hypernym relationship, such that the entailment relation still holds between the transformed representations. We propose two ways to achieve this.
Our first augmentation technique is based on the hypothesis that lexical entailment is transitive and therefore invariant under certain compositions. For example, if A entails B and B entails C then A also entails C. Suitable candidates for A can be found by composing common intersective adjectives with the noun B. For example, if we know that car entails vehicle, then we can augment the dataset with fast car entails car and fast car entails vehicle.
Our second augmentation technique is based on the hypothesis that lexical entailment is invariant within a certain threshold of similarity. If A entails B, A is very similar to A and B is very similar to B then A will also entail B . In order to obtain vectors which are sufficiently similar to the words in the training data, we apply generative adversarial networks (GANs) to create realistic-looking synthetic vectors, from which we choose the most similar to the words in the training data.
We evaluate the proposed techniques on three hypernymy detection datasets. The first two are standard benchmark tasks in this area (Weeds et al., 2014;Baroni et al., 2012), both of which are generated from WordNet (Fellbaum, 1998). However, since many of the approaches to hypernmy classification involve vector space models which have been specialised using the entirety of WordNet, we need to guard against the danger that evaluations are simply measuring how well WordNet has been encoded, rather than how well the general hypernymy relationship has been learned. In light of this, we introduce a new dataset (that we call HP4K) which does not rely on WordNet in its construction.
We evaluate our two data augmentation techniques against two methods for increasing the size of the training data which rely on finding or mining more non-synthetic examples. First, we consider the extraction of additional examples from Word-Net. Second, we consider extracting examples automatically from a Wikipedia corpus using Hearst Patterns (Hearst, 1992). This provides us with what one would expect to be an upper bound on what we might reasonably expect to achieve with a similar amount of synthetic examples generated using our data augmentation techniques.
Our contributions are thus threefold. First, we have identified two novel data augmentation techniques for the task of hypernymy detection which have the potential to generate almost limitless quantities of synthetic data. Second, we show, rather surprisingly, that adding synthetic data is more effective than adding non-synthetic data in almost all cases. Third, we release a new benchmark evaluation dataset for the lexical entailment task that is not dependent on WordNet.

Related Work
Data augmentation has recently become a very popular research topic in NLP and has successfully been applied in machine translation systems (Sennrich et al., 2016;Fadaee et al., 2017;Wang et al., 2018;Tong et al., 2019;Matos Veliz et al., 2019;Xia et al., 2019;Gao et al., 2019;Li and Specia, 2019;, but also for tasks such as relation extraction (Can et al., 2019;Yan et al., 2019), text classification (Wei and Zou, 2019), or natural language inference (Kang et al., 2018;Junghyun et al., 2020). Most similar to our usage of GANs for data augmentation is the proposal of Kang et al. (2018) who leverage a GANbased setup together with WordNet for data augmentation for natural language inference.
While to the best of our knowledge this work represents the first application of data augmentation for lexical entailment, a number of alternative approaches have been proposed. Most proposals rely on supervised methods for injecting an exter-nal source of knowledge into distributional representations. Starting with retro-fitting (Faruqui et al., 2015), vector-specialization methods modify existing representations to embed desired features (Vulić and Mrkšić, 2018;Rei et al., 2018;Kamath et al., 2019;Glavaš and Vulić, 2019).

Data Augmentation Strategies for Hypernymy Detection
Given a labelled dataset D X of triples x hyper ∈ X and y (i) ∈ {0, 1}, we define data augmentation as adding additional hyponym-hypernym triples x hyper , y (j) coming from an automatically generated augmentation set A X , such that x hyper ∈ X and y (j) ∈ {0, 1}, to the existing training set of D. We ensure that the data augmentation does not introduce any lexical overlap with the existing test set, i.e. X ∩ X = ∅.
Data augmentation strategies in NLP can roughly be divided into two categories: linguistically grounded augmentation and artificial augmentation. In the former, which has been the dominant paradigm in NLP, any additional instances that are added to a training set have an actual surface form representation, i.e. the data points correspond to actual words or sentences (Kim et al., 2019;Kumar et al., 2019;Gao et al., 2019;Andreas, 2020;Croce et al., 2020). The latter adds instances that are fully or partly artificial, meaning they do not correspond to any words or sentences. In this work we propose methods for both categories, data augmentation via distributional composition adds data points grounded in real language to a training set, and data augmentation based on GANs infers plausible points in latent space, which however, do not correspond to any real linguistic objects.
Furthermore, we distinguish between data augmentation and dataset extension, where in the former case we only leverage knowledge from the existing dataset and in the latter case we rely on expanding the training set with additionally mined hyponym-hypernym pairs. Below, we discuss two ways of augmenting and two ways of extending a training set. We make use of a cleaned October 2013 Wikipedia dump (Wilson, 2015) as reference corpus to determine word and bigram frequencies.
Distributional Composition based Augmentation. We take a modified noun as being in a hypernymy relation with the unmodified noun. For example, we treat the pairs fast car, car and car, vehicle as expressing the same semantic relation when the modifier-noun compound is composed with an intersective composition function.
We focus on adjective-noun (AN) and nounnoun (NN) compounds, extracted from our reference corpus where each AN or NN phrase occurred at least 50 times. We filtered pairs with non-subsective adjectives using a wordlist from Nayak et al. (2014) We consider two strategies for automatically constructing positive hyponym-hypernym pairs: simple positive cases such as small dog, dog or fast car, car ; and gapped positive cases that mimic the transitivity of hypernym relations, where we pair the hypernym of an existing hyponymhypernym pair with a compound hyponym. For example if car, vehicle is in the training data, we combine car with one of its modifiers to create the pair fast car, vehicle .
We construct negative pairs from the simple positive cases using two strategies: creating compositional co-hyponyms such as fast car, red car , where we keep the head noun fixed and pair it with two different modifiers; and creating perturbed simple positive examples, such as small dog, cat where we select the incorrect hypernym (e.g. cat) from the n most similar nouns to the composed hyponym (e.g. dog). We apply the same methodology to the perturbed gapped positive examples, replacing the correct hypernym with a noun from the top n neighbours of the compositional hyponym's head noun. For example, given a positive pair such as dog, animal , this would result in negative examples such as small dog, vehicle , where the hyponym dog is paired with a modifier and the hypernym animal is replaced with one of its neighbours, in this case, vehicle.
In neural word embeddings, an additive composition function approximates the intersection of the corresponding feature spaces (Tian et al., 2017), hence by creating positive pairs such as small dog, dog , we encode the distributional inclusion hypothesis (Weeds et al., 2004;Geffet and Dagan, 2005) in the augmentation set.
GAN based Augmentation. We create an augmentation set using Generative Adversarial Networks (Goodfellow et al., 2014). GANs consist of two model components -the generator and the discriminator -which are typically implemented as neural networks. The generator's task is to create data that mimics the distribution of the original data, while the discriminator's task is to distinguish between data coming from the real distribution and synthetic data coming from the generator. Both components are trained jointly until the generator succeeds in creating realistic data. Using GANs for data augmentation has been shown to be a successful strategy for a number of computer vision tasks (Shrivastava et al., 2017;Frid-Adar et al., 2018;Neff, 2018). Our goal is to create synthetic hyponym-hypernym pairs that are similar to real examples. Unlike most other scenarios involving GANs for NLP tasks, our generated vectors do not need to correspond to actual words.
For our model -GANDALF 3 -we used a list of ≈40K nouns for which we had vector representations as the "real" data input to GANDALF, and sampled the synthetic vectors from a Gaussian distribution, optimising a binary cross-entropy error criterion for the generator and the discriminator, which are both simple feedforward networks with a single hidden layer. We provide GANDALF's full model details in Appendix A. As an additional quality check for the generated vectors, we tested whether a logistic regression classifier could distinguish the synthetic and non-synthetic vectors. Typically, the accuracy of the classifier was between 0.55-0.65, meaning the classifier is barely able to distinguish between "real" vectors and generated ones. Figure 1 illustrates the training loop of GAN-DALF as well as the selection process for constructing an augmented training set. Essentially, once GANDALF has been trained, the generator is used to create a large collection 4 of synthetic noun vectors. To augment a dataset, D X , for each triple, x hypo , x hyper , y ∈ D X we find the n synthetic vectors most similar to x hypo and the n synthetic vectors most similar to x hyper and for each of the n 2 synthetic vector pairs, x hypo , x hyper , we create the triple x hypo , x hyper , y . The augmented training set is formed by randomly sub-sampling this set of triples.
WordNet based Extension. WordNet (Fellbaum, 1998) is a large manually curated lexical resource, covering a wide range of lexical relations between words, where groups of semantically similar words form "synsets" 5 . For each synset we extract all hypernyms and hyponyms of a given lexeme, and add it as a positive hyponym-hypernym pair if the original lexeme and any extracted hypernym/hyponym occurs at least 30 times in our reference corpus.
We construct negative pairs based on distributional similarity, where we calculate the pairwise cosine similarities between all lexemes in the positive set. Subsequently we use all antecedent (LHS) lexemes from the extracted positive pairs and select the top n most similar words for each antecedent as negative examples 6 .
Pattern based Extension.
Hearst Patterns (Hearst, 1992) are textual patterns such as a car is-a vehicle and can be automatically mined from text corpora in an unsupervised way. This has recently been shown to deliver strong performance on the hypernymy detection task (Roller et al., 2018). In this work, we leverage Hearst Patterns to mine additional hyponym-hypernym pairs in order to extend a training set. We treat any extracted noun pairs as additional positive examples and create the negative pairs in the same way as for the WordNet-based approach above.

Experiments
We evaluate our models on the datasets Weeds (Weeds et al., 2014) and LEDS (Baroni et al., 2012): well-studied and frequently used benchmarks for the hypernymy detection task (Roller et al., 2014;Vilnis and McCallum, 2014;Roller and Erk, 2016;Carmona and Riedel, 2017;Shwartz et al., 2017). Since both datasets use WordNet during construction, this can give rise to a bias in favour of those models that also make use of WordNet. To address this concern, we have created a new entailment dataset, HP4K, that makes use of Hearst Patterns, and is manually annotated, thereby avoiding the use of WordNet.
Weeds: The dataset is based on nouns sampled from WordNet where each noun had to occur at least 100 times in Wikipedia, and its predominant sense had to account for more than 50% of the occurrences in SemCor (Miller et al., 1993). We use the predefined split of Weeds et al. (2014), that avoids any lexical overlap between the training and evaluation sets.  As there is no predefined training/evaluation split, we make use of the 20fold cross-validation methodology of Roller and Erk (2016) that avoids any lexical overlap between training and evaluation sets.
HP4K: We extracted Hearst Patterns from our reference Wikipedia corpus and randomly selected 4500 unigram pairs. Subsequently, we manually annotated each pair according to whether it constitutes a correct hyponymy-hypernymy relation or not. The labelling was carried out by 4 experienced annotators -all domain experts, familiar with the problem of hypernymy detection. We then split up the annotators in two teams, with each team annotating one half of the dataset. The initial round of annotations resulted in a Cohen's κ score of 0.714, indicating substantial agreement (Viera and Garrett, 2005). Conflicts were resolved on a cross-team basis such that team A would resolve team B's annotation conflicts and vice-versa.
During annotation we noticed that positive pairs typically fall into one of two categories. Either they were "true" subtype-supertype relations, such as dog, animal , or they were individual-class relationships where the hyponym is typically a named entity and represents a specific instance of the more general class, as for example in Nirvana, band .
Negative pairs were of a more diverse nature and included a range of different relations, such as co-hyponyms, meronyms or reverse hyponymhypernyms. Negative pairs can also be comprised of two random nouns or two nouns without any semantic relation due to some amount of noise in extracting candidates solely on the basis of Hearst Patterns. Table 1 shows positive and negative examples from the dataset.  HP4K consists of 4369 pairs with a class distribution of 45:55 (positive : negative). Subsequently we split the dataset into a training and evaluation set, ensuring that there is no lexical overlap between the two sets. This resulted in a training set of size 3426 and an evaluation set of size 943 7 .

Models
We conduct experiments with two distributional vector space models, word2vec (Mikolov et al., 2013) and HyperVec (Nguyen et al., 2017). Hyper-Vec is based on word2vec's skip-gram architecture and leverages WordNet to optimise the word representations for the hypernym detection task. Hierarchical information is encoded in the norm of the learned vectors, such that lexemes higher up in the hypernymy hierarchy have larger norms than lexemes in lower parts.
For word2vec we use the 300-dimensional pretrained Google News vectors 8 and for HyperVec we trained 100-dimensional embeddings on a October 2013 Wikipedia dump (Wilson, 2015), using the recommended settings of Nguyen et al. (2017), as our augmentation sets contained many words that were OOV in the pre-trained HyperVec vectors 9 .
In our experiments, we consider a supervised scenario where a classifier predicts a hyponymhypernym relation between two given word embeddings. We use two different models as classifier a logistic regression classifier (LR), and a 3-layer feedforward neural network (FF). In both cases, the classifier takes the aggregated hypothesised hyponym-hypernym pair as input and predicts whether the pair is in a hyponym-hypernym relation. We report a detailed overview of the model parameterisation in Appendix A.
The two models share the same procedure for aggregating the word embeddings of the hypothesised hyponym-hypernym pair. For data augmentation based on distributional composition, we use vector averaging as composition function, which gave substantially better performance than addition in preliminary experiments.

Results
For the FF network, we performed 10-fold crossvalidation on the Weeds and HP4K training sets. As our evaluation for LEDS is based on a 20-fold cross-valiation split, rather than a pre-defined training/evaluation split as for Weeds and HP4K, the same procedure for hyperparameter tuning is not straightforwardly applicable without exposing the model to some of the evaluation data. However, we found that the top parameterisations for Weeds and HP4K were quite similar and therefore applied hyperparameters to the FF model for LEDS that performed well in 10-fold cross-validation on Weeds and HP4K. For data augmentation and dataset extension, we consider the following amounts of additional data: {0.2K, 1K, 2K, 4K, 10K, 20K, 40K}. All augmentation sets are balanced between positive and negative pairs. Figure 2 shows the increase in absolute points 8 Available from: https://code.google.com/ archive/p/word2vec/. 9 We used the HyperVec code from www.ims. uni-stuttgart.de/data/hypervec. of accuracy for the LR and FF model, as well as both vector spaces, averaged across all datasets. While in total data augmentation as well as dataset extension has a positive impact, the gains are larger for the FF model, suggesting that a higher capacity model is necessary to more effectively leverage the additional information from the augmentation source. Furthermore, before starting our experiments we exptected that extending an existing dataset with WordNet represents an upper bound on performance, given that WordNet is manually annotated and curated. However in our experiments we found that data augmentation by either distributional composition or by using GANDALF remarkably surpassed performance of the WordNet-based extension technique regularly.
The effect of data augmentation and dataset extension in absolute points of accuracy on each dataset individually for the FF model is shown in Figure 3. It highlights consistent improvements across the board with only a single performance degradation in the case of extending the LEDS dataset with Hearst Patterns when using HyperVecbased word representations. The results per dataset for the LR model are presented in Appendix A and show that the LR model is less effective in leveraging the augmented data, causing more frequent performance drops. This suggests that models with more capacity are able to make more efficient use of additional data and are more robust in the presence of noise which is inevitably introduced by automatic methods.
Table 2 compares our FF model using word2vec embeddings with all proposed techniques for augmenting or extending a dataset. Our techniques are able to outperform a non-augmented model by 4-6 points in accuracy, representing a relative error reduction of 14%-26%. While the primary objective in this work is to improve an existing model setup with data augmentation, our augmented models compare favourably with previously published results. 10 In general, data augmentation by distributional composition or by GANDALF overcomes two key weaknesses of simply extending a dataset with more data from WordNet or Hearst Patterns. First, many of the hyponym-hypernym pairs we mined from WordNet contain low-frequency words, 10 We note that due to the use of different performance metrics and cross-validation splits, direct model-to-model comparisons are difficult on the LEDS and Weeds datasets. Thus we only compare to approaches that use the same evaluation protocol as we do.    which may have poor representations in our vector space models. Second, while using Hearst Patterns typically returned higher frequency words, the retrieved candidates frequently did not represent hyponymy-hypernymy relationships.

Analysis
The concrete amount of data augmentation, i.e. the number of additional hyponym-hypernym pairs that are added to the training set, represents a tuneable parameter. Figure 4 shows the effect of varying amounts of data augmentation for the FF model, using word2vec representations, across all datasets. We note that all amounts of additional augmentation data share the same quality, i.e. it is not the case that a smaller augmentation set consists of "better data" or contains less noise than a larger set. For the Weeds and LEDS datasets, peak performance is typically achieved with smaller amounts of additional data, whereas for the HP4K dataset optimal performance is achieved with larger amounts of augmentation data. One explanation for the different augmentation characteristics of the HP4K dataset in comparison to the other two datasets is its independence of WordNet during the development of the dataset.

Data Augmentation in Space
In order to visualise what area of the vector space the GANDALF vectors and the composed vectors inhabit, we created a t-SNE (van der Maaten and Hinton, 2008) projection of the vector spaces in Figure 5. For the visualisation we produced the nearest neighbours of standard word2vec embeddings and augmentation embeddings for 5 exemplary words and project them into the same space. Figure 5 shows that the generated augmentation points, marked with an "x", fit in with the real neighbours and do not deviate much from the "natural" neighbourhood of a given word. GANDALF vectors typically inhabit the edges of a neighbour cluster, whereas composed vectors are frequently closer to the cluster centroid. Table 3 lists the nearest neighbours for the example words. For word2vec and the composed augmentation vectors, we simply list the nearest neighbours of each query word. For GANDALF we list the nearest neighbours of the generated vector that correspond to actual words. For example, if the vector GANDALF-234 is closest to the representations of sugar, GANDALF-451 and mountain, we only list sugar and mountain as  neighbours of GANDALF-234. The composed neighbours for each word are typically closely related to the original query, e.g. raw sugar for sugar, or zoo animal for animal. The GANDALF neighbours on the other hand have a much weaker association with the query word, but are frequently still related to it somehow as in the example of akpeteshie, which is a spirit on sugar cane basis, as a neighbour for sugar.

Data Augmentation as Regularisation
In the past, a prominent criticism of distributional methods for hypernymy detection was that such models were found to frequently identify features of a prototypical hypernym in the distributional representations, rather than being able to dynamically focus on the relevant features that are indicative of a hypernymy relation for a specific pair of words (Weeds et al., 2014;Levy et al., 2015b). We therefore briefly investigate whether data augmentation can be used as a regularisation mechanism that helps prevent models from overfitting on prototypical hypernym features. Table 4 shows the results on the Weeds dataset using a hypernym-only FF model with word2vec representations, in comparison to the same model variant that makes use of the hyponym and the hypernym. Ideally, we would hope to see weak performance for the hypernym-only and strong performance on the full model. This would indicate that the classifier does not rely on prototypical features in the hypernym, but is able to focus on specific features in a given hyponym-hypernym pair. For data augmentation by distributional composition there appears to be a correlation between the performance of the hypernym-only and the full model, i.e. a stronger model on the whole dataset also results in better performance for the hypernymyonly model. Hence augmentation by distributional composition might not be effective in helping the model to generalise in its current form. For augmentation with GANDALF however, performance for the full model improves, while performance of the hypernym-only model slightly drops, suggesting that the evoked GANDALF representations have a regularisation effect, while also improving generalisation. Hence, a fruitful avenue for future work will be further leveraging data augmentation for regularisation.

Conclusion
In NLP, in contrast to computer vision, data augmentation has not been applied as standard due to the apparent lack of universal rules for labelinvariant language transformations.
We have considered the problem of hypernymy detection, and proposed two novel techniques for data augmentation. These techniques rely on semantic rules rather than an external knowledge source, and have the potential to generate almost limitless synthetic data for this task. We demonstrate that these techniques perform better than   extending the training set with additional nonsynthetic data, drawn from an external knowledge source in most cases. Our results are consistent across evaluation benchmarks, word vector spaces and classification architectures. We have also shown that our approach is effective even when the word vector space model has already been specialised for hypernymy detection.
Since WordNet is widely used as a source of information about semantic relations, we have proposed a new evaluation benchmark that is independent of WordNet. Whilst results are lower across the board on this dataset, suggesting that it is more difficult than the others, we see the same pattern of increasing performance with a more complex classifier and the use of data augmentation.
Future work includes leveraging data augmentation for more complex models and the extension of our approach to a multilingual setup as well as domains with a more specialised vocabulary such as Healthcare or Fashion.

A.1 GANDALF Model Details
The generator and discriminator in GANDALF are single layer feedforward networks, with tanh activations and a dropout ratio (Srivastava et al., 2014) of 0.3. We used ADAM (Kingma and Ba, 2014) to optimise a binary cross entropy error criterion with a learning rate of 0.0002 and β values of 0.5 and 0.999. We found that GANDALF required quite a bit of wizardry to achieve strong performance and we found the website https://github. com/soumith/ganhacks very helpful. For example we applied label noise and soft labels (Salimans et al., 2016) and used a batch normalisation layer (Ioffe and Szegedy, 2015), which had the largest impact on model performance. GANDALF is implemented in PyTorch (Paszke et al., 2017) and we release our code on https://github.com/ tttthomasssss/le-augmentation.

A.2 Model Details
For our linear model we use the logistic regrssion classifier implemented in scikit-learn (Pedregosa et al., 2011). Our neural network model is 3-layer feedforward model implemented in Py-Torch (Paszke et al., 2017).
We tuned the parameters of the Feedforward neural network by 10-fold cross-validation on the respective training sets, except for LEDS, where we chose the parameters on the basis of a model that performed well on the Weeds and HP4K. Our parameter grid consisted of activation function: {tanh, relu}, dropout: { 0.0, 0.1, 0.3 } and hidden layer sizes, where we considered { 200-200-200, 200-100-50, 200-50-30 } for Hypervec and { 600-600-600, 600-400-200, 600-300-100, 600-200-50} for word2vec. We furthermore considered 3 different aggregation functions: diff (Weeds et al., 2014), which simply takes the elementwise difference of the embedding pair; asym (Roller et al., 2014) which is the concatenation of the difference and the squared difference of the embedding pair; and concat-asym (Roller and Erk, 2016), which is the concatenation of the embedding pair, their difference, and their squared difference. We trained all models for 30 epochs with early stopping and used ADAM with a learning rate of 0.01 to optimise a cross entropy error criterion. Figure 6 below shows the effect of data augmentiation in terms of points of Accuracy for the logistic regression classifier per vector space model and dataset. Unlike for the higher-capacity feedforward model, data augmentation frequently causes performance to go down for the simpler linear model. This suggests that more complex models are required to fully leverage the additional information from the augmentation sets. Table 5 below gives an overview over the complete results for both classifier and vector space models, across all datasets. It shows the consistent positive effect of data augmentation on the more complex feedforward model in comparison to the logistic regression classifier, which is less robust to the small amounts of noise that is inevitably introduced by the automatic augmentation algorithm.   Table 5: Accuracy scores for the data augmentation strategies (DC and GAN), and the two dataset extension strategies (WN and HP), and the baseline that neither uses augmentation nor extension (None). Boldfaced results denote top performance per vector space and dataset, underlined results denote improved performance in comparison to the baseline without data augmentation.