HypMix: Hyperbolic Interpolative Data Augmentation

Interpolation-based regularisation methods for data augmentation have proven to be effective for various tasks and modalities. These methods involve performing mathematical operations over the raw input samples or their latent states representations - vectors that often possess complex hierarchical geometries. However, these operations are performed in the Euclidean space, simplifying these representations, which may lead to distorted and noisy interpolations. We propose HypMix, a novel model-, data-, and modality-agnostic interpolative data augmentation technique operating in the hyperbolic space, which captures the complex geometry of input and hidden state hierarchies better than its contemporaries. We evaluate HypMix on benchmark and low resource datasets across speech, text, and vision modalities, showing that HypMix consistently outperforms state-of-the-art data augmentation techniques. In addition, we demonstrate the use of HypMix in semi-supervised settings. We further probe into the adversarial robustness and qualitative inferences we draw from HypMix that elucidate the efficacy of the Riemannian hyperbolic manifolds for interpolation-based data augmentation.


Introduction
Deep learning methods have improved the state-ofthe-art in a wide range of tasks. Yet, when only limited training data is available, they are prone to overfitting (Zou and Gu, 2019). Numerous data augmentation techniques have been proposed, which involve performing operations such as cropping or rotation (Lecun et al., 1998), or paraphrasing (Kumar et al., 2019) individual examples. However, * Equal contribution. these methods are modality-or dataset-dependent and require domain expertise. Compared to such alteration-based methods, interpolation-based approaches such as Mixup (Zhang et al., 2018) have shown improved performance and generalizability across different modalities. Mixup generates virtual training samples from convex combinations of individual inputs and labels to expand the training distribution. Performing Mixup over the latent representations of inputs has led to further improvements, as the hidden states of deep neural networks carry more information than raw input samples, (Verma et al., 2019a;Chen et al., 2020a). However, most data augmentation methods can only utilize existing labeled data.
Semi-supervised learning methods, on the other hand, can leverage unlabeled data for training. Several semi-supervised methods use interpolation based regularization methods over unlabeled samples to predict soft labels, and combine them with existing labeled samples to increase the overall training data (Verma et al., 2019b;Chen et al., 2020b). Semi-supervised methods use consistency based regularization training (Miyato et al., 2019) which makes the model predictions robust to perturbations on unlabeled samples. However, current semi-supervised learning methods do not generalize across modalities or datasets.
Existing data-augmentation and semi-supervised learning methods operate in the Euclidean space, which is a simplified representative geometry. Representations across modalities inherently possess properties that the Euclidean space is incapable of modeling, and can be better expressed using the more general hyperbolic space (Ganea et al., 2018). The interference of sound waves is hyperbolic in nature, which generates hyperboloid waveforms (Khan and Panigrahi, 2016). Natural language text exhibits hierarchical structure in a variety of respects and embeddings are more expressive when represented in the hyperbolic space (Dhingra et al., 2018). Data augmentation using Möbius operations over images has shown more diversification and generalization compared to Euclidean operations (Zhou et al., 2021). Performing interpolative operations over representations having complex geometry in the hyperbolic space can lead to more suitable representations for model training.
Building on prior research in limited data and data augmentation studies, and the hyperbolic characteristics of speech, text, and vision, we propose HYPMIX 1 : a model, data, and modality agnostic interpolative regularization method operating in the hyperbolic space. We further extend HYPMIX to semi-supervised settings, which is especially effective in extremely low resource environments. We probe the effectiveness of HYPMIX through extensive experiments over three different tasks for supervised and semi-supervised settings on benchmark and low resource datasets across speech, text, and vision in different languages with varying class label distribution. HYPMIX outperforms current state-of-the-art modality and task specific data augmentation methods across all the datasets for both supervised and semi-supervised conditions.
Our contributions can be summarized as: • We propose HYPMIX, a novel model, data, and modality agnostic interpolative regularization based data augmentation method functioning in the hyperbolic space.
• We devise a novel Möbius Gyromidpoint Label Estimation (MGLE) method to predict soft labels for unlabeled data, and extend HYPMIX to a hyperbolic semi-supervised learning method.
• HYPMIX outperforms several strong baselines and Euclidean counterparts across speech, text, and vision across benchmark and low-resource datasets, including semi-supervised settings for Urdu and Arabic tasks.
• We further probe the effectiveness of HYPMIX in comparison to existing methods through layerwise ablation studies and adversarial robustness.

Background and Related Work
Data Augmentation enables use of limited training data, with approaches involving modifying the individual training instances, such as cropping (Simonyan and Zisserman, 2015) or paraphrasing (Wei and Zou, 2019;Kumar et al., 2019). Mixup techniques (Zhu et al., 2019) perform interpolation among input samples and have proven to perform better than modifying individual instances as it incorporates the prior knowledge that linear interpolations of feature vectors should lead to linear interpolations of the associated targets. Recent works (Jindal et al., 2020a;Verma et al., 2019a) perform Mixup operations over hidden state representation of input samples instead of the inputs, as highlevel representations are often low-dimensional and carry more useful information of input samples as compared to raw inputs. Latent interpolation methods have not been generalized across modalities and operate in the simplified Euclidean space which is unable to capture the complex characteristics possessed by latent state representations.
Semi-supervised Learning methods leverage unlabeled data which is typically available in larger quantities (Clark et al., 2018). Consistency regularization methods for semi-supervised learning predict soft labels for unlabeled data and train models on different permutations of labeled and unlabeled data (Verma et al., 2019b;Chen et al., 2020a). Chen et al. (2020b) uses a label guessing strategy on different augmentations of unlabeled data and combines it with labeled data for training models. However, these methods perform label prediction for unlabeled data using Euclidean operations.
Hyperbolic Learning has proven to be effective in representing information where relations among data points possess hierarchical and tree-like nature (Aldecoa et al., 2015). Learning in the hyperbolic space has been applied to various natural language processing (Dhingra et al., 2018;Gulcehre et al., 2019;Tay et al., 2018), and computer vision tasks (Khrulkov et al., 2020;Peng et al., 2020) as well as graph (Chami et al., 2019), sequence (Tay et al., 2018), and financial (Sawhney et al., 2021) learning. However, the ability of the hyperbolic space to model complex representations while performing interpolative operations across modalities is unexplored. Figure 1: Overview of HYPMIX and MIXH applied at layer k over hidden representations of input x i and x j . We perform the forward pass for the inputs upto layer k, and use the mixed representation for the continued pass.

Methodology: HYPMIX
We first formulate the task and introduce the hyperbolic space (Ganea et al., 2018) and Mixup (Zhang et al., 2018) ( §3.1). Using the hyperbolic operations, we then introduce Mixup in the hyperbolic space ( §3.2), and extend it to operate on the hidden state representations of neural networks. We call the resulting approach HYPMIX. An overview of the steps is presented in Figure 1. We formulate HYPMIX for both supervised ( §3.3) and semisupervised ( §3.4) methods. We test HYPMIX on classification tasks across speech, text, and vision.

Preliminaries
Hyperbolic Space is a non-Euclidean geometry with constant negative curvature. We use the Poincaré ball model of the hyperbolic space (Ganea et al., 2018), defined as (B, g B x ), where the manifold B = {x ∈ R n : ||x|| < 1}, endowed with the Riemannian metric g B x = λ 2 x g E , where the conformal factor λ x = 2 1−||x|| 2 and g E = diag[1, .., 1] is the Euclidean metric tensor. We denote the tangent space centered at point x as T x B. We use the Möbius gyrovector space to generalize standard mathematical operations to the hyperbolic space: Möbius Addition, ⊕ for a pair of points x, y ∈ B, where, ., . denotes the Euclidean inner product given by x, y = x 0 y 0 +x 1 y 1 +. . .x n−1 y n−1 , and || · || denotes the norm given by ||x|| = x, x . We define the exponential and logarithmic maps to project vectors between the Euclidean and hyperbolic space respectively. Exponential Mapping 2 maps the tangent vector v to the point exp x (v) on the Poincaré ball, Logarithmic Mapping maps a point y ∈ B to a point log x (y) on the tangent space at x, For exponential and logarithmic mapping, we choose the tangent space center x = 0 and use exp 0 (·) and log 0 (·).

Möbius Scalar Multiplication multiplies matrix
x ∈ B with scalar r ∈ B, Mobius Gyromidpoint M g calculates the hyperbolic weighted addition for gyrovectors {x 1 , . . . , x n } and weights {α 1 , . . . , α n }, Mixup (Zhang et al., 2018) involves training a neural network on convex combinations of a pair of instances and their labels. For two labeled data points (x i , y i ) and (x j , y j ), mixup uses linear interpolation with mixing ratio r to generate the synthetic sample x and corresponding mixed label y , By leveraging the hyperbolic operations and Mixup, we define Mixup in the hyperbolic space.

Formulating Mixup in Hyperbolic Space
For inputs possessing complex geometrical properties, performing mathematical operations in the Euclidean spaces often lead to vectorial distortions which can be stabilized by using the hyperbolic space (Ganea et al., 2018). To minimize these distortions, we formulate MIXH, Mixup in the hyperbolic space by leveraging hyperbolic operations as building blocks. First, we replace Euclidean arithmetic addition (+) and scalar product (·) with their Möbius counterparts: addition (⊕) and scalar multiplication ( ) respectively. We then transform inputs to the hyperbolic space using the exponential mapping exp 0 (·), perform Mixup to generate convex combinations of pairs of inputs x i , x j , and map them back to the Euclidean space using the logarithmic mapping log 0 (·). Formally, We now extend MIXH as a generalizable interpolative regularizer over hidden state representations across neural network layers.

HYPMIX: Interpolative MIXH
Previous works (Chen et al., 2020b;Jindal et al., 2020b) applying interpolation based regularization in the latent space of neural networks operate in the Euclidean space, which cannot capture the complex geometries of hidden state vectors (Tifrea et al., 2019). To better model the fine-grained information present in latent representations using the hyperbolic space, we extend MIXH to the hidden representation space. Let f θ (·) denote any general base model with parameters θ having N layers. f θ,n (·) denotes the n-th layer of the model and h n is the hidden space vector at layer n for n ∈ [1, N ] and h 0 denotes the input vector. We introduce HYPMIX as hyperbolic interpolation at a layer k ∼ [1, N ], for which we first calculate the latent representations separately for the inputs for layers before the k-th layer. For input samples x i , x j , we let h i n , h j n denote their respective hidden state representations at layer n of f θ (·), We then apply MIXH over the individual hidden state representations h i k , h j k from layer k as: The mixed hidden representation h k is used as the input for the continuing forward pass, We define HYPMIX(f θ (·), r, k) for a layer k and mixing ratio r to obtain the final hidden layer representation h N as, Supervised Network Optimization For classification, we apply a perceptron g φ (·) with parameters φ to calculate the class logits from the final hidden state output h N . We optimize the model using KL-divergence loss (KL) to bring the model output distribution closer to the mixed label distribution. We minimize the loss L between the mixed label y and logits obtained from HYPMIX,

Hyperbolic Semi-supervised Learning
Semi-supervised training methods leverage unlabeled data to improve the training for limited or low resource settings (Verma et al., 2019b). We extend HYPMIX to effectively utilize p labeled data points, X l = {x l 1 , x l 2 , . . . , x l p } and q unlabeled data points, . , x u q } using a semi-supervised training strategy in the hyperbolic space ( Figure 2).
We first use existing data augmentation techniques across different modalities to increase the unlabeled training data X u . For an unlabeled sample x u s , we generate Z augmented samples using different augmentation methods such as backtranslation (Edunov et al., 2018) and combine them to generate unlabeled augmented sets, X a = {X a,1 , X a,2 , . . . , X a,Z |X a,z = {x u 1,z , x u 2,z , . . . , x u q,z }, z ∈ [1, Z]}. Möbius Gyromidpoint Label Estimation (MGLE) predicts soft logits for unlabeled and augmented data in the hyperbolic space, allowing us to combine the unlabeled data with limited training data using HYPMIX for training. For an unlabeled sample x u s and corresponding augmented samples {x u s,1 , x u s,2 , . . . , x u s,Z }, we compute the Möbius Gyromidpoint M g of the hyperbolic mapped outputs where weight w o is applied to the original unlabeled sample. The weights control the contribution of different augmentation techniques based on their augmentation quality. We map the predicted output logits to the Euclidean space using log 0 (·) to predict the soft logits y u s , y u s = log 0 (Mg(exp 0 (g φ (f θ (x u s ))), exp 0 (g φ (f θ (x u s,1 ))), . . . , exp 0 (g φ (f θ (x u s,Z ))), wo, w1, . . . , wZ )) We sharpen the output y u s with a hyperparameter temperature T , to prevent it from being too uniform if the model predictions are random, where || · || 1 is the l 1 -norm of the vector.
Semi-supervised Network Optimization For optimizing the model in semi-supervised settings, we use the training set X = X l ∪ X u ∪ X a with labels Y = Y l ∪Y u ∪Y a , where Y u is used for both unlabeled and augmented inputs, i.e. Y a = Y u . We then uniformly sample two elements, x i , x j ∼ X and the corresponding labels y i , y j ∼ Y , and apply HYPMIX(x i , x j ). We optimize the model using KL-divergence loss L over the model outputs and the mixed labels mix(y i , y j ), L = KL(mix(yi, yj)||g φ (HYPMIX(xi, xj)))  We consider benchmark and low-resource datasets across speech, text, and vision spanning a varying number of classes, languages, and class imbalances for a comprehensive evaluation of HYPMIX.
We choose these datasets based on existing works across different task settings and baselines for a fair comparison with HYPMIX. We also choose datasets with comparatively lower language resources, different structures, and language roots, leading to a more diverse evaluation of HYPMIX. We summarize dataset statistics in Table 1. We follow the same preprocessing across all datasets as done by previous works (Jindal et al., 2020b), (Chen et al., 2020b), (Verma et al., 2019a).

Task Setup
We evaluate HYPMIX on three different settings for an extensive analysis: supervised training with limited training data, semi-supervised training with low resource data, and a fully supervised setup with complete training data.
Speech Following previous works, we use EnvNet-v2 with strong augmentation (Tokozume et al., 2018) as our base architecture f θ (·) followed by a fully connected layer g φ (·). We modify MIXH to account for the auditory perception and amplitude of speech signals (Tokozume et al., 2018). We use Fourier and Inverse Fourier Transform to generate augmented samples. We compare HYPMIX with the current state-of-the-art method Speechmix (Jindal et al., 2020b) across multiple settings.
Text Following Chen et al. (2020b), we use BERT-base (Devlin et al., 2019) as the backbone architecture (f θ (·)) for English datasets and BERTbase-arabic (Safaya et al., 2020) for the Arabic dataset. We use a two layer MLP with hidden size 128 as the classifier (g φ (·)) and generate augmented data using back-translation (Edunov et )) and a linear layer as the classifier (g φ (·)). We compare HYPMIX with manifold mixup (Verma et al., 2019a) for different settings.

Training Setup
Speech We use Nesterov's accelerated gradient (Sutskever et al., 2013) using momentum of 0.9, weight decay of 5e − 4, learning rate of 0.01 and mini-batch size of 64 for 2000 epochs. For ESC-10, we train the model on 5 folds, and for Ur-banSound8k, we train the model on 10 folds to report the average error rate. We randomly sample the mixing ratio from a uniform distribution, r ∼ U (0, 1). For semi-supervised training, we use 50 unlabeled samples from each class.
Text We use AdamW (Loshchilov and Hutter, 2019) optimizer with a learning rate 1e − 5 for the BERT encoder and 1e − 3 for the MLP. We follow Chen et al. (2020b) to sample the mixing ratio r from a beta distribution based on the number of labeled samples. For semi-supervised setting, we use 1000 unlabeled samples from each class.
Vision We use Nesterov's accelerated gradient (Sutskever et al., 2013) using momentum of 0.9 and learning rate of 0.1, batch size of 100 to train for 2000 epochs. Following Verma et al. (2019a), we sample the mixing ratio r ∼ Beta(2, 2), where Beta denotes the Beta distribution.

Supervised Training with Limited Data
We compare HYPMIX in a limited training data setup with baseline methods in Table 2. We ob-serve that Euclidean mixup techniques (EUCMIX) improve the performance over base models, indicating the importance of using the latent representation space of neural network architectures to perform interpolative regularization (Verma et al., 2019a). HYPMIX further improves performance (p < 0.01) over Euclidean methods across modalities, validating that the hyperbolic space is able tp better capture the complex geometry of latent representations for different inputs when performing interpolative operations.
HYPMIX shows maximum improvement when applied on extremely low training data, with samples in order of n = 10. This is in line with works (Zhou et al., 2021) which suggest that the variation generated by Möbius operations is very high as compared to Euclidean operations, leading to much more diverse samples from a small training set. This paves a path for better utilization of low resource datasets for downstream tasks across different modalities by leveraging the hyperbolic space. For all modalities, the relative improvement over the baseline architecture reduces with increasing number of labeled samples per class (n). This is in line with works (Verma et al., 2019b;Chen et al., 2020b) observing similar trends, suggesting that with an increase in number of labeled samples, the overall diversity of interpolative representations saturates, leading to lower relative improvements.
Across modalities, we observe maximum improvement when HYPMIX is applied on speech datasets, since speech waves inherently possess hyperbolic nature (Khan and Panigrahi, 2016), and their interpolative augmentation closely resembles hyperbolic wave interference (Chaturvedi et al., 1998). Improvements due to HYPMIX on text datasets ties with works stating that text inherently displays tree-like hierarchical characteristics and can be better represented using Riemannian ge-  ometry (Tifrea et al., 2019). The improvements on vision datasets are in line with works suggesting that performing augmentation operations over images using Möbius operations improves generalization while increasing diversity as compared to simplified Euclidean operations (Zhou et al., 2021). Improvements across different modalities, datasets, and base architectures indicate the modality, data, and model agnostic nature of HYPMIX.

Semi-Supervised Results: Low-Resource
We probe the effect of using hyperbolic semisupervised learning ( §3.4) for low resource datasets using HYPMIX in Table 3. Using semi-supervised learning shows significant improvements over their supervised counterparts trained with limited data for both Euclidean and hyperbolic (HYPMIX) representations, indicating the importance of using unlabeled and augmented data as additional training data. For both Euclidean and hyperbolic methods, we see larger improvements with increasing the number of labeled samples n, due to the increased number of permutations of labeled-labeled, unlabeled-labeled, and unlabeledunlabeled samples encountered during training. We observe greater improvements when semisupervised training is applied in the hyperbolic space with HYPMIX ( Figure 3) for both speech and text as compared to EUCMIX, indicating that the hyperbolic space is able to generate less noisy, yet more diverse samples by effectively modeling the complex latent space representations.   Across modalities, speech datasets that are augmented with simpler methods such as mathematical transforms show larger improvements as compared to text datasets that are augmented with more complicated methods like backtranslation. We attribute this difference to the proximity of augmented unlabeled samples to the original unlabeled data distribution, suggesting that better augmentation methods and controlling the weights for Möbius Gyromidpoint Label Estimation (MGLE, §3.4) based on the augmentation quality is an important factor for the performance of semi-supervised methods.

Layer wise Ablation
We experiment with different sets of layers from which we uniformly sample k to perform HYPMIX. We experiment with the best performing layer sets from corresponding previous works (Jindal et al., 2020b;Chen et al., 2020b) for a fair comparison.
Speech Table 4 compares the error rates on the ESC-10 dataset for Speechmix (Jindal et al., 2020b) and HYPMIX. We observe that HYPMIX achieves the best performance when the layer set has layers performing a max-pool operation in EnvNet-v2. These layers capture different features of sound such as frequency response and auditory perception (Tokozume et al., 2018), suggesting that HYPMIX is able to extend the training distribution by modeling various combinations of latent speech vectors representing different auditory features using hyperbolic interpolation.  , 9, 12} 25.9 22.7 {6, 7, 9, 12} 27.8 21.8 {6, 7, 9} 28.1 24.9 Table 5: Layer-wise ablation (% Error rate) on AG News with n = 10 labeled samples per class.
Text We compare different layer sets of BERTbase (Devlin et al., 2019) for performing HYP-MIX for text datasets in Table 5. Layers {3,4,6,7,9,12} of BERT-base contain the most information about different aspects of natural language (Jawahar et al., 2019). We experiment with different combinations of the layer set {3, 4, 6, 7, 9, 12}. EUCMIX achieves the best result when using the set {7, 9, 12} for interpolation, layers containing the semantic and syntactic information. HYPMIX is able to better capture the syntax tree information present in layer 6 (Jawahar et al., 2019) and shows higher improvements when the mixup layer is chosen from {6, 7, 9, 12}, validating the ability of the hyperbolic space to model hierarchical information better than the Euclidean space (Ganea et al., 2018).
During the layer-wise ablation study, we observe that even though there is intersection between the optimum layer sets of EUCMIX and HYPMIX, they are not exactly the same. This leads to interesting questions regarding the representations that Euclidean and hyperbolic spaces capture, and how can the hyperbolic space be further exploited for modeling complex geometries.

Supervised HYPMIX with Complete Data
EnvNet-v2 (2018)   We compare the performance of HypMix for three benchmark and low resource speech datasets in Table 6 by applying BC Learning (Tokozume et al., 2018), and Speechmix (Jindal et al., 2020b) over EnvNet-v2 with strong augmentation (Tokozume et al., 2018). We observe that mixup-based approaches, i.e., BC learning and Speechmix improve the performance over the standard learning models, validating the importance of interpolative acoustic mixup based on the auditory perception of input samples. HYPMIX achieves state-of-the-art performance (p < 0.01) across all three datasets, suggesting that the hyperbolic representation better models the latent representation of speech signals and acoustic wave interference, compared to the Euclidean space. We also present the results of HYPMIX-Input, where we perform HYPMIX over the raw inputs instead of latent representations. HYPMIX-Input outperforms SpeechMix for two datasets, suggesting that the hyperbolic input space itself is able to generate diverse synthetic samples as compared to Euclidean methods.  Adversarial attacks provide inputs to models specifically designed to confuse them. We compare the robustness of HYPMIX and HYPMIX-Input with BC Learning (Tokozume et al., 2018) and Speechmix (Jindal et al., 2020b) by performing white box adversarial attacks using Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) and Iterative Fast Gradient Sign Method (I-FGSM) (Kurakin et al., 2016) in Table 7. We observe that HYPMIX is more robust by 6.1% and HYPMIX-Input is robust by 5.8% compared to their Euclidean counterparts, indicating that the hyperbolic space helps the model generalize better and make it more resistant to adversarial examples.

Cost of Hyperbolic Operations
HYPMIX requires additional hyperbolic transformations such as exponential and logarithmic mappings and tangential and hyperbolic tangential operations on-top of EUCMIX. However, on a GPU, they can be carried out in parallel. Hence, HYP-MIX requires longer time only by a constant factor compared to EUCMIX, with the individual operations having similar time complexity, which is of the order of the input dimensions of the latent representations of the samples to be mixed. We compare the per iteration time taken by HYPMIX with EUCMIX in Table 8.

Conclusion and Future Work
Drawing inspiration from works showing that speech, text, and vision data inherently possess hyperbolic characteristics and can be better represented in the hyperbolic space, we propose HYPMIX, a model, data, and modality agnostic interpolative regularization method operating in the hyperbolic space. We devise a Möbius Gyromidpoint Label Estimation (MGLE) technique to predict labels for unlabeled training data and combine it with HYPMIX to formulate a hyperbolic semi-supervised learning method. HYPMIX outperforms existing methods for benchmark and low resource datasets across speech, text, and vision in supervised and semi-supervised settings with complete and limited training data. HYPMIX is also more robust to white-box adversarial attacks compared to Euclidean methods. HYPMIX being model, data, and modality agnostic can be extended to downstream tasks across modalities and interpolative augmentation for data such as sequences and graphs. As future work, we plan to evaluate HYPMIX on larger datasets and a variety of tasks such as the GLUE and SuperGLUE benchmarks, and tasks comprising multimodal settings.

Acknowledgements
This work has been supported by the German Federal Ministry of Education and Research (BMBF) as a part of the Junior AI Scientists program under the reference 01-S20060. Ramit Sawhney is supported by ShareChat. We thank the anonymous reviewers for their valuable inputs.