Transductive Learning for Textual Few-Shot Classification in API-based Embedding Models

Proprietary and closed APIs are becoming increasingly common to process natural language, and are impacting the practical applications of natural language processing, including few-shot classification. Few-shot classification involves training a model to perform a new classification task with a handful of labeled data. This paper presents three contributions. First, we introduce a scenario where the embedding of a pre-trained model is served through a gated API with compute-cost and data-privacy constraints. Second, we propose a transductive inference, a learning paradigm that has been overlooked by the NLP community. Transductive inference, unlike traditional inductive learning, leverages the statistics of unlabeled data. We also introduce a new parameter-free transductive regularizer based on the Fisher-Rao loss, which can be used on top of the gated API embeddings. This method fully utilizes unlabeled data, does not share any label with the third-party API provider and could serve as a baseline for future research. Third, we propose an improved experimental setting and compile a benchmark of eight datasets involving multiclass classification in four different languages, with up to 151 classes. We evaluate our methods using eight backbone models, along with an episodic evaluation over 1,000 episodes, which demonstrate the superiority of transductive inference over the standard inductive setting.

Despite the success of the scaling paradigm, significant challenges still exist especially when the many practical constraints of real-world scenarios have to be met: labeled data can be severely limited (i.e., few-shot scenario (Song et al., 2022;Ye et al., 2021)), data privacy is critical for many industries and has become the subject of increasingly many regulatory pieces (Commission, 2020(Commission, , 2016)), compute costs need to be optimized (Strubell et al., 2019).Furthermore, these challenges are made even more complex as stronger foundation models are now available only through APIs (e.g., Ope-nAI's GPT-3, GPT-4 or ChatGPT, Anthropic's Claude or Google's PaLM (Chowdhery et al., 2022)) which has led to some of their parameters being concealed, presenting new challenges for model adaptation (Solaiman, 2023).This paper is centered on the fundamental task of fewshot text classification, specifically focusing on cloud-based/API access.Specifically, we formulate three requirements for API-based few-shot learning (FSL) (see Fig. 1): (R1) Black-box scenario.We focus on learning from models that are opaquely deployed in production to the end-user, who only has access to the end-point of the encoder, i.e., the resulting text embedding produced by the final layer of the network.(R2) Low resources / computation time.AI systems are often required to make rapid predictions at high frequencies in various real-world applications.Therefore, any few-shot classifier used in such scenarios should have a low training and inference time, as well as require minimal computational resources.
(R3) Limited Data Sharing.When utilizing API models, data sharing becomes a major concern.In the current landscape, providers are increasingly offering less transparent procedures for training their networks.As a result, users prefer sharing as little information as possible, such as labeling schema and annotated data, to safeguard their data privacy.Shortcomings of Existing Works.While numerous previous studies have addressed the popular few-shot classification setting, to our knowledge no existing line of work adequately satisfies the three API requirements described above.In particular, prompt-based FSL (Schick and Schütze, 2020a) and parameter-efficient fine-tuning FSL (Houlsby et al., 2019) both require access to the model's gradients, while in-context learning scales poorly with the task's size (e.g number of shots, number of classes) (Chen et al., 2021b;Min et al., 2021Min et al., , 2022;;Brown et al., 2020) and requires full data sharing.Instead, we focus on methods that can operate within API-based constraints.
Under R1, R2, and R3 requirements, the standard inductive learning (Liu et al., 2022) may be quite limiting.To mitigate the labeled data scarcity while retaining API compliance, we revisit transduction (Vapnik, 1999) in the context of textual few-shot classification.Specifically, in the context of FSL, transductive FSL (Liu et al., 2019a) advocates leveraging unlabeled test samples of a task as an additional source of information on the underlying task's data distribution in order to better define decision boundaries.Such additional source essentially comes for free in many offline applications, including sentiment analysis for customer feedback, legal document classification, or text-based medical diagnosis.
Our findings corroborate recent findings in computer vision (Liu et al., 2019a;Ziko et al., 2020;Lichtenstein et al., 2020;Boudiaf et al., 2020;Hu et al., 2021b), that substantial gains can be obtained from using transduction over induction, opening new avenue of research for the NLP community.However, the transductive gain comes at the cost of introducing additional hyperparameters, and carefully tuning them.Motivated by Occam's razor principle, we propose a novel hyperparameter-free transductive regularizer based on Fisher-Rao distances and demonstrate the strongest predictive performances across various benchmarks and models while keeping hyperparameter tuning minimal.We believe that this parameter-free transductive regu-larizer can serve as a baseline for future research.

Contributions
In this paper, we make several contributions to the field of textual FSL.Precisely, our contributions are threefold: A new textual few-shot scenario: We present a new scenario for FSL using textual API-based models that accurately capture real-world constraints.Our new scenario opens up new research avenues and opportunities to address the challenges associated with FSL using API-based models, paving the way for improved performance in practical applications.
A novel transductive baseline.Our paper proposes a transductive FSL algorithm that utilizes a novel parameter-free Fisher-Rao-based loss.By leveraging only the network's embedding (R1), our approach enables fast and efficient predictions (R2) without the need to share the labeling schema or the labels of few-shot examples making it compliant with (R3).This innovative method marks a significant step forward in the field of FSL.A truly improved experimental setting.Previous studies on textual few-shot classification (Schick andSchütze, 2022, 2020b;Mahabadi et al., 2022;Tam et al., 2021;Gao et al., 2020) have predominantly assessed their algorithms on classification tasks with a restricted number of labels (typically less than five).We take a step forward and create a benchmark that is more representative of realworld scenarios.Our benchmark relies on a total of eight datasets, covering multiclass classification tasks with up to 151 classes, across four different languages.Moreover, we further enhanced the evaluation process by not only considering 10 classifiers trained with 10 different seeds (Logan IV et al., 2021;Mahabadi et al., 2022), but also by relying on episodic evaluation on 1,000 episodes (Hospedales et al., 2021).Our results clearly demonstrate the superiority of transductive methods.

Few-shot learning in NLP
Numerous studies have tackled the task of FSL in Natural Language Processing (NLP) by utilizing pre-trained language models (Devlin et al., 2018;Liu et al., 2019b;Radford et al., 2019;Yang et al., 2019).These methods can be classified into three major categories: prompt-based, parameterefficient tuning, and in-context learning.R3)).This scenario, allows tuning a classification head g ϕ (using induction or transduction) at low computational cost (R2) while retaining all support labels locally.
Prompt-based FSL: Prompt-based FSL involves the use of natural language prompts or templates to guide the model to perform a specific task (Ding et al., 2021;Liu et al., 2023).For example, the seminal work (Schick and Schütze, 2020a) proposed a model called PET, which uses a pre-defined set of prompts to perform various NLP tasks as text classification.They also impose a choice of a verbalizer which highly impacts the classification performances (Cui et al., 2022;Hu et al., 2021a).However, recent studies have questioned the benefits of prompt-based learning due to the high variability in performance caused by the choice of prompt (Liu et al., 2022).To address this issue, researchers have proposed prompt tuning which involves a few learnable parameters in addition to the prompt (Lester et al., 2021).Nevertheless, these approaches face limitations when learning from API: (i) encoder access for gradient computation is infeasible (as in R1), (ii) prompting requires to send data and label which raises privacy concerns (as in R3), and (iii) labeling new points is time-consuming (see in R3) and expensive due to the need to send all shots for each input token 1 .Parameter-efficient fine-tuning.These methods, such as adapters (Houlsby et al., 2019;Pfeiffer et al., 2020), keep most of the model's parameters fixed during training and only update small feedforward networks that are inserted within the larger model architecture.A recent example is T-FEW (Liu et al., 2022), which adds learned vectors that rescale the network's internal activations.Additionally, it requires a set of manually created prompts for each dataset making it hard to use in practice.Relying on parameter-efficient fine-tuning methods with an API is not possible due to the need to com-1 The cost of API queries is determined by the number of input tokens that are transmitted.pute gradients of the encoder (as per R1) and the requirement to send both the labeling schema and the labels, which violates R3.In Context Learning (ICL).In-context learning models are models that utilize input-to-output training examples as prompts to make predictions, without any parameter updates (Wei et al., 2022).These models, such as text-davinci, rely solely on the provided examples to generate predictions, without any additional training.However, a significant drawback of this approach is that the user must supply the input, label examples, and task description, which becomes prohibitively expensive when the number of classes or shots increases, is slow (Liu et al., 2022) (R2) and raises data privacy concerns (as highlighted in R3).Additionally, the inability to reuse text embeddings for new tasks or with new labels without querying the model's API limits practicality and scalability, making reusable encoding unfeasible for in-context learning models2 .Meta-learning.Meta-learning approaches have for quite long stood as the de-facto paradigm for FSL ((Snell et al., 2017;Rusu et al., 2019;Sung et al., 2018b;Lee et al., 2019;Raghu et al., 2019;Sun et al., 2019a)).In meta-learning, the objective is to provide the model with the intrinsic ability to learn in a data-efficient manner.For instance, MAML ( (Finn et al., 2017a;Antoniou et al., 2018)), arguably the most popular meta-learning method, tries to train a model such that it can be fine-tuned end-to-end using only a few supervised samples while retaining high generalization ability.Unlike the three previous lines of work, meta-learning methods operate by modifying the pre-training pro-cedure and therefore assume access to both the training data and the model, which wholly breaks both R1 and R3.

Inductive vs transductive learning
Learning an inductive classifier on embeddings generated by an API-based model, as proposed by (Snell et al., 2017), is a common baseline for performing FSL.This approach is prevalent in NLP, where a parametric model is trained on data to infer general rules that are applied to label new, unseen data (known as inductive learning (Vapnik, 1999)).However, in FSL scenarios with limited labeled data, this approach can be highly ambiguous and lead to poor generalization.
Transduction offers an attractive alternative to inductive learning (Sain, 1996).Unlike inductive learning, which infers general rules from training data, transduction involves finding rules that work specifically for the unlabeled test data.By utilizing more data, such as unlabeled test instances, and aiming for a more localized rule rather than a general one, transductive learning has shown promise and practical benefits in computer vision (Boudiaf et al., 2020(Boudiaf et al., , 2021;;Ziko et al., 2020).Transductive methods yield substantially better performance than their inductive counterparts by leveraging the statistics of the query set (Dhillon et al., 2019).However, this approach has not yet been explored in the context of textual data.

Problem Statement
Let Ω be the considered vocabulary, we denote Ω * its Kleene closure.The Kleene closure corresponds to sequences of arbitrary size written with tokens in Given an input space X with X ⊆ Ω * and a latent space Z, we consider a pretrained backbone model f θ : X → Z = R d , where θ ∈ Θ represents the parameters of the encoder and d is the embedding dimension size.In the API-based setting, we assume that we are unable to access the exact structure of f θ as mentioned in R1.However, we do have access to the last encoder embedding which is available for our use (see R1).
The objective of few-shot classification is to learn a classifier from limited labeled data and generalize it to new, unseen tasks or classes.To accomplish this, randomly sampled few-shot tasks are created from a test dataset that has a set of unseen classes Y test .Each task involves a few labeled examples from K different classes chosen at random among Y test .These labeled examples constitute the support set S = {x i , y i } i∈I S , with a size of |S| = N S × K. Additionally, each task has an unlabeled query set seen examples from each of the K classes.I S and I Q represent the drawn indices during the sampling process for support set and query set, respectively.Pre-trained models use few-shot techniques and the labeled support sets to adapt to the tasks at hand and are evaluated based on their performances on the unlabeled query sets.
Remark Setting the values of N and K in textual FSL is not standardized, as discussed in Sec.3.1.Therefore, in all of our experiments, we have relied on setting (N, K) ∈ {5, 10} 2 .

Proposed Transductive Method
NLP few-shot classifiers rely only on inductive inference, while computer vision has shown significant performance improvements using transductive inference for FSL.Transductive inference succeeds in FSL because it jointly classifies all unlabeled query samples of a single task, leading to more efficient and accurate classification compared to inductive methods that classify one sample at a time.Let us begin by introducing some basic notation and definitions before introducing our new transductive loss based on the Fisher-Rao distance.
In the API-based few-shot classification setting, our goal is to train a classification head g ϕ : Z → R K that maps the feature representations to the posterior distribution space for making predictions.To simplify the equations for the rest of the paper, we use the following notations for the posterior predictions of each i ∈ I S ∪ I Q and for the class marginals within Q: where X and Y are the r.v.s associated with the raw features and labels, respectively, and where Y Q means restriction of the r.v.Y to set Q.
For training the classification head in the transductive setting, prior research aims at finding ϕ k=1 y ik log(p ik ) being the cross-entropy supervision on the support set (in which y ik is the k th coordinate of the one-hot en-coded label vector associated to sample i) and R Q being a transductive loss on the query set Q.
Note that this transductive regularization has been proposed in the literature based on the Info-Max principle (Cardoso, 1997;Linsker, 1988), and the inductive loss can be found by setting λ = 0.In what follows, we review the regularizers introduced in previous work.
Entropic Minimization (H) An effective regularizer for transductive FSL can be derived from the field of semi-supervised learning, drawing inspiration from the approach introduced in (Grandvalet and Bengio, 2004).This regularizer, proposed in (Dhillon et al., 2019), utilizes the conditional Shannon Entropy (Cover, 1999) of forecast results from query samples during testing to enhance model generalization.Formally: Mutual Information Maximization (I) A promising alternative to entropic minimization for addressing the challenges of transductive FSL is to adopt the Info-max principle.(Boudiaf et al., 2020) extended this idea, introduced in (Hu et al., 2017), and propose as regularizer a surrogate of the mutualinformation R I Q (α) =: Limitation of existing strategies: Despite its effectiveness, the previous method has a few limitations that should be taken into account.One of these limitations is the need to fine-tune the weight of different entropies using the hyperparameter α.This parameter-tuning process can be time-consuming and may require extensive experimentation to achieve optimal results.Additionally, recent studies have shown that relying solely on the first Entropic term, which corresponds to the Entropic minimization scenario in Equation 1, can lead to suboptimal performance in FSL.

A Fisher-Rao Based Regularizer
In the FSL scenario, minimizing parameter tuning is crucial.Motivated by this, in this section, we introduce a new parameter-free transductive regularizer that fits into the InfoMax framework.Additionally, our loss inherits the attractive properties of the Fisher-Rao distance between soft-predictions q := (q 1 , . . ., q K ) and p := (p 1 , . . ., p K ), which is given by (Picot et al., 2023): The proposed transductive regularizer denoted by R FR Q , for each single few-shot task, can be described as measuring the Fisher-Rao distance between pairs of query samples: where d FR (p i , p j ) is the Fisher-Rao distance between pairs of soft-predictions (p i , p j ).Furthermore, it is shown that expression (4) yields a surrogate of the Mutual Information as shown by the following proposition.This result to the best of our knowledge is new, as far as we can tell.
Theorem 1. (Fisher-Rao as a surrogate to maximize Mutual Information) Let (q i ) i∈I Q be a collection of soft predictions corresponding to the query samples.Then, it holds that ∀ 0 ≤ α ≤ 1: Proof: Further details are relegated to Ap. A.
Q can be exploited to maximize the Mutual Information.However, R FR Q is parameter-free and thus, it does not require tuning α.

Additional Few-shot Inductive Baseline
In addition to the transductive methods of Sec.3.2, we will explore three additional inductive methods for few-shot classification: prototypical networks, linear probing, and a semi-supervised classifier.
Prototypical Networks (PT) PT learn a metric space where the distance between two points corresponds to their degree of similarity.During inference, the distance between the query example and each class prototype is computed, and the predicted label is the class with the closest prototype.PT has been widely used in NLP and is considered as a strong baseline (Snell et al., 2017;Sun et al., 2019b;Gao et al., 2019).Linear Probing (CE) Fine-tuning a linear head on top of a pretrained model is a popular approach to learn a classifier for classification tasks and was originally proposed in (Devlin et al., 2018).
Semi-supervised Baselines (SSL).We additionally propose two semi-supervised baselines following two steps.In the first step, a classifier is trained using the support set S and used to label Q.In the second step, the final classifier is trained on both S and Q with the pseudo label obtained from the first step.

Datasets
Benchmarking the performance of FSL methods on diverse sets of datasets is critical to evaluate their generalization capabilities in a robust manner as well as their potential for real-world applications.Previous work on FSL (Karimi Mahabadi et al., 2022;Perez et al., 2021) mainly focuses on datasets with a reduced number of classes (i.e., K < 5).Motivated by practical considerations we choose to build a new benchmark composed of datasets with a larger number of classes.Specifically, we choose Go Emotion (Demszky et al., 2020), Tweet Eval (Barbieri et al., 2020), Clinc (Larson et al., 2019), Banking (Casanueva et al., 2020) and the Multilingual Amazon Reviews Corpus (Keung et al., 2020).These datasets cover a wide range of text classification scenarios and are of various difficulty4 .A summary of the datasets used can be found in Tab. 1.

Model Choice
The selection of an appropriate backbone model is a critical factor in achieving high performance in few-shot NLP tasks.To ensure the validity and robustness of our findings, we have included a diverse range of transformer-based backbone models in our study, including 1. Three different sizes of RoBERTa based models (Liu et al., 2019b).Similar to BERT, RoBERTa is pretrained using the closed task (Taylor, 1953).We consider two different sizes of the RoBERTa model, namely RoBERTa (B) with 124M parameters and RoBERTa (L) with 355M parameters and Distil-RoBERTa, a lighter version of RoBERTa trained through a distillation process (Hinton et al., 2015), for a total of 82M parameters.
4. text-davinci model: to mimic the typical setting of API-based models, we also conduct experiments on text-davinci, only accessible through OpenAI's API.

Evaluation Framework
Prior research in textual FSL typically involves sampling a low number of tasks, typically less than 10, of each dataset.In contrast, we utilize an episodic learning framework that generates a large number of N-shots K-ways tasks.This framework has gained popularity through inductive metalearning approaches, such as those proposed by (Finn et al., 2017b;Snell et al., 2017;Vinyals et al., 2016;Sung et al., 2018a;Mishra et al., 2017;Rusu et al., 2019;Oreshkin et al., 2018), as it mimics the few-shot environment during evaluation and improves model robustness and generalization.In this context, episodic training implies that a different model is initialized for each generated few-shot task, and all tasks are compiled independently in parallel.This approach allows to the computation of more reliable performance statistics by evaluating the generalization capabilities of each method on a more diverse set of tasks.To account for the model's generalization ability, we average the results for each dataset over 1000 episodes, with the N considered classes varying in every episode.For each experiment, we consider the F1-Score.

Case Study of text-davinci
In this experiment, we investigate the performance of text-davinci in both its language model and embedding-based model forms.We assess its classification capabilities using the aforementioned baseline and explore the language model's performance when applied in an in-context learning (ICL) setup with prompting.
Takeaways.From Tab. 2, we observe that SSL performs comparably to CE, which is simpler to use and will be considered as the baseline in the next part of our study.Although ICL slightly outperforms CE, its implementation comes at a significant cost.In ICL, each class requires N shots, forcing the user to send a long input query with additional instructions.This query length becomes prohibitive as the number of classes increases, and on average, it is 58 times longer than using the embedding base API in our benchmark.The lengthy input and ICL approach make it time-consuming for generation (violating R1), require the user to provide labels (violating R2), and prevent the reuse of embeddings for future use (e.g., retrieval, clustering).Additionally, ICL is 60 times more expensive than CE.Thus, we will discard ICL for the subsequent part of this study.

Overall Results
Global results: To evaluate the effectiveness of various few-shot methods, we conducted a comprehensive analysis of their classification performance across all datasets, all backbones, and all considered N-shots/K-ways scenarios.Results are reported in Tab. 3.An interesting observation is that transductive approaches I and FR outperform their inductive counterparts (CE and PT).Notably, we found that vanilla entropy minimization, which solely relies on H, consistently underperforms in all considered scenarios.Our analysis revealed that FR surpasses traditional fine-tuning based on cross-entropy by a margin of 3.7%.
Mono-lingual experiment: In order to thoroughly analyze the performance of each method, we conducted a per-dataset study, beginning with a  focus on the mono-lingual datasets.Fig. 2 reveals that the global trends observed in Tab. 3 remain consistent across datasets of varying difficulty levels.Notably, we observed consistent improvements achieved by transductive regularizers (such as I or FR) over CE.However, the relative improvement is highly dependent on the specific dataset being evaluated.Specifically, FR achieves +6.5% F1-score on Banking, but only a shy +1.5% on Tweet.A strong baseline generally suggests highly discriminative features for the task, and therefore a strong upside in leveraging additional unlabeled features, and vice versa.Therefore, we hypothesize that the potential gains to be obtained through transduction correlate with the baseline's performance.5

Study Under Different Data-Regime
In this experiment, we investigated the performance of different loss functions under varying conditions of 'ways' and 'shots'.As shown in Fig. 3, we observed that increasing the number of classes ('ways') led to a decrease in F1 while increasing the number of examples per class ('shots') led to an improvement in F1.This can be explained by T ask -N-Shot -K-Way T ask -N-Shot -K-Way  the fact that having more data enables the classifier to better discern the characteristics of each class.Interestingly, the relationship between the number of shots and classification F1 may not be the same for all classes or all loss functions.Fig. 3 shows that different loss functions (e.g.FR on banking) benefited greatly from adding a few shots, while others did not show as much improvement.However, this variability is dependent on the specific dataset and language being used, as different classes may have different levels of complexity and variability, and some may be inherently easier or harder to classify than others.

Ablation Study On Backbones
In this experiment, we examined how different loss functions perform when increasing the number of parameters in various models.The results, presented in Fig. 4, show the average performance across the experiments and are organized by the loss function.We observed an inverse scaling law for both the RoBERTa and XLM-RoBERTa family of models, where increasing the number of parameters led to a decrease in performance for the losses tested.However, within the same family, we observe that the superiority of FR remains consistent.An interesting finding from Fig. 4 is that the transductive regularization technique using FR outperforms other methods on text-davinci.This highlights the effectiveness of FR in improving the performance of the model and suggests that transductive regularization may be a promising approach for optimizing language models.

Practical Considerations
In this experiment, we adopt a practical standpoint and aim to evaluate the effectiveness of an API model, specifically text-davinci.In Tab. 4, we report the training speed of one episode on a MAC with CPU.Overall, we observed that the transductive loss is slower as it necessitates the computation of the loss on the query set, whereas PT is faster as it does not involve any optimization.Furthermore, we note that FR is comparable in speed to I. To provide a better understanding of these results, we can compare our method with existing approaches (in the light of R2).For instance, PET (Schick and Schütze, 2020a) entails a training time of 20 minutes on A100, while ADAPET (Tam et al., 2021) necessitates 10 minutes on the same hardware.

Conclusions
This paper presents a novel FSL framework that utilizes API models while meeting critical constraints of real-world applications (i.e., R1, R2, R3).This approach is particularly appealing as it shifts the computational requirements (R2), eliminating the need for heavy computations for the user and reducing the cost of embedding.To provide a better understanding, embedding over 400k sequences cost as low as 7 dollars.In this scenario, our research highlights the potential of transductive losses, which have previously been disregarded by the NLP community.A candidate loss is the Fisher-Rao distance which is parameter-free and could serve as a simple baseline in the future.
We are optimistic that our research will have a positive impact on society.Nonetheless, it is essential to acknowledge the limitations of API-based fewshot classification models despite their promising results in various tasks.Firstly, the performance of the introduced methods is heavily dependent on the quality of available API models.If the API models do not provide sufficient information or lack diversity, the introduced methods may struggle to accurately classify input texts.Secondly, the blackbox nature of the backbone limits the interpretability of API-based few-shot classification methods, which may hinder their adoption.Ultimately, the aim of this work is to establish a baseline for future research on transductive inference.As a result, not all existing transductive methods are compared in this study.

A Proof of Proposition 1
In this Appendix, we prove the inequality (Eq.6) provided in Proposition 1.The right-hand side of (Eq. 6) follows straightforwardly from the definition of R I Q (α) and the non-negativity of the Shannon entropy.In order to prove the first inequality, we need to introduce the following intermediate result.
For any arbitrary random variable (r.v) X and countable r.v Y , and any real number β, let where the r.v X ⋆ follows the same distribution than X.Notice that it is obvious that I 1 (X; Y ) = I(X; Y ), where I(X; Y ) is Shannon Mutual Information.Lemma 1.For any arbitrary r.v.X and countable r.v.Y , we have Proof of the lemma: We must show that the different of I(X; Y ) − I β (X; Y ) is nonnegative.To this end, we write this difference as: where the first inequality follows by applying Jensen's inequality to the function t → − log(t).
Proof of Proposition 1: From Lemma 1, using Jensen's inequality, we have where inequality (15) follows by applying Lemma 1 and inequality (16) follows by exploiting the convexity of the function t → − log(t) for any 0 ≤ β ≤ 1.Finally, it is not difficult to check from the definition of the Fisher-Rao distance given by expression (3) that Using the identity given by ( 19) in expression ( 18), and setting β = 1/2, we obtain the following lower bound on I(X; Y ): The inequality (6) immediately follows by replacing the distribution of the r.v.X with the empirical distribution on the query and P (y|x) with the soft-prediction corresponding to the feature x, which concludes the proof of the proposition.

B.1 Preliminary Classification Results
Preliminary Experiment.In our experiments, the backbone models are of utmost importance.Our objective in this preliminary experiment is to assess the efficacy of these models when fine-tuning only the model head across a variety of datasets.Through this evaluation, we aim to gain insight into their generalization abilities and any dataset-specific factors that may influence their performance.This information can be utilized to analyze the performance of different models in the few-shot scenario, as described in Sec. 5. We present the results of this experiment in Tab. 5, noting that all classes were considered, which differs from the episodic training approach detailed in Sec. 5.

B.2 A Dive Into text-davinci results
text-davinci appears to be the backbone providing the most informative a priori embeddings in Tab. 5 and could be considered as the prime model for API-based FSL, showcasing the current requirements in this area.It is thus a typical candidate for application uses that must meet the following criteria (R1) -(R3).Therefore, we put a special emphasis on its related results.Fig. 6 (top) details the text-davinci results of the experiments conducted on the mono-lingual datasets.These plots highlight the consistency of the tendencies that emerged in Tab. 5, Tab. 3 and Fig. 2, namely: the superiority of transductive approaches (F R and I) over inductive ones (CE and P T ), the underperformance of the entropic-minimization-based strategy (H), and the higher amount of information conveyed by text-davinci learned embeddings over other backbones, resulting in higher F1 scores on all datasets.
These phenomena still occur in the multi-lingual setting, as illustrated in Fig. 6 (bottom), stressing the superiority of transductive (and especially FR) over other approaches for presumably universal tasks, beyond English-centered ones, and without the need for using language-specific engineering as for prompting-based strategies.
Note that for both of these settings, the entropic-minimization-based strategy (H) seems to be capped at a 15% F1 score, thus with no improvement over other backbones embeddings, and independently of the dataset difficulty.

B.3 Multilingual Experiment
To provide an exhaustive analysis, we report the same experiment that is made in Sec.5.2, for multi-lingual models on the Amazon dataset.While both Latin languages (French and Spanish) share close results, with an F1 gain of 2.8% for FR over CE, the results in the German and English language exhibit an F1 increased by almost 4%.

B.4 Importance of Model Backbones on Monolingual Experiment
In this section, we report the results of our experiment aggregated per backbone.The goal is to understand how the different losses behave on the different backbones.The results are presented in Fig. 10.While the trends observed in the previous charts are retrieved for the majority of backbones, some of these models are exceptions.For example, while transductive methods perform generally better than inductive methods, the CE-based method seems to perform slightly better than I for XLM-RoBERTa-xl.Additionally, while FR is the most effective method for the majority of backbones, it is surpassed by I for the alldistilroberta-v1 model.Furthermore, the inverse-scaling-law details are found for the RoBERTa(B/L) and XLM-RoBERTa (B/L) models per dataset.In general, it is interesting to note that although model performance is constrained by dataset difficulty, the performance order of each method is consistent across all 4 datasets for each considered backbone.

B.4.1 Results Per Language
In this experiment, we report the performance of different losses on the Amazon dataset by averaging the results over the number of shots, ways, and model backbones.The results are presented in Tab. 6.
Our observations indicate that the transductive regularization improves the results for all languages over the inductive baseline (i.e., CE), with a substantially higher gain for the German language.Additionally, we note that the observed improvements for FR are more consistent.This further demonstrates that the transductive loss can be useful in few-shot NLP.In the future, we would like to explore the application of transductive inference to other NLP tasks such as sequence generation (Pichler et al., 2022;Colombo et al., 2019Colombo et al., , 2021d,b) ,b) and classification tasks (Chapuis et al., 2020;Colombo et al., 2022d,b;Himmi et al., 2023) as well as NLG evaluation (Colombo et al., 2021e, 2022c(Colombo et al., 2021e, , 2021c,a,b) ,a,b) and Safe AI (Colombo et al., 2022a;Picot et al., 2022a,b;Darrin et al., 2022Darrin et al., , 2023))

Figure 1 :
Figure1: API-based FSL scenario.The black-box API provides embeddings from the pretrained encoder f θ .The black-box scenario discards existing inductive approaches and in-context learning methods due to the inaccessible of the model's parameters ((R1)) and privacy concerns ((R3)).This scenario, allows tuning a classification head g ϕ (using induction or transduction) at low computational cost (R2) while retaining all support labels locally.

Figure 2 :
Figure 2: Performance on the monolingual datasets.

Figure 3 :Figure 4 :
Figure 3: The effect of ways and shots on test performance on monolingual (left) and multilingual (right) datasets.

Table 2 :
Aggregated performance over K, N, the different datasets for text-davinci.|x| stands for the averaged input length.

Table 3 :
Aggregated performance over K,N, the different datasets and considered backbone.

Table 4 :
Training time for 1 episode on a M1-CPU.

Table 5 :
Preliminary experiment results.Accuracy of the different backbone.

Table 6 :
. Global Results for multilingual Amazon