3.3. Feature Preprocessing
As mentioned in
Section 2, many works hypothesize, explicitly or not, that the features from the same class are aligned with a specific distribution (often Gaussianlike). However, this aspect is rarely experimentally verified. In fact, it is very likely that features obtained using the backbone architecture are not Gaussian. Indeed, usually, the features are obtained after applying a ReLU function [
39] and exhibit a positive and yet skewed distribution mostly concentrated around 0 (more details can be found in the next section).
Multiple works in the domain [
20,
35] discuss the different statistical methods (e.g., batch normalization) to better fit the features into a model. Although these methods may have provable assets for some distributions, they could worsen the process if applied to an unexpected input distribution. This is why we propose to preprocess the obtained raw feature vectors so that they better align with typical distribution assumptions in the field. Denote
${f}_{\phi}\left(\mathbf{x}\right)=[{f}_{\phi}^{1}\left(\mathbf{x}\right),\dots ,{f}_{\phi}^{h}\left(\mathbf{x}\right),\dots ,{f}_{\phi}^{d}\left(\mathbf{x}\right)]\in {\left({\mathbb{R}}^{+}\right)}^{d},\mathbf{x}\in {\mathbf{D}}_{novel}$ as the obtained features on
${\mathbf{D}}_{novel}$, and let
${f}_{\phi}^{h}\left(\mathbf{x}\right),1\le h\le d$ denote its value in the hth position. The preprocessing methods applied in our proposed algorithms are as follows:
(E) Euclidean normalization. Also known as L2normalization, which is widely used in many related works [
19,
35,
37], this step scales the features to the same area so that large variance feature vectors do not predominate the others. Euclidean normalization can be given by:
(P) Power transform. The power transform method [
1,
40] simply consists of taking the power of each feature vector coordinate. The formula is given by:
where
$\u03f5$ = 1 × 10
${}^{6}$ is used to make sure that
${f}_{\phi}\left(\mathbf{x}\right)+\u03f5$ is strictly positive in every position, and
$\beta $ is a hyperparameter. The rationale of the preprocessing above is that power transform, often used in combination with euclidean normalization, has the functionality of reducing the skew of the distribution and mapping it to a closetoGaussian distribution, adjusted by
$\beta $. After experiments, we found that
$\beta =0.5$ gives the most consistent results for our considered experiments, which corresponds to a squareroot function that has a wide range of usage on features [
41]. We will analyze this ability and the effect of power transform in more detail in
Section 4. Note that power transform can only be applied if considered feature vectors contain nonnegative entries, which will always be the case in the remainder of this work.
(M) Mean subtraction. With mean subtraction, each sample is translated using
$\mathbf{m}\in {\left({\mathbb{R}}^{+}\right)}^{d}$, the projection center. This is often used in combination with euclidean normalization in order to reduce the task bias and better align the feature distributions [
20]. The formula is given by:
The projection center is often computed as the mean values of feature vectors related to the problem [
20,
35]. In this paper, we compute it either as the mean feature vector of the base dataset (denoted as
${\mathrm{M}}_{\mathrm{b}}$) or the mean vector of the novel dataset (denoted as
${\mathrm{M}}_{\mathrm{n}}$), depending on the fewshot settings. Of course, in both of these cases, the rationale is to consider a proxy to what would be the exact mean value of feature vectors on the considered task.
In our proposed method, we deploy these preprocessing steps in the following order: Power transform (P) on the raw features, followed by a Euclidean normalization (E). Then, we perform mean subtraction (M) followed by another Euclidean normalization at the end. The resulting abbreviation is PEME, in which M can be either ${\mathrm{M}}_{\mathrm{b}}$ or ${\mathrm{M}}_{\mathrm{n}}$, as mentioned above. In our experiments, we found that using ${\mathrm{M}}_{\mathrm{b}}$ in the case of inductive fewshot learning and ${\mathrm{M}}_{\mathrm{n}}$ in the case of transductive fewshot learning consistently led to the most competitive results. More details on why we used this methodology are available in the experiment section.
When facing an inductive problem, a simple classifier such as a NearestClassMean classifier (NCM) can be used directly after this preprocessing step. The resulting methodology is denoted PE${\mathrm{M}}_{\mathrm{b}}$ENCM. However, in the case of transductive settings, we also introduce an iterative procedure, denoted BMS for Boosted Minsize Sinkhorn, meant to leverage the joint distribution of unlabeled samples. The resulting methodology is denoted PE${\mathrm{M}}_{\mathrm{n}}$EBMS. The details of the BMS procedure are presented thereafter.
3.4. Boosted MinSize Sinkhorn
In the case of transductive fewshot, we introduce a method that consists of iteratively refining estimates for the probability each unlabeled sample belongs to any of the considered classes. This method is largely based on the one we introduced in [
1], except it does not require priors about sample distributions in each of the considered classes. Denoting
$i\in [1,\dots ,l+u]$ as the sample index in
${\mathbf{D}}_{novel}$ and
$j\in [1,\dots ,n]$ as the class index, the goal is to maximize the following log postposterior function:
Here,
$l\left({\mathbf{x}}_{i}\right)$ denotes the class label for sample
${\mathbf{x}}_{i}\in \mathbf{Q}\cup \mathbf{S}$,
$P({\mathbf{x}}_{i};\theta )$ denotes the marginal probability, and
$\theta $ represents the model parameters to estimate. Assuming a Gaussian distribution on the input features for each class, here we define
$\theta ={\mathbf{w}}_{j},\forall j$ where
${\mathbf{w}}_{j}\in {\mathbb{R}}^{d}$ stand for the weight parameters for class
j. We observe that Equation (
4) can be related to the cost function utilized in optimal transport [
42], which is often considered to solve classification problems, with constraints on the sample distribution over classes. To that end, a wellknown Sinkhorn [
43] mapping method is proposed. The algorithm aims at computing a class allocation matrix among novel class data for a minimum Wasserstein distance. Namely, an allocation matrix
$\mathbf{P}\in {\mathbb{R}}_{+}^{(l+u)\times n}$ is defined where
$\mathbf{P}[i,j]$ denotes the assigned portion for sample
i to class
j, and it is computed as follows:
where
$\mathbb{U}(\mathbf{p},\mathbf{q})\in {\mathbb{R}}_{+}^{(l+u)\times n}$ is a set of positive matrices for which the rows sum to
$\mathbf{p}$ and the columns sum to
$\mathbf{q}$,
$\mathbf{p}$ denotes the distribution of the amount that each sample uses for class allocation, and
$\mathbf{q}$ denotes the distribution of the amount of samples allocated to each class. Therefore,
$\mathbb{U}(\mathbf{p},\mathbf{q})$ contains all the possible ways of allocation. In the same equation,
$\mathbf{C}$ can be viewed as a cost matrix that is of the same size as
$\mathbf{P}$, each element in
$\mathbf{C}$ indicates the cost of its corresponding position in
$\mathbf{P}$. We will define the particular formula of the cost function for each position
$\mathbf{C}[i,j],\forall i,j$ in details later on in the section. As for the second term on the right of (
5), it stands for the entropy of
$\tilde{\mathbf{P}}$:
$H(\tilde{\mathbf{P}})={\sum}_{ij}\tilde{\mathbf{P}}[i,j]log\tilde{\mathbf{P}}[i,j]$, regularized by a hyperparameter
$\lambda $. Increasing
$\lambda $ would force the entropy to become smaller, so that the mapping is less diluted. This term also makes the objective function strictly convex [
43,
44] and thus a practical and effective computation. From lemma 2 in [
43], the result of the Sinkhorn allocation has the typical form
$\mathbf{P}=\mathrm{diag}\left(\mathbf{u}\right)\xb7exp(\mathbf{C}/\lambda )\xb7\mathrm{diag}\left(\mathbf{v}\right)$. It is worth noting that here we assume a soft class allocation, meaning that each sample can be “sliced” into different classes. We will present our proposed method in detail in the following paragraphs.
Given all that is presented above, in this paper, we propose an Expectation–Maximization (
EM) [
45] based method, which alternates between updating the allocation matrix
$\mathbf{P}$ and estimating the parameter
$\theta $ of the designed model, in order to minimize Equation (
5) and maximize Equation (
4). For a starter, we define a weight matrix
$\mathbf{W}$ with
n columns (i.e., one per class) and
d rows (i.e., one per dimension of feature vectors), and for column
j in
$\mathbf{W}$, we denote it as the weight parameters
${\mathbf{w}}_{j}\in {\mathbb{R}}^{d}$ for class
j in correspondence with Equation (
4). It is initialized as follows:
where
We can see that $\mathbf{W}$ contains the average of feature vectors in the support set for each class, followed by a L2normalization on each column so that $\parallel {\mathbf{w}}_{j}{\parallel}_{2}=1,\forall j$.
Then, we iterate multiple steps that we describe thereafter.
As previously stated, the proposed algorithm is an
EMlike one that iterately updates model parameters for optimal estimates. Therefore, this step, along with Minsize Sinkhorn presented in the next step, is considered as the
Estep of our proposed method. The goal is to find membership probabilities for the input samples; namely, we compute
$\mathbf{P}$ that minimizes Equation (
5).
Here, we assume Gaussian distributions, and features in each class have the same variance and are independent from one another (covariance matrix
$\mathrm{\Sigma}=\mathbf{I}{\sigma}^{2}$). We observe that, ignoring the marginal probability, Equation (
4) can be boiled down to negative L2 distances between extracted samples
${f}_{\phi}\left({\mathbf{x}}_{i}\right),\forall i$ and
${\mathbf{w}}_{j},\forall j$, which is initialized in Equation (
6) in our proposed method. Therefore, based on the fact that
${\mathbf{w}}_{j}$ and
${f}_{\phi}\left({\mathbf{x}}_{i}\right)$ are both normalized to be unit length vectors (
${f}_{\phi}\left({\mathbf{x}}_{i}\right)$ being preprocessed using PEME introduced in the previous section), here we define the cost between sample
i and class
j to be the following equation:
which corresponds to the cosine distance.
In [
1], we proposed a Wasserstein distancebased method in which the Sinkhorn algorithm is applied at each iteration so that the class prototypes are updated iteratively in order to find their best estimates. Although the method showed promising results, it is established on the condition that the distribution of the query set is known, e.g., a uniform distribution among classes on the query set. This is not ideal, given the fact that any priors about
$\mathbf{Q}$ should be supposedly kept unknown when applying a method. The methodology introduced in this paper can be seen as a generalization of that introduced in [
1] that does not require priors about
$\mathbf{Q}$.
In the classical settings, the Sinkhorn algorithm aims at finding the optimal matrix
$\mathbf{P}$, given the cost matrix
$\mathbf{C}$ and regulation parameter
$\lambda $ presented in Equation (
4). Typically, it initiates
$\mathbf{P}$ from a softmax operation over the rows in
$\mathbf{C}$, then it iterates between normalizing columns and rows of
$\mathbf{P}$, until the resulting matrix becomes close to doubly stochastic according to
$\mathbf{p}$ and
$\mathbf{q}$. However, in our case, we do not know the distribution of samples over classes. To address this, we firstly introduce the parameter
k, initialized so that
$k\leftarrow s$, meant to track an estimate of the cardinal of the class containing the least number of samples in the considered task. Then, we propose the following modification to be applied to the matrix
$\mathbf{P}$ once initialized: we normalize each row as in the classical case but only normalize the columns of
$\mathbf{P}$ for which the sum is less than the previously computed minsize
k [
20]. This ensures at least
k elements are allocated for each class, but not exactly
k samples as in the balanced case.
The principle of this modified Sinkhorn solution is presented in Algorithm 1.
Algorithm 1 Minsize Sinkhorn 
Inputs:$\mathbf{C},\mathbf{p}={\mathbf{1}}_{l+u},\mathbf{q}=k{\mathbf{1}}_{n}$, $\lambda $ Initializations:
$\mathbf{P}=Softmax(\lambda \mathbf{C})$ for$iter=1$to 50 do $\mathbf{P}[i,:]\leftarrow \mathbf{p}\left[i\right]\xb7\frac{\mathbf{P}[i,:]}{{\sum}_{j}\mathbf{P}[i,j]},\forall i$ $\mathbf{P}[:,j]\leftarrow \mathbf{q}\left[j\right]\xb7\frac{\mathbf{P}[:,j]}{{\sum}_{i}\mathbf{P}[i,j]}\phantom{\rule{4.pt}{0ex}}\mathrm{if}\phantom{\rule{4.pt}{0ex}}{\sum}_{i}\mathbf{P}[i,j]<\mathbf{q}\left[j\right],\forall j$ end for return
$\mathbf{P}$

This step is considered as the
Mstep of the proposed algorithm, in which we use a variant of the logistic regression algorithm in order to find the model parameter
$\theta $ in the form of weight parameters
${\mathbf{w}}_{j}$ for each class. Note that
${\mathbf{w}}_{j}$, if normalized, is equivalent to the prototype for class
j in this case. Given the fact that in Equation (
4), we also take into account the marginal probability, it can be further broken down as:
We observe that Equation (
4) corresponds to applying a softmax function on the negative logits computed through an L2distance function between samples and class prototypes (normalized). This fits the formulation of a linear hypothesis between
${f}_{\phi}\left({\mathbf{x}}_{i}\right)$ and
${\mathbf{w}}_{j}$ for logit calculations, hence the rationale for utilizing logistic regression in our proposed method. Note that contrary to classical logistical regression, we implement here a form of selfdistillation. Indeed, we use soft labels contained in
$\mathbf{P}$ instead of onehot class indicator targets, and these targets are refined iteratively.
The procedure of this step is as follows: now that we have a polished allocation matrix
$\mathbf{P}$, we firstly initialize the weights
${\mathbf{w}}_{j}$ as follows:
where
We can see that elements in
$\mathbf{P}$ are used as coefficients for feature vectors to linearly adjust the class prototypes [
1]. Similar to Equation (
6), here
${\mathbf{w}}_{j}$ is the normalized newlycomputed class prototype that is a vector of length 1.
Next, we further adjust weights by applying a logistic regression, and the optimization is performed by minimizing the following loss:
where
$\mathbf{S}\in {\mathbb{R}}^{(l+u)\times n}$ contains the logits, and each element is computed as:
Note that $\kappa $ is a scaling parameter, it can also be seen as a temperature parameter that adjusts the confidence metric to be associated with each sample. It is learnt jointly with $\mathbf{W}$.
The deployed logistic regression comes with hyperparameters on its own. In our experiments, we use an SGD optimizer with a gradient step of
$0.1$ and
$0.8$ as the momentum parameter, and we train over
e epochs. Here, we point out that
$e\ge 0$ is considered an influential hyperparameter in our proposed algorithm,
$e=0$ indicates a simple update of
$\mathbf{W}$ as the normalized adjusted class prototypes (Equation (
10)) computed from
$\mathbf{P}$ in Equation (
11), without further adjustment of logistic regression. In addition, note that when
$e>0$, we project columns of
$\mathbf{W}$ to the unit hypersphere at the end of each epoch.
 d
Estimating the class minimum size
We can now refine our estimate for the minsize
k for the next iteration. To this end, we firstly compute the predicted label of each sample as follows:
which can be seen as the current (temporary) class prediction.
Then, we compute:
where
${k}_{j}=\#\{i,\widehat{\ell}\left({\mathbf{x}}_{i}\right)=j\}$,
$\#\{\xb7\}$ representing the cardinal of a set.
Summary of the proposed method: all steps of the proposed method are summarized in Algorithm 2. In our experiments, we also report the results obtained when using a prior about
$\mathbf{Q}$ as in [
1]. In this case,
k does not have to be estimated throughout the iterations and can be replaced with the actual exact targets for the Sinkhorn. We denote this priordependent version PE
${\mathrm{M}}_{\mathrm{n}}$EBMS* (with an added *).
Algorithm 2 Boosted Minsize Sinkhorn (BMS) 
Parameters:
$\lambda ,e$ Inputs: Preprocessed ${f}_{\phi}\left(\mathbf{x}\right)$, $\forall \mathbf{x}\in {\mathbf{D}}_{novel}=\mathbf{Q}\cup \mathbf{S}$ Initializations:$\mathbf{W}$ as normalized mean vectors over the support set for each class (Equation ( 6)); Minsize $k\leftarrow s$. for$iter=1$ to 20 do Compute cost matrix $\mathbf{C}$ using $\mathbf{W}$ (Equation ( 8)). # Estep Apply Minsize Sinkhorn to compute $\mathbf{P}$ (Algorithm 1). # Estep Update weights $\mathbf{W}$ using $\mathbf{P}$ with logistic regression (Equations ( 10)–( 13)). # Mstep Estimate class predictions $\widehat{\ell}$ and minsize k using $\mathbf{P}$ (Equations ( 14) and ( 15)). end for return
$\widehat{\ell}$
