1. Introduction
Self-supervised learning (SSL) has emerged as a dominant paradigm for representation learning by leveraging the underlying structure of data without the need for human-annotated labels [
1,
2]. Methods such as SimCLR [
1], BYOL [
3], VICReg [
2], and Barlow Twins [
4] have demonstrated remarkable performance by enforcing objectives such as invariance to augmentations, feature decorrelation, and variance preservation. However, relying on Euclidean representations in standard self-supervised learning objectives often assumes a relatively simple geometric structure in the latent space. After multiple layers of nonlinear transformation, this assumption becomes questionable, since latent representations are likely to inhabit a highly non-linear manifold, poorly characterized by standard second-order statistics or
distances. This motivates our kernelized formulation, which enables learning in an implicitly defined high-dimensional feature space that captures the underlying manifold structure.
Several recent works have incorporated kernels into SSL objectives, e.g., [
5,
6,
7,
8,
9]. These methods typically replace similarity metrics or introduce kernel-based dependence criteria within contrastive or predictive frameworks. In contrast, our approach performs a structural lifting of VICReg itself: the variance and covariance penalties are rederived from the covariance operator in RKHS, rather than modified heuristically. This distinction is important, as it preserves the collapse-prevention mechanism of VICReg while redefining its geometry.
Kernel methods, long celebrated for their ability to implicitly map data into high-dimensional feature spaces via the kernel trick [
10], offer a compelling avenue to address this limitation. In supervised settings, the transformation from linear to nonlinear models via kernelization (exemplified by the transition from linear SVM to kernel SVM) has been foundational in classical machine learning. Inspired by this paradigm, we ask:
can core SSL losses be systematically lifted into a Reproducing Kernel Hilbert Space? Here, we answer this by showing how one can replace Euclidean-space objectives with their RKHS counterparts, exemplified through a kernelized version of VICReg.
We propose Kernel-VICReg, which computes invariance, variance, and covariance entirely in RKHS via double-centered kernels and Hilbert–Schmidt norms, while we focus on VICReg as a concrete example (and note that Barlow Twins admits an analogous kernelization), the same RKHS-lifting could be extended to contrastive frameworks like SimCLR or predictive ones like BYOL with suitable cross-kernel formulations.
Experiments across multiple datasets show that Kernel-VICReg can yield more effective representations than its Euclidean counterpart under our evaluation protocol and can improve stability in settings where Euclidean VICReg collapses. These results suggest that integrating kernel methods into SSL is a promising direction.
1.1. Related Work: Kernel Methods in Self-Supervised Learning
While several recent SSL methods incorporate kernel functions, they fundamentally differ from Kernel VICReg in their application and scope. Existing approaches generally fall into the following categories:
Use of Nonlinear Dependence in SSL: Some methods such as [
5,
8] use Hilbert–Schmidt Independence Criterion (HSIC) [
11] to model nonlinear dependence of samples or features in RKHS.
Kernels as Regularizers: Some methods, such as [
9], integrate kernel tools, such as Maximum Mean Discrepancy (MMD) [
12], into existing SSL losses to align feature distributions. In these cases, the kernel is used as a supplementary distance metric, but the underlying architecture remains Euclidean.
Implicit Kernel Analysis: Other works, such as [
13], use kernels primarily as a theoretical lens to analyze the training dynamics or “spectral bias” of standard SSL objectives like SimCLR [
1] or Barlow Twins [
4].
In contrast, our proposed Kernel VICReg represents a systematic lifting of an entire loss function of an existing SSL method, i.e., VICReg, into the RKHS. Unlike methods that use kernels for specific terms, we kernelize the variance, invariance, and covariance terms simultaneously. This allows the model to operate on double-centered kernel matrices and Hilbert–Schmidt norms, capturing nonlinear dependencies across the entire loss function rather than just regularizing a single component. To the best of our knowledge, this is the first work to provide a complete kernelized derivation of the VICReg framework.
1.2. Positioning and Novelty
Recent works have explored the use of kernels in self-supervised learning, including kernel-based contrastive objectives and kernelized predictive frameworks. However, these methods typically replace similarity measures or dependence criteria within existing objectives (e.g., kernelized contrastive alignment or kernel dependence maximization), rather than systematically lifting the entire regularization structure of a non-contrastive SSL method into RKHS.
Our contribution is structurally different. We show that all three components of VICReg—invariance, variance preservation, and covariance decorrelation—admit principled RKHS counterparts derived from the covariance operator in Hilbert space. This yields a unified formulation in which:
- 1.
invariance is expressed via cross-kernel trace distances,
- 2.
variance preservation is tied to eigenvalues of the centered kernel matrix, and
- 3.
covariance decorrelation becomes a Hilbert–Schmidt norm penalty.
To our knowledge, a full operator-level lifting of VICReg into RKHS using covariance operators and double-centered Gram matrices has not been previously derived. Importantly, our formulation is not a kernel substitution in the similarity function but a redefinition of the geometry in which the SSL objective is defined.
The proposed Kernel VICReg framework provides a more robust geometric constraint in the RKHS, while standard Euclidean methods may suffer from dimensional collapse when the projection head is not sufficiently wide. Our kernelized approach enhances robustness to representation collapse by leveraging the infinite-dimensional nature of the Hilbert space to maintain feature variance.
3. Kernel VICReg
Standard VICReg operates in Euclidean space by enforcing variance preservation, decorrelating feature dimensions, and ensuring consistency between views of the same sample. Our proposed Kernel VICReg extends these principles to the Reproducing Kernel Hilbert Space (RKHS), allowing for non-linear representations without explicit feature mappings. We rigorously derive the kernelized counterparts of VICReg’s variance, covariance, and invariance losses.
3.1. Covariance in RKHS
Before deriving the kernelized VICReg loss components, we establish the fundamental result that the covariance operator in RKHS is proportional to the double-centered kernel matrix. The formulations derived and explained in this section will be used in kernelizing the loss terms of VICReg.
Let
be the implicit feature mapping in RKHS
, where
is the pulling function into RKHS. The kernel function in the RKHS
is defined as:
where
denotes the inner product in the RKHS
.
The covariance operator in RKHS is defined as [
14,
15]:
where
t is the dimensionality of the RKHS,
is the centered feature map of the batch, and
is the mean feature vector of the batch in RKHS. If we let
then Equation (
8) can be restated in matrix form:
where
is the centering matrix.
Using the kernel trick, we define kernel matrix
whose
-th element is:
From Equation (
9), the kernel matrix can be stated as:
The double-centered kernel matrix is:
where step
considers that the centering matrix
is symmetric. The double-centered kernel matrix will be used in the following.
The following explains the relation of the covariance operator and double-centered kernel in RKHS. The squared Hilbert–Schmidt norm of the covariance matrix in RKHS is:
where
denotes the trace of matrix and step
follows from the cyclic property of the trace operator. For easier computation in computer, it is possible to restate Equation (
14) using the Frobenius norm:
From Equation (
14), the covariance operator in RKHS is proportional to the centered kernel matrix:
This key result enables the kernelization of VICReg’s variance and covariance regularization terms, as discussed next.
3.2. Kernelized Variance Regularization
The variance regularization term in VICReg prevents representation collapse by ensuring that the variance along each feature dimension remains sufficiently large, i.e., above a threshold
. In RKHS, the variance of feature dimensions corresponds to the eigenvalues of
. Since
is proportional to
according to Equation (
16), we define the kernelized variance loss as:
where
is the standard Hinge loss,
are the eigenvalues of
,
is a threshold for the minimum desired standard deviation, and
is a small positive number preventing numerical instabilities. The proof of why
is understood as a variance in RKHS will be developed in
Section 4.1.
It is noteworthy that computing the eigenvalues here is not concerning in terms of time of computation because the double-centered kernel is where b is the batch size, which is usually not a very large number.
3.3. Kernelized Covariance Regularization
To prevent redundancy in representations, VICReg penalizes off-diagonal elements of the covariance matrix (see Equation (
5)). Building on Equation (
15), the kernelized covariance loss can be defined as
Because of the direct relation between covariance and correlation, this regularization enforces decorrelation between features in RKHS.
The choice of using the Hilbert–Schmidt norm
in Equation (
18) rather than its square
(as initially suggested by the expansion in Equation (
15)) is a deliberate design choice aimed at improving optimization stability. Mathematically, if
are the singular values of the covariance operator, the squared norm
penalizes large correlations quadratically, which can lead to vanishing gradients for smaller correlation values during the late stages of training. By using the square root form (the norm itself), the gradient magnitude remains more consistent:
Empirically, we observed that this formulation provides a more balanced optimization landscape, preventing the covariance term from being dominated by a few large off-diagonal correlations and ensuring that all dimensions of the RKHS embedding are decorrelated effectively. This is consistent with recent findings in SSL literature suggesting that normalization of loss terms can lead to smoother convergence and better numerical stability in high-dimensional feature spaces.
3.4. Kernelized Invariance Term
The invariance term of the loss function minimizes the mean squared error between corresponding samples
and
, i.e., different views of the same sample. Consider the following
kernel matrices:
Given kernel matrices
and
for two augmented views and their cross-kernel matrix
, the distance of the views in RKHS is defined as [
16]:
This loss term pushes the corresponding, i.e., augmented, instances toward each other in the RKHS and pulls away the non-corresponding instances away from each other in the RKHS. This enforces consistency across augmentations in RKHS.
3.5. Overall Kernel VICReg Loss
Combining all three regularization terms, the final Kernel VICReg loss is given by:
where
are hyperparameters controlling the contributions of invariance, variance, and covariance. Note that the best values for hyperparameters
differ across the VICReg and Kernel VICReg. The best hyperparameters can be found depending on the dataset, as they are found in the original VICReg.
By reformulating VICReg in RKHS, our method enables self-supervised learning in high-dimensional implicit feature spaces without explicit feature extraction, making it a powerful framework for non-linear representation learning.
4. Discussions
4.1. Relation of Kernelized Variance Term with Kernel PCA
There is a close relation between the kernelized variance term in Kernel VICReg and kernel Principal Component Analysis (kernel PCA) [
14]. In standard PCA, the eigenvalues of the covariance matrix
are calculated using the following eigenvalue problem:
where
and
are the
i-th eigenvalue and eigenvector of the covariance matrix
, respectively.
In kernel PCA, the eigenvalue problem of the double-centered kernel matrix
is considered:
where
and
are the
i-th eigenvalue and eigenvector of the double-centered kernel matrix
, respectively.
Each eigenvector
gives a principal direction in the feature space
. According to the representation theory, any function in RKHS lies in the span of all points in the RKHS [
17]:
where
is defined in Equation (
9) and
is the matrix of coefficients in the linear combination.
On the one hand, the variance of the principal direction in the feature space is:
For one of the coefficient vectors, this equation becomes:
where trace is dropped because the trace of a scalar is equal to itself.
Assume that the coefficient
is the
i-th eigenvector of the double-centered kernel matrix. Thus, the variance becomes:
Squaring the double-centered kernel matrix in Equation (
23) gives:
Substituting Equation (
27) in Equation (
26) provides:
The coefficient is usually normalized to have
(so that
):
where the eigenvector
is assumed to be normalized to have unit length, i.e.,
. According to Equation (
29), Equation (
28) becomes:
This proves why
is used as variance in Equation (
17).
On the other hand, left-multiplying Equation (
23) by
gives:
where
assumes that the eigenvector
is normalized to have unit length. Therefore, if
, then:
Putting Equation (
31) in Equation (
30) provides:
This analysis shows that the kernelized variance regularization in the proposed Kernel VICReg can be considered as kernel PCA. A similar analysis can be discussed to analyze the variance regularization term in VICReg as PCA.
4.2. Connection to HSIC and Independence
The squared Hilbert–Schmidt norm of the RKHS covariance operator (used in our covariance loss) is closely related to the Hilbert–Schmidt Independence Criterion (HSIC), a well-established kernel-based dependence measure. In this view, our covariance loss can be interpreted as minimizing feature dependence in RKHS, encouraging the learning of diverse and disentangled features. This theoretical grounding strengthens the regularization effect of the Kernel VICReg loss beyond simple decorrelation.
4.3. Kernel Choice as Inductive Bias
Different kernels induce different geometric priors. For example, the RBF kernel emphasizes local smoothness, the Laplacian kernel allows sharper decision boundaries, and the rational quadratic interpolates between them. Our experiments reveal that no single kernel is optimal across all datasets; instead, performance depends on the match between the dataset structure and the kernel-induced geometry. This makes Kernel VICReg not only robust but also adaptable to task-specific data distributions. A compatible extension to reduce sensitivity to a single kernel choice is to use kernel mixtures, e.g., with and , or more broadly multiple kernel learning (MKL). This is orthogonal to our main contribution (lifting VICReg into RKHS), since the objective depends on Gram matrices and can directly operate on a mixture Gram matrix.
4.4. Comparison of Approaches in Kernel VICReg and Graph-Based Embedding
There exist recent works on kernel-graph integration and spectral clustering (e.g., [
18,
19]) analyzing how kernel methods interact with graph structure learning and spectral objectives. These approaches focus primarily on unsupervised clustering or graph-based embedding construction. In contrast, our work does not construct or learn a graph structure; instead, we lift a non-contrastive self-supervised regularization objective into RKHS. The role of kernels here is not spectral clustering, but redefining variance and covariance operators in an implicit feature space. Our formulation in Kernel VICReg is therefore complementary rather than competitive with kernel–graph hybrids.
4.5. Comparison of Optimization Objective in Kernel VICReg and Variational Inference
Although our formulation pulls the VICReg objective to an RKHS via kernel covariance operators, the resulting loss remains a deterministic, differentiable functional of the network parameters. Unlike variational self-supervised methods that introduce latent variables and evidence lower bounds (ELBO), our objective does not involve probabilistic latent modeling or variational inference.
Concretely, the kernelized variance and covariance terms are computed through empirical covariance operators constructed from mini-batch embeddings. These operators depend smoothly on the encoder parameters, and gradients are obtained via automatic differentiation.
Therefore, the optimization problem is a standard stochastic minimization of a deterministic loss (
21) optimized using stochastic gradient descent. No variational bound, alternating optimization, or EM-style procedure is introduced by the RKHS lifting.
4.6. Theoretical Properties of Kernel VICReg
4.6.1. Non-Collapse in RKHS
Proposition 1 (Non-Collapse in RKHS)
. Let denote the double-centered kernel matrix of a batch. If the kernelized variance regularization enforces:where are eigenvalues of , then the covariance operator in RKHS is strictly positive definite on the span of the batch, and representational collapse (i.e., rank-one embedding) is prevented. Proof. From Equation (
16), we have
. If all eigenvalues satisfy
, then
is full rank on the batch span. Thus the covariance operator has a strictly positive spectrum, implying that no direction in RKHS has zero variance. A collapsed representation corresponds to
, which contradicts the enforced lower bound. □
Remark 1. Proposition 1 demonstrates that Kernel VICReg enforces spectral spread in RKHS, while Euclidean VICReg only enforces coordinate-wise variance. This is a theoretical distinction between VICReg and Kernel KICReg.
4.6.2. Nonlinear Variance Capture in RKHS
Theorem 1 (Nonlinear Variance Capture in RKHS). Let be a compact nonlinear manifold. Assume that is not contained in any proper affine subspace of , but its nonlinear structure cannot be captured by second-order Euclidean statistics (i.e., PCA does not linearize ).
Let k be a universal kernel (e.g., Gaussian RBF or Laplacian) with feature map . Then:
- 1.
The feature map ϕ is injective on .
- 2.
The image lies in a linear subspace of whose covariance operator encodes nonlinear structure of .
- 3.
The eigenvalues of the centered kernel matrix correspond to nonlinear principal components of (kernel PCA).
- 4.
Therefore, enforcing lower bounds on eigenvalues of preserves nonlinear modes of variation that are invisible to Euclidean covariance regularization.
Remark 2. Theorem 1 connects Kernel VICReg to kernel PCA theory, as was also discussed in Section 4.1. This theorem does not claim that RKHS variance always strictly dominates Euclidean variance, but rather that for universal kernels, nonlinear structure becomes linearly representable in feature space, allowing spectral regularization to act on intrinsic manifold directions. 4.6.3. Spectral Stability in Small Batches
Theorem 2 (Spectral Stability of Centered Kernel Matrices)
. Let be i.i.d. samples drawn from a distribution . Let k be a bounded kernel satisfying:Let denote the Gram matrix, and let be its double-centered version, where . Let denote the population centered Gram operator restricted to the sample span. Then, for any , with high probability at least , we have:for some universal constant . The bound (
35) shows that eigenvalue estimates in RKHS concentrate at rate
, providing stability guarantees for small batch regimes.
Corollary 1. Under the above conditions, each eigenvalue of concentrates around its population counterpart at rate . Therefore, it provides stability guarantees for small-batch regimes where b is not too large.
4.7. Scalability and Large-Scale Approximations
A primary concern in kernel-based methods is the computational complexity associated with the Gram (kernel) matrix. In Kernel VICReg, the construction of the kernel matrix and the eigenvalue decomposition required for the variance loss incur complexities of and , respectively, where b is the batch size, while modern hardware efficiently handles standard batch sizes (e.g., to 2048), scaling to “cognitive computing” levels with massive datasets requires approximation strategies.
4.7.1. Scalability by Nyström Method
To address this, Kernel VICReg can be integrated with the Nyström method [
20,
21], which approximates the full kernel matrix using
m landmark points (
):
where
is the matrix of kernel evaluations between the batch samples and the landmarks. This reduces the complexity to
.
4.7.2. Scalability by Random Fourier Features
Alternatively, Random Fourier Features (RFF) [
22] can be employed for shift-invariant kernels (e.g., RBF kernel). RFF maps the embeddings
to a low-dimensional randomized feature space
such that the kernel is approximated by a linear inner product:
where
are sampled from the kernel’s spectral density. By utilizing RFF, the Kernel VICReg objective simplifies to a linear form in the
-space, achieving
complexity. This ensures linear scalability relative to the batch size, making the framework suitable for large-scale distributed applications.
4.8. Ethical Considerations and Potential Biases
While Kernel VICReg provides a robust framework for nonlinear representation learning, it is essential to consider the ethical implications regarding data bias. Kernel methods are inherently sensitive to the distribution of the training data. In domains such as medical imaging or cognitive computing, if a specific group is underrepresented, the kernel matrix may fail to capture the local geometry of that sub-population, potentially leading to “feature exclusion”, where the RKHS mapping reinforces existing biases.
Furthermore, the choice of kernel (e.g., the RBF kernel width ) acts as a scale-dependent filter. If the data from minority groups exhibits different variance scales, a global kernel parameter might sub-optimally encode their features compared to the majority group. To mitigate these risks in sensitive applications, we suggest the use of adaptive kernels or group-fairness constraints within the variance-covariance terms, ensuring that the geometric embedding remains equitable across diverse demographic groups.
5. Experimental Results
We evaluate Kernel VICReg on a range of benchmark datasets to assess its ability to learn rich, non-linear representations in self-supervised settings. Our experiments span small-scale datasets (MNIST, CIFAR-10), mid-scale transfer learning (STL-10), and large-scale benchmarks (TinyImageNet, ImageNet100). For all experiments, we use a ResNet-18 backbone as the encoder, followed by a two-layer MLP projector. The model is trained using the Adam optimizer with an initial learning rate of batch size of 256, and cosine learning rate scheduling. Each dataset is augmented following standard protocols: random cropping, horizontal flipping, and color jitter for natural image datasets.
To investigate the effect of different kernels in the Reproducing Kernel Hilbert Space (RKHS), we implement Kernel VICReg using four kernels: linear, radial basis function (RBF), Laplacian, and rational quadratic (RQ). The kernel matrices are computed batch-wise, with double-centering applied to ensure zero-mean embeddings in RKHS. We evaluate the learned representations using linear probing, where a logistic regression classifier is trained atop frozen embeddings, and transfer learning, where encoders pretrained on CIFAR-10 are evaluated on STL-10.
5.1. Comparison with Baselines
We compare Kernel VICReg against a diverse set of prominent self-supervised learning methods, including contrastive (SimCLR, MoCo), clustering-based (SwAV), and non-contrastive frameworks (BYOL, DINO, SimSiam, Barlow Twins, and VICReg). All baselines are implemented with the same ResNet-18 encoder to ensure fairness, and we reuse reported numbers or reproduce them when necessary using identical augmentation and optimization settings.
While the field of self-supervised learning (SSL) is rapidly evolving with newer non-contrastive architectures, the primary objective of this work is to provide a theoretical and methodological framework for lifting the VICReg objective into the Reproducing Kernel Hilbert Space (RKHS). Our evaluations focus on comparing Kernel VICReg against its direct Euclidean counterpart and established foundational SSL baselines (e.g., SimCLR, Barlow Twins, and VICReg) to isolate the impact of kernelization. By demonstrating consistent improvements over the original VICReg across multiple datasets, we establish the validity of the proposed kernelized variance-invariance-covariance constraints.
Table 1 summarizes the top-1 linear evaluation accuracy on ImageNet100 and TinyImageNet. While VICReg exhibits competitive performance on ImageNet100, it collapses on TinyImageNet due to its sensitivity to small datasets with high intra-class variance. In contrast, Kernel VICReg remains stable across all settings, with the Laplacian and RQ kernels achieving the best performance, demonstrating the robustness of RKHS-based regularization.
Table 2 presents results on MNIST and CIFAR-10. Kernel VICReg consistently outperforms its Euclidean counterpart, particularly on MNIST, where the Laplacian kernel reaches
accuracy. On CIFAR-10, the RQ kernel achieves the best performance at
, indicating that kernel choice adapts to data complexity.
Finally,
Table 3 reports transfer learning results on STL-10 using encoders pretrained on CIFAR-10. Kernel VICReg transfers better than VICReg, highlighting its generalization capabilities in low-label regimes.
Note that while several variations and incremental improvements to the VICReg architecture have been proposed since its inception in 2022, this study focuses on the foundational task of extending the core VICReg objective into the RKHS. By comparing our method directly against the standard Euclidean VICReg in
Table 2 and
Table 3, we isolate the performance gains attributable to the kernelized covariance and variance constraints. This controlled comparison is essential for validating the theoretical derivation of Kernel VICReg.
5.2. Further Analysis and Insights
To better understand how kernelization affects the structure of learned representations, we visualize the embedding spaces on the MNIST dataset using UMAP for three models: original VICReg, Kernel VICReg with a linear kernel, and Kernel VICReg with a Laplacian kernel (see
Figure 1). The UMAP plots reveal differences in cluster geometry across these methods. Representations from standard VICReg exhibit some class separation, but the clusters are elongated and lack compactness, suggesting anisotropic variance and potential feature collapse. The red cluster in VICReg is separated and is constructing two clusters; however, this is not the case for our method.
Kernel VICReg with a linear kernel improves upon this, producing tighter and more separated clusters, indicating that even without explicit nonlinearity in the kernel, RKHS-based decorrelation provides better structure. However, the most striking improvement appears with the Laplacian kernel: clusters become nearly circular and uniformly spaced, exhibiting strong isometry. This implies that the Laplacian-induced RKHS preserves pairwise relations and local structure more effectively, leading to embeddings with more consistent intra-class variance and improved inter-class margins.
These visualizations are qualitative; quantitative comparisons are provided by the linear-probe and transfer results in
Table 1 and
Table 3. As shown earlier in
Table 1, Kernel VICReg maintains stable and competitive accuracy across both large-scale (ImageNet100) and small-scale (TinyImageNet) benchmarks. Notably, VICReg collapses on TinyImageNet, consistent with its known sensitivity to datasets exhibiting high intra-class variance or insufficient regularization. In contrast, kernelized versions (especially with Laplacian and rational quadratic kernels) perform robustly, demonstrating the benefits of nonlinear geometric alignment in RKHS.
We further analyze robustness sensitivity to kernel hyperparameters.
Figure 2 shows that polynomial kernels can be highly sensitive to
under distribution shifts, while
Figure 3 demonstrates that RBF performance is non-monotonic in the bandwidth parameter
.
Furthermore, the results in
Table 3 highlight the generalization strength of kernel-based methods in transfer learning. Embeddings trained on CIFAR-10 and evaluated on STL-10 show that the RQ and Laplacian kernels outperform VICReg and even the linear kernel variant. These findings support the hypothesis that kernel-induced representations better capture underlying data manifolds, resulting in improved performance on downstream tasks with distributional shifts.
Overall, Kernel VICReg offers a principled extension of VICReg that gracefully incorporates nonlinearity through RKHS-based loss formulations. The improved cluster geometry, resilience to collapse, and higher transfer accuracy together suggest that kernelized self-supervision is a promising direction for representation learning beyond Euclidean limitations.
5.3. Implementation Details
We implemented Kernel VICReg in PyTorch 2.6.0 with a modular design that allows kernel choice and parameter tuning through command-line flags. The training pipeline consists of three components: (i) a backbone encoder (either ResNet-18 or a simple CNN), (ii) a multi-layer perceptron projector, and (iii) the proposed kernelized VICReg loss. Two augmented views are generated per sample following standard SSL protocols, and their embeddings are compared via the kernel-based losses.
Kernels and Centering. We implemented five kernels (linear, polynomial, radial basis function (RBF), Laplacian, and rational quadratic (RQ)) with automatic double-centering to operate in RKHS. For scale-sensitive kernels (RBF, Laplacian, and RQ), the bandwidth parameter is adaptively estimated using the median heuristic unless explicitly specified.
Kernel VICReg Loss. The loss consists of three terms: (i) invariance, computed as the trace distance between within-view and cross-view Gram matrices; (ii) variance, computed from eigenvalues of the double-centered kernel matrix, penalizing directions with variance below a threshold; and (iii) covariance, computed as the squared Hilbert–Schmidt norm of the covariance operator, equivalent to the sum of squared off-diagonal entries in the kernel covariance. The implementation returns all three contributions separately, allowing monitoring of invariance, variance, and covariance during training.
Training Setup. We conducted experiments on MNIST, CIFAR-10, STL-10, TinyImageNet, and ImageNet100. Unless otherwise noted, the encoder was a ResNet-18 backbone (with adjusted stem for CIFAR) followed by a 3-layer MLP projector with hidden dimension 1024. Models were trained using the Adam optimizer with a learning rate of , a batch size of 512, and a cosine learning rate schedule. Augmentations followed standard SSL protocols: random crop, color jitter, blur, and horizontal flip for natural images; affine transforms for MNIST.
Hyperparameters. Table 4 reports the best-performing
found under our limited search for each dataset/kernel. We include these values to document the observed dataset dependence rather than to claim a single universally optimal setting. This dependence is expected because the kernel acts as an inductive bias that changes the geometry of the objective in RKHS.
Practical kernel-selection heuristics and reduced tuning. As a concise rule of thumb, Laplacian typically favors sharper/local structure (edge- and texture-dominated data), RBF favors smoother geometry and can be more forgiving under higher noise, and RQ is a practical middle ground when both local and global (multi-scale) structure matter. To reduce tuning burden in practice, one can fix to a standard VICReg-scale value and perform small one-dimensional sweeps over (to prevent spectral collapse relative to ) and (to discourage redundancy), rather than jointly grid-searching all and .
Evaluation. Representation quality was measured via linear probing: embeddings from the frozen encoder were extracted, and a linear classifier was trained with SGD for 100 epochs. In addition, we visualized embedding geometry using UMAP and tracked top eigenvalue dynamics of the centered kernel matrix across epochs to study variance C.
5.4. Computational Complexity and Empirical Overhead
While Kernel VICReg introduces nonlinear mapping via the RKHS, it is important to quantify the overhead relative to the standard Euclidean VICReg. Let
b denote the batch size and
p the dimensionality of the embeddings. Standard VICReg computes a covariance matrix in
, whereas Kernel VICReg computes the Gram matrix and its eigenvalue decomposition in
.
Figure 4 and
Figure 5 further quantify how kernel choice scales in latency and memory as embedding dimension and batch size increase.
The results indicate that for standard SSL batch sizes (
), the overhead is marginal. As discussed in
Section 4.7, for larger “cognitive computing” scales where
b increases significantly, the
bottleneck can be effectively mitigated using Nyström approximations or Random Fourier Features, which reduce the complexity back to a linear or quasi-linear relationship with batch size.
6. Conclusions
We introduced Kernel VICReg, a principled extension of VICReg that pulls self-supervised learning objectives from Euclidean space to Reproducing Kernel Hilbert Space (RKHS). By reformulating invariance, variance, and covariance terms in kernel space using double-centered kernel matrices and Hilbert–Schmidt norms, our method captures complex nonlinear structures without requiring explicit feature mappings.
Our empirical results across diverse datasets demonstrate the robustness and effectiveness of this kernelized formulation. Kernel VICReg outperforms its Euclidean counterpart, particularly in challenging regimes such as TinyImageNet, where standard VICReg collapses. Moreover, kernel-induced representations exhibit superior generalization in transfer learning tasks, as evidenced by improvements on STL-10. Visualization through UMAP further reveals that kernelization promotes more compact, isometric cluster structures, especially under the Laplacian kernel.
These findings suggest that kernel methods offer a natural and powerful means to enhance self-supervised learning while our work focuses on VICReg, the framework readily extends to other SSL objectives such as Barlow Twins and SimCLR, opening promising avenues for future research in kernelized SSL. This study contributes to bridging classical kernel theory and modern representation learning by showing that the integration of RKHS structure meaningfully improves both stability and expressiveness in self-supervised models.