1. Introduction
Kernel methods provide an effective and natural way to bring distances and similarities directly into learning problems. A kernel matrix can encode how close, comparable, or compatible two observations are, even when the objects are not conveniently described by ordinary Euclidean coordinates. This makes kernels particularly useful for classification, regression, and similarity-based learning: geometric or relational information is transformed into an inner-product representation in a reproducing kernel Hilbert space (RKHS), where linear operations become available, nonlinear decision rules can be handled through linear algorithms, and standard classifiers can work from similarities rather than only from raw coordinates [
1,
2,
3]. Their practical success depends heavily on the choice of the kernel, and in many applications a single kernel is not rich enough to capture all relevant aspects of the problem. For that reason, combinations of kernels or similarities have been extensively studied, both as a way of fusing heterogeneous information and as a mechanism for improving discrimination in supervised tasks [
4,
5,
6].
A central difficulty appears when the combination rule produces only a matrix on a finite sample. This happens, for instance, in several label-aware or similarity-driven fusion rules [
6,
7]: the practitioner obtains a combined empirical matrix
on the training set
, but not a genuine kernel function
that can be evaluated at new points. Once the learning algorithm leaves the training sample, the combined object is no longer directly available. In consequence, the resulting model cannot be deployed consistently unless one reconstructs an out-of-sample extension of the combined kernel.
Recent work has confirmed that this passage from finite kernel or proximity information to an out-of-sample evaluator remains an active problem. Fanuel et al. [
8] constructed data-dependent positive-semidefinite kernels with explicit out-of-sample formulas for positive-semidefinite embeddings, while Münch et al. [
9] studied spectral fusion of heterogeneous, possibly indefinite, proximity data. These approaches are complementary to the present paper: we start from a combined matrix already produced by a supervised or multi-source rule and seek a functional representation of that final matrix, up to the selected spectral rank, without using labels for new points.
To make the background and scope more explicit, we position the problem at the intersection of three related literatures. Multiple-kernel learning and kernel-alignment methods study how to combine kernels for supervised learning [
4,
5,
10,
11]. Similarity and proximity learning study how finite pairwise matrices, including non-Euclidean or indefinite objects, can be used or rectified for learning [
7,
12]. Kernel approximation and data-dependent RKHS bases study how finite spectral or interpolation information can be evaluated stably away from the sample [
8,
9,
13,
14]. The present paper uses the first two lines of work as sources of finite combined matrices and the third as the mechanism for obtaining deployable out-of-sample evaluators. This positioning clarifies why the empirical question is not only whether a combination improves classification, but also whether the combined matrix can be converted into a stable kernel on unseen observations.
This article addresses precisely that matrix-to-kernel step. Starting from a positive-semidefinite combined matrix, we construct one out-of-sample extension through a finite feature representation and data-dependent bases in the empirical RKHS [
13,
14,
15]. Such an extension is not unique: a finite matrix admits infinitely many positive-semidefinite extensions outside the observed sample. The contribution of the paper is therefore not to recover “the” true kernel, but to define a controlled inductive extension specified by empirical coordinates, a PSD coefficient matrix, and a regression model for those coordinates. The resulting kernel can then be evaluated between training and test points without access to test labels.
Figure 1 summarizes the conceptual pipeline studied in this paper. The figure emphasizes that the objective is not merely to train a classifier on a fixed matrix, but to transform sample-based kernel or dissimilarity constructions into evaluable functions.
Contributions.
The paper makes four concrete contributions.
- 1.
It formulates the passage from a combined sample matrix to an out-of-sample kernel as a reconstruction problem with three target properties: interpolation on the sample, positive definiteness, and practical out-of-sample evaluation.
- 2.
It proposes a reconstruction strategy based on data-dependent bases in the empirical RKHS, with Newton and SVD bases as the two practically relevant choices.
- 3.
It provides a reproducible experimental protocol that separates base kernels, matrix combination, kernel reconstruction, and downstream classification.
- 4.
It validates the method on completed strict runs for Synthetic, Breast Cancer, Ionosphere, Telco Customer Churn, and Chickenpieces, covering both ordinary vector data, heterogeneous tabular data, and relational dissimilarity data.
Illustrative Example: From a Sample-Based Construction to an Out-of-Sample Function
Before turning to the full experimental protocol, it is useful to isolate the main objective on a simple one-dimensional example. The central issue is not primarily to propose a classifier competing with highly specialized alternatives, but rather to transform a construction originally defined only on a finite sample into a function that can be evaluated at previously unseen points.
Suppose that we observe sample points
and a kernel matrix
defined on these points. Let
denote the eigenvalue–eigenfunction pairs associated with the corresponding integral operator, or, in practice, the finite-sample approximation derived from the kernel matrix. A truncated representation with
terms has the form
Once the coefficients have been determined from the sample, the expression above defines an object that can be evaluated not only at the observed points, but also at a new point , provided that the basis functions admit an out-of-sample extension.
This is precisely the situation considered throughout the paper. In many applications, one starts from kernels or dissimilarity-based constructions that are explicitly available only on the training sample. The contribution of the proposed framework is to show how such constructions, including ad hoc kernel combinations, can be turned into evaluable functions outside the sample by means of finite basis representations. In this sense, the classification experiments should be interpreted mainly as a validation that the resulting extensions remain informative and competitive, rather than as an attempt to outperform specialized classification methods.
Figure 2 fixes the idea visually. A smooth target function is observed only at finitely many input locations. From a kernel-based basis constructed on the sample, one obtains a truncated approximation
which interpolates or approximates the observed data and can then be evaluated at new points. The role of the truncation parameter
is to control the balance between fidelity to the sample-based construction and regularity of the resulting extension. The same principle underlies the methodology developed below for combined kernels.
The main message of the paper is deliberately cautious. The proposed reconstruction is especially reliable for sum-type and modular combinations, where the out-of-sample kernel remains close to the oracle combination and classification performance is preserved. In contrast, more abrupt label-driven combinations such as pick-out are harder to reconstruct faithfully and do not consistently improve classification once they are extended outside the sample.
The rest of the paper is organized as follows.
Section 2 introduces the RKHS framework and formulates the reconstruction problem.
Section 3 presents the proposed method.
Section 4 describes the experimental protocol.
Section 5 reports the numerical results, and
Section 6 discusses what they imply for the use of combined kernels in practice.
3. Proposed Reconstruction Method
3.1. Spectral Decomposition of the Combined Matrix
Let
be the rectified combined matrix on the training sample. Its thin eigendecomposition is
where
contains the retained positive matrix eigenvalues and
the corresponding orthonormal eigenvectors. In the computational method we use the associated matrix-coordinate functions
This convention differs from the Mercer-operator normalization
, whose eigenvalues are
. Both conventions are equivalent if the scaling is used consistently, but (
23) is preferable algorithmically because it reconstructs
directly with the matrix eigenvalues:
3.2. Extending Empirical Coordinates
The previous step gives only sample coordinates. To evaluate a new object we must learn a map from the available object representation to those coordinates. Let
For each coordinate
h we fit a scalar regression problem
or, in relational data, a regression from the available dissimilarities to the training objects. The fitted coordinate functions are denoted
, and
This is the formal step that turns a sample matrix into an inductive object. It also makes clear why the construction is not unique: changing the coordinate model, the retained rank, or the basis used for coordinate regression changes the resulting extension while leaving the same training matrix as the target.
The same step can be described through Newton or SVD bases. Let
be a data-dependent basis, with sample matrix
. Each coordinate vector
may be projected onto that basis through
where
is a small numerical ridge. The resulting continuation is
When V is square and nonsingular, this is an exact change of coordinates. When a rank cap is imposed or a numerical ridge is added, it is a stable projection of the empirical coordinate onto the retained basis.
3.3. Finite-Feature Kernel and Validity
After the coordinates have been extended, the reconstructed kernel is
More generally, for a basis representation
we use
The coefficient matrix is diagonal only in coordinates aligned with the retained eigenvectors. In a pivoted Newton basis, for example, C is generally dense but still positive semidefinite.
The following proposition is the basic mathematical guarantee used throughout the paper.
Proposition 1 (PSD extension, interpolation, and coordinate error)
. Let be any feature map and let . Thenis a positive-semidefinite finite-feature kernel. Let G be the sample matrix with rows . If , then . If the fitted sample coordinates are then Proof. For any finite collection
and coefficients
,
because
. The interpolation statement follows immediately from
. For the error bound, expand
and apply submultiplicativity of the Frobenius and spectral norms. □
Equation (
27) also clarifies the interpolation issue. If the full effective rank is retained and the coordinate model is exact on the sample then
. If the spectrum is truncated, the PSD part is clipped, or the coordinate regression is imperfect then agreement is exact only for the retained fitted coordinates. For this reason the article reports predictive metrics together with PSD and reconstruction diagnostics rather than claiming a unique exact reconstruction.
3.4. Out-of-Sample Basis Evaluation
To evaluate
on new points it remains to compute the extended coordinate vector
, or equivalently
when a non-spectral basis is used. In our implementation this is done by regressing each coordinate on the representation available for the objects and then predicting the coordinate values on the new points [
18]. For vector or tabular data the regressors use the observed variables. For relational data they may use the dissimilarities from the new object to the training objects. Let
be the
m-th basis coordinate on the sample. We fit a scalar regression model
for each
, and obtain predicted values
on the new point. Stacking them gives
The train–test block required by a precomputed-kernel classifier is then assembled as
where
. In our implementation this regression step is implemented through kernelized regression models, but the theory does not depend on the particular regressor: any method that estimates the coordinates stably can be used.
This basis-regression view is what differentiates the present method from both Nyström and FK. Nyström extends eigenfunctions directly through the original kernel sections. FK reconstructs the target eigensystem through the eigensystems of the base kernels. In contrast, our method first changes coordinates to a stable basis attached to the combined matrix and only then performs the out-of-sample continuation. In the language of the kernel approximation and fusion literature, the procedure combines two ingredients that are often treated separately [
6,
13,
14]: the interpolation-oriented perspective of alternate bases and the kernel-fusion perspective of matrix combinations. The whole point is that these are not separate technicalities but consecutive parts of the same pipeline.
3.5. Algorithmic Summary
The whole procedure can be summarized as follows.
- 1.
Compute the base kernel matrices on the training sample and build the combined matrix with the chosen combination rule.
- 2.
Symmetrize and project to the PSD cone if needed; record the relative PSD perturbation.
- 3.
Compute the eigendecomposition (
22) and retain the desired positive-rank subspace.
- 4.
Define empirical matrix coordinates by , avoiding the Mercer scaling ambiguity in the practical algorithm.
- 5.
Optionally express those coordinates in a Newton or SVD basis by solving (
24).
- 6.
Learn a coordinate map from the available object representation to the retained coordinates.
- 7.
Form the finite-feature kernel through (
26) and (
27) and assemble the train–test block (
28).
- 8.
Classify the test points using a precomputed-kernel SVM [
3,
33].
This reorganizes the sample-to-function algorithm from earlier basis-combination work [
18] in an explicitly spectral way, so that each stage has a direct interpretation in Mercer terms.
3.6. Complexity, Conditioning, and Practical Remarks
The dominant costs are the eigendecomposition or the Cholesky/SVD factorization of the combined matrix and the set of regressions used to predict the basis coordinates. If , storage is and out-of-sample evaluation is reduced to matrix products involving V, C, and the predicted basis matrix on the test set. Newton and SVD therefore provide two practically complementary regimes: Newton is attractive when one wants rank-adaptive, pivoted, or triangular structure; SVD is attractive when one wants the most direct spectral route and a basis already aligned with the dominant sample eigendirections.
This cost profile should be compared with alternative out-of-sample devices with some care. Nyström-type methods are typically cheaper when a small landmark set or a known kernel section is available, but for a matrix-only supervised combination the section is exactly the missing object. Fusion-kernel constructions can also continue a combined matrix, but they do so through the eigenspaces of the base kernels rather than directly through the eigensystem of . The present implementation therefore benchmarks the matrix-to-function step primarily through leakage-free train–test gaps and, where an analytic oracle exists, relative Frobenius and functional-MSE errors. A full large-scale runtime benchmark against all low-rank extension strategies is outside the present experimental scope.
In the experiments, the two reconstructions were consistently close to each other, which indicates that the main limitation often lies in the quality of the combined matrix itself rather than in the basis used for continuation. This is why the article keeps Fusion Kernel only as a secondary computational baseline: for the direct basis-based route, Newton and SVD became the two methods that best capture the stable matrix-to-function transition.
4. Materials and Methods
4.1. Aim of the Experiments
The experiments are intended as a validation of the matrix-to-function step, not as a search for a universally optimal classifier. The question is the following: after a finite kernel matrix has been produced on a training sample, can it be converted into a kernel function that is evaluable on new points, and does the classifier built from that function behave on held-out points as it behaves on the points used to build the extension?
For this reason the predictive metrics reported below should be read as functional diagnostics. We compare the error measured on the training matrix by inner cross-validation with the error obtained after extending the same matrix to the test points. A small discrepancy between these two quantities indicates that the conversion from matrix to function has preserved the relevant geometry of the classifier. We deliberately do not tune the kernel dictionary to obtain optimal classification results. The kernels are the same broad dictionary used in the earlier kernel-combination experiments: linear, cosine, polynomial, multiscale Gaussian, Laplacian, rational-quadratic, class-scale, local-scaling, prototype, and supervised weighted variants. This makes the test more honest: the goal is to show that the reconstruction works even for a historical, non-specialized collection of kernels.
We distinguish three experimental cases.
Case A: analytic combinations. Direct means and alignment-weighted sums of evaluable base kernels already have a train–test formula. They are used as stability references. If the matrix-to-function implementation is sound, these combinations should remain close to the behavior observed on the training matrix.
Case B: matrix-only supervised rules. Pick-out/max–min and percentile rules are constructed only on the training matrix. They are the main stress test for the reconstruction layer, because no test labels are available when the train–test block is generated.
Case C: relational data. Chickenpieces starts from dissimilarity matrices between shape objects. Here the usefulness of the framework is most visible: the input is not a privileged coordinate representation, but a family of pairwise shape comparisons.
The protocol follows a strict split discipline. Training and test sets are formed before any kernel parameter, scaling constant, alignment score, kernel selection, or combination rule is estimated, as required for unbiased model assessment in supervised learning [
34,
35]. For label-aware rules, labels are used only on the training set. Test labels are never used to construct test kernels. This point is essential for supervised matrix combinations such as max–min/pick-out.
The computational workflow is implemented as ISO C++17 source code, with accompanying preparation scripts written for R version 4.2.0 or later and Python version 3.10 or later.
4.2. Data Sets
The main vector-data evidence reported in the present tables uses four complete 30-repetition strict runs: Synthetic, Breast Cancer, Ionosphere, and Telco Customer Churn. Telco is the heterogeneous tabular benchmark, while Chickenpieces is reported separately as the relational case in which kernels are built from dissimilarity matrices.
The data sets were selected to cover complementary regimes rather than to form a large-scale leaderboard. Synthetic provides a controlled nonlinear problem in which an oracle train–test block can be computed for smooth analytic rules. Breast Cancer gives a small ordinal biomedical benchmark; Ionosphere gives a continuous radar benchmark; Telco gives a heterogeneous tabular problem with binary, categorical, account, and billing variables; and Chickenpieces gives a pure relational setting in which the objects are available through multiple shape dissimilarities. The resulting collection tests whether the proposed matrix-to-function step behaves consistently across vector, heterogeneous business-table, and dissimilarity-only settings. It does not exhaust the range of possible applications. Larger or more complex data sets would require the same mathematical construction combined with scalable eigensolvers, landmark or randomized low-rank factorizations, and faster coordinate-regression or basis evaluation schemes, as discussed in the complexity remarks and conclusions.
Synthetic. A balanced two-class banana-style simulation inherited from the preliminary experiments [
18]. A latent variable
and Gaussian noise
are generated, and the two classes are defined by
and
. This produces two curved clouds that are not linearly separable and provides a controlled setting in which an oracle train–test kernel block can be computed for diagnostic purposes.
Breast Cancer Wisconsin (Original). A compact biomedical benchmark retained for continuity with the preliminary experiments [
36]. The original database contains 699 fine-needle aspirate cases described by nine ordinal cytological attributes measured on a 1–10 scale. After removing the 16 cases with missing
BareNuclei, the usable sample contains 683 observations, with benign and malignant diagnoses as the target classes.
Ionosphere. A radar-return classification benchmark from the UCI repository [
37]. The data were collected by a phased-array system in Goose Bay, Labrador, and they contain 351 observations with 34 continuous predictors derived from the complex autocorrelation values of 17 pulse numbers. The binary label distinguishes “good” returns, which show ionospheric structure, from “bad” returns.
Telco Customer Churn. A heterogeneous customer-level table distributed as an IBM/Kaggle churn benchmark [
38]. The original file contains 7043 customers and 21 columns, including the binary response
Churn. Its predictors combine demographic information, subscribed telephone and internet services, contract type, payment method, tenure, and billing amounts. Telco is therefore a natural tabular fusion benchmark: different kernels can be attached to different semantic blocks of variables, rather than merely to different bandwidths on a single homogeneous feature space.
Chickenpieces. A purely relational benchmark consisting of 446 silhouette objects from five chicken-part classes (wing, back, drumstick, thigh-and-back, and breast) [
12,
39]. The PRDisData release provides 44 dissimilarity matrices obtained from 11 contour normalizations and four edit-cost settings. Each silhouette is encoded through a contour-string representation. The raw objects are therefore not treated as vectors of coordinates; they are compared through several shape dissimilarities, which are subsequently converted into kernel similarities for classification.
One of the 44 Chickenpieces matrices can be obtained as follows. Fix one contour normalization
r and one edit-cost setting
c. Let
and
be the resulting contour strings for silhouettes
i and
j. Since the starting point on a closed contour is arbitrary, the comparison minimizes the edit cost over all cyclic shifts of the second string:
Here,
is the set of cyclic shifts. In the rotation-invariant cyclic edit distance of Bunke and Bühler [
40], substituting contour directions with angles
and
is weighted by
, whereas insertions and deletions receive the fixed penalty associated with the chosen cost setting
c. Computing this value for all pairs gives one
dissimilarity matrix
. The 44 matrices arise from the
choices of
; the preprocessing step below turns each of them into RBF and Laplacian similarity kernels.
Telco requires a separate comment because it is not a purely numerical benchmark. The customer identifier is removed from the model, and the snapshot shown in
Table 1 is only illustrative: the full source file contains additional service indicators, including online security, online backup, device protection, technical support, streaming TV, and streaming movies. The displayed rows combine binary variables such as
SeniorCitizen, categorical service variables such as
InternetService, account variables such as
Contract and
PaymentMethod, continuous billing variables such as
MonthlyCharges and
TotalCharges, and the target variable
Churn. This mixture makes Telco a natural case for kernel fusion: blocks of variables may encode complementary notions of customer similarity.
For visual context,
Figure 3 shows six distinct binary silhouettes from the Chickenpieces benchmark, without contour extraction or contour-string processing. These examples were selected from the public example panel of the data set in [
41], avoiding near-duplicate visual shapes. The panel is intended only to indicate the type of binary shape objects used in this benchmark.
The 30-repetition vector experiments use train–test sizes for Synthetic, for Breast Cancer, for Ionosphere, and for Telco. Each Telco repetition is based on a stratified subsample of 2000 customers from the full 7043-customer table. Chickenpieces uses the full set of 446 objects and five classes; the repeated splits use a train–test partition.
4.3. Preprocessing
For the vector data sets, numeric variables are standardized using training statistics only. Categorical variables are encoded using a training-based one-hot design. In Telco, the customer identifier is removed, categorical fields are expanded, and the continuous variables tenure, MonthlyCharges, and TotalCharges are standardized on the training split.
For Chickenpieces, each dissimilarity matrix is symmetrized when needed, its diagonal is set to zero, and it is re-scaled by its positive off-diagonal median. Two distance-to-kernel transformations are then used:
with
estimated from the corresponding training distances. Each kernel is normalized before entering the comparison.
4.4. Kernel Dictionary for Vector Data
The vector experiments use the same deliberately broad dictionary in every split:
Table 2 gives the definitions used before trace normalization. All quantities that depend on the data—bandwidths, nearest-neighbor scales, class scales, prototype locations, and variable weights—are estimated on the training split only.
The RBF kernels use empirical distance quantiles as bandwidths, the Laplacian kernel uses the median
distance, and the rational-quadratic kernel uses the median Euclidean scale. The class-scale kernels
INTRA and
INTER estimate their bandwidths from within-class and between-class training distances.
LOCAL is a self-tuning kernel with a nearest-neighbor scale [
43].
PROTO is a supervised prototype kernel estimated from training labels. All kernels are trace-normalized before combination:
Automatic Selection Protocol
The suffix
auto has a fixed meaning throughout the tables. Within each outer training split, kernels are first ranked by centered alignment with the training-label kernel, computed on the training block only. For a candidate number
, the retained index set is
The pair is then selected by stratified inner cross-validation on the training split. The vector-data runs use and ; the Chickenpieces runs use the same grid and . For mean_auto, the selected kernels are averaged with uniform weights. For alignment_auto, they are combined with non-negative weights proportional to their positive centered alignments. For matrix-only rules, the same selected kernels are first combined on the training block, the resulting matrix is projected onto the PSD cone when necessary, and the train–test block is obtained by the spectral/KRR continuation described below. The label “best individual” is a descriptive reference chosen from the completed single-kernel summaries; inside each outer split only its SVM penalty is tuned by inner CV. The label “best matrix-only” denotes the predeclared matrix-only rule with the lowest mean inner-CV error for that data set; it is not selected from held-out test performance.
4.5. Combination Rules
We compare three families of combinations.
- 1.
Direct convex combinations. We use a simple mean and an alignment-weighted mean over the selected kernels:
The weights
are proportional to the positive centered alignment between
and the ideal training label kernel [
10,
11].
- 2.
Label-aware matrix combinations. The pick-out/max–min rule is defined on training pairs by
Percentile variants replace the maximum and/or minimum by upper and lower empirical percentiles [
6]. These rules are supervised and matrix-defined; their test blocks are obtained only through the out-of-sample reconstruction.
- 3.
Hybrid score selection. As an additional diagnostic, we tested a score
but it did not improve the direct mean combinations and is therefore not used as a principal method.
When a label-aware combination is indefinite, it is symmetrized and projected onto the positive semidefinite cone by clipping negative eigenvalues [
7,
30].
4.6. Out-of-Sample Evaluation
Direct convex combinations have explicit train–test blocks because the base kernels can be evaluated between test and training points. In contrast, pick-out and percentile rules define a matrix only on the training set. For these rules, we use the eigenfunction-based reconstruction described in the theoretical sections. The training matrix is decomposed spectrally, its positive part is retained, empirical eigenvectors are interpreted as traces of eigenfunctions, and the resulting functions are extended through the chosen basis. The C++ runs reported here use a spectral/KRR continuation for these matrix-only rules. This is stricter than evaluating an oracle matrix, because no test label is used when constructing the train–test block. Concretely, after PSD projection we retain positive eigenvalues larger than
until the retained positive spectrum explains
of the positive trace. If
denotes the direct mean of the selected base kernels, the held-out coordinate matrix is obtained by kernel ridge continuation with ridge
:
and the reconstructed train–test block is
The percentile variants use the upper and lower empirical percentiles and , respectively. All SVM evaluations use precomputed-kernel C-SVMs with C selected by the same inner-CV grid; Chickenpieces, the only multiclass benchmark, is handled by a one-versus-one multiclass reduction.
The controlled Synthetic experiment also includes an oracle diagnostic. For rules such as the sum or the mean of base kernels, the true train–test block is available because the analytic combination is known. We can therefore compare the reconstructed block against the oracle block directly. This diagnostic is not used for model selection; it is only a check that the matrix-to-function conversion recovers the off-sample behavior when the training matrix comes from a smooth evaluable combination.
5. Results
5.1. Vector Data Sets: 30-Repetition Validation
Table 3 summarizes the main 30-repetition vector-data experiments. For each data set we report the best individual kernel, the two direct combinations, and the supervised matrix-only rule selected by the inner-CV criterion described above after strict out-of-sample reconstruction. The table is not intended to claim that the combined kernels are optimized classifiers. It is intended to show that a matrix built from the historical kernel dictionary can be converted into an evaluable kernel whose test behavior remains close to the behavior observed on the training matrix.
The results support the intended interpretation. On Synthetic, the simple mean slightly improves the best individual RBF scale, showing that the direct combination can recover a useful multiscale geometry. On Breast, the best individual kernel is the cosine kernel, and both direct combinations remain within about error units of it. On Ionosphere, the best individual kernel is the local-scaling kernel; the matrix-only percentile rule and the direct mean are both close to it. On Telco, the best fixed individual kernel by classification error is the quadratic polynomial kernel, and the direct mean and alignment-weighted combinations remain within about error units of it. The matrix-only percentile rule is considerably worse in Telco, which reinforces the distinction between stable analytic combinations and abrupt supervised matrix transformations. These outcomes are sufficient for the purpose of the article: the base kernels were not selected because they are special optimal classifiers, yet their matrix combinations can still be converted into usable out-of-sample kernels.
The large variability of percentile_inout_auto on the Synthetic data deserves a separate interpretation. This rule is not an analytic kernel combination: it is an entrywise, label-aware matrix transformation based on within-class and between-class empirical percentiles. As a result, small changes in the outer training split can change the selected kernels, the percentile thresholds, and the label-dependent entries of the combined matrix. The spectral/KRR continuation then has to extrapolate coordinate traces of an abrupt, sample-dependent geometry, while test labels are not available. This explains why the standard deviation of this method is much larger than that of mean_auto and alignment_auto. The result should therefore be read as a stability diagnostic for abrupt supervised matrix-only rules, rather than as a failure of direct kernel combination.
The fact that the best individual kernel changes across data sets is also expected. The Synthetic data favor a Gaussian scale adapted to a smooth curved geometry; Breast Cancer favors an angular/cosine similarity on ordinal biomedical attributes; and Telco favors a low-degree polynomial interaction on a heterogeneous tabular encoding. Thus, kernel choice does affect classification performance. However, the direct combinations remain close to the best individual choices, suggesting that stable combinations can reduce the need to commit to one manually selected kernel while preserving a deployable out-of-sample formula.
5.2. Telco Customer Churn
The Telco run consists of 30 independent stratified repetitions with 1400 training and 600 test customers per repetition. Each split uses the same 20-kernel vector dictionary and the same strict inner-CV protocol as the other vector benchmarks.
Table 4 reports the most informative single-kernel rows. The best fixed individual kernel by mean classification error is
POLY2, with error
. The best balanced accuracy among the fixed individual kernels is obtained by
LIN, with balanced accuracy
, although its mean error is slightly larger. The prototype kernel has the largest alignment with the training label kernel, but it does not produce the best held-out classifier; this is a useful reminder that alignment is a screening statistic rather than a complete substitute for validation. The supervised weighted RBF is a clear failure mode on this encoding, with balanced accuracy essentially at chance.
The discrepancy between alignment and held-out accuracy is expected on this heterogeneous table. Centered alignment measures agreement with the ideal label kernel on the training block; it does not measure the SVM margin, the stability of local neighborhoods, or how well a similarity can generalize to unseen customers. The prototype kernel obtains the largest alignment because it summarizes customers through class-prototype similarities estimated on the training data, but this compression can remove within-class heterogeneity that is useful for prediction. Numerically, PROTO has the largest alignment () but a worse error () than POLY2 (). The WRBF row is even more informative: its negative alignment (), near-chance balanced accuracy (), and nearly degenerate support-vector count indicate that the supervised diagonal RBF metric is poorly matched to the mixed one-hot and standardized Telco encoding. The automatic combination strategy is therefore not based on alignment alone: alignment is used only to rank candidate kernels, while the number of retained kernels and the SVM penalty are selected by inner cross-validation on the training split.
Table 5 gives the corresponding combination results. The two direct combinations are stable and almost indistinguishable from each other:
mean_auto gives error
, and
alignment_auto gives
. Their inner-CV errors are about
, so the train–test discrepancy is small. In contrast, all matrix-only supervised rules are substantially worse after spectral/KRR continuation. The best of them is
percentile_inout_auto, but its mean error is
and its balanced accuracy is
. Thus, Telco supports the same conservative conclusion as the other experiments: smooth averages are deployable and stable, whereas label-aware matrix-only operations produce a harder out-of-sample extension problem.
The Telco gap is also linked to the heterogeneous tabular nature of the data. The predictors combine demographic variables, categorical service indicators, contract and payment information, tenure, and billing amounts. Direct convex combinations preserve the smooth train–test structure of the base kernels and the block-level similarities induced by these variables. By contrast, matrix-only supervised rules impose an additional within-class/between-class deformation on the training matrix. In a heterogeneous table, this deformation can overwrite useful local similarities and create a coarse label-driven geometry that is not well aligned with the original covariate structure. The diagnostic in
Table 6 shows that the train–test gap is moderate but the absolute error is already high, indicating that the main limitation in Telco is the induced supervised training geometry itself, not only the continuation step.
5.3. Training-Matrix Behavior Versus Out-of-Sample Behavior
Table 6 gives the most direct diagnostic for the matrix-to-function claim. The inner-CV column is measured using only the training matrix, that is, the points on which the matrix-to-function conversion is built. The test column is measured after extending the same construction to new points. The absolute gap is small for the direct mean on all four vector data sets. For the best matrix-only rule it is also small on Breast and Ionosphere, moderate on Telco, and large on Synthetic. Telco is especially informative because the best matrix-only rule has a moderate train–test gap but poor classification performance, so its limitation is the learned supervised matrix itself rather than only the continuation step.
This table is the main experimental evidence for the functionality of the extension. The relevant comparison is not whether the test error is the smallest possible among all classifiers, but whether the classifier obtained after extending the matrix behaves like the classifier assessed on the matrix used to build that extension. The direct mean passes this diagnostic robustly, including on Telco. The matrix-only percentile rules pass it most clearly on Breast and Ionosphere. On Telco the gap is not large, but the absolute error remains high, showing that the supervised matrix transformation did not produce a competitive geometry for this heterogeneous table. Synthetic shows a complementary negative case, where an abrupt supervised transformation can be hard to continue faithfully.
Quantitatively, the average absolute train–test gap of the direct mean over the four vector data sets is , with a maximum of on Synthetic. The corresponding average for the selected matrix-only supervised rows is , more than twice as large, and the largest vector-data gap is for Synthetic percentile_inout_auto. Including the relational stress test makes the contrast sharper: Chickenpieces mean_auto has a gap of , whereas pickout_auto has a gap of . This is the most compact numerical summary of the robustness pattern: smooth direct continuation is stable across data types, while abrupt supervised matrix-only continuation is data-dependent and can become unstable.
The two negative cases in
Table 6 should therefore be distinguished. In Synthetic,
percentile_inout_auto has both a larger train–test gap and a much larger across-split standard deviation, which points to instability of the abrupt label-aware continuation. In Telco, by contrast, the train–test gap is moderate, but both the inner-CV and test errors are high. This indicates that the matrix-only supervised rule has already produced a less competitive training geometry. These two behaviors support the same methodological conclusion: the reconstruction layer can make matrix-level rules deployable, but it cannot guarantee that every label-aware training matrix defines a stable and useful out-of-sample geometry.
5.4. Numerical Reconstruction Diagnostics
Table 7 records the numerical quantities monitored during the same runs. For analytic direct means, the PSD perturbation is zero and the main diagnostic is the inner-CV/test gap. For matrix-only rules, there is generally no oracle block
, so the reported diagnostics are the selected
, selected SVM penalty, PSD perturbation, inner-CV error, test error, and their absolute gap. The controlled oracle diagnostic in
Table 8 provides the additional Frobenius and MSE block errors in the synthetic case where an oracle block is available.
These diagnostics are chosen to avoid label leakage. For matrix-only supervised rules such as pick-out and the percentile variants, there is generally no genuine oracle block
at prediction time, because the rule depends on labels and the labels of the new points are not available. Constructing such a block with test labels would answer a different, retrospective question and would not represent a valid deployable classifier. For this reason,
Table 7 reports diagnostics that are available under the strict protocol: PSD perturbation
, selected model complexity, inner-CV error, test error, and the train–test gap. These quantities measure both the numerical effect of PSD rectification and the stability of the out-of-sample continuation without using test labels.
5.5. Oracle Reconstruction Diagnostic
The controlled Synthetic setting allows a second check. When the combined matrix comes from a known analytic rule such as a sum or a mean of base kernels, the oracle train–test block is available.
Table 8 compares this oracle block with the block reconstructed from the training matrix only. Besides the relative Frobenius error, we report the functional block mean squared error
The sum and mean rows have relative Frobenius error essentially at numerical precision. Pick-out is included as a negative control: it is a sharp label-aware matrix rule and does not correspond to the same smooth analytic continuation. In that row the oracle block uses test labels only for this retrospective negative-control diagnostic; test labels are never used to train, select, or reconstruct the evaluated model.
The oracle diagnostic clarifies the scope of the proposal. When the matrix is produced by a smooth combination of evaluable kernels, the matrix-to-function conversion recovers the off-sample kernel block. When the matrix is produced by an abrupt label-aware operation, the training matrix can still be represented spectrally, but the out-of-sample continuation is a harder modeling problem.
The relative Frobenius error and the functional MSE in
Table 8 are therefore reconstruction-accuracy metrics for the case in which a leakage-free oracle block is meaningful. The pick-out row is included only as a negative control to show how an abrupt label-aware matrix rule differs from a smooth analytic combination.
5.6. Chickenpieces
Chickenpieces is the most important relational experiment because the input is not a feature matrix but a family of dissimilarities between shapes [
12,
39].
Table 9 reports the best individual kernel and the main combinations.
The best individual kernel is a Laplacian transformation of the norm29/cost60 contour dissimilarity. The mean and alignment combinations essentially recover the best individual kernel without selecting it manually: the difference in mean error is less than . This is the strongest evidence that the workflow is useful for relational data, where the natural modeling question is not which coordinate system to use, but which dissimilarity source or kernel transformation should be trusted.
The pick-out rule is not competitive in Chickenpieces under strict out-of-sample reconstruction. This is not a contradiction of the method; it is a diagnostic result. The training matrix created by pick-out contains abrupt label-dependent information, and the experiment shows that such geometry is harder to continue to new silhouettes without test labels.
This result should not be interpreted as a general impossibility statement for label-aware matrix combinations on relational dissimilarity data. It shows that the particular abrupt pick-out rule is not robust in the strict Chickenpieces out-of-sample setting. The direct mean and alignment combinations remain stable because they preserve the similarity structures induced by the distance-to-kernel transformations. Pick-out, on the other hand, builds a sharply label-dependent training matrix, and the labels needed to reproduce that rule are unavailable for new silhouettes. In this sense, pickout_auto functions here as a negative-control or stress-test rule. Relational data remain a natural application of the proposed matrix-to-function framework, but smoother supervised combinations are needed if one wants to exploit label information while preserving stable deployability.
A practical way to improve relational label-aware rules would be to regularize the supervised deformation itself. Examples include convex shrinkage of pick-out towards the direct mean, temperature-smoothed max–min or percentile rules, PSD-constrained supervised weights, graph smoothing of the continued eigentraces, and cross-validation of the continuation ridge or retained spectral rank with a criterion that penalizes large train–test gaps. These modifications would preserve the relational input format while making the label-dependent geometry less discontinuous.
5.7. Benchmarking Perspective Relative to Alternative Extension Routes
The experiments provide quantitative benchmarking for the matrix-to-function step, but not a full leaderboard against all possible kernel classifiers. This distinction is important. Existing out-of-sample methods such as Nyström approximations or random features assume that the kernel section to the new point is known or approximable from a known analytic kernel. For matrix-only supervised rules, that section is precisely unavailable without using the unknown test label. The leakage-free comparisons that are meaningful in this setting are therefore: (i) direct analytic combinations versus reconstructed matrix-only rules in held-out classification; (ii) training-matrix inner-CV versus held-out test performance; and (iii) oracle train–test block error in the controlled Synthetic case where an analytic smooth rule is available.
Table 10 summarizes these comparisons. The key numerical message is that the proposed continuation is essentially exact for smooth analytic matrices in the Synthetic oracle diagnostic, while classification robustness depends on the regularity of the matrix geometry being continued.
5.8. Additional Hybrid Score Experiment
We also ran the hybrid selector based on a convex combination of CV-rank and alignment-rank. These runs used only five repetitions and are therefore treated as auxiliary. On Breast, the best score version reached error 0.0322 ± 0.0122; on Synthetic, 0.0787 ± 0.0119; and on Ionosphere, 0.0566 ± 0.0094. These values do not improve the principal direct means in
Table 3. The experiment was useful as a diagnostic, but it will not be retained as a main method until it is re-run with the same 30-repetition protocol.
6. Discussion
The final experiments lead to a deliberately conservative interpretation.
First, the experiments support the central matrix-to-function claim. For the stable mean combination, the classifier evaluated on held-out points behaves like the classifier assessed on the training matrix used to construct the extension. The 30-repetition train–test gaps are small in Synthetic, Breast, Ionosphere, and Telco for the direct mean. The oracle Synthetic diagnostic is even stronger: when the training matrix is generated by a smooth sum or mean of base kernels, the reconstructed train–test block agrees with the oracle block up to numerical precision.
Second, the results should not be read as an optimized classification benchmark. The kernel dictionary was deliberately inherited from earlier kernel-combination work and was not redesigned to make these data sets easy. It contains simple baselines, multiple Gaussian scales, class-scale variants, local scaling, prototype similarities, and a supervised weighted RBF. The fact that the mean or alignment combinations remain close to the best individual kernel is therefore useful evidence: the conversion works even with a non-specialized historical battery of kernels.
Third, alignment is useful as a diagnostic but not as a universal replacement for model selection [
10,
11]. In Breast, alignment identifies the cosine kernel, which is also the best individual kernel. In Ionosphere and Telco, however, prototype-type kernels have high alignment but poorer classification performance than simpler alternatives. This confirms that alignment should be treated as a screening measure rather than as a complete performance criterion.
Fourth, the label-aware rules reveal the distinction between a discriminative training matrix and a deployable out-of-sample kernel [
6,
18]. Percentile rules extend well in Breast and Ionosphere. In Telco, the best percentile rule has a moderate train–test gap but is still far less accurate than the direct means, so its limitation is the induced matrix geometry. The Synthetic and Chickenpieces results show an even stronger failure mode: abrupt supervised operations can be hard to continue faithfully. Pick-out is therefore best interpreted as a stress test or negative control, not as the empirical claim of the paper.
Fifth, Telco strengthens the heterogeneous-tabular side of the motivation. The customer table mixes categorical service fields, account variables, and billing amounts. The direct mean remains within error units of the best fixed individual kernel, which indicates that the broad dictionary can be aggregated without losing deployability. Sixth, Chickenpieces strengthens the relational side of the motivation. It shows that the matrix-to-function problem is not merely a tabular-kernel issue. In a relational problem with 44 dissimilarity matrices, the proposed protocol can compare and aggregate many kernels, identify a useful transformation family, and evaluate the resulting kernel out of sample.
Seventh, the benchmarking claim should be interpreted in the appropriate problem class. The state-of-the-art methods most often associated with out-of-sample kernels, such as Nyström approximations and random features, are designed for situations where the kernel section or an analytic base kernel is available. They do not directly solve the deployment of a final label-aware matrix rule whose entries are defined only on the training pairs. For that reason, the strongest quantitative evidence in this paper is not a generic runtime leaderboard but the combination of held-out classification baselines, train–test stability diagnostics, and the Synthetic oracle block comparison. Scaling the same reconstruction to much larger problems will require low-rank or randomized eigensolvers and faster coordinate evaluation, which we regard as an implementation direction rather than a change in the mathematical problem.
The unstable and negative-control cases discussed above clarify the scope of the framework. Kernel choice matters, because different data geometries favor different base kernels; nevertheless, stable direct combinations can aggregate a broad dictionary while keeping an explicit train–test formula. Reconstruction diagnostics also need to be interpreted according to the type of combination: for smooth analytic rules, relative Frobenius error and functional MSE can be computed against an oracle block, whereas for label-aware matrix-only rules the valid deployment diagnostics are PSD perturbation, inner-CV/test gaps, and held-out classification metrics without test-label access. Finally, the poor performance of percentile_inout_auto in Synthetic and pickout_auto in Chickenpieces should be viewed as informative negative controls. They show that a discriminative training matrix is not automatically a stable deployable kernel.
The final message is therefore not that kernel combinations always outperform the best individual kernel. Rather, the contribution is a principled and strictly validated way to move from a finite kernel matrix or a family of kernel matrices to an out-of-sample kernel evaluator. When the combination is smooth and geometrically stable, as in direct sums and means, the evaluator preserves the training-matrix behavior on new points. When the combination is highly label-driven, as in pick-out, the method exposes the difficulty of extending that matrix faithfully.
7. Conclusions
This paper studies inductive finite-rank extensions of out-of-sample kernel functions from empirical kernel matrices, with particular emphasis on matrices produced by kernel-combination rules. The proposed approach uses spectral decompositions and data-dependent RKHS bases to convert a matrix defined on a training sample into a controlled kernel evaluator for new points.
The experiments support three conclusions. First, the conversion from matrix to function is reliable for smooth, stable combinations such as sums and means. Quantitatively, the direct mean is essentially tied with the best individual kernel on Synthetic ( versus error), remains within about error units of the best individual kernel on Telco ( versus ), and matches the best relational Chickenpieces kernel within error units. In the controlled Synthetic oracle diagnostic, smooth sum/mean reconstruction gives relative Frobenius error and functional MSE at numerical scale. Second, the base kernels do not need to be specially engineered for optimal classification in order for the reconstruction idea to work; the broad historical dictionary used in earlier combination experiments is sufficient to obtain stable deployable kernels. Third, purely supervised matrix-only combinations such as pick-out and some percentile rules are harder to deploy: Synthetic percentile_inout_auto has error , the best Telco matrix-only supervised rule has error , and Chickenpieces pickout_auto has error . This limitation is not a contradiction of the reconstruction method; rather, it identifies the boundary between smooth or moderately regular matrix combinations, which are naturally extendable, and abrupt supervised transformations, which require additional regularization or smoother design before they can be recommended as deployable kernels.
The evidential basis for these conclusions is now made explicit.
Table 3 and
Table 6 support the stability claim for direct combinations through held-out classification and train–test gap diagnostics.
Table 8 supports the reconstruction-accuracy claim for smooth analytic rules through relative Frobenius error and functional MSE.
Table 5 and
Table 9 delimit the scope of label-aware matrix-only rules by showing that abrupt supervised transformations may perform poorly even when the matrix can be represented spectrally.
Table 10 clarifies the relationship with alternative extension routes and explains why methods requiring an analytic test kernel section are not directly applicable to label-aware matrix-only combinations without an oracle. The conclusions are therefore deliberately restricted: the paper does not claim universal superiority of kernel combinations, does not present a large-scale state-of-the-art classifier benchmark, and does not recommend abrupt supervised matrix-only rules as default deployable kernels.
Future work should focus on three directions. The first is computational: large-scale eigensolvers, landmark approximations, and faster basis-evaluation schemes are needed for larger data sets. The second is methodological: smoother supervised combination rules, possibly constrained to remain positive semidefinite and easier to extend, may retain the discrimination of label-aware rules while preserving the stability of direct averages. For relational data, this includes shrinkage of abrupt pick-out matrices towards smooth means, temperature-smoothed max–min rules, and graph-regularized eigentrace continuation. The third direction is comparative: the present reconstruction should be combined and benchmarked with PSD-embedding extensions [
8], subspace-fusion rules for heterogeneous proximities [
9], polar-decomposition PSD corrections [
31], and scalable Nyström or randomized low-rank variants when an analytic kernel section is available.