Next Article in Journal
Detecting and Analyzing Botnet Nodes via Advanced Graph Representation Learning Tools
Previous Article in Journal
A Virtual Power Plant-Integrated Proactive Voltage Regulation Framework for Urban Distribution Networks: Enhanced Termite Life Cycle Optimization Algorithm and Dynamic Coordination
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Integrated Model Selection and Scalability in Functional Data Analysis Through Bayesian Learning

by
Wenzheng Tao
1,2,
Sarang Joshi
2,3,* and
Ross Whitaker
1,2
1
School of Computing, The University of Utah, Salt Lake City, UT 84112, USA
2
Scientific Computing and Imaging Institute, The University of Utah, Salt Lake City, UT 84112, USA
3
Biomedical Engineering, The University of Utah, Salt Lake City, UT 84112, USA
*
Author to whom correspondence should be addressed.
Algorithms 2025, 18(5), 254; https://doi.org/10.3390/a18050254 (registering DOI)
Submission received: 27 February 2025 / Revised: 17 April 2025 / Accepted: 18 April 2025 / Published: 26 April 2025
(This article belongs to the Section Algorithms for Multidisciplinary Applications)

Abstract

:
Functional data, including one-dimensional curves and higher-dimensional surfaces, have become increasingly prominent across scientific disciplines. They offer a continuous perspective that captures subtle dynamics and richer structures compared to discrete representations, thereby preserving essential information and facilitating the more natural modeling of real-world phenomena, especially in sparse or irregularly sampled settings. A key challenge lies in identifying low-dimensional representations and estimating covariance structures that capture population statistics effectively. We propose a novel Bayesian framework with a nonparametric kernel expansion and a sparse prior, enabling the direct modeling of measured data and avoiding the artificial biases from regridding. Our method, Bayesian scalable functional data analysis (BSFDA), automatically selects both subspace dimensionalities and basis functions, reducing the computational overhead through an efficient variational optimization strategy. We further propose a faster approximate variant that maintains comparable accuracy but accelerates computations significantly on large-scale datasets. Extensive simulation studies demonstrate that our framework outperforms conventional techniques in covariance estimation and dimensionality selection, showing resilience to high dimensionality and irregular sampling. The proposed methodology proves effective for multidimensional functional data and showcases practical applicability in biomedical and meteorological datasets. Overall, BSFDA offers an adaptive, continuous, and scalable solution for modern functional data analysis across diverse scientific domains.

1. Introduction

The emergence of big data across diverse fields, such as biomedicine, finance, and physical modeling, has catalyzed the need for advanced analytical methodologies capable of handling complex, high-dimensional datasets that conventional discrete data analysis approaches cannot always process effectively. Such datasets often require analysis that captures and interprets their continuous and potentially high-dimensional complexities—a central promise of functional data analysis (FDA) [1,2]. Foundational work established FDA’s capacity to treat each observation as an entire function [3], be it a curve, surface, or higher-dimensional structure, thereby extracting richer insights than conventional discrete point analyses. Over the past decade, FDA’s scope has widened significantly to accommodate high-dimensional applications, with theoretical and computational advances emerging across various contexts [4,5].
A pivotal technique within FDA is functional principal component analysis (fPCA), which serves as a dimension reduction tool similar to classical PCA and factor analysis. Unlike classical PCA, however, fPCA operates, in principle, in an infinite-dimensional function space to capture dominant modes of variation and reduce complexity [6]. Despite its conceptual elegance, existing fPCA and similar FDA models often assume that data are observed on a shared, finite grid, often relying on heuristic imputation or posterior estimation to handle any missing entries [7,8,9]. This assumption conveniently facilitates the adoption of established linear algebraic methods but compromises the integrity of FDA by introducing significant information loss and high-computational demands in high-dimensional applications.
An ideal approach would represent each function at its naturally sampled measurement points rather than forcing all observations onto a shared grid, thus preserving crucial information and avoiding the need for heuristic resampling. This point is critical when considering that, given only a finite number of data points, infinitely many functions can interpolate these points, each reflecting different inductive biases about smoothness or shape [10]. Conventional smoothing or regridding methods (e.g., polynomial interpolation) introduce biases that may distort the underlying function’s actual behavior. In contrast, we will achieve more accurate and unbiased predictions by concurrently updating the function estimation and the population-level statistics governing the estimation, such as those encoded in the covariance operator. Such an approach requires the direct modeling of the function from its original measurement points, rather than imposing artificial grids.
To mitigate these limitations, several studies have proposed alternative strategies. For instance, Ref. [3] developed a nonparametric technique for the estimation of the mean and covariance for functional data under smoothness assumptions while also discussing a continuous formulation and the necessary discretization in practical applications. In [11,12], the authors extended fPCA to sparse and irregular longitudinal designs by smoothing the covariance estimate and then discretizing. Nonetheless, classical discretization steps often result in significant information loss and computational burdens.
As functional data’s size and complexity grew, researchers turned to flexible basis expansions, including sinusoids (Fourier), wavelets, polynomials, and B-splines, for a finite-dimensional representation of functional data that is convenient and accurate in computation, avoiding the drawbacks of explicit approximation and resampling [2,13,14]. However, a core challenge remains in selecting a suitable model. For instance, researchers must choose the number and form (e.g., smoothness), along with the dimensionality of the representational subspace. In approximation, the placement of basis functions is also essential. Evenly spaced nodes remain popular for their simplicity but may be suboptimal. Alternative node allocations may be better, such as Chebyshev nodes for superior accuracy [15] or sparse grids to reduce the combinatorial growth of the computational complexity [16].
Existing studies tend to rely on choosing the hyperparameters manually [6] or on cross-validation [3,14,17], which are known to be computationally prohibitive. Others employ approximated cross-validation [13] or marginal likelihood [8], but these still require the exhaustive testing of all candidate models. Methods with sparse Bayesian priors [7,18] for model selection allow model selection with a single optimization. In [19,20], the authors use shrinkage or sparse priors for data-adaptive basis selection to ensure minimal but effective sets of basis functions. Notably, Ref. [21] proposed the Bayesian and Akaike information criteria, demonstrating state-of-the-art performance in simulation studies for sparse and dense functional data.
In addition, probabilistic FDA emerges as a sophisticated adaptation of probabilistic methods tailored to incorporate the flexibility of latent variable models to manage functional data. A Bayesian latent factor regression model (LFRM) [18], for example, extends conventional regression to accommodate complex structures and dependencies in functional data, providing a robust framework to handle the complexities inherent in functional data. However, these Bayesian approaches are often limited by the computational demands of Monte Carlo methods in high dimensions [8]. To address increasingly high-dimensional FDA problems, recent efforts have emphasized scalability. For instance, Ref. [6] introduced FDA for images with a fixed basis or grid. In [17], they further reduced the complexity in 2D fPCA via tensor product B-splines. Meanwhile, Ref. [22] applied a Bayesian framework with basis expansion, adaptive regularization, and Gibbs sampling to 2D functional data in the form of EEG studies on children with autism. Furthermore, Ref. [23] leveraged a parsimonious basis representation and variational Bayes to achieve computational efficiency, making it suitable for 3D brain imaging data. A Bayesian nonparametric model [24] leverages variational inference for efficient computation in high-dimensional functional time series and uses an Indian buffet process to automatically select latent factors. Nonetheless, it focuses on 1D functional observations with temporal dependencies and a common sampling grid.
In parallel, the broader method of principal component analysis (PCA) remains a fundamental and effective tool. Classical PCA, rooted in eigendecomposition [25], effectively extracts dominant modes of variation in many settings but does not inherently accommodate the probabilistic nature of real-world data and their inherent uncertainties. Thus, Ref. [26] introduced probabilistic PCA (PPCA), which incorporates a probability distribution to manage these uncertainties more effectively. PCA has since evolved to address missing data [27], model selection [28], and complex data types [29]. In the context of functional data, these concepts motivate new approaches that unify probabilistic methodologies, latent factor models, and kernel expansions for continuous domains [1].
Within Bayesian machine learning, various priors have been proposed for sparse or robust formulations of PCA. Specifically, sparse Bayesian learning (SBL) [30], with its mechanism automatic relevance determination (ARD) [31,32], has proven adept in promoting parsimonious solutions [33]. SBL has emerged in Bayesian PCA [28], applying an iterative method to evaluate the relevance of each component and select the internal dimensionality by disregarding the redundant ones. In [34], the authors applied SBL to optimize the combination of base kernels to enhance model performance. A matrix completion method [35] uses ARD to select the factorization rank and dual graph priors to promote smoothness along rows and columns for the effective interpolation of missing entries, although they are more relaxed than a strict continuity constraint for FDA. These methods often exploit variational techniques or accelerated optimization [36], thereby balancing model complexity with computational tractability. In functional data contexts, where representations are infinite-dimensional, SBL offers a compelling framework for advanced FDA methods by efficiently handling sparse expansions and adaptively adjusting the model complexity.
In summary, despite these efforts to advance functional data analysis, several challenges persist. Existing methods often exhibit limitations in accuracy and efficiency when sampling is sparse, automatic model selection is essential, and the dimensionality is high [23]. Concurrently, probabilistic PCA and SBL frameworks illustrate powerful strategies to incorporate versatility and adaptivity for such data complexities, while their adaptation to FDA is still evolving. These gaps underscore the need for a robust, flexible, and computationally feasible approach, unifying ideas from FDA, PPCA, and SBL, that manages the continuous and high-dimensional nature of modern datasets.

1.1. Contributions

This manuscript proposes a novel Bayesian framework for functional principal component analysis that leverages nonparametric kernel expansions, sparse Bayesian learning for model selection, and efficient variational inference (VI). We abbreviate the proposed method as BSFDA (Bayesian scalable functional data analysis). (The code is available at https://github.com/WeeenZh/BSFDA, accessed on 21 March 2025). BSFDA addresses critical gaps in existing FDA techniques with irregular sampling, high-dimensional scalability, and the selection of both basis functions and principal components. Specifically, our approach offers the following:
  • Joint selection of optimum latent factors and sparse basis functions: This eliminates constraints on parametric representation dimensionality, avoids information loss from discretization, and extends naturally to higher dimensions or non-Euclidean spaces through nonparametric kernel expansion. It further enhances the interpretability by adaptively choosing the model complexity without testing multiple models separately. We achieve these improvements using a Bayesian paradigm that provides robust and accurate posterior estimates while supporting uncertainty quantification.
  • Scalability across domain dimensionality and data size: The proposed method uses VI for faster computation compared to Markov chain Monte Carlo (MCMC) methods, while still being accurate in terms of the estimation of the intrinsic dimensionality and overall covariance structure. BSFDA reduces the overall computation by partitioning the parameters into smaller update groups and introducing a slack variable to further subdivide the weighting matrix (which is part of the kernel structure) into even smaller parts [18], updating fewer blocks at a time and considering all model options. Introducing a slack variable makes the optimization process more efficient by separating different variable groups. This approach scales well with the data size and works efficiently even with large, complex datasets. We demonstrate this on the 4D global oceanic temperature dataset (ARGO), which consists of 127 million data points spanning across the globe for 27 years, with depths of up to 200 m [37].

1.2. Outline

Together, these contributions position our work at the intersection of functional principal component analysis [1] and sparse Bayesian learning [30], enabling the robust, flexible, and computationally feasible analysis of high-dimensional functional data. The remainder of this paper is organized as follows. In Section 2, we describe the proposed Bayesian functional PCA framework in detail, highlighting the nonparametric kernel expansions and sparse Bayesian priors. Next, in Section 3 and Section 4, we discuss the variational inference procedure and the reduced active block updating step, illustrating how these techniques jointly provide scalability and accuracy. In Section 5, we then present extensive empirical studies demonstrating the factor selection accuracy, covariance operator estimation, and performance in large-scale 4D applications. Finally, in Section 6, we conclude with a discussion of potential extensions and open directions, emphasizing the broader implications of our work for large-scale, high-dimensional functional data analysis.

2. Formulation

2.1. Generative Model

In conventional fPCA, the data are assumed to be samples of functions that are elements of an appropriately smooth function space [1]. Using this assumption, data samples acquired at discrete points are typically interpolated to the continuum using tools such as splines, Fourier basis functions, or wavelets. In our work, we assume that the functions y i : R M R are outcomes of an M-dimensional stochastic process. As in classical fPCA, we assume that y i is in a class of functions that can be approximated through a truncated, finite expansion, which is a weighted summation of K kernel functions { ϕ k } k = 1 K :
y i ( x ) = k = 1 K w i k ϕ k ( x ) ,
with w i k being random variables, ϕ k ( x ) = K ( x , X k ) , K is the kernel function, and X k is the k-th location.
Thus, y i ( x ) s are realizations of a finite-dimensional stochastic process. Conventionally, the covariance operator of these functions is discretized, and the leading eigenfunctions form the estimated principal component loadings, following the Karhunen–Loève theorem [1]. By contrast, we establish a flexible Bayesian framework of fPCA following the form of probabilistic PCA [26], where the principal subspace is identifiable up to an arbitrary rotation and does not enforce the orthogonality of the loadings. Nevertheless, it is straightforward to recover classical eigenfunctions from the final covariance estimation over an arbitrary grid in this Bayesian framework.
The observed data are P independent, noisy samples of the functions { y i } i = 1 P at index { X i R N i × M } i = 1 P , where N i is the number of measured samples for the ith function y i and X i n R M is the location of the nth measurement in the domain of the sample. The observations are { Y i } i = 1 P , where Y i n = y i ( X i n ) + E i n , where E i n is white Gaussian noise of variance σ 2 .
We also assume that the functions span a low-dimensional subspace of dimension J < < K . We model this stochastically by assuming that the weights, w i R K , are given by w i k = j = 1 J Z i j W j k + Z ¯ k , where W R J × K are the principal component loading coefficients and Z i R J are standard normal variables [26]. This model is therefore
Y i n = k = 1 K j = 1 J Z i j W j k + Z ¯ k ϕ k ( X i n ) + E i n = ( Z i W + Z ¯ ) Φ i · n + E i n ,
where Φ i · n = [ ϕ 1 ( X i n ) , , ϕ K ( X i n ) ] T are the evaluations of the basis functions at the n-th index of the i-th sample function.
The choice of the kernel family usually benefits from knowledge of the dataset’s characteristics, such as the periodicity or domain geometry. Our framework is flexible across various kernel families, but we employ Gaussian kernels for both one-dimensional and multidimensional data by default, with the initial length-scale selection carried out through cross-validation, which will be refined through our sparse prior described below. To avoid disproportionately favoring larger length scales, we normalize each kernel’s scaling coefficient using its square integral over the observational domain.

2.2. Sparse Prior

For effective model selection, we introduce a sparse prior over the coefficients of the basis functions [28]. The sparse prior in the proposed model is based on automatic relevance detection (ARD) [28]. ARD evaluates the importance of a feature with a precision parameter estimated from the data. The model uses { α j } j = 1 J and { β k } k = 1 K for the numbers of components and basis functions, respectively, while η signifies the overall magnitude of the mean coefficients:
Z ¯ k N ( 0 , η 1 β k 1 ) , k = 1 : K
W j k N ( 0 , α j 1 β k 1 ) , j = 1 : J , k = 1 : K
In the model, α j , β k , η , σ 2 are all variables of precision parameters, coming naturally with a conjugate prior of Gamma distribution that facilitates efficient posterior optimization. The probabilistic graphical model is depicted in Figure 1. Setting a 0 , b 0 to a small value yields a vague Gamma prior that approximates a noninformative (Jeffreys-type) prior.

3. Methods

Based on the proposed formulation in Section 2, we estimate Pr [ Θ | X , Y , a 0 , b 0 ] , the posterior of the unobserved values Θ = { Z , W , Z ¯ , σ , α , β , η } . This inference gives the point estimates of Θ and the posterior predictive distribution of new data. For notational convenience, X , a 0 , and b 0 are omitted.
Using Bayes’ theorem, Pr [ Θ | Y ] = Pr [ Y | Θ ] Pr [ Θ ] Pr [ Y ] , but the exact posterior distribution is intractable because the evidence Pr [ Y ] = Pr [ Θ , Y ] d Θ is intractable. Therefore, an approximate inference strategy is proposed. To facilitate this, we utilize variational inference (VI) [38], choosing a surrogate density from a parameterized family, denoted as Q , to approximate the posterior. Compared with classical methods such as Markov chain Monte Carlo (MCMC) sampling, VI is typically faster [38]. In our experiments, VI is about 85 times faster for the original Bayesian PCA formulation [28], as shown in Appendix F.2.

3.1. Variational Bayesian Inference

Variational inference optimizes Q by maximizing the lower bound L (minimizing the KL divergence between the actual and surrogate distributions):
E Q ln Pr [ Θ , Y ] Q ( Θ ) = KL ( Q ( Θ ) | | Pr [ Θ | Y ] ) + ln Pr [ Y ] KL ( Q ( Θ ) | | Pr [ Θ | Y ] ) .
The mean field variational family is used for Q . It simplifies the optimization by assuming that the surrogate posterior distributions are independent, allowing each variable in the posterior to be optimized independently: Q Θ = i Q Θ i . The posterior for each variable is chosen to be conjugate, further simplifying the optimization. Thus, the posteriors of the component scores Z, the weighting matrix W, and the mean weights Z ¯ are normal distributions. Here, W is vectorized via vec ( W ) without altering its normality assumption. Meanwhile, the posteriors of the precision variables of noise σ 2 , components α , basis functions β , and mean weights η are Gamma distributions:
Q Z ( Z ) = i Q Z i ( Z i ) = i N ( Z i | μ Z i , Σ Z i )
Q W ( W ) = N ( vec ( W ) | μ vec ( W ) , Σ vec ( W ) )
Q Z ¯ ( Z ¯ ) = N ( Z ¯ | μ Z ¯ , Σ Z ¯ )
Q σ ( σ ) = Γ ( σ 2 | a σ , b σ )
Q α ( α ) = j Q α j = j Γ ( α j | a α j , b α j )
Q β ( β ) = k Q β k = k Γ ( β k | a β k , b β k )
Q η ( η ) = Γ ( η , | a η , b η )

Update Steps

In mean field approximation using the surrogate posterior Q Θ = i Q Θ i conditioned on observations Y, the lower bound is maximized with respect to each unknown Θ i . With the conjugate prior, the optimal updates (denoted with “←”) make the moments of Q Θ i equal to the moments conditioned on the remaining parts of Q Θ [38]:
Q Θ i exp E Q / Θ i [ ln ( Pr [ Y , Θ ] ) ] exp E Q / Θ i [ ln ( Pr [ Y , Θ ] ) ] d Θ i
From Equation (13), detailed update rules for each variable are presented subsequently, and the derivations of these formulas are given in the Appendix part.
Updates for the parameters of the posterior for the precision of components  Q α j , j = 1 : J :
a α j a 0 + K 2 ,
b α j b 0 + 1 2 k = 1 K E Q / α j [ W j k 2 β k ] = b 0 + 1 2 k = 1 K Σ W j k + μ W j k 2 a β k b β k ,
where Equation (14) calculates the corrected degrees of freedom and Equation (15) calculates the corrected sum of squares. As a 0 and b 0 approach 0, the expectation of precision α j , which is E Q α j [ α j ] = a α j b α j , is exactly the inverse of the empirical or sample variance.
Updates for the parameters of the posterior of the precision of the mean weights  Q η :
a η a 0 + K 2 ,
b η b 0 + 1 2 k = 1 K E Q / η [ Z ¯ k 2 β k ] = b 0 + 1 2 k = 1 K Σ Z ¯ k + μ Z ¯ k 2 a β k b β k
Updates for the parameters of the posterior of the precision of basis functions  Q β k , k = 1 : K :
a β k a 0 + J + 1 2 ,
b β k b 0 + 1 2 E Q / β k [ Z ¯ k 2 η + j = 1 J W j k 2 α j ] = b 0 + 1 2 Σ Z ¯ k k + μ Z ¯ k 2 a η b η + j = 1 J Σ W j k + μ W j k 2 a α j b α j
Updates for the parameters of the posterior of the mean weights  Q Z ¯ :
Σ Z ¯ E Q / Z ¯ σ 2 i = 1 P Ψ i + η diag ( β ) 1 = a σ b σ i = 1 P Ψ i + a η b η diag ( a b ) 1 ,
μ Z ¯ E Q / Z ¯ σ 2 i = 1 P ( Y i E Q / Z ¯ Z i W Φ i ) Φ i T Σ Z ¯ = a σ b σ i = 1 P ( Y i μ Z i μ W Φ i ) Φ i T Σ Z ¯
where diag ( β ) denotes the diagonal matrix with diagonal entries given by β . Equation (20) indicates that the eigenvectors of Σ Z ¯ are solely determined by the sum of Gram matrices i = 1 P Ψ i , where Ψ i = Φ i Φ i T , while the eigenvalues of Σ Z ¯ have a negative correlation with the scale of i = 1 P Ψ i , the prior η diag ( β ) , and data-dependent term σ 2 . This is sensible because, for instance, large noise would result in large uncertainty in Z ¯ . In Equation (21), the data residuals, excluding the component scores, are projected into the K-dimensional space through the inner product, with Φ i and summed over all sample functions to calculate the mean weights.
Updates for the parameters of the posterior of the weights  Q W :
Σ vec ( W ) E Q / W σ 2 i = 1 P Ψ i T ( Z i T Z i ) + diag ( β ) diag ( α ) 1 = a σ b σ i = 1 P Ψ i T ( μ Z i T μ Z i + Σ Z i ) + diag a b diag c d 1 ,
μ vec ( W ) E Q / W σ 2 i = 1 P vec Φ i ( Φ i T Z ¯ T Y i T ) Z i T T Σ vec ( W ) = a σ b σ i = 1 P vec Φ i ( Φ i T μ Z ¯ T Y i T ) μ Z i T T Σ vec ( W )
Equation (22) is similar to Equation (20), because it is correlated with Φ i , its prior diag ( β ) diag ( α ) , and data-dependent terms σ 2 and Z i . In Equation (23), the data residual excluding the mean function is used to estimate the expectation of W.
Updates for the parameters of the posterior of the component scores  Q Z i :
H i j k E Q / Z i [ W j Ψ i W k T ] = Tr ( E Q / Z i [ W k T W j ] Ψ i ) = T r Σ [ W k , W j ] + μ [ W j ] T μ [ W k ] Ψ i , j = 1 : K , k = 1 : K ,
Σ Z i E Q / Z i [ σ 2 W Ψ i W T + I ] 1 = [ a σ b σ H i + I ] 1 ,
μ Z i E Q / Z i [ σ 2 ( Y i Z ¯ Φ i ) Φ i T W T ] Σ Z i = a σ b σ ( Y i μ Z ¯ Φ i ) Φ i T ( μ W ) T Σ Z i ,
where H i is a temporary variable denoting the Gram matrix of weighted kernel functions W Φ i , and Σ [ W k , W j ] denotes the covariance between W k T and W j in Q .
Updates for the parameters of the posterior of the noise  Q σ :
a σ a 0 + 1 2 i N i ,
b σ b 0 + 1 2 E Q / σ i | | Y i ( Z i W + Z ¯ ) Φ i | | 2 2 = b 0 + 1 2 i ( Y i Y i T 2 Y i μ Z i μ W Φ i T 2 Y i μ Z ¯ Φ i T + 2 μ Z i μ W Ψ i ( μ Z ¯ ) T + Tr Σ Z ¯ + ( μ Z ¯ ) T μ Z ¯ Ψ i ) + 1 2 vec ( H T ) T i vec vec ( Ψ i ) vec ( Σ Z i + μ Z i T μ Z i ) T ,
where H is a temporary variable that is updated by
H j + k M E Q / σ vec ( W k W j T ) T = vec ( Σ [ W k , W j ] + μ [ W j ] T μ [ W k ] ) T , j = 1 : K , k = 1 : K
Nearly noninformative (vague) priors, i.e., with almost zero a 0 , b 0 , introduce an inherent identifiability ambiguity in our formulation—specifically, in the product of the precision parameters α , β , and η (Equations (20) and (22)). In our model, scaling α and η by a specific factor while inversely scaling β leaves the product (and hence the lower bound in Equation (5) unchanged. This inherent ambiguity can lead α , β , and η to converge to extreme values, thereby challenging the numerical stability during optimization. To mitigate this issue, we adopt a heuristic constraint to ensure that the smallest values of α and β remain within one order of magnitude of each other. Specifically, we enforce log 10 min ( α ) min ( β ) 1 . If an update to any α j or β k would violate this constraint, this particular update is skipped, and the rest of the parameters remain updated. This strategy does not alter the algorithm’s overall structure but stabilizes the optimization by preventing unnecessary flexibility in the precision parameters.

3.2. Scalable Update Strategy

The scalability of our algorithm so far is primarily challenged by the need to optimize the variational lower bound, L , over K basis functions. As indicated by Equation (22), the time complexity is O K 6 (or, alternatively, O K 2 P max i ( N i ) , typically dominated by the former), which becomes prohibitive when K is large. In practice, however, only a small subset of these basis functions is necessary for an accurate representation—those with non-negligible weights under our sparse prior.
To address this, we focus the updates on the subspace of active basis functions, denoted as K ( a ) , which comprises only those functions with non-negligible weights. The remaining basis functions, whose influence is minimal, are held fixed during optimization. Furthermore, the number of active principal components is noted as J ( a ) and set equal to K ( a ) , ensuring that the model spans the full range of possible ranks from 1 to K ( a ) . Consequently, we optimize Q ( a ) using updates derived with regard to the objective K ( a ) -dimensional lower bound L ( a ) as an efficient surrogate of the full updates of Q with regard to the full lower bound L , using only K ( a ) active basis functions. Meanwhile, the active dimensionality of the model is adjusted dynamically during optimization by activating or deactivating basis functions based on their precision parameters. For clarity, variables associated with the active subspace are annotated with the superscript ( a ) (e.g., a α j ( a ) = a 0 + K ( a ) 2 versus a α j = a 0 + K 2 ).

3.2.1. Implicit Factorization

For notational clarity, we reorder the rows and columns of our parameter matrices to separate active components from inactive ones. Specifically, we partition the matrices as follows:
Z i = Z i A Z i B , Z ¯ = Z ¯ A Z ¯ B , α = α A α B , β = β A β B , W = W A W B W C W D , Φ i = Φ i A Φ i B ,
Here, the subscript A denotes variables belonging to the active subspace (i.e., those corresponding to K ( a ) basis functions), and B , C , and D denote the inactive components. Notably, the cross terms W B and W C involve both active and inactive components; these are updated implicitly, as proven in the Appendix part.
Following the strategy in [39], a basis function is deemed inactive if its precision exceeds a high threshold, i.e., α j > ϵ 1 and β k > ϵ 1 as ϵ 0 . In the limit, the inactive basis functions decouple from the active ones, leading to the following mean field factorization:
Q W   = Q W A Q W B Q W C Q W D
Q Z ¯   = Q Z ¯ A Q Z ¯ B
Q Z i   = Q Z i A Q Z i B
The factorization of α and β was already obtained in Equations (10) and (11). These factorizations allow us to decouple the update for the active subspace with the proof provided in the Appendix part.
It implies that only updates for Q Z i A , Q W A , Q W B , Q W C , Q Z ¯ A , Q α A , Q β A , Q σ , Q η are required, as shown in Figure 2. This strategy reduces the computational complexity from O ( K 6 ) to O ( K ( a ) 6 ) . Moreover, we initialize W as an identity matrix and set the active α A to all ones and the inactive α B to infinite. In this way, we can initialize the remaining active dimensions K ( a ) , e.g., Z i A , using a modified, multi-instance version of a relevance vector machine [30], which performs fast analytical maximum-likelihood updates, as detailed in Appendix E in the Appendix part.

3.2.2. Low-Dimensional Lower Bound

This section shows how to optimize these active surrogates, e.g., Q Z ¯ A , using updates of Q ( a ) with regard to the K ( a ) -dimensional lower bound L ( a ) , which ultimately optimizes the full lower bound L . To distinguish between the two, we denote the active surrogate posterior for the full model as Q Z ¯ A and that for the reduced K ( a ) -dimensional model as Q Z ¯ A ( a ) . The active Gaussian surrogate posteriors are shared, e.g., Q Z ¯ A = Q Z ¯ A ( a ) = N ( Z ¯ A | μ Z ¯ A , Σ Z ¯ A ) . This sharing implies that updating Q ( a ) is equivalent to updating Q , so we set the moments of the active distributions of the full model to match those of the reduced model. However, the surrogate posterior Gamma distributions differ between the two models. For example, the update of E Q ( a ) [ α A ] depends solely on Q W A , whereas E Q [ α A ] also incorporates a cross term Q W B corresponding to the remaining ( K K ( a ) ) dimensions. This difference is reflected in how the scale parameters depend on the number of active versus total basis functions, as shown in Equations (14), (16) and (18). Nonetheless, we prove that, in the limit ϵ 0 , the fixed point of the K ( a ) -dimensional updates of the complete surrogate Q equals that of the reduced surrogate Q ( a ) . Consequently, the updates for Q α A , Q β A , and Q η are derived directly from the expectations of the reduced model Q α A ( a ) , Q β A ( a ) , Q η ( a ) :
E Q [ α A ] E Q ( a ) [ α A ] b α j a α j a α j ( a ) b α j ( a ) , j J ( a ) ,
E Q [ β A ] E Q ( a ) [ β A ] b β k a β k b β k b β k ( a ) , k K ( a ) ,
E Q [ η ] E Q ( a ) [ η ] b η b η ( a ) a η ( a ) a η
These update Equations (34)–(36) are proven to optimize L in Theorems A1 and A2 in the Appendix part.

3.2.3. Heuristic for Activation of Basis Functions

The proposed method selects a relatively small set of basis functions from a potentially extensive set of possibilities. The computational costs are mitigated by recognizing that inactive basis functions do not interact with those that are active (with non-negligible weights). The inactive components are removed in the final model and excluded from the active subspace optimization. However, our method essentially optimizes over the full space, and thus we have the algorithm allowing for the reactivation of inactive basis functions to ensure optimization across the entire subspace. Due to computational constraints, we consider functions for activation sequentially rather than all at once. Thus, we propose Algorithm 1 to introduce unseen basis functions into the active set using a selective strategy akin to the heuristic approach described in [39].
The algorithm selects the top function, ϕ B k , from the inactive basis functions { ϕ B k } k by gauging their correlations with residuals and applying an angle-based threshold τ ang relative to the subspace of ϕ A . The correlation with residuals for ϕ B k is measured by i Φ i B k ( Y i E Q ( a ) [ Z i A W A + Z ¯ A ] Φ i A ) T 2 . The angle-based threshold ensures a meaningful distinction from active functions. Next, the current active surrogate posterior is expanded by a dimension for ϕ B k ˜ , initiating optimization from the numerical maximum τ max . Postoptimization, the function is retained if it falls below τ max . Otherwise, the algorithm terminates. Efficiently, in trial optimization, the approach replaces one function with precision τ max , if present.
Algorithm 1 Search for new basis functions to activate
   Sort inactive basis functions { ϕ B k } k by correlation with residuals.
   Filter through { ϕ B k } k , selecting the most correlated one as ϕ B k .
   Copy current active surrogate Q ( a ) ( Θ ) posterior to Q k ( a ) ( Θ ) .
   Expand dimension in Q k ( a ) ( Θ ) for ϕ B k .
   Optimize Q k ( a ) ( Θ ) for three iterations using mean field approximation.
   if expected precision is within threshold then
          Q ( a ) ( Θ ) Q k ( a ) ( Θ ) .
   end if

4. Faster Variant

To enhance the computational efficiency of our primary algorithm, we introduce a faster variant, denoted as BSFDA Fast . This approach leverages conditional independence among the columns of W, enabling separate updates and thereby reducing the computational complexity. Similar strategies have been described in [18,28]. The model is defined with an introduced variable ζ i for the coefficient noise as follows:
θ i = Z i W + ζ i ,
ζ i k N ( 0 , ς k 2 β k 1 ) .
Similarly to the above, we assign a conjugate Gamma prior to the precision:
ς k 2 Γ ( a 0 , b 0 ) .
This formulation ensures that the columns of W are conditionally independent, allowing the variational distribution to factorize as Q W = k Q W · k , thereby facilitating separate updates for each column. Consequently, the time complexity is reduced from O ( K ( a ) 6 ) to O ( K ( a ) 3 ) .
To align with the original model, it is necessary for ζ and the associated variance parameters ς to approach zero. Having ς too high would allow the coefficient noise to corrupt the signal, biasing the model toward underestimating the true signal levels, particularly because this noise operates in the coefficient space where it introduces smooth, correlated variation (low entropy, like signals), which is harder to eliminate than high-frequency white noise (maximum entropy). Injecting the same amount of noise leads to the unbiased estimation of the signals but increases the estimation variance. Conversely, as ς decreases, the columns of W become dependent, violating the independence assumption inherent in variational inference. This dependency degrades the approximation quality and slows down the optimization process. Such dependency issues are well documented in both the variational inference and MCMC literature—with recent efforts addressing them via structured VI [38] or blocked/collapsed Gibbs sampling [40]. Empirical validations of this noise impact are conducted with both BSFDA Fast in Section 5.1 and with Bayesian PCA [28] in Appendix F.2 in the Appendix part.
To balance the trade-off between the optimization speed and accuracy, we adopt a strategy of gradually decreasing the values of ς k during the optimization iterations. Specifically, we initialize ς k with a relatively large value and linearly decrease it from 10 2 to 10 5 over the first half of the iterations. After reaching 10 5 , ς k is fixed for the remaining iterations. This gradual reduction ensures that the algorithm initially maintains its efficiency, with benefits from minimizing the interdependency among the columns of W to accelerating convergence while later preserving the quality of the approximation by preventing the noise from obscuring the signal components. We unify the scales by scaling the basis functions so that Z i is standard normal and W is an identity matrix in initialization. Empirical evaluations indicate that the strategy above is effective in most applications.
By implementing these modifications, BSFDA Fast offers a practical solution that substantially accelerates the algorithm without a significant loss in accuracy, making it well suited for large-scale, high-dimensional functional data analysis.

5. Results

The proposed method demonstrates its effectiveness through simulations and applications to observed datasets.

5.1. Simulation Results

In the simulations, we evaluate the functional data analysis performance in terms of model selection, the estimated covariance accuracy, and extendability to multidimensional domains.
The model selection metric is the accuracy in estimating the number of principal components, which is the dimension of the compact subspace of signal variations. The configuration of the simulations in this section aligns with that established in [21], covering various scenarios. The simulated datasets are derived from a latent generative model with variables Z i with dimension r for the i-th sample function and noise corruption with a standard deviation of σ : Y i = j = 1 r Z i j f j ( X i ) + g ( X i ) + E i , Z i j N ( 0 , v j ) , E j N ( 0 , σ 2 I ) , where { f j } j = 1 r represent eigenfunctions, { v j } j = 1 r are the eigenvalues, and g : R R signifies the mean function. Here, we consider five scenarios.
Scenario 1: Data generated with g = 5 ( x 0.6 ) 2 , r = 3 , v = ( 0.6 , 0.3 , 0.1 ) , σ 2 = 0.2 , f 1 ( x ) = 1 , f 2 ( x ) = 2 sin ( 2 π x ) , f 3 ( x ) = 2 cos ( 2 π x ) . Here, v 3 < σ 2 , i.e., the noise has larger variance than the smallest signal.
Scenario 2: Similar to Scenario 1, but the third eigenfunction is replaced by a function with higher frequencies f 3 ( x ) = 2 cos ( 4 π x ) , and the principal component scores follow a skewed Gaussian mixture model. Specifically, the j-th component score has a one-in-three probability of following a N ( 2 v j / 3 , v j / 3 ) distribution and a two-in-three probability of following N ( v j / 3 , v j ) , for j = 1 , 2 , 3 .
Scenario 3: Data generated with g = 12.5 ( x 0.5 ) 2 1.25 , r = 3 , v = ( 4 , 2 , 1 ) , σ 2 = 0.5 , f 1 ( x ) = 1 , f 2 ( x ) = 2 cos ( 2 π x ) , f 3 ( x ) = 2 sin ( 4 π x ) .
Scenario 4: Same as Scenario 3, but the component scores are generated from a Gaussian mixture model as in Scenario 2.
Scenario 5: Data from g = 12.5 ( x 0.5 ) 2 1.25 , r = 6 , v = ( 4 , 3.5 , 3 , 2.5 , 2 , 1.5 ) , σ 2 = 0.5 , f 1 ( x ) = 1 , f 2 k ( x ) = 2 sin ( 2 k π x ) for k = 1 , 2 , 3 , f 2 k + 1 ( x ) = 2 cos ( 2 k π x ) for k = 1 , 2 , j-th component score obeying N ( 0 , v j ) .
In each scenario, the simulations produce 200 sample functions. We investigate three cases with sparse, medium, and dense sampling by assigning the number of observations per sample function N i = { 5 , 10 , 50 } . Each case in each scenario is repeated 200 times. The method’s performance is compared to that of fpca from [13], the AIC and BIC in the 2022 release of pace [11], the modified AIC and BIC in [21], and all the competing methods in [21]. For fpca, we set the candidate numbers of basis functions as [8,10,15,20] and the candidate dimensions of the process as [2,3,4,5] for Scenarios 1–4 and [4,5,6,7,8] for Scenario 5. The other parameters are all set to the defaults. Due to its consistent overestimation of the true number of components—likely resulting from interference by correlated noise and less sparse precision priors—we exclude LFRM [18] from further comparisons (see Appendix F.1.1 in the Appendix part).
Each estimation uses 10 length scales of functions, which are selected using cross-validation and k-means clustering. This adaptive strategy allows the algorithm to choose distinct length scales at different locations of the definition domain, thereby accommodating varying smoothness characteristics inherent in complex functional data—a level of flexibility that is not possible when using a regular grid that forces a single length scale across the entire domain [18]. Sparse sampling in Scenario 5 uses five length scales to avoid overfitting. Figure 3 shows the length scales and centers of the selected kernel basis functions for three different numbers of sample points, N i , in a random repetition of Scenario 5. The results reveal that the selected length scales mainly concentrate around 0.07, with a few as high as 0.35—suggesting that the lower length scales capture finer, high-frequency variations. The higher length scales model the overall, lower-frequency quadratic mean structure and the constant baseline component. Furthermore, the estimated density functions of the selected length scales exhibit consistent patterns across the three sampling densities, and the method selects 9, 11, and 12 basis functions, respectively, demonstrating the algorithm’s adaptive fidelity and complexity based on the available observations. The Appendix part showcases the uncertainty evaluation in Figure A2.
Table 1, Table 2, Table 3, Table 4 and Table 5 show the results. The results for the first five methods are from [21]. Of 15 cases, the proposed BSFDA exhibits the highest accuracy in 12. In the other three cases, the accuracy of BSFDA is comparable to the best result and is always above 0.950. BSFDA Fast demonstrates performance comparable to that of BSFDA when applied to medium-density and dense datasets with significantly higher efficiency, which we detail in Figure 4. However, its efficacy diminishes with sparse data. This limitation arises because the parameter ς can bias model estimation in scenarios with insufficient data evidence, leading to an underestimation of the signal variance. Consequently, BSFDA Fast tends to underestimate the number of components, particularly those capturing nuanced variations, in the presence of sparse observations. Nonetheless, with adequate data, BSFDA Fast achieves performance on par with that of the original model.

5.1.1. Mean Squared Error in Covariance Operator

The mean squared error across X grid , a grid of 1000 index points,
| | cov ( X grid , X grid ) cov ^ ( X grid , X grid ) | | F 2 1000 × 1000 ,
where | | · | | F is the Frobenius norm, measures the accuracy of the estimated covariance. The quadratic measure of the error with the Frobenius norm for covariance estimators has been used by [41]. The methods compared include fpca of [13], pace of [11] with the AIC and BIC, and refund-sc of [12]. Only cases in Scenario 5 are used because of the time constraints (e.g., refund-sc takes 6 h for 20 repetitions with 50 points in Scenario 5). As the most challenging, Scenario 5 should provide the most compelling comparison. The results in Table 6 demonstrate that the proposed method is comparable to the best work in terms of the estimated covariance accuracy. Specifically, dense sampling becomes prohibitive for refund-sc. The results highlight the benefit of continuous formulations, as seen in both fpca and the proposed method, over the grid-based optimization in conventional methods. BSFDA Fast again performs comparably well when the data are adequate.

5.1.2. Multidimensional Functional Data Simulation

A simulation experiment with a 4D index set reveals the proposed method’s advantages for high-dimensional data, where the gridding strategies of previous methods are impractical. The settings are as follows, with a length scale l s = 0.33 :
Z i N ( 0 , I ) R 1 × 3
ϕ 0 ( x ) = ( π l s 2 ) 2 exp 1 2 x [ 0.5 , 0.5 , 0.5 , 0.5 ] l s 2 2
ϕ 1 ( x ) = ( π l s 2 ) 2 exp 1 2 x [ 0.4 , 0.4 , 0.4 , 0.4 ] l s 2 2
ϕ 2 ( x ) = ( π l s 2 ) 2 exp 1 2 x [ 0.6 , 0.6 , 0.6 , 0.6 ] l s 2 2
y i ( x ) = Z i 0 0.6 ( ϕ 0 ( x ) ϕ 1 ( x ) ) + Z i 1 0.3 ϕ 1 ( x ) + Z i 2 0.4 ϕ 2 ( x )
The observations include additive noise with a sigma of 4.472 × 10−1. The cross-validation selects a length scale of 0.405. The estimated noise sigma is 4.637 × 10−1. The proposed method correctly estimates the number of principal components as three and selects 31 basis functions. As shown in Figure 5, the eigenfunctions are correctly estimated. In addition, the estimated mean function is zero, which is accurate.
Next, we present a convergence comparison between BSFDA and BSFDA Fast under four schedules for the coefficient noise ς k . Specifically, we compare the default diminishing schedule from 10 2 to 10 5 with three fixed settings: 10 2 , 10 3 , and 10 5 . We evaluate the covariance error and the discrepancy between the estimated/true dimensionality in one replicate of each sample density in Scenario 5 and the 4D simulation. For the 4D case, we adopt a default initial ς k of 10 3 . As illustrated in Figure 4, BSFDA Fast achieves comparable accuracy to BSFDA while converging significantly faster than BSFDA in terms of both covariance errors and component estimation for medium and densely sampled data. In the 4D case, BSFDA Fast converges in covariance estimation after approximately 10,000 s and in dimensionality after around 4000 s, compared to roughly 100,000 s and 13,000 s, respectively, for BSFDA. However, for sparse data, BSFDA Fast exhibits reduced estimation accuracy and underestimates the number of components by one. A similar decline in accuracy is observed in the 4D simulation when data sparsity is high. This limitation arises because the introduction of coefficient noise ς biases the model toward eliminating signals that are deemed insignificant. Moreover, when comparing the three fixed- ς k variants of the fast algorithm, a clear trade-off emerges: a smaller ς k reduces the overall error but slows down the optimization due to increased dependency among variables. These results collectively demonstrate the effectiveness of our chosen ς k schedule in BSFDA Fast , as it balances both efficiency and accuracy.

5.2. Results on Public Datasets

The proposed method’s practicality was validated with two application datasets, CD4 and wind speed measurements.

5.2.1. CD4

CD4 data, a classical form of functional data, have received attention in [1,11,13]. CD4 cell counts gauge the immune system’s response to human immunodeficiency virus (HIV) infection, which leads to a progressive reduction in CD4 cell counts. The Multicenter AIDS Cohort Study (MACS) [42] provided the CD4 data. This dataset consists of CD4 percentages from 283 male human subjects that were HIV-positive, each with 1 to 14 repeated measurements over time in years. Subjects were scheduled for reevaluation at least semiannually. However, missed visits caused the sparse and uneven distribution of measurements. The proposed method used five length scales selected from cross-validation and k-means clustering. Finally, the model selected nine basis functions. Figure 6 displays the estimated mean function, eigenfunctions, and curves of the observations. The mean function reflects the overall decreasing tendency with the progression of the disease. The eigenfunctions are obtained by applying the singular value decomposition of the covariance operator that is discretized (for visualization purposes only) with a grid of 50 evenly spaced points over the whole timeline. The first eigenfunction is relatively flat and mainly captures the subject-specific average magnitude of the CD4 counts, consistent with the findings of [1,11,13]. The second eigenfunction captures the simple linear trend of the variations, as described in [13]. The third eigenfunction captures the piece-wise linear time trend with a breakpoint near 2.5 years from the baseline. Refs. [1,11] found similar eigenfunctions.

5.2.2. Wind Speed

The wind speed data, collected from 110 locations across Utah’s Salt Lake Valley, range between 11 and 1440 measurements. The proposed method leverages 10 length scales selected from cross-validation and k-means clustering. Figure 7 illustrates the estimated mean function, curves of the observations, eigenfunctions, and covariance. The horizontal axis represents the seconds starting from 12:00 AM Greenwich Mean Time (GMT) on 15 June 2023, which corresponds to 6:00 PM in Salt Lake City. In Figure 7a, the estimated mean function depicts two pronounced peaks observed approximately at 8:00 PM and 6:00 AM, as well as two troughs around 12:00 AM and 12:00 PM. This pattern aligns with the diurnal cycle, particularly highlighting the thermal activity associated with sunset and sunrise. The peaks during sunset and sunrise are due to the interplay of topographical features, which result in specific breezes, such as the land breeze near the Great Salt Lake and the distinct mountain and valley breezes. The troughs, on the other hand, reflect moments when the atmosphere is at its most stable, with minimal thermal activity disrupting wind patterns. The complexity of the data is distilled and represented using 12 descriptors with 17 basis functions. As Figure 7b shows, the primary eigenfunction is relatively level, indicating that the most significant variation is the location-specific average magnitude. Its profile echoes the influence of sunrise and sunset observed in the mean function, with elevations around 7:00 PM and 5:00 AM and subdued patterns during other times, indicative of similar atmospheric stability. The estimated covariance in Figure 7c highlights variance peaks around 8 PM and 5 AM, as well as a strong correlation between these periods. This underscores the effects of location-specific topographic factors on the wind speed.

5.2.3. Modeling Large-Scale, Dynamic, Geospatial Data

Here, we demonstrate the scalability regarding both the size of the measurements and the dimensionality of our framework. For this, we apply it to the ARGO dataset, which consists of ocean temperature measurements from more than 4000 locations, at multiple depths and time points [37]. ARGO is a nearly global observing system for the ocean temperature, salinity, and other key variables via autonomous profiling floats. As of 2019, ARGO has generated over 338 gigabytes of data from 15,231 floats [37]. We focused on high-quality (“research” mode option in the database API) data from 1998 to 2024 for depths between 0 and 200 m in the open-access snapshot of Argo GDAC of 9 November 2024 [43]. The number of measurement points per year varies widely—from 38,931 to over 11 million, with 127 million in total. Figure 8 illustrates a global map of the sea surface temperature measurements from February 2021, highlighting the dataset’s extensive spatial coverage.  
In our modeling, each year’s data are treated as a single underlying function of four variables: latitude, longitude (on the spherical Earth), depth, and intra-annual time (modeled as a periodic variable). Note that the spatial data lie on a sphere and the time is a circle, assuming the periodicity of the time of the year. Our approach models these measurements holistically—without resorting to moving windows or submodeling—thereby preserving the continuous nature of the data and enabling the extraction of meaningful global, seasonal, and depth-dependent trends. Furthermore, the unique geospatial and temporal structure of the ARGO data, with spatial coordinates on a sphere and time exhibiting periodicity, necessitates specialized modeling techniques. Given that our model is 4D, the 4D kernel is defined as a product of the following kernels, following the design strategy for climatological data in [10]. The geospatial kernel on the sphere is a radial basis function (RBF) on geodesic distances. To ensure periodicity, the temporal kernel is an Exp-Sine-Squared k ( x , x ) = exp 2 sin 2 π | x x | l s , where l s is the length scale. For depth, we use a Gaussian kernel.
The numeric data (excluding metadata) as input to the model were approximately 4 GB. For length scale selection, we used Gaussian process regression on a small subset of 2000 randomly selected data in 2016 (medium size of measurements) for a cross-validated RMSE, which we optimized with a grid search. The specific length scales were set as follows: geodesic length scale of 2 × 10 3  km, depth length scale of 70 m, time length scale of 3, and periodicity of 1. For evaluation, we held out 10% of the depth profiles (a single round trip of a buoy from the surface to a depth at the same coordinate) from each year as testing data, following [44]. The total training set contained roughly 114 million points. Because the sample spacing was typically small relative to the selected length scales, we applied agglomerative clustering to 10,000 randomly chosen index points, reducing them to 2000 candidate basis functions. These candidate basis functions—precomputed for efficiency—took roughly 1.7 TB of memory. Computations were performed with 24 threads on a server equipped with 192 Intel® Xeon® Platinum 8360H CPUs @ 3.00 GHz and 3 TB RAM. Initialization was conducted using the modified RVM for 200 iterations for initial basis functions, using stochastic optimization with a 1000 batch size per year. Then, BSFDA Fast was executed for 10,000 iterations, where the heuristic to include new bases also used a 1000 batch size per year. With these computational strategies and heuristics, the entire modeling process was completed in 15.
The proposed approach selected 163 effective basis functions and condensed them into 16 principle components. The final model occupied merely 50 MB of storage. The interpolation yielded a root mean square error (RMSE) of 1.95 and an R 2 of 94.2% on the testing data, reflecting a reasonable balance between global dimension reduction and fidelity. The estimated white noise level was also 1.95, indicating that the training data adequately covered the underlying variability in the ARGO observations, and the final model was reasonably generalizable.
Figure 9 presents 2D visualizations of geospatial interpolations at three depths (in decibars, roughly meters) and a specific time (29 May 2021) around 1° S and 30° W, each with three views. We have chosen one measurement as the central point, denoted by the red circle, and selected a narrow window (±1 decibar, ±1 day) around this center. The cyan and fuchsia circles represent training and testing data, respectively, within this window. Their sizes indicate the distance along the unplotted dimensions (depth and time here), reflecting variations in these dimensions. The visualizations show that the temperatures are warmer near the equator and decrease with depth. The match between the interpolated values and actual measurements demonstrates consistency in capturing broad spatial and vertical variations.
Figure 10 complements this by illustrating interpolation in the depth–time slices while holding the geospatial coordinates fixed, focusing on mixed layer characteristics. The “mixed layer” refers to a region of nearly uniform temperature, which is crucial in understanding thermodynamic potential and nutrient cycling [45]. Here, the plot uses a window of 50 km to include actual measurements, and the circle sizes denote the geodesic distance from the chosen center.
We plot every fifth measurement vertically to reduce overlap and improve the clarity. Figure 10a uses the same center point, 1° S and 30° W, as in Figure 9, exhibiting a shallow mixed layer with pronounced vertical gradients. In contrast, Figure 10b adopts a center at a higher latitude, 49° N and 29° W, where the model reveals a deeper mixed layer. The temperature here remains relatively stable below the surface. The dominant variations are cyclic seasonal changes, which are warmer near the surface around September. As is shown, the vertical sequence of the center and the nearby testing sequence match the interpolation closely. These results confirm that the mid-latitudes exhibit a stronger seasonal cycle [45] and that BSFDA Fast accurately approximates the actual measurements.
To our knowledge, this is the first time that the ARGO dataset has been modeled in a full 4D principal component model, with the correct domain topology. We incorporate the entire period of 27 years, rather than shorter spans (e.g., 2004–2008 or 2007–2016) [44,46,47]. Instead of segmenting the dataset into localized spatiotemporal windows, we process the entire 4D domain (latitude, longitude, depth, and intra-annual time) in a single holistic framework. Previous studies were typically tailored to ARGO datasets and handled each depth, month, or spatial region separately, restricting the correlation estimates to limited windows (e.g., 1000 km and three months) while excluding data with large offsets [44,47]. In addition, they required repeated on-demand model fitting, which can hinder scalability. Our kernel-based framework, by contrast, is broadly applicable to general functional data, only requiring kernel definitions for the domain. Although global dimension reduction inevitably introduces some residual noise, the kernel-based design is extensible to finer spacing or multiple length scales if higher precision is needed. Furthermore, inference with our model is simply the evaluation of the active 163 active basis functions weighted by the 16 principal components. Interpolation over a 300 × 300 grid only takes about two seconds. By contrast, previous methods with Gaussian process regression require a weighted sum of all measured data within a certain window. The parametric representations also facilitate straightforward derivative and integration calculations, which are essential in investigating ocean temperature stratification and heat content [44]. In summary, the ARGO dataset provides an ideal testbed for our method, as it captures the dynamic behavior of high-dimensional geospatial data in a continuous framework. A more comprehensive study of ARGO was beyond this study’s scope. Nonetheless, the results here confirm the clear advantages of the proposed method for large-scale, high-dimensional functional data.

6. Discussion and Conclusions

This paper proposes BSFDA, a novel framework for functional data analysis with irregular sampling, integrating model selection and scalability in one unique, coherent, and effective algorithm. Our extensive empirical studies, including both simulations and real-world applications, show that BSFDA offers superior covariance estimation accuracy with remarkable efficiency.
In terms of accuracy, our method excels in model selection, consistently achieving top-tier performance. The accuracy of the covariance operator estimation also rivals that of the best existing methodologies in the field. This shows that our approach can not only handle large and complex datasets but also ensure high accuracy and precision in the results that it produces. Our method’s superiority compared to existing techniques is expected owing to the inherent iterative nature of data smoothing and covariance estimation in our approach.
In terms of scalability, our method demonstrates linear growth in time complexity with the size of the dataset, and, impressively, the computations are executed in a small, K ( a ) -dimensional subspace. This property ensures that, as the datasets grow larger and more complex, the performance of our model remains robust and efficient. Additionally, we introduce a faster variant, BSFDA Fast , which performs similarly to BSFDA on medium and dense datasets with a significantly reduced computational cost. This leap in efficiency enabled the full 4D functional modeling, for the first time, of a large-scale oceanic temperature dataset across 27 years (ARGO) [37]. Although BSFDA Fast can underestimate the signal strength under very sparse sampling, the vanilla BSFDA effectively complements and alleviates this issue.
Although the proposed framework proves effective in various real-world scenarios, it relies on the specific scheduling of coefficient noise levels. This schedule transitions from a faster, more biased model to a slower, less biased one, balancing the convergence speed against the estimation accuracy. Whereas empirical tests validate its advantages, there remain exploratory directions for further enhancement. For instance, the proper incorporation of structured variational inference or injection of artificial, compensatory noise in the observations could enable fast inference with reduced bias at the same time. However, it is necessary to address the increased complexity in optimization and variance in the estimation. Additionally, variational inference based on mean field approximations may underestimate the posterior variance, which is acceptable in many tasks [38] but leaves an open question when the independence assumptions are severely violated. Moreover, variational inference prioritizes computational efficiency over strict theoretical optimality [23,38], and the inherent infinite-dimensionality of fPCA further complicates any formal asymptotic analysis, even though our empirical studies demonstrate decreasing errors as the sampling density increases.
Looking ahead, it would be interesting to explore how extensions of regular PCA, such as simplified PCA and robust PCA [25], can be integrated within our proposed framework. Domains such as finance, in particular, include large-time-series datasets that often contain more outliers. These extensions will enhance the flexibility and robustness of our method, further improving its adaptability to various data conditions. In addition, we see potential in examining the extensions of functional PCA, such as time warping, dynamics, and manifold learning [1]. In particular, shape analysis emerges as a direct application of time warping. Such extensions would push the boundaries of what our proposed method could achieve, potentially enabling it to handle an even wider array of data structures and complexities.
In conclusion, our research findings affirm the proposed framework’s effectiveness and adaptability in advanced functional data analysis. Nonetheless, the method’s potential remains broad, and future work promises to widen its scope and refine its performance. By unifying sparse Bayesian learning, kernel-based expansions, and efficient variational inference, BSFDA offers a powerful foundation for large-scale, high-dimensional FDA challenges.

Author Contributions

Conceptualization, S.J. and R.W.; Methodology, W.T., S.J. and R.W.; Software, W.T.; Investigation, S.J. and R.W.; Writing—original draft, W.T.; Writing—review & editing, S.J. and R.W.; Visualization, W.T.; Supervision, S.J. and R.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Institutes of Health: 5R01DE032366-02A1.

Data Availability Statement

The CD4 data are openly available in  https://rdrr.io/cran/timereg/man/cd4.html (accessed on 17 April 2025). The wind speed data are private. The ARGO data are openly available in  https://www.seanoe.org/data/00311/42182/ (accessed on 17 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. System of Notation

Table A1 summarizes the notation used in Section 2 and Section 4, providing a reference for the derivations. All vectors in the table are represented as row vectors.
Table A1. Symbol definitions in formulation.
Table A1. Symbol definitions in formulation.
SymbolMeaning
y i i-th sample function
x R M One M-dimension index
MDimension of the index set
KNumber of all basis functions
JNumber of all components
PNumber of sample functions
N i Number of measurements of the i-th sample function
X i R N i × M Index set of the i-th sample function
Y i R N i Measurement of the i-th sample function
Z i R J Component scores of the i-th sample function
Z ¯ R K Coefficients of basis functions in the mean function
E i R N i Measurement errors of the i-th sample function
W R J × K Weighing matrix of basis functions in the eigenfunctions
W j · R K , W · k T R J j-th row and k-th column of W
K Kernel function
α j Scale parameter of W j · (j-th component)
β k Scale parameter of W · k (k-th basis function)
σ The standard deviation of measurement errors
η The communal scale parameter of Z ¯
{ ϕ k : R M R } k = 1 K The union of all the centered kernel functions
Φ i k j = ϕ k ( X i j · ) R The value of centered kernel function ϕ k at X i j ·
θ i R K The coefficients of the i-th sample function
ζ i R K The coefficient noise of the i-th sample function
ς k The scale parameter of the k-th coefficient noise
Table A2 summarizes the notation used in Section 3.
Table A2. Notation used in formulating the optimization.
Table A2. Notation used in formulating the optimization.
SymbolMeaning
Θ All latent variables
Q · The surrogate posterior distribution of variable ·
Q / · The joint surrogate posterior distribution of all variables except ·
μ · , Σ · The mean and covariance of · in Q , e.g., μ vec ( W ) R J K , Σ vec ( W ) R J K × J K
a · , b · The shape and rate parameters of Q · , e.g., a β k , b β k
E Q [ · ] The expectation of variable · over density Q
L The lower bound of surrogate posterior Q with K basis functions
Ψ i The Gram matrix of the kernel functions for the i-th sample function, Φ i Φ i T
K ( a ) , K ( e ) The number of active/effective basis functions
J ( a ) , J ( e ) The number of active/effective components
P i The log likelihood of Y i in a multisample relevance vector machine
C i The covariance of Y i in a multisample relevance vector machine
S i The posterior covariance of Z i in a multisample relevance vector machine
P Z i The log likelihood of ( Y i , Z i ) in a multisample relevance vector machine
ϵ 0 The infinitesimal number
τ · The threshold/tolerance of ·

Appendix B. Variational Update Formulae

As defined in Section 2, we consider the following priors and conditional distributions:
Pr [ Y | Z , W , Z ¯ , σ ] = i N Y i | ( Z i W + Z ¯ ) Φ i , σ 2 I
Pr [ Z ] = i N ( Z i | 0 , I )
Pr [ W | α , β ] = j , k N ( W j k | 0 , α j 1 β k 1 )
Pr [ Z ¯ ] = k N ( Z ¯ k | 0 , η 1 β k 1 )
Pr [ σ ] Pr [ α ] Pr [ β ] Pr [ η ] = Γ ( σ 2 | a 0 , b 0 ) j = 1 J Γ ( α j | a 0 , b 0 ) k = 1 K Γ ( β k | a 0 , b 0 ) Γ ( η | a 0 , b 0 )
For brevity, the joint posterior is shown with the vague Gamma prior parameters a 0 , b 0 , and the observation index X omitted:
Pr [ Z , W , Z ¯ , σ , α , β , η | X , Y , a 0 , b 0 ] = Pr [ Z , W , Z ¯ , σ , α , β , η | Y ] = Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ( Pr [ Y ] ) 1 Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] = Pr [ Y | Z , W , Z ¯ , σ ] Pr [ Z ] Pr [ W | α , β ] Pr [ Z ¯ | η , β ] Pr [ σ ] Pr [ α ] Pr [ β ] Pr [ η ]
Derivation of Equations (14) and (15)
According to Equation (13) and the posterior in Equation (A6), the update formulae for the surrogate distribution Q α j are
Q α j exp ( E Q / α j [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp E Q / α j [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] d α j exp ( E Q / α j [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp ( E Q / α j [ ln ( Pr [ W | α , β ] Pr [ α ] ) ] ) exp E Q / α j 1 2 k = 1 K ln ( α j ) + W j k 2 α j β k + ( a 0 1 ) ln α j b 0 α j exp K 2 + a 0 1 ln ( α j ) α j 1 2 k = 1 K E Q / α j W j k 2 β k + b 0
where we have omitted terms of which α j is conditionally independent. By definition,
Q α j = exp ln ( Γ ( α j | a α j , b α j ) ) = exp ln b α j a α j Γ ( a α j ) α j a α j 1 exp ( b α j α j ) exp ( a α j 1 ) ln α j b α j α j
By equating Equations (A7) and (A8), the updates for Q α j are
a α j K 2 + a 0
b α j 1 2 k = 1 K E Q / α j W j k 2 β k + b 0
Derivation of Equations (16) and (17)
According to Equation (13) and the posterior Equation (A6), the update formulae for Q η are
Q η exp ( E Q / η [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp E Q / η [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] d η exp ( E Q / η [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp ( E Q / η [ ln ( Pr [ Z ¯ | η , β ] Pr [ η ] ) ] ) exp E Q / η 1 2 k = 1 K ln ( η ) + Z ¯ k 2 η β k + ( a 0 1 ) ln η b 0 η exp K 2 + a 0 1 ln ( η ) η 1 2 k = 1 K E Q / η Z ¯ k 2 β k + b 0
where we have omitted terms of which η is conditionally independent. By definition,
Q η = exp ln ( Γ ( η | a η , b η ) ) = exp ln b η a η Γ ( a η ) η a η 1 exp ( b η η ) exp ( a η 1 ) ln η b η η
By equating Equations (A11) and (A12), the updates for Q η are
a η K 2 + a 0
b η 1 2 k = 1 K E Q / η Z ¯ k 2 β k + b 0
Derivation of Equations (18) and (19)
According to Equation (13) and the posterior Equation (A6), the update formulae for Q β k are
Q β k exp ( E Q / β k [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp E Q / β k [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] d β k exp ( E Q / β k [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp ( E Q / β k [ ln ( Pr [ W · k | α , β k ] Pr [ Z ¯ k | η , β k ] Pr [ β ] ) ] ) exp ( E Q / β k [ 1 2 j = 1 J ln ( α j β k ) + W j k 2 α j β k 1 2 ln ( η β k ) + Z ¯ k 2 η β k + ( a 0 1 ) ln β k b 0 β k ] ) exp ( J + 1 2 + a 0 1 ln ( β k ) β k 1 2 E Q / β k Z ¯ k 2 η + j = 1 J E Q / β k W j k 2 α j + b 0 )
where we have omitted terms of which β k is conditionally independent. By definition,
Q β k = exp ln ( Γ ( β k | a β k , b β k ) ) = exp ln b β k a β k Γ ( a β k ) η a β k 1 exp ( b β k η ) exp ( a β k 1 ) ln η b β k η
By equating Equations (A15) and (A16), the updates for Q η are
a β k J + 1 2 + a 0
b β k 1 2 E Q / β k Z ¯ k 2 η + j = 1 J E Q / β k W j k 2 α j + b 0
Derivation of Equations (20) and (21)
According to Equations (13) and the posterior Equation (A6), the update formulae for Q Z ¯ are
Q Z ¯ exp ( E Q / Z ¯ [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp E Q / Z ¯ [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] d Z ¯ exp ( E Q / Z ¯ [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp ( E Q / Z ¯ [ ln ( Pr [ Y | Z , W , Z ¯ , σ ] Pr [ Z ¯ | η , β ] ) ] ) exp ( E Q / Z ¯ [ 1 2 i = 1 P N i ln ( 2 π σ 2 ) + σ 2 | | Y ( Z i W + Z ¯ ) Φ i | | 2 2 1 2 k = 1 K ln ( 2 π η β k ) + Z ¯ k 2 η β k ] ) exp ( 1 2 ( Z ¯ E Q / Z ¯ σ 2 i = 1 P Ψ i + η diag ( β ) Z ¯ T 2 E Q / Z ¯ σ 2 i = 1 P ( Y E Q / Z ¯ Z i W Φ i ) Φ i T Z ¯ T ) )
where we have omitted terms of which Z ¯ is conditionally independent. By definition,
Q Z ¯ = exp ln ( N ( Z ¯ | μ Z ¯ , Σ Z ¯ ) ) = exp 1 2 ln | 2 π Σ Z ¯ | + ( Z ¯ μ Z ¯ ) Σ Z ¯ 1 ( Z ¯ μ Z ¯ ) T exp 1 2 Z ¯ Σ Z ¯ 1 Z ¯ T 2 μ Z ¯ Σ Z ¯ 1 Z ¯ T
By equating Equations (A19) and (A20), the updates for Q Z ¯ are
Σ Z ¯ E Q / Z ¯ σ 2 i = 1 P Ψ i + η diag ( β ) 1
μ Z ¯ E Q / Z ¯ σ 2 i = 1 P ( Y E Q / Z ¯ Z i W Φ i ) Φ i T Σ Z ¯
Derivation of Equations (22) and (23)
According to Equation (13) and the posterior Equation (A6), the update formulae for Q W are
Q W exp ( E Q / W [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp E Q / W [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] d W exp ( E Q / W [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp ( E Q / W [ ln ( Pr [ Y | Z , W , Z ¯ , σ ] Pr [ W | α , β ] ) ] ) exp E Q / W 1 2 i = 1 P N i ln ( 2 π σ 2 ) + σ 2 | | Y ( Z i W + Z ¯ ) Φ i | | 2 2 exp ( E Q / W [ 1 2 ( ln | 2 π ( diag ( β ) diag ( α ) ) 1 | + vec ( W ) T ( diag ( β ) diag ( α ) ) vec ( W ) ) ] ) exp E Q / W 1 2 σ 2 i = 1 P 2 Y i Φ i T W T Z i T + 2 Z i W Ψ i Z ¯ T + Z i W Ψ i W T Z i T exp E Q / W 1 2 vec ( W ) T ( diag ( β ) diag ( α ) ) vec ( W ) exp 1 2 E Q / W 2 σ 2 i = 1 P vec Φ i ( Φ i T Z ¯ T Y i T ) Z i T T vec ( W ) exp 1 2 vec ( W ) T E Q / W σ 2 i = 1 P Ψ ( Z i T Z i ) + ( diag ( β ) diag ( α ) ) vec ( W )
where we have omitted terms of which W is conditionally independent. By definition,
Q W = exp ln ( N ( vec ( W ) | μ vec ( W ) , Σ vec ( W ) ) ) = exp ( 1 2 ( ln | 2 π Σ vec ( W ) | + ( vec ( W ) T μ vec ( W ) ) Σ vec ( W ) 1 ( vec ( W ) T μ vec ( W ) ) T ) ) exp 1 2 vec ( W ) T Σ vec ( W ) 1 vec ( W ) 2 μ vec ( W ) Σ vec ( W ) 1 vec ( W )
By equating Equations (A23) and (A24), the updates for Q W are
Σ vec ( W ) E Q / W σ 2 i = 1 P Ψ ( Z i T Z i ) + ( diag ( β ) diag ( α ) ) 1 μ vec ( W ) E Q / W σ 2 i = 1 P vec Φ i ( Φ i T Z ¯ T Y i T ) Z i T T Σ vec ( W )
Derivation of Equations (24)–(26)
According to Equation (13) and the posterior Equation (A6), the update formulae for Q Z i are
Q Z i exp ( E Q / Z i [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp E Q / Z i [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] d Z i exp ( E Q / Z i [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp ( E Q / Z i [ ln ( Pr [ Y i | Z i , W , Z ¯ , σ ] Pr [ Z i ] ) ] ) exp E Q / Z i 1 2 N i ln ( 2 π σ 2 ) + σ 2 | | Y ( Z i W + Z ¯ ) Φ i | | 2 2 + J ln ( 2 π ) + Z i Z i T exp 1 2 Z i E Q / Z i σ 2 W Ψ i W T + I Z i T 2 E Q / Z i σ 2 ( Y i Z i Φ i ) Φ i T W T Z i T
where we have omitted terms of which Z i is conditionally independent. By definition,
Q Z i = exp ln ( N ( Z i | μ Z i , Σ Z i ) ) = exp 1 2 ln | 2 π Σ Z i | + ( Z i μ Z i ) Σ Z i 1 ( Z i μ Z i ) T exp 1 2 Z i Σ Z i 1 Z i T 2 μ Z i Σ Z i 1 Z i T
By equating Equations (A26) and (A27), the updates for Q Z ¯ are
Σ Z i E Q / Z i [ σ 2 W Ψ i W T + I ] 1
μ Z i E Q / Z i [ σ 2 ( Y i Z ¯ Φ i ) Φ i T W T ] Σ Z i
Derivation of Equations (27)–(29)
According to Equation (13) and the posterior Equation (A6), the update formulae for Q σ are
Q σ exp ( E Q / σ [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp E Q / σ [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] d σ exp ( E Q / σ [ ln ( Pr [ Z , W , Z ¯ , σ , α , β , η , Y ] ) ] ) exp ( E Q / σ [ ln ( Pr [ Y i | Z i , W , Z ¯ , σ ] Pr [ σ ] ) ] ) exp E Q / σ 1 2 i = 1 P N i ln ( 2 π σ 2 ) + σ 2 | | Y ( Z i W + Z ¯ ) Φ i | | 2 2 exp E Q / σ ( a 0 1 ) ln σ 2 b 0 σ 2 exp ( a 0 + 1 2 i N i 1 ln ( σ 2 ) σ 2 b 0 + 1 2 E Q / σ [ i | | Y i ( Z i W + Z ¯ ) Φ i | | 2 2 ] )
where we have omitted terms of which σ is conditionally independent. By definition,
Q σ = exp ln ( Γ ( σ 2 | a σ , b σ ) ) = exp ln b σ a σ Γ ( a σ ) ( σ 2 ) a σ 1 exp ( b σ σ 2 ) exp ( a σ 1 ) ln σ 2 b σ σ 2
By equating Equations (A30) and (A31), the updates for Q σ are
a σ a 0 + 1 2 i N i
b σ b 0 + 1 2 E Q / σ i | | Y i ( Z i W + Z ¯ ) Φ i | | 2 2

Appendix C. Scalable Update for BSFDA

Appendix C.1. Implicit Factorization

We initialize the inactive precision parameters as
E Q α j [ α j ] = ϵ 1 , j > J ( a )
E Q β k [ β k ] = ϵ 1 , k > K ( a )
Under these settings and subsequent variational updates (using Equations (A34) and (A35)), in the limit as ϵ 0 , the surrogate distributions satisfy
μ Z i B = 0 , Σ Z i B = ϵ I ( J J ( a ) ) , Σ Z i [ A , B ] = Σ Z i [ B , A ] T = 0
μ Z ¯ B = 0 , Σ Z ¯ [ B , B ] = ϵ I ( K K ( a ) ) , Σ Z ¯ [ A , B ] = Σ Z ¯ [ B , A ] T = 0
μ vec ( W ) B = 0 , μ vec ( W ) C = 0 , μ vec ( W ) D = 0 , Σ vec ( W ) [ B , B ] = ϵ I ( K ( a ) J K ( a ) J ( a ) ) , Σ vec ( W ) [ C , C ] = ϵ I ( J ( a ) K J ( a ) K ( a ) ) , Σ vec ( W ) [ D , D ] = ϵ I ( J K + J ( a ) K ( a ) J K ( a ) J ( a ) K ) ,
Σ vec ( W ) [ x , y ] = 0 , ( x , y ) { ( A , A ) , ( B , B ) , ( C , C ) , ( D , D ) }
For convenience, we initialize Q with the above properties.
Lemma A1.
If Q α j [ α j ] = ϵ , j J ( a ) and Q β k [ β k ] = ϵ , k K ( a ) , then the variational distribution over W factorizes as Q W = Q W A Q W B Q W C Q W D in the limit as ϵ 0 .
Proof. 
We express the distribution as
Q W = N ( vec ( W ) | μ vec ( W ) , Σ vec ( W ) ) = exp 1 2 ln | 2 π Σ vec ( W ) | + μ vec ( W ) Σ vec ( W ) 1 μ vec ( W ) T .
The factorization holds if the off-diagonal block matrices in Σ vec ( W ) , e.g., Σ [ W A , W B ] , are all zero, i.e., the blocks are mutually independent. Initially, this is ensured by the definition for the initial status in Equation (A38). Thus, we only need to show that the statement remains true after Q W is updated, i.e., after Equation (22) is applied with the inactive scale parameters Q α j [ α j ] and Q β k [ β k ] fixed at ϵ . First, we regard Σ [ W A B C ] , i.e., the covariance of the union of W A , W B , W C after vectorization, as one block. By the block matrix inversion formula, we obtain Σ [ W A B C , W D ] ϵ 2 0 and consequently Q W = Q W A B C Q W D . Next, we apply the block matrix inversion formula to Σ [ W A B C ] in Equation (22) and we obtain ( Σ [ W B , W A ] , Σ [ W B , W C ] , Σ [ W C , W A ] , Σ [ W C , W B ] ) ϵ 0 , yielding the desired factorization.    □
Lemma A2.
If Q β k [ β k ] = ϵ , k K ( a ) , then the implicit factorization Q Z ¯ = Q Z ¯ A Q Z ¯ B holds in the limit as ϵ 0 .
Proof. 
The proof is similar to the proof for Lemma A1. Because Q Z ¯ = N ( μ Z ¯ , Σ Z ¯ ) , we need only the off-diagonal block to be zero, i.e., Σ Z ¯ [ A , B ] = 0 . Initially, this is ensured by definition for the initial status in Equation (A37). Q Z ¯ is updated by Equation (20). Applying the block matrix inversion formula with the inactive Q β k [ β k ] , we obtain Σ Z ¯ [ A , B ] ϵ 0 , establishing the factorization.    □
Lemma A3.
If j J ( a ) or k K ( a ) , then E Q / Z i [ W k j W j k ] O ( ϵ ) , j = 1 : J , k = 1 : K in the limit as ϵ 0 .
Proof. 
For the initial status, apparently, the largest E Q / Z i [ W k j W j k ] is E Q / Z i [ W k j 2 ] = ϵ . Because either Q α j [ α j ] = ϵ or Q β k [ β k ] = ϵ , after updates from Equations (22) and (23) are applied, E Q / Z i [ W k j W j k ] = Σ [ W k j , W j k ] + μ vec ( W ) k j μ vec ( W ) j k O ( ϵ ) by the Woodbury matrix identity.    □
Lemma A4.
If E Q α j [ α j ] = a α j b α j = ϵ 1 , j J ( a ) , then the implicit factorization Q Z i = Q Z i A Q Z i B holds in the limit as ϵ 0 .
Proof. 
The proof is similar to the proof for Lemma A1. Because Q Z i = N ( μ Z i , Σ Z i ) , only Σ Z i [ A , B ] = 0 is needed. Initially, this is ensured by definition for the initial status Equation (A36). Q Z is updated by Equations (24) and (25). In Equation (24), when j J ( a ) or k K ( a ) , C i j k = Tr ( E Q / Z i [ W k · T W j · ] Ψ i ) = ( j , k ) E Q / Z i [ W k j W j k ] Ψ i k j O ( ϵ ) 0 applying Lemma A3. Applying the block matrix inversion formula to Equation (25), Σ Z i [ A B ] O ( ϵ ) 0 , thus proving the implicit factorization.    □

Appendix C.2. Scale Parameters

Here, we state the theorems that justify the use of updating rules for Q α j ( a ) based on L ( a ) to update Q α j (and, similarly, Q β k ( a ) for Q β k , Q η ( a ) for Q η ), and it does maximize L ultimately.
Lemma A5.
W j k W B W C , i.e., either ( j > J ( a ) ) or ( k > K ( a ) ) , after updating Q W B and Q W C by Equations (22) and (23), E Q [ W j k 2 ] = b α j b β k a α j a β k .
Proof. 
According to Equations (A34) and (A35), if ( j > J ( a ) ) or ( k > K ( a ) ) , either E Q [ α j ] = ϵ 1 or E Q [ β k ] = ϵ 1 , respectively.
In the limit as ϵ 0 , using Equation (22) and the block matrix inversion formula, we obtain
Σ W j k lim ϵ 0 E Q / W [ diag ( β ) diag ( α ) + σ 2 i ( Ψ i ) ( Z i T Z i ) ] 1 [ j + k M , j + k M ] = lim ϵ 0 E Q / W [ α j β k ] 1 + O ( ϵ 2 ) = E Q / W [ α j β k ] 1 = b α j b β k a α j a β k
In the limit as ϵ 0 and using Equation (23),
μ W j k lim ϵ 0 a σ b σ i vec Φ i ( μ Z ¯ Φ i Y i ) T μ Z i T T Σ vec ( W ) [ 1 , j + k M ] = a σ b σ i vec Φ i ( μ Z ¯ Φ i Y i ) T μ Z i T T Σ vec ( W ) · ( j + k M ) O ( ϵ )
Equation (A40) uses the fact that elements in Σ vec ( W ) · ( j + k M ) are all O ( ϵ ) based on the block matrix inversion formula. Thus,
lim ϵ 0 E Q [ W j k 2 ] = lim ϵ 0 Σ W j k + ( μ W j k ) 2 = lim ϵ 0 Σ W j k + O ( ϵ 2 ) = lim ϵ 0 b α j b β k a α j a β k + O ( ϵ 2 ) = b α j b β k a α j a β k
   □
Lemma A6.
k > K ( a ) , after updating Q Z ¯ B by Equations (20) and (21), E Q [ Z ¯ k 2 ] = b η a η ϵ .
Proof. 
If k > K ( a ) , E Q [ β k ] = ϵ 1 .
Then, using Equation (20) and the block matrix inversion formula, we have
Σ Z ¯ k k lim ϵ 0 E Q / Z ¯ i = 1 P σ 2 Ψ i + η diag ( β ) 1 k k = lim ϵ 0 i = 1 P a σ b σ Ψ i + a η b η diag ( a b ) 1 k k = lim ϵ 0 b η b β k a η a β k + O ( ϵ 2 ) = b η b β k a η a β k
Using Equation (21),
μ Z ¯ k lim ϵ 0 E Q / Z ¯ σ 2 i = 1 P ( Y E Q / Z ¯ Z i W Φ i ) Φ i T Σ Z ¯ 1 k = lim ϵ 0 a σ b σ i = 1 P ( Y μ Z i μ W Φ i ) Φ i T Σ Z ¯ 1 k = lim ϵ 0 a σ b σ i = 1 P ( Y μ Z i μ W Φ i ) Φ i T Σ Z ¯ · k O ( ϵ )
Equation (A43) uses the fact that elements in Σ Z ¯ · k are all O ( ϵ ) .
E Q [ Z ¯ k 2 ] = lim ϵ 0 Σ Z ¯ k k + μ Z ¯ k 2 = lim ϵ 0 Σ Z ¯ k k + O ( ϵ 2 ) = lim ϵ 0 b η b β k a η a β k + O ( ϵ 2 ) = lim ϵ 0 b η a η ϵ + O ( ϵ 2 ) = b η a η ϵ
   □
Theorem A1.
j J ( a ) , updates of Q α j and Q W B will converge at E Q α j [ α j ] = E Q α j ( a ) [ α j ] given that E Q β k [ β k ] = E Q β k ( a ) [ β k ] , k K ( a ) , a 0 = b 0 = 0 and the conditions in Equations (A35) and (A36) are satisfied in the limit as ϵ 0 .
Proof. 
Assume that Q α j ( a ) has just been updated using Equations (14) and (15), i.e., j J ( a )
a α j ( a ) = a 0 + K ( a ) 2
b α j ( a ) = b 0 + 1 2 k = 1 K ( a ) E Q / α j ( a ) [ W j k 2 β k ] = b 0 + 1 2 k = 1 K ( a ) Σ W j k + μ W j k 2 a β k ( a ) b β k ( a )
The updates for Q α derived from L are
b α j b 0 + 1 2 k = 1 K Σ W j k + μ W j k 2 a β k b β k = b 0 + 1 2 k = 1 K ( a ) Σ W j k + μ W j k 2 a β k b β k + 1 2 k = K ( a ) + 1 K Σ W j k + μ W j k 2 a β k b β k = b α j ( a ) + 1 2 k = K ( a ) + 1 K Σ W j k + μ W j k 2 a β k b β k
It involves W j k , k > K ( a ) and therefore they need to be kept updated. Applying Theorem A5 for Equation (A47), we can obtain
b α j b α j ( a ) + 1 2 k = K ( a ) + 1 K b α j b β k a α j a β k a β k b β k = b α j ( a ) + 1 2 ( K K ( a ) ) b α j a α j
Applying Equation (A48) in an iterative manner, we will obtain a sequence of updates for a α j . Solving
b α j = b α j ( a ) + 1 2 ( K K ( a ) ) b α j K 2
b α j = ( 1 1 2 ( K K ( a ) ) 2 K ) 1 b α j ( a ) = K K ( a ) b α j ( a )
Thus, we find that the sequence will converge at
b α j K K ( a ) b α j ( a )
As a result, E Q α j [ α j ] = a α j b α j = a α j ( a ) b α j ( a ) = E Q α j ( a ) [ α j ] .    □
Theorem A2.
k K ( a ) , updates of Q β k and Q W C will converge at E Q β k [ β k ] = E Q β k ( a ) [ β k ] given that E Q α j [ α j ] = E Q α j ( a ) [ α j ] , j J ( a ) , a 0 = b 0 = 0 and the conditions in Equations (A35) and (A36) are satisfied in the limit as ϵ 0 .
Proof. 
Assume that Q β k ( a ) has just been updated using Equations (18) and (19), i.e.,
a β k ( a ) = a 0 + K ( a ) + 1 2
b β k ( a ) b 0 + 1 2 E Q / β k ( a ) [ Z ¯ k 2 + j = 1 J ( a ) W j k 2 α j ] = b 0 + 1 2 Σ Z ¯ k k + μ Z ¯ k 2 + j = 1 J ( a ) Σ W j k + μ W j k 2 a α j ( a ) b α j ( a )
The update for Q β k derived from L is
b β k b 0 + 1 2 Σ Z ¯ k k + μ Z ¯ k 2 + j = 1 J Σ W j k + μ W j k 2 a α j b α j = b 0 + 1 2 Σ Z ¯ k k + μ Z ¯ k 2 + j = 1 J ( a ) Σ W j k + μ W j k 2 a α j b α j + 1 2 j = J ( a ) + 1 J Σ W j k + μ W j k 2 a α j b α j = b β k ( a ) + 1 2 j = J ( a ) + 1 J Σ W j k + μ W j k 2 a α j b α j
It involves W j k , j > J ( a ) and therefore they need to be kept updated. Applying Theorem A5 for Equation (A54), we can obtain
b β k b β k ( a ) + 1 2 j = J ( a ) + 1 J b β k b α j a β k a α j a α j b α j
= b β k ( a ) + 1 2 ( K K ( a ) ) b β k a β k
Applying Equation (A56) in an iterative manner, we will obtain a sequence of b β k . Solving
b β k = b β k ( a ) + 1 2 ( K K ( a ) ) b β k K + 1 2
b β k = ( 1 1 2 ( K K ( a ) ) 2 K + 1 ) 1 b β k ( a ) = K + 1 K ( a ) + 1 b β k ( a )
Thus, we find that the sequence will converge at
b β k K + 1 K ( a ) + 1 b β k ( a )
As a result, E Q [ β k ] = a β k b β k = a β k ( a ) b β k ( a ) = E Q ( a ) [ β k ] .    □
Theorem A3.
Updates of Q η and Q Z ¯ B will converge at E Q η [ η ] = E Q η ( a ) [ η ] given that E Q β k [ β k ] = E Q β k ( a ) [ β k ] , k K ( a ) , a 0 = b 0 = 0 and the conditions in Equations (A35) and (A36) are satisfied in the limit as ϵ 0 .
Proof. 
Assume that Q η ( a ) has just been updated using Equations (16) and (17), i.e.,
a η ( a ) a 0 + K ( a ) 2
b η ( a ) b 0 + 1 2 k = 1 K ( a ) E Q / η [ Z ¯ k 2 β k ] = b 0 + 1 2 k = 1 K ( a ) Σ Z ¯ k + μ Z ¯ k 2 a β k ( a ) b β k ( a )
The update for Q η derived from L is
b η b 0 + 1 2 k = 1 K Σ Z ¯ k + μ Z ¯ k 2 a β k b β k = b 0 + 1 2 k = 1 K ( a ) Σ Z ¯ k + μ Z ¯ k 2 a β k b β k + 1 2 k = K ( a ) + 1 K Σ Z ¯ k + μ Z ¯ k 2 a β k b β k = b η ( a ) + 1 2 k = K ( a ) + 1 K Σ Z ¯ k + μ Z ¯ k 2 a β k b β k
It involves Z ¯ k , k > K ( a ) and therefore they need to be kept updated. Applying Lemma A6 for Equation (A62), we can obtain
b η b η ( a ) + 1 2 k = K ( a ) + 1 K b η b β k a η a β k a β k b β k
= b η ( a ) + 1 2 ( K K ( a ) ) b η a η
Applying Equation (A64) in an iterative manner, we will obtain a sequence of updates for b η . Solving
b η = b η ( a ) + 1 2 ( K K ( a ) ) b η K 2
b η = ( 1 1 2 ( K K ( a ) ) 2 K ) 1 b η ( a ) = K K ( a ) b η ( a )
Thus, we find that the sequence will converge at
b η K K ( a ) b η ( a )
As a result, E Q [ η ] = b η a η = b η ( a ) a η ( a ) = E Q ( a ) [ η ] .    □
In practice, due to limitations in numerical representation, we restrict the values so that the active precision parameter estimates do not truly reach infinity:
E Q α j [ α j ] τ max , j J ( a )
E Q β k [ β k ] τ max , k K ( a )

Appendix C.3. Weights and Noise

We next describe how to update Q Z A , Q Z ¯ A , Q W A , Q σ in a scalable manner, using computation in the K ( a ) -dimension subspace only.
Theorem A4.
L and L ( a ) share the same update rule for Z i A , i.e.,
H i A j k E Q / Z i [ W A j Φ i A Φ i A T W A k T ] = Tr ( E Q / Z i [ W A k T W A j ] Φ i A Φ i A T ) = T r Σ [ W A k , W A j ] + μ [ W A j ] T μ [ W A k ] Φ i A Φ i A T , j = 1 : J ( a ) , k = 1 : K ( a )
Σ Z i A E Q / Z i [ σ 2 W A Φ i A Φ i A T W A T + I ] 1 = [ a σ b σ H i A + I ] 1
μ i A E Q / Z i [ σ 2 ( Y i Z ¯ Φ i A ) Φ i A T W A T ] Σ Z i A = a σ b σ ( Y i μ Z ¯ A Φ i A ) Φ i A T ( μ W A ) T Σ Z i A
Proof. 
Applying Lemma A3 to Equation (24), we have
H i A j k T r Σ [ W A k , W A j ] + μ [ W A j ] T μ [ W A k ] Φ i A Φ i A T + O ( ϵ ) T r Σ [ W A k , W A j ] + μ [ W A j ] T μ [ W A k ] Φ i A Φ i A T , j = 1 : J ( a ) , k = 1 : K ( a )
Applying the block matrix inversion formula to Equation (25), we have
Σ Z i A E Q / Z i [ σ 2 W A Φ i A Φ i A T W A T + I ] 1 + O ( ϵ 2 ) E Q / Z i [ σ 2 W A Φ i A Φ i A T W A T + I ] 1 = [ a σ b σ H i A + I ] 1
Applying block matrix multiplication and Theorem A5 to Equation (26) conditioned on Equation (A37), we obtain
μ i A E Q / Z i [ σ 2 ( Y i Z ¯ Φ i A ) Φ i A T W A T ] Σ Z i A + O ( ϵ ) a σ b σ ( Y i μ Z ¯ A Φ i A ) Φ i A T ( μ W A ) T Σ Z i A
   □
Theorem A5.
L and L ( a ) share the same update rule for Z ¯ A , i.e.,
Σ Z ¯ A E Q / Z ¯ i = 1 P σ 2 Φ i A Φ i A T + η diag ( β A ) 1 = i = 1 P a σ b σ Φ i A Φ i A T + a η b η diag ( a A b A ) 1
μ Z ¯ A E Q / Z ¯ σ 2 i = 1 P ( Y E Q / Z ¯ Z i A W A Φ i A ) Φ i A Σ Z ¯ A = a σ b σ i = 1 P ( Y μ i A μ W A Φ i A ) Φ i A Σ Z ¯ A
Proof. 
Applying the block matrix inversion formula to Equation (20) conditioned on E Q / Z ¯ [ β k ] = ϵ 1 , k > K ( a ) , we have
Σ Z ¯ A E Q / Z ¯ i = 1 P σ 2 Φ i A Φ i A T + η diag ( β A ) 1 + O ( ϵ ) E Q / Z ¯ i = 1 P σ 2 Φ i A Φ i A T + η diag ( β A ) 1
Applying block matrix multiplication and Theorem A5 to Equation (21) conditioned on Equation (A36), we have
μ Z ¯ A E Q / Z ¯ σ 2 i = 1 P ( Y E Q / Z ¯ Z i A W A Φ i A ) Φ i A Σ Z ¯ A + O ( ϵ ) E Q / Z ¯ σ 2 i = 1 P ( Y E Q / Z ¯ Z i A W A Φ i A ) Φ i A Σ Z ¯ A
   □
Theorem A6.
L and L ( a ) share the same update rule for W A , i.e.,
Σ vec ( W ) E Q / W σ 2 i = 1 P ( Φ i A T Φ i A ) ( Z i A T Z i A ) + diag ( β A ) diag ( α A ) 1 = a σ b σ i = 1 P ( Φ i A T Φ i A ) ( μ i A T μ i A + Σ Z i A ) + diag a A b A diag c A d A 1
μ vec ( W ) E Q / W σ 2 i = 1 P vec Φ i A ( Φ i A T Z ¯ A T Y i T ) Z i A T T Σ vec ( W ) A = a σ b σ i = 1 P vec Φ i A ( Φ i A T μ Z ¯ A T Y i T ) μ i A T T Σ vec ( W ) A
Proof. 
Applying the block matrix inversion formula to Equation (22) conditioned on E Q / Z ¯ [ β k ] = ϵ 1 , k > K ( a ) and E Q / Z ¯ [ α j ] = ϵ 1 , j > J ( a ) , we have
Σ vec ( W ) E Q / W σ 2 i = 1 P ( Φ i A T Φ i A ) ( Z i A T Z i A ) + diag ( β A ) diag ( α A ) 1 + O ( ϵ ) E Q / W σ 2 i = 1 P ( Φ i A T Φ i A ) ( Z i A T Z i A ) + diag ( β A ) diag ( α A ) 1
Applying block matrix multiplication and Theorem A5 to Equation (23) conditioned on Equations (A36) and (A37), we obtain
μ vec ( W ) E Q / W σ 2 i = 1 P vec Φ i A ( Φ i A T Z ¯ A T Y i T ) Z i A T T Σ vec ( W ) A + O ( ϵ ) E Q / W σ 2 i = 1 P vec Φ i A ( Φ i A T Z ¯ A T Y i T ) Z i A T T Σ vec ( W ) A
   □
Theorem A7.
L and L ( a ) share the same update rule for σ, i.e.,
a σ a 0 + 1 2 i N i
b σ b 0 + 1 2 E Q / σ i | | Y i ( Z i A W A + Z ¯ A ) Φ i A | | 2 2 = b 0 + 1 2 i ( Y i Y i T 2 Y i μ i A μ W A Φ i A T 2 Y i μ Z ¯ A Φ i A T + 2 μ i A μ W A Φ i A Φ i A T ( μ Z ¯ A ) T + Tr Σ Z ¯ A + ( μ Z ¯ A ) T μ Z ¯ A Φ i A Φ i A T ) + 1 2 vec ( G A T ) T i vec vec ( Φ i A Φ i A T ) vec ( Σ Z i A + μ i A T μ i A ) T ,
where
G A ( j + k M ) E Q / σ vec ( W A k W A j T ) T = vec ( Σ [ W A k , W A j ] + μ vec ( W ) [ W A j ] T μ vec ( W ) [ W A k ] ) T , j = 1 : K ( a ) , k = 1 : K ( a )
Proof. 
Applying block matrix multiplication and Theorem A5 to Equation (28) conditioned on Equations (A36) and (A37), we have
b σ b 0 + 1 2 E Q / σ i | | Y i ( Z i A W A + Z ¯ A ) Φ i A | | 2 2 + O ( ϵ ) b 0 + 1 2 E Q / σ i | | Y i ( Z i A W A + Z ¯ A ) Φ i A | | 2 2
   □
We show that Q Z i A , Q W A , Q Z ¯ i A , and Q σ share the same update formulas as those derived from the low-dimensional lower bound: Q Z i A ( a ) , Q W A ( a ) , Q Z ¯ i A ( a ) , Q σ ( a ) . Thus, in practice, it suffices to update Q ( a ) ; we can then increase K ( a ) by including new basis functions. This process proves to implicitly maximize L with Q .

Appendix C.4. Low-Dimensional Lower Bound

We now have updating formulas for the parameters in the active subspace. Q Z i A is updated by Equations (A70)–(A72). Q W A is updated by Equations (A80) and (A81). Q Z ¯ A is updated by Equations (A76) and (A77). Q α A , Q β A , Q η are updated by Theorems A1–A3, with the companion of implicit updates of Q W B , Q W C . Q σ is updated by Equations (A84)–(A86). All the updating rules are identical to those derived from the low-dimensional lower bound L ( a ) with K ( a ) basis functions. Therefore, in practice, all we need is to optimize L ( a ) , with time complexity of O K ( a ) 2 max K ( a ) 4 , P max i ( N i ) , as described in Theorem A8, and then check if a new basis function should be included in the model.
For numerical stability, we scale ϕ , b such that min k ( E Q β [ β k ] ) = min k ( c k d k ) = 1 at the beginning of Algorithm A1.
Theorem A8.
The lower bound L can be optimized using Algorithm A1 with time complexity of O K ( a ) 2 max K ( a ) 4 , P max i ( N i ) .
Proof. 
The proof is a consequence of Theorems A1–A7.    □
Algorithm A1 Variational inference
Require:  μ Z i , Σ Z i , μ vec ( W ) , Σ vec ( W ) , μ Z ¯ , Σ Z ¯ , a σ , b σ , a α j , b α j , a β k , b β k , i , j , k           ▹Multisample RVM
    while True do
         L ( a ) lowerbound ( Q ( a ) )
        Update Q ( a ) with respect to all parameters using mean field approximation
        if  lowerbound ( Q ( a ) ) L ( a ) < τ con  then                                      ▹Insignificant increase
            Search for new basis functions using Algorithm 1
            if not found then                                                                                       ▹Converged
                 break
            end if
        end if
        Remove dimensions associated with the precision of the maximum values
    end while
    Get rid of dimensions associated with α j min j ( α j ) τ eff and β k min k ( β k ) τ eff

Appendix D. Scalable Update for BSFDAFast

For brevity, we denote the covariance of ζ i as S, i.e., ζ i N ( 0 , S ) . S is diagonal and S k k = ς k 2 β k 1 . The variational update formulas are as follows:
Σ θ i E Q / θ i Φ i Φ i T σ 2 + S 1 1
μ θ i E Q / θ i ( Y i Z ¯ Φ i ) Φ i T σ 2 + Z i W S 1 Σ θ i
Σ Z i E Q / Z i W S 1 W T + I 1
μ Z i E Q / Z i θ i S 1 W T Σ Z i
a ς k a 0 + P 2
b ς k E Q / ς k b 0 + 1 2 i ( θ i k Z i W · k ) 2 β k
Σ W · k E Q / W · k ς k 2 β k i Z i T Z i + β k diag ( α ) 1
μ W · k E Q / W · k ς k 2 β k i ( θ i k Z i ) Σ W · k
a β k a 0 + 1 + K + P 2
b β k E Q / β k b 0 + 1 2 Z ¯ k 2 η + j ( W j k 2 α j ) + i ( θ i k Z i W · k ) 2 ς k 2
Σ Z ¯ E Q / Z ¯ σ 2 i ( Φ i Φ i T ) + η diag ( β ) 1
μ Z ¯ E Q / Z ¯ σ 2 i ( Y i θ i Φ i ) Φ i T Σ Z ¯
a σ 2 a 0 + 1 2 i N i
b σ 2 E Q / σ b 0 + 1 2 i | | Y i ( Z ¯ + θ i ) Φ i | | 2 2
Notably, the columns of W become conditionally independent of the introduction of the slack variable θ , akin to the strategy described in [18,28]. Then, the surrogate posterior of W factorizes over the columns, thereby requiring the calculation of the covariance of each column separately, instead of the entire W at once. Thus, the computational complexity is significantly reduced. This factorization is introduced on top of the existing factorizations; thus, the low-dimensional optimization strategy of BSFDA also applies to BSFDA Fast .

Appendix E. Fast Initialization

In order to efficiently obtain a good initialization for the unknowns to be estimated, e.g., Z , Z ¯ , β and σ , we approximate the model so that we can adopt a fast strategy maximizing the marginal likelihood using direct differentiation, which is similar to [39]. This initial β serves to select the K ( a ) basis functions to start with.
We introduce Z ˜ for easier marginalization:
Y i = Z ˜ i Φ i + E i
Z ˜ i k = Z i k β k + Z ¯ k N ( Z ¯ k , β k 1 )
Z ¯ k N ( 0 , β k 1 )
β k Γ ( β k | a 0 , b 0 ) , σ 2 Γ ( σ 2 | a 0 , b 0 )
E i N ( 0 , σ 2 I )
The approximated probabilistic graphical model is shown in Figure A1.
Figure A1. Probabilistic graphical model for the simplified model.
Figure A1. Probabilistic graphical model for the simplified model.
Algorithms 18 00254 g0a1

Appendix E.1. Maximum Likelihood Estimation

We apply maximum likelihood estimation for point estimates of Z ¯ , β , σ .
Z ¯ * , β * , σ * arg min Z ¯ , β , σ P ,
where P = ln Pr [ Y | Z ¯ , β , σ ] . Conditioned on these estimates, we can calculate the expectation of Z.
Optimization of  β , Z ¯
We set the differentiation to zero, i.e., P β k = 0 , and obtain
β k θ k , if θ k > 0 , otherwise
where
θ k = i = 1 P s i k 2 i = 1 P ( q i k 2 s i k )
q i k = Φ i k C i / k 1 ( Y Z ¯ Φ i ) T
s i k = Φ i k C i / k 1 Φ i k T
C i / k = C i Φ i k T β k 1 Φ i k
C i = Φ i T diag ( β 1 ) Φ i + σ 2 I = k = 1 K Φ i k T β k 1 Φ i k + σ 2 I
We differentiate P with respect to Z ¯ and zero the derivative, i.e., P Z ¯ = 0 , to obtain
Z ¯ i = 1 P Y i C i 1 Φ i T i = 1 P ( Φ i C i 1 Φ i T ) 1
We approximate Equation (A114) by Z ¯ A i = 1 P Y i C i 1 Φ i A T i = 1 P ( Φ i A C i 1 Φ i A T ) 1 and Z ¯ B 0 . This way, we can apply the update with only the active basis functions.
Optimization of σ :
We use EM to optimize σ . In the E-step,
E Q Z ˜ [ Z ˜ i ] σ 2 ( Y i Z ¯ Φ i ) Φ i T S i
E Q Z ˜ [ Z ˜ i T Z ˜ i ] S i + E Q Z ˜ [ Z ˜ i ] T E Q Z ˜ [ Z ˜ i ] ,
where S i = ( Ψ i σ 2 + diag ( β ) ) 1 .
In the M-step,
σ 2 i = 1 P E Q Z ˜ | | Y i ( Z ˜ i + Z ¯ ) Φ i | | 2 2 i = 1 P N i = i = 1 P ( Y i Z ¯ Φ i ) ( Y i Z ¯ Φ i 2 Φ i T E Q Z ˜ Z ˜ i T ) T + Tr ( E Q Z ˜ Z ˜ i T Z ˜ i Ψ i ) i = 1 P N i
The optimization iterates between the E-step Equations (A115) and (A116) and the M-step Equation (A117).
In practice, we need only E Q Z ˜ [ Z ˜ i A ] , E Q Z ˜ [ Z ˜ i A T Z ˜ i A ] , and S i A , and they can be calculated using the K ( a ) active basis functions. Thus, similarly to [39], all computations can be operated with only the active basis functions and thus it is computationally efficient. This is described in Algorithm A2.
P = ln Pr [ Y | Z ¯ , β , σ ] = i = 1 P ln Pr [ Y i | Z ¯ , β , σ ] = i = 1 P P i
P i = Pr [ Y i | Z ˜ i , Z ¯ , β , σ ] Pr [ Z ˜ i | Z ¯ , β ] d Z ˜ i = E Z ˜ i N ( Z ¯ , β ) [ Pr [ Y i | Z ˜ i , σ ] ] = N ( Y i | Z ¯ Φ i , C i )
Pr [ Y i | Z ˜ i , Z ¯ , β , σ ] = N ( Y i | ( Z ˜ i + Z ¯ ) Φ i , σ 2 I )
Algorithm A2 Multisample relevance vector machine
while  P is not converged do
     k a random number that satisfies CosSim ( ϕ k , ϕ A ) τ sim                      ▹ O ( K ( a ) 3 )
     s i k Φ i k C i / k 1 Φ i k T , i                                 ▹ Sparsity factor. O P max ( K ( a ) 3 , max i ( N i ) 2 )
     q i k Φ i k C i / k 1 ( Y Z ¯ A Φ i A ) T , i                                                                  ▹Quality factor. O P max i ( N i ) max ( K ( a ) , max i ( N i ) )
     θ k i = 1 P s i k 2 i = 1 P ( q i k 2 s i k )
    if  θ > 0  then
         β k θ k                                                                                          ▹Precision is finite
    else
         β k                                        ▹Precision is infinite and the dimension is removed
    end if
     Φ i A All Φ i k that has β k < , i
     C i = β k < Φ i k T β k 1 Φ i k + σ 2 I , i
     Z ¯ A i = 1 P Y i C i 1 Φ i A T i = 1 P ( Φ i A C i 1 Φ i A T ) 1                                                                      ▹ O P K ( a ) max K ( a ) , max i ( N i ) 2
     S i A ( Φ i A Φ i A T σ 2 + diag ( β A ) ) 1 , i
     E Q Z ˜ [ Z ˜ i A ] σ 2 ( Y i Z ¯ A Φ i A ) Φ i A T S i A , i                                    ▹ O ( P K ( a ) 2 max i ( N i ) )
     E Q Z ˜ [ Z ˜ i A T Z ˜ i A ] S i A + E Q Z ˜ [ Z ˜ i A ] T E Q Z ˜ [ Z ˜ i A ] , i
     σ i = 1 P ( Y i Z ¯ A Φ i A ) ( Y i Z ¯ A Φ i A 2 Φ i A T E Q Z ˜ Z ˜ i A T ) T + Tr ( E Q Z ˜ Z ˜ i A T Z ˜ i A Φ i A Φ i A T ) i = 1 P N i
end while
We apply Sylvester’s determinant theorem to Equation (A113) and obtain
| C i | = | C i / k | | I + β k 1 Φ i k T C i / k 1 Φ i k |
We apply the Woodbury matrix identity to Equation (A113) and obtain
C i 1 = C i / k 1 C i / k 1 Φ i k T ( β k + Φ i k C i / k 1 Φ i k T ) 1 Φ i k C i / k 1
We first expand P i
P i = ln Pr [ Y i | Z ¯ , σ , β ] = 1 2 i ln | 2 π C i | + ( Y i Z ¯ Φ i ) C i 1 ( Y i Z ¯ Φ i ) T = 1 2 ( N i ln ( 2 π ) + ln | C i / k | + ln | I + β k 1 Φ i k C i / k 1 Φ i k T | + ( Y i Z ¯ Φ i ) C i 1 ( Y i Z ¯ Φ i ) T ( β k + Φ i k C i / k 1 Φ i k T ) 1 | | Φ i k C i / k 1 ( Y Z ¯ Φ i ) T | | 2 2 ) = P i / k + 1 2 ( ln β k ln | β k + s i k | + q i k 2 β k + s i k )
where we plug in Equations (A121) and (A122) and define q i k , s i k in a similar way to [39]. The sparsity factor s i k can be seen to be a measure of the extent to which the basis function ϕ k overlaps those already present in the model under the measurements at index set X i . The quality factor q i k is a measure of the alignment with the error of the model at X i with this basis function excluded. Because we are representing the mean functions using only the active basis functions, i.e., Z ¯ k = 0 when β k = , Equation (A110) uses only the K active basis functions. Similarly, Equation (A111) only uses the K active basis functions.
For computational efficiency, we can compute s i k , q i k using S i k = Φ i k C i 1 Φ i k T , Q i k = Φ i k C i 1 ( Y Z ¯ Φ i ) T in a similar way to [39] as follows:
s i k = Φ i k C i / k 1 Φ i k T = S i k + Φ i k C i / k 1 Φ i k T ( β k + Φ i k C i / k 1 Φ i k T ) 1 Φ i k C i / k 1 Φ i k T = S i k + s i k ( β k + s i k ) 1 s i k s i k = β k + s i k β k S i k
s i k ( 1 1 β k S i k ) 1 S i k = β k S i k β k S i k
q i k = Φ i k C i / k 1 ( Y Z ¯ Φ i ) T = Q i k + Φ i k C i / k 1 Φ i k T ( β k + Φ i k C i / k 1 Φ i k T ) 1 Φ i k C i / k 1 ( Y Z ¯ Φ i ) T = Q i k + s i k ( β k + s i k ) 1 q i k
q i k β k + s i k β k Q i k = β k Q i k β k S i k

Appendix E.2. Optimization of β, Z ¯

Derivation of Equation (A108)
We differentiate P with respect to β k
P β k = i = 1 P 1 2 β k 1 | β k + s i k | 1 q i k 2 ( β k + s i k ) 2 = 1 2 β k 1 i = 1 P ( β k + s i k ) 2 ( β k ( s i k q i k 2 ) + s i k 2 )
We further adopt the approximation s 1 k s 2 k s P k . Because s i k is a discrete measure of the overlapping between the basis functions, it should remain invariant with respect to different sampling grids X i given that the number of measurements is adequate and similar. Alternatively, the expectation maximization scheme can also be applied and is guaranteed to increase the likelihood P in each iteration until convergence. However, we opt for this gradient descent with approximations for its advantage in speed to obtain a reasonable initialization. This way, we set the approximated differentiation to zero:
P β k 1 2 β k 1 ( β k + s 1 k ) 2 i = 1 P ( β k ( s i k q i k 2 ) + s i k 2 ) = 0
β k θ k = i = 1 P s i k 2 i = 1 P ( q i k 2 s i k )
Because β k is a scale parameter, we need β k > 0 . Consequently, the optimal value for β k to maximize P depends on the sign of θ k . When θ k > 0 , the maximum of P is achieved at β k = θ k .
On the other hand, when θ k 0 , P is monotonically increasing with respect to β k , we should have β k in order to maximize P .
More intuitively, Equation (A130) can be regarded as a weighted summation of the estimation of β k using each individual sample function, and it automatically assigns more weight to those with more measurements. Therefore, this optimization strategy is supposed to provide reasonable estimates even when the sampled functions have different numbers of measurements.
Derivation of Equation (A114)
We differentiate P with respect to Z ¯ and zero the derivative to obtain
P Z ¯ = 1 2 i = 1 P 2 Y i C i Φ i T + 2 Z ¯ Φ i C i 1 Φ T = 0
Z ¯ i = 1 P Y i C i 1 Φ i T i = 1 P ( Φ i C i 1 Φ i T ) 1

Appendix E.3. Optimization of σ

Derivation of Equations (A115) and (A116)
We use the expectation maximization strategy with latent variables Z ˜ i . It is similar to that used in [30]. It introduces a surrogate function, the log likelihood for the complete data E Q Z ˜ [ P Z ˜ ] , which is easier to optimize; moreover, in theory, the process ultimately maximizes P .
For the E-step, we calculate the posterior of Z ˜ i .
ln Pr [ Z ˜ i | Y i , Z ¯ , σ , β ] = ln Pr [ Y i | Z ˜ i , Z ¯ , σ , β ] Pr [ Z ˜ i | β ] Pr [ Y i | Z ¯ , σ , β ]
1 2 Z ˜ i ( Ψ i σ 2 + diag ( β ) ) Z ˜ i T 2 σ 2 ( Y I Z ¯ Φ i ) Φ i T Z ˜ i T
Therefore,
E Q Z ˜ [ Z ˜ i ] σ 2 ( Y i Z ¯ Φ i ) Φ i T S i
E Q Z ˜ [ Z ˜ i T Z ˜ i ] S i + E Q Z ˜ [ Z ˜ i ] T E Q Z ˜ [ Z ˜ i ]
where
S i = ( Ψ i σ 2 + diag ( β ) ) 1
Derivation of Equation (A117)
In the M-step, we need to maximum E Q Z ˜ [ P Z ˜ ] conditioned on Q Z ˜ with respect to σ 2 ,
P Z ˜ = i = 1 P ln Pr [ Y i , Z ˜ i | Z ¯ , σ , β ] = i = 1 P ln ( Pr [ Y i | Z ˜ i , Z ¯ , σ ] Pr [ Z ˜ i | β ] ) = 1 2 i = 1 P ( N i ln ( 2 π σ 2 ) + σ 2 | | Y i ( Z ˜ i + Z ¯ ) Φ i | | 2 2 + k = 1 K ln ( 2 π β k 1 ) + Tr ( Z ˜ i diag ( β ) Z ˜ i T ) )
We differentiate E Q Z ˜ [ P Z ˜ ] with respect to σ 2 and set to 0
E Q Z ˜ [ P Z ˜ ] σ 2 = E Q Z ˜ 1 2 i = 1 P N i σ 2 σ 4 | | Y i ( Z ˜ i + Z ¯ ) Φ i | | 2 2 = 0
σ 2 i = 1 P E Q Z ˜ | | Y i ( Z ˜ i + Z ¯ ) Φ i | | 2 2 i = 1 P N i = i = 1 P ( Y i Z ¯ Φ i ) ( Y i Z ¯ Φ i 2 Φ i T E Q Z ˜ Z ˜ i T ) T + Tr ( E Q Z ˜ Z ˜ i T Z ˜ i Ψ i ) i = 1 P N i

Appendix F. Experiments

Appendix F.1. Benchmark Simulation

Figure A2 presents the application of the proposed BSFDA to the simulation benchmark (Scenario 1) outlined in [21]. Even though prior analyses have utilized this benchmark, the current experimental configuration is specifically adapted to highlight the method’s capacity for uncertainty quantification. The experimental design consists of 20 functional observations, each sampled at either three points (with a 20% probability) or 10 points (with an 80% probability), determined via random assignment. The number of sampled functions is decreased from 200 to 20 to underscore the effect and estimation of uncertainties. The actual white noise standard deviation is 0.4472, whereas the estimated standard deviation is 0.4839. The component number is also correctly estimated as 3. The figure depicts the true underlying function, the discrete observational data, and the corresponding functional estimates, accompanied by their respective 95% truncated uncertainty intervals.
Notably, the uncertainty associated with sparsely sampled functions exhibits substantial inflation in regions devoid of observations. In contrast, in sampled regions, the uncertainty aligns closely with that of densely sampled functions, approximating twice the standard deviation of the white noise. Additionally, the uncertainty bounds for the estimated mean function are presented, demonstrating reduced variability relative to individual function estimates.
Table A3. Distributions of the estimated component number r ^ for Scenario 1 (r = 3).
Table A3. Distributions of the estimated component number r ^ for Scenario 1 (r = 3).
N i r ^ AIC PACE AICBIC PC p 1 IC p 1 AIC PACE 2022 BIC PACE 2022 fpcaBSFDA BSFDA Fast
5≤10.0000.0000.1550.0050.0000.0000.0000.0000.0000.000
=20.0080.4050.3350.5650.2150.0000.0000.0000.0000.985
=30.0000.5800.3800.4100.7350.6500.8800.6450.9950.015
=40.1210.0100.1150.0100.0450.3350.1200.2350.0050.000
≥50.8700.0050.0150.0100.0050.0150.0000.1200.0000.000
10≤10.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
=20.0000.0050.0400.0400.0050.0000.0000.0000.0000.075
=30.0000.9800.6700.9550.9850.8800.9200.6451.0000.910
=40.0000.0150.2550.0000.0100.1200.0800.2350.0000.015
≥51.0000.0000.0350.0050.0000.0000.0000.1200.0000.000
50≤10.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
=20.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
=30.0001.0000.8301.0001.0001.0001.0000.8900.9800.945
=40.0000.0000.1500.0000.0000.0000.0000.0600.0200.050
≥51.0000.0000.0200.0000.0000.0000.0000.0500.0000.005
Table A4. Distributions of the estimated component number r ^ for Scenario 2 (r = 3).
Table A4. Distributions of the estimated component number r ^ for Scenario 2 (r = 3).
N i r ^ AIC PACE AICBIC PC p 1 IC p 1 AIC PACE 2022 BIC PACE 2022 fpcaBSFDA BSFDA Fast
5≤10.0000.0000.2300.0000.0000.0000.0000.0000.0000.000
=20.0000.2050.3950.0000.1400.0500.0750.0000.0000.960
=30.0050.6300.2450.3750.6050.5700.6200.4751.0000.040
=40.1250.1550.1100.4400.2100.3450.2750.3500.0000.000
≥50.8700.0100.0200.1850.0450.0350.0300.1750.0000.000
10≤10.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
=20.0000.0000.1700.0000.0000.0000.0000.0000.0000.000
=30.0000.7100.6650.5700.8050.8250.8500.6401.0000.995
=40.0050.2600.1350.3550.1850.1750.1500.2350.0000.005
≥50.9950.0300.0300.0750.0100.0000.0000.1250.0000.000
50≤10.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
=20.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
=30.0000.6300.7950.9550.9451.0001.0000.9501.0000.950
=40.0000.3200.1850.0450.0550.0000.0000.0200.0000.050
≥51.0000.0500.0200.0000.0000.0000.0000.0300.0000.000
Table A5. Distributions of the estimated component number r ^ for Scenario 3 (r = 3).
Table A5. Distributions of the estimated component number r ^ for Scenario 3 (r = 3).
N i r ^ AIC PACE AICBIC PC p 1 IC p 1 AIC PACE 2022 BIC PACE 2022 fpcaBSFDA BSFDA Fast
5≤10.0000.0000.3350.0000.0000.0000.0000.0000.0000.000
=20.0250.0350.2600.2200.0050.0000.0050.0000.0000.025
=30.0050.7200.3250.6400.5900.3200.4000.4500.9950.945
=40.1300.1700.0800.0750.2800.6400.5650.3600.0050.030
≥50.8400.0750.0000.0650.1250.0300.0300.1900.0000.000
10≤10.0000.0000.0050.0000.0000.0000.0000.0000.0000.000
=20.0150.0000.0350.0000.0000.0000.0000.0000.0000.000
=30.0000.5800.7700.9650.6650.7400.7550.4400.9951.000
=40.0000.4000.1450.0300.3200.2600.2450.3800.0050.000
≥50.9850.0200.0450.0050.0150.0000.0000.1800.0000.000
50≤10.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
=20.0000.0000.0000.0000.0000.0000.0000.0000.0150.000
=30.0001.0000.7751.0001.0001.0001.0000.7650.9800.920
=40.0000.0000.2000.0000.0000.0000.0000.1100.0050.050
≥51.0000.0000.0250.0000.0000.0000.0000.1250.0000.030
Table A6. Distributions of the estimated component number r ^ for Scenario 4 (r = 3).
Table A6. Distributions of the estimated component number r ^ for Scenario 4 (r = 3).
N i r ^ AIC PACE AICBIC PC p 1 IC p 1 AIC PACE 2022 BIC PACE 2022 fpcaBSFDA BSFDA Fast
5≤10.0000.0000.3150.0000.0000.0000.0000.0000.0000.000
=20.0150.0200.1800.1600.0150.0000.0000.0000.0000.000
=30.0150.7100.4100.6400.5600.5150.5750.3701.0000.975
=40.1450.1850.0700.0950.2600.4500.3900.5150.0000.025
≥50.8250.0850.0250.1050.1650.0350.0350.1150.0000.000
10≤10.0000.0000.0100.0000.0000.0000.0000.0000.0000.000
=20.0000.0000.0050.0000.0000.0000.0000.0000.0000.000
=30.0000.8300.7750.9200.9000.7500.7600.3500.9950.990
=40.0000.1500.1900.0450.0850.2500.2400.3800.0050.010
≥51.0000.0200.0200.0350.0150.0000.0000.2700.0000.000
50≤10.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
=20.0000.0000.0000.0000.0000.0000.0000.0000.0100.000
=30.0000.9450.8351.0001.0001.0001.0000.7300.9500.935
=40.0000.0550.1400.0000.0000.0000.0000.1600.0400.055
≥51.0000.0000.0250.0000.0000.0000.0000.1100.0000.010
Table A7. Distributions of the estimated component number r ^ for Scenario 5 (r = 6).
Table A7. Distributions of the estimated component number r ^ for Scenario 5 (r = 6).
N i r ^ AIC PACE AICBIC PC p 1 IC p 1 AIC PACE 2022 BIC PACE 2022 fpcaBSFDA BSFDA Fast
5≤40.0050.1650.8350.5800.0600.0000.0000.0100.0000.060
=50.0050.3300.0200.3450.3350.5750.5900.0100.0750.515
=60.7050.4700.0900.0700.5450.4250.4100.8550.9250.160
=70.2450.0350.0500.0050.0600.0000.0000.1150.0000.160
≥80.0400.0000.0050.0000.0000.0000.0000.0100.0000.105
10≤40.0050.0000.0000.0000.0000.0000.0000.0000.0000.000
=50.0000.0000.0300.1450.0000.4250.4250.0000.0000.000
=60.0650.5700.5250.7750.7050.5750.5750.5001.0000.930
=70.4750.2800.1650.0200.1850.0000.0000.4050.0000.035
8 0.4550.1500.0300.0600.1100.0000.0000.0950.0000.035
50≤40.0000.0000.0050.0000.0000.0000.0000.0000.0000.000
=50.0650.0000.0000.0000.0000.1300.1300.0050.0000.000
=60.0000.2600.5900.9800.9650.8700.7700.6950.9950.925
=70.0000.4050.3250.0100.0350.0000.0000.2500.0050.045
≥80.9350.3350.0800.0100.0000.0000.0000.0500.0000.030
Figure A2. Application of the proposed BSFDA to the simulation benchmark from [21], illustrating the true mean function (blue), the observed measurements from two functions sampled at different densities (light blue for sparse, orange for dense), and the corresponding functional estimates with 95% truncated uncertainty intervals.
Figure A2. Application of the proposed BSFDA to the simulation benchmark from [21], illustrating the true mean function (blue), the observed measurements from two functions sampled at different densities (light blue for sparse, orange for dense), and the corresponding functional estimates with 95% truncated uncertainty intervals.
Algorithms 18 00254 g0a2

Appendix F.1.1. Performance of LFRM

To compare the latent factor regression model (LFRM) [18] as a dimension reduction model to ours, i.e., Bayesian scalable functional data analysis (BSFDA), we set the covariates in LFRM to zero, thus assigning standard Gaussian priors to the latent variables, analogously to our approach. We followed the simulation benchmark in [21] to select the number of components, focusing on Scenario 1 with 50 measurements per function (the densest data). Because LFRM does not estimate a mean function, we omitted the mean from the simulation run here.
The following hyperparameters of LFRM need to be determined:
  • Gamma prior for white noise and correlated noise;
  • Length scale;
  • Number of basis functions;
  • Number of iterations.
LFRM, with its default white noise prior, correctly identified the white noise variance (true value 0.2) in all tests. We thus retained this default. We tested different Gamma priors for correlated noise: the default prior, a noninformative-like (vagor) prior (same mean but 100 times the variance) and a low-noise prior (same variance but 100 times the mean). We maintained the number of locations for basis functions at 10, which is the default setting. For the length scale in LFRM, we first used the best estimate from our cross-validation (CV). We then tried all 10 CV-selected length scales, producing 100 basis functions in total. However, this required substantial time, so we performed only two repeated runs for this setting. We kept LFRM’s default of 5000 burn-in iterations (25,000 total) with thinning at intervals of 5, verifying convergence through trace plots in line with [18]. Meanwhile, BSFDA was run 200 times as in Section 5.1, LFRM (10 length scales) two times, and all other settings 10 times.
Across repeated trials, LFRM consistently overestimated the true number of components (which was 3). Specifically,
  • Standard LFRM estimated 10–14 components;
  • LFRM with 10 length scales estimated 6–8 components;
  • LFRM with a low-correlated-noise prior estimated 8–15 components;
  • LFRM with a noninformative-like correlated-noise prior estimated 10–14 components.
In contrast, our method BSFDA produced a clear gap in the distribution of the precision parameters, effectively separating effective dimensions from redundant ones.
Several factors may explain LFRM’s performance.
  • Correlated noise interference: The correlated noise can obscure the true signal.
  • Prior specification: LFRM’s precision parameter priors are potentially less noninformative and not as sparse as those sparse Bayesian learning priors [30] in BSFDA.
  • Element-wise vs. column-wise precision: The element-wise precision parameters in LFRM might compensate in a way that reduces the overall sparsity.

Appendix F.2. Variational Inference vs. MCMC

We conducted experiments using both Gibbs sampling (MCMC) and mean field approximation (variational inference) for the Bayesian PCA simulation [28] under varying noise levels, assuming that the true noise was known. In our experiments, “satisfactory estimation” was defined as the point when the fourth smallest precision (i.e., the inverse of variance) was at least 100 times smaller than the fifth—indicating that the four true signal dimensions (with variances [5, 4, 3, 2]) had been correctly identified. For computational tractability, we capped VI at 200,000 iterations (approximately 200 s) and MCMC sampling at 20,000 iterations (about 20 min), with a burn-in period of 200 iterations and thinning set to 10.
Figure A3 illustrates the runtime for VI and MCMC to identify the correct components. Our key findings are as follows:
  • When the noise level was close to the signal, neither MCMC or VI found the true dimension in the limited iterations (and probably never would have), because the data were heavily polluted.
  • As the noise level decreased toward zero, the number of iterations (and runtime) required for satisfactory estimation increased dramatically; VI began to fail around a noise level of 1 × 10 4 and MCMC sampling around 1 × 10 3 , within the set time constraints.
  • Across the 10 noise levels (about 3 × 10 3 to 2 × 10 1 ) where both successfully identified the correct dimensionality, VI was consistently completed much faster than MCMC sampling. VI was 85.57 ± 50.24 times faster on average, in the range of 32.46 to 189.12.
Figure A3. Time for variational inference and MCMC to identify the correct components in Bayesian PCA.
Figure A3. Time for variational inference and MCMC to identify the correct components in Bayesian PCA.
Algorithms 18 00254 g0a3
These results indicate that both MCMC sampling and VI become slower as the noise decreases due to strong dependencies in the posterior. We hypothesize this is because both MCMC and VI suffer from the dependency introduced by low noise, which is a known long-standing issue with ongoing research methods, e.g., structured VI [38] or blocked/collapsed Gibbs samplers [40]. However, both MCMC and VI work well provided that there are sufficient iterations. This behavior suggests that the dependency induced by very low noise levels creates an optimization challenge rather than a fundamental modeling issue.
In summary, (1) VI is significantly faster than MCMC, (2) both methods slow down as the noise level decreases, and (3) both fail to recover the correct components when the noise is excessively high.

References

  1. Wang, J.L.; Chiou, J.M.; Müller, H.G. Functional Data Analysis. Annu. Rev. Stat. Its Appl. 2016, 3, 257–295. [Google Scholar] [CrossRef]
  2. Ramsay, J.O.; Silverman, B.W. Applied Functional Data Analysis: Methods and Case Studies; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
  3. Rice, J.A.; Silverman, B.W. Estimating the Mean and Covariance Structure Nonparametrically When the Data are Curves. J. R. Stat. Soc. Ser. B Stat. Methodol. 1991, 53, 233–243. [Google Scholar] [CrossRef]
  4. Aneiros, G.; Cao, R.; Fraiman, R.; Genest, C.; Vieu, P. Recent advances in functional data analysis and high-dimensional statistics. J. Multivar. Anal. 2019, 170, 3–9. [Google Scholar] [CrossRef]
  5. Li, Y.; Qiu, Y.; Xu, Y. From multivariate to functional data analysis: Fundamentals, recent developments, and emerging areas. J. Multivar. Anal. 2022, 188, 104806. [Google Scholar] [CrossRef]
  6. Happ, C.; Greven, S. Multivariate Functional Principal Component Analysis for Data Observed on Different (Dimensional) Domains. J. Am. Stat. Assoc. 2018, 113, 649–659. [Google Scholar] [CrossRef]
  7. Kowal, D.R.; Canale, A. Semiparametric Functional Factor Models with Bayesian Rank Selection. Bayesian Anal. 2023, 18, 1161–1189. [Google Scholar] [CrossRef]
  8. Suarez, A.J.; Ghosal, S. Bayesian Estimation of Principal Components for Functional Data. Bayesian Anal. 2017, 12, 311–333. [Google Scholar] [CrossRef]
  9. Sun, T.Y.; Kowal, D.R. Ultra-Efficient MCMC for Bayesian Longitudinal Functional Data Analysis. J. Comput. Graph. Stat. 2024, 34, 34–46. [Google Scholar] [CrossRef]
  10. Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; Adaptive Computation and Machine Learning; MIT Press: Cambridge, MA. USA, 2006; p. 248. [Google Scholar]
  11. Yao, F.; Müller, H.G.; Wang, J.L. Functional Data Analysis for Sparse Longitudinal Data. J. Am. Stat. Assoc. 2005, 100, 577–590. [Google Scholar] [CrossRef]
  12. Di, C.Z.; Crainiceanu, C.M.; Caffo, B.S.; Punjabi, N.M. Multilevel functional principal component analysis. Ann. Appl. Stat. 2009, 3, 458–488. [Google Scholar] [CrossRef]
  13. Peng, J.; Paul, D. A geometric approach to maximum likelihood estimation of the functional principal components from sparse longitudinal data. J. Comput. Graph. Stat. 2009, 18, 995–1015. [Google Scholar] [CrossRef]
  14. Chiou, J.M.; Yang, Y.F.; Chen, Y.T. Multivariate functional principal component analysis: A normalization approach. Stat. Sin. 2014, 24, 1571–1596. [Google Scholar] [CrossRef]
  15. Trefethen, L.N. Approximation Theory and Approximation Practice, Extended Edition; SIAM: Philadelphia, PA, USA, 2019. [Google Scholar]
  16. Bungartz, H.J.; Griebel, M. Sparse grids. Acta Numer. 2004, 13, 147–269. [Google Scholar] [CrossRef]
  17. Shi, H.; Yang, Y.; Wang, L.; Ma, D.; Beg, M.F.; Pei, J.; Cao, J. Two-Dimensional Functional Principal Component Analysis for Image Feature Extraction. J. Comput. Graph. Stat. 2022, 31, 1127–1140. [Google Scholar] [CrossRef]
  18. Montagna, S.; Tokdar, S.T.; Neelon, B.; Dunson, D.B. Bayesian Latent Factor Regression for Functional and Longitudinal Data. Biometrics 2012, 68, 1064–1073. [Google Scholar] [CrossRef]
  19. Kowal, D.R.; Bourgeois, D.C. Bayesian Function-on-Scalars Regression for High-Dimensional Data. J. Comput. Graph. Stat. 2020, 29, 629–638. [Google Scholar] [CrossRef]
  20. Sousa, P.H.T.O.; Souza, C.P.E.d.; Dias, R. Bayesian adaptive selection of basis functions for functional data representation. J. Appl. Stat. 2024, 51, 958–992. [Google Scholar] [CrossRef]
  21. Li, Y.; Wang, N.; Carroll, R.J. Selecting the Number of Principal Components in Functional Data. J. Am. Stat. Assoc. 2013, 108, 1284–1294. [Google Scholar] [CrossRef]
  22. Shamshoian, J.; Şentürk, D.; Jeste, S.; Telesca, D. Bayesian analysis of longitudinal and multidimensional functional data. Biostatistics 2022, 23, 558–573. [Google Scholar] [CrossRef]
  23. Huo, S.; Morris, J.S.; Zhu, H. Ultra-Fast Approximate Inference Using Variational Functional Mixed Models. J. Comput. Graph. Stat. 2023, 32, 353–365. [Google Scholar] [CrossRef]
  24. Liu, Y.; Qiao, X.; Pei, Y.; Wang, L. Deep Functional Factor Models: Forecasting High-Dimensional Functional Time Series via Bayesian Nonparametric Factorization. In Proceedings of the International Conference on Machine Learning, PMLR, Vienna, Austria, 21–27 July 2024; pp. 31709–31727. [Google Scholar]
  25. Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
  26. Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 1999, 61, 611–622. [Google Scholar] [CrossRef]
  27. Ilin, A.; Raiko, T. Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res. 2010, 11, 1957–2000. [Google Scholar]
  28. Bishop, C.M. Variational Principal Components. In Proceedings of the Ninth International Conference on Artificial Neural Networks, ICANN’99, Edinburgh, UK, 7–10 September 1999; IEEE: Piscataway, NJ, USA, 1999; pp. 509–514. [Google Scholar]
  29. Tipping, M.E.; Bishop, C.M. Mixtures of probabilistic principal component analyzers. Neural Comput. 1999, 11, 443–482. [Google Scholar] [CrossRef]
  30. Tipping, M.E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 2001, 1, 211–244. [Google Scholar]
  31. MacKay, D.J. Bayesian methods for backpropagation networks. In Models of Neural Networks III: Association, Generalization, and Representation; Springer: Berlin/Heidelberg, Germany, 1996; pp. 211–254. [Google Scholar]
  32. Neal, R.M. Bayesian Learning for Neural Networks; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 118. [Google Scholar]
  33. Wipf, D.; Nagarajan, S. A new view of automatic relevance determination. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007; Volume 20. [Google Scholar]
  34. Girolami, M.; Rogers, S. Hierarchic Bayesian models for kernel learning. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 241–248. [Google Scholar]
  35. Chen, Y.; Cheng, L.; Wu, Y.C. Bayesian low-rank matrix completion with dual-graph embedding: Prior analysis and tuning-free inference. Signal Process. 2023, 204, 108826. [Google Scholar] [CrossRef]
  36. Cheng, L.; Yin, F.; Theodoridis, S.; Chatzis, S.; Chang, T.H. Rethinking Bayesian Learning for Data Analysis: The Art of Prior and Inference in Sparsity-Aware Modeling. IEEE Signal Process. Mag. 2022, 39, 18–52. [Google Scholar] [CrossRef]
  37. Wong, A.P.S.; Wijffels, S.E.; Riser, S.C.; Pouliquen, S.; Hosoda, S.; Roemmich, D.; Gilson, J.; Johnson, G.C.; Martini, K.; Murphy, D.J.; et al. Argo Data 1999–2019: Two Million Temperature-Salinity Profiles and Subsurface Velocity Observations From a Global Array of Profiling Floats. Front. Mar. Sci. 2020, 7, 700. [Google Scholar] [CrossRef]
  38. Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational Inference: A Review for Statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
  39. Tipping, M.E.; Faul, A.C. Fast marginal likelihood maximisation for sparse Bayesian models. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, PMLR, Key West, FL, USA, 3–6 January 2003; pp. 276–283. [Google Scholar]
  40. Park, T.; Lee, S. Improving the Gibbs sampler. Wiley Interdiscip. Rev. Comput. Stat. 2022, 14, e1546. [Google Scholar] [CrossRef]
  41. Ledoit, O.; Wolf, M. A well-conditioned estimator for large-dimensional covariance matrices. J. Multivar. Anal. 2004, 88, 365–411. [Google Scholar] [CrossRef]
  42. Kaslow, R.A.; Ostrow, D.G.; Detels, R.; Phair, J.P.; Polk, B.F.; Rinaldo, C.R., Jr.; Study, M.A.C. The Multicenter AIDS Cohort Study: Rationale, organization, and selected characteristics of the participants. Am. J. Epidemiol. 1987, 126, 310–318. [Google Scholar] [CrossRef] [PubMed]
  43. Argo Float Data and Metadata from Global Data Assembly Centre (Argo GDAC)-Snapshot of Argo GDAC of 9 November 2024. 2024. Available online: https://www.seanoe.org/data/00311/42182/ (accessed on 29 November 2024).
  44. Yarger, D.; Stoev, S.; Hsing, T. A functional-data approach to the Argo data. Ann. Appl. Stat. 2022, 16, 216–246. [Google Scholar] [CrossRef]
  45. de Boyer Montégut, C.; Madec, G.; Fischer, A.S.; Lazar, A.; Iudicone, D. Mixed layer depth over the global ocean: An examination of profile data and a profile-based climatology. J. Geophys. Res. Ocean. 2004, 109. [Google Scholar] [CrossRef]
  46. Roemmich, D.; Gilson, J. The 2004–2008 mean and annual cycle of temperature, salinity, and steric height in the global ocean from the Argo Program. Prog. Oceanogr. 2009, 82, 81–100. [Google Scholar] [CrossRef]
  47. Kuusela, M.; Stein, M.L. Locally stationary spatio-temporal interpolation of Argo profiling float data. Proc. R. Soc. A 2018, 474, 20180400. [Google Scholar] [CrossRef]
Figure 1. Probabilistic graphical model for the full model.
Figure 1. Probabilistic graphical model for the full model.
Algorithms 18 00254 g001
Figure 2. Diagrams of variational inference algorithm for all parameters. The top three diagrams each have a closed loop and a closed-form overall transfer function.
Figure 2. Diagrams of variational inference algorithm for all parameters. The top three diagrams each have a closed loop and a closed-form overall transfer function.
Algorithms 18 00254 g002
Figure 3. Length scales and centers of selected kernel basis functions in a random repetition for three different m values in Scenario 5.
Figure 3. Length scales and centers of selected kernel basis functions in a random repetition for three different m values in Scenario 5.
Algorithms 18 00254 g003
Figure 4. Convergence plots for Scenario 5 in Yehua and the 4D simulation. The upper row displays the covariance error against time, and the lower row illustrates the difference between the estimated and true numbers of components.
Figure 4. Convergence plots for Scenario 5 in Yehua and the 4D simulation. The upper row displays the covariance error against time, and the lower row illustrates the difference between the estimated and true numbers of components.
Algorithms 18 00254 g004
Figure 5. Cross-sectional visualization of eigenfunctions (eigenvalues) of the 4D simulation.
Figure 5. Cross-sectional visualization of eigenfunctions (eigenvalues) of the 4D simulation.
Algorithms 18 00254 g005
Figure 6. Outcomes from the proposed method applied to MACS CD4 datasets. (a) Estimated curves for a random selection of nine sampled functions and the mean function. (b) Estimated eigenfunctions (eigenvalues).
Figure 6. Outcomes from the proposed method applied to MACS CD4 datasets. (a) Estimated curves for a random selection of nine sampled functions and the mean function. (b) Estimated eigenfunctions (eigenvalues).
Algorithms 18 00254 g006
Figure 7. Outcomes from the proposed method applied to a wind speed dataset. (a) Estimated curves for a random selection of 9 sampled functions (denoted by different colors) and the mean function. (b) Estimated eigenfunctions (eigenvalues) denoted as EF. (c) Estimated covariance.
Figure 7. Outcomes from the proposed method applied to a wind speed dataset. (a) Estimated curves for a random selection of 9 sampled functions (denoted by different colors) and the mean function. (b) Estimated eigenfunctions (eigenvalues) denoted as EF. (c) Estimated covariance.
Algorithms 18 00254 g007
Figure 8. Temperature measurements in February 2021 near the sea surface in the ARGO dataset.
Figure 8. Temperature measurements in February 2021 near the sea surface in the ARGO dataset.
Algorithms 18 00254 g008
Figure 9. Geodesic interpolation from BSFDA Fast vs. actual ARGO global oceanic measurements at 1, 200, and 300 decibars, at 1° S and 30° W, on May 29. Measurements are represented by circles, with the filling color indicating the temperature. Circle sizes show distance in depth and time from the central point.
Figure 9. Geodesic interpolation from BSFDA Fast vs. actual ARGO global oceanic measurements at 1, 200, and 300 decibars, at 1° S and 30° W, on May 29. Measurements are represented by circles, with the filling color indicating the temperature. Circle sizes show distance in depth and time from the central point.
Algorithms 18 00254 g009
Figure 10. Depth–time interpolation from BSFDA Fast vs. actual ARGO global oceanic measurements at two sites focusing on mixed layer behavior. Measurements are represented by circles (green for training and pink for testing data), with the filling color indicating the temperature. Circle sizes show distance in geodesic space from the central point (denoted as red circles). (a) Shallow mixed layers at 1° S and 30° W. (b) Deep mixed layers at 49° N and 29° W.
Figure 10. Depth–time interpolation from BSFDA Fast vs. actual ARGO global oceanic measurements at two sites focusing on mixed layer behavior. Measurements are represented by circles (green for training and pink for testing data), with the filling color indicating the temperature. Circle sizes show distance in geodesic space from the central point (denoted as red circles). (a) Shallow mixed layers at 1° S and 30° W. (b) Deep mixed layers at 49° N and 29° W.
Algorithms 18 00254 g010
Table 1. Proportion of accurate estimations for Scenario 1 (r = 3).
Table 1. Proportion of accurate estimations for Scenario 1 (r = 3).
N i AIC PACE AICBIC PC p 1 IC p 1 AIC PACE 2022 BIC PACE 2022 fpcaBSFDA BSFDA Fast
50.0000.5800.3800.4100.7350.6500.8800.6450.9950.015
100.0000.9800.6700.9550.9850.8800.9200.6451.0000.910
500.0001.0000.8301.0001.0001.0001.0000.8900.9800.945
Table 2. Proportion of accurate estimations for Scenario 2 (r = 3).
Table 2. Proportion of accurate estimations for Scenario 2 (r = 3).
N i AIC PACE AICBIC PC p 1 IC p 1 AIC PACE 2022 BIC PACE 2022 fpcaBSFDA BSFDA Fast
50.0050.6300.2450.3750.6050.5700.6200.4751.0000.040
100.0000.7100.6650.5700.8050.8250.8500.6401.0000.995
500.0000.6300.7950.9550.9451.0001.0000.9501.0000.950
Table 3. Proportion of accurate estimations for Scenario 3 (r = 3).
Table 3. Proportion of accurate estimations for Scenario 3 (r = 3).
N i AIC PACE AICBIC PC p 1 IC p 1 AIC PACE 2022 BIC PACE 2022 fpcaBSFDA BSFDA Fast
50.0050.7200.3250.6400.5900.3200.4000.4500.9950.945
100.0000.5800.7700.9650.6650.7400.7550.4400.9951.000
500.0001.0000.7751.0001.0001.0001.0000.7650.9800.920
Table 4. Proportion of accurate estimations for Scenario 4 (r = 3).
Table 4. Proportion of accurate estimations for Scenario 4 (r = 3).
N i AIC PACE AICBIC PC p 1 IC p 1 AIC PACE 2022 BIC PACE 2022 fpcaBSFDA BSFDA Fast
50.0150.7100.4100.6400.5600.5150.5750.3701.0000.975
100.0000.8300.7750.9200.9000.7500.7600.3500.9950.990
500.0000.9450.8351.0001.0001.0001.0000.7300.9500.935
Table 5. Proportion of accurate estimations for Scenario 5 (r = 6).
Table 5. Proportion of accurate estimations for Scenario 5 (r = 6).
N i AIC PACE AICBIC PC p 1 IC p 1 AIC PACE 2022 BIC PACE 2022 fpcaBSFDA BSFDA Fast
50.7050.4700.0900.0700.5450.4250.4100.8550.9250.160
100.0650.5700.5250.7750.7050.5750.5750.5001.0000.930
500.0000.2600.5900.9800.9650.8700.7700.6950.9950.925
Table 6. Mean squared error of covariance Error CovFunc for Scenario 5.
Table 6. Mean squared error of covariance Error CovFunc for Scenario 5.
N i AIC PACE 2022 BIC PACE 2022 fpcarefund.scBSFDA BSFDA Fast
512.373 ± 4.02612.377 ± 4.0315.192 ± 6.1668.833 ± 4.7305.814 ± 3.53510.292 ± 12.717
1010.391 ± 2.52110.391 ± 2.5212.098 ± 1.4255.314 ± 3.5012.068 ± 1.4272.656 ± 1.712
509.054 ± 1.6839.054 ± 1.6831.642 ± 1.240N/A1.638 ± 1.2471.770 ± 1.275
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tao, W.; Joshi, S.; Whitaker, R. Integrated Model Selection and Scalability in Functional Data Analysis Through Bayesian Learning. Algorithms 2025, 18, 254. https://doi.org/10.3390/a18050254

AMA Style

Tao W, Joshi S, Whitaker R. Integrated Model Selection and Scalability in Functional Data Analysis Through Bayesian Learning. Algorithms. 2025; 18(5):254. https://doi.org/10.3390/a18050254

Chicago/Turabian Style

Tao, Wenzheng, Sarang Joshi, and Ross Whitaker. 2025. "Integrated Model Selection and Scalability in Functional Data Analysis Through Bayesian Learning" Algorithms 18, no. 5: 254. https://doi.org/10.3390/a18050254

APA Style

Tao, W., Joshi, S., & Whitaker, R. (2025). Integrated Model Selection and Scalability in Functional Data Analysis Through Bayesian Learning. Algorithms, 18(5), 254. https://doi.org/10.3390/a18050254

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop