A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case

Linear regression (LR) is a core model in supervised machine learning performing a regression task. One can fit this model using either an analytic/closed-form formula or an iterative algorithm. Fitting it via the analytic formula becomes a problem when the number of predictors is greater than the number of samples because the closed-form solution contains a matrix inverse that is not defined when having more predictors than samples. The standard approach to solve this issue is using the Moore–Penrose inverse or the L2 regularization. We propose another solution starting from a machine learning model that, this time, is used in unsupervised learning performing a dimensionality reduction task or just a density estimation one—factor analysis (FA)—with one-dimensional latent space. The density estimation task represents our focus since, in this case, it can fit a Gaussian distribution even if the dimensionality of the data is greater than the number of samples; hence, we obtain this advantage when creating the supervised counterpart of factor analysis, which is linked to linear regression. We also create its semisupervised counterpart and then extend it to be usable with missing data. We prove an equivalence to linear regression and create experiments for each extension of the factor analysis model. The resulting algorithms are either a closed-form solution or an expectation–maximization (EM) algorithm. The latter is linked to information theory by optimizing a function containing a Kullback–Leibler (KL) divergence or the entropy of a random variable.


Introduction
In machine learning, models can be grouped into two categories: probabilistic and nonprobabilistic. Probabilistic models can be classified as generative and discriminative [1]. Examples of classic generative models are naive Bayes and Gaussian mixture models (GMM). Examples of classic discriminative models are linear regression (LR) and logistic regression. The key difference is whether they model the joint probability of the input and the output-generative models-or they just model the conditional probability of the output given the input-discriminative models. For a classification or a regression task, one may argue that what you need is just a discriminative model, but the generative models have their advantages: they can sometimes handle missing data, can easily generate new data, can be extended to be unsupervised or semisupervised, etc. ( [2] p. 268).
As one may notice, there are generative models for unsupervised learning that have counterparts in supervised learning, even though this is not widely discussed in the literature. One such example is the GMM ( [2] p. 339) with its counterpart, the Gaussian joint Bayes model ( [2] p. 102), also known as quadratic discriminant analysis. Their training/fitting algorithms are similar, as one may notice, for example, in [3,4]: for Gaussian joint Bayes: x i is an input observation, (π j , µ j , Σ j ) are the parameters of a GMM, observable z i is the class index corresponding to x i , j is a class index, and 1 {z i =j} is the indicator function, which returns 1 if the condition z i = j is true and 0 otherwise. • for expectation-maximization (EM) for the GMM, which optimizes a function concerning a Kullback-Leibler (KL) divergence or the entropy of a random variable-check Appendix A for these details-: E step: M step: w ij where x i is an input observation, (π j , µ j , Σ j ) are the parameters of a GMM, unobservable z i is the cluster index corresponding to x i , j is a cluster index, and w ij is the probability that x i belongs to cluster j.
This similarity between GMM and Gaussian joint Bayes is intriguing; hence, we decided to further explore this aspect but starting from other supervised-unsupervised counterparts. As a result, we changed the root model into factor analysis (FA) [5] ( [2] p. 381), which is normally used for dimensionality reduction or for density estimation when the dimensionality of the data is greater than the number of samples. Factor analysis is a Gaussian generative model used in unsupervised learning. We aimed at creating its supervised counterpart in order to handle a regression task and then exploit it as much as possible.
After creating the supervised counterpart, we proved a significant property, namely that linear regression is equivalent to (supervised) factor analysis-with one-dimensional latent space-when no constraints are imposed on the covariance matrices.
A linear regression model can be fitted via a closed-form solution or an iterative algorithm. When the number of predictors is greater than the number of samples, there is no closed-form solution. There are other solutions to this problem, as we will see.
We were at the point where we knew that factor analysis was linked to linear regression and that it could be used when the number of samples was lower than the dimensionality of the data-from now on, this is denoted as D >> n or n << D. As a result, we shifted our focus from solely exploiting the factor analysis model to highlighting novel linear regression versions applicable in the D >> n regime-linear regression being a widely known and used model-: • linear regression when D >> n, • semisupervised linear regression when D >> n, • (semisupervised) linear regression when D >> n with missing data.
The structure of this paper is as follows. In Section 2, we include some theoretical background to enhance the readability of this paper. Section 3 contains related work. In Section 4, we include the models we proposed, starting from factor analysis. Section 5 contains experiments using the proposed models. In Section 6, we conclude the paper and show future directions.
We include the full algorithms in the appendices in a pseudocode format, two of them being instances of the expectation-maximization schema.

Theoretical Background
We started our analysis from two core models in machine learning: linear regression and factor analysis. We will discuss the aspects of those two models that are relevant to understanding the next sections of this paper.

Linear Regression
. . , n}} be a data set where D is the dimensionality of the input data, {x (i) |i ∈ {1, . . . , n}} is the input, and {y (i) |i ∈ {1, . . . , n}} is the output. The linear regression model is as follows: Then, the parameters w and b can be estimated via maximum likelihood as follows: or, equivalently, as follows: wherex = x (1) + · · · + x (n) n ,ȳ = y (1) + · · · + y (n) n , A potential problem with Equation (3) is whenXX is not invertible. Such case arises when D + 1 > n, i.e., there are more predictors than samples. Two standard solutions to this problem are the following: • Let A ∈ R a×b be a matrix. Then, the Moore-Penrose inverse of A can be defined as , with α > 0; the bigger the α, the more regularization we add to the model, i.e., move away from overfitting. From this point of view, the first solution using the Moore-Penrose inverse can be interpreted as achieving the asymptotically lowest L2 regularization.

Factor Analysis
The formulas stated in the conclusion of the following Proposition were proved in [5] and are relevant for the factor analysis algorithm-although the matrix Ψ is considered as being diagonal there, the formulas stay the same even if Ψ is not diagonal.

Proposition 2.
Let us consider the following factor analysis model: , Λ ∈ R D×d , and Ψ ∈ R D×D a diagonal matrix. Then: A factor analysis model can be fitted via an EM algorithm using the maximum likelihood estimation (MLE) principle, and factor analysis can be used as a density estimation technique when the dimensionality of the data is greater than the number of samples [5].
For the algorithms we developed, we will let z ∼ N (µ z , Σ z ) and not z ∼ N (0, I), because z becomes observed data, and we want to learn its parameters, and not impose something unrealistic like z ∼ N (0, I). This generalization leads to the following result.

Proposition 3.
Let us consider the following linear Gaussian system: z ∼ N (µ z , Σ z )-latent variable, z ∈ R d×1 , µ z ∈ R d×1 , Σ z ∈ R d×d a symmetric and positive definite matrix, x|z ∼ N (µ + Λz, Ψ), x ∈ R D×1 , µ ∈ R D×1 , Λ ∈ R D×d , and Ψ ∈ R D×D a symmetric and positive definite matrix. (If Ψ is a diagonal matrix, then we say that we are in the FA case. If Ψ is a scalar matrix, Ψ = η I, η ∈ R * + , then we are in the probabilistic principal component analysis (PPCA) case [7]. If Ψ is any symmetric and positive definite matrix, then we say we are in the unconstrained factor analysis (UncFA) case. If the first two terms are standard (FA and PPCA), the third one is proposed by us-UncFA.) Then: (4) The proof can be found in ( [8] pp. 9-11).

Related Work
Although factor analysis is widely used for dimensionality reduction, its supervised counterpart is, to the best of our knowledge, not present in the literature. What is present is a model called supervised principal component analysis or latent factor regression ( [2] p. 405). The idea is that not only the input, for a regression task, is generated by a latent variable, as one applies factor analysis to replace the input in the problem with a low dimensional embedding, but also the output. The key idea is that the purpose of supervised principal component analysis is still dimensionality reduction and not at all regression, which is where we want to push the factor analysis model.
There is also a term called linear Gaussian system ( [2] p. 119). This was already presented in Section 2, and it generalizes the factor analysis generative process by considering that z has a learnable mean and covariance matrix, but it does not go further.
Factor analysis is strongly related to principal component analysis (PCA) [9] because by imposing a certain constraint in factor analysis, we get a model called probabilistic principal component analysis [7] that can be fitted using a closed-form solution, which, in an asymptotic case, is also the solution for PCA. Probabilistic PCA can be kernelized using a model called the Gaussian process latent variable model (GPLVM) [10]. This model also has supervised counterparts [11], but, as in the case of FA, the supervised extension targets dimensionality reduction, and the idea is similar to the one in supervised PCA.

Proposed Models
In this section, we propose three models starting from the FA model, each in a new subsection: simple-supervised factor analysis (S2.FA), simple-semisupervised factor analysis (S3.FA), and missing simple-semisupervised factor analysis (MS3.FA). While S2.FA is applicable in the supervised case, regression, S3.FA is meant to be used in a semisupervised context. MS3.FA handles missing input data in a (semi)supervised scenario.
One important remark that will not be restated in this paper is that all the models (FA, S2.FA, S3.FA, MS3.FA) are fitted by maximizing the likelihood (the MLE principle) of the observed data. Another important observation is regarding the names of our proposed algorithms: the algorithms are called "simple-"-S2 = simple-supervised; S3 = simplesemisupervised; MS3 = missing simple-semisupervised-not only because they constitute a simple adaptation of the factor analysis model, but mostly because we created an adaptation of the (simple-)supervised FA model called (simple-)supervised PPCA, and we did not want this model to be confused with the already existing supervised PCA model in the literature. Simple-supervised probabilistic principal component analysis (S2.PPCA) is not discussed in this paper, but it is implemented and usable in the R package that we developed (https: //github.com/aciobanusebi/s2fa; accessed on 31 July 2021) along with other undiscussed but related models: Simple-semisupervised unconstrained factor analysis (S3.UncFA), Simple-semisupervised probabilistic principal component analysis (S3.PPCA), Missing simple-semisupervised unconstrained factor analysis (MS3.UncFA), and Missing simplesemisupervised probabilistic principal component analysis (MS3.PPCA).

The S2.FA Model
The core of this subsection regards the S2.FA model, but in order to link it with LR, we need to also introduce a similar model to S2.FA: S2.UncFA. This link will make S2.FA a good candidate for replacing LR when D >> n. These three ideas-S2.FA and S2.UncFA, S2.UncFA-LR link, and replacing LR via S2.FA-will be expanded below. The first model that we propose is called simple-supervised factor analysis. It is a linear Gaussian system with slight changes: If we do not impose the constraint of Ψ being diagonal, we arrive at the simplesupervised unconstrained factor analysis (S2.UncFA): , and Ψ ∈ R D×D a symmetric and positive definite matrix.
In contrast with the factor analysis model, which is fitted via an EM algorithm, S2.UncFA and S2.FA are fitted via analytic formulas (see Propositions 4 and 5).
. . , n}} be a data set where D is the dimensionality of the input data, d is the dimensionality of the output data (although in the context of this paper d = 1, we decided to expose more general results-d ≥ 1-in order for the reader to gain more insight; this is the reason why we write Σ z and not just σ 2 z , or z (i) and not just z (i) , orz and not justz, etc.), {x (i) |i ∈ {1, . . . , n}} is the input, and {z (i) |i ∈ {1, . . . , n}} is the output. We suppose that the data was generated as follows: , Σ z ∈ R d×d a symmetric and positive definite matrix and Then, the parameters in the S2.UncFA algorithm (training phase) can be estimated via maximum likelihood using the following closed-form formulas: For the testing/prediction phase, one uses the formula for z|x from (4).

Proposition 5.
[We will denote the parameterΨ in (9) asΨ S2.UncFA .] For the S2.FA algorithm, (9) is replaced bŷ where "diag" takes the diagonal of a matrix and returns the corresponding diagonal matrix.
The proof of Equation (10) is relatively simple, and we skip it for brevity. It can be found in [8] pp. 21-23. For the step-by-step S2.FA algorithm and also for the matrix form of the algorithm, see Appendix B. Linear regression and S2.UncFA have the same prediction function after fitting, as we claim and prove below.
. . , n}} be a data set where D is the dimensionality of the input data, d is the dimensionality of the output data (The same observation as earlier: in the context of this paper, only the "d = 1" case is relevant.), {x (i) |i ∈ {1, . . . , n}} is the input, and {z (i) |i ∈ {1, . . . , n}} is the output.
One can fit an S2.UncFA model and obtain-via the relationships (5)-(9)-μ z ,Σ z ,Ψ,Λ,μ. Remember that at the test phase (see (4)), the predicted value is One can fit a linear regression model and obtainŵ,b from (1) and (2). Remember that at the test phase, the predicted value is We begin by computingΨ.
We return to computeΨ: We continue by computingΛΣ zΛ .
Σ zΛ Since (ΛΣ zΛ ) =ΛΣ zΛ =ΛΣ zΛ ⇒ΛΣ zΛ is symmetric. We have that: We also have that: We continue by computingΛΣ zΛ +Ψ. As we have already noticed above,ΛΣ zΛ is also included inΨ (see (11) and (12)). In the computation ofΛΣ zΛ +Ψ, we replaceΛΣ zΛ once with (14) and then with (15). We get:ΛΣ zΛ +Ψ (11) Observation: The result is exactly the maximum likelihood estimate of the covariance matrix Σ of the input data set if x ∼ N (µ, Σ): Σ MLE = 1 n XX −xx . This is natural because according to the relationship (4) we have x ∼ N (µ, ΛΣ z Λ + Ψ), and there are enough free parameters in ΛΣ z Λ + Ψ, i.e., dD We return to the initial computation: 4.1.3. The S2.FA Model. A New Approach for LR When D >> n Since FA can be used to estimate the density of a data set when D >> n, and S2.UncFA is equivalent to LR, we consider S2.FA as a new approach to extend LR when D >> n besides the two solutions mentioned in Section 2.

The S3.FA Model
Factor analysis is a classic generative unsupervised model. Its supervised counterpart is S2.FA as shown in the previous subsection. Those two can be merged into a semisupervised model that we propose here, named simple-semisupervised factor analysis: If we were to speak about Gaussian naive Bayes and the GMM, good hints for combining these two supervised-unsupervised counterparts into a semisupervised model can be found in [12]. We applied those hints for our supervised-unsupervised counterparts-S2.FA and FA-and created an EM algorithm to fit an S3.FA model. For the step-by-step algorithm and the matrix form of the algorithm, see Appendix C. For more elaborate notations and the proof, see [8] pp. 26-34.

The MS3.FA Model
The algorithm that fits an S3.FA model can be adapted also for the case when not all the components of x are known. We call the resulted model missing simple-semisupervised factor analysis: , and Ψ ∈ R D×D a diagonal matrix; each component of x: x 1 , . . . , x D is either observed or latent.
The resulting algorithm that fits a MS3.FA model is an EM algorithm. For the stepby-step algorithm, see Appendix D. For more elaborate notations and the proof, see [8] pp. 34-37.

Experiments
In this section, we include the experiments we carried out on data with D >> n using the S2.FA, S3.FA, and MS3.FA models, comparing them with other methods. In all the experiments, we computed errors between the real values and the predicted values; the metric we used is mean squared error (MSE): where N is the number of the unknown elements whose real and predicted values are real i and predicted i , respectively; an unknown element represents an output number for S2.FA and S3.FA or an input/output number for MS3.FA. We ran each experiment five times and computed a 95% confidence interval using the t-distribution. Furthermore, in each experiment we used the same three data sets: All three of these data sets have multiple outputs, but for each data set, we selected only the first output column that appears in the text data file and used it as the output: ace_conc for gas sensor array under flow modulation data set, LBL_ALLminpA_fut_001 for atp1d, the first column in the propvals file for m5spec.
We preprocessed each data set simply by dropping the constant columns. Only the second data set has constant columns: from 432 columns, we obtain 370.

The S2.FA Model: Experiment
The experiment concerning S2.FA covers the comparison of the three solutions presented so far for LR when D >> n: Each data set was split into a training part-80%-and a testing part-20%. If the model had hyperparameters, ridge regression, the training part was also split into a new training part-60% of the whole data set-and a validation part-20% of the whole data set-in order to be able to set the hyperparameters (for ridge regression, we used a simple technique: pick α ∈ {10 2 , 10 1.9 , 10 1.8 , . . . , 10 −1.9 , 10 −2 }, which attains the minimum validation error); after setting the hyperparameters, we train a new model on the initial training part-80% of the whole data set-and obtain the final model. All of the MSE errors are reported on the testing part and shown in Table 1.
As one may notice, the best method is different for each data set, so our general advice is to use all the methods on a given data set and pick the best one. Table 1. Simple-supervised factor analysis (S2.FA) experiment: mean squared error (MSE) 95% confidence intervals on three data sets using three methods for regression when D >> n; the best MSE means are marked in bold.  [16] with the parameter alpha set to 1. Each data set was split into a training part-80%-and a testing part-20%. We retained from the training part 5%, 10%, 15%, 20%, 25%, . . . , 100% of the output labels. For the supervised methods, we used only the labeled data in the training set, and for the semisupervised methods, we used the full training set when fitting the model. We initialized the S3.FA method with the fitted parameters returned by the S2.FA algorithm. All of the MSE errors are reported on the testing part and shown in Figures 1-3 Table 2. The figures contain all the results from using 5% to 100% of the output labels, but we include less information in the table for brevity.

-inspired from [17]-and
We notice that on the selected data sets, S3.FA returns poorer results even than S2.FA, which uses only the labeled data. As in the previous experiment, the best method is also data-dependent. The models that show a greater amount of variability compared to the others are S3.FA-in two data sets-and Moore-Penrose-in one data set. Moreover, as expected, the errors tend to decrease as the percentage of labeled training data points increases; the plots do not help us in this regard, but this decrease can be seen numerically in Table 2. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

The MS3.FA Model: Experiment
The experiment concerning MS3.FA includes a comparison of two different types of algorithms for data imputation when D >> n: • Mean imputation: for a given attribute (input column), compute its mean ignoring the missing values, then replace the missing data on that attribute with this computed mean • MS3.FA.
We also tried two other R packages: mice [18] and Amelia [19], but they could not be applied successfully on our data sets perhaps because they have a peculiarity: D >> n.
For each data set, we removed 10%, 20%, 30%, 40%, 50%, and 60% of the input (We could have added missing data also in the output, but we wanted to focus on the missing input data scenario and not on the semisupervised case.) cells and imputed those via the above mentioned algorithms. The results are presented in Figures 4-6 and in Table 3.
From these results, we discover that MS3.FA is better than mean imputation on two data sets, and, as expected, the error increases as the percentage of missing data increases.   . MS3.FA experiment: MSE 95% confidence intervals on the m5spec data set using two methods for imputing missing data when D >> n; p ∈ {10%, 20%, 30%, 40%, 50%, 60%} of the input data are removed.

Conclusions and Future Work
The initial purpose of this paper was to extend an already existing model: factor analysis. We developed its supervised counterpart (S2.FA) and noticed that the unconstrained version (S2.UncFA) is equivalent to linear regression. Because FA is applied in density estimation when the dimensionality of the data is greater than the number of samples, and because of the already mentioned equivalence, the purpose of the paper became to analyze this new method of applying LR when D >> n, i.e., via S2.FA. Since FA and S2.FA are generative models and are unsupervised-supervised counterparts, we combined both into a new model S3.FA as an extension of LR to semisupervised learning when D >> n. The final extension regards missing data; it is called MS3.FA. We developed an R package (s2fa) with these algorithms; it can be found on GitHub. The experimental parts included several comparisons in the D >> n scenario: • of S2.FA with other techniques extending LR to the D >> n case, • of S3.FA with other (semi)supervised regression methods, • of MS3.FA with another data imputation algorithm.
The bottom line is that we do not necessarily recommend S3.FA for semisupervised regression since our results suggest that it gives poor results, but we encourage the consideration of S2.FA for regression and MS3.FA for missing data imputation as algorithms to be compared with others on a given data set.
As for future work, we could further explore the S2.FA, S3.FA, and MS3.FA algorithms when z is a real vector, not just a real number. Moreover, we can experiment with the PPCA version of the algorithms. Questions regarding the time complexity-empirical or not-can be also addressed; we expect the fitting time to be impractical if the number of columns is large. Because there are models such as mixture of factor analyzers [20] and mixture of linear regression models ([21] Section 14.5.1), another research direction involves mixtures of S2.FAs. Another idea would be to investigate the memory resources required by the algorithms when the data set increases and also to consider scalable systems such as Spark [22] for implementation.

Acknowledgments:
We thank Cristian Gaţu and Daniel Stamate for attentive proofreading and useful comments of previous versions of this paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. On the Expectation-Maximization Algorithm
This section provides theoretical details on the EM algorithm [23]. These are relevant in order to establish a link between EM and the information theory field. A latent variable model assumes that the data we observed-usually, denoted by the random variable X-is not complete: there is also some latent data modeled via random variables-usually, denoted by Z. Often, pairs consisting of an observed point and a latent one constitute the complete dataset, e.g., in a GMM the latent data is the cluster number and such a number is assigned to each point in the observed dataset.
Usually, in a latent variable model the likelihood of the observed data is not tractablealthough there are exceptions, like in GMM or factor analysis-and therefore we cannot maximize it directly. Instead we maximize a lower bound for the log-likelihood function called ELBO (evidence lower bound). To simplify the discussion, we consider that we have one single observed datapoint, x. We will also consider the case where Z is continuous; when Z is discrete the sign is replaced by the ∑ sign.
Let q be any distribution over z, where the z values are the possible values of the random variable Z. The log-likelihood of the observed datapoint x is: . Furthermore, the following relationships important for the E step of the EM algorithmsee below-can be proven: Moreover, the following relationships important for the M step of the EM algorithmsee below-can be proven: Now, instead of carrying out max θ log p(x; θ) we will execute max θ,q ELBO(θ, q), since ELBO is a lower bound for the log-likelihood of x and hence its maximization will not hurt the process of maximizing log p(x; θ).
The ELBO can be maximized in at least two ways: • via (block) coordinate ascent The resulting meta-algorithm is the EM algorithm. In fact this is the case for many classic models-EM for GMM [3], EM for factor analysis [5] etc.-. EM is an iterative algorithm and an iteration encompasses two steps: 1. E step: q (t) = arg max q ELBO(θ (t−1) , q) for θ (t−1) fixed-from the previous iteration.
(In this case, we have log p(x; θ (t−1) ) = ELBO(θ (t−1) ; q (t) ).) So, we obtained the distribution q (t) as a posterior distribution. Although in classic models where conjugate priors are used it is tractable to compute p(·|x; θ (t−1) )-this type of inference is called analytical inference-, in other models this is not the case and a solution to this shortcoming is represented by approximate/variational inference. 2.
Since ELBO(θ, q) = E q [log p(x, z; θ)] + H(q), we have: So, we obtained a relatively simpler term to maximize. Note that the maximization is further customized using the probabilistic assumptions at hand.
• via gradient ascent: this is the case of Variational Autoencoder [24] which will not be discussed since it is not necessarily relevant to this study.