Logistic Biplot by Conjugate Gradient Algorithms and Iterated SVD

: Multivariate binary data are increasingly frequent in practice. Although some adaptations of principal component analysis are used to reduce dimensionality for this kind of data, none of them provide a simultaneous representation of rows and columns (biplot). Recently, a technique named logistic biplot (LB) has been developed to represent the rows and columns of a binary data matrix simultaneously, even though the algorithm used to ﬁt the parameters is too computationally demanding to be useful in the presence of sparsity or when the matrix is large. We propose the ﬁtting of an LB model using nonlinear conjugate gradient (CG) or majorization–minimization (MM) algorithms, and a cross-validation procedure is introduced to select the hyperparameter that represents the number of dimensions in the model. A Monte Carlo study that considers scenarios with several sparsity levels and different dimensions of the binary data set shows that the procedure based on cross-validation is successful in the selection of the model for all algorithms studied. The comparison of the running times shows that the CG algorithm is more efﬁcient in the presence of sparsity and when the matrix is not very large, while the performance of the MM algorithm is better when the binary matrix is balanced or large. As a complement to the proposed methods and to give practical support, a package has been written in the R language called BiplotML. To complete the study, real binary data on gene expression methylation are used to illustrate the proposed methods.


Introduction
In many studies, researchers have a binary multivariate data matrix and aim to reduce dimensions to investigate the structure of the data. For example, in the measurement of brand equity, a set of consumers evaluates the perceptions of quality, perceptions of value, or other brand attributes that can be represented in a binary matrix [1]; in the evaluation of the impact of public policies, the answers used to identify whether the beneficiaries have some characteristics or to identify if some economic or social conditions have changed from a baseline are usually binary [2][3][4]. Likewise, in biological research-and in particular in the analysis of genetic and epigenetic alterations-the amount of binary data has been increasing over time [5]. In these cases, classical methods to reduce dimensionality, such as principal component analysis (PCA), are not appropriate.
This problem has received considerable attention in the literature; consequently, different extensions of PCA have been proposed. From a probabilistic perspective, Collins et al. [6] provide a generalization of PCA to exponential family data using the generalized linear model framework. This approach suggests the possibility of having proper likelihood loss functions depending on the type of data.
Logistic PCA is the extension of the classic PCA method for binary data and was studied by Schein et al. [7] using the Bernoulli likelihood, where an alternating least squares method is used to estimate the parameters. De Leeuw [8] proposed the calculation of the maximum likelihood estimates of a PCA on the logit or probit scale, using a MM algorithm that iterates a sequence of weighted or unweighted singular value decompositions. Subsequently, Lee et al. [9] introduced sparsity to the loading vectors defined on the logit transform of the success probabilities of the binary observations and estimated the parameters using an iterative weighted least squares algorithm, but the algorithm is computationally too demanding to be useful when the data dimension is high. To solve this problem, a different method was proposed by Lee and Huang [10] using a combined algorithm with coordinate descent and MM to reduce the computational effort. More recently, Landgraf and Lee [11] proposed a formulation that does not require matrix factorization, and they use an MM algorithm to estimate the parameters of the logistic PCA model. Song et al. [12] proposed the fitting of a logistic PCA model using an MM algorithm across a non-convex singular value threshold to alleviate the overfitting issues. However, neither of these approaches provides a simultaneous representation of rows and columns to visualize the binary data set, similar to what is called a biplot for continuous data [13].
In cases where the variables of the data matrix are not continuous, a classical linear biplot representation is not suitable. Gabriel [25] proposed the "bilinear regression" to fit a biplot for data with distributions from the exponential family, but the algorithm was not clearly established and was never used in practice. For multivariate binary data, Vicente-Villardon et al. [26] proposed a biplot called Logistic Biplot (LB), which is a dimension reduction technique that generalizes PCA to cope with binary variables and has the advantage of simultaneously representing individuals and variables. In the LB, each individual is represented by a point and each variable as directed vectors, and the orthogonal projection of each point onto these vectors predicts the expected probability that the characteristic occurs. The method is related to logistic regression in the same way that classical biplot analysis is related to linear regression. In the same way, as linear biplots are related to PCA, LB is related to Latent Trait Analysis (LTA) or Item Response Theory (IRT).
The authors estimate the parameters of the LB model by a Newton-Raphson algorithm, but this presents some problems in the presence of separation or sparsity. In [27], the method is extended using a combination of principal coordinates and standard logistic regression to approximate the LB model parameters in the genotype classification context and is called an external logistic biplot, but the procedure of estimation is quite inefficient for big data matrices. More recently, in [28], the external logistic biplot method was extended for mixed data types, but the estimation algorithm still has problems with big data matrices or in the presence of sparsity. Therefore, there is a clear need to extend the previous algorithms for the LB model because they are not very efficient and none of them present a procedure for choosing the number of dimensions of the final solution of the model.
In the context of supervised learning, some optimization methods have been successfully implemented for logistic regression. For example, Komarek and Moore [29] develops Truncated-Regularized Iteratively-Reweighted Least Squares (TR-IRLS) technique that implements a linear Conjugate Gradient (CG) to approximate the Newton direction. The algorithm is especially useful for large, sparse data sets because it is fast, accurate, robust to linear dependencies, and no data preprocessing is necessary. Furthermore, another of the advantages of the CG method is that it guarantees convergence in a maximum number of steps [30]. On the other hand, when the imbalance is extreme, this problem is known as the rare events problem or the imbalanced data problem, which presents several problems and challenges to existing classification algorithms [31]. Maalouf and Siddiqi [32] developed a method of Rare Event Weighted Logistic Regression (RE-WLR) for the classification of imbalanced data on large-scale data.
In this paper, we propose the estimation of the parameters of the LB model in two different ways: one of these is to use CG methods, and the other way is to use a coordinate descendent MM algorithm. In addition, we incorporate a cross-validation procedure to estimate the generalization error and thus choose the number of dimensions in the LB model.
Taking into account the latent variables and model specification that defines a LB, a simulation process is carried out that allows for the evaluation of the performance of the algorithms and their ability to identify the correct number of dimensions to represent the multivariate binary data matrix adequately. Besides the proposed methods, the BiplotML package [33] was written in R language [34] to give practical support to the new algorithms.
The paper is organized into the following sections. Section 2.1 presents the classical biplot for continuous data. Next, Section 2.2 presents the formulation of the LB model. Section 2.3 introduces the proposed adaptation of the CG algorithm and the coordinate descendent MM algorithm to fit the LB model. Section 2.4 describes the model selection procedure (number of dimensions) and introduces our formulation via simulated data. Section 3 presents the performance of the proposed models and an application using real data. Finally, Section 4 presents a discussion of the main results.

Biplot for Continuous Data
Classical biplot methods allow the combined visualization of the rows and columns of a data matrix X = (x 1 , . . . , x n ) T , with x i ∈ R p , i = 1, . . . , n being a vector of the observations of an individual i on p variables in low-dimensional space [13,14].
More generally, if rank(X) = r and the data are centered, given a positive integer k ≤ r, a biplot is an approximation where A and B are matrices of rank k, and E is the matrix that contains the approximation errors. In this way, the matrix X can be graphically represented using markers (points or vectors) a 1 , . . . , a n for its rows and b 1 , . . . , b p for its columns, in such a way that the ij-th element of the matrix, x ij , is approximated by the inner product a T i b j , and the natural space of parameters is determined by Θ = AB T .
It is well known that we can reproduce exactly the matrix in dimension r using its singular value decomposition (SVD), where U = [u 1 , . . . , u r ] and V = [v 1 , . . . , v r ] are the matrices of left and right singular vectors and Λ is the r-dimensional diagonal matrix containing the singular values in decreasing order λ 1 ≥ . . . ≥ λ r > 0. It is also known that the low (k)-rank approximation for X is obtained from where the subscript (k) stands for the first k columns of the matrix. We can define a biplot in the form of Equation (1) taking A = U (k) Λ γ (k) and B = VΛ , 0 ≤ γ ≤ 1. Then, Θ = AB T minimizes the Frobenius norm for any value of γ. Note that, for example, if γ = 1, A contains the coordinates of individuals on the principal components and B is the projection of the initial coordinate axis onto them. There are algorithms other than SVD used to calculate the markers as alternated regressions [13] or NIPALS [35].

Logistic Biplot
Let X = (x 1 , . . . , x n ) T be a binary matrix, with x i ∈ {0, 1} p , i = 1, . . . , n and x ij ∼ Ber(π(θ ij )), where π(·) is the inverse link function. In this paper, the logit link is used, π(θ ij ) = 1 + exp(−θ ij ) −1 , which represents the expected probability that the character j is present in individual i, the log-odds π(θ ij ) is θ ij with θ ij = log π(θ ij )/(1 − π(θ ij )) , which corresponds to the natural parameter of the Bernoulli distribution expressed in an exponential family form. Using the probability distribution, we have P( 1−x ij , and the loss function is obtained as the negative log likelihood In this case, it is not appropriate to center the columns because the centered matrix will no longer be made up of elements equal to zero or one. Therefore, we extend the specification of the natural parameter space by introducing variable main effects or the column offset term µ for a model-based centering. The canonical parameter matrix Θ = (θ 1 , . . . , θ n ) T can be represented by a low-dimensional structure for some integer k ≤ r that satisfies θ i = µ + ∑ k s=1 a is b s , i = 1, . . . , n, which expressed in matrix form is written as where 1 n is an n-dimensional vector of ones; µ = µ 1 , . . . , µ p T ; A = (a 1 , . . . , a n ) T with a i ∈ R k , i = 1, . . . , n; B = (b 1 , . . . , b k ) with b j ∈ R p , j = 1, . . . , k; and Π = π(Θ) is the predicted matrix whose ijth element equals π(θ ij ). Now, Θ = logit(Π) is a biplot in the logit scale and log-odds θ ij = µ j + a T i b j . In addition to the canonical parameter matrix Θ, the hyperparameter k must also be estimated to select the LB model. In general, in dimensionality reduction methods, the choice of k is problematic and presents a risk of overfitting [36]. We propose the use of a completely unsupervised procedure based on cross-validation for LB model selection to counteract such overfitting. Cross-validation is operationally simple and allows the selection of the number of dimensions at the point where a numerical criterion is minimized, which will indicate that choosing more dimensions would lead to the overfitting of the model.
Once the parameters have been estimated, the geometry of the biplot regression allows us to obtain the direction vector g j that projects a row marker of A to predict the values in column j [21,26]. For example, to find the coordinates for a fixed probability π when k = 2, we look for the point (g j1 , g j2 ) that predicts π and that is on the biplot axis, i.e., on the line joining the points (0, 0) and (b j1 , b j2 ), which is calculated as In the logistics biplot, as in the classic PCA biplot, all directions pass through the origin. In a PCA biplot for centered data, the origin represents the mean of each variable and the arrow shows the direction of increasing values. As the binary data cannot be centered (we keep the column offset term in the model), the origin does not represent any particular value of the probability. In the LB model, we represent the variables with arrows (segments) starting at the point that predicts 0.5 and ending at the point that predicts 0.75.

Parameter Estimation
Expressing the loss function given in (5)

Estimation Using Nonlinear Conjugate Gradient Algorithms
Let ∇L(Θ) be a matrix whose ijth element equals π(θ ij ) − x ij , which will be expressed in matrix form as ∇L(Θ) = Π − X.
As the θ ij = µ j + a T i b j , then the L(Θ) function involves the matrices A and B through π(θ ij ) = π(µ j + a T i b j ). In this way, it is also possible to calculate the partial derivatives with respect to µ, A, and B (see Appendix A): For the model to be identifiable and to avoid the elements of B becoming too small due to changes in the scale of A and not due to changes in the log-likelihood, it is usually required that A T A = I k [9]. However, the maximum likelihood estimation of the model with this constraint tends to overfit the data [12]. In this paper, we use an approach that allows us to control the scale from the simultaneous updating of the parameters.
The initialization of the algorithm starts at point Θ 0 = 1 n µ T 0 + A 0 B T 0 , which can be configured by the user or by using a random initialization. For a random initialization, all elements of A 0 = a 0 1 , . . . , a 0 n T , B 0 = b 0 1 , . . . , b 0 k , and µ 0 can be sampled from the uniform distribution. We choose the first search direction d T l to be the steepest descent direction at the initial point Θ 0 ; then, a line search method is used to calculate the steplength parameter α l , and using a scalar β l , we define the rule for updating the direction based on the gradient at every iteration and thus simultaneously update µ, A, and B, thus updating the natural space of parameters Θ. The pseudocode is written formally in Algorithm 1. This is a method that is considered to have low computational cost because each iteration only requires the evaluation of the gradient and the loss function, so it can become efficient in large data sets [37,38]. In this paper, we use four well-known formulas for the update direction: Fletcher-Reeves (FR), Polak-Ribiere-Polyak (PRP), Hestenes-Stiefel (HS), and Dai-Yuan (DY) [39][40][41][42], which can be written as where ∆ l = ∇L l+1 − ∇L l , and · is the Euclidean norm. These methods have had modifications that combine the formulas [43][44][45][46][47][48][49] and that could be adapted to fit a LB model.

Algorithm 1 CG Algorithm for fitting a LB model
Compute β l+1 according to one of the formulas given in (13) 15: To achieve convergence in the implementation of the CG algorithm, one often requires the step-length obtained with a line search to be exact or satisfy the strong Wolfe conditions, with 0 < c 1 < c 2 < 1. As the FR method has been shown to be globally convergent under strong Wolfe line searches with c 2 < 1/2 [50,51], this condition ensures that all d l directions are descending directions from L.

Estimation Using Coordinate Descendent MM Algorithm
As the optimization problem is non-convex, we also use a MM algorithm to generate solutions that decrease with each iteration. The negative log-likelihood can be majorized by a quadratic function, and the majorized function can then be minimized iteratively.
According to Taylor's theorem, and taking into account the fact that ∇ 2 f (θ ij ) ≤ 1/4, the loss functions of a single estimated natural parameter given in (5) are quadratically Consequently, the loss function for the whole canonical matrix of parameters is majorized by where C is a constant that does not depend on Θ, and z ij ) . Let Z l be the matrix whose ijth element equals z (l) ij . According to [52], the iteratively weighted least squares algorithm can be used as an upper bound for a quadratic function, which allows for the minimization of the majorization function as and thus the majorized function to be minimized is written as In this case, the parameters are estimated sequentially. The algorithm is based on the coordinate descent optimization of the majorized function. When fixing AB T in Equation (19), µ can be estimated by the vector of mean scores of each column of matrix Z l , µ = 1 n Z l 1 n . In this way, the optimization problem becomes min A, Z c l+1 = UΛV T

10:
A l+1 = UΛ 11: We used data from the Genomic Determinants of Sensitivity in Cancer 1000 (GDSC1000) [5]. The database contains 926 tumor cell lines with comprehensive measurements of point mutation, CNA, methylation, and gene expression. For the purpose of illustrating the methods, the methylation data were used, and to facilitate the interpretation of the results, three types of cancer were included: breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), and skin cutaneous melanoma (SKCM).

Simulation Process
The data sets were simulated from a latent variables model with different levels of sparsity and with a low-dimensional structure according to the model presented in Equation (6). To generate the binary matrix X, we used the procedure presented in Algorithm 3.
The offset term µ was used to control the sparsity, while the log-odds Θ was defined to have a low-dimensional structure.

Model Performance Assessment
To evaluate the performance of the CG algorithm and the MM algorithm, we used the training error, which is defined as the average misclassification rate when using the low-dimensional structure generated by the LB model with the training set. Each algorithm provided an estimate of µ, A, and B. We used the predicted probability matrix Π = π 1 n µ T + AB T to select p thresholds, 0 < δ j < 1, j = 1, ..., p-one for each column of X-and then perform the binary classification. As the data matrices could be imbalanced, we used the training error defined as Balanced Accuracy (BACC), which is most appropriate in these cases [53,54] where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.
To decide the classification rule from the predicted value, a threshold must be selected for each variable. This threshold value can be selected by minimizing the BACC rate for each variable in the training set and then applying the rule to a test set.
As the training error can be an optimistic measure of the misclassification rate, we also used a cross-validation procedure that allowed for the testing of the models; thus, we used a test data set that was independent of the training data set. As in the supervised models, cross-validation was used to keep track of the overfitting of the model. In our case, the LB model is a non-supervised method to reduce the dimensionality, and cross-validation helps to find the number of dimensions to maintain [36,[55][56][57]. The procedure allowed us to estimate the value of the hyperparameter k to avoid overfitting and evaluate the capacity of the different estimation algorithms to identify the low-dimensional space.
As in PCA, we expect that a few dimensions would reproduce the original data (binary) as well as possible; then, we excluded some observed values, fit the model without those values, and calculated the expected probabilities for the missing data. Using the rule of classifying a missing value as being present if the fitted probability is higher than a threshold δ j , we imputed the selected missing data and calculated the classification. We performed this procedure M times for different values of k, and select the value that minimizes the generalization error.
More formally, we used the exclusion pattern proposed by Wold [55], which consists of a selected sequence of elements x ij being left out using a diagonal pattern and treated as missing values; in this way, we avoided the problems of excluding a complete row or column as described in [36,57]. Let W be a binary matrix, where w ij = 0 if the value is excluded (belongs to the validation set) and w ij = 1 is not (belongs to the training set). The minimization problem thus became min µ,A,B − log(p(X; Θ, W)) The loss function now depends only on the training set. Solving the problem with one of the proposed algorithms, we created a matrix of expected probabilities Π = π(Θ). Then,x ij = 1 when π(Θ) > δ j , andx ij = 0 otherwise, where 0 < δ j < 1 is the threshold for minimizing the BACC rate for variable j in the training set. We used the elements of X when w ij = 0 to calculate the BACC. This procedure was performed M times, and the generalization error given by the cross-validation was calculated using the mean of the BACC; we considered M = 7 folds as in [57].
We also measured the models' ability based on each algorithm to recover the log-odds Θ; for this, we used the relative mean squared error (RMSE), defined as where Θ is the true parameter andΘ is estimated by the LB model using one of the proposed algorithms.

Monte Carlo Study
Binary matrices were simulated with n = 100, 300, 500; p = 50, 100; k = 3; and D = 0.5, 0.3, 0.2, 0.1, where the parameter D represents the proportion of ones in matrix X. The different sparsity levels were simulated to check if the performance with the algorithms to find the low-dimensional structure was affected for this reason. The combinations of n, p, k, and D generated the different scenarios; in each scenario, R matrices were simulated independently, and the measures were calculated to evaluate the performance of the algorithms. Finally, the mean of the cross-validation error (cv error) was calculated, as well as the mean of the training error (BACC) and the mean of the relative mean squared error (RMSE) with their respective standard errors. We used a value of R = 30 that generated standard errors less than 1% in the estimation of the BACC and cv error. Figure 1a presents the cross-validation error of the algorithms based on the conjugate gradient and the MM algorithm when the matrix X is balanced. From the cv error, we can see that all models began to overfit when k > 3, so the five estimation methods identified the three underlying dimensions for all balanced scenarios that were simulated. Figure 1b shows the Balanced Accuracy (BACC); it is highlighted that the slope decelerated when the three underlying dimensions were reached in the matrix, and thus an elbow criterion could be used as a signal of the number of dimensions to choose. Finally, Figure 1c shows the estimation of the relative mean squared error (RMSE) for the matrix Θ; we see that the algorithms based on the conjugate gradient and the MM algorithm showed similar results when the number of dimensions was less than or equal to 3. Whereas, when the model had more than the three predefined dimensions in the simulation, the algorithms based on the conjugate gradient presented a lower RMSE than the MM algorithm, although, for a fixed value of p, these gaps closed as n increased.  In all the scenarios studied, it is observed that the error was minimized when the number of underlying dimensions in the space of the variables was reached, so our method identified that a value of k = 3 in the LB model was appropriate to avoid overfitting. As this occurred with all imbalanced data sets, then we see that the level of dispersion does not affect the performance of the algorithms in terms of correctly finding the low-dimensional space. The training error for imbalanced data sets with different levels of sparsity is shown in Figure 3a-c. In all the studied scenarios, it is observed that the percentage of loss in the training error stabilized from the third dimension. In this way, the different algorithms allowed the low-dimensional space to be appropriately selected using the elbow method.
The RMSE of the estimation of the log-odds Θ for the different levels of sparsity is shown in Figure 4a-c; we can see that the algorithms presented similar performances. In the scenarios of n = 100 and p = 50, it is observed that the RMSE increased notably when the number of dimensions was greater than 3, so there were some important gaps between the two approaches, although these differences decreased as the value of p or the value of n increased. On the other hand, the computational performance of the algorithms was measured on a computer with an Intel Core i7-3517U processor with 6 GB of RAM. Table 1 shows the the running time in seconds with 100 replications for k = 3 and a stopping criterion of = 10 −4 . We see that the performances of the different algorithms were competitive, and they converged relatively quickly when the maximum of the absolute changes of the estimated parameters in two subsequent iterations was less than 10 −4 . In general, it is observed that the CPU times of the CG algorithms were similar and presented a better performance than the MM algorithm when p = 50 and the sparsity level began to be high (D ≤ 0.2), or when n = 100, p = 100, and D = 0.1. In the other cases, the MM algorithm performed better, especially when n and p increased, resulting in up to six times faster performance than CG algorithms in balanced scenarios when n = 500 and p = 100.

Application
To apply the proposed methodology, we used data from the Genomic Determinants of Sensitivity in Cancer 1000 (GDSC1000) [5]. The database contains 926 tumor cell lines with comprehensive measurements of point mutation, CNA, methylation, and gene expression. To illustrate the methods, the methylation data were used, and to facilitate the interpretation of the results, three types of cancer were included: breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), and skin cutaneous melanoma (SKCM).
We performed preprocessing to sort the data sets and separate the methylation information into one data set. After preprocessing, the methylation dataset has 160 rows and 38 variables, each variable is a CpG island located in the gene promoter area. In this case, code 1 indicates a high level of methylation, and 0 indicates a low level; approximately 27% of the binary data matrix are ones. Figure 5 shows the cross-validation error and training error using the conjugate gradient algorithms and the coordinate descendent MM algorithm. If k = 0, the model (6) only considered the term µ, meaning that Θ = 1 n µ T , where µ j is the proportion of ones in column j and was used as a reference to observe the performance of the algorithms when more dimensions were included by incorporating the row and column markers, The cross-validation error was minimized in three dimensions for the four formulas (FR, HS, PR, and DY) based on the CG algorithm, so k = 3 was the appropriate value to avoid overfitting when using these methods of estimation. When using the MM algorithm, it was found that the LB model generated overfitting for k > 2, so using two dimensions in the LB model is suitable when using this estimation method. An advantage of using a biplot approach is that it allows for a simultaneous representation of rows and columns, which are plotted with points and directed vectors, respectively. Figure 6 shows the biplot obtained for the methylation data using the Fletcher-Reeves conjugate gradient algorithm; the vectors of the variables are represented by arrows (segments) that start at the point that predicts 0.5 and end at the point that predicts 0.75. Therefore, short vectors indicate a rapid increase in probability and the orthogonal projection of the row markers on the vector approximates the probability of finding high levels of methylation in the cell line.
The position of the segment, which corresponds to the point that predicts a probability of 0.5, can start any side around the origin. For example, in Figure 6, the variable DUSP22 points to the origin, when doing the orthogonal projection of the points in the direction of the vector, most of them will be projected after the reference point where the segment starts; this means that almost all cell lines of the three groups have high fitted probabilities of having high levels of methylation in that variable.
The cell lines are separated into three clearly identified clusters. In the BRCA type of cancer, variables such as NAPRT1, THY1, or ADCY4 are directed towards the positive part of dimension 1 and therefore have a greater probability of presenting high levels of methylation. The LUAD cell lines are located in the negative part of dimension 2, so these have a high propensity to present high levels of methylation in variables such as HIST1H2BH, ZNF382, and XKR6. Finally, the cell lines for the SKCM cancer type are located in the negative part of dimension 1 and have a greater probability of presenting high levels of methylation in variables such as LOC100130522, CHRFAM7A, or DHRS4L2.  Figure 6. Logistic biplot using a Fletcher-Reeves conjugate gradient algorithm for methylation data. Table 2 shows the rate of correct classifications for each variable using the measures of sensitivity and specificity; these measures allowed us to determine if the model exhibited a good classification for the two types of data in each variable. Sensitivity measured the true positives rate, specificity measured the true negatives rate, and the global measure corresponded to the total rate of correct classifications for each variable.
In general, the model with three dimensions and using the CG algorithm with the FR formula generated high values for sensitivity; only the GSTT1 gene presented a relatively low sensitivity, with 72% true positives. Regarding specificity, the LOC391322 gene obtained the lowest true negatives rate, with 80.9%. Thus, the results of the model are satisfactory.

Conclusions and Discussion
The Logistic Biplot (LB) model is a dimensionality reduction technique that generalizes the PCA to deal with binary variables and has the advantage of simultaneously representing individuals and variables (biplot).
In this paper, we propose and develop a methodology to estimate the parameters of the LB model using nonlinear conjugate gradient algorithms or the descending coordinate MM algorithm. For the selection of the LB model, we have incorporated a cross-validation procedure that allows the choice of the number of dimensions of the model in order to counteract overfitting.
As a complement to the proposed methods and to give practical support, a package has been written in the R language called BiplotML package [33], which is available from CRAN; this is a valuable tool that enables the application of the proposed algorithms to data analysis in every scientific field. Our contribution is important because we provide alternatives to solve some problems encountered in the LB model in the presence of sparsity or a big data matrix [23,[26][27][28]. Additionally, a procedure is presented that allows the choice of the number of dimensions, which until now had not been investigated for an LB model.
The proposed algorithms are iterative and have the property that the loss function decreases with each iteration. To study the properties of the proposed algorithms to fit a LB model, low rank data sets with k = 3 and different levels of sparsity were generated for n = 100, 300, 500 rows and p = 50, 100 columns. The accuracy of the algorithms was measured using the training error, generalization error (cv-error), and mean square error (RMSE) of the log-odds. According to the Monte Carlo study, we established that the crossvalidation criterion is successful in the estimation of the hyperparameter of the number of dimensions. This allows the model to be specified and thus avoid overfitting; in this way, we obtain the best performance of the proposed algorithms in terms of recovering the underlying low rank structure.
The comparison of the running times showed that the algorithms converge quickly. The CG algorithm is more efficient when the matrices are sparse and not very large, while the performance of the MM algorithm is better when the number of rows and columns tends to increase; thus, it is preferable for large matrices.
Finally, we used real data on gene expression methylation to show our approach. The LB model allowed us to carry out a simultaneous projection between rows and columns, where a grouping of three classes was observed, formed by the cell lines of the three types of cancer analyzed. Furthermore, the vectors that represented the variables allowed us to identify those cell lines that were more likely to achieve high levels of methylation in the different genes.  Institutional Review Board Statement: Ethical review and approval were waived for this paper due to GDSC1000 being open data and completely anonymized; thus, the privacy of any patient data is not compromised.

Informed Consent Statement:
Any research article describing a study involving humans should contain this statement.

Data Availability Statement:
The data analyzed in this paper can be found at https://www.cancerrxgene. org/gdsc1000/GDSC1000_WebResources/Home.html (accessed on 14 May 2021).

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: