New Efﬁcient Approach to Solve Big Data Systems Using Parallel Gauss–Seidel Algorithms

: In order to perform big-data analytics, regression involving large matrices is often necessary. In particular, large scale regression problems are encountered when one wishes to extract semantic patterns for knowledge discovery and data mining. When a large matrix can be processed in its factorized form, advantages arise in terms of computation, implementation, and data-compression. In this work, we propose two new parallel iterative algorithms as extensions of the Gauss–Seidel algorithm (GSA) to solve regression problems involving many variables. The convergence study in terms of error-bounds of the proposed iterative algorithms is also performed, and the required computation resources, namely time- and memory-complexities, are evaluated to benchmark the efﬁciency of the proposed new algorithms. Finally, the numerical results from both Monte Carlo simulations and real-world datasets are presented to demonstrate the striking effectiveness of our proposed new methods.


Introduction
With the advances of computer and internet technologies, tremendous data will be processed and archived in our daily life. Data-generating sources include the internet of things (IoT), social websites, smart-devices, sensor networks, digital images/videos, multimedia signal archives for surveillance, business-activity records, web logs, health (medical) records, on-line libraries, eCommerce data, scientific research projects, smart cities, and so on [1,2]. This is the reason why the quantity of data all over the world has been growing exponentially. By 2030, the International Telecommunication Union (ITU) predicts that the trend of this exponential growth of data will continue and overall data traffic just for mobile devices will reach an astonishingly five zettabytes (ZB) per month [3].
In big-data analysis, matrices are utilized extensively in formulating problems with linear structure [4][5][6][7][8][9][10]. For example, matrix factorization techniques have been applied for topic modeling and text mining [11,12]. For example, a bicycle demand-supply problem was formulated as a matrix-completion problem by modeling the bike-usage demand as a matrix whose two dimensions were defined as the time interval of a day and the region of a city [13]. For social networks, matrices such as adjacency and Laplacian matrices have been used to encode social-graph relations [14]. A special class of matrices, referred to as low-rank (high-dimensional) matrices, which often have many linearly dependent rows (or columns), is often encountered when various big data analytics applications need to be addressed. Let's list several data analytics applications involving such highdimensional, low-rank matrices: (i) system identification: low-rank (Hankel) matrices are used to represent low-order linear, time-invariant systems [15]; (ii) weight matrices: several signal-embedding problems, for example, multidimensional scaling (see [16]) and use of the approximate solutions. In this information-technology boom era, problems are often quite large and have to be solved by digital computers subject to finite precision. The proposed new divide-and-iterate method can be applied extensively in data processing for big data. This new approach is different from the conventional divide-and-conquer scheme as there exist no horizontal (mutual iterations among subproblems) computations in the conventional divide-and-conquer approach. Under the same divide-and-iterate approach, this work uses GSA, instead of the Kaczmarz algorithm (KA) [43], to solve factorized subsystems in a parallel method. The rest of this paper is organized as follows. The linear-regression problem and the Gauss-Seidel algorithm are discussed in Section 2. The proposed new iterative approach to solve a factorized system is presented in Section 3. The validation of the convergences of the proposed methods is provided in Section 4. The time-and memory-complexities for our proposed new approach are discussed in Section 5. The numerical experiments for the proposed new algorithms are presented in Section 6. Finally, conclusion will be drawn in Section 7.

Solving Linear Regression Using Factorized Matrices and Gauss-Seidel Algorithm
A linear-regression problem (especially involving a large matrix) will be formulated using factorized matrices first in this section. Then, the Gauss-Seidel algorithm will be introduced briefly, as this algorithm needs to be invoked to solve the subproblems involving factorized matrices in parallel. Finally, the individual solutions to these subproblems will be combined to form the final solution.

Linear Regression: Divide-and-Iterate Approach
A linear regression is given by Vy = c, where V ∈ C m×n and C denotes the set of complex numbers. It is equivalent to the following: where the matrix V is decomposed as the product of the matrix W and the matrix H, W ∈ C m×k , and H ∈ C k×n . Generally, the dimension of V is large in the context of big data. Therefore, it is not practical to solve the original regression problem. We propose to solve the following subproblems alternatively: and: One can obtain the original linear-system solution to Equation (1) by first solving the sub-linear system given by Equation (2) and then substituting the intermediate solution x into Equation (3) to obtain the final solution y. A linear system Vy = c is called consistent if it has at least one solution. On the other hand, it will be called inconsistent if there exists no solution. The sub-linear system can be solved by the Gauss-Seidel algorithm, which will be briefly introduced in the next subsection.

Gauss-Seidel Algorithm and Its Extensions
The Gauss-Seidel algorithm (GSA) is an iterative algorithm for solving linear equations Vy = c. It is named after the German mathematicians Carl Friedrich Gauss and Philipp Ludwig von Seidel, and it is similar to the Jacobi method [44]. The randomized version of the Gauss-Seidel method can converge linearly when a consistent system is expected [45].
Given V ∈ C m×n and c as in Equation (1), the randomized GSA will pick column , where C denotes the set of complex numbers, V (,j) is the j-th column of the matrix V, · F is the Frobenius norm, and · 2 is the Euclidean norm. Thus, the solution y will be updated as: where t is the (iteration) index of the solution at the t-th step (iteration), e (j) is the j-th basis vector (a vector with 1 at the j-th position and 0 otherwise), and * denotes the Hermitian adjoint of a matrix (vector). However, the randomized GSA updated by Equation (4) does not converge when the system of equations is inconsistent [45]. To overcome this problem, an extended GSA (EGSA) was proposed in [46]. The EGSA will pick row i ∈ {1, 2, · · · , m} of V with probability where V (i,) represents the i-th row of the matrix V. Consequently, the solution y will be updated as: and: When the EGSA is applied for the consistent systems, it behaves exactly like the GSA. For the consistent systems, the EGSA has been shown to converge linearly in expectation to the least-squares solution (y opt = V † c, where † denotes the pseudo-inverse based on the least-squares norm) according to [46].

Parallel Random Iterative Approach for Linear Systems
In this section, we will propose a novel parallel approach to deal with vector additions/subtractions and inner-product computations. This new parallel approach can faster the computational speed of the GSA and the EGSA as stated in Sections 3.1 and 3.2. Suppose that we have p processors (indexed by P 1 , P 2 , . . ., P p ) available to carry out vector computations in parallel. The data involved in such computations need be allocated to each processor in balance. Such balanced load of data across all processors can make the best use of resource, maximize the throughput, minimize the computational time, and mitigate the chance of any processor's overload. Here we propose two strategies to assign data evenly, namely (i) cyclic distribution and (ii) block distribution. For the cyclic distribution, we assign the i-th component of a length-m vector y to the corresponding processor as follows: where i|p denotes i modulo by p. On the other hand, for block distribution, we assign the i-th component of a length-m vector x to the corresponding processor as follows: where 0 ≤ i < m, specifies the block size such that def = m p , denotes the integer rounding-down operation, and denotes the integer rounding-up operation. The cyclic and block distributions for four processors are illustrated in Figure 2. For example, the parallel computation of an inner product between two vectors using the cyclic distribution is illustrated by Figure 3.

Consistent Linear Systems
The parallel random iterative algorithm to solve the original system formulated by Equation (1) is stated by Algorithm 1 if the original system is consistent. The idea here is to solve the sub-system formulated by Equation (2) and the sub-system formulated by Equation (3) alternately using the GSA. The symbols ⊕ p , p , and p represent parallel vector addition, subtraction, and inner-product, respectively, using p processors. Note that × p is the operation to scale a vector by a complex value. The parameter T specifies the number of iterations required to perform the proposed algorithms. This quantity can be determined by the error tolerance of the solution (refer to Section 5.1 for detailed discussion).
× p e (i) ; Pick up column H (,j) with probability

Inconsistent Linear Systems
If the original system formulated by Equation (1) is not consistent, Algorithm 2 is proposed to solve it instead. Algorithm 2 below is based on the EGSA.
Pick up row W (l,) with probability Pick up column H (,j) with probability

Convergence Studies
The convergence studies for the two algorithms proposed in Section 3 are manifested by Theorem 1 for consistent systems and Theorem 2 for inconsistent systems. The necessary lemmas for establishing the main theorems discussed in Section 4.2 are first presented in Section 4.1. All proofs will be written using vector operations without the subscript p because the parallel computations for vector operations should lead to the same results regardless of the processor index p. Without loss of generality, the instances of subscript p indexed in Algorithms 1 and 2 are simply used to indicate that those computations can be carried out in parallel.

Auxiliary Lemmas
We define the metric A for a matrix A as: where σ min (A) denotes the minimum nontrivial singular value of the matrix A and 0 < σ min (A) < 1. We present the following lemma, which establishes an identity related to the error bounds of our proposed iterative algorithms.

Lemma 1.
Let A be a nonzero real matrix. For any vector v in the range of A, i.e., v can be obtained by a linear combination of A's columns (taking columns as vectors), we have: Proof. Because the singular values of A and A T are the same and σ i (AA T ) ≥ σ 2 min (A) where the subscript i denotes the i-th largest singular value in magnitude, Lemma 1 is proven.
Since the original solution to Equation (1) can be facilitated from solving the factorized linear systems, Lemma 2 below can be utilized to bound the error arising from the solutions to the factorized sub-systems at each iteration.

Lemma 2.
The expected squared-norm for Hy t − Hy opt , or the error between the result at the t-th iteration and the optimal solution conditional on the first t iterations, is given by: where the subscript x of the statistical expectation operator E t,x [ ] indicates that the expectation should be taken over the random variable x.
Therefore, Hŷ t+1 −Hy opt and Hŷ t+1 −Hy t are orthogonal to each other. The relation Hy opt = x opt is used to establish the identity = (4) . Recall that the expectation E t [ ] is conditional on the first t iterations. The law of iterated expectations in [47] is thereby applied here to establish the equality = (5) . Since the probability of selecting the column , we can have the equality = (6) . The inequality ≤ (7) comes from the fact that H * (Hy opt − Hy t ) 2 2 ≥ H Hy t − Hy opt 2 2 . The equality = (8) results from the definition of H in Equation (9). Finally, the inequality ≤ (9) comes from the fact that Ax 2 2 ≤ A 2 F x 2 2 (according to the Cauchy-Schwarz inequality) for the matrix A ∈ C m×n and the vector x ∈ C n×1 . Lemma 3. Consider a linear, consistent system Vy = c where V has the dimension m × n. If the Gauss-Siedel algorithm (GSA) with an initial guess y 0 ∈ R n (R denotes the set of real numbers) is applied to solve such a linear, consistent system, the expected squared-norm for y t − y opt can be bounded as follows: Proof. See Theorem 4.2 in [45].
The following lemma is presented to bound the iterative results for solving an inconsistent system using the extended Gauss-Siedel algorithm (EGSA).

Lemma 4.
Consider a linear, inconsistent system Vy = c. If the extended Gauss-Siedel algorithm (EGSA) with an initial guess y 0 within the range of V T and d 0 ∈ R n is applied to solve such a linear, inconsistent system, the expected squared-norm for y s − y opt can be bounded as follows: Proof. Since: and: we have the following: The expectation of the first term in Equation (19) can be bounded as: Then, we have: The expectation of the second term in Equation (19) is given by: where the equality = (1) comes from the law of iterated expectations again for the conditional expectations E t−1,(,i) [ ] (conditional on the i-th column at the (t − 1)-th iteration) and E t−1,(l,) [ ] (conditional on the l-th row at the (t − 1)-th iteration), and the equality = (2) comes from the probability of selecting the l-th row to be From the GSA update rule, we can have the following inequality: where the equality = (1) is established by applying the GSA one-step update, and the equality = (2) is based on the fact that I − ). The inequality ≤ (3) comes from Lemma 1. Based on this inequality and the law of iterated expectations, the expectation of the second term in Equation (19) can be bounded as: According to Equations (19), (21) and (24), we have: Consequently, Lemma 4 is proven.

Convergence Analysis
Since the necessary lemmas are introduced in Section 4.1, we can begin to present the main convergence theorems here for the two proposed algorithms. Theorem 1 is established for the consistent systems, while Theorem 2 is established for the inconsistent systems.

Theorem 1.
Let V ∈ C m×n be a low-rank matrix such that V = WH with a full-rank W ∈ C m×k and a full-rank H ∈ C k×n , where k < m and k < n. Suppose that the systems Vy = c and Wx = c have the optimal solutions y opt and x opt , respectively. The initial guesses are selected as On the other hand, for ζ = 1, we have: Proof. From Lemma 2, we have: By applying the bound given by Lemma 3 to Equation ( If W = H , from the law of iterated expectations, we can rewrite Equation (29) = t H Hy 0 − Hy opt Consequently, Theorem 1 is proven.

Theorem 2.
Let V ∈ C m×n be a low-rank matrix such that V = WH with a full-rank W ∈ C m×k and a full-rank H ∈ C k×n , where k < m and k < n. The systems Vy = c and Wx = c have the optimal solutions y opt and x opt , respectively. The initial guesses are selected as y 0 ∈ range(H T ), x 0 ∈ range(W T ),and d 0 ∈ C k . Define ζ def = W H . If Vy = c is inconsistent, we have the following bound for ζ = 1: On the other hand, for ζ = 1, we have: By applying the bound in Lemma 4 to Equation (34), we get: Next, if W = H , by applying the law of iterated expectations, we can rewrite Equation (35) as: On the other hand, if ζ = 1, or W = H , Equation (36) Consequently, Theorem 2 is proven.

Complexity Analysis
In this section, the time-and memory-complexity analyses will be presented for the algorithms proposed in Section 3. The details are manifested in the following subsections.

Time-Complexity Analysis
For a consistent system with W = H , the error estimate can be bounded as:

times. For each while-loop
in Algorithm 1, we need 2k+2m p arithmetic operations for updating x t and another 2k+2n p arithmetic operations for updating y t using p processors. Therefore, given the error limit not exceeding , the time-complexity T cons W = w for solving a consistent system with W = H can be bounded as: For a consistent system with W = H , since the growth rate of the term t t max is larger than that of the term H t , the error estimate can be bounded by t t max (see Theorem 5 in [48]) as follows: where C cons W = H is another constant related to the matrices W and H. As proven by Theorem 4 in [48], one has to iterate the while-loop in Algorithm 1 for at least times. For each while-loop in Algorithm 1, the time-complexity here (for W = H ) is the same as that for solving the consistent system with W = H instead. Therefore, the time-complexity T cons W = H for a consistent system with W = H can be bounded as: On the other hand, for an inconsistent system with W = H , the error estimate can be bounded as: times.
For each while-loop in Algorithm 2, it requires 2m+2k p arithmetic operations for updating d t , 4 × k p arithmetic operations for updating x t , and another 2n+2k p arithmetic operations for updating y t using p processors. Hence, given the error tolerance , the time-complexity T inco W = H for solving an inconsistent system with W = H can be bounded as: For an inconsistent system with ρ W = ρ H , we can apply Theorem 5 in [48] to bound the error estimate as: where C inco W = C is a constant related to the matrices W and H. Similar to the previous argument for solving a consistent system with W = H , one should iterate the while-loop in Algorithm 2 for at least T

times. For each while-loop in
Algorithm 2, the time-complexity is the same as that for solving an inconsistent system with W = H . Therefore, the time-complexity T inco W = H for solving an inconsistent system with W = H can be bounded as: According to the time-complexity analysis earlier in this section, the worst case occurs when =0 since it requires t→∞ in Equations (38), (41), (43), and (44). On the other hand, the best case occurs when is fairly large and all constants (determined from the matrices W and H), C cons and C inco ρ W = H are relatively small and we only need a single time iteration to make all error estimates less than such an .

Memory-Complexity Analysis
In the context of big data, the memory usage considered here is extended from the conventional memory-complexity definition, i.e., the size of the working memory used by an algorithm. Besides, we will also consider the memory used to store the input data. In this subsection, we will demonstrate that our proposed two algorithms, which solve the factorized sub-systems, require much less memory-complexity than the conventional approach to solve the original system. This memory-efficiency is a main contribution of our work. For a consistent system V m×n factorized as W m×k × H k×n , our proposed Algorithm 1 will require m k + n k + m memory-units (MUs) to store the inputs W m×k , H k×n , and c m×1 . In Algorithm 1, one needs (k + n) MUs to store the probability values used for the column-selections. For updating various vectors, (k + n) MUs are required to store the corresponding updates. Hence, the total number of the required MUs is given by: Alternatively, if one applies Algorithm 1 to reconsider the memory-complexity for solving the original system (unfactorized system), it will need (m n + m + 2n) MUs for storing data.
On the other hand, for an inconsistent system, our proposed Algorithm 2 will require m k + n k + m memory units to store the inputs W m×k , H k×n , and c m×1 . In Algorithm 2, one needs (m + k + n) MUs to store the probability values used for the row and column selections. For updating various vectors, (2k + n) MUs are required to store the corresponding updates. Therefore, the total number of the required MUs is given by: Alternatively, if one applies Algorithm 2 to enumerate the memory-complexity for solving the original system, it will require (m n + 2m + 2n) MUs to store data.

Numerical Evaluation
The numerical evaluation for our proposed algorithms is presented in this section. Convergence and time/memory-complexities of our proposed new algorithms will be evaluated in Sections 6.1, 6.3 and 6.4, respectively.

Convergence Study
First consider a consistent system. The entries of W, H, and c are drawn from an independently and identically distributed (i.i.d.) random Gaussian process with zero-mean and unit variance where m = 200, n = 150, and k = 100. We plot the convergence trends of the expected error E Hy t − Hy opt 2 2 and the actual L 2 -errors (shown by the shadow areas) with respect to ( W = 0.997, H = 0.894) and ( W = 0.988, H = 0.907) in Figure 4. The convergence speed subject to W = 0.997 is slower than that subject to W = 0.988 because the convergence speed is determined solely by W according to Equation (26), where H = W ζ and W > H . Each shadow region spans over the actual L 2 -errors resulting from one hundred Monte Carlo trials.  On the other hand, consider an inconsistent system. One has to apply Algorithm 2 to solve it. The entries of W, H, and c are drawn from an independently and identically distributed (i.i.d.) random Gaussian process with zero-mean and unit variance where m = 200, n = 150, and k = 100. We plot the convergence trends of the expected error E Hy t − Hy opt 2 2 and the actual L 2 -errors (shown by the shadow areas) with respect to ( W = 0.894, H = 0.993) and ( W = 0.882, H = 0.987) in Figure 5. Again, the convergence speed subject to H = 0.993 is slower than that subject to H = 0.987 because the convergence speed is determined solely by H according to Equation (32), where H > W . Each shadow region spans over the actual L 2 -errors resulting from one hundred Monte Carlo trials.

Validation Using Real-World Data
In addition to the Monte Carlo simulations shown in Section 6.1, we also validate our proposed new algorithms for real-world data on wine quality and bike rental. Here we use two real-world datasets from the UCI Machine Learning Repository [49]. The first set is related to wine quality, where we chose the data related to red wine only. The owner of this data set invoked twelve physicochemical and sensory variables to measure the wine quality. These variables include: 1-fixed acidity, 2-volatile acidity, 3-citric acid, 4-residual sugar, 5-chlorides, 6-free sulfur dioxide, 7-total sulfur dioxide, 8-density, 9-pH value, 10-sulphates, 11-alcohol, and 12-quality (each score ranging from 0 to 10). Consequently, these twelve categories of data can form an overdetermined matrix (as a matrix V) with size 1599 × 12. If the nonnegative matrix factorization is applied to obtain the factorized matrices W and H for k = 5, we have W = 0.99840647 and H = 0.99838774. The expected errors E Hy t − Hy opt 2 2 and the actual L 2 errors for wine data (denoted by triangle) are depicted in Figure 6, where Algorithm 2 is applied to solve the pertinent linear-regression problem in this case. On the other hand, another dataset about a bikesharing system includes categorical and numerical data. Since the underlying problem is linear regression, we can work on the numerical attributes of the data only. These attributes include: 1-temperature, 2-feeling temperature, 3-humidity, 4-windspeed, 5-causal counts, 6-registered counts, and 7-total rental-bike counts. The matrix size for this dataset is thus 17, 379 × 7, and the corresponding parameters to this matrix are W = 0.99599172, H = 0.98568833, and k = 7. The expected errors E Hy t − Hy opt 2 2 and the actual L 2 errors for bike data (denoted by rhombus) are delineated in Figure 6.

Error
Wine data Bike data Figure 6. Error-convergence comparison for the wine data and the bike-rental data.

Time-Complexity Study
The time-complexity is studied here according to the theoretical analysis in Section 5.1. First consider an arbitrary consistent system (a random sample drawn from the Monte Carlo trials stated in Section 6.1). The effect of error tolerance on time-complexity can be visualized in Figure 7. It can be observed that time-complexity increases as decreases. In addition, we would like to investigate the effects of the number of processors p and the dimension k on time-complexity. The time-complexity results versus the number of processors p and the dimension k are presented in Figure 8 subject to = 10 −5 . On the other hand, let's focus on an arbitrary inconsistent system (an arbitrary Monte Carlo trial as stated in Section 6.1) now. The corresponding time-complexities for = 10 −5 and = 10 −4 are delineated by Figure 9. Note that one more vector is required to be updated in Algorithm 2 compared to Algorithm 1, the time-complexities shown in Figure 9 are higher that those shown in Figure 7 subject to the same . Because our derived errorestimate bound is tighter than that presented in [46] for the EGSA, the time-complexity of the proposed method for an inconsistent system has been reduced about 60% from that of the approach proposed by [46] subject to the same . How the number of processors p and the dimension k affect the time-complexity for inconsistent systems is illustrated by Figure 10 subject to the error tolerance = 10 −5 .   The curves denoted by "ZF" illustrate the theoretical time-complexity error-bounds for solving the original system involving the matrix V without factorization (theoretical results from [46]).
According to [50], we define the spectral radius η(A) of the "iteration matrix" A def = V * V, where V is given by Equation (1), by: Note that |A| denotes the cardinality of A and λ 1 , λ 2 , . . ., λ |A| specify the eigenvalues of A. In Figure 11, we delineate the time-complexities required by the close-form solution (denoted by "Closed-Form" in the figure) and our proposed iterative Gauss-Seidel approach (denoted by "GS" in the figure) versus the dimension n for V with different spectral radii subject to = 10 −10 for an inconsistent system (m = 1.25 n) such that η(V * V) = 0.9, 0.5, and 0.1. Even under such a small error tolerance = 10 −10 , the time-complexity required by the closed-form solution to Equation (1) is still much larger than the that required by the iterative Gauss-Seidel algorithm proposed in this work when only a single processor is used.   Figure 11, the time-complexity advantage of our proposed approach becomes more significant as the dimension grows.  Table 1 compares the run-times for the LU matrix factorization method in [51] and alternate least-squares (ALS) method in [52] with respect to different dimensions involved in the factorization step formulated by Equation (1). According to Table 1, the ALS method leads to a shorter run-time compared to the LU matrix factorization method. Table 2 compares the run times of the LAPACK solver [53], our proposed Gauss-Seidel algorithms with the factorization step formulated by Equation (1) (acronymed as "Fac. Inc." in the tables), and our proposed Gauss-Seidel algorithms without the factorization step formulated by Equation (1) (acronymed as "Fac. Exc." in the tables) for max = 0.99 and max = 0.01. If max = 0.99, the convergence speeds of our proposed Gauss-Seidel algorithms are slow since max is close to one and thus it requires a longer time than the LAPACK solver. However, our proposed method can outperform the LAPACK solver when max is small since our proposed Gauss-Seidel algorithms will converge to the solution much faster.

Memory-Complexity Study
Memory-complexity is also investigated here according to the theoretical analysis stated in Section 5.2. Figure 12 depicts the required memory-complexity for solving an arbitrary consistent system (the same as the system used in Section 6.3) using Algorithm 1. The memory-complexity is evaluated for different dimensions: k = 0.2 n, k = 0.1 n, and k = 0.05 n. We further set m = 1.25 n. On the other hand, for an arbitrary inconsistent system (the same as the system used in Section 6.3), all of the aforementioned values of m, n, and k remain the same and Algorithm 2 should be applied instead. Figure 13 plots the required memory-complexity for solving an inconsistent system using Algorithm 2. In Figures 12 and 13, for both consistent and inconsistent systems, we also present the required memory-complexity for solving the original system involving the matrix V without factorization. According to Figures 12 and 13, the storage-efficiency can be significantly improved by at least 75% to 90% (dependent on the dimension k).

Conclusions
For a wide variety of big-data analytics applications, we designed two new efficient parallel algorithms, which are built upon the Gauss-Seidel algorithm, to solve large linearregression problems for both consistent and inconsistent systems. This new approach can save computational resources by transforming the original problem into subproblems involving factorized matrices of much smaller dimensions. Meanwhile, the theoretical expected-error estimates were derived to study the convergences of the new algorithms for both consistent and inconsistent systems. Two crucial computational resource metricstime-complexity and memory-complexity-were evaluated for the proposed new algorithms. Numerical results from artificial simulations and real-world data demonstrated the convergence and the efficiency (in terms of computational resource usage) of the proposed new algorithms. Our proposed new approach is much more efficient in both time and memory than the conventional method. Since the prevalent big-data applications frequently involve linear-regression problems (such as how to undertake linear regression when the associated matrix dimension is very large) with tremendous dimensions, our proposed new algorithms can be deemed very impactful and convenient to future big-data computing technology. In the future, we would like to consider the problem about how to perform the matrix factorization V properly to have max = max( W , H ) as small as possible. If we have a smaller max , we can expect faster convergences of our proposed Gauss-Seidel algorithms. In general, it is not always possible to have the linear system characterized by V having a small value of max . Future research suggested here will help us to overcome this main challenge.