Sampling Based Histogram PCA and Its Mapreduce Parallel Implementation on Multicore

: In existing principle component analysis (PCA) methods for histogram-valued symbolic data, projection results are approximated based on Moore’s algebra and fail to reﬂect the data’s true structure, mainly because there is no precise, uniﬁed calculation method for the linear combination of histogram data. In this paper, we propose a new PCA method for histogram data that distinguishes itself from various well-established methods in that it can project observations onto the space spanned by principal components more accurately and rapidly by sampling through a MapReduce framework. The new histogram PCA method is implemented under the same assumption of “orthogonal dimensions for every observation” with the existing literatures. To project observations, the method ﬁrst samples from the original histogram variables to acquire single-valued data, on which linear combination operations can be performed. Then, the projection of observations can be given by linear combination of loading vectors and single-valued samples, which is close to accurate projection results. Finally, the projection is summarized to histogram data. These procedures involve complex algorithms and large-scale data, which makes the new method time-consuming. To speed it up, we undertake a parallel implementation of the new method in a multicore MapReduce framework. A simulation study and an empirical study conﬁrm that the new method is effective and time-saving.


Introduction
Generalized Principal Component Analysis (PCA) is an important research tool in the Symbolic Data Analysis (SDA) [1]. PCA is a statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (PCs) [2][3][4]. PCA is commonly used for dimension reduction [5,6] by specifying a few PCs that account for as much of the variability in the original dataset as possible. It is well known that PCA has been primarily developed for single-valued variables. However, the explosion of big data from a wide range of application domains presents new challenges for traditional PCA, as it is difficult to gain insight from mass observation even in a low-dimensional space. Symbolic data analysis [7][8][9][10] has developed a new path for solving this problem.
Symbolic data was first introduced by [7] and aims to summarize large-scale data with conceptual observations described by symbolic descriptors, such as interval data, histogram, and continuous distribution data. Application fields of symbolic data include economic management, finance, The remainder of this paper is organized as follows. In Section 2, we propose a sampling-based Histogram PCA method. Section 3 describes the implementation of the new method in a multicore MapReduce framework. Next, in Section 4, a simulation study of the new method and its parallel implementation are presented. We also compare the execution times of serial and parallel processing. In Section 5, we report the results of our empirical study. Finally, Section 6 outlines our conclusions.

Sampling-Based Histogram PCA
In this section, we first define the inner product operator for histogram-valued data, then derive histogram PCA, and finally project the observations onto the space spanned by principal components using sampling.

Basic Operators
Let X be a data matrix that can be expressed as a vector of observations or of variables.
where e i = (x i1 , . . . , x ip ), i = 1, . . . , n denotes the ith observation and X j = (x 1j , . . ., x nj ) T , j = 1, . . . , p denotes the jth variable. Additionally, the random variable x ij represents the jth variable of the ith observation, with realization x ij in the classical case and ξ ij in the symbolic case. Next, we give the empirical distribution functions and descriptive statistics for histogram data. For further detail, refer to [9,26].
Let e i , i = 1, . . . , n, be a random sample. Let the jth variable have a histogram-valued realization ξ ij , where [a l ij , b l ij ) is the lth subinterval of ξ ij and p l ij is its associated relative frequency. Let s ij denote the number of subintervals in ξ ij . Then, a l ij ≤ b l ij for all l = 1, . . . , s ij , and ∑ s ij l=1 p l ij = 1. Reference [27] extended the empirical distribution function derived by [26] for interval data to histogram data. Based on the assumption that all values within each subinterval [a 1 ij , b 1 ij ) are uniformly distributed, they defined the empirical distribution of a point W s within subinterval [a 1 ij , b 1 ij ) as Furthermore, assume that each object is equally likely to be observed with probability 1/n; then, the empirical distribution function of W is where ξ l ij = [a 1 ij , b 1 ij ). The empirical density function of W is defined as The symbolic sample mean derived from the density function shown in Equation (5) is Accordingly, the centralization of x ij is given by After centralization, the symbolic sample variance and covariance are, respectively, and

The PCA Algorithm of Histogram Data
Based on the operators above, we begin to derive histogram-valued PCs. For simplicity of notation, we assume that all histogram-valued data units have been centralized.
The kth histogram-valued PC Y k (k = 1 . . . p) is a linear combination of X 1 , X 2 , . . . , X p , where u k = (u 1k . . . u pk ) T ∈ R p is subject to u T k u k = 1 and u T k u l = 0, ∀l = k. Then, the symbolic sample variance of Y k (k = 1 . . . p) can be derived from where Σ represents the sample covariance matrix as X 1 , X 2 , . . . , X p , in which S 2 i and S ij (i, j = 1 . . . p) are present in Equations (8) and (9).
The following process is the same as that for classical PCA. (for details, see the original sources). Set u 1 , u 2 , . . . , u m as the eigenvectors of Σ, corresponding to the eigenvalues λ 1 λ 2 . . . λ m . The derivation of PC coefficients is transformed into the eigendecomposition of the covariance matrix. The contribution rate(CR) of the mth PC to the total variance can be measured by

The Projection Process
In existing PCA methods for histogram data, the projection results are approximated based on Moore's algebra [19] and fail to reflect true data structures, mainly because there are no precise, unified methods for linearly combining histogram data.
In this paper, we project observations onto the space spanned by PCs with a sampling method. Using the concept of symbolic data, histogram-valued data can be generated from a mass of numerical data. Retroactively, we can also sample from histogram-valued data to obtain numerical data on which linear combination operations can be performed. As PC coefficients and numerical variables corresponding to the original histogram-valued variables are obtained, we can calculate the numerical projections by linearly combining them. Then, histogram-valued projections of the original observations can be summarized from numerical projections. Let x ij represent the ith observation unit of the jth variable X j in histogram-valued matrix X. ξ ij in Equation (2) is a realization of x ij . Based on the assumption that all values within each subinterval [a 1 ij , b 1 ij ) are uniformly distributed, to obtain numerical data corresponding to ξ ij , we select m i samples from those subintervals of ξ ij through random sampling. To maintain the coherence of the numerical sample matrix, the sample sizes of different variables from the same observation are assumed to be equal. After the random sampling process, the numerical sample matrix corresponding to the original histogram-valued matrix, which we call matrix P, can be obtained. As the dimension of X is n × p, According to Equation (10), the kth numerical PC NY k (k = 1 . . . p) is a linear combination of P 1 , P 2 , . . . , P p , The calculation in Equation (13) can be performed because P 1 , P 2 , . . . , P p are numerical variables. The projection approach in Equation (13) is similar to the classic PCA approach. Finally, we obtain histogram-valued PCs through generating numerical PCs. In this way, histogram-valued observations are successfully projected onto the space spanned by PCs.
The above analysis can be summarized by the following algorithm: • Step 1: Centralize the histogram-valued matrix X in Equation (1) and keep the notations consistent for simplicity.

•
Step 4: Obtain numerical matrix P through random sampling from X.
During this procedure, it is crucial to guarantee that any column of the matrix P obtained by sampling has the same distribution as the corresponding column in X, or that the distance between these two distributions is small enough to be negligible. Considering the empirical cumulative distribution approximates the true cumulative distribution favorably, we can determine that the difference between these two distributions converges as the sample size goes to infinity. Here, we give some key conclusions from the perspective of asymptotic theory.
Kolmogorov-Smirnov distance is a useful measure of global closeness between distributions The asymptotic properties of Kolmogorov-Smirnov distance D m have been studied and many related conclusions have been built. Extreme fluctuations in the convergence of D m to 0 can be characterized by [28] lim m→∞ Here, we deal with histogram data, in which case F(x) is not continuous; thus, we cannot give sup x F(x)(1 − F(x)). According to Equation (15), D m converges with a rate slower than m −1/2 by a factor (log log m) 1/2 .
From the pointwise closeness of F m (x) to F(x), we can easily observe that where AN is asymptotic normality. Therefore, global and local differences between the histogram obtained from sampling data and the original can be controlled when the sample size is large, illustrating the feasibility of the proposed sampling method. It should be noted that the original histogram data, which is not continuous, is treated as the true distribution, but the conclusions above can cover discontinuous cases, although the formula expression may be complex. Related theoretical issues will be considered in the future.

MapReduce for Multicore, Sampling-Based Histogram PCA
The sampling-based histogram PCA approach presented in Section 2 involves complex algorithms and large-scale data, which makes the method time-consuming and degrades its computation efficiency. To speed up the method, we undertake its parallel implementation using a multicore MapReduce framework in this section.
MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a map procedure that filters and sorts data and a reduce procedure that performs a summary operation [21]. Reference [22] show that algorithms that can be written in a "summation form" can be easily parallelized on multicore computers. Coincidentally, Equations (8), (9) and (13) are written in this form. In addition, the random sampling procedure can also be parallelized. Consequently, we adapt the MapReduce paradigm to implement our histogram PCA algorithm.
The multicore MapReduce frame for histogram PCA is presented in Figure 1, which shows a highlevel view of our architecture and how it processes the data. In Step 1, the MapReduce engine calculates the mean of each histogram variable using Equation (6) using pseudocode shown in Algorithm 1. // value: observations in X 6: 7: for each element ξ ij in value do 8: 9: Then, the variables are centralized using Equation (7). Every variable has its own engine instance, and every MapReduce task is delegated to its engine. Similarly, calculating the covariance matrix of histogram data using Equations (8) and (9) in Step 2, random sampling from X in Step 4, and computing the numerical principal components NY 1 , NY 2 , . . . , NY m using Equation (13) in Step 5 can also be delegated to MapReduce engines. The map and reduce functions are presented in Algorithms 2 and 3. // value: observations in X 6: 7: for each element ξ ij in value do 8: 9: ξ ij = ξ ij − W j //Centralizing histogram data 10: 11: for i = 1 → p do 22: 23: for j = 1 → p do 24: 25: if i == j then 26: 27:  for each element ξ ij in value do 8: 9: P ij ←Randomly sampling numerical data from ξ ij , the sample size is K.

Simulation Study
To analyze the performance of the new histogram PCA method and demonstrate whether MapReduce parallelization can speed it up, we carried out a simulation study involving three examples. The environment in which we conducted experiments is a PC with 8 Core Intel i7-6700K 4.00 GHz CPU and 8 GB RAM running Windows 7.

Example 1
With Example 1, we aim to investigate the accuracy of the histogram PCA algorithm. The simulated symbolic data tables are built as in [29]. Based on the concept of symbolic variables, to obtain n observations associated with a histogram-valued variable X k , we simulate K real values x NUM ik corresponding to each unit. These values are then organized in histograms that represent the empirical distribution for each unit. The distribution of the microdata that allow histogram generation, corresponding to each observation of the variables X k (k = 1 . . . p), is randomly selected from a mixture of distributions. Without loss of generality, in all observations, the widths of the subintervals in each histogram are the same.
Thus, we generate n histogram observations and the corresponding n × K single-value data points. The single-value data points can be viewed as the original data, which we aim to approximate using histogram-valued data. Consequently, experimental PCA results obtained from single-value data can work as an evaluation benchmark for the histogram PCA algorithm.
The detailed process of Example 1 is as follows: (1) Generate a synthetic histogram dataset X with p variables and n observations and the corresponding single-value dataset, denoted as X NUM .
(2) Adopt the new histogram PCA method on X and perform classical PCA on X NUM . The obtained PC coefficients are denoted as u j and u NUM j , and the corresponding PC variances are represented by λ j and λ NUM j , j = 1, 2, . . . , p.
(3) Compute the following two indicators [30]: (a) Absolute cosine value (ACV) of PC coefficients: Since cosine measures the angle between two vectors, ACV describes the similarity between PC coefficients u j and the benchmark u NUM j . A higher ACV indicates a better performance of the new histogram PCA method. (b) Relative error (RE) of PC variances: Taking PC variances obtained from X NUM as a benchmark, the lower the RE is, the better the new histogram PCA method has performed.
The parameters for the experiments are set as follows: n = 30, p = 3, K = 50, 100, 500, 1000, 5000. We can compare the performance of the new histogram PCA method with the single-value benchmark based on the resulting PC coefficients and variances.
The comparative results of the PC coefficients are shown in Table 1 using the ACV measure, and the comparative results of PC variances are shown in Table 2

Example 2
In this example, we consider the precision of the sampling-based projection process. The simulated symbolic data tables were built in the same way as in Example 1. To obtain n observations associated with histogram-valued variable X k , we simulated 5000 real values x NUM ik corresponding to each unit. These values were then organized in histograms. The distribution of the microdata was also randomly selected from a mixture of distributions: N(1, 1), χ 2 (1), lnN(0, 0.5), −lnN(0, 0.5)}.
As in Example 1, for all observations, the widths of the subintervals in each histogram are the same.
Thus, we generate n histogram observations and the corresponding n × 5000 single-value data points. The histogram dataset was adopted for comparing our sampling-based histogram PCA with the PSPCA method proposed by [16]. Single-value data points can also be viewed as the original data we aim to approximate using histogram-valued data. Consequently, experimental PCA results obtained from single-value data can work as the evaluation benchmark for the results of the histogram PCA algorithm.
Per Section 2.3, during the projection process, we select m i single-value samples from the ith observation. In this example, we assume that the sizes of single-value samples in each observation are all the same and are denoted as M. The detailed process of Example 2 follows: (1) Generate a synthetic histogram dataset X with p variables and n observations and the corresponding single-value data set, denoted as X NUM .
(2) Adopt sampling-based histogram PCA and PSPCA methods to X. The obtained histogram PCs are denoted as Y j (M) and Y PSPCA j respectively, where M is the sample size of each histogram unit.
Classical PCA is performed on X NUM . The obtained histogram PCs are Y NUM j .
(3) Compute the Wasserstein distance d(Y j , Y NUM j ) between Y j and Y NUM j [31].
The parameters for the experiments are set as n = 30, p = 3, and M = 50, 100, 500, 1000, 5000. Then, we can compare the performance of our sampling-based histogram PCA method with the single-value benchmark and the PSPCA method with the resulting PCs, as shown in Figure 2.
The solid line shows the Wasserstein distance between Y j and Y PSPCA j and the dotted line denotes the Wasserstein distance between Y j and Y NUM j . The results shown in this figure indicate that the projections of sampling-based histogram PCA, PSPCA, and the single-value benchmarks are generally close. More specifically, the distances decrease as the random sampling size increases during sampling-based projection; when the sample size is more than 1000, the difference is close to zero. These results demonstrate the accuracy of sampling-based histogram PCA projection.

Example 3
In the third example, we verify the time-saving effects of the multicore MapReduce implementation of the new histogram PCA method. First, we generated observations of the histogram-valued variables X k , (k = 1 . . . p). Then, we implemented the method in two different versions: one running MapReduce and the other a serial implementation without the new framework. The execution times of both approaches are compared. Here, we compare results using 1, 2, 4 and 8 cores.
First, we create observations for each histogram-valued variable x k . The simulated symbolic data tables are built as in Examples 1 and 2.
Based on the concept of symbolic variables, to obtain n observations associated with a histogram-valued variable X k , we simulate 5000 real values corresponding to each unit. These values are then organized in histograms that represent the empirical distribution for each unit. In all observations, the widths of the histograms subintervals are the same.
To perform the simulation experiment, symbolic tables that illustrate different situations were created. In this study, a full factorial design is employed, with the following factors: • Distribution of the microdata that allow histogram generation corresponding to each observation of the variables X k (k = 1 . . . p):
Based on the simulated histogram-valued variables, we conducted the new histogram PCA method in both parallel and serial. The random sampling size from histogram-valued variables in Step 4 is also 5000, the same as the number of real values we began with to simulate histogram-valued variables. Index CR 1 and CR 2 are computed using Equation (12) to analyze the behavior of the new histogram PCA method.Moreover, the execution time of both approach are compared to evaluate the speedup effect.
As the conclusion of different situations is almost the same, for the sake of brevity, in this section, we only provide the results for p = 3 and the uniform distribution. Table 3 shows the contribution rate of the 1th and 2th PCs. As can be seen from the table, CR 1 and CR 2 explain most of the original information contained in the histogram-valued matrix, indicating that the new histogram PCA method performs well.  Figure 3 demonstrates the speedup of the method for 1, 2, 4 and 8 processing cores under the MapReduce frame. In Figure 3, we can see that for a given number of cores, execution time increases with sample size. In addition, when sample size is constant, the execution time of the parallel and serial approaches is nearly equivalent when using 1 core. As the number of cores increases, the execution time of the serial approach is essentially unchanged, whereas the execution time of the parallel approach is gradually reduced; the ratio of the execution time between the serial and parallel approaches is nearly the same as the core number. We conclude that the speed increase using a parallel approach with a multicore MapReduce frame is significant.

Empirical Study
Nowadays, the rapid development of the Internet has created opportunities for the development of movie sites. The major functions of movie sites are to calculate movie ratings based on user rating data and to then recommend high-quality movies. Since the set of user rating data is massive, symbolic data can be utilized.
In this section, we use real data to evaluate the performance of sampling-based histogram PCA. The data consists of 198,315 ratings from a Chinese movie site from October 2009 to May 2014. The ratings come from three different types of users: visitors, normal registered users, and expert users. In the first step, we summarize movie ratings from the different user types to obtain scoring histograms for each type of user for a movie. Then, the dataset becomes a histogram symbolic table of 500 observations and 3 variables. Finally, sampling-based histogram PCA can be conducted on the generated histogram symbolic table. We computed the CR index of each PC and show the loading plot of the first and the second PCs in Table 4 and Figure 4.   The results show that the first PC summarizes 73.87% of the total information carried by the original variables. The first PC is positively associated with visitors', normal registered users', and expert users' rating histograms. Thus, we could simply use the first PC as an integrated embodiment of the three types of user ratings, simplifying the evaluation and comparison of different movies.
Next, we verify the performance of the sampling-based projection process using the projection results on the first PC. As the explanatory power of the first PC is very good, if the scores of the first PC can identify the different characteristics of the movies, the projection results are reasonable.
In this empirical study, clustering analysis is implemented on these movies based on the Wasserstein distances [31] between pairs of projections on the first PC and movies are divided into five clusters. During the projection process, we randomly sample 1000 single-value samples from each histogram observation. The projection results of the clusters are showed in Figure 5, where the curves in the first row are the projections of the movies in each cluster. For convenience, we use curve instead of histogram to show the a movie's scoring probabilities in each score section. The histograms in the second row are the centers of the five clusters in the first row. The results show that the characteristics of different categories of movie are different. The movies in Cluster 1 receive high ratings; movies in Cluster 2 are distributed relatively equally across score sections; movies in Cluster 3 are distributed relatively equally among low-score sections; and the movies in Clusters 4 and 5 are low grade, mainly distributed in 3 and 3-4 point ratings, respectively.
As can be seen, the final ratings histograms clearly reflect different types of movies, which indicate that the method proposed in this paper performs effectively.

Conclusions
As there is no precise, unified method of linear combination of histogram-valued symbolic data, the paper presents a more accurate histogram PCA method, which can project observations onto the space spanned by PCs by random sampling numerical data from the histogram-valued variables. Furthermore, considering that the sampling algorithm is time-consuming and that the sampling can be done separately, we adopt a MapReduce paradigm to implement the new histogram PCA algorithm. A simulation study and an empirical study were performed to analyze the behavior of the new histogram PCA method and to demonstrate the effect of MapReduce parallel implementation. The new method performs well and can be sped up significantly through a multicore MapReduce framework.
In practice, the sampling method for the linear combination of histogram data proposed in the paper can be popularized and applied widely on other multivariable statistical models for histogram data, such as linear regression model, linear discriminant analysis model, and so on. Using sampling method will provide a new perspective for these methods, but further research is needed in the future.

Conflicts of Interest:
The authors declare no conflict of interest.