Normality Testing of High-dimensional Data Based on Principle Component and Jarque–Bera Statistics

: The testing of high-dimensional normality is an important issue and has been intensively 1 studied in the literature. It depends on the variance–covariance matrix of the sample and numerous 2 methods have been proposed to reduce the complexity of the variance-covariance matrix. Principle 3 component analysis (PCA) has been widely used in high dimensions, since it can project the high-4 dimensional data into lower dimensional orthogonal space. The normality of the reduced data can 5 then be evaluated by Jarque–Bera (JB) statistics in each principle direction. We propose a combined 6 statistics—the summation of one-way JB statistics upon the independence of each principle 7 direction—to test the multivariate normality of data in high dimensions. The performance of the 8 proposed method is illustrated by the signiﬁcance level and empirical power of the simulated 9 normal and non-normal data. Two real examples show the validity of our proposed method. 10


Introduction 13
Normality plays an important role in statistical analysis, and there are numerous 14 methods for normality testing presented in the literature.Koziol [1] and Slate [2] used This paper is organized as follows.Section 2 provides the theory of principle component analysis and gives the methodologies of statistical inference for multivariate normality.In Section 3, some simulated examples of normal data and non-normal data are used to illustrate the efficiency of our proposed method.Two real examples are then investigated in Section 4 to verify the methods' effectiveness.

High-dimensional Normality Test Based on PC-type JB statistic
For observed data X = (x ij ) n×p with sample size n and dimension p, the principle component analysis reduces the dimension of p-variate random vector X through linear combinations, and it searches the linear combinations with larger spread among the observed value of X, i.e., the larger variances.Specifically, it searches for the orthogonal directions ω i (i = 1, 2, . . ., p), which satisfy Denoted by Σ Σ Σ, the covariance matrix of X, the eigenvalue λ i and principle components ω i (i = 1, 2, . . ., p) can be obtained by spectral decomposition of the covariance matrix Σ Σ Σ.Therefore, the observed data can be projected to the archived lower-dimension space {ω 1 , ω 2 , . . ., ω p } by z i = Xω i , which gives the projected observed matrix z.
For each z i , the skewness and kurtosis can be calculated by (2) where z i stands for the sample mean.Then, the univariate JB statistic can be given by To test the normality of high-dimensional data, z = (z 1 , z 2 , . . ., z r ), define where r stands for the number of principle components ultimately selected, which satisfies: Considering the hypothesis: H 0 : the data is normally distributed; v.s.H 1 : the data is nonnormally distributed Under the null hypothesis H 0 , the JB statistic will be asymptotically χ 2 (2) distributed [13], then the JB sum will be asymptotically χ 2 (2r) distributed.For a given significance α, the critical region will be Upon JB sum , an exact critical region R(X) can be deduced, and therefore the testing can be implemented based on these critical regions.
Evaluating the performance of the proposed PC-type Jarque-Bera testing depends on 1) whether the orthogonal axes are chosen due to the cumulative proportion; and 2) whether the hypothesis is rejected or accepted.Composed by the well known power function, the error will be: where α is the probability of a Type-I error and β is the probability of a Type-II error.

56
Therefore, we can see that the power is a non-decreasing function of the parameter s.

58
To evaluate the performance of the aforementioned testing, some simulation experi- All of these methods are studied in 2000 simulated data.Table 1: Significance level of PC-type Jarque-Bera (JB) testing for normally distributed data for Case-I compared with other methods  (2) In the case of p = 30, as in Figure 2   We may see from   3 we can see the results of these two sets.This time, the p-values of the ten methods are no longer as high as before, meaning that our method performs well in assessing the normality of normal and non-normal data.

Body data example
In this part, we analyze the normality of body data investigated in [14] to show the consistency of our method with other existing methods and conclusions before.This data set contains 100 human individuals and each individual has 12 measurements of the human body (see [14] for details).As before, the p-values of the PC-type statistics and the S k -type, K u -type, and Kazuyuki's statistics are computed.
Let B 0 describe the whole dataset, and the multivariate normality of it can be investigated by the resulting p-values of each method shown in Table 3.Since all the p-values approach 0, we may conclude that this dataset contains non-normal data.As with the discussion in [14], we also investigate the other six datasets to show the validity of our proposed method, as well as making a comparison with other methods.For convenience, we denote B 1 = (X 1 , X 3 , X 8 , X 10 , X 12 ), B 2 = (X 1 , X 3 , X 8 , X 10 ), B 3 = (X 1 , X 8 , X 10 , X 12 ), B 4 = (X 3 , X 8 , X 10 , X 12 ), B 5 = (X 4 , X 5 , X 6 , X 11 ), and B 6 = (X 2 , X 4 , X 6 , X 11 ).This phenomenon indicates that when testing real multivariate normal distributed data, our method results in a slightly higher p-value than the compared S k -type and K utype statistics, whereas for non-normal distribution data, our method shows a relatively lower p-value.Thus, our PC-type statistic JB sum constitutes a more effective way of testing normality both in normal data and non-normal data, with more stable testing results.

Conclusion
The purpose of this paper is to use a JB-type testing method to test high-dimensional normality.The statistics we proposed here used the generalized statistic JB sum of JB statistics to test normality based on the dimensional reduction performed by PCA.
Through simulated experiments, we find that in both low and high dimensions, JB sum performs well in testing normal and non-normal data, and it is more stable than many other compared methods.Therefore, it can be used to test normality effectively.
From two real examples, we can also see that our proposed method possesses the superiority of stability in performing the normality testing of real datasets, as well as the inclination of detecting the true normality from the perspective of p-values. 57

Figure 1 -( 1 )
5 show the 91 comparisons of the power of different dimensions p and various sample sizes n. 92 Figure 1 indicates that in the case of p = 5, Z * M1 's performance is best in all three 93 cases.Though Z * M2 performs well in Case I and Case II, it is not as good in Case V.

Figure 4 .Figure 5 .
Figure 4. Empirical power of proposed PC-type JB testing compared with other methods (p=100)

118 4 . 1 .
SPECTF Heart data Example119The SPECTF heart dataset[15] provides data on cardiac single proton emission 120 computed tomography (SPECT) images.It describes the diagnosis of cardiac single 121 proton emission computed tomography (SPECT) images, and each patient is classified 122 into two categories: normal and abnormal.The data contain 267 instances, with each 123 instance belonging to a patient along with 44 continuous feature patterns summarized 124 from the original SPECT images.The other attribute is an binary variable that indicates 125 the diagnosis of each patient, with 0 for normal and 1 for abnormal.126Inthis dataset, we simultaneously evaluate the normality of the whole dataset and 127 each class within it.The testing p-value of each method mentioned above is shown in 128 Table3.129LetS 0 describe the whole data set and S 1 and S 2 denote the normal class dataset and 130 abnormal class dataset, respectively.We calculate the p-values of our PC-type statistic 131 as well as the S k -type and K u -type statistics and other methods mentioned in[16,17] of 132 these three datasets.Since all ten statistics' p-values of data S 0 and S 1 are very close to 0, 133 we will not describe them here, which indicates a non-normal distribution of the whole 134 dataset and abnormal dataset. 135

190 11971214 ,
81960309), sponsored by the Scientific Research Foundation for the Returned 191 Overseas Chinese Scholars, Ministry of Education of China, and supported by Coopera-192 tion Project of Chunhui Plan of the Ministry of Education of China 2018.The authors 193 would also like to thank Edit-in-chief and the referees for their suggestions to improve 194 the paper. 195

Table 1 and
Table 2 describe the significance level of the PC-type JB testing JB sum com-From the table above we can conclude that the significance level of JB sum is close 66 69

77 (III) Shi f ted χ 2 (1) : every variable in X n×p was centralized, with independently identi- 78 cal distribution χ 2 (1). 79 (IV) Shi f ted exp(1) : every variable in X n×p was centralized, with independently iden-
[14]performance of JB sum compared with the S k -type statistics χ 2 sk , S kmax[14], K u - [17]S2[17], Kazuyuki's statistic mJBM [16] and Rie's statistic ZNT [18] are illustrated 87 in Figures 1-5.Since JB sum , χ 2 sk , and χ 2 ku are based on the sum of χ 2 , we call them 88 sum-type.S kmax and K umax come from the maximum of χ 2 , and thus we call them 89 max-type.90 , although Z * M1 and Z * M2 perform better than 97 JB sum in Case IV, they do not maintain stable results like JB sum in Case III.In fact, In Figure3, where p = 50, JB sum 's performance is best among others except Z * M2 .As With the increase in dimension, as seen in Figure4, Z * M1 no longer performs as well 105 as before, and mJBM is still not stable enough when n is close to p.Although 106 K umax 's performance is better than JB sum 's at first, it is surpassed by the latter when In Figure5, as in p = 100, the power of K umax is initially higher than JB sum , and is 109 eventually surpassed by JB sum .Except for Z * M2 and K umax , JB sum 's performance is 98JB sum 's performance is generally better than the other methods mentioned here 99 among all three cases.100(3)104(4) 108 (5) 111From the phenomenon above, we may conclude that JB sum performs well compared 112 to the other statistics, in that its power is relatively higher than the others and the 113 corresponding simulation results are more stable.Thus, it can be used to test the non-114 normality of low-or high-dimensional data effectively.115 4. Two Real Examples 116 In this section, we investigated two real examples to illustrate the performance of 117 our proposed method compared with the nine aforementioned existing methods.
Table 3 that S 2 's corresponding p-values are a little different from In this normal category, we extract some variables and construct a new dataset S 3 141from several experiments.The selected variables included in S 3 are X 2 , X 4 , X 6 , X 7 , X 9 ∼ 142 X 12 , X 14 ∼ X 21 , X 23 ∼ X 28 , X 31 ∼ X 34 , and X 37 ∼ X 43 .We then compute the p- 140 143 values of this dataset, and the results are shown in Table3.It can be seen that all 144 normality testing methods have a relatively high p-value, which demonstrates the mul-

Table 4 :
p-values of the six statistics of body data This research was funded by National Natural Science Foundation of China (No.