Heterogeneous Learning of Functional Clustering Regression and Application to Chinese Air Pollution Data

Clustering algorithms are widely used to mine the heterogeneity between meteorological observations. However, traditional applications suffer from information loss due to data processing and pay little attention to the interaction between meteorological indicators. In this paper, we combine the ideas of functional data analysis and clustering regression, and propose a functional clustering regression heterogeneity learning model (FCR-HL), which respects the data generation process of meteorological data while incorporating the interaction between meteorological indicators into the analysis of meteorological data heterogeneity. In addition, we provide an algorithm for FCR-HL to automatically select the number of clusters, which has good statistical properties. In the later empirical study based on PM2.5 concentrations and PM10 concentrations in China, we found that the interaction between PM10 and PM2.5 varies significantly between regions, showing several types of significant patterns, which provide meteorologists with new perspectives to further study the effects between meteorological indicators.


Introduction
Environmental pollution is a hot global issue and has been given high attention by all governments. Among air pollutants, particulate matter (PM) poses the greatest risk to human health [1]. The Organization for Economic Cooperation and Development (OECD) estimates that air pollution will be the leading environmental cause of death by 2050. Therefore, it is important to mine the regional heterogeneity of air pollution and its internal patterns. Meteorological data such as temperature, humidity, atmospheric pressure, and pollutant concentrations are continuously changing in the atmosphere at a specific location. Unfortunately, we are unable to obtain the original curves of continuous data directly. The usual approach is to sample at a given time interval, thus obtaining time-series type data in discrete cases. Obviously, no matter how intensive our sampling time is, there is no way to avoid information loss. When performing heterogeneity mining, it is necessary to calculate the distance between meteorological data from different regions. There are generally two types of method for discrete type of meteorological data. The first one is to extract representative observation statistics from continuous time by series, e.g., mean, variance, etc. [2,3], and the second is to treat time series data according to the ordinary high-dimensional Euclidean space. The former method further produces information loss from data processing, while the latter maintains the full picture of discrete information, but there is a problem of "curse of dimensionality" due to the high dimensionality when calculating the distance of sample observations. However, the use of functional data analysis (FDA) for meteorological data can avoid these problems and bring more advantages in data processing. First of all, FDA respects the data generation process more. It converts discrete data into functional data by interpolation or smoothing [4]. The advantage of this treatment is that it retains as much information as possible about the variation of the sample in the time domain, while also preserving the characteristics of the curve fluctuations. In addition, with Functional Principal Component Analysis (FPCA), functional data objects can be projected from time domain data to frequency domain data, providing a frequency domain perspective for time series analysis. Intuitively, the meteorological time series is decomposed into a combination of functional components, which is also consistent with the characteristics of meteorological data. For example, temperature is influenced by diurnal and seasonal variations, pollutant concentration is influenced by seasonal variations, traffic peaks, and other cyclical factors. In particular, meteorological data is often a combination of multiple trends over time. After FPCA, we can calculate the distance between sample observations based on the finite component scores, which can effectively avoid the problem of dimensional curse. Therefore, mining the heterogeneity and internal pattern of meteorological data under the perspective of FDA has natural advantages.
Heterogeneous learning is achieved through clustering methods. Clustering is the typical method to mining heterogeneity. Clustering algorithms are a classical class of unsupervised learning algorithms that cluster samples based on the similarity of their observations. There exists a wide range of applications in the mining of meteorological data, such as analysis of spatial and temporal variation of pollutants, air quality monitoring and optimization of monitoring network, and correlating pollutant concentrations with specific synoptic conditions [4]. The commonly used clustering methods to study meteorological data are Partitioning clustering [5][6][7], Hierarchical clustering [8][9][10], Fuzzy clustering [11][12][13] and Model-based clustering such as the EM method, SOM method, etc. [14,15].
The above applications and models all have a common defect, that is, they are all clustered directly based on the data itself, ignoring the potential structural information between the related data, and the heterogeneity between the clusters is insufficiently described. In fact, there are complex correlations between various air pollutants. For example, nitrogen oxides are correlated with each other [16], and there are complex correlations between the concentrations of PM 10 and PM 2.5 [17]. The correlation between meteorological data has the potential to optimize the clustering algorithm. Therefore, from the idea of clustering regression, we want to excavate the potential relationship between different meteorological indicators and use it as auxiliary information to guide the clustering.
Cluster regression was first mentioned by Späth [18], which has given rise to new ideas and vitality in the era of big data. Joki et al. [19] introduced the support vector machine model in machine learning into CLR (Cluster-wise linear regression), transformed the problem into an unconstrained non-smooth optimization problem, and designed a method based on an incremental algorithm and double beam method combined with the DC optimization method. Numerical experiments verify the reliability and effectiveness of the method. The results show that the method after adding a support vector machine optimizes the partitioning effect when outliers in the data. Amb et al. [20] designed a CLR algorithm based on DCA (difference of convex algorithm) and incremental method and used the quasi-Newton method to solve the problem. They found that the new method can effectively solve the CLR problem under large-scale data from the evaluation of synthetic data and real data. Da Silva and de Carvalho [21] proposed the W-CLR (Weighted Clusterwise Linear Regression) model, which solves the possible overfitting problem of the original model and can better describe the linear relationship of subspace samples. Experiments on the W-CLR synthetic dataset and benchmark datasets validate the effectiveness of the method. In terms of the application of the model, Bagirov et al. [22] selected the monthly rainfall data from 1889 to 2014 in eight different geographic locations in Australia and proposed a clustered linear regression (CLR) method for monthly precipitation forecasting. The results show that the method has advantages over models such as multiple linear regression, neural networks, and support vector machines. Torti et al. [23] studied the heterogeneity problem from the perspective of CLR based on EU mask trade data, achieved the selection of the optimal number of clusters and the best combination through a two-step method, and obtained the optimal stable solution. However, the above research is still at the level of linear regression. With increasingly complex data and interrelationships, simple linear regression can hardly describe the potential connections between the data accurately.
This paper combines the dual advantages of FDA perspective with cluster regression and proposes a functional clustering regression heterogeneity learning method (FCR-HL). In summary, FCR-HL has the following three advantages: (1) clustering from FDA perspective, which is more suitable for the generation process of meteorological data, and greatly reduces the information loss problem of the original data.
(2) The auxiliary information (i.e., the regression relationship between air pollutants) is incorporated into the clustering process to optimize the clustering results. In addition, the regression patterns within each group can be automatically identified. (3) We develop an adaptive selection process of the number of clusters, effectively avoiding the limitation of manual setting. The clustering results have a direct impact on the subsequent studies. The model we constructed can optimize the clustering results while mining the heterogeneity patterns of different groups, and thus has practical significance and application value.

Methods
The FCR-HL model mainly solves four problems: (1) the optimization problem of clustering: partitioning data into different clusters with the perspective of regression can incorporate more information to attain a better clustering effect, and how to solve the optimization problem is the key point. (2) Parameter estimation problem: the parameters in the regression that explain the impact of the covariate on the response variable need to be estimated within each cluster. (3) Clustering number estimation that decides how many clusters are needed is an important part for the model. (4) Iterative algorithm: it is difficult to solve the problem of both partition and parameter estimation simultaneously; our model gives an iterative process to solve the three problems mentioned above. The following will explain the solutions to the four problems, respectively.

Clustering Optimization and Parameter Estimation
Ramsay and Dalzell [24] proposed functional data analysis, which uses non-parametric ideas to fit data, and can effectively capture the continuous characteristics of data. In the functional data analysis, the functional regression model is an effective and convenient method. This paper mainly focuses on one typical functional regression models, that is, the covariates are functional data, and the response variables are scalar types, which have a functional covariate and scaler response variable: where the response variable Y i is a scalar and the vector expression is Y = (Y 1 , Y 2 , . . . , Y n ) , n is the observation number, and the covariate variable X i (t), t ∈ [0, T] represents ith functional trajectory that has a bounded upper limit T. Assuming e i ∼ iid N 0, σ 2 , the Karhunen-loeve expansion can be used for the functional covariates to obtain Equation (2): where u(t) = E[X(t)] represents the mean function of the covariate, and ϕ k (t) is the eigenfunction corresponding to the k th largest eigenvalue λ k of the covariance G(s, t) = Cov(X(t), X(s)), the eigenfunctions are orthogonal to each other, and satisfy ϕ 2 k (t)dt = 1 and ϕ k (t)ϕ l (t)dt = 0, k = l. Using the functional principal component analysis (FPCA), ξ ik named as the functional principal component scores of X i (t) − u(t) in the direction of ϕ k (t) are obtained, which satisfy E[ξ ik ] = 0 and Var[ξ ik ] = λ k . According to Formula (2), Formula (1) can be rewritten as: where is mapped to the constant parameter β 0 , and ϕ k (t) are mapped to the parameter β k . In other words, the parameter β 0 includes the mean value of Y i when X i (t) = 0 and the information of the mean trend of X i (t), and the parameter β k stands for the effect of the kth deviations of X i (t) on Y i . In this way, the auxiliary information between the covariate and the response variable is reflected in the parameter β k . This paper builds the FCR-HL model based on the auxiliary information to cluster the data. For Equation (3), the summation term is truncated at K, which is determined using the AIC criterion Li, et al. [25], which is to estimate the optimal K, which is given by minimizing the sum of the pseudo-Gaussian loglikelihood and K. Note Y = (Y 1 , Y 2 , . . . , Y n ) T , ξ i = (1, ξ 1 , ξ 2 , . . . , ξ K ) T , ξ = (ξ 1 , ξ 2 , . . . , ξ n ) T , β = (β 0 , β 1 , β 2 , . . . , β K ) T , e = (e 1 , e 2 , . . . , e n ) T , we rewrite Formula (2) in a matrix expression: The advantage of using FPCA technology is that the infinite-dimensional functional data can be converted into low-dimensional data, and then it helps to construct a linear regression model relating to the principal component scores. On one hand, this method can reduce the computational difficulty and the algorithm complexity due to the dimensional curse. On the other hand, it can preserve the nonlinear characteristics in the covariate, which are utilized for the regression analysis. At the same time, the principal component scores estimated by FPCA have good statistical characteristics, especially the unbiasedness and consistency, which are helpful for inferring the subsequent parameter estimations discussed later.
The goal of clustering optimization in this paper is to cluster data from the perspective of the regression hyperplane. The FCR-HL model mainly has two steps of iterations: First, obtaining the parameter estimates under the given partition. Second, clustering samples based on the parameter estimates. According to the two-step iterative algorithm, the optimal regression clustering results can be found.
Firstly, given a partition, the parameters are estimated from the perspective of the regression hyperplane and with the data that has been partitioned. Compared with the random partition, it takes the relationship between the covariate and response variable as an auxiliary information for clustering. The parameters can be estimated with greater accuracy once the partition has deduced the heterogeneity between the data. It is assumed that the samples from the same partition have the following relationship: where C = {C 1 , C 2 , . . . , C M } represent the sub-populations and is the samples size of the cluster C m , and M is the number of clusters, which may grow with sample size, y i m are the observed response data belonging to the cluster C m , ξ i m are the vector scores derived from the observed functional covariate x i (t) belonging to the cluster C m , and β m = (1, β m1 , β m2 , . . . , β mK ) T are the coefficients of the cluster C m . In (5), it is necessary to first solve the unknown functional principal component scores, and then we can estimate the parameter β m . It should be noted that the estimate of the functional principal component scores can directly affect the result of the parameter estimates, considering that the PACE (principal component analysis through conditional expectation) method proposed by Yao, et al. [26] is unbiased and consistent estimation method for the functional principal component scores. The PACE method gives the estimatorsξ ik =Ê(ξ ik X i ) where X i = (X(t i1 ), X(t i2 ), . . . , X t in i ) and ξ ik and X t ij are jointly Gaussian. Then, the PACE method is used to estimate the functional principal component scores and the mean function according to Formula (2), by which the principal component score estimatesξ and the estimation of the mean functionû(t) have the following convergence properties: whereû(t) is obtained by the local liner smoother, and h u is the bandwidth used in the local linear smoother. Formulas (6) and (7) show thatû(t) converges to u(t) andξ are unbiased estimates for ξ when n → ∞ , which are the good statistical characteristics mentioned before. Thus, ξ can be replaced by the estimatesξ as a new regression model shown in Formula (8): Based on Formula (8), the log-likelihood function can be shown in Formula (9): It is difficult to obtain the optimal partition and the estimates of the unknown parameters in (9) just by maximizing logL n M, C, β 1 , σ 2 1 , . . . β M , σ 2 M . Thus, an iterative method is proposed. Firstly, the optimization objective of clustering, fixing β m atβ m and σ 2 i m atσ 2 m , is to maximize the log-likelihood function when the observation data (y i , x i (t)) belongs to the cluster: To solve Formula (10), the parameter estimations β m ,σ 2 m needs to be obtained. The idea is to maximize the log-likelihood function of the data within the class. Formula (11) is the log-likelihood function of the data i ∈ C m : Then, the parametersβ m are obtained according to the maximum likelihood estimation: wheren m represents the sample size of the clusterĈ m andσ 2 m =σ 2 i m for simplification. Then, parameter estimations( β m ,σ 2 m ,Ĉ m ) are brought into Formula (9) to obtain the log-likelihood function of the complete data: When fixing the partition C m , theβ m andσ 2 m are the maximum likelihood estimators of the regression within the cluster, as shown in (12) and (13). When fixingβ m andσ 2 m , the, the likelihood function will be maximized if the test data belongs to the cluster C m . It is noted that the log-likelihood function is a monotonically increasing function, so it can reach the local maximum if a limited number of iterations are carried out alternatively. Furthermore, the parameter estimates derived from this optimization also have good statistical characteristics. First, the principal component scores obtained by FPCA are obtained by mapping the information of the data itself to the direction of the principal component.ξ are the unbiased estimates of ξ. Thus,ξ and ξ can be considered as nonrandom variables for the response variable Y, and the maximum likelihood estimation can give estimates having good statistical characteristics, for example, the unbiasedness: where variance ofβ m can be used to verify the significance of the parameter. Because only the variance ofβ m is estimated correctly, the significance results of the parameter estimates are reliable. From Formulas (6), (7), (16), and (17), it can be known that theβ m converges to β m in probability. Therefore, it can be ensured that the obtained optimal number of clusters converges to the real number with probability 1 when data is clustered from the perspective of regression hyperplane.
In addition, it is also noted that the estimations using maximum likelihood are based on the classical assumption that the error term in Formula (8) obeys independently and identically normal distribution. Once the assumption is broken, the maximum likelihood estimation results are problematic. Thus, when it comes to the data which violates the independently and identically normal error distribution, a robust estimation (M-estimation) scheme, a generalized maximum likelihood estimation method is given. A special case of M-estimation is the Huber distribution, which has a normal distribution at the origin and an exponential distribution at the tail. The parameter estimation can be obtained according to the Huber distribution:β where ρ c (t) is the error function of the Huber distribution, and c is a fixed constant. Given parameter estimatesβ m , the optimal objective function for clustering can be obtained from sample observations (y i m , x i m (t)): This function ρ c (t) is also strictly monotonically increasing. The partition above is under the condition of a given number of clusters, so the next step is to give an estimation method for the number of clusters.

Estimation of the Optimal Number of Clusters
After estimating model parameters and optimizing the clustering scheme, we need to discuss the estimation of the optimal number of clusters. In this paper, the information criteria is used as the clustering loss function among the iterative algorithm. Then, we can simultaneously update the identification of heterogeneity in clusters and the optimal number of clusters.
After using the FPCA, the sample data is y 1 ,ξ 1 , y 2 ,ξ 2 , . . . , y n ,ξ n . According to the previous analysis, it is assumed that the sample is composed of M subpopulations, and the characteristics of each population are represented by the regression hyperplane determined by the parameters.
Denote the partition {C = C m , m = 1, 2, . . . , M}, C m {m 1 , m 2 , . . . , m n m }, and obtain the regression model of each subpopulation is: where n m is the sample size of cluster C m and n = M ∑ m=1 n m , Y C m = y m 1 , y m 2 , . . . , y m nm ,ξ C m = ξ m 1 ,ξ m 2 , . . . ,ξ m nm is the response variable and principal component scores belonging to C m , respectively, andξ m j = ξ m j1 ,ξ m j2 , . . . ,ξ m jK for j = 1, 2, . . . , n m , and I n m is a n m × n m identity matrix. Notice thatξ C m is a K × n m matrix, both Y C m and e C m are n m × 1 vectors. The estimation of the number of clusters adopts the information criterion method based on the maximum likelihood estimation proposed by Shao and Wu [27], which is denoted as LS-C and can be obtained by: whereβ m is estimated by the maximization likelihood estimation in this case, q(M) is a strictly increasing function of M and q(M) = MK generally, and A n ∝ log(n) or A n ∝ loglog(n). The first part is the residual sum of squares and the second part is a penalty function relating to M and n. At the same time, Shao and Wu [27] have proved that the estimate derived by D n Ĉ M will converge to the correct number of regression hyperplanes (or the number of the clusters) with probability 1 when the sample size is large enough ( n → ∞ ). It is noted that the LS-C is based on the maximum likelihood estimation. Again, a robust estimation for the case that does not have an independently and identically normal error distribution. Rao, et al. [28] constructed the robust information criterion denoted as RM-C: whereβ m is obtained by using the M-estimation in this case, and both q(M) and A n are same with Formula (24). By minimizing the information criteria LS-C or RM-C when the error distribution is independently identical normal or not, the number of clusters and the partition can be obtained. The advantage of the information criteria LS-C and RM-C is that the estimated number of clusters converges to the real number when the sample size is large enough, and the details can be referenced in Shao and Wu [27] and Rao, Wu and Shao [28].

Iterative Algorithm Design
In the FCR-HL model proposed in this paper, parameter estimation, clustering optimization, and the estimation of the number of clusters are all continuously updated in the iterative process, and this section will explain the iterative algorithm.
First, the residual sum of squares obtained based on maximum likelihood estimation within the cluster is recorded as RSS (residual squares sums), and the residual sum of squares obtained based on M-estimation is recorded as RRSS (robust residual squares sums): Then, within the cluster RSS at each candidate M is calculated for the least square regression or the RRSS for the M-estimation based regression to approximate the local minimization, and determine the optimal cluster number by the information criteria LS-C or the RM-C, respectively.
In addition, the regression-based cluster method is easily affected by the initial partition. The global minimum of the information criterion or its good approximation can be achieved when using a good initial partition. Thus, it is necessary to determine the initial partition C 0 . Based on the idea proposed in Qian, et al. [29], we extend it to handle the functional data. The following table Algorithm 1 shows the iterative initial partition algorithm.
Algorithm 1 An iterative algorithm for initial partition.
Step 1: Using the FPCA on X(t) to estimate the functional principal component scoreξ.
Step 2: Through mapping the mean function u(t) and the basis function ϕ(t) to the parameter β, respectively, we build a functional regression model where the functional principal component scores are the covariates.
Step 3: Parameters are estimated by the maximum likelihood estimation or robust estimation and based on the whole data.
Step 4: (1) Set a distance threshold d and a sample size constant c.
(2) For l = 1, we calculate the distance between the point and the regression hyperplane obtained in Step 3. If the distance is less than the threshold d, then the point is partitioned into C 0,1 , otherwise the point is partitioned into C c 0,1 , where |C 0,1 | > c, C c 0,1 > c, otherwise go to Step 5. (3) For l = l + 1, a point in the dataset ∩ l i=1 C 0,i c , we estimate the parameters again and calculate the new distance. If the distance is less than d, the point is partitioned into C 0,l+1 , otherwise into C c 0,l+1 , where C 0,l+1 > c, C c 0,l+1 > c, otherwise go to Step 5.
Step 5: Obtain the initial partition C 0 = C 0,1 , . . . , It should be noted that the constants c and d are set based on the data. The initial partition is an iterative hierarchical binary clustering method, which adopts a regression model such as the least square regression in each iteration. The regression is robust, having a high breakdown threshold; thus, the Algorithm 1 is highly likely to produce a reasonable initial partition. After the initial partition, the iterative algorithm of the FCR-HL model are shown in the table Algorithm 2: The Partition iteration algorithm based on the initial partition.
Step 1: Let s = 1, we calculate the RSS 0 or RRSS 0 of the initial partition C 0 , and the parameter estimationβ.
Step 2: Let s = s + 1, we calculate the RSS or RRSS of the data y i ,ξ i in C 0, i and in C 0,j , j = j , respectively, and we can obtain the RSS min or RRSS min , where RSS min < RSS 0 or RRSS min < RRSS 0 . Then the updated partition is Step 3: Continue to iterate Step 2 until RSS 0 or RRSS 0 no longer drops.
In summary, the parameter estimations β m ,σ 2 m ,M and the partitionĈ M = Ĉ 1 ,Ĉ 2 , . . . ,Ĉ M are updated in the iterative algorithm. Finally, the final regression clustering result can be obtained.
In the simulation analysis and the empirical data analysis, the K-means method has been used as a comparison model as it is a representative cluster method which only utilizes the distance between observations themselves. Our model emphasizes the importance of the auxiliary information between the response and covariate and cluster data from the regression perspective to dig the heterogeneity.

Model Comparison Based on Heterogeneity Partitioning
We simulate data from three different groups that satisfies Y ij = α j 0 + X ij (t)α j 1 (t)dt + ε i , i = 1, 2, . . . , 500; j = 1, 2, 3. The number of units is 500 for each group and t is uniformly designed on [0, 1]. Firstly, the functional covariate is generated by X ij (t) = 6 ∑ k=1 ξ ik ϕ k t ij where ξ ik ∼ N(0, 1) and they are independent with each other k, and ϕ k (t) are the cubic spline basis functions. Then, the different coefficients in three groups are simulated by α 1 2 sin(2πt), ϕ 2 (t) = √ 2 cos(4πt). After obtaining the functional principal component scores "Score1" and "Score2" of X j (t) that are treated as the new explanatory variables in the regression clustering model, our model can reduce the regression analysis from the infinite dimension. Since the error is identically independent distributed, we use LS-C information criterion as the selection criterion for the number of clusters, where q(M) = MK, A n = clog(n) where c = 2. Adopting the FCR-HL model proposed in this paper, the information criterion is obtained as shown in Figure 1: ( ) that are treated as the new explanatory variables in the regression clustering model, our model can reduce the regression analysis from the infinite dimension. Since the error is identically independent distributed, we use LS-C information criterion as the selection criterion for the number of clusters, where ( ) = , = ( ) where = 2.
Adopting the FCR-HL model proposed in this paper, the information criterion is obtained as shown in Figure 1: Figure 1. LS-C values over the candidate K for the simulation data. Figure 1. LS-C values over the candidate K for the simulation data. Figure 1 shows that LS-C reaches the minimum when K = 3 (the scale of the vertical axis is so large that the values of LS-C at K = 3 and K = 2 in Figure seem to be close, but they are not), which is the estimation of the optimal number of clusters and consistent with the number of real clusters. When K = 1, the LS-C reaches the maximum which means that data without partition have poor performance. It encourages us to pay more attention to the clustering regression. To show the superiority of the FCR-HL model, we compare the performance of the confusion matrix with that of the K-means that is a popular and widely used cluster method in Tables 1 and 2. Each row in a confusion matrix represents a predicted cluster, while each column represents a real cluster. Table 1 shows that 440, 477, and 490 samples were correctly clustered into groups, respectively. The confusion matrix of the K-means clustering method in Table 2 indicates that the K-means method has a good behavior on the partition of the first group, but it cannot effectively partition the data of the second and third group. Considering the relationship between response variables and functional explanatory variables, our model shows how the regression relationships change between clusters, not the distance of the data observations themselves. The confusion matrixes show the improvement when the auxiliary information has been added into the partition process of the data. Next, we show the heterogeneity between the clusters.

Heterogeneous Hidden Information Mining
Traditional cluster methods, such as the K-means method, can only provide partition results, while our model can provide regression information of each cluster. In addition, when the regression analysis is carried out to the different clusters, the heterogeneity can be mined. If the regression analysis is carried out to the data without partition, the heterogeneity has been ignored and the results of the regression analysis may be exact enough or even wrong. Therefore, it is necessary to identify the heterogeneity of data with the help of FCR-HL model.
First, the regression results of the data without partition are shown in Table 3.  Table 3 shows that if the data has not been clustered, that is, assuming that all the data are from one population, the estimated parameters are all significant at the significance level 0.05, while the R 2 that explains the goodness of fitting is only 0.3371, which is relatively small. Again, it is necessary to identify the heterogeneity of the data first, and then perform regression analysis within clusters. Then, the results will be more reliable.
Using the FCR-HL model in the simulated data by setting d = 0.02, which is adjusted adaptively, we have the regression results of the three clusters. And the regression results of three clusters are shown in the Table 4. The parameters in the three groups are all significant at the significance level of 0.05, and the R 2 of the three groups are 0.9158, 0.9994 and 0.9981, respectively, which means that these three regressions have better performance. Therefore, it can be shown that the FCR-HL model has improved the fitting effect and obtain reliable parameter estimations.
Comparing with the results given by the K-means method, our model utilizes the relationship between the response variable and the functional explanatory variable, and incorporates this auxiliary information into the cluster process to improve the accuracy of clustering. In addition, the FCR-HL model can update the parameter estimates by updating the principal component scores and the number of clusters when new data enters into the sample.

Climate Data
Using the classic Canadian weather data, which contains the annual temperature change and rainfall information of 35 stations, the annual rainfall is used as the response variable, using the temperature as the explanatory variable to study the influence of temperature on rainfall. Figure 2 shows the temperature of each site: From Figure 2, although the temperature at each site has a similar trend of change, there are differences in the size of the trend change and the time of the change. If we disregard these characteristics and do regression analysis on the pooled data directly, the finding may be contradictory to the truth. Moreover, the relationship between the temperature and rainfall is very important to distinguish which cluster the data belongs to. We use the FCR-HL model to partition the data and explain the impact of temperature on rainfall. The number of sites is small, the maximum likelihood parameter estimation method is no longer applicable, and the robust estimation algorithm will be used to estimate the parameters. Thus, the RM-C is calculated to obtain the optimal number and the cluster results shown in Figure 3. The left picture of Figure 3 is the value of RM-C under different obtained by the iterative algorithm, and it is obvious that RM-C reaches the minimum at = 2, which means that the optimal number of clusters is 2. The sites belonging to the two clusters are shown in Table A3 , which shows that the logarithmic annual rainfall of the sites in cluster 1 are always bigger than sites in cluster 2. This cluster result is convincing.
To clearly illustrate the usefulness of the FCR-HL model, the parameter estimations for the two clusters given by the FCR-HL model and parameter estimations given by the direct regression on the pooled data. The parameter beta in the picture explains the impact of the temperature on annual rainfall over time. As we can see, although the yellow curve obtained by the use of regression analysis on the pooled data is smother than both the black curve and the blue curve, which corresponds to the cluster 1 and cluster 2, respectively, and are obtained by the FCR-HL model, the latter two curves can better reflect the From Figure 2, although the temperature at each site has a similar trend of change, there are differences in the size of the trend change and the time of the change. If we disregard these characteristics and do regression analysis on the pooled data directly, the finding may be contradictory to the truth. Moreover, the relationship between the temperature and rainfall is very important to distinguish which cluster the data belongs to. We use the FCR-HL model to partition the data and explain the impact of temperature on rainfall. The number of sites is small, the maximum likelihood parameter estimation method is no longer applicable, and the robust estimation algorithm will be used to estimate the parameters. Thus, the RM-C is calculated to obtain the optimal number K and the cluster results shown in Figure 3.  From Figure 2, although the temperature at each site has a similar trend of change, there are differences in the size of the trend change and the time of the change. If we disregard these characteristics and do regression analysis on the pooled data directly, the finding may be contradictory to the truth. Moreover, the relationship between the temperature and rainfall is very important to distinguish which cluster the data belongs to. We use the FCR-HL model to partition the data and explain the impact of temperature on rainfall. The number of sites is small, the maximum likelihood parameter estimation method is no longer applicable, and the robust estimation algorithm will be used to estimate the parameters. Thus, the RM-C is calculated to obtain the optimal number and the cluster results shown in Figure 3. The left picture of Figure 3 is the value of RM-C under different obtained by the iterative algorithm, and it is obvious that RM-C reaches the minimum at = 2, which means that the optimal number of clusters is 2. The sites belonging to the two clusters are shown in Table A3 , which shows that the logarithmic annual rainfall of the sites in cluster 1 are always bigger than sites in cluster 2. This cluster result is convincing.
To clearly illustrate the usefulness of the FCR-HL model, the parameter estimations for the two clusters given by the FCR-HL model and parameter estimations given by the direct regression on the pooled data. The parameter beta in the picture explains the impact of the temperature on annual rainfall over time. As we can see, although the yellow curve obtained by the use of regression analysis on the pooled data is smother than both the black curve and the blue curve, which corresponds to the cluster 1 and cluster 2, respectively, and are obtained by the FCR-HL model, the latter two curves can better reflect the The left picture of Figure 3 is the value of RM-C under different K obtained by the iterative algorithm, and it is obvious that RM-C reaches the minimum at K = 2, which means that the optimal number of clusters is 2. The sites belonging to the two clusters are shown in Table A3, which shows that the logarithmic annual rainfall of the sites in cluster1 are always bigger than sites in cluster2. This cluster result is convincing.
To clearly illustrate the usefulness of the FCR-HL model, the parameter estimations for the two clusters given by the FCR-HL model and parameter estimations given by the direct regression on the pooled data. The parameter beta in the picture explains the impact of the temperature on annual rainfall over time. As we can see, although the yellow curve obtained by the use of regression analysis on the pooled data is smother than both the black curve and the blue curve, which corresponds to the cluster1 and cluster2, respectively, and are obtained by the FCR-HL model, the latter two curves can better reflect the considerable difference fluctuation of the data. That is, the parameter estimations after clustering can highlight the different characteristics of the data from different subpopulations. More specifically, the parameter estimations for the cluster1 are bigger than the parameter estimations for the cluster2 when t ≤ 50 and t ≥ 230, which indicates that the influence of temperature on the annual rainfall of the sites in cluster1 is always bigger than that of the sites in the cluster2, and the parameter estimations of the cluster1 are smaller than the parameter estimations of the cluster2 when 50 ≤ t ≤ 230, which indicates that the influence of temperature on the annual rainfall is always smaller for the sites of cluster1 than the sites of cluster2, conversely. In addition, the black curve and blue curve have shown considerable different degrees of volatility, which indicates that the influence of temperature on annual rainfall varies between groups and also over time.
In addition, Table 5 represents the parameter estimations of the regression directly on the pooled data. Cluster1 and cluster2 stand for the regression on the partitioned data. The R 2 value of the pooled data and two clusters are 0.7567, 0.5692 and 0.6868, respectively. It is noted that there are only thirty-five sites, and the sample sizes of the two clusters are not large enough to ensure the unbiasedness and consistency of both the score estimatorsξ and the regression estimatorsβ. Given this condition, the R 2 value of the clustered data is not promoted. However, Table A3 shows the reasonability of the partitions given by FCR-HL model. Although the performance of the R 2 value is not ideal for the Canadian weather data, we may still suggest adopting the proposed model in clustering and heterogeneity learning. After analyzing the simulated data and climate data, it can be seen that the FCR-HL model improves the accuracy of clustering, and it can be seen from the results that the FCR-HL model can detect the heterogeneity information.

China Air Pollution Data
As typical pollutants in atmosphere, inhalable particulate matter such as PM 10 and PM 2.5 bring risk to human health [1,30] and obtain a lot of attention from researchers. In the research on urban air quality, the concentration of PM 10 and PM 2.5 have a significant correlation with each other. This correlation has variability across seasons and regions. This paper uses the FCR-HL model to study the heterogeneity and characteristics of air quality in different regions of China. We first obtain national air quality data from China Meteorological Data Network (https://www.resdc.cn/data.aspx?DATAID=289, accessed on 15 February 2023), and then we clean the data. Finally, we obtain PM 10 and PM 2.5 concentration data of 1602 stations across the country from 1 January 2019, to 31 December 2019, with a frequency of hours. The annual average PM 10 concentration of each station was obtained as a response variable, and the daily average PM 2.5 concentration was obtained as a functional explanatory variable. Figure 4 shows the functional representation of discrete PM 2.5 concentration data: This paper uses the FCR-HL model to study the heterogeneity and characteristics of air quality in different regions of China. We first obtain national air quality data from China Meteorological Data Network (https://www.resdc.cn/data.aspx?DATAID=289, accessed on 15 February 2023), and then we clean the data. Finally, we obtain PM10 and PM2.5 concentration data of 1602 stations across the country from 1 January 2019, to 31 December 2019, with a frequency of hours. The annual average PM10 concentration of each station was obtained as a response variable, and the daily average PM2.5 concentration was obtained as a functional explanatory variable. Figure 4 shows the functional representation of discrete PM2.5 concentration data:  Using the FCR-HL model proposed in this paper and considering the auxiliary information between PM10 and PM2.5, we first perform functional principal component analysis on PM2.5, and build a functional regression model: where is the PM10 concentration for the station , and is the vector estimated scores of the PM2.5 for the station .
To explain the advantages and necessity of the model more clearly, firstly, the coefficients over time of the functional regression without partitioning are shown in Figure 5, and then the coefficients over time of the functional regression in each cluster are shown in Figure 6.  Using the FCR-HL model proposed in this paper and considering the auxiliary information between PM 10 and PM 2.5 , we first perform functional principal component analysis on PM 2.5 , and build a functional regression model: where y i is the PM 10 concentration for the station i, andξ i is the vector estimated scores of the PM 2.5 for the station i.
To explain the advantages and necessity of the model more clearly, firstly, the coefficients over time of the functional regression without partitioning are shown in Figure 5, and then the coefficients over time of the functional regression in each cluster are shown in Figure 6.     The coefficients in Figures 5 and 6 show how the impact of PM 2.5 on PM 10 changes over time without partitioning and under partitions, respectively. The coefficient indicates the annual average change in PM 10 concentration with daily PM 2.5 concentration, and their signs indicate the positive and negative aspects of the correlation. Note that the impact of PM 2.5 on PM 10 means that the PM 10 concentration changes with the change of the PM 2.5 concentration over time. In Figure 6, according to the iterative clustering algorithm, the stations are divided into 11 groups as the optimal number of clusters is estimated to be 11, and the heterogeneity partition for analyzing the relationship between PM 10 and PM 2.5 are also shown.
As we can see, the regression coefficients in Figure 5 are much smoother over time. Most of them are small and positive values, which consider that the influence of PM 2.5 on PM 10 is almost positive and small over time for all stations. However, the impact may not be exact enough without taking into account the heterogeneities between stations as mentioned above. On the contrary, the coefficients after utilizing FCR-HL model are more convincing for the influence of PM 2.5 on PM 10 . In Figure 6, the regression coefficients for all 11 groups have entirely different characteristics. First, these regression coefficients are both positive and negative, which differ from those in Figure 5. Second, the regression coefficients of all 11 clusters have more steep variation trends than those in Figure 5. Regression coefficients differ from one cluster to another. For example, the coefficient of cluster1 shows a rise and then a fall, and includes a local maximum and a local minimum from day 0 to day 100, which means that the impact of PM 2.5 on PM 10 increases first and then decreases, while the coefficient of cluster 5 only decreases and only has one local minimum from day 0 to day 100, which means that the impact of PM 2.5 on PM 10 is always decreasing. For another example, a similar trend is observed for the coefficients of cluster1 and cluster 9 from day 0 to day 100, but the values are different, especially the local maximum and local minimum. All these differences between the 11 clusters can demonstrate the existence of heterogeneity and the importance of the heterogeneity partition. Additionally, the coefficients of each cluster can better show the varying impact over time. As for the stations of the cluster1, the impact of PM 2.5 on PM 10 is negative at the beginning and then positive until close to day 100, from where the impact is negative until close to day 230 and then positive until day 300, and the impact is negative on the last day. By that analogy, how the impact of PM 2.5 on PM 10 varies over time and how much impact of PM 2.5 on PM 10 at a fixed time can be obtained for the 11 clusters.
In addition, from (28), the parameters estimations for the pooled data and grouped data are shown in Tables A1 and A2 (in Appendix A), respectively. It can be seen from Table A1 that the functional principal component scores 5 and 6 in the regression results are not significant, and the R 2 of the regression is 0.7484. In Table A2, the results of the 11 groups divided by the FCR-HL model show that the functional principal component scores are almost significant except the score3 in the 4th group and the score6 in the 7th group, and the R 2 of all groups are all above 0.9. It is noted that the functional explanatory variable in the FCR-HL model is expanded by the Karhunen-loeve theory, after which the functional principal component scores contain the essential information of the explanatory variable. Thus, the insignificance of the functional principal component scores in Table A1 resulting in the rejection of important parameters bears witness to the inaccuracy of the regression analysis on the pooled data which has the heterogeneity feature. By contrast, almost all the parameters estimated by the FCR-HL model are significant, which means that almost all the essential auxiliary information has been widely used for clustering. Additionally, the improvement of R 2 of all the groups also testifies the efficiency of the FCR-HL model. In the end, these comparisons show that the FCR-HL model can effectively work on heterogeneity partition and mining its internal information.
The K-means method is also used to compare the performance of the R 2 with that of the FCR-HL model. The comparison is shown in Table 6.  Table 6, it is obvious that FCR-HL model promote the goodness of the regression analysis of the PM 10 on PM 2.5 significantly. Considering the auxiliary information will improve the cluster results, since the distance that is the key role in clustering has been set to be the distance between the data and regression hyperplane, not between the data itself. Then, the results of the FCR-HL model help us to better understand how the PM 2.5 concentrations impact the PM 10 concentrations.
In summary, the PM 2.5 and PM 10 data in the empirical analysis section are clustered into 11 groups by using our model. On one hand, the impact of the PM 2.5 on the PM 10 varies over time and between groups, by which there is obvious heterogeneity between groups. On the other hand, the significance of the parameters in Tables A1 and A2 expresses the importance of the partition to avoid the loss of essential information. In each cluster, the impact of the PM 2.5 on PM 10 has up-and-down fluctuations, which means that the PM 10 concentration changes with the PM 2.5 concentration up and down. Moreover, the coefficients in all clusters show that the impact is not always positive.

Discussion
This paper constructs a heterogeneity learning model from the perspective of data clustering, which can solve the problem of clustering and provide implicit structural information about heterogeneity at the same time. Combining the regression model with the clustering algorithm can not only incorporate more effective information into the clustering and improve the accuracy of the clustering, but also analyze more precisely the relationship between the explanatory variables and the explained variables in different clusters (also called as the subpopulations). In addition, because the complexity of the actual data makes the classical regression model unable to capture the continuous characteristics of the data, we need functional data analysis techniques to add the continuous characteristics of the data to improve the research of regression clustering. Based on the functional data analysis technology, this paper uses the principal component scores and then reconstructs a new regression function. Using the iterative algorithm and information criterion, we can obtain the number of clusters and parameter estimations simultaneously. Regarding the FCR-HL model, the advantages in statistics are: first, each parameter estimation is consistent; second, the iterative algorithm can give cluster results at the same time, and they can be updated when new samples are added. The advantage in the application is that it detects the heterogeneity in data and explains how the covariate impacts the response.
In addition, both data simulation and empirical results illustrate the effectiveness of the new model and its broad application prospects. Simulation, case data and empirical data are used, and the results given by the FCR-HL model are compared with the that of the well-known K-means method. The comparisons show that our model can better partition data as the regression clustering utilizes the auxiliary information to explain how differently regression performs across the clusters, while the K-means method focus on how the distances among data behave differently across the clusters.
In summary, the data for environmental research and public health, for example, the climate data, has heterogeneity. In this way, the FCR-HL model is proving to be hugely powerful. Two future directions are pointed out. (1) We will analyze other types of air pollution data and public health data to make our model more systematized and comprehensive.
(2) The auxiliary information plays an important role in our model, and data is inextricably linked in a complex social network. Therefore, digging more useful information, such as the network information and text information, and then putting them into the model will improve the study of the heterogeneity learning.

Conclusions
In this paper, we proposed the FCR-HL model to handle air pollution data. Firstly, the starting point of the article lies in the heterogeneity learning in the data and extends it to the functional data. Secondly, we introduce the model design, parameter estimation, and the iterative algorithm. To testify the validity of our model, the famous K-means and our model are both used in the simulation and climate data, and the performance of our model is better. An empirical analysis on the air pollution data is adopted in the final. The results show that the impact of the PM 2.5 on PM 10 varies between clusters and over time.
To sum up, the FCR-HL model captures the continuity in data itself and incorporates the auxiliary information to support multiple pieces of information, including the number of subpopulations and how the PM 2.5 impact the PM 10 over time. Thus, this model may provide some effective information for the policymaking department and a new perspective for research.

Conflicts of Interest:
The authors declare no conflict of interest. Significance: '***', '*', and ' ' represent significant at the significance level of 0, 0.01, and 1, respectively.   Table A3. Two clusters for the Canadian weather case by using the FCR-HL model.