Estimating Boundary Layer Height from LiDAR Data under Complex Atmospheric Conditions Using Machine Learning

: Reliable estimation of the atmospheric boundary layer height (ABLH) is critical for a range of meteorological applications, including air quality assessment and weather forecasting. Several algorithms have been proposed to detect ABLH from aerosol LiDAR backscatter data. However, most of these focus on cloud-free conditions or use other ancillary instruments due to strong interference from clouds or residual layer aerosols. In this paper, a machine learning method named the Mahalanobis transform K-near-means (MKnm) algorithm is ﬁrst proposed to derive ABLH under complex atmospheric conditions using only LiDAR-based instruments. It was applied to the micro pulse LiDAR data obtained at the Southern Great Plains site of the Atmospheric Radiation Measurement (ARM) program. The diurnal cycles of ABLH from cloudy weather were detected by using the gradient method (GM), wavelet covariance transform method (WM), K-means, and MKnm. Meanwhile, the ABLH obtained by these four methods under cloud or residual layer conditions based on micropulse LiDAR data were compared with the reference height retrieved from radiosonde data. The results show that MKnm was good at tracking the diurnal variation of ABLH, and the ABLHs obtained by it have remarkable correlation coefﬁcients and smaller mean absolute error and mean deviation with the radiosonde-derived ABLHs than those measured by other three methods. We conclude that MKnm is a promising algorithm to estimate ABLH under cloud or residual layer conditions.


Introduction
The atmospheric boundary layer (ABL) is the lowest part of the troposphere near the Earth's surface.Moreover, it is the layer within which energy, water, momentum, and matter in the atmosphere are mixed and exchanged with the surface [1][2][3].Pollutants from the ground are dispersed and trapped within this layer, and fog also occurs in the ABL.It evolves throughout the day and is season influenced by solar radiation and other factors.Therefore, the atmospheric boundary layer height (ABLH) is a crucial parameter in numerical weather prediction models, climate change studies and air quality assessments [4][5][6][7].
Many instruments have been implemented to detect ABLH, such as tethered balloons [8], microwave radiometers [9], radiosonde (RS) [10][11][12][13], LiDAR [14][15][16], and ceilometer [17].RS can estimate ABLH with high precision from the vertical profiles of measured temperature and humidity.Thus, it is the most reliable instrument, and its estimate is often used as a reference value to compare with the ABLH obtained by other instruments.However, the radiosonde test is labor intensive and the limited launch frequency cannot monitor the evolving process of ABLH which is necessary for a range of meteorological applications.LiDAR is an active remote sensing tool with high spatial and temporal resolution, long detection range, and strong anti-interference ability, which can continuously and automatically measure the ABLH.It is assumed that the distribution of aerosols in the ABL is almost uniform and the concentration of aerosols in the ABL is much higher than that in the free troposphere (FT).There is a transition zone between ABL and FT, known as the entrainment zone (EZ).The center of the EZ is the ABLH at which the aerosol concentration decreases sharply [17,18].There are two main kinds of methods to estimate ABLH based on LiDAR: (1) Methods to finding an abrupt change based on vertical distribution information regarding the aerosol concentration reflected by LiDAR backscatter signals, such as the threshold method [19], gradient method (GM), [20] and wavelet covariance transform method (WM) [21]; (2) methods based on statistical analysis that seek the maximum variance of the LiDAR backscatter signal [22].However, these methods operate in cloud-free meteorological conditions, and will misjudge the ABLH from cloud interference.
Machine learning (ML)-which extracts features from the training data-is a powerful tool for classification and regression problems [23,24].Moreover, ML has been widely used in target recognition [25], computer vision [26], and other fields [27][28][29], and has achieved remarkable results.In this paper, we regard the detection of ABLH as a cluster problem and explore how the appropriate algorithm can be used to solve this problem.Toledo proposed a cluster analysis (CA) method to calculate the ABLH by taking advantage of both the vertical distribution of aerosols and the variation in concentration [30].The study indicated that cluster analysis is a reliable method for studying the ABLH.Toledo tested the robustness and capability of six numerical methods usually used to determine ABLH from LiDAR data under different atmospheric conditions [18].The results showed that all methods are in agreement with radiosonde measurements under sea-land breeze and non-dusty conditions.However, the presence of a residual layer (RL) affected the use of these methods and CA performed best under dusty conditions.Caicedo assessed three aerosol LiDAR-retrieved algorithms for ABLH detection and found that there is good agreement between radiosonde and aerosol-derived ABLH in cloud-free conditions [17].The presence of clouds creates difficulties for estimating ABLH for all methods.Rieutord described two machine learning methods that derive ABLH from LiDAR data, which are based on K-means and AdaBoost [31].The case study indicated that both algorithms performed well.Nevertheless, K-means is sensitive to the clustering initial value, and AdaBoost is constrained by its training data.Moreover, the meteorological conditions of low-level clouds (cloud base height below 3 km) were discarded.
As is known, clouds cover about 60% of the globe [32].Estimating ABLH under cloudy conditions is a necessary but challenging task.Krishnamurthy [33] used multiple instruments to define the ABLH under clouds conditions.However, these instruments are not always available for field measurements.Dang [34] proposed a top limit technique to eliminate interference from the cloud layer.However, in that study, the threshold −2 gradient used for cloud location determination was not applicable to other observation sites.Li [14] determined the ABLH based on the WCT method with whale optimization algorithm and the top limit.The empirical threshold −2 was also used as a reference for the detection of boundary layer height information.Zhong [35] estimated the ABLH under multi-layer conditions by using maximum limited height initialization and range restriction.Nevertheless, the parameter was determined by experience which is limited to the research area and is unsuitable for other areas.The applicability of these ABLH estimation methods is limited due to the uncertainty of threshold selection.Therefore, it is necessary to use methods other than threshold to estimate the ABLH.To improve the reliability and convenience of ABLH estimation under clouds or residual layer conditions, we propose a new algorithm, known as Mahalanobis transform K-near-means (MKnm), in which only LiDAR data are used.In this study, the observation datasets are constructed based on the characteristics of the categories.Mahalanobis transform is used to partition the data due to the correlation and different magnitudes of each feature in the observation data, and the interference from outliers is reduced with the data that near the previous centers to update the next centers.The most important factor is the LiDAR backscatter signal gradient which is used as a reference to select the initial clustering value.To verify the effectiveness of the proposed MKnm, micropulse LiDAR data-under cloud or residual layer aerosol conditions-from the Southern Great Plains site C1 were used for testing, and the ABLH defined by MKnm, K-means, VM, and HM were compared with reference ABLH calculated from radiosonde data.
The rest of this article is organized as follows.Section 2 introduces the materials used in our study.ABLH retrieval using existing methods and proposed MKnm are described in Section 3. The results of our study, including comparisons of the methods with radiosonde, are presented in Section 4. In Section 5, we discuss the proposed method.Section 6 concludes the study.

Materials
Ground-based micro pulse LiDAR (MPL) and radiosonde data used in this work were collected from the central facility (36.607322 • N, 97.487643 • W) near Lamont in northcentral Oklahoma at the Southern Great Plains (SGP) site of the Atmospheric Radiation Measurement (ARM) program [33,36,37].The dataset spanned 2 years (2003 and 2004).The instruments described below refer mainly to works during this period.

Micro-Pulse LiDAR (MPL)
The micro-pulse LiDAR (Version 2.0) instrument installed at the SGP site is an eye-safe instrument that can be operated autonomously.It emits a laser pulse at a wavelength of 523 nm.Moreover, with pulse-repetition frequencies of 2.5 KHz, the maximum detection range is up to 60 km.The raw MPL data have a vertical resolution of 30 m and temporal resolution of 30 s.After system dead-time correction, background subtraction, range correction, overlap, after-pulsing, and energy-level normalization, the resulting signal is referred to as normalized relative backscatter (NRB): where n(r) represents the measured signal return in photoelectron counts per second at range r.D[n(r)] is the dead time; n ap (r) is the after-pulsing; n b is background; O c (r) is the overlap factor; C represents the calibration constant of a dimensional system; E is the transmitted laser pulse energy; β is the backscatter coefficient; T is the atmospheric transmittance; NRB is the value-added data product (VAP) of ARM that is used for detecting clouds and aerosols, and the vertical resolution is 90 m.In this work, the interval thresholding technique is used to reduce noise interference [29].NRB data below 4.37 km are used, and data from rain and fog meteorological conditions are discarded.In this study, we focus mainly on demonstrating the ability to detect the ABLH under cloud or residual layer aerosol conditions by the proposed method.

Radiosonde (RS)
Vaisala RS90 radiosonde was equipped at the SGP site that is usually launched four times daily-05 :30,11:30,17:30,and 23:30 UTC [33].It provides vertical variation information of air pressure, temperature, relative humidity.The most widely used approaches to determine the ABLH involve finding the maximal positive gradient of the vertical potential temperature profile or the minimum negative gradient value of the vertical relative humidity profile.Three different algorithms were used to estimate ABLH from RS data in the ARM ABLH VAP: (1) the Heffter method, (2) bulk Richardson number method [38,39], and (3) Liu and Liang method [33].The Liu and Liang method determines ABLH of the convective, stable, and neutral boundary layer from radiosonde soundings profiles collected from 14 major field campaigns around the world [10].It is a robust method and can produce realistic ABLH that can be verified by several thousand additional soundings; its perfor-mance is also validated by substantial literature [14,33].In particular, Krishnamurthy [33] compared these three methods using 1785 cases with radiosonde data and daytime clear or shallow cumulus conditions.From comparisons results, they found that the Liu and Liang technique resulted in the best overall agreement with the Tucker method, and used as a reference to calibrate the RF model in their study.The observation site in our study is same as Krishnamurthy's.Therefore, we also use Liu and Liang method as a reference to evaluate the proposed method.In order to match the ABLH detected by LiDAR data and RS data, the LiDAR data are the average within 10 min of launching radiosonde.

Methods
Figure 1 shows a typical example of NRB under cloudy conditions at SGP. Figure 1a illustrates the vertical distribution profiles of normalized relative LiDAR backscatter (NRB) signals observed at 23:31 UTC, 05 February 2003.Figure 1b,c shows the gradient and relative increase in NRB (RE = (NRB (z + ∆z) − NRB (z))/NRB (z)), respectively.In Figure 1a, NRB almost decreases with altitude until approximately 2.2 km above ground level (AGL) where a rapid increase occurs and a sharp decline at approximately 2.4 km AGL, which correspond to the strongest positive and negative gradient pairs shown in Figure 1b.In addition, In Figure 1c, a relative increase in the NRB-larger than 0.55-is clearly shown between 2.2 km and 2.4 km AGL, indicating that one cloud layer is located between 2.2 km to 2.5 km, which is also confirmed by the cloud data from the ARM SGP site [36].In Figure 1a, the LiDAR signal above the cloud top almost decays to zero due to cloud attenuation.The ABLH determined by RS is 1.082 km.
the ARM ABLH VAP: (1) the Heffter method, (2) bulk Richardson number method [38,39], and (3) Liu and Liang method [33].The Liu and Liang method determines ABLH of the convective, stable, and neutral boundary layer from radiosonde soundings profiles collected from 14 major field campaigns around the world [10].It is a robust method and can produce realistic ABLH that can be verified by several thousand additional soundings; its performance is also validated by substantial literature [14,33].In particular, Krishnamurthy [33] compared these three methods using 1785 cases with radiosonde data and daytime clear or shallow cumulus conditions.From comparisons results, they found that the Liu and Liang technique resulted in the best overall agreement with the Tucker method, and used as a reference to calibrate the RF model in their study.The observation site in our study is same as Krishnamurthy's.Therefore, we also use Liu and Liang method as a reference to evaluate the proposed method.In order to match the ABLH detected by LiDAR data and RS data, the LiDAR data are the average within 10 min of launching radiosonde.

Methods
Figure 1 shows a typical example of NRB under cloudy conditions at SGP. Figure 1a illustrates the vertical distribution profiles of normalized relative LiDAR backscatter (NRB) signals observed at 23:31 UTC, 05 February 2003.Figure 1b,c shows the gradient and relative increase in NRB (RE = (NRB (z + Δz) − NRB (z))/NRB (z)), respectively.In Figure 1a, NRB almost decreases with altitude until approximately 2.2 km above ground level (AGL) where a rapid increase occurs and a sharp decline at approximately 2.4 km AGL, which correspond to the strongest positive and negative gradient pairs shown in Figure 1b.In addition, In Figure 1c, a relative increase in the NRB-larger than 0.55-is clearly shown between 2.2 km and 2.4 km AGL, indicating that one cloud layer is located between 2.2 km to 2.5 km, which is also confirmed by the cloud data from the ARM SGP site [36].In Figure 1a, the LiDAR signal above the cloud top almost decays to zero due to cloud attenuation.The ABLH determined by RS is 1.082 km.

Gradient Method (GM)
As mentioned above, the aerosol concentration in the boundary layer is significantly higher than that in the free atmosphere.There is a marked change in the LiDAR signal when it enters the free atmosphere from the boundary layer.Therefore, the boundary layer height is defined as the position where the LiDAR signal decays fastest or the height at which the minimum gradient is obtained, and is defined as:

Gradient Method (GM)
As mentioned above, the aerosol concentration in the boundary layer is significantly higher than that in the free atmosphere.There is a marked change in the LiDAR signal when it enters the free atmosphere from the boundary layer.Therefore, the boundary layer height is defined as the position where the LiDAR signal decays fastest or the height at which the minimum gradient is obtained, and is defined as: Or where NRB(r) represents the normalized relative backscatter at the height r.Here, we use Equation ( 2) to compute ABLH GM .GM is simple and convenient but is easily disturbed by noise and the aerosol layer structure.

Wavelet Covariance Transform Method (WM)
Wavelet covariance transformation is a method that uses the Haar step function to detect the LiDAR signal step changes.The wavelet covariance function is defined as: where r l and r u are the lower and upper boundaries of the LiDAR profile; a and b are the dilation and translation parameters of the Haar step function.Here, a = 2n∆r, n is a positive integer, ∆r denotes the vertical resolution of LiDAR, and b is set to 90 m step size from 0.3 km to 4.37 km considering the vertical resolution of NRB.
The Haar wavelet function is expressed as: The ABLH obtained by WM is to find the maximum W NRB (a,b) and is shown as: where WM shows good performance in many studies and is often used as the manufacturer's method.However, parameters a and b need to be appropriately selected.

K-Means Method
K-means is a classical unsupervised learning algorithm that partitions objects according to their distance from k cluster centers [40].It is commonly used in many applications since it is simple and computationally efficient.The K-means implementation process to identify ABLH is as follows: Step 1-Build dataset X∈R N×F .N is the number of data points; F is the dimension of data that each dimension represents a feature of clusters.Here, F =3 -altitude r, normalized relative backscatter NRB(r), and the absolute value of the relative change in NRB(r) expressed as|∆NRB(r)|.
Step 2-Normalize the data making each dimension contribute equally to the partition.
Step 3-Choose the number of clusters k, and select the initial cluster center C 1 , . . ., C k at random locations inside the dataset.Here, the number of clusters k is same as that determined by MKnm.
Step 4-Cluster the data X i according to Euclidean distance: where j = 1, . . ., k, indicates the cluster; d(X i ,C j ) is the Euclidean distance between data X i and cluster center C j .
Step 5-Update cluster center using the average of the data in the cluster, defined as: where N j is the number of data points in cluster j.
Steps 4 and 5 are repeated until the maximum number of iterations is reached or the cluster centers stop moving.In this study, the ABLH determine by K-means is located at the category boundary for the first decrease in cluster strength from bottom to top.

MKnm Method
K-means has been used to detect ABLH from LiDAR data.However, it is sensitive to the number of clusters and initial cluster centers.Outliers can also affect cluster centers that keep them away from real values [31,40].Moreover, the Euclidean distance measurement-often used in K-means-does not take into account the relationship between object features [41,42].Therefore, MKnm is proposed to solve these problems.

Algorithm Description
MKnm estimates ABLH by cluster analysis from only the LiDAR signal.One key work of this method is to select features of the clusters.There are at least three clusters to be identified, such as ABL, FT, and cloud (for Figure 1).The algorithm is based on the fact that the intensity of the LiDAR backscatter signal generally decreases with altitude, however, increases sharply when interacting with clouds or uplifting aerosols and decreases rapidly at the upper part of them due to scattering and attenuation [34,35].Therefore, the LiDAR backscatter signal changes significantly when it encounters clouds, as shown in Figure 1.In this study, we choose the feature of clusters as altitude r, normalized relative backscatter NRB(r), and the absolute value of the relative change in NRB(r) expressed as|∆NRB(r)|.Aerosols with adjacent positions, similar concentrations, and varying intensities are classified into one category.This supposes that the ABLH will be of higher accuracy under improved clustering.
In order to accurately partition data, the relationship of each feature should be considered.Two pieces of data that are closely related may not belong to the same cluster, and the magnitude of each feature can also affect the clustering results.MKnm takes the Mahalanobis transform to the raw dataset in order to eliminate effects arising from correlation and different magnitudes from each feature.These processes are expressed as: where Z is the new dataset obtained by taking Mahalanobis transform to raw dataset X; Y is the dataset after projecting X onto U T , which is similar to principal component analysis (PCA); σ i is the standard deviation of the ith feature of Y; U is the orthogonal matrix of the eigenvectors of Σ, expressed as Σ = UQU T ; Σ is the covariance matrix of X.
In the next step, MKnm uses K-near-means to partition dataset Z into k clusters by calculating the Euclidean distance between data and cluster centers.It should appropriately select the number of clusters k and the initial cluster centers C. The aim of this algorithm is to estimate ABLH which means the main focus is to identify ABL and FT.Therefore, a low number of clusters will not extract ABL and FT from NRB.However, a greater number of clusters will only further classify the ABL or FT and increase the cost, which has a minimal effect on ABLH estimation.As mentioned previously, the ABLH will be of higher accuracy under improved clustering. Figure 2 illustrates the clustering under varying numbers of clusters with random initial centers.
From Figure 2, it can be observed that the aerosol layer is further classified with an increase in the number of clusters k.It divides into cloud layer and non-cloud layer at k = 2 and divides into aerosol layer 1, aerosol layer 2, clouds layer, and the attenuation layer above the cloud k = 4.In our study, in order to clustering the aerosol properly, the NRB gradient is chosen as the observed object.The optimal number of clusters is determined by the number of same direction intervals on the gradient of NRB after the process using the interval thresholding technique.It is known that NRB almost decreases with altitude under ideal atmospheric conditions in which the number of same direction intervals is one.It can be identified as ABLH by dividing aerosols into two clusters.Under cloud atmospheric conditions, the NRB gradient is negative until it interacts with clouds or the elevated aerosol layer; it is positive at the lower part and negative at the upper part, corresponding to an enhancement and attenuation of backscatter signals.There are three same direction intervals of the NRB gradient that are needed to divide into three or four clusters to identify ABLH that includes ABL, FT, and cloud or elevated aerosol layer, and the attenuation layer above the cloud may need to be selected.Therefore, the optimal number of clusters is the number of same direction intervals of the NRB gradient or the addition of one.The change in the NRB gradient from Figure 1 is shown in Figure 3.
Remote Sens. 2022, 14, x FOR PEER REVIEW 7 of 19 number of clusters will only further classify the ABL or FT and increase the cost, which has a minimal effect on ABLH estimation.As mentioned previously, the ABLH will be of higher accuracy under improved clustering. Figure 2 illustrates the clustering under varying numbers of clusters with random initial centers.From Figure 2, it can be observed that the aerosol layer is further classified with an increase in the number of clusters k.It divides into cloud layer and non-cloud layer at k = 2 and divides into aerosol layer 1, aerosol layer 2, clouds layer, and the attenuation layer above the cloud k = 4.In our study, in order to clustering the aerosol properly, the NRB gradient is chosen as the observed object.The optimal number of clusters is determined by the number of same direction intervals on the gradient of NRB after the process using the interval thresholding technique.It is known that NRB almost decreases with altitude under ideal atmospheric conditions in which the number of same direction intervals is one.It can be identified as ABLH by dividing aerosols into two clusters.Under cloud atmospheric conditions, the NRB gradient is negative until it interacts with clouds or the elevated aerosol layer; it is positive at the lower part and negative at the upper part, corresponding to an enhancement and attenuation of backscatter signals.There are three same direction intervals of the NRB gradient that are needed to divide into three or four clusters to identify ABLH that includes ABL, FT, and cloud or elevated aerosol layer, and the attenuation layer above the cloud may need to be selected.Therefore, the optimal number of clusters is the number of same direction intervals of the NRB gradient or the addition of one.The change in the NRB gradient from Figure 1 is shown in Figure 3.In Figure 3, there are three same direction intervals.Thus, the optimal num clusters is three or four.Next, we analyze the effect of the initial center in the cluste ysis.Figure 4 illustrates the clustering under different initial centers with a given n of clusters.From the figure, it can be seen that the result of the cluster will be affe the initial center.We can obtain different ABLH with different initial centers fr In Figure 3, there are three same direction intervals.Thus, the optimal number of clusters is three or four.Next, we analyze the effect of the initial center in the cluster analysis.Figure 4 illustrates the clustering under different initial centers with a given number of clusters.From the figure, it can be seen that the result of the cluster will be affected by the initial center.We can obtain different ABLH with different initial centers from the same number of clusters.Therefore, the initial center of clusters needs to be selected carefully.In our study, the maximum of the same direction intervals is selected as the initial centers, and is then fine-tuned.In Figure 3, there are three same direction intervals.Thus, the optimal number of clusters is three or four.Next, we analyze the effect of the initial center in the cluster analysis.Figure 4 illustrates the clustering under different initial centers with a given number of clusters.From the figure, it can be seen that the result of the cluster will be affected by the initial center.We can obtain different ABLH with different initial centers from the same number of clusters.Therefore, the initial center of clusters needs to be selected carefully.In our study, the maximum of the same direction intervals is selected as the initial centers, and is then fine-tuned.Since the cluster center is the mean of the data in the cluster during the update process, the outliers will remove it from the real value.In this study, the data point that nears the cluster center is chosen to update it, defined as: 1 Since the cluster center is the mean of the data in the cluster during the update process, the outliers will remove it from the real value.In this study, the data point that nears the cluster center is chosen to update it, defined as: where C j (n+1) is the center of the jth cluster at time n + 1, k j (n) is the jth cluster at time n, N j (n) is the number of the jth cluster at time n, Z i (n) is the data point i of the jth cluster at time n, H (n) is the mean Euclidean distance between data, and cluster centers at time n.

Flowchart of ABLH Estimated by MKnm
Figure 5 is the flowchart of ABLH estimated under complex atmospheric conditions based on LiDAR data by MKnm.The gradient and relative increases in NRB need to be calculated first to construct dataset X.Indeed, the intensity of the LiDAR backscatter signal usually declines with altitude.However, it increases rapidly when the cloud or elevated aerosol layer is encountered and decreases sharply at the upper boundary of each layer.Therefore, zero-crossing of the derivative occurs at the boundary of the cloud or elevated aerosol layer.Campbell also indicates that the relative increase in NRB will be at least 0.55 when a cloud exists [36].Interval thresholding-which considers the same direction interval as the whole to implement threshold operations-shows better performance for removing noise interference.The NRB gradient that deals with interval thresholding can reflect the distribution of the cloud or elevated aerosol layer in vertical space.Then, the optimal number of clusters K can be determined by calculating the performance metrics.We select the maximum in the same direction interval from the NRB gradient as the initial cluster center.
Considering the correlation of elements contained in the data, a Mahalanobis transform is implemented for the dataset.In addition, data that near the cluster center are used to update the center to reduce interference from outliers.Performance metrics are calculated with the given initial cluster center, and fine-tuning is implemented until 10 times.Finally, the dataset is cluster analyzed using the optimal cluster number and initial cluster center.
In order to estimate the ABLH, the average intensity of each cluster needs to be calculated.The ABLH is located at the category boundary for the first decrease in cluster strength from bottom to top.vated aerosol layer is encountered and decreases sharply at the upper boundary of each layer.Therefore, zero-crossing of the derivative occurs at the boundary of the cloud or elevated aerosol layer.Campbell also indicates that the relative increase in NRB will be at least 0.55 when a cloud exists [36].Interval thresholding-which considers the same direction interval as the whole to implement threshold operations-shows better performance for removing noise interference.The NRB gradient that deals with interval thresholding can reflect the distribution of the cloud or elevated aerosol layer in vertical space.Then, the optimal number of clusters K can be determined by calculating the performance metrics.We select the maximum in the same direction interval from the NRB gradient as the initial cluster center.Considering the correlation of elements contained in the data, a Mahalanobis transform is implemented for the dataset.In addition, data that near the cluster center are used to update the center to reduce interference from outliers.Performance metrics are calculated with the given initial cluster center, and fine-tuning is implemented until 10 times.Finally, the dataset is cluster analyzed using the optimal cluster number and initial cluster center.In order to estimate the ABLH, the average intensity of each cluster needs to be calculated.The ABLH is located at the category boundary for the first decrease in cluster strength from bottom to top.

Performance Metrics
In this study, three indexes were used: (1) the sum of the squared errors (SSE); (2) the mean of the silhouette coefficient (MSC) [43]; and (3) Davies-Bouldin indices (DBI) [44], defined as: where Z is the data in cluster i; k is the number of clusters; C i is the cluster center of cluster i.
where N is the number of data points; a i is the mean distance between data point Z i and other data points with the same cluster.b i is the minimum mean distance between data point Z i and other data points with a different cluster and is defined as: where N Ci is the number in the cluster i; a i represents the similarity between data point Z i and other data points in the cluster.b i represents the dissimilarity between data point Z i and other data points in other clusters.MSC locates in the range of [−1, 1].In general, a bigger MSC indicates a better clustering result, however, does not always work since it is affected by a i and b i .Hence, in our study, the initial centers of clustering are selected based on SSE and MSC.Davies-Bouldin indices also need to be used when selecting the optimal number of clusters, defined as: where k is the number of clusters,µ i is the mean distance between data and their cluster center in cluster i, and d(C i , C j ) is the distance between cluster center C i and cluster center C j .

Results
To assess the validity of the proposed Mknm, this section shows the ABLH obtained different methods and analyzes the ABLH diurnal cycles under cloudy conditions using different methods.Furthermore, ABLHs retrieved by the LiDAR method are compared with radiosonde methods under residual layer or cloudy conditions.

Comparisons of ABLH Retrieval between LiDAR and Radiosonde Methods under Clouds or RL Conditions
In order to further evaluate and quantify the performance of the proposed method, the ABLH detected by four LiDAR-based methods in the presence of RL and cloud layers are compared with ABLH identified by radiosonde.The mean absolute error (MAE) and

Comparisons of ABLH Retrieval between LiDAR and Radiosonde Methods under Clouds or RL Conditions
In order to further evaluate and quantify the performance of the proposed method, the ABLH detected by four LiDAR-based methods in the presence of RL and cloud layers are compared with ABLH identified by radiosonde.The mean absolute error (MAE) and Pearson correlation coefficient (R) are used as scores to assess the quality of methods.They are defined as: where N is the number of data points, Ẑ is the ABLH detected by LiDAR-based methods, and Z RS is the ABLH determined by radiosonde.cov(•) denotes the covariance, and σ(•) denotes the standard deviation.A lower MAE and higher R are better since this indicates that ABLHs detected by LiDAR-based methods are closer to references.Moreover, the mean deviation D-which is the difference between the mean of ABLH detected by LiDAR-based methods and radiosonde-was also used to assess the methods.The residual layer (RL) occurs in the morning or evening and is disconnected from the surface aerosol layer.It has the same properties as the cloud layer when interacting with the LiDAR beam that affects the defined ABLH. Figure 8 shows the comparisons between ABLH results determined by LiDAR-based methods and radiosonde in cloud or RL scenarios at the ARM SGP site from January 2003 to May 2004.Previous studies have indicated that the ABLH derived from LiDAR-based methods is affected by distinct weather patterns and LiDAR-blind areas.Therefore, meteorological conditions including rain, snow, or fog were not considered, and the ABLH below 300 m, determined by radiosonde, was also discarded.This study selected 56 well-defined cases for comparison.Figure 8a compares the ABLH estimated by GM and radiosonde.The ABLH determined by GM is much higher than that obtained from radiosonde due to the presence of clouds or RL, with a correlation coefficient of 0.23, a mean absolute error of 1698 m, and a mean deviation of 1690 m.The comparison results of ABLH-determined by WM and by radiosonde-are shown in Figure 8b, with a correlation coefficient of 0.24, a mean absolute error of 1639 m, and a mean deviation of 1624 m.From Figure 8a,b, it can be seen that ABLH determined by GM and WM both have large errors with ABLH determined by radiosonde.Given the number of clusters, the ABLH determined by cluster analysis (K-means and MKnm) is in close agreement with those by radiosonde as shown in Figure 8c,d.The correlation coefficient between K-means and radiosonde derived ABLH is 0.77, the mean absolute error is 220 m, and the mean deviation is 86 m.The proposed MKnm method obtained the best result, which is consistent with radiosonde, with a correlation coefficient of 0.95, a mean absolute error of 87 m, and a mean deviation of −8 m.   Figure 9 shows a comparison of the ABLH determined by four LiDAR-based methods.From the figure, it can be seen that ABLHs-determined by GM and WM-have the closest correlation, with a correlation coefficient of 0.99.The result is consistent with those described previously since they are methods that find the maximum change in the LiDAR backscatter signal.The cluster analysis methods-K-means and MKnm-are closely  From the figure, it can be seen that ABLHs-determined by GM and WM-have the closest correlation, with a correlation coefficient of 0.99.The result is consistent with those described previously since they are methods that find the maximum change in the LiDAR backscatter signal.The cluster analysis methods-K-means and MKnm-are closely related, with a correlation coefficient of 0.8.However, it also shows that the ABLH-defined by MKnm-varies slightly from that defined by K-means.Compared with MKnm, K-means with random initial center has a larger error.Table 2 shows the comparison results of ABLH-determined by these four LiDAR-based methods.Overall, in most cases, the ABLHs estimated by K-means and MKnm are quite different from those by GM and WM in the presence of clouds or RL.SSE is the sum of the square errors between observations and their cluster center and decreases with an increase in the number of clusters.The magnitude of decline is initially large and gradually decreases.Researchers often use the elbow method to select the optimal number of clusters.Nevertheless, it is easy to fall into local optimum, and the selection of elbow points has a certain uncertainty in which errors are easily produced.In this study, we used SSE as one evaluation index to select the initial cluster centers; because, in the case of the same cluster number K, SSE can better evaluate the clustering effect, which is conducive to selecting better initial centers.MSC combines the cohesion of observations within clusters and the separation of observations between clusters.In terms of good clustering performance, observations of the same cluster should be as similar as possible and vice versa.Thus, a larger MSC is better.The DBI index tests the similarity of observations among clusters, and better clustering results lead to significant differences among clusters, that is, a smaller DBI is better.MSCand DBI are effect evaluation indexes that quantify the quality of clustering which is conducive to selecting the optimal clustering category K and the initial center C.

Define the ABLH after Clustering
According to the aerosol vertical structure distribution characteristics, the ABL is in the lowest layer of the troposphere.Therefore, after clustering, the bottom category is generally ABL, and the upward category adjacent to its potential boundary layer, free atmosphere, cloud, or RL, can be judged according to the aerosol intensity (average) information from these categories.If the intensity of the upward category does not differ greatly from that of the bottom category but is weaker, the upward category is the boundary layer.If the intensity differs greatly from the known boundary layer category and is weaker, the category is free atmosphere.If the intensity differs greatly from the known boundary layer category and is stronger than the known boundary layer category, the category is residual layer aerosol or cloud.Based on this, the boundary layer height can be obtained.

Estimation of ABLH above the Cloud
The ABLH is mostly below the cloud, but in rare cases, it may be coupled to or appear above the cloud [34].In this scenario, it is challenging to define ABLH based on ground-based LiDAR only, and presently, there is no suitable method.In many cases, due to cloud attenuation, even for high-power systems, the LiDAR signal completely decays before reaching the cloud top, and the LiDAR pulse may be unable to penetrate the cloud to obtain the backscattered LiDAR signal generated by aerosols above the cloud top [45].In this case, the LiDAR echo signal on the cloud top will attenuate into background noise, therefore, making it difficult to detect the ABLH on and above the cloud top.As the LiDAR backscattering signal rapidly decays at the upper boundary of the cloud, the gradient-based boundary layer height detection method (gradient method and wavelet method) determines the upper boundary of the cloud as the boundary layer height, and the clustering analysis method also determines the upper boundary of the cloud as the boundary layer height.However, with the weak attenuation of clouds, the LiDAR pulse can penetrate the clouds and obtain aerosol information above the cloud.The methods based on the clustering analysis have great advantages, e.g., they can estimate the ABLH locating at that area.however, the gradient-based method will also define the ABLH as the cloud top.

Conclusions
The estimation of ABLH based on the LiDAR backscatter signal is significantly influenced by complex atmospheric structures such as clouds and uplifting aerosol layers.This paper proposed a new method known as MKnm to define ABLH under cloud or RL conditions.MKnm is a clustering method that aims to overcome the disadvantage of classic clustering methods as K-means.Firstly, it estimates the optimal number of clustering.The NRB gradient reflects the varying aerosol concentration intensity.We can obtain the overall distribution of aerosol concentration intensity in vertical space by implementing interval thresholding.Thus, the optimal number of clustering is related to the number of same direction intervals on the NRB gradient.Secondly, it selects initial centers of clustering.The experiment proved that the ABLH can be affected by initial centers of clustering.In this paper, the interval extreme value was used as the candidate center, and the initial center was determined by fine-tuning based on the evaluation index.Thirdly, Mahalanobis distance was used to partition observations by considering the correlation and different magnitudes of each feature, and the near observations were used to update the centers by eliminating interference from outliers.MKnm, K-means, GM, and WM were compared with RS by taking the LiDAR data under cloud or RL conditions from the ARM SGP C1 site.
The diurnal cycles of ABLH from cloudy weather were detected by using four methods.MKnm and K-means can trace the changing trend of ABLH, however, GM and WM misidentified the cloud top or residual layer as ABLH.MKnm outperformed the K-means which sometimes had a large error.The ABLH estimated by MKnm, K-means, GM, and WM were compared with that defined by radiosonde based on LiDAR data (from January 2003 to May 2004) under cloud or RL conditions.Experimental results showed that GM and WM were similar in estimating ABLH, and the ABLH obtained by both was less related to the ABLH defined by radiosonde; the value was only 0.23-0.24,and the mean absolute error and mean deviation were larger (above 1600 m).However, the ABLH obtained by the clustering analysis method is in close agreement with the ABLH defined by radiosonde-the mean absolute error and mean deviation are small.The proposed MKnm performed best in these methods, with a correlation coefficient of 0.95, a mean absolute error of 87 m, and a mean deviation of −8 m.Overall, MKnm can-with high accuracy-estimate the presence of ABLH in complex weather conditions, and, from the results of this study, the performance of K-means can be improved.

Figure 1 .
Figure 1.Vertical distributions of normalized relative LiDAR backscatter signal on typical cloud conditions: (a) one cloud layer is observed at 23:31UTC, 05 February 2003.Horizontal line with red color depicts the position of ABLH by RS; (b) NRB gradient; (c) relative increase in NRB.

Figure 1 .
Figure 1.Vertical distributions of normalized relative LiDAR backscatter signal on typical cloud conditions: (a) one cloud layer is observed at 23:31UTC, 05 February 2003.Horizontal line with red color depicts the position of ABLH by RS; (b) NRB gradient; (c) relative increase in NRB.

Figure 3 .
Figure 3.The gradient of NRB processed by interval thresholding technique.

Figure 3 .
Figure 3.The gradient of NRB processed by interval thresholding technique.

Figure 6 19 Figure 6 .
Figure 6 shows the ABLH obtained by four methods based on LiDAR data from Figure 1 defined as ABLH-GM, ABLH-WM, ABLH-Kmeans, and ABLH-Mknm.The reference ABLH defined by RS is 1.082 km.Notably, the best result is provided by Mknm, which has the lowest deviate (0.1622 km) and relative deviation (0.1499).The ABLH estimated by GM, WM, and K-means are close to the top of the cloud with a large deviate and relative deviation.The ABLH obtained by GM is highest in all methods which is farthest from the reference ABLH.FOR PEER REVIEW 11 of 19

Figure 7
Figure 7 is the NRB time-height distribution, and the ABLH retrieved by different methods are shown using different symbols.The diurnal cycles in ABLH are detected by four methods with data measured at the ARM SGP site C1 on 2 October 2003.The NCRB are averaged every 10 min, and the convective boundary layer (CBL) in the daytime (from

Figure 6 .
Figure 6.The ABLH obtained by four methods under cloudy conditions.

Figure 7 19 Figure 7 .
Figure 7 is the NRB time-height distribution, and the ABLH retrieved by different methods are shown using different symbols.The diurnal cycles in ABLH are detected by four methods with data measured at the ARM SGP site C1 on 2 October 2003.The NCRB are averaged every 10 min, and the convective boundary layer (CBL) in the daytime (from 11:05-23:45 UTC) is analyzed.Radiosonde launches at 11:30, 17:30, and 23:30, and ABLH determined by RS is shown in Figure 7 as the black diamond.As can be seen from Figure 7, a layer or more clouds appear in the sky throughout the day.ABLHs-calculated by GM and WM-are almost located at the top of the cloud where the strongest change in aerosol concentration occurs.However, ABLHs detected by Kmeans and Mknm are in good agreement with the change in ABLHs, and Mknm can precisely identify ABLHs.From 11:05-14:00, the boundary layer is clearly visible at lower altitudes, and there is a thick high cloud layer above 2.1 km AGL which is shown as a white region.The NCRB is completely attenuated above the cloud top at this time.GM and WM capture the upper edge of the cloud as ABLH with a large error.Kmeans and Mknm are able to correctly identify ABLHs, and Mknm obtains the best results that correctly and accurately capture the change process of ABLHs.Between 17:05 and 19:05 UTC, more than one cloud layer appears in the sky.GM and WM detect the cloud layer, which occurs as the strongest change in aerosol concentration as ABLH.Kmeans and Mknm capture the location as a marked change in aerosol concentration under clouds as ABLH.From 19:25 to 21:05 UTC, the ABLH may couple with the cloud layer due to no sharp change in NRB beside them; the ABLH is located at the upper edge of the cloud estimated by all four methods.It is clear that the ABLH determined by MKnm and Kmeans match very well with that obtained by radiosonde at 11:30 and 23:30 UTC, and MKnm performance best.GM and WM overestimate ABLH as estimated from radiosonde.At 17:30 UTC, there is a low cloud.The ABLH estimated by radiosonde method is the base of the cloud layer where the temperature profile changes rapidly.Considering the continuity of ABLH in two adjacent moments, the ABLH determined by MKnm is selected [34].Remote Sens. 2022, 14, x FOR PEER REVIEW 12 of 19

Figure 7 .
Figure 7. Resulting ABLH for 2 October 2003 with 10 min averages for all methods.Radiosondeestimated ABLHs are shown as the black diamond.

Figure 8 .
Figure 8. Comparisons between ABLH results determined by LiDAR-based methods and radiosonde for 56 cases in cloud or RL conditions.(a) GM and RS; (b) WM and RS; (c) Kmeans and RS; (d) MKnm and RS.The red dotted line is the expected 1:1 reference line.The correlation coefficient, mean absolute error and mean deviation are represented by R, MAE and D, respectively.

Figure 8 .
Figure 8. Comparisons between ABLH results determined by LiDAR-based methods and radiosonde for 56 cases in cloud or RL conditions.(a) GM and RS; (b) WM and RS; (c) Kmeans and RS; (d) MKnm and RS.The red dotted line is the expected 1:1 reference line.The correlation coefficient, mean absolute error and mean deviation are represented by R, MAE and D, respectively.

Figure 9
Figure 9 shows a comparison of the ABLH determined by four LiDAR-based methods.From the figure, it can be seen that ABLHs-determined by GM and WM-have the closest correlation, with a correlation coefficient of 0.99.The result is consistent with those described previously since they are methods that find the maximum change in the LiDAR backscatter signal.The cluster analysis methods-K-means and MKnm-are closely related, with a correlation coefficient of 0.8.However, it also shows that the ABLH-defined by MKnm-varies slightly from that defined by K-means.Compared with MKnm, K-means with random initial center has a larger error.Table2shows the comparison results of ABLH-determined by these four LiDAR-based methods.Overall, in most cases, the ABLHs estimated by K-means and MKnm are quite different from those by GM and WM in the presence of clouds or RL.

Figure 9 .
Figure 9. Intercomparison of the ABLH determined by LiDAR-based methods for 56 cases in cloud or RL conditions.(a) WM and GM; (b) Kmeans and GM; (c) MKnm and GM; (d) Kmeans and WM; (e) MKnm and WM; (f) MKnm and Kmeans.The red dotted line is the expected 1:1 reference line.

Figure 9 .
Figure 9. Intercomparison of the ABLH determined by LiDAR-based methods for 56 cases in cloud or RL conditions.(a) WM and GM; (b) Kmeans and GM; (c) MKnm and GM; (d) Kmeans and WM; (e) MKnm and WM; (f) MKnm and Kmeans.The red dotted line is the expected 1:1 reference line.

1 .
Evaluation Index to Quantify the Quality of Clustering

Table 1 .
Table 1 shows the comparison results of ABLH-determined by LiDAR-based methods and by radiosonde.The results indicate that MKnm performs well under cloud or RL conditions.Correlation coefficients (R), mean absolute error (MAE), mean deviation (D) between ABLH determined by LiDAR-based methods and radiosonde.themeanabsoluteerror is 220 m, and the mean deviation is 86 m.The proposed MKnm method obtained the best result, which is consistent with radiosonde, with a correlation coefficient of 0.95, a mean absolute error of 87 m, and a mean deviation of −8 m.Table1shows the comparison results of ABLH-determined by LiDAR-based methods and by radiosonde.The results indicate that MKnm performs well under cloud or RL conditions.

Table 2 .
Correlation coefficients (R) of ABLH determined by LiDAR-based methods.finedbyMKnm-variesslightly from that defined by K-means.Compared with MKnm, K-means with random initial center has a larger error.Table2shows the comparison results of ABLH-determined by these four LiDAR-based methods.Overall, in most cases, the ABLHs estimated by K-means and MKnm are quite different from those by GM and WM in the presence of clouds or RL.

Table 2 .
Correlation coefficients (R) of ABLH determined by LiDAR-based methods.