Study of the Spatiotemporal Distribution Characteristics of Rainfall Using Hybrid Dimensionality Reduction-Clustering Model: A Case Study of Kunming City, China

: In recent years, the frequency and intensity of global extreme weather events have gradually increased, leading to significant changes in urban rainfall patterns. The uneven distribution of rainfall has caused varying degrees of water security issues in different regions. Accurately grasping the spatiotemporal distribution patterns of rainfall is crucial for understanding the hydrological cycle and predicting the availability of water resources. This study collected rainfall data every five minutes from 62 rain gauge stations in the main urban area of Kunming City from 2019 to 2021, constructing an unsupervised hybrid dimensionality reduction-clustering (HDRC) model. The model employs the Locally Linear Embedding (LLE) algorithm from manifold learning for dimensionality reduction of the data samples and uses the dynamic clustering K-Means algorithm for cluster analysis. The results show that the model categorizes the rainfall in the Kunming area into three types: The first type has its rainfall center distributed on the north shore of Dian Lake and the southern part of Kunming’s main urban area, with spatial dynamics showing the rainfall distribution gradually developing from the Dian Lake water body towards the land. The second type’s rainfall center is located in the northern mountainous area of Kunming, with a smaller spatial dynamic change trend. The water vapor has a relatively fixed and concentrated rainfall center due to the orographic uplift effect of the mountains. The third type’s rainfall center is located in the main urban area of Kunming, with this type of rainfall showing smaller variations in all indicators, mainly occurring in May and September when the temperature is lower, related to the urban heat island effect. This research provides a general workflow for spatial rainfall classification, capable of mining the spatiotemporal distribution patterns of regional rainfall based on extensive data and generating typical samples of rainfall types.


Introduction
As global climate change intensifies, the frequency and intensity of extreme weather events are on an upward trend [1,2].Moreover, due to the impact of human activities, urban areas are experiencing significant changes in rainfall patterns [3].Water contributes to GDP and plays an important role in the economy in general [4]; precipitation variations have a heavy economic impact.These changes not only affect local agricultural production but also pose new challenges for urban water resource management and planning [5].For instance, the uneven distribution of rainfall may lead to water scarcity in some regions, while others might face flooding disasters [6,7].
Precipitation is a key meteorological forcing factor in numerous geographical studies [8].Accurately understanding the spatiotemporal distribution of rainfall data are crucial for comprehending hydrological cycles, predicting water resource availability, and devising effective flood prevention measures [2,9].For example, from the perspective of flood risk, the spatiotemporal distribution of rainfall processes is one of the significant factors affecting flood behavior [10].The evolution of hydrological models from lumped to distributed approaches highlights an essential feature: the ability to respond to changes in the spatiotemporal distribution of rainfall [11,12].From a temporal perspective, the distribution of rainfall over time can significantly impact the runoff generation mechanisms of the underlying surface.Daily variability in conjunction with an evolving climate [13] plays an important role.The position of peak rainfall during precipitation events notably influences the magnitude of flooding hazards [14].From a spatial perspective, the spatial distribution of rainfall influences the converge mechanisms of the underlying surface.Factors such as the slope, roughness, drainage capacity, and proximity to river channels of the rainfall's center on the ground projection show varying runoff characteristics in specific basins, leading to flood events with distinct spatial features [15].Traditional rainfall analysis methods primarily rely on statistical techniques to assess rainfall patterns and their impacts on the environment and society [16].Time series analysis, including moving averages and trend analysis (utilizing methods such as the Mann-Kendall trend test), is widely used to identify long-term trends and temporal correlations in rainfall data [17].In spatial distribution analysis, Geographic Information System (GIS) technology and Kriging interpolation methods enable researchers to map the spatial distribution of rainfall and predict precipitation at unmeasured locations, revealing geographic characteristics of rainfall patterns [18].Frequency analysis, such as extreme value analysis and recurrence interval analysis, assesses the frequency and intensity of extreme rainfall events, often employing probability distributions like Gumbel or Weibull for data analysis [19].Furthermore, integrated approaches, including multivariate and regression analysis, combine various meteorological or geographical factors to analyze their collective impact on rainfall patterns [20].Although these traditional methods have provided significant insights into understanding rainfall data over the past decades, they exhibit limitations in handling large datasets, identifying complex spatiotemporal dependencies, and analyzing extreme events [21].Moreover, these methods often commence with statistical indicators of rainfall data and require further processing to reconstruct the processes of rainfall events with spatiotemporal distribution [22].
Consequently, with the advancement of computational technology, an increasing number of studies have begun to explore the use of advanced data analysis techniques to overcome these challenges, offering more in-depth analyses and comprehensive understandings.In recent years, the development of artificial intelligence, particularly unsupervised learning, has shown immense potential in analyzing complex meteorological datasets [23][24][25][26].Unsupervised learning, a branch of machine learning, focuses on automatically discovering information and patterns in unlabeled data [27,28].This is primarily achieved through two fundamental approaches: dimensionality reduction and clustering.Dimensionality reduction techniques, such as Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE), aim to reduce the dimensions of data while preserving the most crucial information, thereby enhancing data processing efficiency and algorithm performance [29,30].Clustering algorithms, including K-Means clustering and DBSCAN, reveal the intrinsic structure and patterns of data by grouping it into multiple categories with similar characteristics [31].Unsupervised learning, by analyzing the characteristics and structure of data itself without explicit labels, offers a powerful tool for data mining and knowledge discovery [32,33].Unsupervised learning methods are now widely used for classifying meteorological data and identifying extreme rainfall events.For instance, Liu Yuan Yuan et al. employed unsupervised learning algorithms to identify rainfall types in the Shenzhen area [34].Henning Oppel and colleagues utilized a novel unsupervised learning algorithm to explore the patterns of rainfall timing distribution and their correlation with flood type [35].Andrew Mercer applied unsupervised learning to study the climate patterns of warm-season precipitation in the southeastern United States [36].However, such studies rarely explore the rainfall patterns of a target region simultaneously from multiple aspects including the temporal distribution, spatial patterns, and statistical analysis of classification results.
Kunming, located in Yunnan Province, China, experiences a mild subtropical highland monsoon climate due to its geographical location and high altitude.The city enjoys a year-round mild climate with all four seasons resembling spring, and precipitation distribution is uneven, predominantly concentrated in the summer.Precipitation in Kunming is characterized by distinct seasonal patterns and diurnal variations.From a spatiotemporal perspective, precipitation begins to increase in May, peaks in September, and then declines rapidly.Peak rainfall typically occurs in the summer months, especially in July and August.Additionally, precipitation exhibits spatial variations, with mountainous regions surrounding the city receiving more rainfall than the urban plains due to topographical influences.Wang et al. selected precipitation data from Kunming's monitoring stations from 1998 to 2011 to study local extreme weather events [37].Zhang et al. analyzed forty years of precipitation data up to 2020 to examine the impacts of urbanization on Kunming's rainfall patterns [38].However, urbanization in this region has not ceased.According to our understanding, recent studies on the spatiotemporal distribution of urban precipitation in Kunming using recent data have been insufficient.Previous research has predominantly focused on statistical indicators of precipitation without discussing varying regional rainfall patterns, making it difficult to directly benefit flood safety management in meteorological and hydrological departments.
In light of the aforementioned challenges, this paper innovatively proposes the use of an unsupervised HDRC model for analysis of rainfall events in Kunming, Yunnan Province, China.This model aims to uncover the typical spatiotemporal distribution patterns of rainfall in this region in recent years and to conduct statistical indicator analysis.

Materials and Methods
This study establishes a standardized workflow for analyzing the spatiotemporal distribution characteristics of regional rainfall, including data cleaning, partitioning rainfall events, constructing high-dimensional learning samples, implementing an unsupervised dimensionality reduction-clustering models, and reconstructing the low-dimensional sample space.The specific process is illustrated in Figure 1.

Materials
This study utilizes rainfall data collected every 5 min from 62 rainfall stations in the urban area of Kunming City, provided by the Kunming Meteorological Bureau for the flood seasons from 2019 to 2021.These stations comprehensively cover the main built-up areas of Kunming City, with the distribution of the rain gauge stations illustrated in Figure 2.
Prior to data analysis, it is necessary to clean and screen historical rainfall data to eliminate unreasonable records:

•
Records are considered unreasonable if a single station reports more than 10 mm of rainfall in 5 min without any rainfall 30 min before and after the event; • Records are deemed unreasonable if a rainfall station within a 5 km × 5 km area reports 0 data, yet records more than 10 mm of rainfall in 5 min;

•
For abnormal records at individual stations, rainfall isohyet maps for the period must be compared to verify the data's reasonableness.If found unreasonable, interpolation results from other stations within a 5 km × 5 km area of the station are used to replace its rainfall record.
After initial cleaning of the rainfall data, specific rainfall events must be selected to reduce the sparsity of the dataset.To select as many standardized rainfall events as possible from the existing high-resolution data, the criteria for choosing rainfall events are as follows: • Eliminate all periods in the annual time series of the study's rainfall stations where rainfall is consistently 0, considering discontinuities in the time series as the start of a new rainfall event; • Temporally, eliminate rainfall events lasting less than one hour; volumetrically, exclude events where the average cumulative rainfall is less than 2 mm; • Each rainfall event is downscaled to fit within a one-hour period (divided into twelve 5 min intervals), and the total volume is normalized to standardize the events.Using the aforementioned methods, 161 rainfall events were selected between 2019 and 2021.

Constructing High-Dimensional Data Samples
Following the data cleaning and normalization process, the dataset is transformed into a statistical learning sample set.This set contains the dynamic characteristics of the temporal and spatial dimensions of several rainfall events within the target area, as demonstrated in Equations ( 1) and ( 2): Here, Ω denotes the historical extreme rainfall sample set, comprising  extreme rainfall events.  represents the proportion matrix for the th rainfall event at time t, and   denotes the percentage of rainfall at the th rainfall station at time t out of the total rainfall at all stations at that time, as defined in Equation (3): In the equation,   is the rainfall amount at station  at time t:  = 1,2,3 …  (S is the number of rainfall stations),  = 1,2,3 …  (m is the number of time intervals).
Ω represents a high-dimensional data space describing the rainfall processes in a specific region.To enhance computational efficiency and reduce redundancy in high-dimensional data-thus preventing confusion of the main components in the rainfall sample space via clustering algorithms-it is assumed that the rainfall sample vectors of an area with specific climatic features, hydrological distribution, and topography are distributed within a specific low-dimensional manifold.Subsequent dimensionality reduction is performed using the manifold learning algorithm to be introduced.

Unsupervised HDRC Models
The unsupervised HDRC model developed in this study comprises two components: first, the application of the Locally Linear Embedding (LLE) algorithm from manifold learning for dimensionality reduction of high-dimensional data samples; second, the implementation of the dynamic clustering K-Means algorithm for clustering analysis of the low-dimensional data samples.

• LLE
The LLE algorithm is a widely utilized unsupervised technique for dimensionality reduction, designed to map high-dimensional data into a lower-dimensional space without altering its intrinsic structure [39].The primary goal of LLE is to preserve the local structure of the original data as much as possible, discarding irrelevant noise and the global structure through dimensionality reduction.The fundamental concept of LLE involves finding the K nearest neighbors for each data point and representing that point as a linear combination of these neighbors.This process calculates the low-dimensional representation by minimizing the distance between these linear combinations and the original data points.LLE stands out as an effective dimensionality reduction tool that maintains the local structural integrity of the data, reduces redundancy present in high-dimensional spaces, and operates efficiently without the need for any prior knowledge or intricate hyperparameter adjustments [40].The LLE algorithm is implemented in three main steps: 1.In high-dimensional space, the LLE algorithm identifies the K nearest neighbors of a sample x i using the Euclidean distance metric.
Firstly, for N data points {x 1 , x 2 , ⋯ x N } ∈ R D in the high-dimensional space, calculate the Euclidean distance between each sample point x i and all other samples as demonstrated in Equation (4): subject to the constraints in Equation ( 5): 2. For each sample x i , find the linear relationship of the K nearest neighbors in its neighborhood to obtain the linear relationship weight coefficients, as shown in Equation (6): Next, by solving Equation ( 4) at the minimum of the constraint of Equation ( 5), the weight coefficients W are obtained, as shown in Equation ( 7): The weight vector 3 Assuming the linear relationship weight coefficients Wi within the K neighborhood remain constant between high-dimensional and low-dimensional spaces, the sample data are reconstructed in the lower dimension using the weight coefficients W i , and the following are implemented: The LLE algorithm assumes that mapping high-dimensional samples to a lower-dimensional space preserves the local linear relationships of samples in the high-dimensional space with constant weight coefficients.Then, the point {x 1 , x 2 , ⋯ x N } ∈ R D in the high-dimensional space is mapped into the low-dimensional space as {y 1 , y 2 , ⋯ y N } ∈ R d through the weight coefficients W. The mapping x i of the high-dimensional sample point y i in the low-dimensional space can be similarly obtained by solving for the minimum of the mean square deviation; the objective function is given by Equation ( 8): In order to obtain normalized low-dimensional data, the constraints are as shown in Equation ( 9): where y i ∈ R d ( d ≪ D is solved by making the points outside the neighborhood of i w ij = 0 , thus expanding the matrix of weight coefficients from W N×K to the square matrix W N×N . The mean square deviation is written in the following form, as shown in Equation (10): Let M = (I − W)(I − W) T and Equation (10) becomes Equation (11): Using the same method, Equation ( 11) is transformed into Equation ( 12) by synthesizing an optimization function using the Lagrange operator Derive H(Y) and make it 0 as in Equation ( 13): It follows that Y is a matrix consisting of the eigenvectors of M. In order to reduce the data to d dimensions, the eigenvectors corresponding to the first d non-zero eigenvalues of the matrix M are taken to be the desired low-dimensional mapping Y.

• K-Means
K-Means is a clustering analysis algorithm that divides a set of unlabeled data points into distinct groups or "clusters", defined by the similarity or proximity of data points within each cluster [41].The fundamental concept of this algorithm is to assign data points to a predetermined number of clusters to minimize the distance between each data point and the centroid of its cluster while maximizing the distance between different clusters [42].
The process involves partitioning n data points into k clusters, where k is a predetermined constant.Clusters are defined based on the similarity or proximity of the data points within them, measured using metrics such as Euclidean distance, Manhattan distance, or cosine similarity.In K-Means, each cluster is characterized by a centroid, which is the mean of all data points within the cluster.The algorithm begins by randomly selecting k data points as the initial centroids and then iterates by assigning the remaining data points to the nearest cluster based on distance, updating the centroids of the clusters and reassigning data points until the allocation stabilizes or a set number of iterations is reached.Finally, the K-Means algorithm will return k clusters and their centroid of mass [43].The selection of the hyperparameter n is critical and can be initially estimated based on prior knowledge, with adjustments made by evaluating the similarity of clustering outcomes to avoid misclassification [44].
The above method is able to find the subset of rainfall feature space clustering after applying the stream learning dimensionality reduction algorithm, as shown in Equation ( 14): These clusters, while lacking physical dimensionality, represent the projection of the original sample feature space onto a low-dimensional manifold.The LLE algorithm presupposes that local linear relationships within high-dimensional space are preserved in the low-dimensional space, meaning the linear relationship between a sample x i and its neighbors in high-dimensional space remains consistent with its projection y i and corresponding neighbors in low-dimensional space.Thus, the clusters C = {C_1, C_2, ⋯ , C_r} obtained through dynamic clustering are applicable in the physically meaningful highdimensional sample space.By reverse-mapping these clusters using sample indices, clusters B = {B 1 , B 2 , ⋯ , B r } in high-dimensional space are identified, enabling the analysis of spatial and temporal distribution characteristics of rainfall in high-dimensional space,(as shown in Equation ( 15):

Results
Given the rainfall data for Kunming City and the computational process introduced in section two, utilizing the elbow method, combined with the prior knowledge of local meteorological and hydrological departments, three types of rainfall events were identified from a total of 161 rainfall occurrences.The clustering visualization is presented in Figure 3, where Feature Dimension 1 and Feature Dimension 2 represent low-dimensional components obtained through the LLE algorithm and lack specific physical meaning.These dimensions are abstract representations, capturing the essence of the data's structure in a reduced form to facilitate the clustering and visualization of the underlying patterns within the rainfall events.The statistical information for the three types of rainfall is shown in Table 1.To spatially resample these three characteristic rainfall types, the Kriging interpolation method was employed.The spatial and temporal distribution maps of the rainfall are depicted in Figures 4-6.Type I rainfall is characterized by rain centers distributed along the northern shore of Dian Lake and the southern part of Kunming's main urban area.As observed in Figure 4, the spatial dynamics of this rainfall type exhibit a progression from the waters of Dian Lake gradually towards the land, posing a primary threat to the western and southern regions of Kunming's main urban area.According to Table 1, this type of rainfall accounts for 79 events, making up 49.07% of the study samples and representing the predominant rainfall pattern in the Kunming area.These rainfall events mainly occur in June, July, August, and September, with an average duration of 4.81 h.The maximum hourly rainfall intensity at the center reaches 26.0 mm/h.The average rainfall duration peaks in July at 6.78 h, and the average rainfall amount is 16.81 mm.Type Ⅱ rainfall is characterized by its concentration in the northern mountainous region of Kunming City.As depicted in Figure 5, the spatial dynamic change of this rainfall type is relatively minor, with a more fixed and concentrated rainfall center due to the orographic effect of the mountain's windward slopes.The distribution of the rainfall center aligns closely with the contour lines of the mountain, posing a significant threat to the northern mountainous areas and urban regions of Kunming, potentially triggering flash floods and affecting the northern urban areas.According to Table 1, this type of rainfall primarily occurs during the hottest months, July and August, with a higher frequency in August.Among the three rainfall types, it accounts for the smallest proportion of the study sample with 30 events.However, it features the highest maximum hourly rainfall intensity at 33.14 mm/h, marking it as a type of rainfall with highly uneven spatial and temporal distribution.Type Ⅲ rainfall is centered around the main urban area of Kunming City and is associated with the city's urban heat island effect.According to Table 1, this type of rainfall shows relatively minor variations in its parameters and primarily occurs during the cooler months of May and September, with maximum hourly rainfall intensities of 19.73 mm/h and 19.53 mm/h, respectively.This type of rainfall mainly takes place within the urban boundaries and is a significant cause of urban flooding in Kunming. Figure 6 reveals that there is a rainfall accumulation in the lowland at the intersection of the central four districts of Kunming City, indicating a specific area where rainfall tends to accumulate, potentially exacerbating flooding issues within the urban environment.

Discussion
To further cross-validate the correlation of the temporal distribution of rainfall among the three different types of rainfall across various regions, this paper categorizes the rain gauge stations within the study area into three types based on their locations: the first type consists of rain gauge stations located in urban areas, the second type includes stations close to water bodies (Dian Lake), and the third type comprises stations in mountainous areas.We statistically analyzed the maximum cumulative proportion and the average cumulative proportion of different rainfall types at the rain gauge stations in different regions, as shown in Figure 7. Figure 7a corresponds to Type Ⅰ rainfall, where the slope of the curves for all three regions is similar, indicating a relatively even distribution of rainfall, consistent with the prior knowledge that water vapor from the water body enters the plain area.Figure 7b pertains to Type Ⅱ rainfall, which demonstrates that the rainfall in the mountainous area is the greatest, while that in the urban area is the least.Although the mountainous area has an advantage in the maximum rainfall amount, its average rainfall is consistent with that near Dian Lake, suggesting that the rainfall in the mountainous area is short-lived but intense, aligning with the results presented in section three.Type Ⅲ rainfall corresponds to urban area rainfall; as seen in Figure 7c, this type of rainfall predominates in urban areas, illustrating the stimulatory effect of the urban heat island effect on rainfall events.

Conclusions
This chapter analyzes rainfall data from 61 stations in Kunming City from 2019 to 2021, cleaning and vetting the data at 5 min intervals.After screening, 161 rainfall events were identified, and a rainfall sample database for Kunming City was constructed.Utilizing manifold learning and dynamic clustering algorithms, an unsupervised HDRC model was developed to analyze the spatiotemporal distribution characteristics of rainfall in Kunming City.The rainfall was categorized into three types based on these characteristics.Analyses and summaries were conducted on the spatiotemporal distribution, occurrence, timing, frequency, and potential disaster-prone area characteristics of each type of rainfall.This paper fills in gaps in the use of unsupervised learning methods for classifying meteorological data and identifying extreme rainfall events.Exploration of rainfall patterns in the target region was conducted from multiple perspectives, including the temporal distribution, spatial patterns, and statistical analysis of classification results.The overview of the three types of rainfall is as follows: 1. Type Ⅰ rainfall is characterized by a distribution center located between the northern shore of Dian Lake and the southern part of Kunming's main urban area.Rainfall distribution is relatively uniform, posing a primary threat to the western and southern regions of Kunming's main urban area.This type of rainfall comprises 79 events, accounting for 49.07% of the study sample, and is the predominant rainfall type in the Kunming area.It primarily occurs in June, July, August, and September.
2. Type Ⅱ rainfall has its center in the northern mountainous region of Kunming, with minimal spatial dynamic variation.Among the rainfall types, this class, comprising 30 events and accounting for 18.63% of the sample, is the least common.It predominantly occurs in July and August, the warmest months, with August witnessing the highest frequency and peak hourly intensity of 33.14 mm/h, indicating a highly uneven spatiotemporal distribution of precipitation.
3. The precipitation center of Type Ⅲ rainfall is located in the urban area of Kunming, characterized by and associated with the urban rain island effect.Variability in rainfall parameters is minimal.This category consists of 52 events, representing 32.30% of the study sample, predominantly occurring in the cooler months of May and September, with peak hourly intensities of 19.73 mm/h and 19.53 mm/h, respectively.This type of rainfall predominantly occurs within urban areas and is the primary cause of urban waterlogging in Kunming.
The unsupervised HDRC model enables rapid classification of rainfall processes in a region, producing various types of typical rainfall events and their characteristic values, thereby facilitating a deeper understanding of the influence of topography, hydrology, and other factors on the local hydrological cycle.In practical applications, predictions of the most likely rainfall types can be made based on the timing of rainfall events and factors such as the direction of moisture sources provided by meteorological forecasts.This allows for the implementation of targeted flood and flood control measures such as reinforcing flood prevention arrangements at rainfall centers.This approach has potential for application in cities with varied climate characteristics, water-land distributions, and topographical features, offering valuable decision-making references for local meteorological and flood control departments.

Figure 1 .
Figure 1.Flow chart of the analysis of spatial and temporal distribution characteristics of rainfall.

Figure 2 .
Figure 2. Study scope of the rainfall area.

Figure 4 .
Figure 4. Cumulative rainfall distribution for Type I rainfall.

Figure 7 .
Figure 7. Cumulative rainfall ratios for different regions for three different types of rainfall: (a) Type Ⅰ rainfall; (b) Type Ⅱ rainfall; and (c) Type Ⅲ rainfall.

Table 1 .
Statistical characteristics of various rainfall types.