Step 1: Given the burstiness and localized nature of SDHR events, rainfall samples are selected based on both rainfall intensity and the spatial extent of the rainfall area, using hourly data from meteorological observation stations. By digitally encoding multiple SDHR events, a high-dimensional array is constructed that captures both the temporal and spatial distribution characteristics of the rainfall. This array enables a comprehensive digital representation of all the SDHR events.
Step 2: The high-dimensional array is then transformed into a fully connected form to construct a sequence of SDHR sample sequences. The K-shape clustering algorithm is then applied to iteratively optimize the class centers of the sample sequence set. The algorithm calculates the shape similarity between each sample and the corresponding class center to classify the SDHR events into different groups.
Step 3: Temporal and spatial distribution features of rainfall, such as the coefficient of variation of rainfall time, rainfall development duration, spatial variance coefficient, and spatial centroid of rainfall, are used to perform a comparative analysis of the spatiotemporal distribution characteristics of different types of SDHR events in Beijing.
2.2.1. Digital SDHR Process
- (1)
Screening of SDHR samples
Currently, there is no unified definition of SDHR. The National Meteorological Center defines SDHR operationally as rainfall exceeding 20 mm/h. However, to identify SDHR events with significant spatiotemporal variations, we adopt the method used by the Beijing Meteorological Center, as outlined in Li Chen’s study [
61]. This approach sets screening criteria based on both rainfall intensity and precipitation area. The proportion of stations and precipitation thresholds are determined through a comprehensive analysis of historical rainfall and disaster data in Beijing, ensuring a more accurate representation of SDHR dynamics in urban flood risk assessment.
The specific screening criteria are shown in
Table 1. An SDHR event is considered to have occurred if any of the following three conditions are met. The screening process follows a reverse order, beginning with Condition 3. If Condition 3 is not satisfied, the sample is then assessed according to Condition 2. If Condition 2 is also not met, the sample is evaluated based on Condition 1. This sequential approach not only improves computational efficiency but also ensures the independence of each heavy rainfall sample. Using this method, a total of 105 SDHR events that met the screening criteria were identified for the period from 2009 to 2021.
- (2)
Digital processing of SDHR samples
The digitalization of SDHR samples aims to represent the spatiotemporal distribution of precipitation in a numerical format. Each rainfall event exhibits a unique data distribution pattern, and this approach enables the application of mathematical tools for further analysis.
For each SDHR sample, a high-dimensional array is constructed in both the temporal and spatial dimensions. Specifically, a rainfall event involves data from
n time periods and
s rain gauge stations. The digitalization process of the rainfall event consists of constructing a high-dimensional array that incorporates information from these time periods and stations. For a set of
N SDHR samples, there will be
N high-dimensional arrays. By using this method, a sample set of rainfall processes, denoted as
, is established, which facilitates the digital representation of multiple rainfall events, as shown in the following equation:
Then,
Qi is transformed into
QTi for K-shape clustering analysis:
In this equation, represents the SDHR sample set, which includes N SDHR events. is the fully connected form of the i-th rainfall sample. The variables involved are as follows: is the rainfall amount at the s-th rain gauge station at the tn-th time, , , s is the number of rain gauge stations; n is the number of time periods.
2.2.2. Classification of SDHR Events Using the K-Shape Algorithm
The K-shape algorithm, proposed by Paparazzo et al. [
62], is a time series clustering method based on shape similarity. K-shape is well suited for SDHR datasets analysis due to its (1) temporal phase alignment—using a normalized cross-correlation distance metric, it aligns time series by shape rather than timing, capturing phase shifts in rainfall peaks that traditional methods like K-means fail to address. (2) Amplitude normalization—it normalizes rainfall amplitudes, reducing biases from intensity differences (e.g., 50 mm/h vs. 30 mm/h) and focusing on structural similarities. (3) Computational efficiency—it balances accuracy and speed, outperforming computationally expensive DTW-based methods, making it scalable for high-resolution SDHR datasets (
Table 2).
K-shape is an iterative process that groups time series data by continuously optimizing cluster centers and reassigning members. Each iteration consists of two main steps.
Step 1: Cluster center extraction: The center for each cluster is computed based on the time series data within the cluster.
Step 2: Member assignment: Each time series is compared to all cluster centers, and based on shape similarity, it is assigned to the closest cluster center, thereby updating the cluster membership.
This iterative process continues until the cluster membership stabilizes or the maximum number of iterations is reached. The shape similarity measure and the cluster center extraction method are crucial in this process, as they directly impact the accuracy of the clustering results.
In this study, the K-shape algorithm is applied to the fully connected form of the constructed SDHR sample set φ, resulting in a classification of different types of SDHR events.
- (1)
Shape similarity measurement method
To efficiently measure the shape similarity between time series, the K-shape algorithm uses cross-correlation as a similarity metric. Specifically, given two fully connected rainfall time series
and
, with Y remaining fixed, the time series X is shifted relative to y to compute the cross-correlation at various displacements. The relative displacement of the sequences is expressed as follows:
where
indicates that sequence
X has shifted s units to the right; and
indicates that sequence
X has shifted s units to the left. When
, there will be 2
m − 1 possible shift scenarios for the two time series. Then, the cross-correlation for each shift scenario is calculated to assess the degree of shape similarity between the two time series at different shifts.
The cross-correlation metric between
X and
Y is defined as follows:
; the calculation method for
is as follows:
where
is the local inner product calculation, representing the cross-correlation between sequence
X and sequence
Y at position w. Here,
k refers to the relative offset between sequence
X and sequence
Y.
The purpose of this step is to identify the value of the parameter w that maximizes the cross-correlation between the time series X and Y. This parameter w represents the lag or shift where the similarity measure between X and Y is the highest. Once the optimal w is determined, we can calculate the corresponding shift of X relative to Y, denoted as x(s), where s is the difference between w and m: s = w − m.
To enhance computational efficiency and address the differences in time series data, the K-shape algorithm utilizes coefficient normalization, specifically the z-score method, to normalize the cross-correlation sequence. The definition of the normalized cross-correlation is as follows:
The normalized cross-correlation coefficient ranges from [−1, 1], where 1 means complete positive correlation and −1 means complete negative correlation. A Shape-Based Distance (SBD) of 0 signifies identical time series shapes, while a value of 2 indicates complete opposition. Once the centroid is determined, the SBD is used to assign time series to the nearest cluster based on shape similarity.
- (2)
Cluster centroid extraction method
In time series clustering analysis, centroid extraction is a crucial step, as it directly determines the representative data point for each category. This representative data point, or cluster centroid, reflects the overall trend and characteristics of the time series data for that category.
The traditional method for extracting a class centroid involves calculating the arithmetic mean of the corresponding coordinates from all sequences in that category. However, this approach often fails to accurately capture the essential features of time series data. To overcome this limitation, the K-shape algorithm redefines centroid extraction as an optimization problem. For each class, the algorithm seeks to identify a center sequence (
) that maximizes the similarity, measured by the normalized cross-correlation (NCCc), with all time series in that class. The formula for this optimization is as follows:
In this formula,
denotes the initial cluster center of the
k-th class, while
represents the complete dataset for that class. By integrating this with Equation (6), Equation (9) can be reformulated as follows:
Since all time series
have been aligned to the reference sequence (the initial centroid
), the self-correlation term in the denominator can be disregarded. This omission is valid because, during the alignment process, each sequence
has been shifted according to its maximum cross-correlation value with the centroid
, thus optimizing
. At this stage, the relative position between each sequence and the centroid is established, rendering the self-correlation term unnecessary for determining the centroid’s position. Vectorizing Equation (10) yields
To facilitate the solution, the optimization problem in Equation (8) is reformulated as a Rayleigh Quotient maximization problem. We introduce a matrix S, where the elements represent the inner products
of all aligned time series. To center the matrix
, we define a centralization matrix
Q, such that
and
. Here,
I denotes the identity matrix, and
O is a matrix in which all elements are equal to 1. Furthermore, to ensure that
has unit norm, Equation (11) is divided by
. Finally, substituting
S into
results in
Ultimately, the centroid is determined by calculating the eigenvector associated with the maximum eigenvalue of the matrix M = QT·S·Q.
2.2.3. Analysis Method of Spatiotemporal Characteristics of SDHR
After applying the K-shape clustering method to classify SDHR events, this study uses indicators such as the coefficient of variation of rainfall, the frequency of occurrence of rainfall stations, rainfall growth time, and the spatial centroid of rainfall to perform a comparative analysis of the spatiotemporal distribution characteristics of different types of SDHR in Beijing. The following section provides a detailed introduction to the indicators that characterize the degree of spatiotemporal distribution unevenness, rainfall burstiness, and spatial heterogeneity.
- (1)
The uneven spatiotemporal distribution of rainfall
This study uses the coefficient of variation of rainfall (
Cv) to quantify the degree of spatiotemporal distribution unevenness of SDHR in Beijing from 2009 to 2021. The calculation formula for the coefficient of variation of rainfall is as follows:
where
; when calculating the temporal coefficient of variation,
xi represents the precipitation amount at the
i-th time period,
is the average precipitation amount in the
i-th time period, and n is the total number of hours of rainfall. When calculating the spatial coefficient of variation,
xi represents the cumulative precipitation at the
i-th rain gauge station,
is the average cumulative precipitation across all stations, and
n is the number of rain gauge stations. Based on the annual rainfall data for Beijing from 2009 to 2021, and considering the spatiotemporal distribution of rainfall in northern China, a critical value of
Cv = 1.0 is selected to assess the uniformity of rainfall distribution. The coefficient of variation (
Cv) threshold of 1.0 is adopted to classify rainfall uniformity, following regional studies of northern China’s convective systems [
63]. This threshold statistically distinguishes homogeneous stratiform rainfall (
Cv ≤ 1.0) from heterogeneous convective extremes (
Cv > 1.0) based on their spatiotemporal variability characteristics [
63]. When
Cv ≤ 1.0, the rainfall is considered uniformly distributed in time or space; when
Cv > 1.0, the distribution is considered uneven.
- (2)
The burstiness of the rainfall
This study uses the growth time (
GT) as an indicator to characterize the burstiness of rainfall. Growth time refers to the time required from the start of rainfall to the occurrence of the peak rainfall intensity. The specific expression is as follows:
where
is the time at which the rainfall peak occurs;
is the time when rainfall begins. It is required that the hourly rainfall begins at the start of rainfall.
PREstart must be greater than 0.1 mm, and the hourly rainfall at the peak time,
PREpeak, must be no less than 20 mm.
Since this study uses hourly rainfall observation data, the unit of GT is measured in hours. If, within an SDHR event, more than 50% of the stations that experience the event have a growth time of no more than 1 h, the event is considered to exhibit burstiness. Additionally, to characterize the spatial distribution of SDHR burstiness, this study separately calculates the growth time for each of the 110 meteorological stations across all SDHR events. This allows for the identification of regions in Beijing where the burstiness of SDHR is particularly significant.
- (3)
Rainfall spatial heterogeneity characterization
This study uses the rainstorm center location as an indicator to compare the spatial heterogeneity of different types of SDHR events. The spatial centroid of rainfall is used to quantitatively represent the rainstorm center location in a single rainfall event. The centroid is a spatial feature, and the spatial centroid of rainfall refers to the weighted average position of the rainfall amount within a specific time period.
When describing the distribution of rainstorm centers for different SDHR events, this study defines the rainfall spatial centroid as the weighted average position of the rainfall amount at the time of maximum precipitation during an SDHR event. The more dispersed the spatial distribution of rainstorm center locations across different events of a particular type of SDHR, and the wider the distribution range, the stronger the spatial heterogeneity of the rainfall, and the higher the forecasting difficulty. The calculation formula for the rainfall spatial centroid is as follows:
In the formula, xc and yc represent the coordinates of the centroid, xc and yc are the coordinates of the meteorological stations, and pi is the precipitation at the corresponding meteorological station during a specific time period.