1. Introduction
The security of water resources in this changing environment has become a research focus, due to the fact that climate change causes variations in rainfall at a large scale, and human activities influence the spatiotemporal characteristics at the regional scale [
1,
2]. Big cities are especially concerned as they are the hub of human activities. Human activities lead to the frequent occurrence of extreme rainstorms, through the urban heat island effect and air pollution [
3,
4]. Statistics show that 60% of the cities in China suffered from waterlogging from 2014 to 2016 [
2]. Studies showed that urban waterlogging is directly related to the temporal and spatial distribution of rainstorms [
5]. For example, the flood peak of triangle rainfall with a rain-peak in the central or rear is 30% larger than that of even rainfall, irrespective of whether the average rainfall is the same [
6]. It is of great significance to study the spatial and temporal modes of rainfall to prevent waterlogging [
7].
At present, there are two main approaches to the definition of urban rainfall process. First, different rainfall types are applied to describe various rainfall processes, such as single peak and double peaks, the method based on site monitoring data. Studies of this method focus on monitoring data of a single station or the average of multiple stations. Pilgrim and Cordery [
8] put the time of a rain peak at the most likely position, and the proportion of the rain peak in the total rainfall is the average of the proportion of the rainfall peaks in each field. Keifer and Chu [
9] designed the Chicago-mode according to the strength, diachronic, and frequency of rainfall. Huff [
10] designed four modes of rainfall, according to the location of the rain peak and duration of rainfall in Illinois, USA. These methods only consider the total rainfall and extreme rainfall of a single station [
11]. However, the data of a single station cannot reflect the spatial characteristics of rainfall, especially in metropolitan areas with obvious spatiotemporal variations of the background environment, such as temperature and wind direction [
12]. This method is widely used in urban planning or urban construction, but it is increasingly criticized for neglecting the spatial variations of rainfall, especially in large cities. The second method defines rainfall from the spatial perspective based on geographic theory. The general method is to use a spatial interpolation algorithm to interpolate the monitoring data from stations into spatial distribution data, such as the software ANUSPLIN, developed by Hutchinson, of the Australian National University [
13]. In recent years, with the advancement of satellite and radar technology, rainfall spatial data can be obtained more directly, with higher accuracy, for example with Global Precipitation Climatology Project (GPCP) and Tropical Rainfall Measuring Mission (TRMM). These spatial distribution data are more suitable for analyzing the distribution of total rainfall or cumulative rainfall in a certain period, but it is difficult to express the correlation between rainfall distributions in different periods. However, in the real rainfall process, the rain belt usually moves rapidly. The spatial distribution of total rainfall or the time history distribution of rainfall intensity at a single point are not enough to accurately describe the dynamic temporal and spatial distribution characteristics of a rainfall, which is very important for the risk emergency management of rainstorm.
For the purpose of reducing the urban storm disaster effectively, it is necessary to express a complete “rainfall process”, which includes not only the rainfall graph of stations at different locations, but also the spatial relationship of rainfall of these stations at different period. So, rainfall is a multidimensional data that includes time and spatial characteristics. Some scholars try to integrate the temporal and spatial characteristics of rainfall with multidimensional data. For example, based on the 3-dimensional mosaic reflectivity data from 10 S-band Doppler radars in Guangdong province, an artificial intelligence (AI) algorithm for automatic hail detection and nowcasting is developed in the light of the machine learning (ML) technology [
14]. The temporal and spatial distribution characteristics of short duration rainfall in Shenzhen are analyzed and extracted by LLE algorithm [
15]. It is necessary to conduct further study on the rainfall process with temporal and spatial dimensions.
In contrast to taking the rainfall data of a single station as the research object, this paper used the ML algorithm to extract the temporal and spatial distribution characteristics of rainfall from the rainfall data of all rainfall stations in the whole research area. The continuous monitoring data of rainfall stations from 2004 to 2016 were divided into different rainfall processes in Beijing. The rainfall processes were taken as the research objects, then, algorithms, such as dimensionality reduction, clustering, and reconstruction were applied to identify the typical spatiotemporal process of rainfall, and then simulate the rainfall process of different modes.
  2. Study Area and Data Process
According to previous studies, Beijing has become one of the most urbanized cities in China in the past 30 years [
16,
17]. In 2013, urban population accounted for 86.3% of the total population in Beijing, far higher than the average of 53.73% for China, causing a rapid expansion of built-up areas [
17]. In recent years, severe waterlogging has frequently occurred in Beijing [
18]. For example, floods and waterlogging caused serious casualties on 21 July 2012 [
19].
The Beijing urban area is 396 km
2. Fourteen meteorological monitoring stations have continuously recorded the rainfall data of the Beijing urban area in recent years. The continuous monitoring data of 14 rainfall stations were selected from the database, which contains data from 2004 to 2016, at intervals of 5 min. The 14 rainfall stations were evenly distributed in urban areas, as shown in 
Figure 1.
Statistics show that 89 rainstorms occurred during 2004 and 2016, which is defined as rainfall of 1 hour exceeding 30 mm [
20]. For the comparability of rainfall processes, it was necessary to standardize the rainfalls with different duration. According to previous studies, the method of 1-h moving average was used to deal with the rainfall processes, as shown in 
Figure 2. The data in the red box were selected as the standardized rainfall.
Figure 2 shows the data of a rainstorm process at 14 stations. In 
Figure 2, “cumulative rainfall” represents the sum of rainfall at 14 stations at a certain period. For example, “527” represents the sum of rainfall at all 14 stations from the beginning of rainfall to 1 h. The range surrounded by red dotted/dashed outlines indicates that the “cumulative rainfall” was the largest in this hour compared with any other hour.
 According to the early warning standard issued by Beijing Flood Control Office, when the rainfall exceeds 50 mm in one-hour, yellow warning signals will be issued to the public, indicating that serious urban disasters may occur. In 2004–2016, 22 rainfall events had a maximum sliding rainfall of more than 50 mm in one hour, accounting for 24.7% of the total rainfalls. In this paper, the 22 rainstorms were selected as samples, which are called “short duration storm”. The rainfall data of these 22 rainstorms are shown in 
Table 1.
  3. Methodology
  3.1. Flowchart for Extraction of Rainfall Temporal and Spatial Modes
Rainstorm is described according to the duration, intensity, total amount, and frequency by most researchers. In this paper, a new method is introduced to describe a rainstorm. The spatiotemporal process of a rainstorm was constructed as a high-dimensional array, and then principal component analysis (PCA), dynamic clustering (k-means), and reconstruction were applied to analyze the array [
21]. The flowchart is shown in 
Figure 3.
In 
Figure 3, “m” is the number of samples, which represents the number of rainfalls in the manuscript. “n” is the dimension of the samples which represents the dimension of the rainfall matrix. “k” represents the dimension of rainfall samples after dimensionality reduction.
- (1)
- The rainstorm events were digitized and structured. High-dimensional arrays were established from temporal and spatial dimension perspectives. 
- (2)
- Principal component analysis was used to map high-dimensional array to low latitude array. 
- (3)
- Dynamic clustering was used to categorize samples to typical modes for describing the temporal and spatial distribution of rainstorms. 
- (4)
- With the inverse calculation of principal component analysis, the low dimensional array was reduced to a high-dimensional array to express the spatiotemporal process of each rainstorm modes. 
  3.2. Construction of High-Dimensional Array for Rainstorms
Rainstorm was characterized as a spatiotemporal process. The continuous monitoring data was divided into independent rainfall periods, where the duration of no rainfall was longer than 120 min [
20]. When the maximum rainfall in 1 h was greater than 30 mm, it was called a rainstorm.
We take the monitoring data from multiple stations in a rainstorm as a multi-dimensional array (shown in 
Figure 2). Based on the processing method introduced by “2. Study Area and Data process” in this paper, the multi-dimensional array was converted to matrix with same rows and columns, the number of which is 14 × 12 in this paper. Fourteen is the number of stations and twelve is the number of periods. The matrix forms one sample in the database of rainstorms (
). This is shown in Equation (1).
        
        where 
 represents rainstorm i, m is the number of rainstorms, 
 is the rainfall at 
tn period of s rainfall station, 
s = 1,2,3… S, 
tn = 1,2,3… N, 
S is the number of stations, and N is the number of periods. The main objective of this paper is to study the “mode” of rainfall, or may be called “structure”, so the monitoring rainfall data is standardized by Equation (2).
        
        where 
 represents the ratio of 
j station rainfall to all station at 
tn period, 
S is the number of stations.
Figure 4 is a sample of visual representation of the standardized rainstorm matrix.
   3.3. Dimensionality Reduction of High Dimensional Array
As shown in 
Figure 4, the 
 is a 
S × N dimensional array. It was necessary to map the high-dimensional array to low dimensional space, in order to cluster and analyze arrays in 
 [
22]. Principal component analysis (PCA) was used to reduce the dimensions of high-dimensional arrays [
23].
Taking U represents the arrays in , which is an n×m matrix, where n is the number of rainstorms, and m represents the number of characteristics, which equal S × N in this paper, including the monitoring data of each station at all periods.  is the matrix converted from  by the transformation matrix , which means that the original data is converted from m dimensions to k dimensions (k  m). The steps are as follows:
		
- (1)
- The transformation matrix  was obtained by centralization of the matrix U, 
- (2)
- Calculation of covariance matrix  of , 
- (3)
- Calculating the eigenvalues and eigenvectors of the variance matrix of M, M = , 
- (4)
- The dimension k, which can retain more than 90% information of original data, was obtained. K eigenvectors constitute the transformation matrix  as column vectors, 
- (5)
- Descending dimensions by Equation (3). 
 is the low dimensional matrix after transformation, which n is the number of samples and k is the number of dimensions of new matrix.
  3.4. Clustering and Feature Selection
After the dimensionality reduction of high-dimensional samples, the k-means clustering algorithm is used to classify low dimensional samples of 
 [
24].
		
- (1)
- r initial cluster centers are set up: , where p is the number of iterations. 
- (2)
- Calculating the distance from samples x  to each cluster center, if r, then , where  represents cluster j with the center of . 
- (3)
- The new center of each cluster is calculated. The new center of  -  is calculated by Equation (4).
         - 
        where  N-  is the number of samples contained in the cluster  - , and  -  is the sample in  - . Using  -  as the new cluster center, and the clustering criterion function can be minimized (Equation (5)).
         - 
		where  - . 
- (4)
- If , j = 1,2,…r, then go to step (2); if , j = 1,2,…r, the calculation is over. 
In this paper, different r values were calculated, and the initial values of different cluster centers were selected for the k-means cluster. Finally, the rainstorms were divided into three modes. The mean value of each rainstorm was taken as a typical mode of the rainstorm.
  3.5. Reconstruction of Rainstorms
With the inverse calculation of principal component analysis, the low dimensional array was reduced to a high-dimensional array to express the spatiotemporal process of each rainstorm. The i clustering centers are reconstructed into i×m matrices (Equation (6)).
        
        where 
i is the number of rainstorm modes and 
m is the dimension of the original data.
  4. Results
The method of extracting rainfall temporal and spatial modes was applied to analyze the monitoring data of 22 rainstorms in the urban area of Beijing. It was found that the spatiotemporal distribution can be divided into three modes, and there are obvious differences in the center, spatial distribution, and occurrence time of three modes. The movement process of rainfall centers in different modes at different periods is shown in 
Figure 5. The centroid coordinates for a certain period was obtained by the method of calculating the geographical center with the weight of rainfall.
As shown in 
Figure 5, there are obvious differences in the three modes. The geographical centers of mode 1, mode 2, and mode 3 are located in northwest, southwest, and southeast, respectively. Mode 1 moves from the northwest to the urban center, mode 2 mainly spreads from the southwest and south to the north and the urban center, and mode 3 is basically concentrated in the urban center.
The matrixes of 3 modes are shown in 
Figure 6, which shows the characteristics of rainfall distribution. For example, the rainstorms belonging to mode 2 are more centralized than that belonging to modes and mode 3, which shows rainfall concentrated at several stations of 30748000, 30523900, 30504030, and 30523650. The rainfall spatial distribution at different periods of each mode is shown in 
Figure 7, 
Figure 8 and 
Figure 9, which represents the percentage of rainfall at all rainfall stations in the period by the depth of the color in the location of the station.
As shown in 
Figure 6, the proportion of rainfall at each station is significantly different. In mode 1, rainfall is mainly concentrated in six stations, which are stations 3, 6, and 4. The spatial non-uniformity of rainfall in mode 2 is the most obvious, and only four stations account for a large proportion of rainfall, which are stations 5, 9, 12, and 4. In mode 3, almost all stations have obvious rainfall, except station 1 and station 2.
In mode 1, as shown in 
Figure 5, 
Figure 6 and 
Figure 7, rainstorm spreads from the northwest mountainous area to the central area of the city and the eastern part of the city. Rainfall began at the beginning of the northwest mountain area, and the rest of the city did not have any rainfall. Rainfall gradually dispersed and rainfall occurred at all stations. There are seven rainstorms of this mode, accounting for 31.8% of all sample. Among them, the single peak and homogeneity stations are 43%, and the double-peak type accounts for 14%.
In mode 2, as shown in 
Figure 5, 
Figure 6 and 
Figure 8, the main rainfall was concentrated in the southern and southwestern regions of the city, gradually spreading to the northern and urban central areas, and there was no rainfall in the northwest mountain areas. There are three rainstorms of this mode, accounting for 13.7% of the total sample. Two of them are unimodal and one is homogeneous. In addition, this type of rainfall occurs between 14:00 and 16:00.
In mode 3, as shown in 
Figure 5, 
Figure 6, and 
Figure 9, the main rainfall was concentrated in the central area of the city and the eastern and southern parts of the city, which basically did not move. There was no rain in the northwest mountain areas. There were 12 rainstorms of this mode, accounting for 54.6% of total sample, among them, 62% were single peak type and 38% were homogeneity type. This mode was the main rainstorm type in summer in Beijing, which mainly occurred from afternoon to evening.
  5. Discussion
How to identify and extract valuable information from multidimensional and massive rainfall monitoring data is a problem faced by many researchers. In this paper, a new approach for rainfall mode recognition is introduced, and different rainfall modes are identified from the massive monitoring data through the algorithms of dimensionality reduction, clustering, and reconstruction of a high dimensional array. This approach can be applied to both multidimensional data analysis and spatiotemporal data mining.
The results show that there are three modes of rainstorms in the Beijing urban area. Rainstorms of mode 1 moved from the northwest to the center of Beijing, then spread to the eastern part of the urban area; rainstorms of mode 2 occurred in the southwestern region of the urban area, and gradually northward, but there was no rainfall in the mountainous northwest; rainstorms of mode 3 were concentrated in the central and eastern regions, and basically did not move. The results are consistent with the actual rainstorm process. This approach provides a framework for analysis, and there is uncertainty in some respects. For example, the result of the restructured rainfall has certain randomness and uncertainty in spatiotemporal distribution, which should be paid attention to in future. Because of the abundance of available data, this paper only selected the rainfall data of 14 years in Beijing. It should be noted that urban rainfall is a part of a larger range of rainfall in most cases, the temporal and spatial distribution characteristics of rainfall extracted from the rainfall data of 14 stations may have a certain randomness and uncertainty. More extensive rainfall data should be collected, considering terrain, climate, etc., according to the availability of data. In addition, it is necessary to distinguish the historical evolution of rainfall modes in different periods due to the climate changes in the city.
There are several suggestions for practice, First, the spatial and temporal resolution for rainfall data needs to reflect the temporal and spatial differences of a rainfall. Secondly, the study area should include the urban area and surrounding areas as much as possible, which is mainly to maintain the integrity of rainfall process.
  6. Conclusions
In this paper, a high dimensional array is introduced for the study of the spatiotemporal distribution of rainfall, which describes rainfall by storing continuous rainfall monitoring data of all rainfall stations.
Through the establishment of high dimensional arrays of each rainstorm and algorithms, such as dimensionality reduction, clustering, feature extraction, and reconstruction, the spatiotemporal distribution of rainstorms in the flood season of Beijing city was analyzed. It was found that there were three spatiotemporal modes of rainstorms in the urban area of Beijing from 2004 to 2016. Rainstorms of mode 1 moved from the northwest mountain area to the central district, and further spread to the eastern part of the area. Rainstorms of mode 2 concentrated in the southwest of the urban area, gradually spreading to the northern and urban central areas; the northwest mountainous area basically had no rain. Rainstorms of mode 3 concentrated in the central area of the urban area and the eastern and southern regions, and basically did not move. The variation of the centroids of different modes shows a significant difference between the modes. The approach and conclusions in this paper can be applied to the study of rainfall modes in other cities or regions at a different scale, so as to provide assistance for rainfall forecasting and flood prevention.
The limitation of current machine algorithms is that it is too dependent on the number and quality of learning samples. If the rainfall stations are dense and the rainfall data quality is accurate, this method can achieve good results. If it is in an area with insufficient data and sparse rainfall stations, results may not be satisfactory. With the increasing density of rainfall stations, the improvement of rainfall data quality, and improvement of machine learning algorithm, the machine learning algorithm will get more reasonable and objective results.