Exploring the Spatiotemporal Characteristics and Causes of Rear-End Collisions on Urban Roadways

: Rear-end collisions are caused by drivers misjudging urgent risks while following vehicles ahead in most cases. However, compared with other accident types, rear-end collisions have higher preventability. This study aims to reveal the prone segments and hours of rear-end collisions. First, we extracted 1236 cases from trafﬁc accident records in Harbin from 2015 to 2019. These accidents are classiﬁed as property damage accidents, injury accidents and fatal accidents according to the collision severity. Second, density analysis in GIS was used to demonstrate the spatial distribution of rear-end collisions. The collision spots considering the density and severity were visually displayed. We counted the hourly and seasonal distribution characteristics according to the statistical data. Finally, LightGBM and random forest classiﬁer models were used to evaluate the substantial factors affecting accident severity. The results have potential practical value in rear-end collision warning and prevention.


Introduction
With the rapid development of the economy and society, the urbanizing process is accelerating each year. The urban road network and the relevant infrastructure have been gradually completed. The increasing proportion of motorized travel poses challenges for traffic safety. The frequent collision types are also becoming different from those in the past. In many countries, rear-end collisions are considered to be the most frequent type of total traffic accidents [1]. Statistics show that rear-end collisions account for approximately 30% of all crashes and 20% of all fatal traffic accidents. In China, this proportion is even more severe. For a higher relative crash speed, rear-end collisions tend to result in more severe property damage and personal injury. It has been proven that developing and applying active safe equipment in vehicles is an effective method to prevent traffic accidents. However, identifying hazard segments and improving traffic safety facilities are also important issues.
Different traffic environments, driving performances and involved vehicle types result in different degrees of severity [2]. A number of studies have focused on deep analysis of the rear-end crash course, aiming to find the prominent factors influencing the collision occurrences. According to the results, a series of prevention mechanisms can be proposed to improve the safety level. In addition, some studies propose spatial autocorrelation methods to analyze the potential spatiotemporal patterns of collisions. Considering the multiple factors in the model, it can also assess the consequences of the accidents [3]. Combining information on traffic flow, traffic rules, road geometry and human factors, Bayesian spatiotemporal models [4] and probabilistic models [5] provide a good description of collision occurrence trends. The generalized linear mixed model [6], gravity model [7] and regression model [8] yield good results in the analysis of factors influencing rear-end collision severity. The factors and characteristics of rear-end collisions derived from the The mean center (MC) indicates the average location of the observed sample. It reflects the concentrating trend and the overall offset of the sample data in space. We can use the rear-end collision data to calculate MC. The concentration and offset trend can be obtained by calculating MC as follows: X and Y represent the average coordinates of data in the X and Y directions. N is the total number. x i and y i represent the X and Y coordinates of the ith data point.

Standard Deviational Ellipse
The standard deviational ellipse (SDE) can be used to characterize the spatial distribution of data, including central, dispersion and directional distribution tendencies. The length of the short axis indicates the degree of spatial aggregation. The shorter the axis shows, the more aggregated the data are. The long axis represents the spatial expansion direction as follows: a i = x i − X, b i = y i − Y and X, Y are the mean center coordinates of the data, which can be calculated from Equations (1) and (2). θ is the angle of clockwise rotation of the standard deviation ellipse along the Y-axis. S x and S y represent the long and short semiaxes of the standard deviation ellipse, respectively. In this study, the standard deviation ellipse is obtained from the two-dimensional coordinates of all data points. It can evaluate the aggregation degree and expansion direction of data.

Density Analysis
The urban map can be divided into some small cells with a side length of d, which eventually corresponds to pixel cells on the GIS. k represents the center of the circle. r represents the radius. We can use N k (r) to calculate the number of events around the neighborhood. N k (r) is divided by the neighborhood area to obtain the accident density named D k accident . Similarly, we can obtain the density of the road network, D k road as follows: If q i represents the severity of the ith accident, the density of accident severity named by D k severity in cell k can be obtained as follows:

Clustering Analysis
The outlier analysis can be calculated by the local Moran's I of the data points.
I i is the local Moran's I of point i. m i and m j are the attributes of points i and j. M is the global mean of the attribute. ω i,j is the spatial weight between point i and point j. ω i,j is usually the inverse of the distance between the two points. S 2 i is the second-order sample matrix of all attributes except point i.
Normally, the confidence level of statistical significance is 95%. According to the normal distribution, the range of z should be between −1.96 and +1.96. In terms of statistical significance, if I > 0, this point has the same attribute level as the neighboring points. It is reflected as high-high clustering or low-low clustering. The result depends on the difference between the attribute value of this point and the average value of all points. If I < 0, there is a large difference in attributes between this point and the neighboring points. This point is an outlier.
There were five types of results: high-high clustering (H-H), high-low clustering (H-L), low-high clustering (L-H), low-low clustering (L-L) and nonsignificant. High-high clustering means that this point and its neighbors are all high values. They also have similar attributes. High-low clustering means that this point is a high value and is surrounded by neighboring points of low values. Low-high clustering means that this point is a low value and is surrounded by neighboring points of high values. Low-low clustering means that this point is low value and the neighboring points are also low value. Nonsignificant indicates that the point has no significant relationship with the neighboring points and the attributes are more different. In general, the local Moran's I describes the clustering characteristics between high and low values points in geo-spatial.

LightGBM
LightGBM is an efficient integrated machine learning algorithm developed by the Microsoft team based on the Boosting method. The purpose of the LightGBM algorithm is to find an approximation of a function. This function can minimize the specified loss function. The loss function will determine how well the model fits the data. The LightGBM model integrates multiple regression trees to fit the final model. First the algorithm needs to determine how the objective function is calculated. After that, the optimization problem is to make each tree have the smallest objective function. For this, it is necessary to calculate the gain from the splitting of the leaf nodes in the tree. When the maximum gain of node splitting is got, the feature with the highest gain will be selected as the splitting feature. Continuously this iterating process until a specific condition is met. LightGBM uses histogram algorithms, Leaf-wise growth strategies, and histogram differential acceleration and other methods. It can significantly reduce the complexity of the algorithm and training time consumption. This makes LightGBM has excellent training efficiency and high prediction accuracy.

Random Forest
Random forest (RF) is a supervised data mining algorithm. It is a combination of bagging algorithm and decision tree, which is an integrated algorithm. Its main workflow is as follows: (1) RF uses the bagging algorithm to sample the training set with put back Bootstrap sampling. It forms sub-training set. In other words, each sub-training set is drawn from the original training set by put-back bootstrap sampling. (2) Using the decision tree method, a binary tree corresponding to each sub-training set is formed. (3) The algorithm repeats steps 1 and 2. When the generated tree can accurately classify the samples in the training set, the algorithm will come up with a tree model. Or, until all the attribute features are used up, the algorithm will generate a tree model. After that, the tree model of all processes is combined. This forms a random forest model. (4) For an arbitrary test sample, the final classification result is decided by simple voting.
Random forests have two important randomness. One of them is the sample randomness of Bootstrap sampling, and the other is the random selection of features. Both of two randomness make it possible to greatly reduce the correlation of each tree in the random forest. Thus, it ensures that the random forest has good classification ability.

Data Source
The data come from the Harbin Traffic Safety Management Database. The selected data are from 2015 to 2017 with a total of 8037 cases. The data attributes can be divided into five categories. The fields contained in each category are shown in Table 1. The rear-end collision data were filtered from the database for a total of 1372 pieces. Figure 1 shows the road network of Harbin. External environment Weather, temperature, terrain, road physical separation, roadside protection facilities, road conditions, etc.
Other information Road safety attributes, jurisdiction units, road safety supervision level, etc.

Data Mining
In the data processing, 136 pieces of data are missing some field in the "reason" categories; we deleted this data. A total of 264 pieces of data are missing or incorrect in the "weather" field; we corrected the details of these data by querying the weather records.
After data preprocessing, 1236 pieces of data completed the accident information. We selected the central city of Harbin as the study area. Within this area, a final 1205 samples were screened.
(1) According to the traffic accident severity classification, accidents are classified into three categories: property damage accidents, injury accidents, and fatal accidents. Accident severity can be a label for data, as shown in Table 2.

Data Mining
In the data processing, 136 pieces of data are missing some field in the "reason" categories; we deleted this data. A total of 264 pieces of data are missing or incorrect in the "weather" field; we corrected the details of these data by querying the weather records.
After data preprocessing, 1236 pieces of data completed the accident information. We selected the central city of Harbin as the study area. Within this area, a final 1205 samples were screened.
(1) According to the traffic accident severity classification, accidents are classified into three categories: property damage accidents, injury accidents, and fatal accidents. Accident severity can be a label for data, as shown in Table 2. The data were graded according to the level of severity, as shown in Table 3. (2) In this paper, feature variables were extracted by mining collision data. Nine feature variables were extracted: weather, wind speed, temperature, week, season, time, location, vehicle type, and accident type. The feature variables were coded, as shown in Table 4. (3) This step creates the coordinates' visualization for the collision location by GIS. After coordinate transformation, 1127 pieces of data were finally retained. For the follow study, we chose these 1127 samples with accurate coordinate conversion. Figure 2 shows the spatial distribution of collisions after data processing. The blue points represent collision locations.

Feature Extraction
We used the Pearson correlation coefficient to extract the features ( Figure 3). The Pearson correlation coefficient r utilizes values in the range of (−1, 1). When r ≥ 0.8, it can be regarded as highly correlated. When 0.5 ≤ r < 0.8, it can be regarded as moderately

Feature Extraction
We used the Pearson correlation coefficient to extract the features ( Figure 3). The Pearson correlation coefficient r utilizes values in the range of (−1, 1). When r ≥ 0.8, it can be regarded as highly correlated. When 0.5 ≤ r < 0.8, it can be regarded as moderately correlated. When r < 0.3, it means that the correlation between them is extremely weak. The matrix shows that except for temperature and season, which are moderately correlated, the correlation between each feature is weak. The nine feature variables can be used for analysis.

Feature Extraction
We used the Pearson correlation coefficient to extract the features ( Figure 3). The Pearson correlation coefficient r utilizes values in the range of (−1, 1). When r ≥ 0.8, it can be regarded as highly correlated. When 0.5 ≤ r < 0.8, it can be regarded as moderately correlated. When r < 0.3, it means that the correlation between them is extremely weak. The matrix shows that except for temperature and season, which are moderately correlated, the correlation between each feature is weak. The nine feature variables can be used for analysis.

Spatial Distribution
Applying Equations (1)-(5), the mean center location and ellipse distribution location of collisions are obtained. The spatial distribution trend is shown in Figure 4.

Spatial Distribution
Applying Equations (1)-(5), the mean center location and ellipse distribution location of collisions are obtained. The spatial distribution trend is shown in Figure 4. As shown in Figure 4, most collisions are concentrated in the interface of the two administrative districts of Daoli and Daowai. The mean center is located south of the area.
The length of short axis of the standard deviation ellipse is quite short. It means that the spatial distribution is more concentrated. The direction of the long axis shows that the As shown in Figure 4, most collisions are concentrated in the interface of the two administrative districts of Daoli and Daowai. The mean center is located south of the area.
The length of short axis of the standard deviation ellipse is quite short. It means that the spatial distribution is more concentrated. The direction of the long axis shows that the distribution has a trend of spreading from southwest to northeast.

Density Distribution
Equation (6) can obtain the density of collision points. The maximum normalization was used to process density values. Based on the density distribution map, 0.2, 0.5 and 0.8 are defined as dividing lines; 0~0.2 are defined as a low-density region. This means that there is a very small number of collisions per unit area; 0.2~0.5 is the medium-low region; 0.5~0.8 is the medium-high density region; 0.8~1 is the high-density region. The density calculations are shown in Figure 5. In Figure 5, the low-density areas account for the largest proportion. The high-density areas account for the smallest proportion. Density region statistics are shown in Table  5. In this research, we reviewed the 2020 annual road network density monitoring report for major Chinese cities. The report shows that there is a crucial difference in road network density in different districts of Harbin. The maximum value is 6.4 km/km 2 in Daoli, the minimum is 3.6 km/km 2 in Pingfang and the mean value is 5.0 km/km 2 . To reduce the influence on the results, this paper obtains the accident frequency per unit of road length. The results are shown in Figure 6. In Figure 5, the low-density areas account for the largest proportion. The high-density areas account for the smallest proportion. Density region statistics are shown in Table 5. In this research, we reviewed the 2020 annual road network density monitoring report for major Chinese cities. The report shows that there is a crucial difference in road network density in different districts of Harbin. The maximum value is 6.4 km/km 2 in Daoli, the minimum is 3.6 km/km 2 in Pingfang and the mean value is 5.0 km/km 2 . To reduce the influence on the results, this paper obtains the accident frequency per unit of road length. The results are shown in Figure 6. Sustainability 2022, 14, x FOR PEER REVIEW 11 of 24 Figure 6. The density distribution considering the road network density.
In Figure 6, the location of the distribution density region considering road network density varies greatly. The statistics are shown in Table 6. Daoli and Xiangfang are more densely affected by the road network density, and Songbei has not changed substantially.

Severity Distribution
The traffic safety level of urban roads is also related to casualties and property damage. Using the accident severity as the weight of each collision point, the weighted density of all points is calculated by Equation (8). By dividing weighted density by the density without weight, we obtain the severity distribution on a unit area. The results are shown in Figure 7. In Figure 6, the location of the distribution density region considering road network density varies greatly. The statistics are shown in Table 6. Daoli and Xiangfang are more densely affected by the road network density, and Songbei has not changed substantially.

Severity Distribution
The traffic safety level of urban roads is also related to casualties and property damage. Using the accident severity as the weight of each collision point, the weighted density of all points is calculated by Equation (8). By dividing weighted density by the density without weight, we obtain the severity distribution on a unit area. The results are shown in Figure 7.
Compared with the density distribution considering the road network density above, the spatial distribution has a substantial difference. Table 7 shows the statistical results. In Figure 6, the high-density area is located in the urban center of Pingfang. However, it is located in Daowai, Xiangfang, and Nangang in Figure 7. The medium-high density areas occupy a considerably large area. Most of the areas are located in the urban center of the district. The medium-and low-density areas have a wider distribution, covering an area of 136.2 km 2 . Sustainability 2022, 14, x FOR PEER REVIEW 12 of 24 Compared with the density distribution considering the road network density above, the spatial distribution has a substantial difference. Table 7 shows the statistical results. In Figure 6, the high-density area is located in the urban center of Pingfang. However, it is located in Daowai, Xiangfang, and Nangang in Figure 7. The medium-high density areas occupy a considerably large area. Most of the areas are located in the urban center of the district. The medium-and low-density areas have a wider distribution, covering an area of 136.2 km 2 . Clustering analysis of the data obtained clustering spatial distribution of collisions based on the different levels of severity. The interpretation of the different clustering results is list as follows: (1) High severity clustering (high-high), which means that the spatial area contains many high severity collision points. (2) High-low severity clustering (high-low), which means that the spatial range of many low-severity collision points contains a few high severity collision points. (3) Low-high severity clustering (low-high), which means that the spatial range of many high-severity collision points contains a few low severity collision points. (4) Low severity clustering (low-low), which indicates that the spatial range contains many low-severity collision points. (5) Not-significant implies that there is no significant clustering relationship between the points in the region.   Clustering analysis of the data obtained clustering spatial distribution of collisions based on the different levels of severity. The interpretation of the different clustering results is list as follows: (1) High severity clustering (high-high), which means that the spatial area contains many high severity collision points. (2) High-low severity clustering (high-low), which means that the spatial range of many low-severity collision points contains a few high severity collision points. (3) Low-high severity clustering (low-high), which means that the spatial range of many high-severity collision points contains a few low severity collision points. (4) Low severity clustering (low-low), which indicates that the spatial range contains many low-severity collision points. (5) Not-significant implies that there is no significant clustering relationship between the points in the region. Figure 8 shows the results. A high-high clustering trend is shown in the urban center of Daoli and Daowai. In Songbei and Pingfang, most of the collision points demonstrate a high-low clustering trend. This result indicates that these areas generally have a low severity of collisions. The points with low-low clustering of severity existed only in the suburbs of Daoli.
In all of the above, the collisions are concentrated in the urban center of each district. Both the density and severity of rear-end collisions in the suburbs are low. high-low clustering trend. This result indicates that these areas generally have a low severity of collisions. The points with low-low clustering of severity existed only in the suburbs of Daoli. In all of the above, the collisions are concentrated in the urban center of each district. Both the density and severity of rear-end collisions in the suburbs are low.

Seasonal Distribution
Harbin has distinct seasons. It has a long spring and winter and a short summer. Figure 9 shows the statistics of rear-end collisions from 2015 to 2019 according to the month. The high incidence month is in July, August and September. With August as the boundary, the number of collisions from February to August has an increasing trend. August to December has a decreasing trend. The maximum is 141, which occurs in August. The minimum is 55 in January.

Seasonal Distribution
Harbin has distinct seasons. It has a long spring and winter and a short summer. Figure 9 shows the statistics of rear-end collisions from 2015 to 2019 according to the month. The high incidence month is in July, August and September. With August as the boundary, the number of collisions from February to August has an increasing trend. August to December has a decreasing trend. The maximum is 141, which occurs in August. The minimum is 55 in January.
high-low clustering trend. This result indicates that these areas generally have a low severity of collisions. The points with low-low clustering of severity existed only in the suburbs of Daoli. In all of the above, the collisions are concentrated in the urban center of each district. Both the density and severity of rear-end collisions in the suburbs are low.

Seasonal Distribution
Harbin has distinct seasons. It has a long spring and winter and a short summer. Figure 9 shows the statistics of rear-end collisions from 2015 to 2019 according to the month. The high incidence month is in July, August and September. With August as the boundary, the number of collisions from February to August has an increasing trend. August to December has a decreasing trend. The maximum is 141, which occurs in August. The minimum is 55 in January.  Based on the geographical conditions and climate of Harbin, the season division is as follows: spring: March to May; summer: June to August; autumn: September to October; winter: November to February. By Equations (6) and (7), the density distribution in each season is calculated with or without considering the road network density. Figures 10 and 11 show the results in each season. The density distribution is concentrated in Daoli, Daowai and Xiangfang even in different seasons. These districts are typically prone to rear-end collisions. follows: spring: March to May; summer: June to August; autumn: September to October; winter: November to February. By Equations (6) and (7), the density distribution in each season is calculated with or without considering the road network density. Figures 10 and  11 show the results in each season. The density distribution is concentrated in Daoli, Daowai and Xiangfang even in different seasons. These districts are typically prone to rearend collisions. Based on the geographical conditions and climate of Harbin, the season division is as follows: spring: March to May; summer: June to August; autumn: September to October; winter: November to February. By Equations (6) and (7), the density distribution in each season is calculated with or without considering the road network density. Figures 10 and  11 show the results in each season. The density distribution is concentrated in Daoli, Daowai and Xiangfang even in different seasons. These districts are typically prone to rearend collisions.  Figure 12 reports the time period distribution. The high incidence is concentrated from 6:00 to 10:00 and 16:00 to 20:00. The number is relatively small between 0:00 and 4:00. 10:00 and 16:00 are the boundaries. It has an increasing trend continuously before 10:00. The trend remains unchanged between 10:00 and 16:00. After 16:00, it decreases. The maximum is 159, which occurs from 8:00~10:00. The minimum is 27 from 2:00~4:00. Based on the results above, the classification of time periods is 0:00~5:59, 6:00~11:59, 12:00~17:59 and 18:00~23:59. We can use Equations (6) and (7) to obtain the spatial distribution of each time period. Figure 13 shows the density distribution of each time period. Figure 14 shows the results considering the road network density. Daoli, Daowai and Xiangfang are still the high-density distribution areas. During commuting hours, Pingfang appears in high-density areas.  Figure 12 reports the time period distribution. The high incidence is concentrated from 6:00 to 10:00 and 16:00 to 20:00. The number is relatively small between 0:00 and 4:00. 10:00 and 16:00 are the boundaries. It has an increasing trend continuously before 10:00. The trend remains unchanged between 10:00 and 16:00. After 16:00, it decreases. The maximum is 159, which occurs from 8:00~10:00. The minimum is 27 from 2:00~4:00.  Figure 12 reports the time period distribution. The high incidence is concentrated from 6:00 to 10:00 and 16:00 to 20:00. The number is relatively small between 0:00 and 4:00 10:00 and 16:00 are the boundaries. It has an increasing trend continuously before 10:00 The trend remains unchanged between 10:00 and 16:00. After 16:00, it decreases. The max imum is 159, which occurs from 8:00~10:00. The minimum is 27 from 2:00~4:00. Based on the results above, the classification of time periods is 0:00~5:59, 6:00~11:59 12:00~17:59 and 18:00~23:59. We can use Equations (6) and (7) to obtain the spatial distri bution of each time period. Figure 13 shows the density distribution of each time period Figure 14 shows the results considering the road network density. Daoli, Daowai and Xiangfang are still the high-density distribution areas. During commuting hours, Ping fang appears in high-density areas. Based on the results above, the classification of time periods is 0:00~5:59, 6:00~11:59, 12:00~17:59 and 18:00~23:59. We can use Equations (6) and (7) to obtain the spatial distribution of each time period. Figure 13 shows the density distribution of each time period. Figure 14 shows the results considering the road network density. Daoli, Daowai and Xiangfang are still the high-density distribution areas. During commuting hours, Pingfang appears in high-density areas.

LightBGM Prediction
The classification method follows traffic accident severity classification. We divided the test set as the proportion of 30% and the training set as 70%, 10 cross-validations are performed. The validation set is a uniformly random sample from the training set. This validation set was used for cross-validation. The parameter settings are shown in Table 8. First, we calibrated the number of LightGBM classifier features and the number of decision trees. Then, the accuracy and confusion matrix are selected to evaluate the classification accuracy of the LightGBM classifier. The results are shown in Figure 15 and Table 9.

LightGBM Prediction
The classification method follows traffic accident severity classification. We divided the test set as the proportion of 30% and the training set as 70%, 10 cross-validations are performed. The validation set is a uniformly random sample from the training set. This validation set was used for cross-validation. The parameter settings are shown in Table 8. First, we calibrated the number of LightGBM classifier features and the number of decision trees. Then, the accuracy and confusion matrix are selected to evaluate the classification accuracy of the LightGBM classifier. The results are shown in Figure 15 and Table 9. The accuracy rate of the LightBGM classifier is 85.36%, which can reach a high standard. The confusion matrix illustrates the following conclusions. For property damage accidents, the prediction accuracy rate is 89%. Eleven percent are incorrectly predicted as injury accidents. For injury accidents, the prediction accuracy rate is only 76%. A total of  The accuracy rate of the LightGBM classifier is 85.36%, which can reach a high standard. The confusion matrix illustrates the following conclusions. For property damage accidents, the prediction accuracy rate is 89%. Eleven percent are incorrectly predicted as injury accidents. For injury accidents, the prediction accuracy rate is only 76%. A total of 19% are incorrectly predicted as property damage accidents, and 5% are predicted as fatal accidents. All fatal accidents in the test set are predicted correctly with an 100% accuracy rate.
Using the LightGBM classifier, the predicting characteristic curve is calculated. The AUC value under the curve is 0.89, as shown in Figure 16. When the AUC value is between 0.5 and 1, it indicates that the classifier is a more effective prediction. The accuracy rate of the LightBGM classifier is 85.36%, which can reach a high standard. The confusion matrix illustrates the following conclusions. For property damage accidents, the prediction accuracy rate is 89%. Eleven percent are incorrectly predicted as injury accidents. For injury accidents, the prediction accuracy rate is only 76%. A total of 19% are incorrectly predicted as property damage accidents, and 5% are predicted as fatal accidents. All fatal accidents in the test set are predicted correctly with an 100% accuracy rate.
Using the LightBGM classifier, the predicting characteristic curve is calculated. The AUC value under the curve is 0.89, as shown in Figure 16. When the AUC value is between 0.5 and 1, it indicates that the classifier is a more effective prediction. Some of these predicted results are shown in Figure 17.  Some of these predicted results are shown in Figure 17. The accuracy rate of the LightBGM classifier is 85.36%, which can reach a high standard. The confusion matrix illustrates the following conclusions. For property damage accidents, the prediction accuracy rate is 89%. Eleven percent are incorrectly predicted as injury accidents. For injury accidents, the prediction accuracy rate is only 76%. A total of 19% are incorrectly predicted as property damage accidents, and 5% are predicted as fatal accidents. All fatal accidents in the test set are predicted correctly with an 100% accuracy rate.
Using the LightBGM classifier, the predicting characteristic curve is calculated. The AUC value under the curve is 0.89, as shown in Figure 16. When the AUC value is between 0.5 and 1, it indicates that the classifier is a more effective prediction. Some of these predicted results are shown in Figure 17.

Random Forest Prediction
The dataset contains nine attributing factors for each data item. This paper mainly uses the min-max normalization method (Equation (14)) to transform each factor value. After normalization, all values are in the range of [0, 1].
x max is the maximum value in the feature sample. x min is the minimum value. x * represents the new value after normalization.
The test set and training set were divided by proportions of 30% and 70%. As the same for the LightGBM classifier, RF classifier performed 10 cross-validations. The validation set was derived from the training set. RF parameter settings are shown in Table 10. Less than l × 10 −7 will not generate child nodes Because each experiment has result errors, multiple experiments were used to take the average. Each set of data derived from the experiment is the mean value of 10 experiments.
The number of trees in the RF classifier is an important parameter to be adjusted. To avoid over-fitting, in the random forest model, we selected a series number of trees to test the hyperparameters. Figure 18 shows the changes.

Random Forest Prediction
The dataset contains nine attributing factors for each data item. This paper mainly uses the min-max normalization method (Equation (14) is the maximum value in the feature sample. is the minimum value. * represents the new value after normalization.
The test set and training set were divided by proportions of 30% and 70%. As the same for the LightGBM classifier, RF classifier performed 10 cross-validations. The validation set was derived from the training set. RF parameter settings are shown in Table 10. Minimum impurity of node segmentation l × 10 −7 Less than l × 10 −7 will not generate child nodes Because each experiment has result errors, multiple experiments were used to take the average. Each set of data derived from the experiment is the mean value of 10 experiments.
The number of trees in the RF classifier is an important parameter to be adjusted. To avoid over-fitting, in the random forest model, we selected a series number of trees to test the hyperparameters. Figure 18 shows the changes. In Figure 18, according to the increasing number of decision trees, the prediction accuracy tends to be smoother. The accuracy fluctuates between 81.1% and 81.5%. When more than 500 trees are used, the accuracy stabilizes at 81.4%. Therefore, we selected 1000 In Figure 18, according to the increasing number of decision trees, the prediction accuracy tends to be smoother. The accuracy fluctuates between 81.1% and 81.5%. When more than 500 trees are used, the accuracy stabilizes at 81.4%. Therefore, we selected 1000 trees. After calibrating the number of RF classifier features and decision trees, the accuracy and confusion matrix can be obtained. The results are shown in Table 11 and Figure 19. trees. After calibrating the number of RF classifier features and decision trees, the accuracy and confusion matrix can be obtained. The results are shown in Table 11 and Figure 19. The accuracy rate of the RF classifier is 81.43%. For property damage accidents, the prediction accuracy rate is 91%. Only 8% are incorrectly predicted as injury accidents, and 1% are predicted as fatal accidents. However, the prediction accuracy rate for injury accidents is only 62%. Thirty-eight percent are incorrectly predicted as property damage accidents, and 1% is determined to be fatal accidents. For fatal accidents, the accuracy rate is 98%, and 2% are incorrectly predicted as property damage accidents.
In Figure 20, the AUC value is 0.83. This result shows that the classifier can predict well. Some of these predicted results are shown in Figure 21. The accuracy rate of the RF classifier is 81.43%. For property damage accidents, the prediction accuracy rate is 91%. Only 8% are incorrectly predicted as injury accidents, and 1% are predicted as fatal accidents. However, the prediction accuracy rate for injury accidents is only 62%. Thirty-eight percent are incorrectly predicted as property damage accidents, and 1% is determined to be fatal accidents. For fatal accidents, the accuracy rate is 98%, and 2% are incorrectly predicted as property damage accidents.
In Figure 20, the AUC value is 0.83. This result shows that the classifier can predict well.
trees. After calibrating the number of RF classifier features and decision trees, the accuracy and confusion matrix can be obtained. The results are shown in Table 11 and Figure 19. The accuracy rate of the RF classifier is 81.43%. For property damage accidents, the prediction accuracy rate is 91%. Only 8% are incorrectly predicted as injury accidents, and 1% are predicted as fatal accidents. However, the prediction accuracy rate for injury accidents is only 62%. Thirty-eight percent are incorrectly predicted as property damage accidents, and 1% is determined to be fatal accidents. For fatal accidents, the accuracy rate is 98%, and 2% are incorrectly predicted as property damage accidents.
In Figure 20, the AUC value is 0.83. This result shows that the classifier can predict well.  Some of these predicted results are shown in Figure 21.

Analysis of Rear-End Collision Causes
Based on the above, the LightGBM classifier has higher prediction accuracy. We selected the LightGBM classifier to rank the nine parameters in degree of importance. Table 12 shows the results.

Analysis of Rear-End Collision Causes
Based on the above, the LightGBM classifier has higher prediction accuracy. We selected the LightGBM classifier to rank the nine parameters in degree of importance. Table  12 shows the results. Week 0.0214 9 Accident Type 0.0003 The most important parameter affecting the severity of rear-end collisions is temperature, followed by weather, time, vehicle type, wind speed, season, location, week, and accident type. Therefore, traffic safety managers need to especially focus on environmental factors, such as temperature and weather. It is the key to reduce the probability of rearend collisions in Harbin.

Discussion
The results show that the occurrence of rear-end collisions are strongly characterized by spatial and temporal distributions. In summary, the main points are as follows: (1) It has a high frequency in the urban center of districts.
(2) In the morning, collisions have a high probability.
(3) It shows a significant correlation between spatial features and rear-end collisions severity. (4) The probability of collisions is higher on main roads with district intersections.
In fact, the urban center of Daoli and Daowai is the old urban area of Harbin. Many old roads and one-way roads lead to complicated driving conditions. There are more serious traffic problems in this area. Pingfang shows high density only in commuting time. Working commuting areas in Pingfang are more concentrated than in other districts. The conclusions drawn from the above analysis are the same.
Compared to the existing studies, there are several similar conclusions that can be drawn. Dimitrioua et al. [41] illustrate rear-end collisions potential was presented when  The most important parameter affecting the severity of rear-end collisions is temperature, followed by weather, time, vehicle type, wind speed, season, location, week, and accident type. Therefore, traffic safety managers need to especially focus on environmental factors, such as temperature and weather. It is the key to reduce the probability of rear-end collisions in Harbin.

Discussion
The results show that the occurrence of rear-end collisions are strongly characterized by spatial and temporal distributions. In summary, the main points are as follows: (1) It has a high frequency in the urban center of districts.
(2) In the morning, collisions have a high probability.
(3) It shows a significant correlation between spatial features and rear-end collisions severity. (4) The probability of collisions is higher on main roads with district intersections.
In fact, the urban center of Daoli and Daowai is the old urban area of Harbin. Many old roads and one-way roads lead to complicated driving conditions. There are more serious traffic problems in this area. Pingfang shows high density only in commuting time. Working commuting areas in Pingfang are more concentrated than in other districts. The conclusions drawn from the above analysis are the same.
Compared to the existing studies, there are several similar conclusions that can be drawn. Dimitrioua et al. [41] illustrate rear-end collisions potential was presented when traffic flow and speed standard deviation were higher. Large traffic flow and vehicle speed differences often occur in urban centers and during daytime working hours. Point (1) and (2) confirms the conclusion. Point (3) is similar to the view of Liu and Sharma [12]. The spatial characteristics of the city can be considered as a factor in the occurrence of rear-end collisions. Soltani and Askari [3] obtained the following conclusions. The urban areas that connect to each other areas were determined as clusters with high crash rates. Recreational facilities and schools along the road are associated with the occurrence of traffic accidents. This is similar to the Point (4). In our results, the administrative border area between Daoli and Daowai is the hot spot. It is reasonable to draw such a result considering that Daoli and Dawai, as old urban areas, have a high density of buildings and urban facilities.
Comparing the LightGBM and RF classifiers are can be seen in Table 13. The LightGBM classifier has better prediction accuracy for overall accuracy. The RF classifier has better prediction accuracy for property damage accident prediction accuracy. This paper concludes that temperature shows a substantial effect on the cause of rear-end collisions. The temperature characteristics of Harbin are very important. In autumn, there is a large difference in temperature between day and night. The winter is quite long. There is large amounts of snow and ice on roads. The concentration of rear-end collisions in autumn is linked to the change in temperature. In addition to driving factors, temperature has a considerable influence on the severity of collisions at road intersections [6]. In comparison, rear-end collisions cause factors that exclude season and adverse weather in North Carolina [42]. North Carolina has a relatively mild climate temperature difference, which is markedly different from Harbin.

Conclusions
This paper used spatiotemporal analysis and machine learning to analyze rear-end collision data. The results highlight some characteristics of rear-end collisions. Spatial distribution is characterized by three aspects: spatial distribution, density distribution and severity distribution. The spatial distribution is concentrated in administrative border areas between Daoli and Daowai. Considering the density distribution only, collisions are concentrated in Daoli and Xiangfang. Considering the effect of road network density, collisions are concentrated in Pingfang. An analysis of the severity and clustering of rear-end collisions shows a high-high clustering trend in the urban centers of Daoli and Daowai. In summary, administrative intersection areas with dense traffic are high incidence areas.
The temporal distribution characteristics were analyzed by seasonality and time period. Rear-end collisions mostly occur from July to September. In autumn, rear-end collisions are distributed over the largest range and at high density. In winter, all administrative districts have a high density area. The results of the time period characteristics are more obvious. Regardless of whether road network density is considered, the occurrence of rear-end collisions are concentrated between 6:00~11:59.
We then made predictions by the LightGBM and RF classifiers. Comparing the accuracy of the two classifiers, LightGBM has the highest combined accuracy. We calculated the feature importance ranking of affecting accident severity with the LightGBM classifier. It was concluded that temperature is the most important factor.
In this paper, the data have some limitations in describing the difference in collisions. Data descriptions are only available for the external environment and lack specific details. Therefore, the analysis of the causal factor is not comprehensive enough.
Further studies will gather more details about the collision, such as vehicle conditions, traffic conditions and road types. The LightGBM and RF classifiers belong to machine learning based on the decision tree. The machine learning prediction model can be considered combined with a clustering algorithm to mine the potential causes. This method can be applied to the analysis of traffic accidents in many scenarios, such as traffic accidents at urban road intersections, crashes of motor vehicle and non-vehicle and freeway rear-end accidents. These all can be analyzed using this method for spatiotemporal characteristics and accident causes.