Tropical Cyclone Intensity Prediction Using Deep Convolutional Neural Network

: In this study, deep convolutional neural network (CNN) models of stimulated tropical cyclone intensity (TCI), minimum central pressure (MCP), and maximum 2 min mean wind speed at near center (MWS) were constructed based on ocean and atmospheric reanalysis, as well Best Track of tropical hurricane data over 2014–2018. In order to explore the interpretability of the model structure, sensitivity experiments were designed with various combinations of predictors. The model test results show that simpliﬁed VGG-16 (VGG-16 s) outperforms the other two general models (LeNet-5 and AlexNet). The results of the sensitivity experiments display good consistency with the hypothesis and perceptions, which veriﬁes the validity and reliability of the model. Furthermore, the results also suggest that the importance of predictors varies in different targets. The top three factors that are highly related to TCI are sea surface temperature (SST), temperature at 500 hPa (TEM_500), and the differences in wind speed between 850 hPa and 500 hPa (vertical wind shear speed, VWSS). VWSS, relative humidity (RH), and SST are more signiﬁcant than MCP. For MWS and SST, TEM_500, and temperature at 850 hPa (TEM_850) outweigh the other variables. This conclusion also implies that deep learning could be an alternative way to conduct intensive and quantitative research.


Introduction
Tropical cyclone (TC), one of the most severe weather systems which develop over the tropical ocean, have been drawing large attention due to their devastating impact on human beings [1]. It has been found that the intensity of TC mainly depends on three factors [2]: the initial intensity of TC, the thermodynamic state of the atmosphere, and the heat exchange between the ocean and TCs. However, it is still difficult to predict TC intensity (TCI) accurately due to the limited understanding of TC dynamics and sparse observation over the ocean [3,4]. More importantly, TC has shown 13~15% intensifying trend over the past 37 years, since the late 1970s, [5] and the current situation is not optimistic. Therefore, it is essential to develop advanced models to better understand and predict TCI, as their associated destructive winds, extreme precipitation [6,7], floods [8], landslides [9], and other drastic disasters are threatening human life and property.
Based on atmospheric processes and variables, numerous models have been constructed to predict the track of a TC and TCI. Predicting models can generally be divided into two types: numerical models and statistical models. Numerical models heavily rely on complex physical processes to forecast TCI and its track. Shao and Smith [10] improved the prediction of hurricanes Florence and Michael by assimilating atmospheric retrievals from hyperspectral instruments into the Weather Research and Forecasting (WRF) system. Such processes need to deal with a large amount of data and cost intensive computation resources. Furthermore, the poor understanding of complex physical processes, the inaccurate vortex initialization, and large calculation process hinder numerical models' ability to simulate precisely and efficiently [11].
Considering the constraints of numerical models, statistical models are more flexible and consume fewer computational resources, opening up new opportunities for growth under the prosperity of big data. To some extent, traditional statistical models such as standard multiple regression [12], Generalized Additive Model (GAM) [13], and Statistical Hurricane Intensity Prediction Scheme (SHIPS) [14] could be used to predict TCs. Nevertheless, ordinary non-linear regression still has a limited ability to describe non-linear relations.
With the development of machine learning methods, especially the appearance of neural networks with activation functions, various advanced models have been employed to forecast TCs. Sen et al. [15] applied an Artificial Neural Network (ANN) to improve the performance of non-linear autoregressive models that forecast cyclone disturbances. Furthermore, deep learning techniques such as using a Convolutional Neural Network (CNN), which was designed to process images exclusively, have been widely applied to learn features from satellite imagery data [16,17]. Pradhan et al. [18] designed CNN based on LeNet-5 [19] to extract features from satellite imagery and estimated TCI. To achieve a more accurate prediction, deeper networks were employed and more information was integrated and fed into the network. Higa et al. [20] applied complicated Visual Geometry Group Network-16 (VGG-16) to capture more features from the satellite imagery of TC. Giffard-Roisin et al. [21] constructed fused deep learning models consisting of two CNN modules to learn from wind fields and geopotential height fields. However, most of this research focused on satellite imagery and ignored the importance of atmospheric and ocean data. Zhang et al. [22] considered nine predictors when building CNN, e.g., brightness, temperature, relative vorticity, and geopotential height (GEOPH), but the sea surface temperature (SST) was not included. In this case, more factors that have direct or substantial impacts on TC should be taken into consideration, such as SST [23,24] and vertical wind shear (VWS) [25], which can more effectively identify and affect the dynamic and thermodynamic processes of the interaction between ocean and TCs. Additionally, the interpretability study of machine learning is a new field that verifies the reliability of models. However, models are hardly applied to conduct interpretability experiments and some researchers ignore the importance of 'visualizing' the black box. Correspondingly, having an in-depth understanding of model results improves our perspective on the importance of factors and the mechanisms of TC formation.
In this paper, data and the model construction are described in Section 2. Briefly, 2014-2018 atmospheric and ocean reanalysis data are used to construct the predicting model, which is based on the simplified version of VGG-16 (VGG-16s) [26]. Then, we evaluate its performance by comparing it with two other underlying CNN models in Section 3.1. Moreover, we extend the application for an in-depth understanding of the interpretability of our model results by analyzing the importance of variables in Section 3.2.

Data
The fifth generation of atmospheric reanalysis from ECMWF (ERA5) ranging from 2014 to 2018 was used to provide necessary meteorological and ocean predictors in our research. The temporal and horizontal resolutions of ERA5 are 3 h and 0.25 • × 0.25 • , respectively. We mainly focused on the TCs that entered the region with latitude ranges between 1.5 • N and 30.25 • N and longitude ranges between 129.75 • E and 180 • E, which mainly consists of the Northwest Pacific Ocean, as shown in the red box in Figure 1. Target data of 6 h intervals within the study region, including TCI, minimum central pressure (MCP), and maximum 2 min mean wind speed at near center (MWS), were obtained from Best Track (BT) data from the China Meteorological Administration Tropical Cyclone Data Center, Beijing [27,28] and data are available at http://tcdata.typhoon.org.cn/en/zjljsjj_sm.html (accessed on 15 February 2022). In order to keep the coherency of data structure, we pre-processed the input feature and target data into the same temporal resolution (6 h). Ten predictors were utilized in the model: SST, temperature at 850 and 500 hPa (TEM_850 and TEM_500), the average of divergence (DIV) at 850 hPa and 1000 hPa, GEOPH at 850 hPa, relative humidity (RH) at 1000 hPa, meridional and zonal components of wind speed (V and U, respectively) at 850 hPa, and VWS between 500 and 850 hPa. It is noted that the VWS is characterized by the differences of wind speed and wind direction between 850 hPa and 500 hPa (VWSS and VWSD, respectively). In addition, U and V contain the information of TC central wind speed and TCI. In other words, U and V are not independent variables to the target TCI and MWS. Therefore, these two factors were not considered as the model predictors nor as sensitivity experiments targeting TCI and MWS. For different targets, the predictors used in the experiment are shown in Table 1. Five-year data from 2014 to 2018 were used to train, validate, and test the model. In the shuffled dataset, a total of 2970 samples were obtained. Of these, 60% of the data made up training dataset, 20% made up validation dataset, and 20% made up testing dataset.

Model
Generally, CNN comprises of the convolutional layer, the pooling layer, the fully connected layer, and the non-linear activation function. Focusing on the 'deep in' neural network, Simonyan and Zisserman [26] created VGG-16 with 13 convolutional layers to classify high-resolution images. Deeper layers can avoid missing features but also increase the number of parameters. Thus, a smaller kernel of 3 × 3 size is utilized to improve efficiency. Owing to the limitation of sample size, we reduced the number of convolutional layers to 10, resulting in the simplified VGG-16 (VGG-16s).
The architecture of VGG-16s is shown in Figure 2. Before feeding data into the network, data are organized into four dimensions, which are samples, feature types, the height of the feature map, and the width of the feature map. The spatial grids of each feature can be seen as a channel of the picture, so the shape of input sample is 10 (8 for TCI) × 115 × 201. For the first convolutional block, 64 filters of 3 × 3 size are used to extract features. After that, the max pooling layer performs down-sampling and reduces the feature map size to 57 × 100. For the second, third, fourth, and fifth convolutional blocks, feature maps are processed similarly. There are 128, 256, 512, and 512 filters of 3 × 3 kernel size to capture features and the size of feature maps are, respectively, reduced to 28 × 50, 14 × 25, 7 × 12, and 3 × 6 after max pooling. We used the ReLU function as the activation function and the padding option was applied to avoid missing information. Then, the 3D matrix of 3 × 6 × 512 size was flattened into a vector of 9216 size, and three fully connected neural networks with 4096 nodes are utilized to output the target.

Experiment Set-Up
In this paper, the VGG-16s was developed and two classical CNN models were chosen to evaluate the performance of VGG-16s, which are LeNet-5 and AlexNet [29]. LeNet-5 has the initial structure of complex CNN models, which combines the pooling layer and the convolution layer. Additionally, it is designed to accomplish simple-image recognition tasks such as identifying handwritten numbers. The modules of AlexNet and LeNet-5 are quite similar. However, AlexNet has 1000 times more parameters to learn than LeNet-5. Consequently, dropout layers are employed to prevent overfitting. To further avoid overfitting, early stopping was applied to find the best epoch and the dropout number was set to 0.5. Learning rate was chosen by learning rate decay strategy. Other parameters are shown in Table 2. Then, we used the VGG-16s model to carry out experiments and tried to explore the difference of predictor importance among the three targets. In detail, sensitivity experiments were conducted relying on training a few models with different predictors. For MCP and MWS, experiments were designed by training models based on removing one predictor each time. The differences of test results from different runs revealed the importance of the removed predictors. Furthermore, the change range of metrics could measure the importance magnitude. For TCI, experiments were designed by using one predictor to train the model each time. For example, if the model trained with SST outperforms the model trained with RH, SST possibly outweighs RH for TCI.
To evaluate the performance of models on test datasets, coefficients of determination (R 2 ), root mean square error (RMSE), mean absolute error (MAE), and symmetric mean absolute percentage error (SMAPE) were used for the regression task. On the basis of their definition, the lower the MAE, RMSE, and SMAPE are, the better the model will be. Accuracy (ACC), precision (Pre), recall (Rec), and F1-score (F1) are metrics for the classification task.
ACC is the proportion of correctly classified samples in total samples: where N T represents the number of samples that were correctly classified; N s represents the number of total test samples.
Pre indicates the proportion of true positive results in all positive results. In a certain category, the calculation formula is as follows: where N TP represents the number of samples that were correctly classified as true category; N FP represents the number of samples that were incorrectly classified as true category.
Rec is the number of true positive samples divided by original positive samples. In a certain category, the calculation formula is as follows: where N FN represents the number of samples that were incorrectly classified as the false categories.
To balance Pre and Rec, F1 is utilized to combine these two indexes. If F1 is high, both the precision and recall of the classifier indicate good results. In a certain category, the calculation formula is as follows: In summary, ACC is used to measure the model performance on overall samples. Pre and Rec are complementary metrics to assess judgement errors based on the positive prediction results and judgement omissions based on true positive values. F1 is designed to integrate Pre and Rec. In addition, Pre, Rec, and F1 are based on the number of TCI levels, which is seven in this study. Thus, the macro-average method, which calculates average values of all categories, was used to integrate.

Model Comparison
The estimation metrics of different models when predicting three selected targets are shown in Table 3. Generally, VGG-16s achieved a better prediction performance with three targets than the other two models. Especially in comparison to LeNet-5, the R 2 of VGG-16s increased by 33%, and RMSE, MAE, and SMAPE of VGG-16s, respectively, reduced by 30%, 36%, and 36% for the MCP forecast. In terms of predicting MWS, the improvements shown on four indices (R 2 , RMSE, MAE and SMAPE) are 31%, 35%, 41%, and 39% compared to LeNet-5. Figure 3 shows the comparison between the observed values and prediction results of MCP and MWS from three models. It was found that the prediction results of VGG-16s (the red dashed lines) are closer to the observed values (the black solid lines) than the prediction results of LeNet-5 and AlexNet, which are consistent with the metrics results. Such improvements indicate that deeper layers could help the model capture more details and can deal with multi-channels consisting of various factors.  However, VGG-16s showed a slightly superior performance when classifying TCI. Compared with LeNet-5, four indicators increased by 15%, 5.8%, 18%, and 10%, respectively. This smaller improvement is possibly due to the insufficient predictors, the imbalance of samples, and inadequate model structure. Therefore, the model needs to be further optimized and more predictors need to be selected to achieve further improvements. In addition, the macro-average Receiver Operating Characteristic (ROC) [30] curves and ROC curves of different categories of three models are shown in Figure 4. It manifests that the area under the macro-average ROC curve elevates gradually and VGG-16s has the best classification performance on major classes. In summary, VGG-16s performs better than LeNet-5 and AlexNet with regard to the project of three targets. Thus, VGG-16s is applied to sensitivity experiments. Higher TPR and lower FPR indicate better performance. The area under the ROC curve is also a measure of test results, where a greater area means better performance.

Sensitivity Experiments
The sensitivity results are shown in Figures 5 and 6. The results show that SST, VWSS, and TEM_500 are the top three important factors in TCI predictions. VWSS, RH, and SST make the most contributions to MCP among the selected factors. SST, TEM_500, and TEM_850 are the top three factors exhibiting closer relations to MWS. Common features can be found in the sensitivity results among three targets. Firstly, SST is a critical variable for three targets, which is due to the fact that ocean potential heat is a crucial energy resource to the formation and enhancement of TC. Generally, a high SST can be seen as a signal that a large amount of water in the ocean evaporates into the atmosphere, which is favorable to a low-pressure center and intensifying TC [23,31,32]. Secondly, TEM_500 is important for both MWS and TCI. This can be attributed to the moisture budget and heat exchange. When wet air mass rises, it releases latent heat that warms the surroundings to keep a warm core. After that, dry air continues rising and become colder. Therefore, the lower the upper air temperature, the more latent heat is released at around 500 hPa, which is possibly similar to the characteristic of outflow temperature [33,34]. Thirdly, VWSS has a high sensitivity to MCP and TCI, which is potentially related to the ventilation effect [35]. A deeper VWS may induce the upper and lower layers of TC to be in different phases, disrupting the energy transport and weakening the warm core. Consequently, a high VWSS is not conducive to TC intensification and is usually seen as a key factor impeding the development of TC. Furthermore, VWS is of great significance to MCP, which suggests that VWS may more directly influence a low-pressure center. Additionally, it is noted that GEOPH is not quite vital to the predicted three targets, especially to TCI and MWS. A possible explanation is that GEOPH is more likely to impact TC track than intensity [36]. Moreover, some differences are expected among three targets, particularly between MCP and MWS. This result indicates that RH is more important to MCP than the other two targets. Statistical data show that TC rapid intensification cases have higher values of 850~700 hPa relative humidity [37]. Areas with a high RH may imply violent evaporation, strong air-sea interaction cycle, and deep lowering of the pressure field [32]. In contrast, vertical temperature distribution is significant for MWS. However, a few previous works focused on the relations between temperature at certain layers and MWS. Therefore, this problem must be further investigated by numerical models.

Discussion
A precise TC forecast model could help to identify risks and reduce the loss caused by TCs. In this paper, we compared three basic CNN models based on ocean, atmospheric, and TC data, and VGG-16s was found to perform well. More advanced models such as ResNet [38] and efficient methods such as Synthetic Minority Over-sampling Technique (SMOTE) [39] and Under-sampling (US) [40] that deal with small samples problem should be considered and used to improve model performance, especially for classification tasks. Additionally, the integrated models which combine the merits and function of numerical weather forecast models and machine learning models are possibly meaningful to the development of TC forecasting. Moreover, selecting the best predictor group is instrumental to model efficiency. For example, studies indicate that the tropical cyclone heat potential (TCHP) is more closely related to the central pressure than SST [41]. Therefore, TCHP is possibly a more crucial factor to be chosen as a predictor. Currently, an excessive and insufficient input could bring high computation and poor performance. Thus, choosing the appropriate method when selecting factors is vital.
Studying interpretability is also critical to convince decision makers and researchers to apply models with confidence. In this study, we explored the interpretability of VGG-16s. The results display how the deep learning model makes predictions on targets, which is an alternative way to make structures more visible and explainable. Furthermore, interpretability techniques of deep learning are worthy of being further studied, not merely focusing on factor importance. For instance, saliency maps [42] are promising to explore the spatial patterns of features.
Scenario studies could be carried out based on advanced models. In this research, an interpretability analysis also provides a reference for finding the most direct impact of TC and identifying the priority of function route of TC. Although some preliminary conclusions were found, various issues are still difficult to understand and elucidate comprehensively. In essence, deep learning is still a black box which adds difficulties to mechanism study. Consequently, it is a vital and profound direction for deep research to not only expand the study field of interpretability, but also to consider how to integrate numerical models.

Conclusions
In summary, VGG-16s with 10 convolution layers was constructed to stimulate the TCI, MCP, and MWS of TCs. Then, we compared VGG-16s with LeNet-5 and AlexNet, and the results indicate that VGG-16s outperforms the other two models on the three targets, which implies that VGG-16s is probably more appropriate to stimulate TCI, MCP, and MWS. The VGG-16s model structure was further utilized to analyze interpretability by studying sensitivity. The result suggests that the structure of VGG-16s has good interpretability and the importance of predictors varies in different targets, though three targets are all used to depict TC intensity. The top three factors which are highly related to TCI are SST, TEM_500, and VWSS. VWSS, RH, and SST are more critical to MCP. For MWS, SST, TEM_500, and TEM_850 outweigh the other variables. Additionally, the difference in rank manifests that factors have a distinct function route influencing TC intensity. More importantly, deep learning provides an alternative approach to exploring TC formation and intensification, which will bring benefits to promote a more advanced forecast system. Institutional Review Board Statement: Not applicable.