Precipitation is a very important topic in weather forecasts. Weather forecasts, especially precipitation prediction, poses complex tasks because they depend on various parameters to predict the dependent variables like temperature, humidity, wind speed and direction, which are changing from time to time and weather calculation varies with the geographical location along with its atmospheric variables. To improve the prediction accuracy of precipitation, this context proposes a prediction model for rainfall forecast based on Support Vector Machine with Particle Swarm Optimization (PSO-SVM) to replace the linear threshold used in traditional precipitation. Parameter selection has a critical impact on the predictive accuracy of SVM, and PSO is proposed to find the optimal parameters for SVM. The PSO-SVM algorithm was used for the training of a model by using the historical data for precipitation prediction, which can be useful information and used by people of all walks of life in making wise and intelligent decisions. The simulations demonstrate that prediction models indicate that the performance of the proposed algorithm has much better accuracy than the direct prediction model based on a set of experimental data if other things are equal. On the other hand, simulation results demonstrate the effectiveness and advantages of the SVM-PSO model used in machine learning and further promises the scope for improvement as more and more relevant attributes can be used in predicting the dependent variables.

Weather prediction is always a challenging problem, and many weather forecasters and experts devote themselves to improving the accuracy of prediction. For example, weather research and forecasting models (WRF), automatic weather stations (AWS), and numerical weather prediction models (NWP) and so on are proposed and used in practice [

SVM is a data mining technique based on machine learning used for data classification. One may use a classification algorithm to predict whether the particular day will be sunny or rainy. SVM is one of the most advanced classification techniques [

The PSO algorithm looks for optimal parameters in the prediction region for the SVM model. In this way, it can be more effective to overcome the problem of local optimum. Thus, the PSO-SVM model is used for precipitation prediction. A hybrid PSO optimized SVM (PSO-SVM) model was used as an automated learning tool, trained in order to predict other parameters such as optical density, dissolved oxygen concentration, pH, nitrate concentration and phosphate concentration [

The remainder of this paper is organized as follows: in the following section, an introduction to SVM and PSO is presented.

With the development of society, people are increasingly demanding higher accuracy of weather forecasts. Improving the accuracy of forecasting is an inevitable trend in forecasting business development. In weather forecast work, the knowledge accumulation and approximate memory of the work forecast model based on different types of weather are still used [

Given separable sample sets

In order to maximize the interval, one only needs to maximize

In this case, the dual problem of the objective function is a maximized quadratic programming problem. It can be defined as Equation (

Subject to:

This is an extremum problem of quadratic function with inequality constraints, and there exists a unique solution.

The

Based on the above analysis, the basic idea of SVM is that, when classifying the nonlinear samples, the samples of the original space are mapped to a high-dimensional feature space by a nonlinear mapping, so that the space becomes linear and then linearly classifies in the high-dimensional space, so as to realize the relative original space nonlinear algorithm.

As a popular parameter optimization tool, Particle Swarm Optimization (PSO) is also applied in this paper. Besides SVM, PSO is introduced to obtain the optimal parameters in SVM. It is an evolutionary computation technique [

The velocity and position of the particles are iterated to obtain the equation as follows [

The PSO algorithm can deal with the continuous optimization problem and support multi-point search. Therefore, we take the PSO algorithm determine the parameters of SVM to improve the performance of SVM model.

Considering the influence of training parameters on generalization performance of SVM, we introduce the particle swarm optimization algorithm to search the training parameters in global space. In recent years, people have developed a series of methods based on the nuclear method of learning, and the linear learners expand into a nonlinear learning by using kernel functions. However, the problem of selecting parameters of a kernel function still exists [

The parameter space exhaustive search method (enumeration method) is the most widely used SVM parameter optimization method, with each parameter in a certain interval in a certain range of values, each value in a level, and each parameter in a different level combination constitutes a combination of multiple sets of candidate parameters, and then selects the minimum expected risk from the set of a minimum level as the optimal parameter value. At the same time, the method combines

However, this method also has shortcomings:

Time-consuming—when the data size becomes large, or the number of parameters exceeds two, it takes a very long time to calculate.

It is difficult to find the optimal parameters accurately. The method needs to select a reasonable range for parameters, and the optimal parameters may not be in the range. For the practical problem, the above method can only find the local optimal solution in the candidate parameter combination.

Combination of the PSO algorithm and SVM model can effectively solve this problem. We use the PSO algorithm to optimize the parameters in this paper. Because the algorithm is easy to understand, the realization is simple, the adjustable parameters are few, and it is better than common intelligent algorithms like artificial neural network (ANN) and Genetic algorithm (GA) in many cases. Based on the previous discussion, taking the Gaussian Radial Basis Function (RBF) function as the kernel function, we demonstrate the flowchart of the PSO-SVM algorithm in

The details of PSO for selecting parameters for the SVM model can be stated as the following steps:

Step 1: Initialization parameters

Initialize parameters of particle swarm optimization like population size, the maximum iteration number and so on. The initial particle swarm location is

Step 2: Evolution starts

Set gen = 1, produce randomly a set of

Step 3: Preliminary calculations

Calculate the new position

Step 4: Offspring generation

The global best value is generated according to Equations (

Step 5: Circulation stops

When gen is equal to the maximum iteration number, and the stop criterion is met, the optimal parameters of the SVM model are finally obtained. Otherwise, it is necessary to go back to Step 2. Then, end the training and verification procedure and get the result of the PSO-SVM model. The diagram of the SVM-PSO forecasting model is illustrated in

Data collection and preprocessing are the initial stages of the data mining process. Because only valid data will produce accurate output, data preprocessing is the key stage. For this study, we use the one-year ground-based meteorological data from Nanjing Station (ID: [58238]). The dataset contains atmospheric pressure, sea level pressure, wind direction, wind speed, relative humidity and precipitation. Data is collected every 3 h. We just consider related information and ignore the rest. The method of principal component analysis (PCA) is used to reduce the dimensionality of the data, thus reducing the data processing time and improving the efficiency of the algorithm. We performed data transformation on rainfall.

By observing the original data set, we find the incorrect data in the data set, which does not correspond to the fact. For example, if the pressure value of a sample is 999,017 or 999,999 and it appears repeatedly, it indicates that such erroneous data is abnormal data returned when data is acquired, so we reject samples containing such data. The error data is the data returned when the abnormality is collected. Therefore, we exclude the samples containing such data and check other elements until there is no data value that exceeds the normal threshold.

As shown in

The statistical performance of the classification model can be evaluated by comparing the labeled data sets. The label attribute stores the actual observed values, and the prediction attribute stores the values of labels predicted by the classification model. For the assessment of the algorithms’ results, calibration, cross-validation and external testing were carried out. For each one of the experiments, its datasets were divided into two subsets, training and test, comprising 80% and 20% of the original samples, respectively. Thus, we randomly classify datasets into training sets and test sets.

Training set—take 80% of the samples randomly from the dataset as the training set.

Text set—for the text data set, we used other data remaining in the data set, which contained all the attributes except the rainfall data that the model is supposed to predict. The test set was never used for the training of any of the models.

In both two sets, the attributes and their types remained the same.

While the data are classified according to different amounts of rainfall, the data is divided into 0 categories when the data of the rainfall is 0, and the rainfall of data is non-0 divided into one category. There is a total of 2214 pieces of valid data, of which 1974 of data belong to 0 categories, accounting for 90% of the total data, 240 pieces of data belong to one category, accounting for 10% of the total data.

In general, the data need to be normalized when the sample data are scattered and the sample span is large, so that the data span is reduced to build the model and prediction. In the SVM modeling, in order to improve the accuracy of prediction and smooth the training procedure, all of the sample data are normalized to fit them in the interval [0,1] using the following linear mapping formula

In order to verify the predictive ability of the precipitation prediction model based on the proposed PSO-SVM algorithm, the model is compared with the traditional SVM, GA-SVM and the AC-SVM (SVM with Ant Colony Algorithm) precipitation prediction model. We call it the GA-SVM method, when the SVM algorithm uses the Genetic Algorithm to select and optimize the parameters. Similarly, the AC-SVM model optimizes the parameters of SVM by Ant Colony Algorithm, and the M-SVM model is the SVM model which uses the numeration method to select the parameters. Many scholars used the GA-SVM model and AC-SVM in forecasting problems and achieved good predictive performance [

For the assessment of the algorithms’ results, calibration, cross-validation and external testing were carried out. In the calibration assessment, the models were developed using the training set and validated with the same one. In the cross-validation, a

In this paper, the number of samples is 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000, for each type of sample number, and 80% of the data were randomly selected as the training set for constructing the algorithm model. The remaining 20% of the data were used as the test set to validate the model accuracy. According to the trial, the sensitivity coefficient was

The coefficient of variation (CV) is defined as the ratio of the standard deviation. When you need to compare two sets of data when the size of the degree of dispersion and measurement scales are much different for the two sets of data, or the data for different dimensions, direct use of standard deviation for comparison is inappropriate, and you must eliminate measurement scales and dimensions. In addition, the coefficient of variation can do this, which is the ratio of the standard deviation of the original data to the mean of the original data.

The smaller the coefficient of variation, the smaller the variation (deviation) level—on the contrary, the greater the risk. The abscissa is the number of samples, and the ordinate is the accuracy. The accuracy of the traditional mesh optimization method is more stable, but it is easy to fall into the local optimal value. In the case of small sample size, the accuracy of GA parameter optimization is higher, but the stability is bad, which is influenced by the sample. The accuracy of Ant Colony optimization is the most stable among the four optimization methods, which is about 92%. Changes in the sample size have minimal impact on accuracy. PSO parameter optimization method has the best accuracy in the case of small sample size. With the increase of the number of samples, the accuracy rate increases and tends to be stable. The SVM algorithm model trained by the training set contains the best RBF kernel parameter

The prediction accuracy fluctuates greatly when the number of samples is small. The accuracy rate reaches the lowest point when the number of samples is 600, with the increase in the number of samples, the accuracy rate increases, the accuracy based on M-SVM reaches the maximum when the number of samples is 1200, and then the accuracy rate begins to decline. The GA-SVM model reaches the maximum when the number of samples is 1400, and then the accuracy rate begins to decline. The AC-SVM model gets the best classification accuracy when the number of samples is 800, but then it has poor performance. The accuracy based on the PSO optimization method achieves the maximum value when the number of samples is 1600, and the accuracy is basically better than the GA method. We just used the SVM method and the training set established a model without optimization of the parameter selection. The accuracy of the test data based on this SVM model is shown in

The model without optimization parameters takes less time to establish and can be neglected. The time curves of M-SVM and AC-SVM are linear, the slopes are small and the growth is slow. As the number of samples increases, the curve of the PSO-SVM method changes faster, and the time is positively correlated with the sample size. The curve of GA-SVM is on the rise, and the pre-growth rate is the fastest among the four methods. With the increase in the number of samples, the time required to establish the model becomes extremely unstable. Overall, the GA-SVM method takes the longest time to calculate.

First of all, for data preprocessing, we need to remove the bad datum and abnormal values before attempting to calculate, then normalize the data to eliminate the effects of sample spans, and smooth the training process. The data is subjected to a fivefold cross validation for a more stable data model. Due to the particularity of the precipitation datum, the regional precipitation datum are distributed unevenly in time and space. The days of precipitation are obviously shorter than that of the total sample. Although the datum are pretreated, they still cannot completely remove the datum on the impact of the forecast results, and the degree of volatility is greater when the number of samples is small.

We analyze the results of the optimization parameters

In order to further analyze the performance problem of PSO-SVM based on the Particle Swarm Optimization algorithm and the algorithms based on bionic intelligence GA-SVM and AC-SVM, we use the data set whose sample size is 1000 as an example, and the generalization ability of these four algorithms is shown in

The PSO-SVM algorithm is proven to be an effective method of the rainfall forecast decision. The SVM method is a kind of machine learning method with a high degree of nonlinear problems. It is a kind of machine intelligent learning method with a solid theoretical basis. Moreover, there is no limit to the dimension (number of vectors) of the vector due to the establishment of the SVM model, which facilitates the handling of meteorological problems with time, space and multiple factors. We modeled the parameters of the model by using the PSO algorithm to optimize the parameters of the model. The PSO-SVM model was established and compared with the traditional mesh optimization, the Genetic algorithm and the Ant Colony algorithm.

The experimental results indicate that the PSO algorithm has a higher accuracy and efficiency. The PSO-SVM is ideal for multiple variable analyses and particularly important in current problem-solving tasks like weather forecasts. Furthermore, some research could be done to increase the accuracy rate of this model. The output of the model is proven to be very promising and absolutely helpful for its improvement.

Thus, this study is considered significant. If the weather warnings are told to ordinary people ahead of the natural calamity, they can act as life-saving signals, valuable added services and disaster prevention tools. These models provide a good example of the capabilities of Support Vector Machine Regression for modeling high precision and efficient weather forecasting.

The authors would like to thank the anonymous reviewers for their constructive comments, which greatly helped to improve this paper. This research was supported by the National Nature Science Foundation of China (No. 41575155), and the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD).

J.D. and Y.L. conceived and designed the experiments; Y.Y. and W.Y. performed the experiments; J.D. and Y.L. analyzed the data; Y.L. contributed analysis tools; J.D. and Y.L. wrote the paper.

The authors declare no conflict of interest.

The flowchart of the Support Vector Machine with Particle Swarm Optimization model.

Diagram of the Support Vector Machine and Particle Swarm Optimization (SVM-PSO) forecasting model.

(

(

The Receiver Operating Characteristic (ROC) curve and Area Under Curve (AUC) drawn from the wired 1000 samples.

Original meteorological data.

PRS (hPa) | PRS_Sea (hPa) | WIN_D ( |
WIN_S (0.1 m/s) | TEM ( |
RHU (1%) | PRE_1h (mm) |
---|---|---|---|---|---|---|

1031.2 | 1035.8 | 89 | 2.5 | 77 | 2 | 0 |

1030.8 | 1035.4 | 113 | 2.9 | 61 | 6.4 | 0 |

1027.3 | 1031.9 | 153 | 2.1 | 49 | 8.3 | 0 |

1026.2 | 1030.8 | 122 | 2 | 55 | 7.1 | 0 |

⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |

1027.1 | 1031.7 | 121 | 0.7 | 71 | 4.1 | 0 |

The Mean Square Error (MSE) of four methods.

MSE | M-SVM | GA-SVM | AC-SVM | PSO-SVM |
---|---|---|---|---|

200 | 0.3162 | 0.3162 | 0.3536 | 0.3536 |

400 | 0.4183 | 0.3354 | 0.3354 | 0.3354 |

600 | 0.3873 | 0.3416 | 0.3536 | 0.3651 |

800 | 0.3708 | 0.3708 | 0.3062 | 0.3354 |

1000 | 0.3162 | 0.324 | 0.324 | 0.3162 |

1200 | 0.2814 | 0.25 | 0.25 | 0.2415 |

1400 | 0.2892 | 0.2809 | 0.2739 | 0.2597 |

1600 | 0.2622 | 0.25 | 0.2622 | 0.2305 |

1800 | 0.2934 | 0.2635 | 0.2635 | 0.2582 |

2000 | 0.3082 | 0.2872 | 0.2872 | 0.2782 |