Projecting Annual Rainfall Timeseries Using Machine Learning Techniques

Skarlatos, Kyriakos; Bekri, Eleni S.; Georgakellos, Dimitrios; Economou, Polychronis; Bersimis, Sotirios

doi:10.3390/en16031459

Open AccessArticle

Projecting Annual Rainfall Timeseries Using Machine Learning Techniques

by

Kyriakos Skarlatos

¹,

Eleni S. Bekri

²

,

Dimitrios Georgakellos

¹,

Polychronis Economou

^2,*

and

Sotirios Bersimis

¹

Department of Business Administration, University of Piraeus, 18534 Piraeus, Greece

²

Department of Civil Engineering, University of Patras, 26504 Patras, Greece

^*

Author to whom correspondence should be addressed.

Energies 2023, 16(3), 1459; https://doi.org/10.3390/en16031459

Submission received: 29 December 2022 / Revised: 19 January 2023 / Accepted: 26 January 2023 / Published: 2 February 2023

Download

Browse Figures

Versions Notes

Abstract

:

Hydropower plays an essential role in Europe’s energy transition and can serve as an important factor in the stability of the electricity system. This is even more crucial in areas that rely strongly on renewable energy production, for instance, solar and wind power, as for example the Peloponnese and the Ionian islands in Greece. To safeguard hydropower’s contribution to total energy production, an accurate prediction of the annual precipitation is required. Valuable tools to obtain accurate predictions of future observations are firstly a series of sophisticated data preprocessing techniques and secondly the use of advanced machine learning algorithms. In the present paper, a complete procedure is proposed to obtain accurate predictions of meteorological data, such as precipitation. This procedure is applied to the Greek automated weather stations network, operated by the National Observatory of Athens, in the Peloponnese and the Ionian islands in Greece. The proposed prediction algorithm successfully identified the climatic zones based on their different geographic and climatic characteristics for most meteorological stations, resulting in realistic precipitation predictions. For some stations, the algorithm underestimated the annual total precipitation, a weakness also reported by other research works.

Keywords:

hydropower; precipitation; Greece; machine learning; predictions

1. Introduction

The main human activities largely responsible for global climate change consist in burning fossil fuels for electricity and heat production, contributing to over 75 percent of global greenhouse gas emissions and nearly 90 percent of all carbon dioxide emissions [1]. Only 20 percent of the human population inhabits regions that do not import fossil fuels from other countries and, therefore, are independent and not vulnerable to geopolitical shocks and crises. Within this frame, the constant increase of renewable energy sources’ use, such as hydropower and wind and solar energy, is required and set into force, as implied and promoted by the Renewable Energy Directive (2009/28/EC), revised in 2018 and legally binding since June 2021. Hydropower is the backbone of low-carbon electricity generation since it provides almost half of it worldwide today [2] and can dominantly contribute to the EU energy targets for 2020–2030 [3], of reaching 32 percent of the EU’s energy consumption produced by 2030 from renewable energy. Additionally, in comparison to the other renewable energy sources, hydropower is characterized by flexibility and storage capacity, providing crucial services to maintain the stability of the electricity system and to effectively integrate a rapidly increasing share of variable renewable energy production, for instance from solar and wind power, into the energy system [4,5].

Topographic and meteorologic (such as temperature and precipitation) characteristics of a region determine the energy potential of a hydropower station [6,7]. The design and operation of hydropower plants strongly depend on the available water resources, expressed as runoff. Historical runoff timeseries or, instead, historical precipitation timeseries, which are more commonly available, are required for estimating runoff. Moreover, under the climate change effects, including among others increases in mean and extreme air and water temperatures, changes in annual and seasonal water availability, and extreme climate-related events [8], precipitation predictions are crucial for reassessing supply capacity and management resilience of existent hydropower stations [9] as wells as for reliably assessing hydropower potential of new small or large hydropower plants to cover the energy needs of a region.

In order to predict precipitation at various time scales, time series models, such as ARIMA and SARIMA, have been extensively used (see for example [10,11]). Machine learning techniques, as an increasingly popular approach, provide an attractive alternative to traditional methods [12] for predicting hydrometeorological data. A thorough literature review of the machine learning models for predicting rainfall is provided in [13].

One of the most popular machine learning algorithms, which is also used in the present paper, are the Artificial Neural Networks (ANNs), since various research attempts resulted in good and accurate rainfall predictions in hydrology [14]. Various types of ANNs were successfully used to predict the probability of precipitation and/or quantitative precipitation over a time period (see, among others, [15,16,17,18]) or to predict the weather or other meteorological characteristics and events (see, among others, [19,20,21,22,23]). The ANNs were also used to predict the photovoltaic power using weather data [24,25,26,27] and the power generated by wind turbines [28,29,30,31].

This paper proposes a complete procedure for obtaining accurate predictions of meteorological data, in order to support electricity production by exploiting hydropower. In Section 2 the study area of this work is presented and the need to obtain accurate precipitation predictions is highlighted, while Section 3 describes the methodology that is used for the analysis of the available meteorological data. In Section 4, the analysis of the available data is presented, while in Section 5 some concluding remarks are given.

2. Motivation

According to the recast Renewable Energy Directive 2018/2001/EU, accelerating the integration of renewable energy sources is required to reach the goal of the 55% reduction of the net greenhouse gas emissions by 2030 for a climate-neutral planet by 2050. The fulfillment of this goal becomes even more crucial through the requested series of measures to quickly diminish the EU’s dependence on Russian fossil fuels well before 2030. Within this frame, hydropower’s imperative role is intensified for Europe’s energy transition, as has been identified by the EU Commission. According to the predictions in the International Energy Agency (IEA)’s first Hydropower Special Market Report [2], the total installed hydropower capacity will increase in Europe at about 8% by 2030, including modernization and expansion of existing infrastructure. Therefore, to safeguard and increase the reliability of hydropower’s contribution to total energy production, an accurate prediction of the monthly precipitation is essential and necessary, especially under the climate change effect.

Monitoring the measurements of the meteorological network and obtain accurate predictions in an area like the Peloponnese and the Ionian islands (Greece) that is characterized by circa 1.4 million permanent residents (approximately 13% of the total population of Greece, census 2021) and more than 16% of the country’s touristic visits (circa 2.6 million visits in 2021) based on InSETE (https://insete.gr/districts/?lang=en#, accessed on 20 December 2022), is of great importance mainly due to two reasons.

First, the installed renewable energy capacity of the regions of Peloponnese, Western Greece, and the Ionian islands is circa 2 GW, according to the last update (May 2022) of the Renewable Energy Sources Operator and Guarantees of Origin. At the same time, the total installed renewable energy capacity of Greece is 9.2 GW (including 3.2 GW of large hydropower projects), which demonstrates the importance of the regions of Peloponnese, Western Greece, and the Ionian islands in the stability of the electricity system since they cover more than 21% of the total installed renewable energy capacity of Greece. It is of note that the installed renewable energy capacity of Greece corresponds to 20% of the country’s final energy consumption and is expected to double by 2030. Second, the expansion of the national transmission system to an ultra-high voltage 400 kV system, which will be the backbone of the Hellenic energy system, sets Peloponnese as a key energy node through its planned interconnection with Crete, Attica, and Northern Greece.

Regarding the total installed hydropower capacity (big and small) in Greece, the Hellenic Public Power Energy Corporation S.A. reports that it amounts to 3217.4 MW. The average annual hydroelectric energy production from hydropower is approximately 4020 GWh (five-year average), and depending on the hydrologic year, it covers from 8 to 10% of the total energy production, e.g., for 2020 with 5282 GWh hydropower production. As a result, it is obvious that studying and obtaining accurate prediction of the total precipitation of the following period is crucial to maintain stable energy production. In the examined region, there is one big hydropower station (Ladon) with 70 MW capacity and numerous small hydropower stations with circa 58 MW capacity.

3. Materials and Methods

Let

x_{i t}, i = 1, 2, \dots, m, t = 1, 2, \dots, n_{i}

be the (monthly) records of some meteorological variable such as average surface temperature or total precipitation of m automated weather stations. Each

x_{i t}

is the average/sum of daily available data. The observed time series may contain missing values and present, particularly in the case of the monthly records, a common seasonality period. The operation period and the percentage of missing data may vary significantly between the stations, resulting in a different number of observations of each station. As a result, analyzing and forecasting such data with proper machine algorithms first requires special and careful handling of missing values by some carefully designed preprocessing steps of the available data. Next, the proposed preprocessing steps are presented, and the used machine algorithms are discussed.

3.1. Data Preprocessing

In meteorological data the available observations across meteorological stations, as already mentioned, do not cover the same period and are not always continuously recorded. This results in sparse data with, in many cases, a significant percentage of missing values. There are many reasons for these characteristics of the observed meteorological data. For example, regarding the different periods that the available data cover across different weather stations, it is worth mentioning that national meteorological services and organizations often set automated weather stations temporarily in operation just for testing purposes for a limited period or expand their network with new weather stations at different time points. Stations that operated for a short time period, did not operate at the end of the period of study (i.e., they were removed from the network), or contain an extremely large percentage of missing values during their operation period (i.e., they are probably used occasionally for testing purposes) can be excluded from the analysis. Regarding data sparsity and the missing values for the rest of the stations, these may occur, among others, due to the fact that (a) the starting date of operation for various stations is different resulting in different time series periods, (b) failures during extreme weather phenomenons, and (c) measurement errors.

To address the aforementioned characteristics and overcome the subsequent problems, it is necessary to understand the nature of the data at first and then to examine the behavior of their features. An important tool towards this is the available metadata of each station regarding its geographical position (longitude, latitude, and altitude) and its proximity to other stations. Another important feature is the climate zone in which every station is located and the seasonality and the memory that is present in the meteorological and climate data. Based on these characteristics and features, a two-step imputation procedure is proposed and described below.

3.1.1. Spatial Imputation

Although daily meteorological records regarding precipitation, temperature, etc. exhibit a high degree of spatial and temporal variability, their monthly or annual aggregate or mean values usually present a more robust behavior among nearby areas. Nevertheless, in areas with diverse geographical features defined by the rugged relief and its distribution between the mainland and the sea, for example in Greece, the monthly and annual variation among nearby areas can still present great variation due to the relation between various meteorological measurements, for example total precipitation and station elevation [32]. In such situations, the climate zone in which a station is located can also be used to impute the missing values of a station by using the corresponding value of its nearest stations that belongs to the same zone. Information on the distinct climate zones of an area can be provided by the National Meteorological Services and Organizations of each country or by applying a proper clustering algorithm to the available data.

To order the stations of a given climate zone/cluster based on their distance from a given station, the Haversine distance

d = R \cdot θ

can be calculated [33], where R is the radius of the earth (

R = 6378137 m

) and

θ

is the central angle of any two points (stations), A and B, with coordinates—latitude, longitude—

(ϕ_{A}, ℓ_{A})

,

(ϕ_{B}, ℓ_{B})

, respectively. The angle

θ

is determined by the coordinates of the stations using the following formula:

c o s θ = s i n ϕ_{A} \cdot s i n ϕ_{B} + c o s ϕ_{A} \cdot c o s ϕ_{B} \cdot c o s Δ ℓ

where

Δ ℓ = ℓ_{A} - ℓ_{B}

. The Haversine distance calculates the shortest distance between two points, i.e., the “great-circle distance”, by assuming a spherical earth, and provides good results for small distances, such as the ones needed to apply the proposed imputation.

Given the Haversine distance

d_{i j}

of station i from every other station

j, j = 1, \dots, n_{c}, j \neq i

, where

n_{c}

is the number of stations in this cluster, the stations can be ordered according to their distance from that station. If the value

x_{i t}

is missing from station i, then

x_{i t}

is replaced by the value

x_{(j) t}

of the

{(j)}^{t h}

nearest station. The index

(j), j = 1 \dots, n_{c} - 1

is increased until the first missing value is found or until the distance between the stations exceeds a predefined threshold value.

3.1.2. Temporal Imputation

Following the spatial imputation, a time series imputation algorithm, such as imputation by structural model and Kalman smoothing, can be applied to replace any remaining missing value. Imputation by structural model and Kalman smoothing uses the structural model fitted by maximum likelihood in the time series and Kalman filtering to impute any missing value in the operation period of each station. A similar methodology has been used in [34]. For more details, the reader is referred to [35]. It is of note that data transformation, such as log transformation, can also be applied, before the imputation of the missing values of each time series, to reduce the skewness that is usually observed in meteorological data.

3.2. Machine Learning Prediction Algorithm

Artificial neural networks (ANNs) have recently gained popularity and have proven to be a useful model for classification, clustering, pattern recognition, optimal design, and prediction in a variety of fields [36,37,38,39,40]. Weather forecasting has grown in importance in recent decades. In the majority of situations, the research attempted to build a linear relationship between the input weather data and the related target data. However, with the discovery of nonlinearity in the nature of weather data, the emphasis has switched to the nonlinear prediction of meteorological data. Although there are various research papers in nonlinear statistics for weather forecasting, most of them demand that the nonlinear model be described before estimation. Nevertheless, since meteorological data are nonlinear and have a very irregular trend, Artificial Neural Networks (ANNs) have proven to be a better technique for determining the structural relationship between the various components [41].

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning long-term dependencies in sequence prediction problems [42,43]. While recurrent neural networks usually struggle to capture the influence of the earliest stages in the current observation as the length of the input sequence increases, the special structure of LSTM manages to counterbalance this disadvantage. LSTM neural networks rely on a chain structure that contains several hidden neural networks that consist of different memory blocks called cells. The information retained by the cells and the memory manipulations is processed by the three different gates, namely the forget gate, the input gate, and the output gate. These gates control how the information in a sequence of data enters, is stored and leaves each hidden layer of the network.

The flow of information through the gates is handled by an activation function. One of the most frequently used activation functions in LSTM networks is the sigmoid function that can regulate how much of the information is allowed to pass from each gate [44]. More specifically, values close to 1 of the sigmoid function indicate that a large percentage of information passes through the gate, whereas values close to 0 indicate a strict flow.

The sigmoid and hyperbolic tangent activation functions are often utilized in LSTM networks [42,45]. These LSTM activation functions are generated using the following formulas for forget gate

f_{t}

, input gate

i_{t}

, output gate

o_{t}

, candidate vector

c_{t}^{'}

, cell state

c_{t}

, and hidden state

h_{t}

:

\begin{matrix} f_{t} & = σ (w_{x f} x_{t} + w_{h f} h_{t - 1} + b_{f}) \\ i_{t} & = σ (w_{x i} + w_{h i} h_{t - 1} + b_{i}) \\ c_{t}^{'} & = t a n h (w_{x c^{'}} x_{t} + w_{h c^{'}} h_{t - 1} + b_{c^{'}}) \\ o_{t} & = σ (w_{x o} x_{t} + w_{h o} h_{t - 1} + b_{o}) \\ c_{t} & = σ (f_{t} c_{t - 1} + i_{t} c_{t}^{'}) \\ h_{t} & = o_{t} * t a n h_{c_{t}} \end{matrix}

The weights of each gate represented in the aforementioned equations by

w_{x f}

,

w_{x i}

,

w_{x c^{'}}

, and

w_{x o}

are used with

x_{t}

, and

w_{h f}

,

w_{h i}

, and

w_{h c^{'}}

, are used with

h_{t 1}

. The * operator represents element-wise multiplication. Furthermore, the biases employed at each gate are

b_{f}, b_{i}, b_{c^{'}},

and

b_{o}

. Capturing such long-term dependencies is an important feature for time series analysis and prediction since time series not only present a short memory and dependence in previous observations but also persistent behaviors over time due to their seasonality and trend. While the trend of a time series can be handled by the most recent observations, seasonality can often be missed when a not-large sequence of observations is used as input in the LSTM network. An important extension that overcomes this problem is the Seasonal LSTM (SLSTM) networks, proposed by [46], which incorporates seasonality attributes in the model.

A general representation of feature vector

X

and target values

Y

of an SLSTM network is presented in Table 1 for a time series with n available observations. The feature vector is built assuming a seasonality period s, k seasonal lags, and

m + 1

recent values. The target vector is constructed by the f following observations. In Figure 1, a simple graphical representation for the first feature vector

X

and target values

Y

for specific values of

s = 12

,

k = 2

,

m = 11

, and

f = 12

is also represented.

In the present setup, the available meteorological data from the different stations can be viewed as time-series panel data. For time-series panel data, one can use two ways to apply the SLSTM model: either by modeling each station data one by one (the univariate approach) or by modeling the stations’ panel data all together at once (the multivariate approach).

3.2.1. Univariate Approach

For the univariate approach, the dataset of time series is split into a train–test set by selecting the last input rows of data, corresponding to the last r observations, to test the model and the rest to train the LSTM network. The number r of the observations that can be included in the test set is determined by the nature and the scope of the analysis. For example, if the purpose of the study is to predict the following 12 monthly records, then the last 12 observations of time series can be held for the testing set. Note that each time series, or the transformed one by the imputation process, should be rescaled into the range [0,1] before training in order to avoid large inputs that can slow down the learning and convergence of the network.

3.2.2. Multivariate Approach

For the multivariate approach, the dataset of the available time series is split initially to a 80/20 rate on the entire time series level, i.e., 80% of the available times series are selected for training, and the remaining 20% for testing. In this split, fixed exogenous regressors can also be incorporated into the analysis. For example, in the set of a meteorological time series of m stations, 80% of the stations can be selected for training the model and their characteristics; for example, altitude, climate zone, etc., can serve as input regressors in the LSTM model. These fixed exogenous regressors are added to the model by expanding the feature vector

X

as described in Table 1 and Figure 1. It is noted, once more, that all of the meteorological time series and the regressors need to be first rescaled into the range [0,1] to avoid again large inputs that can slow down the learning and convergence of the network.

The multivariate approach also enables the use of stratified repeated random subsampling (stratified bootstrap resampling) (see, among others, [47,48], for similar approaches in different contexts) to successively partition the dataset into training and testing sets (80/20 ratio) over a relatively large number of times. This is done not only because there is no real justification to prefer a specific partition, but also because it allows obtaining multiple predictions for each station. These multiple forecasts can be averaged over the splits or create some credible intervals for each prediction.

4. Results and Discussion

As already mentioned, the proposed methodology was applied to precipitation time series from the Peloponnese and the Ionian islands in Greece from 2010 to 2020 to predict the annual precipitation levels for the following year in every station. The imputation process was carried out in R using the geosphere package to calculate the Haversine distances between the station and the imputeTS package to impute any remaining missing value in the operation period of each station using Kalman filtering. The SLSTM models used in the analysis were trained using a Python package called Keras on top of the Tensorflow backend. Moreover, default and recommended parameters were used, such as the Xavier Initializer, which uses the following initialization rule, which we applied to all weights:

W_{i, j}^{[ℓ]} \sim N (0, \frac{1}{n^{[ℓ - 1]}})

where

W^{[ℓ]}

denotes the ℓth matrix and

n^{[ℓ - 1]}

is the number of neurons in layer

[ℓ - 1]

. The analysis was carried out on an Intel^® Smart Cache 4.80 GHz CPU with 8 cores, 16 threads and DDR4- 79.9 GB memory. The GPU version ran on a workstation with one NVIDIA GeForce RTX 3060 Ti with 8-GB GDDR6 dedicated memory.

4.1. Data Description

From the Greek automated weather stations network, operated by the Institute for Environmental Research and Sustainable Development of the National Observatory of Athens, that set in operation its first stations in 2007 [49], 64 stations were selected. These 64 stations are located in the Peloponnese and the Ionian islands, and the first of these stations was set in operation in 2010.

For each of the 64 operating automated weather stations in the area under study (see Figure 2) the altitude (in meters) and the geographical coordinates (in degrees) were available along with the precipitation records (in millimeters) of each month during their operation. Additionally, based on the study of [50], each station was assigned to a climate zone in which every station is located and the seasonality and the memory that is presented in the precipitation data.

4.2. Data Preprocessing

4.2.1. Data Cleaning

The available monthly precipitation timeseries for the 64 stations were downloaded from the National Observatory of Athens. Stations with records that cover, not necessarily continuously, less than four years were removed. These stations, 11 in total, include mainly test, temporal, or relatively new stations. Eight of these stations were not operating in the last year. Additionally, the operation period of nine stations was less than five years and these stations were also removed from the analysis.

For the remaining 46 stations, the percentage of the missing values during their operation period was calculated. In six of them, the percentage of the missing values was larger than 10% (the maximum percentage of missing values was 18.27% in the Pinia station) and they were also excluded from further analysis (Figure 3), resulting in a dataset with 40 meteorological stations.

4.2.2. Data Imputation

Following the study of [50], which covered the same area under study, each station was assigned to one of the four clusters identified based on the precipitation characteristics and patterns of the available records. Any meteorological station that was not included in the aforementioned paper was assigned to a cluster based on Thiessen polygons provided in the same paper.

Next, the Haversine distances were calculated between all of the stations of each cluster, and the closest station at each given station was identified and ranked. Then, any missing value in the monthly records in any station was imputed by the corresponding value of the closest station that belonged to the same cluster. If this value was also missing from the closest station, the procedure was repeated, in increasing order with respect to the distance, until the first non-missing value was found in a station. The imputation was not implemented if the distance between the stations was larger than 30 km.

For illustration purposes, in Figure 4 the similarity in the precipitation records of two nearby meteorological stations (Haversine distance: 10.64 Km), namely Megalopoli (data from 1/2010) and Lykochia (data from 2/2013) is presented. In the embedded plot, the year 2018 is presented in detail. In 2018 there was a missing value (May 2018) for Lykochia station. From the plot, it is clear that the two time series present a similar behavior that justifies the spatial imputation presented in Section 3.1.1.

The imputation procedure resulted in a dataset with missing values in only five time series (meteorological stations: Zakynthos, Kalavryta, Kranidi, Lappa, and Oleni) during their operation period with 1, 4, 1, 6, and 4 observations, respectively. The aforementioned missing observations were finally imputed using a basic structural model with a frequency equal to 12 (monthly data) on the log-transformed univariate time series and a local trend model along with the Kalman smoothing.

4.3. LSTM Model Performance

Following the machine learning algorithm presented in Section 3.2, the SLSTM model was used to predict the precipitation of each station. The same layers architecture was adopted for both the univariate and multivariate approaches (see Section 4.3.1 and Section 4.3.2, respectively) with the same hyperparameters. More specifically, the model consists of five layers: the input layer, the LSTM layer, a hidden layer that uses the sigmoid or logistic activation function, connected with another hidden layer which also uses the sigmoid activation function, triggering the output layer (see Figure 5). Since the precipitation presented a 12-month seasonality and the goal was to predict the total precipitation of the following year, the values

s = 12

,

k = 2

,

m = 11

, and

f = 12

were adopted. The values

k = 2

,

m = 11

denote that for each month (for example December), the 12 most recent observations and the corresponding months from the two previous years (for example the two previous Decembers) were used to create the feature vectors. For the multivariate approach, the previous vectors were extended with the station’s altitude and the climate zone in which it is located. Min–max normalization was applied in both approaches to the log-transformed data.

4.3.1. Univariate Approach

For the univariate approach, the data were initially grouped and divided into 40 different datasets, each one corresponding to one of the available stations. Next, to determine the values of the hyperparameters of the univariate LSTM model, first, the dataset was split into train and test subsets by selecting the last twelve features vectors

X

and target values

Y

(the last value of each

Y

corresponds to one of the twelve months of the last year, 2020) as test sets. Then, a grid search was conducted over a set of possible values of hyperparameters using Amaliada station. Amaliada station was selected due to its number of observations (120, corresponding to 10 years of operation, the median of all available stations). The structure of the features vectors

X

and the target values

Y

along with the number of observations results in a total of

n - f - k * s = 84

available data lines, as described in Table 1, for the Amaliada station.

In order to create a unified approach for all stations, the possible values of the hyperparameters were determined as a function of the values n, f, k, and s. The number

n d_{ℓ}

of nodes tested for the LSTM layer was selected among the values:

n d_{ℓ} = \{\begin{matrix} ⌊ \frac{1}{10} (n - f - k \times s) ⌋ & (for Amaliada station = 7) \\ ⌊ \frac{1}{3} (n - f - k \times s) ⌋ & (for Amaliada station = 24) \\ n - f - k \times s & (for Amaliada station = 72) \end{matrix}

while the nodes

n d_{h}

tested for each hidden layer among the values:

n d_{h} = \{\begin{matrix} ⌊ \frac{1}{2} (n - f - k \times s) ⌋ & (for Amaliada station = 36) \\ 2 (n - f - k \times s) & (for Amaliada station = 144) \\ 3 (n - f - k \times s) & (for Amaliada station = 216) \end{matrix}

where

⌊ \cdot ⌋

denotes the integer part. These values were determined so that the number of nodes in the LSTM layer is less or equal to the number of input data, while the number of nodes in the hidden layer covers a larger range. Finally, the number of batches tested was set as equal to 1, 6, and 12, to reflect the nature of the data, i.e., by selecting a single month, a semester, and a year. For all the scenarios, the Adam optimizer was adopted [51].

The best set of hyperparameters for the batch size, the number of LSTM nodes, and the number of nodes in each of the two hidden layers was determined by using the Mean Square Error (MSE) loss function:

M S E = \frac{1}{n} \sum_{i} {(Y_{i a} - Y_{i p})}^{2}

(1)

where

Y_{i a}

and

Y_{i p}

are the actual and the predicted values of all the target values. In Table 2 the values of the loss function for all different combinations of the hyperparameters are presented. The smallest

M S E

value (0.02660) is marked with bold and corresponds to a batch size equal to 1, 7 nodes for the LSTM layer and 36 for both the hidden layers. This model was then adopted also for the rest of the stations and used to make the prediction for the following year presented in Section 4.4. It is of note that other performance metrics, such as the Mean Absolute Percentage Error (MAPE), can also be adopted. From a short study that was carried out, both these metrics selected the same combination of hyperparameters. The CPU usage range and the time required to run the analysis for the selected hyperparameters for the Amaliada station is reported in Table 3.

4.3.2. Multivariate Approach

For the multivariate approach, the data were initially split into 80/20 subsets by randomly selecting 32 stations (80% of the 40 available stations) for training and the remaining 8 (20%) for testing. All of the available fixed, exogenous regressors, i.e., the station’s altitude and the climate zone in which it is located, were also added to the model by expanding the feature vector

X

. This initial split was used to determine the values of the hyperparameters. In order to prevent over-fitting, due to a large amount of input, a rule of thumb that was adopted, is to set the number of neurons in the hidden layers as follows:

N_{h} = \frac{N_{s}}{α \times (N_{i} \times N_{o})}

(2)

where

N_{i}

is the number of input neurons,

N_{o}

is the number of output neurons and

N_{s}

is the number of samples in the training dataset. The parameter

α

is an arbitrary scaling factor, usually 2–10. For the current set-up, alpha was set to 2, 5, and 10 resulting in three different values of the number of hidden layers. For the nodes of the LSTM layer the values

(n - f - k \times s) / 10

,

(n - f - k \times s) / 20

, and

(n - f - k \times s) / 30

were adopted for testing, where the denominator indicates the range of percentage that was chosen for the initial split. Finally, the number of batches tested was set as equal to 1, 6, and 12 following the same reasoning as in the univariate case. For all scenarios, the Adam optimizer was again adopted and the MSE loss function was used to select the best set of hyperparameters.

In Table 4 the values of the loss function for all different combinations of hyperparameters are presented. The smallest

M S E

value (0.01576) is marked with bold and corresponds to a batch size equal to 12,

(n - f - k * s) / 30

nodes for the LSTM layer, and

N_{h}

nodes given by Equation (2) with

α = 10

for both hidden layers.

The best combination of hyperparameters was adopted in the following step of the analysis, i.e., on the multiple splits of the data. More specifically, 1000 random splits were applied to the data. These splits were used to train and test the model over the successive partition of data and allowed to obtain 1000 predictions for the following year that were used for the following Section 4.4.

The CPU usage range and the time required to run the analysis for the selected hyperparameters for the Amaliada station are reported in Table 3. For the multivariate approach, the results correspond to each loop session. Thus, the total time required depended on the total number of iterations. Moreover, it is of note that the time required increased as the number of LSTM nodes or hidden nodes increased.

4.4. Predicting

Both approaches were used to predict the monthly precipitation and as a consequence the total precipitation of 2021 by using the data that were available on December 2020, i.e., the 12 monthly measurements of 2020 and the measurements of December 2019 and 2018. For illustration purposes, in Figure 6, the boxplots of the predictions (in the original scale, i.e., by back-transformed log data) for each month for the Monemvasia (left plot) and Alagonia (right plot) stations are presented based on the multivariate approach. In the same plot, the corresponding predictions based on the univariate approach are denoted with a blue circle while the true measured values for the two stations for each month are denoted with orange circles. The boxplots clearly indicate the months with low or high precipitation values.

For comparison purposes, the seasonal SARIMA model was applied in the two aforementioned stations. The best SARIMA model was selected using the auto.arima function from the forecast package in r. In Table 5, the monthly predictions using the proposed algorithm are reported as well as the prediction for 12 months ahead based on the best SARIMA model. The actual values (in mm) for the two stations are also presented in Table 5. In the parentheses, the absolute differences between each prediction from the actual value are reported. The minimum for each month and each station are marked in bold.

From the values in Table 5 there seems to be a balanced behavior, regarding the absolute errors (across all months in the two stations) of the two methods. Moreover, the proposed method underestimates the total (annual) precipitation in both stations. However, it is worth noticing that the absolute percentage error for the total (annual) precipitation for the proposed method is more stable (19.43% for the Monemvasia station and vs. 14.23% for the Alagonia station) compared with that of the SARIMA models, which present a 7.52% error for the Monemvasia station (drier climate zone) and a 40.95% error for the Alagonia station (rainy climate zone).

In Figure 7, the kernel density estimates for the annual total precipitation are presented. The kernel density estimates were determined by using the 1000 total available annual predictions for each station. For each station, every single total annual prediction was calculated by adding the corresponding 12 months’ predictions. The stations are arranged according to the climate zone, from drier to the most rainy. The brackets denote the stations that belong to the same climate zone according to the paper by [50]. The white vertical lines in each density plot indicate the first and the third quartile. The blue dots represent the observed values of the total precipitation for each station in 2021.

It is clear that the proposed prediction algorithm provides great insight into the annual precipitation in each station. First, the algorithm clearly incorporates the different characteristics among the different climate zones. For example, the prediction of the Alagonia and Stemnitsa stations, both located in high-altitude areas of western Peloponnese, indicates, indeed, heavy rainy regions. On the other hand, the prediction of Isthmos and Didyma, stations that are located in the northern and eastern parts of the Peloponnese, demonstrate that these stations are indeed in the driest region of the study area. Second, the kernel estimates and the corresponding quartiles can be used to estimate credible intervals for these estimates and provide the necessary information in order to study realistic precipitation scenarios.

On the other hand, it is clear that for some stations the method did not manage to estimate accurately the total precipitation (see for example Zakynthos and Patra). This may be due to the following two reasons. First, in 2021 there was a 36%, on average, increase in the precipitation in the study area with respect to 2020. Such a change is hard to be captured by any prediction algorithm. Second, it is of note that in many cases monthly measurements are often not completed or recorded due to short times of maintenance, power outage, or even malfunctions during a month. This results often in lower monthly precipitation records and as a consequence in underestimations of future observations.

From the above analysis, it is clear that the error of model underestimation was larger than that of overestimation for both of the proposed approaches. This is in accordance with similar findings in analysis using ANNs, as, for example, in the study of [52] on precipitation prediction in Iran and the study of [53] regarding precipitation prediction in four cities in Korea.

5. Conclusions

Obtaining accurate predictions of meteorological data, such as precipitation, is crucial for safeguarding the energy production of a large region. This is more crucial if this region strongly relies on renewable energy production. In such cases, the hydropower energy provided by the nearby dams is an important factor to the stability of the electricity system. The proposed algorithm presented in this paper can serve as a valuable tool to predict the precipitation over a large period, for example, a year, in an area using the available records of an automated weather stations network.

The two-step imputation procedure for the data preprocessing algorithm provides a simple but reliable method to increase the amount of available data. Its complexity, if any, relies only on calculating the Haversine distance between all the stations and selecting only the nearby stations that belong to the same climate zone. The two analysis approaches presented in the paper provide also all of the necessary information to make predictions for future measurements. The multivariate approach seems to be more informative and more robust, but the univariate approach can also serve as a point estimate to work around some basic energy production scenarios. The main obstacle for the proposed algorithms, as in every ANN, is the neural complexity and the selection of suitable values of the hyperparameters. However, typical architectures and a reasonable grid search seem to provide a reliable solution.

It is clear that the proposed methodology serves as a first step to use machine techniques to predict annual rainfall and that further, in-depth research, for example in different network architectures and models, is needed to provide more accurate predictions. Moreover, it is clear that more study is required regarding the comparison of the two approaches and the identifications of the scenarios in which one approach is likely to be more effective than the other. Additionally, we hope that this study will serve as a base for future studies regarding the scalability (to time-wise and spatially larger datasets) and the sensitivity analysis regarding different architectures, activation functions, or other values of hyperparameters.

Author Contributions

Conceptualization, P.E. and S.B.; methodology, K.S., E.S.B. and P.E. and S.B.; software, K.S. and P.E.; validation, D.G., P.E., and S.B.; formal analysis, K.S., E.S.B., D.G., P.E., and S.B.; resources, K.S., and E.S.B.; data curation, K.S., and E.S.B.; writing—original draft preparation, K.S., E.S.B., and P.E.; writing—review and editing, K.S., E.S.B., D.G., P.E., and S.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors thank the Institute for Environmental Research and Sustainable Development of the National Observatory of Athens and especially its Research Director, Konstantinos Lagouvardos, for providing the data for this analysis.

Conflicts of Interest

No conflicts of interest exist in the submission of this manuscript, and the manuscript is approved by all authors for publication.

References

UN. Renewable Energy—Powering a Safer Future. 2022. Available online: https://www.un.org/en/climatechange/raising-ambition/renewable-energy (accessed on 12 November 2022).
IEA. Hydropower Special Market Report. 2021. Available online: https://www.iea.org/reports/hydropower-special-market-report/executive-summary (accessed on 12 November 2022).
European Commission. Guidance on the Requirements for Hydropower in Relation to EU Nature Legislation. 2018. Available online: https://ec.europa.eu/environment/nature/natura2000/management/docs/hydro_final_june_2018_en.pdf (accessed on 13 October 2022).
EU Renewable Energy. Hydropower. 2022. Available online: https://energy.ec.europa.eu/topics/renewable-energy/hydropower_en (accessed on 12 November 2022).
Pfeiffer, O.; Nock, D.; Baker, E. Wind energy’s bycatch: Offshore wind deployment impacts on hydropower operation and migratory fish. Renew. Sustain. Energy Rev. 2021, 143, 110885. [Google Scholar] [CrossRef]
Corbari, C.; Ravazzani, G.; Perotto, A.; Lanzingher, G.; Lombardi, G.; Quadrio, M.; Mancini, M.; Salerno, R. Weekly Monitoring and Forecasting of Hydropower Production Coupling Meteo-Hydrological Modeling with Ground and Satellite Data in the Italian Alps. Hydrology 2022, 9, 29. [Google Scholar] [CrossRef]
Gøtske, E.K.; Victoria, M. Future operation of hydropower in Europe under high renewable penetration and climate change. Iscience 2021, 24, 102999. [Google Scholar] [CrossRef] [PubMed]
EEA. EEA Report No 1/2019 Building a Climate-Resilient Low-Carbon Energy System. 2019. Available online: https://www.eea.europa.eu/publications/adaptation-in-energy-system (accessed on 12 November 2022).
Bekri, E.S.; Economou, P.; Yannopoulos, P.C.; Demetracopoulos, A.C. Reassessing Existing Reservoir Supply Capacity and Management Resilience under Climate Change and Sediment Deposition. Water 2021, 13, 1819. [Google Scholar] [CrossRef]
Aghelpour, P.; Varshavian, V. Evaluation of stochastic and artificial intelligence models in modeling and predicting of river daily flow time series. Stoch. Environ. Res. Risk Assess. 2020, 34, 33–50. [Google Scholar] [CrossRef]
Kabbilawsh, P.; Kumar, D.S.; Chithra, N. Forecasting long-term monthly precipitation using SARIMA models. J. Earth Syst. Sci. 2022, 131, 174. [Google Scholar] [CrossRef]
Zhou, Z.; Ren, J.; He, X.; Liu, S. A comparative study of extensive machine learning models for predicting long-term monthly rainfall with an ensemble of climatic and meteorological predictors. Hydrol. Process. 2021, 35, e14424. [Google Scholar] [CrossRef]
Hussein, E.A.; Ghaziasgar, M.; Thron, C.; Vaccari, M.; Jafta, Y. Rainfall Prediction Using Machine Learning Models: Literature Survey. In Artificial Intelligence for Data Science in Theory and Practice; Alloghani, M., Thron, C., Subair, S., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 75–108. [Google Scholar] [CrossRef]
Mislan, H.; Hardwinarto, S.; Sumaryono, M.A.; Aipassa, M. Rainfall monthly prediction based on artificial neural network: A case study in Tenggarong Station, East Kalimantan-Indonesia. Proc. Comput. Sci. 2015, 59, 142–151. [Google Scholar] [CrossRef]
Hsu, K.l.; Gao, X.; Sorooshian, S.; Gupta, H.V. Precipitation estimation from remotely sensed information using artificial neural networks. J. Appl. Meteorol. 1997, 36, 1176–1190. [Google Scholar] [CrossRef]
Hall, T.; Brooks, H.E.; Doswell, C.A. Precipitation forecasting using a neural network. Weather Forecast. 1999, 14, 338–345. [Google Scholar] [CrossRef]
Moustris, K.P.; Larissi, I.K.; Nastos, P.T.; Paliatsos, A.G. Precipitation forecast using artificial neural networks in specific regions of Greece. Water Resour. Manag. 2011, 25, 1979–1993. [Google Scholar] [CrossRef]
Sønderby, C.K.; Espeholt, L.; Heek, J.; Dehghani, M.; Oliver, A.; Salimans, T.; Agrawal, S.; Hickey, J.; Kalchbrenner, N. Metnet: A neural weather model for precipitation forecasting. arXiv 2020, arXiv:2003.12140. [Google Scholar]
Anushka, P.; Upaka, R. Comparison of different artificial neural network (ANN) training algorithms to predict the atmospheric temperature in Tabuk, Saudi Arabia. Mausam 2020, 71, 233–244. [Google Scholar] [CrossRef]
Casallas, A.; Ferro, C.; Celis, N.; Guevara-Luna, M.A.; Mogollón-Sotelo, C.; Guevara-Luna, F.A.; Merchán, M. Long short-term memory artificial neural network approach to forecast meteorology and PM2. 5 local variables in Bogotá, Colombia. Model. Earth Syst. Environ. 2022, 8, 2951–2964. [Google Scholar] [CrossRef]
Saipriya, S.; Chithra, N. Development of Hybrid Wavelet Artificial Neural Network Model for Downscaling Precipitation and Temperature. In Innovative Trends in Hydrological and Environmental Systems; Springer: Berlin/Heidelberg, Germany, 2022; pp. 525–536. [Google Scholar]
Kilsdonk, R.A.; Bomers, A.; Wijnberg, K.M. Predicting Urban Flooding Due to Extreme Precipitation Using a Long Short-Term Memory Neural Network. Hydrology 2022, 9, 105. [Google Scholar] [CrossRef]
Docheshmeh Gorgij, A.; Alizamir, M.; Kisi, O.; Elshafie, A. Drought modelling by standard precipitation index (SPI) in a semi-arid climate using deep learning method: Long short-term memory. Neural Comput. Appl. 2022, 34, 2425–2442. [Google Scholar] [CrossRef]
Pazikadin, A.R.; Rifai, D.; Ali, K.; Malik, M.Z.; Abdalla, A.N.; Faraj, M.A. Solar irradiance measurement instrumentation and power solar generation forecasting based on Artificial Neural Networks (ANN): A review of five years research trend. Sci. Total Environ. 2020, 715, 136848. [Google Scholar] [CrossRef]
Pongpiachan, S.; Wang, Q.; Apiratikul, R.; Tipmanee, D.; Li, Y.; Xing, L.; Li, G.; Han, Y.; Cao, J.; Macatangay, R.C.; et al. An Application of Artificial Neural Network to Evaluate the Influence of Weather Conditions on the Variation of PM2. 5-Bound Carbonaceous Compositions and Water-Soluble Ionic Species. Atmosphere 2022, 13, 1042. [Google Scholar] [CrossRef]
Kumar, P.M.; Saravanakumar, R.; Karthick, A.; Mohanavel, V. Artificial neural network-based output power prediction of grid-connected semitransparent photovoltaic system. Environ. Sci. Pollut. Res. 2022, 29, 10173–10182. [Google Scholar] [CrossRef]
Salem, H.; Kabeel, A.; El-Said, E.M.; Elzeki, O.M. Predictive modelling for solar power-driven hybrid desalination system using artificial neural network regression with Adam optimization. Desalination 2022, 522, 115411. [Google Scholar] [CrossRef]
Nazir, M.S.; Alturise, F.; Alshmrany, S.; Nazir, H.M.J.; Bilal, M.; Abdalla, A.N.; Sanjeevikumar, P.; Ali, M.Z. Wind generation forecasting methods and proliferation of artificial neural network: A review of five years research trend. Sustainability 2020, 12, 3778. [Google Scholar] [CrossRef]
Wang, Y.; Zou, R.; Liu, F.; Zhang, L.; Liu, Q. A review of wind speed and wind power forecasting with deep neural networks. Appl. Energy 2021, 304, 117766. [Google Scholar] [CrossRef]
Rathod, U.H.; Kulkarni, V.; Saha, U.K. On the Application of Machine Learning in Savonius Wind Turbine Technology: An Estimation of Turbine Performance Using Artificial Neural Network and Genetic Expression Programming. J. Energy Resour. Technol. 2022, 144. [Google Scholar] [CrossRef]
Malik, H.; Yadav, A.K.; Márquez, F.P.G.; Pinar-Pérez, J.M. Novel application of Relief Algorithm in cascaded artificial neural network to predict wind speed for wind power resource assessment in India. Energy Strategy Rev. 2022, 41, 100864. [Google Scholar] [CrossRef]
Barry, R.G. Mountain Weather and Climate; Psychology Press: London, UK, 1992. [Google Scholar]
Sinnott, R.W. Virtues of the Haversine. Sky Telesc. 1984, 68, 158. [Google Scholar]
Bersimis, S.; Triantafyllopoulos, K. Dynamic Non-parametric Monitoring of Air-Pollution. Methodol. Comput. Appl. Probab. 2020, 22, 1457–1479. [Google Scholar] [CrossRef]
Hyndman, R.J.; Khandakar, Y. Automatic time series forecasting: The forecast package for R. J. Stat. Softw. 2008, 27, 1–22. [Google Scholar] [CrossRef]
Yegnanarayana, B. Artificial Neural Networks; PHI Learning Pvt. Ltd.: Delhi, India, 2009. [Google Scholar]
Abiodun, O.I.; Jantan, A.; Omolara, A.E.; Dada, K.V.; Mohamed, N.A.; Arshad, H. State-of-the-art in artificial neural network applications: A survey. Heliyon 2018, 4, e00938. [Google Scholar] [CrossRef] [PubMed]
Molaris, V.; Triantafyllopoulos, K.; Papadakis, G.; Economou, P.; Bersimis, S. The Effect of COVID-19 on minor dry bulk shipping: A Bayesian time series and a neural networks approach. Commun. Stat. Case Stud. Data Anal. Appl. 2021, 7, 624–638. [Google Scholar] [CrossRef]
Bersimis, S.; Sgora, A.; Psarakis, S. A robust meta-method for interpreting the out-of-control signal of multivariate control charts using artificial neural networks. Qual. Reliab. Eng. Int. 2022, 38, 30–63. [Google Scholar] [CrossRef]
El-kenawy, E.S.M.; Abutarboush, H.F.; Mohamed, A.W.; Ibrahim, A. Advance artificial intelligence technique for designing double T-shaped monopole antenna. CMC-Comput. Mater. Contin. 2021, 69, 2983–2995. [Google Scholar] [CrossRef]
Abhishek, K.; Singh, M.; Ghosh, S.; Anand, A. Weather forecasting model using artificial neural network. Proc.Technol. 2012, 4, 311–318. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 64–67. [Google Scholar]
Zaytar, M.A.; El Amrani, C. Sequence to sequence weather forecasting with long short-term memory recurrent neural networks. Int. J. Comput. Appl. 2016, 143, 7–11. [Google Scholar]
Poornima, S.; Pushpalatha, M. Prediction of rainfall using intensified LSTM based recurrent neural network with weighted linear units. Atmosphere 2019, 10, 668. [Google Scholar] [CrossRef]
Yoo, T.W.; Oh, I.S. Time Series Forecasting of Agricultural Products’ Sales Volumes Based on Seasonal Long Short-Term Memory. Appl. Sci. 2020, 10, 8169. [Google Scholar] [CrossRef]
James, K.E.; White, R.F.; Kraemer, H.C. Repeated split sample validation to assess logistic regression and recursive partitioning: An application to the prediction of cognitive impairment. Stat. Med. 2005, 24, 3019–3035. [Google Scholar] [CrossRef] [PubMed]
Alexopoulos, P.; Skondra, M.; Kontogianni, E.; Vratsista, A.; Frounta, M.; Konstantopoulou, G.; Aligianni, S.I.; Charalampopoulou, M.; Lentzari, I.; Gourzis, P.; et al. Validation of the Cognitive Telephone Screening Instruments COGTEL and COGTEL+ in Identifying Clinically Diagnosed Neurocognitive Disorder Due to Alzheimer’s Disease in a Naturalistic Clinical Setting. J. Alzheimer’s Dis. 2021, 83, 259–268. [Google Scholar] [CrossRef]
Lagouvardos, K.; Kotroni, V.; Bezes, A.; Koletsis, I.; Kopania, T.; Lykoudis, S.; Mazarakis, N.; Papagiannaki, K.; Vougioukas, S. The automatic weather stations NOANN network of the National Observatory of Athens: Operation and database. Geosci. Data J. 2017, 4, 4–16. [Google Scholar] [CrossRef]
Skamnia, A.; Bekri, E.; Economou, P. Analysis of regional precipitation measurements: The Peloponnese and the Ionian islands case. In Proceedings of the Protection and Restoration of the Environment XVI, Kalamata, Greece, 5–8 July 2021; Ioannis, M., Polychronis, E., Zacharias, I., Yannopoulos, P., Korfiatis, G., Koutsospyros, A., Eds.; pp. 190–198. [Google Scholar]
Dozat, T. Incorporating nesterov momentum into adam. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Aghelpour, P.; Singh, V.P.; Varshavian, V. Time series prediction of seasonal precipitation in Iran, using data-driven models: A comparison under different climatic conditions. Arab. J. Geosci. 2021, 14, 551. [Google Scholar] [CrossRef]
Jang, D. An Application of ANN Ensemble for Estimating of Precipitation Using Regional Climate Models. Adv. Civ. Eng. 2021, 2021, 7363471. [Google Scholar] [CrossRef]

Figure 1. The first feature vector

X

and target values

Y

of an SLSTM network with

s = 12

,

k = 2

,

m = 11

, and

f = 12

.

Figure 1. The first feature vector

X

and target values

Y

of an SLSTM network with

s = 12

,

k = 2

,

m = 11

, and

f = 12

.

Figure 2. The Peloponnese and the Ionian islands meteorological stations network (denoted with red dots) of the Institute for Environmental Research and Sustainable Development of the National Observatory of Athens.

Figure 3. Percentage of missing values in the meteorological stations that were in operation for at least 4 years during the period under study and were active in 2020. The red dotted line facilitates the identification of the stations with more than 10% of missing values during their operation period.

Figure 4. Monthly precipitation time series plots for Megalopoli and Lykochia. In the embedded plot, the year 2018 is presented in detail.

Figure 5. The LSTM neural network architecture used in the analysis.

Figure 6. The boxplots of the predictions for each month for the Monemvasia (left plot) and Alagonia (right plot) stations based on the multivariate approach. The corresponding predictions based on the univariate approach are denoted with a blue circle. The true measured values for the two stations for each month are denoted with orange circles.

Figure 7. The kernel density estimates for the annual total precipitation (sum of 12 months’ predictions) for all stations. The clusters of the stations according to their climate zone are denoted with brackets. The white lines in the density plots indicate the first and the third quartile. The blue dots represent the observed values of the total precipitation for each station in 2021.

Table 1. A general representation of feature vector

X

and target values

Y

of an SLSTM network. The total number of available observations in the time series is denoted with n, s denotes the period of seasonality, k the number of seasonal lags, m the maximum values of lagged values, and f the maximum number of future observations to predict.

Table 1. A general representation of feature vector

X

and target values

Y

of an SLSTM network. The total number of available observations in the time series is denoted with n, s denotes the period of seasonality, k the number of seasonal lags, m the maximum values of lagged values, and f the maximum number of future observations to predict.

No.	X	Y
1	$[x_{1}, x_{1 + s}, \dots x_{1 + k * s}, x_{(k * s + 1) - m - 1}, x_{(k * s + 1) - (m - 2)}, \dots, x_{(k * s + 1) - 1}, x_{k * s + 1}]$	$[x_{(k * s + 1) + 1}, x_{(k * s + 1) + 2} \dots, x_{(k * s + 1) + f}]$
2	$[x_{2}, x_{2 + s}, \dots x_{2 + k * s}, x_{(k * s + 2) - m - 1}, x_{(k * s + 2) - (m - 2)}, \dots, x_{(k * s + 2) - 1}, x_{k * s + 2}]$	$[x_{(k * s + 2) + 1}, x_{(k * s + 2) + 2} \dots, x_{(k * s + 2) + f}]$
…	…	…
t	$[x_{t - k * s}, x_{t - (k - 1) * s}, \dots x_{t - s}, x_{t - m - 1}, x_{t - (m - 2)}, \dots, x_{t - 1}, x_{t}]$	$[x_{t + 1}, x_{t + 2} \dots, x_{t + f}]$
…	…	…

Table 2. Grid search for the univariate approach using the Amaliada station. The smallest

M S E

value is marked in bold.

Table 2. Grid search for the univariate approach using the Amaliada station. The smallest

M S E

value is marked in bold.

Batch Size	LSTM Nodes	Hidden Nodes	$MSE$
1	7	36	0.02660
		144	0.02805
		216	0.02889
	24	36	0.02761
		144	0.02815
		216	0.02798
	72	36	0.02827
		144	0.02820
		216	0.02747
6	7	36	0.02705
		144	0.02779
		216	0.02761
	24	36	0.02753
		144	0.02787
		216	0.02871
	72	36	0.02697
		144	0.02741
		216	0.02803
12	7	36	0.02764
		144	0.02716
		216	0.02756
	24	36	0.02684
		144	0.02775
		216	0.02776
	72	36	0.02717
		144	0.02790
		216	0.02762

Table 3. The CPU usage range and the time required to run the analysis for the selected hyperparameters.

	CPU Usage	Time (In s)
Univariate	12–16%	106.84
Multivariate	17–21%	314.11

Table 4. Grid search for the multivariate approach using the initial split. The smallest

MSE

value is marked in bold.

Table 4. Grid search for the multivariate approach using the initial split. The smallest

MSE

value is marked in bold.

Batch Size	LSTM Nodes	Hidden Nodes	$MSE$
1	98	140	0.01722
		280	0.01739
		701	0.01868
	148	140	0.01686
		280	0.01724
		701	0.01864
	296	140	0.01678
		280	0.01668
		701	0.01895
6	98	140	0.01643
		280	0.01755
		701	0.01825
	148	140	0.01599
		280	0.01685
		701	0.01883
	296	140	0.01674
		280	0.01650
		701	0.01839
12	98	140	0.01576
		280	0.01695
		701	0.01900
	148	140	0.01591
		280	0.01642
		701	0.01855
	296	140	0.01560
		280	0.01587
		701	0.01806

Table 5. The monthly actual precipitation values (in mm) and the predictions using the proposed algorithm and the prediction for 12 months ahead based on the best SARIMA model. In the parentheses, the absolute differences between each prediction from the actual value are reported. The minimum for each month and each station are marked in bold.

	Monemvasia Station			Alogonia Station
		Predictions			Predictions
Month	Actual	Univariate	SARIMA $^{*}$	Actual	Univariate	SARIMA $^{* *}$
January	109.6	99.5 (10.1)	120.3 (10.7)	243.4	220.0 (23.4)	78.4 (165.0)
February	29.2	79.1 (49.9)	59.3 (30.1)	69.0	111.2 (42.2)	70.0 (1.0)
March	16.8	40.0 (23.2)	54.1 (37.3)	129.8	25.8 (104.0)	72.8 (57.0)
April	8.7	15.1 (6.4)	11.2 (2.5)	14.2	39.3 (25.1)	64.7 (50.5)
May	0.6	3.7 (3.1)	6.5 (5.9)	0.6	32.9 (32.3)	27.4 (26.8)
June	6.2	11.5 (5.3)	8.9 (2.7)	68.4	33.6 (34.8)	6.6 (61.8)
July	0.0	5.0 (5.0)	0.9 (0.9)	3.2	29.6 (26.4)	35.0 (31.8)
August	0.0	0.1 (0.1)	0.2 (0.2)	5.4	34.9 (29.5)	24.8 (19.4)
September	40.6	8.2 (32.4)	21.9 (18.7)	24.6	82.0 (57.4)	68.8 (44.2)
October	69.0	32.6 (36.4)	71.9 (2.9)	224.6	132.0 (92.6)	60.5 (164.1)
November	95.2	87.3 (7.9)	117.8 (22.6)	145.8	151.5 (5.7)	66.7 (79.1)
December	193.2	76.4 (116.8)	138.9 (54.3)	290.4	153.0 (137.4)	144.3 (146.1)
Total (annual)	569.1	458.5	611.9	1219.4	1045.8	720.0

* SARIMA(0, 0, 0)x(2, 0, 0)₁₂, ** SARIMA(1, 0, 0)x(0, 1, 1)₁₂.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Skarlatos, K.; Bekri, E.S.; Georgakellos, D.; Economou, P.; Bersimis, S. Projecting Annual Rainfall Timeseries Using Machine Learning Techniques. Energies 2023, 16, 1459. https://doi.org/10.3390/en16031459

AMA Style

Skarlatos K, Bekri ES, Georgakellos D, Economou P, Bersimis S. Projecting Annual Rainfall Timeseries Using Machine Learning Techniques. Energies. 2023; 16(3):1459. https://doi.org/10.3390/en16031459

Chicago/Turabian Style

Skarlatos, Kyriakos, Eleni S. Bekri, Dimitrios Georgakellos, Polychronis Economou, and Sotirios Bersimis. 2023. "Projecting Annual Rainfall Timeseries Using Machine Learning Techniques" Energies 16, no. 3: 1459. https://doi.org/10.3390/en16031459

APA Style

Skarlatos, K., Bekri, E. S., Georgakellos, D., Economou, P., & Bersimis, S. (2023). Projecting Annual Rainfall Timeseries Using Machine Learning Techniques. Energies, 16(3), 1459. https://doi.org/10.3390/en16031459

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Projecting Annual Rainfall Timeseries Using Machine Learning Techniques

Abstract

1. Introduction

2. Motivation

3. Materials and Methods

3.1. Data Preprocessing

3.1.1. Spatial Imputation

3.1.2. Temporal Imputation

3.2. Machine Learning Prediction Algorithm

3.2.1. Univariate Approach

3.2.2. Multivariate Approach

4. Results and Discussion

4.1. Data Description

4.2. Data Preprocessing

4.2.1. Data Cleaning

4.2.2. Data Imputation

4.3. LSTM Model Performance

4.3.1. Univariate Approach

4.3.2. Multivariate Approach

4.4. Predicting

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI