Next Article in Journal
A Systematic Review on the Existing Research, Practices, and Prospects Regarding Urban Green Infrastructure for Thermal Comfort in a High-Density Urban Context
Next Article in Special Issue
Improved Monthly and Seasonal Multi-Model Ensemble Precipitation Forecasts in Southwest Asia Using Machine Learning Algorithms
Previous Article in Journal
MinION Nanopore Sequencing Accelerates Progress towards Ubiquitous Genetics in Water Research
Previous Article in Special Issue
Geospatial Artificial Intelligence (GeoAI) in the Integrated Hydrological and Fluvial Systems Modeling: Review of Current Applications and Trends
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Deep Reinforcement Learning Ensemble for Detecting Anomaly in Telemetry Water Level Data

School of Computing Sciences, University of East Anglia, Norwich NR4 7TJ, UK
Hydro-Informatics Institutes, Bangkok 10900, Thailand
Author to whom correspondence should be addressed.
Water 2022, 14(16), 2492;
Submission received: 1 July 2022 / Revised: 6 August 2022 / Accepted: 8 August 2022 / Published: 13 August 2022


Water levels in rivers are measured by various devices installed mostly in remote locations along the rivers, and the collected data are then transmitted via telemetry systems to a data centre for further analysis and utilisation, including producing early warnings for risk situations. So, the data quality is essential. However, the devices in the telemetry station may malfunction and cause errors in the data, which can result in false alarms or missed true alarms. Finding these errors requires experienced humans with specialised knowledge, which is very time-consuming and also inconsistent. Thus, there is a need to develop an automated approach. In this paper, we firstly investigated the applicability of Deep Reinforcement Learning (DRL). The testing results show that whilst they are more accurate than some other machine learning models, particularly in identifying unknown anomalies, they lacked consistency. Therefore, we proposed an ensemble approach that combines DRL models to improve consistency and also accuracy. Compared with other models, including Multilayer Perceptrons (MLP) and Long Short-Term Memory (LSTM), our ensemble models are not only more accurate in most cases, but more importantly, more reliable.

1. Introduction

As climate change becomes more apparent, strong storms that bring heavy rainfalls occur with unusual patterns in many parts of the world. They can cause severe floods that result in devastating damages to infrastructure and loss of human life. In Thailand, flooding occurs more frequently and can cause enormous damages and huge economic losses of up to $46.5 billions a year [1]. On the other hand, drought happened in several parts of Thailand in 2015, notably in the Chao Phraya River Basin, the largest river basin in Thailand. This is consistent with a report from the UNDRR (2020) [2] that the ongoing drought crisis from 2015 to 2016 was the most severe drought in Thailand in 20 years. Therefore, it is essential to monitor water levels around the country because they form an important basis for making decisions on early warning.
In order to monitor the water levels in rivers, the Hydro Informatics Institute (HII) has been studying, building, and deploying water level telemetry stations around Thailand since 2013. Every ten minutes, each station transmits the measured data to the HII data centre through cellular or satellite networks. However, the water level data collected from telemetry station sensors might be incorrect due to some factors, such as human or animal activity, malfunctioning equipment, or interference of items surrounding the sensors. Any irregularity in the data might result in an inaccurate decision, such as false alarms or missed true alarms. Although water level data may be manually reviewed before being distributed for further analysis, the procedure necessitates the use of skilled specialists who examine the data from each station and make judgments about any probable abnormalities that may exist. This process is slow, very time-consuming and also unreliable. This motivates us to develop an automated approach that can identify irregularities in a more accurate, efficient, and reliable manner.
In our previous work [3], we studied seven statistics-based models for detecting the anomalies. We found that although an individual model can be used to identify anomalies, it produces too many false alarms for some situations, such as when the water level will dramatically rise before a flood occurs, which is a scenario notably different from the others, and hence led to that the majority of statistical models identify such points as anomalous. We also created two ensembles as the ensemble methods [4], if constructed properly, have been demonstrated to be able to improve accuracy and reliability over individual models. The first ensemble was built with a simple strategy as it just combines some selected models with majority voting as its decision-making function. However, the test results showed that the simple ensemble models did not work well enough, even though they were usually better than most of the basic individual models. We then developed a complex ensemble method. It basically builds an ensemble of some simple ensembles selected from the candidates with some criteria, and these simple ensembles’ outputs are combined with a weighted function. The findings indicate that a complex ensemble can improve the accuracy and consistency in recognising both abnormal and normal data.
In recent decades, deep machine learning methods have been demonstrated to be more powerful than conventional machine learning techniques in tackling complex problems such as speech recognition, handwriting recognition, image recognition, and natural language processing. One of these methods is the Long Short-Term Memory (LSTM) [5], outperformed the Multilayer Perceptron (MLP), although trained with only normal data, for detecting anomaly patterns from ECG signals. Moreover, the C-LSTM methods, which integrated a convolutional neural network (CNN), well performed to detect anomaly signals that are difficult to classify in web traffic data as shown in [6]. Another deep neural network based on anomaly detection technique was recently proposed, called DeepAnt, which consists of a time series predictor that uses CNN to predict the values of the next time step and classify the predicted values as normal or abnormal by passing them to the anomaly detector [7].
Reinforcement Learning (RL) is an algorithm that imitates the human learning process. It is based on the self-learning process in which an agent learns by interacting with the environment without any assumptions or rules. With the advantage of being able to learn on their own, it can identify unknown anomalies [8], which gives it an edge over other models. RL has been applied to a variety of applications such as games [9,10], robotics [11,12], natural language processing [13,14], computer vision [15], etc. It has also been used in some studies to detect anomalies in data, such as an experiment [16] that shows the use of the deep Q-function network (DQN) algorithm to detect anomalies in time series. Network intrusion detection systems (NIDS) are developed by [17], based on deep reinforcement learning. They utilised it to identify anomalous traffic on the campus network with a combination of flexible switching, learning, and detection modes. When the detection model performs below the threshold, the model is retrained. In the comparison against three traditional machine learning approaches, their model outperformed on two benchmark datasets, NSL-KDD and UNSW-NB15. A binary imbalanced classification model based on deep reinforcement learning (DRL) was introduced in [18]. They developed the reward function by setting the rewards for the minority class to be greater than the rewards for the majority class, which made DRL paying more attention to the minority class. They compared it to seven imbalanced learning methods and found that it outperformed other models in text datasets and extremely imbalanced data sets.
Although deep learning and RL methods have achieved excellent results in time series, one common issue is that their performance varies and it is hard to predict when they do better and when they perform relatively poor. In order to improve their consistency and accuracy, ensemble methods can be used. One example of such a method is the technique called particle swarm optimization (PSO), which was developed [19] to predict the changing trend of the Mexican Stock Exchange by combining several neural networks. An ensemble of MLP, Backpropagation network (BPN), and LSTM, as shown in [20], was used to make models for detecting anomalous traffic in a network. The ensemble approach that utilises DRL schemes to maximise investment in stock trading was developed in [21]. They trained a DRL agent and obtained an ensemble trading strategy using three different actor-critics-based algorithms that outperformed the individual algorithm and two baselines in terms of the risk-adjusted return. Another ensemble RL that employed three types of deep neural networks in Q-learning and used ensemble techniques to make the final decision to increase prediction accuracy for wind speed short-term forecasting was suggested [22].
We discovered that none of the DRL methods have been applied to identify anomalies in telemetry water level data. We wonder whether DRL is applicable for identifying abnormalities in telemetry water level data. Even if the final DRL models perform well on training data, there is no guarantee that they will also perform well on testing data. Previous research has shown that combining many models that were trained in different ways may be more accurate than any of the individual models. So, in this paper, we aim to answer the following two research questions.
( Q 1
Is DRL applicable and effective for identifying abnormalities in water level data?
( Q 2
Can we build some ensembles of DRL to improve accuracy and consistency?
To answer them, in this paper, we conducted intensive investigation by evaluated the accuracy of DRL models with real-world data. Then we proposed a strategy to build some ensembles by selecting some suitable DRL models. The testing results show that DRL is applicable for identifying abnormalities in telemetry water level data with the advantage of identifying an unknown anomaly. However, the process of training takes a long time. The constructed ensembles not only improve accuracy and consistency, but also reduce the rate of false alarms.
Thus, the main contributions of this paper are:
( C 1
DRL models have been demonstrated to be able to detect anomalies in telemetry water level data.
( C 2
The ensembles we have constructed in this research with some suitable DRL models and use a weighted decision-making strategy can improve both accuracy and consistency. The proposed approach has a potential to be further developed and implemented for real-world application.
The rest of the paper is organised as follows: Section 2 overviews related work for anomaly detection. Section 3 describes the methodology. Section 4 presents the experiment design-from data preparation, parameters configurations, to evaluation metrics. Results and discussions are provided in Section 5 and Section 6; the conclusion and suggestions for further work are summarised in Section 7.

2. Related Work

There are many methods for detecting anomalies in time series data. One basic approach is to use statistics-based methods, as reviewed in [23,24]. For example, simple and exponential smoothing techniques were used to identify anomalies in a continuous data stream of temperature in an industrial steam turbine [25]. But in general whilst they provided a baseline, they have a disadvantage in handling trends and periotics, e.g., the water level will dramatically rise before the flood, which differs considerably from the other data points and may lead to an increased false alarm rate. In addition, they can be affected by the types of anomaly and some work well for a certain type of problem. For example, for missing and outlier values, when the data is normally distributed, the K-means clustering method [26] is usually used, as it is simple and relatively effective. However, there is unfortunately no general guideline for choosing a method for a given problem.
Change Point Detection (CPD) is an important method for time series analysis. It indicates an unexpected and significant change in the analysed time series stream data and has been studied in many fields, as surveyed in [27,28]. However, the CPD has no ability to detect anomalies since not all detected change points are abnormalities. Many studies are being conducted to solve this problem by integrating CPD with other models to increase anomaly detection effectiveness. For example, researchers from [29] presented new techniques, called rule-based decision systems, that combine the results of anomaly detection algorithms with CPD algorithms to produce a confidence score for determining whether or not a data item is indeed anomalous. They tested their suggested method using multivariate water consumption data collected from smart metres, and the findings demonstrated that anomaly detection can be improved. Moreover, it has been proposed to detect anomalies in file transfer by using the CPD to detect the current bandwidth status from the server, then using this to calculate the expected file transfer time. The server administrator has been notified when observed file transfers take longer than expected, which may mean it may have something wrong [30]. The author of [31] investigated the CUSUM algorithm for change point detection to detect SYN flood attacks. The results demonstrated that the proposed algorithm provided robust performance with both high and low intensity attacks. Although change point detection performed well in many domains, the majority of them focused on changes in the behaviour of time series data (sequence anomaly) rather than point anomaly, which is my primary research emphasis. Furthermore, water level data at certain stations is strongly periodic with tidal effects, resulting in numerous data points changing from high tides to low tides each day, which is typical behaviour.
In recent decades, machine learning methods, including deep neural networks (DNNs), have been satisfactorily implemented in various hydrological issues such as outlier detection [32,33], water level prediction [34,35], data imputation [36], flood forecasting [37], streamflow estimation [38], etc. For example, in [39], the authors proposed the R-ANFIS (GL) method for modelling multistep-ahead flood forecasts of the Three Gorges Reservoir (TGR) in China, which was developed by combining the recurrent adaptive-network-based fuzzy inference system (R-ANFIS) with the genetic algorithm and the least square estimator (GL). The authors of [40] presented a flood prediction by comparing the expected typhoon tracking and the historical trajectory of typhoons in Taiwan in order to predict hydrographs from rainfall projections impacted by typhoons. The PCA-SOM-NARX approach was developed by [41] to forecast urban floods, combining the advantages of three models. Principal component analysis was used to derive the geographical distributions of urban floods (PCA). To construct a topological feature map, high-dimensional inundation recordings were grouped using a self-organizing map (SOM). To build 10-minute-ahead, multistep flood prediction models, nonlinear autoregressive with exogenous inputs (NARX) was utilised. The results showed that not only did the PCA-SOM-NARX approach produce more stable and accurate multistep-ahead flood inundation depth forecasts, but it was also more indicative of the geographical distribution of inundation caused by heavy rain events. Even though we can use forecasting methods to find anomalies by using prediction error as a threshold to classify data points as normal or not, it may take time to find the suitable threshold for each station.
An autoencoder is an unsupervised learning neural network. It is comprised of two parts: an encoder and a decoder. The encoder uses the concepts of dimension reduction algorithms to convert the original data into the different representations with the underlying structure of the data remaining and ignoring the noise. Meanwhile, the decoder reconstructs the data from the output of the encoder with as close of a resemblance as possible to the original data. An autoencoder is effectively used to solve many applied problems, from face recognition [42,43] and anomaly detection [44,45,46,47] to noise reduction [48,49,50]. In the time series domain, the authors of [51] proposed two autoencoder ensemble frameworks for unsupervised outlier identification in time series data based on sparsely connected recurrent neural networks, which addressed the issues from [52] given the poor results when using an autoencoder with time series data. In one of the frameworks called the Independent Framework, multiple autoencoders are trained independently of one another, whereas in the other framework, the Shared Framework, multiple autoencoders are trained jointly in a manner that is multitask learning. They experimented by using univariate and multivariate real-world datasets. Experimental results revealed that the suggested autoencoder ensembles with a shared framework outperform baselines and state-of-the-art approaches. However, a disadvantage of this method is its high memory consumption when training many autoencoders together. In the hydrological domain, the authors of [53] presented the SAE-RNN model, which combined the stacked autoencoder (SAE) with a recurrent neural network (RNN) for multistep-ahead flood inundation forecasting. They started with SAE to encode the high dimensionality of input datasets (flood inundation depths), then utilised an LSTM-based RNN model to predict multistep-ahead flood characteristics based on regional rainfall patterns, and then decoded the output by SAE into regional flood inundation depths. They conducted experiments on datasets of flood inundation depths gathered in Yilan County, Taiwan, and the findings demonstrated that SAE-RNN can reliably estimate regional inundation depths in practical applications.
Time series based on ensemble methods have recently attracted attention. In a study by [54], they introduced the method EN-RTON2, which is an ensemble model with real-time updating using online learning and a submodel for real-time water level forecasts. However, they experimented with fewer datasets, a smaller number of records, and lower data frequency than our datasets. Furthermore, the authors offered no indication of the time necessary for training models and forecasting, which may be inadequate in our case given the number of stations and frequency of data transmission. The ensemble models were proposed by [55], which applied the sliding window based ensemble method to find the anomaly pattern in sensor data for preventing machine failure. They used a combination of classical clustering algorithms and the principle of biclustering to construct clusters representing different types of structure. Then they used these structures in a one-class classifier to detect outliers. The accuracy of these methods was tested on a time series of real-world datasets from the production of industry. The results have verified the accuracy and the validity of the proposed methods.
Despite the fact that numerous studies have used different anomaly detection techniques to tackle problems in many domains, only a few have focused on finding anomalies in water level data. Furthermore, the various employed sensors, installation area, frequency of data transmission, and measurement purposes lead to a variety of types of anomalies. As a result, techniques that perform well with one set of data may not work well with another.

3. Materials and Methods

This section describes firstly how deep reinforcement learning is constructed for detecting anomalies in water level telemetry data; and then how an ensemble can be built effectively by selecting suitable individual models to improve the accuracy of anomaly detection. The frameworks of these investigations were implemented with Python and their code can be accessed via GitHub (, The last check on 5 August 2022).

3.1. Reinforcement Learning (RL)

Reinforcement learning (RL) is a branch of machine learning and it is one of the most active areas of research in artificial intelligence (AI), which is growing rapidly with a wide variety of algorithms. It is goal-oriented learning. The learner, or agent, learns from the result, or rewards, of its actions without being taught what actions to take. The way in which the agent decides which action to perform depends on the policy, which can be in the form of a lookup table or a complex search process. So, a policy function defines the agent’s behaviour in an environment.
Most techniques that are used to find the optimal policy for resolving the RL problem are based on the Markov decision process (MDP), whereby the probability of next state s depends only on the current state s and action a. It is represented by five important variables [56]:
  • A finite set of states (S), which may be discrete or continuous.
  • A finite set of actions (A). The agent takes an action a from the action set A, a A .
  • A transition probability ( T ( s , a , s ) ), which is the probability to get from state s to another state s with action a.
  • A reward probability ( R ( s , a , s ) R ), which is the reward after going from state s to another state s with action a.
  • A discount factor ( γ ), which focuses on controls the important immediate and future rewards and lies within 0 to 1, γ [ 0 , 1 ] .
The goal of learning is to maximise the expected cumulative reward in each episode. The agent should try to maximise the reward from any state s. The total reward R at state s as the sum of current rewards and the total discounted reward at the next state s , which can be represented as follow:
R ( s ) = R ( s , a , s ) + γ R ( s )
The algorithm that has been widely used in RL is Q-learning. It tries to maximize the values from Q-function, as shown in Equation (1), which can be approximated using the Bellman equation, which represents how good it is for an agent to perform a particular action in a state s.
N e w Q ( s , a ) = Q ( s , a ) + α ( r + γ max Q ( s , a ) Q ( s , a ) )
where α is the learning rate, and max Q ( s , a ) is the highest Q value between possible actions from the new state s .

3.1.1. Deep Q-Learning Network

Q-learning has a limitation: it does not perform well with many states and actions. Furthermore, going through all the actions in each state would be time-consuming. Therefore, the deep Q-learning network [57] (DQN) has been developed to solve those issues by using a neural network (NN). The Q-value is approximated by an NN with weights w, instead of finding the optimal Q-value through all possible state-action pairs, and errors are minimized through gradient descent. The overall process of DRL is depicted in Figure 1.
An agent usually does not know what action is best at the beginning of training. It may select the greatest action that is the best based on history (exploitation) or may explore new possibilities that may be better or worse (exploration). However, when should an agent “exploit” rather than “explore”? This remains a challenge since if the chosen action results in a faulty selection, an agent may get stuck in incorrect learning for a time. The epsilon-greedy algorithm is a simple way to balance exploration and exploitation. It does this by randomly choosing between exploration and exploitation and using the hyperparameter ϵ to switch between random action and Q-values, as shown in Equation (2). The normal procedure is to begin with ϵ = 1.0 and gradually lower it to a small value, such as 0.01.
a = select a random action a with probability ϵ a r g m a x a Q ( s , a ) otherwise
Moreover, we make a transition from one state s to the next state s by performing some action a and receive a reward r as T ( s , a , s ) . So, neural networks may overfit with correlated experience from those transitions. So, we saved the transition information in a buffer called replay memory and trained the DQN with a random transition in replay memory instead of training with last transitions. It will reduce the correlated experience of learning each time, and then it will reduce the overfitting of the model.

3.1.2. Deep Reinforcement Learning Model (DRL)

The action of the DRL agent is to determine whether or not a data point is an abnormality. We assigned a value of 1 to the anomaly class and a value of 0 to the normal class. DQN was chosen as our reinforcement learning strategy. When state s is received, an MLP is used as the RL agent’s brain to generate Q-value, which is then followed by the Q function. The epsilon decay approach is used for exploration and exploitation. In order to explore the entire environment space, we use the greedy factor ϵ to determine whether our DRL agent should follow the Q function or randomly select an action.
For each iteration, DQN receives the set of states S and predicts the label for training the DRL model. The transition is stored in replay memory. In each epoch, a mini batch of replay memory is sampled and used to train the model for loss minimization. Moreover, whether the model will learn well or not depends on the rewards function. The good reward function has an effect on the model’s performance. If we offer a high reward for correctly identifying normal data in datasets, DRL may identify all data as normal in order to get the highest score. If, on the other hand, we give a high reward for finding outliers, DRL might label all data as outliers to get the best score.
Since our datasets are imbalanced, we will give the reward of the minority class higher than the majority class and give the penalty when our model misclassifies [18]. This will impact on the results in Q-values, then the model will select the best action to maximize the rewards. The reward function is defined below
r e w a r d s = A predicted anomaly correct B predicted wrong C predicted normal correct
A general issue in training neural networks is to determine how long they should be trained. Too few epochs may result in the model learning insufficiently, whereas too many epochs may result in the model overfitting. So, the performance of the model must be monitored during training by evaluating it on a validation data set at the end of each epoch and updating the model if the performance of the model on a validation is better than at the previous epoch. In our experiments, we selected 5 criteria as the conditions for generating the models: four performance metrics and the maximum number of epochs. The four measures are F1-score, the reward of each epoch, accuracy, and validation loss values. In the end, we will have five models: the finished training model ( D R L ), the models with the highest F1-score ( D R L F 1 ), the models with the highest rewards ( D R L R w d ), the model with the highest accuracy ( D R L A c c ), and the model with the lowest validation loss values ( D R L V a l i d ).

3.1.3. Ensemble Methods

In general, the capacity of an individual model is limited and may have only learned some parts of the problem, and hence may make mistakes in the areas where it has not learned sufficiently. Therefore, it can be useful to combine some individual models to form an ensemble to allow them to work collectively to compensate for each other’s weaknesses. Many studies [4,58,59,60] have shown that if an ensemble is built with diverse models and appropriate decision-making functions, it can improve the accuracy of classification and also reliability. In our research, we created multiple ensembles by selecting suitable DRL models that had been generated from the previous experiments. We investigated two combining methods to aggregate the outputs from the member models of an ensemble: simple majority voting and weighted voting algorithms.
  • Majority Voting: The predictions of each model in an ensemble have to be aggregated, and the final prediction is the class that gets the most votes. Each of our ensembles will be built with an odd number of classifiers in order to avoid a tie situation in voting.
  • Weighted Voting: As the performance of individual models is usually different, treating them all the same way in decision marking appears unlogical, so we devised a weighted voting mechanism to take this difference into consideration when making a final decision in an ensemble. With the weighted voting method, the contribution from a model is weighed by its performance. For a model m i , after it has been trained with the training data, its weight score w i is derived by using its F 1 score that is calculated on the given validation dataset; we then have a set of F1-scores of each model, F 1 m = { F 1 m 1 , F 1 m 2 , . . . , F 1 m M } . Then, these F1-scores are ranked to find the maximum and minimum scores. Finally, we calculate the normalised weighting score w i for module m i using the equation below:
    w i = F 1 m i m i n ( F 1 m ) m a x ( F 1 m ) m i n ( F 1 m ) , i = 1 , . . . , M
    The output of an ensemble, Φ ( x ) , is calculated by multiplying the weight with the output of an individual module and taking the argument of maxima as follows:
    Φ ( x ) = a r g m a x i = 1 M w i m i ( x )
    where M is the number of models in an ensemble, and m i ( x ) is the predicted class of model i.

3.2. Data Labelling

Water level data from telemetry stations were unlabelled for anomalies. It is then necessary to assign ground truth labels to all anomalies and normal data points in each time series of water level data in order to train the models with supervised algorithms. This was manually done by a group of the domain experts at the HII in a manner similar to the ensemble approach. Each specialist looked at the data and identified all the anomalies based on their experience. Then their judgements were aggregated by taking a consensus to decide if a data point is an anomaly or not.

3.3. Datasets

Since the DRL algorithm takes a lot of time for training on the computing facilities that we had, we were limited to consider some relatively small datasets. After data preprocessing, the 8 stations from the HII telemetry water level station were chosen for use in this experiment, including CPY011, CPY012, CPY013, CPY014, CPY015, CPY016, CPY017, and YOM009. We chose the datasets from May and June for CPY011, CPY012, CPY013, CPY015, CPY016, and CPY017 in 2016 and similar months in 2015 for CPY014 and YOM009 because they have a low percentage of missing data. Figure 2 shows the water levels of these eight stations. It is visually clear that station YOM009 has very different behaviour from the others because it is located in a different region.
All the data are normalised and divided into 3 subsets, with the first 60% of a time series for training, the next 20% for validating, and the last 20% for testing, respectively. Table 1 shows the demographics of one partition of the data from each station. As can be seen, in general, the rates of anomalies are quite low for most stations, but the variances are considerably large. For example, they varied from 0.14% to 7.22% in the training data.

3.4. Evaluation Metrics and Comparison Methods

As our task is basically a classification problem. We therefore chose some commonly used measures: Recall, Precision, and F1, to evaluate the accuracy of models. They are defined by the following equations, based on the confusion matrix shown in Table 2.
R e c a l l = T P T P + F N
P r e c i s i o n = T P T P + F P
F 1 = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l
where T P , F P , F N , and T N denote the number of true positive—correct predictions for anomaly data, false positive—the number of incorrect predictions for anomaly data, false negative—the number of incorrect predictions for normal data, and true negative—the number of correct predictions for normal data, respectively.
To make statistical comparisons, we implemented a statistically rigorous test for multiple classifiers across many datasets. This approach was initially described in [61] and is intended to examine the statistical significance of classifiers. This technique takes the strategy of testing the null hypothesis against the alternative hypothesis. The null hypothesis states that no difference exists between the average rankings of k algorithms on N datasets. The alternative hypothesis is that at least one algorithm’s average rank differs.
In the first place, the k methods are ranked according to their performance over the N datasets; then, the average ranking of each algorithm is calculated. To test the null hypothesis, the Friedman test is calculated using Equation (6).
χ F 2 = 12 N k ( k + 1 ) j R j 2 k ( k + 1 ) 2 4
where R j is the rank of the jth of k algorithms on N datasets and the statistic is estimate using a chi-squared distribution with k 1 degrees of freedom.
If the null hypothesis is rejected at the selected significance level α , the post-hoc Nemenyi test is used to compare all classifiers to each other. The Nemenyi test is similar to the Tukey test for ANOVA and uses a critical difference (CD), which is presented in Equation (7)
C D = q α k ( k = 1 ) 6 N
where q α is calculated by the difference in the range of standard deviations between the smallest valued sample and the largest valued sample. The results of these tests are often visualised using a critical difference (CD) diagram. Classifiers are shown on a number line based on their average rank across all datasets, and bold CD lines are used to connect classifiers that are not significantly different.
In comparison, the performance of our approach, MLP, and LSTM have been used with the same number of hidden layers and the number of neurons in each hidden layer.

4. Experiment Design and Setting

4.1. Four Sets of Experiments

We designed four sets of experiments to test DRL models and ensemble models. (1) to train various DRL models and test them with the different data sampled from the same water level monitoring stations; (2) to train various DRL models with the data from a station and then test them with the data from other stations; (3) to build several ensembles by selecting different numbers of the DRL models and test them with the testing data from the same stations; and (4) to test the ensembles with the data from different stations. The purpose of doing these cross-station testing is to check and evaluate the generalisation ability of the DLR models and the ensembles.

4.2. Parameter Setting

For the DRL model, a multilayer perceptron network was used in the Q-network with the following parameters: the number of input nodes in the input layer was 36, one hidden layer with 18 nodes, and 2 nodes in the output layer. Moreover, epsilon-greedy policy ( ϵ ) was used for exploration from 0.1 to 0.0001. The size of replay memory is 50,000, discount factor of intermediate rewards γ was 0.99. The Adam algorithm was used to optimise the parameters of Q-Network and the learning rate was 0.001. The batch size was 256, training with 100, 500, 1000, 5000, and 10,000 episodes. The episode was over when the number of incorrectly identified anomalies was greater than the number of certain anomalies in the training set or had been trained on all the samples in the training set. We set the reward function parameters for A, B, and C to be 0.9, −0.1, and 0.1, respectively. Furthermore, the window size of 6 was chosen to save time during the training process.
For comparison, MLP and LSTM were used with the identical structures as we used in DRL. They were trained using 100 epochs with early stopping to avoid overfitting. For each setting, the experiments were repeated 10 times with variations, and then the means and standard deviations of the results are reported in the next section.

4.3. Computing Facilities

All the experiments were coded with Python Programming Language (V3.6) (Python Software Foundation,, accessed on 30 June 2022) and TensorFlow 2.8, and run on a personal computer with an Intel Core i5-7500 CPU @ 3.4 GHz, 32 GB RAM, 64-Bit Operating System.

5. Results

5.1. Accuracies of DRL Models

For each station, various DRL models were generated over a range of epochs from 100 to 10,000, with the intention of investigating how well our proposed DRL method learns at the different points of training. The results are shown in Table 3.
Using the CPY011 dataset, we observed that D R L and D R L R w d with 1000 training iterations not only earned the highest F1-score of 0.8333, 0.7143 recall, and 1.0000 precision but also provided the highest average F1-score of 0.7433. However, after 1000 epochs of training, the performance of all models, with the exception of D R L V a l i d decreased and then rose when 10,000 epochs were used.
The top models to identify anomalies on the CPY012 dataset are D R L V a l i d , with a maximum F1-score of 0.7826 after 10,000 training epochs. However, D R L A c c obtained the greatest average F1-score with 0.7234. Meanwhile, 10,000 training epochs with D R L F 1 and D R L A c c delivered the highest F1-score for identifying anomalies in CPY013 data, at 0.8000 F1-score. Furthermore, D R L F 1 provided the highest average F1-score of 0.6963.
With just 500 epochs of training on CPY014 data, D R L R w d and D R L V a l i d delivered the best F1-score of 0.8571. However, the maximum average F1-score achieved by D R L F 1 and D R L A c c was just 0.6733. When looking at the results on CPY015 data, the best models are D R L F 1 and D R L A c c . This is shown by the fact that their F1-scores were the highest in many training epochs.
D R L A c c was the best model for detecting anomalies in CPY016 data since it not only had the greatest F1-score in almost every training epoch but also had the highest average F1-score of 0.5714. Meanwhile, every model scored the best F1-score of 0.8571, 100 percent recall, and 0.7500 accuracy when trained with 100 epochs on CPY017, with the exception of the D R L F 1 model, which achieved just 0.6667 F1-score. While the best models for detecting anomalies on YOM009 are D R L A c c and D R L V a l i d , which both have the same F1-score of 0.4769, the worst models are D R L while training with 5000 iterations at a 0.2728 F1-score, 0.5538 recall, and 0.1818 precision.
Figure 3 shows the comparison of the critical differences between the different DRL models. The number associated with each algorithm is the average rank of the DRL models on each type of dataset, and solid bars represent groups of classifiers with no significant difference. There is no statistically significant difference across the models, with D R L A c c ranking first, followed by D R L F 1 , D R L , D R L R w d , and D R L V a l i d ranking last.
Figure 4 also shows a line graph of the F1-score as the number of epochs of training from each model increases. We can observe that as the number of epochs is increased, the performance of all deep reinforcement learning models using data from CPY012, CPY013, and CPY015 tends to improve. When training with CPY014 data, on the other hand, the F1-score of each model tends to stay the same or go down as the number of epochs goes up. In the case of trained models with CPY016 data, the F1-score of each model tends to stabilise and slightly decrease, with the exception of D R L V a l i d , which tends to grow after 5000 epochs of training. When we looked at the models that were trained with the CPY017 dataset, the F1-score of D R L F 1 went up after training with 1000 epochs and then went down. Other models, however, went up when training with more epochs, even though the performance of some models went down after 1000 epochs, while the F1-score of models that have been trained with CPY011 and YOM009 remained stable when training with more epochs.
Figure 5 shows the findings of the best DRL model for each station. We can observe that the DRL model performs well, capturing the majority of abnormalities in testing datasets. However, it still did not work well when there were anomalies in data that changed frequently, like when there were anomalies in YOM009 data between 29 June and 1 July 2015, and in CPY015 data on 19 June 2016.

5.2. Performance on the Same Station

We evaluated the performance of our techniques with MLP and LSTM models on eight telemetry water level datasets. The data in each station is first divided into training, validating, and testing parts in a 6:2:2 ratio. The results were averaged after being run ten times and then were compared to the averaged DRL models of each station as shown in Table 4. It demonstrated that D R L F 1 and D R L A c c had the highest average F1-scores for detecting anomalies on CPY015, with F1-scores of 0.4133. MLP had the greatest average F1-score when it came to detecting anomalies on CPY011, CPY012, and CPY014 with scores of 0.8505, 0.7822, and 0.8571, respectively. On the other stations, LSTM was the top performing model. According to the CD diagram in Figure 6, the best LSTM model had the greatest ranking of performance, followed by D R L A c c and MLP.
We discovered that D R L F 1 and D R L A c c had the highest average F1-scores for detecting anomalies on CPY015, with F1-scores of 0.4133. MLP had the greatest average F1-score when it came to detecting anomalies on CPY011, CPY012, and CPY014 with scores of 0.8505, 0.7822, and 0.8571, respectively. On the other stations, LSTM was the top performing model. The LSTM model has the highest ranking of performance, according to the CD diagram in Figure 6, followed by D R L A c c and MLP.
Since RL models need time to learn until they have enough knowledge to do their task, time costing is the one important thing that we need to be interested in. We calculate the time spent by the best deep learning models (BDRL) and comparative models, as shown in Table 5. The MLP model requires the least training time per epoch, with an average of 0.30 s, followed by the LSTM model at 0.64 s, and the DRL model at 17.56 s. For MLP and LSTM training with early stopping, they needed an average of 12 and 15 training epochs, respectively, while our method requires around 4638 epochs to get optimal results. It means that the MLP model took an average of 2.97 s to train, while LSTM took 9.20 s and DRL took an average of 78,756 s, which is about 22 h.

5.3. Performance on the Different Station

After generating various models on some stations’ data and testing them with the same stations, we tested these models with the data collected from different stations with the intention of examining their generalisation ability. The F1-scores of each model are provided in Table 6.
Using D R L R w d -the best model for detecting anomalies by training with CPY011 data and then identifying anomalies from other stations, we can see that, though it works rather well, with F1-scores ranging from 0.4 on CPY014 to 0.65 on CPY013 data, it is unable to detect anomalies on CPY015 and YOM009. Using the BDRL model of the CPY012 training dataset, D R L V a l i d , although it provided good performance when identifying anomalies in the CPY013, CPY014, and CPY016 datasets with F1-scores greater than 0.61, especially CPY014 with a 0.8571 f1-score, which more than detected anomalies on its own dataset, it provided poor performance, with an F1-score lower than 0.4000, when detecting anomalies in other stations. Similar to D R L F 1 , which was trained using CPY013 data, it not only performs well when recognising anomalies on its own dataset but also when detecting anomalies on the CPY014 dataset, with an F1-score of 0.8571. The BDRL model, D R L R w d , that was trained with CPY014 did the worst when it was used to find anomalies in other stations’ data, with an F1-score of less than 0.23 for every dataset and the lowest F1-score of only 0.0255 for CPY011. Similar to the best model on CPY015 datasets, which performed poorly, with the highest F1-score on CPY011 data being 0.4138 and being unable to identify anomalies on CPY014, CPY017, and YOM009. Meanwhile, the best model for detecting anomalies on CPY016 data performed the best for detecting anomalies on CPY013 with a 0.5421 F1-score. The model that was trained on CPY017 did the best of finding anomalies in data from CPY012, CPY013, and CPY014 with an F1-score greater than 0.58. While the best model from the YOM009 training dataset achieved a low F1-score on CPY011, CPY015, and CPY017, 0.0839 is the lowest F1-score. However, when it was used to find outliers on COY012, CPY013, CPY014, and CPY016 with F1-scores higher than 0.59, it did better than its own training data.
It is worth noting that models trained using CPY014 and CPY015 data perform poorly when used to identify anomalies from other stations. This may be due to the fact that the actual number of anomalies in those stations are relatively low and most of them are kind of extreme outliers, as shown in Figure 2, so the models were trained with only those kinds of anomalies, which may not be enough for the model to learn. In contrast to YOM009, which has a many number and types of anomalies for model to learn, as a result, it can identify abnormalities on CPY012, CPY013, CPY014, and CPY016 better than other models that were trained with another station.
Then, we tested MLP and LSTM using data from different stations to compare our method to the candidate models. Table 7 represents the results of the MLP models when tested with the datasets from the same and different stations. Using the CPY011 dataset, the MLP models achieved the highest F1-score of 0.5430 on CPY016, despite their being unable to identify anomalies on CPY014 and YOM009. Similar to finding anomalies on CPY012, it offered good results with F1-scores of more than 0.63, with the exception of CPY011, CPY015, and YOOM009, which produced F1-scores of less than 0.4. The best MLP of the CPY013 training dataset provided the highest F1-score on the CPY014 dataset (0.8571 F1-score) and the lowest on CPY015 (0.2093 F1-score). Anomalies on the YOM009 dataset were the most difficult for the MLP models trained on CPY014 to detect, with an F1-score of just 0.1818. However, it performed excellent results in identifying anomalies on CPY017 with a 1.0000 F1-score. Meanwhile, the MLP model on the CPY015 dataset performed poorly when detecting abnormalities from other stations. On the other hand, the MLP models that were trained on CPY016 and CPY017 generated good results when used to identify anomalies from other stations, despite still performing poorly in some stations. In contrast, the MLP model trained on YOM009 worked well when used to detect abnormalities on other stations but performed badly when detecting anomalies on its own data. Furthermore, it performed well on CPY017 data, with a 1.000 F1-score.
In the case of the LSTM model, as depicted in Table 8. They performed well, with an average F1-score of more than 0.42 for each station except CPY015, which had an average F1-score of 0.1099. However, they generated poor performances in some stations, such as the LSTM of CPY016 that achieved an F1-score of only 0.1754 when used to detect anomalies on the CPY011 dataset, and it was unable to detect anomalies on CPY014, CPY017, and YOM009 datasets with the LSTM that had been trained on the CPY015 dataset. However, it provided excellent performance when detecting anomalies on CPY017 with the LSTM that has been trained on the CPY014 dataset. When the LSTM was trained on YOM009, it did well at finding anomalies from other stations, especially CPY014 and CPY017, with an F1-score of 0.8571.
Furthermore, we generated a bar chart to compare the average F1-score from each model when tested with the data collected from different stations, as shown in Figure 7. When evaluated with data from other stations, the models trained with CPY012 and CPY013 produced an average F1-score greater than 0.4. The models trained on CPY015 earned poor performance when used to identify anomalies from other stations, with an average F1-score lower than 0.2. DRL models that were trained with CPY015 outperform other models in detecting anomalies in data from other stations. LSTM models trained on CPY011, CPY012, CPY016, and CPY017, on the other hand, outperform other models in detecting abnormalities on other datasets. When trained with data from CPY013, CPY014, and YOM009, MLP had the best F1-score for finding outliers in other datasets.

5.4. Ensemble Results

Since we have multiple RL models after each epoch of training, and since each model performs the best in each of the criteria, we then built an ensemble that combined the decisions of all RL models, with the aim of generating a better final decision. In model selection, we select all five models and select the three models with the highest ranking in F1-score to build our ensemble model. For decision making, we used majority voting and weighted voting strategies to make a final decision. So, we have 4 ensemble models for each epoch of training, including a majority voting ensemble model with 3 ( E D R L 3 ) and 5 ( E D R L 5 ) models, and a weighted ensemble model with 3 ( W E D R L 3 ) and 5 ( W E D R L 5 ) models.

5.4.1. Performance on the Same Station

The results of our ensemble models are shown in Table 9 demonstrated that ensemble with majority voting and weighted voting that were generated from the top three D R L models of CPY011 provided the best with 0.8333 F1-score, while W D R L 3 that was generated from the DRL model after trained with 10,000 epochs is the best model to detect anomalies in CPY012 datasets with an F1-score of 0.7941. The ensemble model of CPY013 that performs the best is E D R L 3 and W E D R L 3 at 0.8000. The best ensemble model for identifying anomalies in CPY014 datasets is the ensemble model that provided the F1-score of 0.8571. With CPY015 data, the models with the highest F1-score are E D R L 3 , W E D R L 3 , and W E D R L 5 . These models were built based on the individual DRL model, which was trained for 10,000 iterations. Meanwhile, W E D R L 3 got the highest F1-score of 0.5922 for CPY016 by combining the best three DRL models that were trained over 5000 iterations. With CPY017, E D R L 5 outperforms other ensemble models with a 100 percent in every metric. The ensemble results of YOM009, W E D R L 5 , offered the highest performance with an F1-score of 0.5032 that was generated from the D R L model after 500 epochs of training.
Figure 8 depicts line charts that indicate the F1-score of each ensemble model that was trained using data from each station. It was clear from the results that the ensemble models not only delivered good performances and had a tendency to either improve or keep their F1-scores steady but also reduced the false alarms by increasing the precision scores. When we compared the results of each training epoch of the individual DRL model and the ensemble model, as shown in Table 3 and Table 9, we discovered that ensemble models performed better than every single DRL model in many training epochs. In particular, E D R L 5 on the CPY017 with 500 training epochs generated an excellent score of 1.0000 in every metrics index, resulting from a 25% increase in accuracy and a 15% increase in F1-score. Meanwhile, E D R L 5 on the CPY011 with 10,000 training epochs improved the performance of the best individual model with an F1-score from 0.75 to 0.8750, reached 1.00 in terms of recall, and increased precision by 20%. By combining the DRL models trained on only 500 epochs, the ensemble model on YOM009 got the highest F1-score of 0.5032.
As shown in Table 10, we evaluated the average F1-score of each individual DRL model and ensemble of DRL models against the other neural network models. We can see that the LSTM model was the best model when detecting anomalies on CPY013, CPY014, CPY016, and CPY017, while W E D R L 3 provided the highest average F1-score on CPY015 and YOM009. The highest F1-score was 0.4134 on CPY015, which was provided by D R L F 1 , D R L A c c , E D R L 3 , and W E D R L 3 . Although MLP and LSTM beat other models in many datasets, W E D R L 3 has the greatest average ranking, as shown in Figure 9. In other words, the ensemble model not only has the potential to improve the performance of a single model, but it also has a higher reliability to deliver excellent performance than a single model.

5.4.2. Performance on the Different Station

We then tested the generalisation ability of the best ensemble ( W E D R L 3 ) with the data collected from different stations. The F1-score of each model is depicted in Table 11. We can observe that the ensemble model that was created from the model trained on CPY011 data performed well not only on their own dataset but also on CPY017, with an F1-score of 0.8200, similarly to W E D R L 3 on CPY012 and CPY013, which recognised anomalies on CPY014 better than their own dataset with F1-scores of 0.8421 and 0.8143, respectively. Inversely, the ensemble model on CPY014, CPY015, and CPY016 trained datasets provided poor performance when used to detect anomalies on other stations. Even though the ensemble model trained on the CPY017 dataset got an F1-score of more than 0.5 on CPY012, CPY013, and CPY014, it did not do well on many stations, with an F1-score of less than 0.3. W E D R L 3 scored badly not just on their own dataset but also on others, with F1-scores ranging from 0.0739 on CPY015 to 0.5748 on CPY016.

5.4.3. Ensemble with All Seven Models

Then, to learn more about how well the ensemble worked, we combined our developed DRL models with MLP and LSTM models. In model selection, we selected all seven models and selected the five and three models with the highest ranking in F1-score to build our ensemble model. We used the same strategy to make a final decision. So, we have 6 ensemble model for each epochs of training include majority voting ensemble model with 3 ( E 3 ), 5 ( E 5 ), and 7 ( E 7 ) model, and weighted ensemble model with 3 ( W E 3 ), 5 ( W E 5 ), and 7 ( W E 7 ) models, and the results are displayed in Table 12.
We can see that, on the CPY011 dataset, the ensemble of the top three models (E3) earned the greatest F1-score of 0.9231 with every epoch of training. On CPY012, the greatest F1-score of 0.8438 was obtained by E5 and WE7 with models trained with 10,000 epochs, and E7 with models trained with 500 epochs, while E3 and WE3 models trained with 10,000 epochs performed the best in identifying anomalies on the CPY013 dataset. With the CPY014 dataset, all ensemble models gave an F1-score of 0.8571, with the exception of the ensemble with majority voting of all seven models trained with 10,000 epochs, which performed badly with an F1-score of 0.6667. WE7 surpassed other ensemble models on the CPY015 and CPY016 datasets, with the greatest F1-score of 0.4615 and 0.6704, respectively. Every ensemble model on CPY017 produced outstanding results with a 1.0000 F1-score, particularly E3, WE3, WE5, and WE7, which produced excellent results with all training epochs. The weighted ensemble with 5 models (WE5) trained with 500 epochs performed the best on the YOM009 dataset, with a 0.5032 F1-score.
As indicated in Table 13, we averaged the F1-score of each individual model and ensemble model to compare their performance. We can observe that E3 not only performed the best model with the greatest average F1-score on all datasets but also excellently performed with a 1.0000 F1-score on CPY017 and YOM009. Among the models tested on the CPY014 dataset, the best F1-score of 0.8571 was achieved by MLP, LSTM, E3, E5, WE3, WE5, and WE7. In contrast, on the CPY015 dataset, the model with DRL-based ( D R L F 1 , D R l A c c , E D R L 3 , and W E D R L 3 ) generated the highest F1-score of 0.4134. Furthermore, as shown in Figure 10, the CD diagram was chosen to make a statistical comparison of our results, which revealed that E3 had the highest ranking, and the ensemble model that combined all seven individual models outperformed both the individual model and the ensemble model created using DRL models. It also demonstrated the ability of ensemble methods to improve the performance of individual DRL models because it represented a significant difference from individual models ( D R L , D R L R w d , and D R L V a l i d ).
We then tested the generalisation ability of ensemble models with the data collected from different stations. The F1-score of each station is depicted in Table 14. Using ensemble E3 with CPY011 data, to identify anomalies from other stations, we can see that it works well with F1-scores of more than 0.5800, but it performed poorly at detecting anomalies on CPY015 and YOM009 with F1-scores of 0.3444 and 0.1017, respectively. E3 on CPY012 performed well when detecting anomalies on CPY014 with a 0.8635 F1-score. Similarly, E3 on CPY013 provided a higher F1-score on their own dataset when detecting anomalies on CPY012 and CPY014 with an F1-score of 0.8060 and 0.8571, respectively. The best ensemble on CPY014 generated excellent performance when identifying anomalies on CPY017 data. In contrast, E3 on CPY015 performed poorly on YOM009 with an F1-score of only 0.0437. While considered E3 on CPY016, although it provided good performance with an F1-score higher than 0.6 on CPY012, CPY013, and CPY014, it performed poorly on CPY011, CPY015, CPY017, and YOM009 with an F1-score lower than 0.45. E3 on CPY017 provided good results with an F1-score of more than 0.69, except on CPY015, CPY016, and YOM009 with an F1-score lower than 0.56. Meanwhile, E3 on YOM009 generated an F1-score on its own of only 0.43, but it performed excellently when detecting anomalies on CPY017 and other datasets with an F1-score higher than 0.65, except on CPY015 with an F1-score of 0.2414.

6. Discussion

We can observe that when the number of training epochs increases, the performance of each model grows or decreases in each epoch, then drops and bounces back. This might indicate that our model is still learning or is learning too much—that is, it is difficult to decide when it is time to stop training.
Even though DRL can do better than other models, it is time-consuming—at least 50 times slower than MLP models on average—because we have to train it until it performs well enough and we cannot predict how long that will take. The size of the windows must also be taken into account. A larger window size takes more time than a smaller window size. The window size has an effect on the comparison of data in windows to identify the anomaly. Additionally, we may add additional neural networks to improve the accuracy of our technique, but training will take longer.
DRL does better than other models when it is trained on datasets with a low number of outliers. This proves the ability to detect unknown anomalies. However, its performance is insufficient, which may be due to an imbalance in our dataset. As a result, models may lack sufficient information to explore and leverage knowledge for adaptive detection of unknown abnormalities.
Moreover, the neural structure that works well with one station may not function well with another. Hence, the problems of this topic include determining the suitable neural structure for each station. Furthermore, the primary parameter that requires further attention is the reward function, since a suitable reward will impact the model’s learning process.
In the case of ensemble models, when all of the individual models in an ensemble perform similarly, majority voting is the best method for determining the final decision. However, when the accuracies of individual models are different, the weighted voting is the best way to utilise the strengths of the good models in making a decision. Furthermore, the ensemble model can also reduce the false alarm rate, as seen by an increased precision score. It should be noted that, although single models performed well on certain stations, they did poorly on others, such as the LSTM model. As a result, we cannot rely on a single model since we do not know if it is the best or not. The ensemble models, on the other hand, are more reliable, even though they may not produce the best accuracy for every station. On the whole, nevertheless, most ensembles, such as W E D R L 3 performed consistently very well and their accuracies are always ranked highly at every station, whilst the individual models: DRL, MLP and LSTM, are not consistent through out all the stations.

7. Conclusions

In this research, we firstly investigated how deep reinforcement learning (DRL) can be applied to detect anomalies in water level data and then devised two strategies to construct more effective and reliable ensembles. For DRL, we defined a reward function as it plays a key role in determining the success of an RL. We developed ensemble models with five deep reinforcement learning models, generated by the same DRL algorithm but with different criteria of performance measurement. We tested our ensemble approach on telemetry water level data from eight different stations. We compared our approach to two different neural network models. Moreover, we demonstrate the ability to detect unknown anomalies by using the trained model to detect anomalies from other stations’ data.
The results indicate that D R L A c c models are the best individual DRL models, but they performed slightly poor than LSTM. When tested on different stations, LSTM still does better than others, but its accuracy is not satisfactory. When compared to an ensemble approach, LSTM was more accurate in some stations than other ensembles with DRL models, but less accurate in some others. On the whole, the statistical results from the CD diagram showed that our ensemble approach with only 3 members of DRL models, W E D R L 3 , was superior. Furthermore, all ensemble models that were combined by selecting models from 5 DRL models, MLP, and LSTM outperformed both the best individual model, LSTM, and the best ensemble using DRL models, W E D R L 3 . This is supported by the highest F1-score and rankings with the CD diagram. It is clear that ensemble methods not only increased the accuracy of a single model but also provided a higher reliability of performance.
In conclusion, DRL is applicable for detecting anomalies in telemetry water level data with added benefit of detecting unknown anomalies. Our ensemble construction methods can be used to build ensemble models from selected single DRL models in order to increase the accuracy and reliability. In general, the ensembles are consistent in producing more accurate classification, although they may not always achieve the best results. Moreover, they are superior in reducing the number of false alarms in identifying abnormalities in water level data, which is very important in real application. The next stage in our study will be to develop more effective and efficient techniques for correcting the identified anomalies in the data.

Author Contributions

Conceptualization, T.K. and W.W.; methodology, T.K. and W.W.; formal analysis, T.K. and W.W.; investigation, T.K. and W.W.; resources, T.K.; writing and revision: T.K. and final revision: W.W.; project administration, T.K. All authors have read and agreed to the published version of the manuscript.


This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request due to restriction, e.g., privacy. The data presented in this study are available on request from the Hydro-Informatics Institute (HII).


The authors would like to thank the Hydro-Informatics Institute of Ministry of Higher Education, Science, Research and Innovation, Thailand, for providing the scholarship for Thakolpat Khampuengson to do his Ph.D. at the university of East Anglia.

Conflicts of Interest

The authors declare no conflict of interest.


The following abbreviations are used in this manuscript:
CNNConvolutional Neural Network
BDRLBest Deep Reinforcement Learning model
DQNDeep Q-Learning Network
DRLDeep Reinforcement Learning
HIIHydro Informatics Institute
LSTMLong-Short Term Memory
MLPMultilayer Perceptron
NNNeural Network
RLReinforcement Learning


  1. World Bank. Thai Flood 2011: Rapid Assessment for Resilient Recovery and Reconstruction Planning; World Bank: Washington, DC, USA, 2012. [Google Scholar]
  2. UNDRR. Disaster Risk Reduction in Thailand: Status Report 2020; United Nations Office for Disaster Risk Reduction (UNDRR): Geneva, Switzerland, 2020. [Google Scholar]
  3. Khampuengson, T.; Bagnall, A.; Wang, W. Developing Ensemble Methods for Detecting Anomalies in Water Level Data. In Proceedings of the International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, UK, 15–17 December 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 145–151. [Google Scholar]
  4. Wang, W. Some Fundamental Issues in Ensemble Methods. In Proceedings of the IEEE World Congress on Computational Intelligence, Hong Kong, China, 1–8 June 2008; pp. 2243–2250. [Google Scholar] [CrossRef]
  5. Chauhan, S.; Vig, L. Anomaly detection in ECG time signals via deep long short-term memory networks. In Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Porto, Portugal, 6–9 October 2021; IEEE: Piscataway, NJ, USA, 2015; pp. 1–7. [Google Scholar]
  6. Kim, T.Y.; Cho, S.B. Web traffic anomaly detection using C-LSTM neural networks. Expert Syst. Appl. 2018, 106, 66–76. [Google Scholar] [CrossRef]
  7. Munir, M.; Siddiqui, S.A.; Dengel, A.; Ahmed, S. DeepAnT: A deep learning approach for unsupervised anomaly detection in time series. IEEE Access 2018, 7, 1991–2005. [Google Scholar] [CrossRef]
  8. Pang, G.; van den Hengel, A.; Shen, C.; Cao, L. Deep reinforcement learning for unknown anomaly detection. arXiv 2020, arXiv:2009.06847. [Google Scholar]
  9. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
  10. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
  11. Kormushev, P.; Calinon, S.; Caldwell, D.G. Reinforcement learning in robotics: Applications and real-world challenges. Robotics 2013, 2, 122–148. [Google Scholar] [CrossRef]
  12. Polydoros, A.S.; Nalpantidis, L. Survey of model-based reinforcement learning: Applications on robotics. J. Intell. Robot. Syst. 2017, 86, 153–173. [Google Scholar] [CrossRef]
  13. Sharma, A.R.; Kaushik, P. Literature survey of statistical, deep and reinforcement learning in natural language processing. In Proceedings of the 2017 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 5–6 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 350–354. [Google Scholar]
  14. Luketina, J.; Nardelli, N.; Farquhar, G.; Foerster, J.; Andreas, J.; Grefenstette, E.; Whiteson, S.; Rocktäschel, T. A survey of reinforcement learning informed by natural language. arXiv 2019, arXiv:1906.03926. [Google Scholar]
  15. Le, N.; Rathour, V.S.; Yamazaki, K.; Luu, K.; Savvides, M. Deep reinforcement learning in computer vision: A comprehensive survey. Artif. Intell. Rev. 2021, 55, 2733–2819. [Google Scholar] [CrossRef]
  16. Huang, C.; Wu, Y.; Zuo, Y.; Pei, K.; Min, G. Towards experienced anomaly detector through reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  17. Hsu, Y.F.; Matsuoka, M. A deep reinforcement learning approach for anomaly network intrusion detection system. In Proceedings of the 2020 IEEE 9th International Conference on Cloud Networking (CloudNet), Piscataway, NJ, USA, 9–11 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
  18. Lin, E.; Chen, Q.; Qi, X. Deep reinforcement learning for imbalanced classification. Appl. Intell. 2020, 50, 2488–2502. [Google Scholar] [CrossRef]
  19. Pulido, M.; Melin, P.; Castillo, O. Particle swarm optimization of ensemble neural networks with fuzzy aggregation for time series prediction of the Mexican Stock Exchange. Inf. Sci. 2014, 280, 188–204. [Google Scholar] [CrossRef]
  20. Ikram, S.T.; Cherukuri, A.K.; Poorva, B.; Ushasree, P.S.; Zhang, Y.; Liu, X.; Li, G. Anomaly detection using XGBoost ensemble of deep neural network models. Cybern. Inf. Technol. 2021, 21, 175–188. [Google Scholar] [CrossRef]
  21. Yang, H.; Liu, X.Y.; Zhong, S.; Walid, A. Deep reinforcement learning for automated stock trading: An ensemble strategy. In Proceedings of the First ACM International Conference on AI in Finance, New York, NY, USA, 15–16 October 2020; pp. 1–8. [Google Scholar]
  22. Liu, H.; Yu, C.; Wu, H.; Duan, Z.; Yan, G. A new hybrid ensemble deep reinforcement learning model for wind speed short term forecasting. Energy 2020, 202, 117794. [Google Scholar] [CrossRef]
  23. Rousseeuw, P.J.; Hubert, M. Robust statistics for outlier detection. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011, 1, 73–79. [Google Scholar] [CrossRef]
  24. Zimek, A.; Filzmoser, P. There and back again: Outlier detection between statistical reasoning and data mining algorithms. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1280. [Google Scholar] [CrossRef]
  25. Kumar, A.; Srivastava, A.; Bansal, N.; Goel, A. Real time data anomaly detection in operating engines by statistical smoothing technique. In Proceedings of the 2012 25th IEEE Canadian Conference on Electrical & Computer Engineering (CCECE), Montreal, QC, Canada, 29 April–2 May 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1–5. [Google Scholar]
  26. Lin, J.; Sheng, G.; Yan, Y.; Zhang, Q.; Jiang, X. Online Monitoring Data Cleaning of Transformer Considering Time Series Correlation. In Proceedings of the 2018 IEEE/PES Transmission and Distribution Conference and Exposition (T&D), Denver, CO, USA, 16–19 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–9. [Google Scholar]
  27. Aminikhanghahi, S.; Cook, D.J. A survey of methods for time series change point detection. Knowl. Inf. Syst. 2017, 51, 339–367. [Google Scholar] [CrossRef]
  28. Truong, C.; Oudre, L.; Vayatis, N. Selective review of offline change point detection methods. Signal Process. 2020, 167, 107299. [Google Scholar] [CrossRef]
  29. Apostol, E.S.; Truică, C.O.; Pop, F.; Esposito, C. Change point enhanced anomaly detection for IoT time series data. Water 2021, 13, 1633. [Google Scholar] [CrossRef]
  30. Dao, C.; Liu, X.; Sim, A.; Tull, C.; Wu, K. Modeling data transfers: Change point and anomaly detection. In Proceedings of the 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, 2–6 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1589–1594. [Google Scholar]
  31. Siris, V.A.; Papagalou, F. Application of anomaly detection algorithms for detecting SYN flooding attacks. In Proceedings of the IEEE Global Telecommunications Conference, 2004. GLOBECOM’04, Dallas, TX, USA, 29 November–3 December 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 4, pp. 2050–2054. [Google Scholar]
  32. Yu, Y.; Zhu, Y.; Li, S.; Wan, D. Time series outlier detection based on sliding window prediction. Math. Probl. Eng. 2014, 2014, 879736. [Google Scholar] [CrossRef]
  33. van de Wiel, L.; van Es, D.M.; Feelders, A.J. Real-Time Outlier Detection in Time Series Data of Water Sensors. In Advanced Analytics and Learning on Temporal Data; Lemaire, V., Malinowski, S., Bagnall, A., Guyet, T., Tavenard, R., Ifrim, G., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 155–170. [Google Scholar]
  34. Yang, J.H.; Cheng, C.H.; Chan, C.P. A time-series water level forecasting model based on imputation and variable selection method. Comput. Intell. Neurosci. 2017, 2017, 8734214. [Google Scholar] [CrossRef]
  35. Park, K.; Jung, Y.; Seong, Y.; Lee, S. Development of Deep Learning Models to Improve the Accuracy of Water Levels Time Series Prediction through Multivariate Hydrological Data. Water 2022, 14, 469. [Google Scholar] [CrossRef]
  36. Vu, M.; Jardani, A.; Massei, N.; Fournier, M. Reconstruction of missing groundwater level data by using Long Short-Term Memory (LSTM) deep neural network. J. Hydrol. 2021, 597, 125776. [Google Scholar] [CrossRef]
  37. Chang, L.C.; Chang, F.J.; Yang, S.N.; Kao, I.F.; Ku, Y.Y.; Kuo, C.L.; Amin, I.M.Z.b.M. Building an intelligent hydroinformatics integration platform for regional flood inundation warning systems. Water 2019, 11, 9. [Google Scholar] [CrossRef]
  38. Liu, Y.; Hou, G.; Huang, F.; Qin, H.; Wang, B.; Yi, L. Directed graph deep neural network for multi-step daily streamflow forecasting. J. Hydrol. 2022, 607, 127515. [Google Scholar] [CrossRef]
  39. Zhou, Y.; Guo, S.; Chang, F.J. Explore an evolutionary recurrent ANFIS for modelling multi-step-ahead flood forecasts. J. Hydrol. 2019, 570, 343–355. [Google Scholar] [CrossRef]
  40. Chang, L.C.; Chang, F.J.; Yang, S.N.; Tsai, F.H.; Chang, T.H.; Herricks, E.E. Self-organizing maps of typhoon tracks allow for flood forecasts up to two days in advance. Nat. Commun. 2020, 11, 1–13. [Google Scholar] [CrossRef]
  41. Chang, L.C.; Liou, J.Y.; Chang, F.J. Spatial-temporal flood inundation nowcasts by fusing machine learning methods and principal component analysis. J. Hydrol. 2022, 612, 128086. [Google Scholar] [CrossRef]
  42. Gao, S.; Zhang, Y.; Jia, K.; Lu, J.; Zhang, Y. Single sample face recognition via learning deep supervised autoencoders. IEEE Trans. Inf. Forensics Secur. 2015, 10, 2108–2118. [Google Scholar] [CrossRef]
  43. Xu, C.; Liu, Q.; Ye, M. Age invariant face recognition and retrieval by coupled auto-encoder networks. Neurocomputing 2017, 222, 62–71. [Google Scholar] [CrossRef]
  44. Pereira, J.; Silveira, M. Unsupervised anomaly detection in energy time series data using variational recurrent autoencoders with attention. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1275–1282. [Google Scholar]
  45. Zhou, C.; Paffenroth, R.C. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 665–674. [Google Scholar]
  46. Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 3, 2018. [Google Scholar]
  47. Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M.R.; Venkatesh, S.; Hengel, A.v.d. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1705–1714. [Google Scholar]
  48. Jiang, G.; Xie, P.; He, H.; Yan, J. Wind turbine fault detection using a denoising autoencoder with temporal information. IEEE/Asme Trans. Mechatronics 2017, 23, 89–100. [Google Scholar] [CrossRef]
  49. Maas, A.; Le, Q.V.; O’neil, T.M.; Vinyals, O.; Nguyen, P.; Ng, A.Y. Recurrent Neural Networks for Noise Reduction in Robust ASR. 2012. Available online: (accessed on 30 June 2022).
  50. Chiang, H.T.; Hsieh, Y.Y.; Fu, S.W.; Hung, K.H.; Tsao, Y.; Chien, S.Y. Noise reduction in ECG signals using fully convolutional denoising autoencoders. IEEE Access 2019, 7, 60806–60813. [Google Scholar] [CrossRef]
  51. Kieu, T.; Yang, B.; Guo, C.; Jensen, C.S. Outlier Detection for Time Series with Recurrent Autoencoder Ensembles. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 2725–2732. [Google Scholar]
  52. Chen, J.; Sathe, S.; Aggarwal, C.; Turaga, D. Outlier detection with autoencoder ensembles. In Proceedings of the 2017 SIAM International Conference on Data Mining, Houston, TX, USA, 27–29 April 2017; SIAM: Philadelphia, PA, USA, 2017; pp. 90–98. [Google Scholar]
  53. Kao, I.F.; Liou, J.Y.; Lee, M.H.; Chang, F.J. Fusing stacked autoencoder and long short-term memory for regional multistep-ahead flood inundation forecasts. J. Hydrol. 2021, 598, 126371. [Google Scholar] [CrossRef]
  54. Yu, L.; Tan, S.K.; Chua, L.H. Online ensemble modeling for real time water level forecasts. Water Resour. Manag. 2017, 31, 1105–1119. [Google Scholar] [CrossRef]
  55. Iftikhar, N.; Baattrup-Andersen, T.; Nordbjerg, F.E.; Jeppesen, K. Outlier detection in sensor data using ensemble learning. Procedia Comput. Sci. 2020, 176, 1160–1169. [Google Scholar] [CrossRef]
  56. Atienza, R. Advanced Deep Learning with Keras: Apply Deep Learning Techniques, Autoencoders, GANs, Variational Autoencoders, Deep Reinforcement Learning, Policy Gradients, and More; Packt Publishing Ltd.: Birmingham, UK, 2018; p. 272. [Google Scholar]
  57. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  58. Wan, Y.; Chen, J.; Xu, C.Y.; Xie, P.; Qi, W.; Li, D.; Zhang, S. Performance dependence of multi-model combination methods on hydrological model calibration strategy and ensemble size. J. Hydrol. 2021, 603, 127065. [Google Scholar] [CrossRef]
  59. Casciaro, G.; Ferrari, F.; Mazzino, A. Comparing novel strategies of Ensemble Model Output Statistics (EMOS) for calibrating wind speed/power forecasts. arXiv 2021, arXiv:2108.12174. [Google Scholar]
  60. Marathe, A.; Walambe, R.; Kotecha, K. Evaluating the performance of ensemble methods and voting strategies for dense 2D pedestrian detection in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, BC, Canada, 11–17 October 2021, pp. 3575–3584.
  61. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Figure 1. Overall process of DRL (Deep Reinforcement Learning).
Figure 1. Overall process of DRL (Deep Reinforcement Learning).
Water 14 02492 g001
Figure 2. Water level data from eight stations: CPY011, CPY012, CPY013, CPY014, CPY015, CPY016, CPY017, and YOM009 (ah). The different colours show the partitions of the data for training (blue), validation (orange) and testing (green). The anomalies are indicated by red crosses, x.
Figure 2. Water level data from eight stations: CPY011, CPY012, CPY013, CPY014, CPY015, CPY016, CPY017, and YOM009 (ah). The different colours show the partitions of the data for training (blue), validation (orange) and testing (green). The anomalies are indicated by red crosses, x.
Water 14 02492 g002
Figure 3. A critical difference diagram for 5 different DRL models on different datasets of telemetry water level data.
Figure 3. A critical difference diagram for 5 different DRL models on different datasets of telemetry water level data.
Water 14 02492 g003
Figure 4. F1-score when increasing the learning epochs at each station.
Figure 4. F1-score when increasing the learning epochs at each station.
Water 14 02492 g004aWater 14 02492 g004b
Figure 5. Anomaly detection from the best DRL model of each station. (a) CPY011 with D R L ; (b) CPY012 with D R L V a l i d ; (c) CPY013 with D R L F 1 ; (d) CPY014 with D R L R w d ; (e) CPY015 with D R L F 1 ; (f) CPY016 with D R L F 1 ; (g) CPY017 with D R L ; (h) YOM009 with D R L A c c .
Figure 5. Anomaly detection from the best DRL model of each station. (a) CPY011 with D R L ; (b) CPY012 with D R L V a l i d ; (c) CPY013 with D R L F 1 ; (d) CPY014 with D R L R w d ; (e) CPY015 with D R L F 1 ; (f) CPY016 with D R L F 1 ; (g) CPY017 with D R L ; (h) YOM009 with D R L A c c .
Water 14 02492 g005
Figure 6. A critical difference diagram of each model.
Figure 6. A critical difference diagram of each model.
Water 14 02492 g006
Figure 7. Bar charts of average F1-scores of the DRL, MLP, and LSTM when tested with the data collected from different stations.
Figure 7. Bar charts of average F1-scores of the DRL, MLP, and LSTM when tested with the data collected from different stations.
Water 14 02492 g007
Figure 8. F1-score of ensemble model when increasing the learning epochs at CPY011, CPY012, CPY013, CPY014, CPY015, CPY016, CPY017, and YOM009 (ah).
Figure 8. F1-score of ensemble model when increasing the learning epochs at CPY011, CPY012, CPY013, CPY014, CPY015, CPY016, CPY017, and YOM009 (ah).
Water 14 02492 g008aWater 14 02492 g008b
Figure 9. Critical difference diagram.
Figure 9. Critical difference diagram.
Water 14 02492 g009
Figure 10. A critical difference diagram.
Figure 10. A critical difference diagram.
Water 14 02492 g010
Table 1. Demographic summary of the water level data of 8 stations used in this research.
Table 1. Demographic summary of the water level data of 8 stations used in this research.
Table 2. Confusion matrix of classification results.
Table 2. Confusion matrix of classification results.
Table 3. The performance of DRL when increasing the learning epochs (the best F1-score of each row shown in bold).
Table 3. The performance of DRL when increasing the learning epochs (the best F1-score of each row shown in bold).
StationEpochs DRL DRL F 1 DRL Rwd DRL Acc DRL Valid
RecallPrecF1 RecallPrecF1 RecallPrecF1 RecallPrecF1 RecallPrecF1
CPY0111000.85710.54550.6667 0.85710.75000.8000 0.85710.54550.6667 0.85710.75000.8000 0.71430.62500.6667
5000.85710.75000.8000 0.85720.54550.6667 0.85710.75000.8000 0.85710.54550.6667 0.85710.66670.7500
10000.71431.00000.8333 0.85710.66670.7500 0.71431.00000.8333 0.85710.66670.7500 0.42860.27270.3333
50000.71430.62500.6667 0.85710.50000.6316 0.71430.62500.6667 0.85710.50000.6316 0.71430.71430.7143
10,0000.85710.66670.7500 1.00000.58330.7368 0.85710.66670.7500 1.00000.58330.7368 0.85710.40000.5455
Avg0.80000.71740.7433 0.88570.60910.7170 0.80000.71740.7433 0.88570.60910.7170 0.71430.53570.6020
Std0.07820.17430.0760 0.06390.09970.0674 0.07820.17430.0760 0.06390.09970.0674 0.17490.19010.1689
CPY0121000.70590.40000.5106 0.76470.70270.7324 0.67640.38980.4946 0.70590.68570.6957 0.76470.70270.7324
5000.76470.63410.6933 0.79410.72970.7606 0.73530.62500.6757 0.79410.72970.7606 0.79410.72970.7606
10000.67650.65710.6667 0.76470.66670.7123 0.61760.60000.6087 0.76470.66670.7123 0.76470.40620.5306
50000.70590.70590.7059 0.61760.61760.6176 0.70590.72730.7164 0.67650.69700.6866 0.70590.72730.7164
10,0000.64710.75860.6984 0.70590.80000.7500 0.70590.77420.7385 0.70590.82760.7619 0.79410.77140.7826
Avg0.70000.63110.6550 0.72940.70330.7146 0.68820.62330.6468 0.72940.72130.7234 0.76470.66750.7045
Std0.04360.13780.0821 0.07020.06840.0572 0.04460.14890.0984 0.04830.06370.0357 0.03600.14810.1005
CPY0131000.87100.35060.5000 0.87100.61360.7200 0.90320.38360.5385 0.87100.61360.7200 0.67740.12140.2059
5000.67740.36840.4773 0.80650.50000.6173 0.74190.39660.5169 0.80650.50000.6173 0.83870.49060.6190
10000.80650.59520.6849 0.80650.56820.6667 0.70970.61110.6567 0.80650.56820.6667 0.96770.55560.7059
50000.77420.57140.6575 0.67740.67740.6774 0.80650.56820.6667 0.67740.63640.6562 0.67740.52500.5915
10,0000.70970.66670.6875 0.83870.76470.8000 0.77420.68570.7273 0.83870.76470.8000 0.74190.65710.6970
Avg0.76780.51050.6014 0.80000.62480.6963 0.78710.52900.6212 0.80000.61660.6920 0.78060.46990.5639
Std0.07700.14230.1039 0.07360.10150.0685 0.07430.13370.0899 0.07360.09780.0706 0.12370.20450.2061
CPY0141000.75000.75000.7500 0.75000.75000.7500 0.75000.75000.7500 0.75000.75000.7500 0.75000.75000.7500
5000.75000.37500.5000 0.75000.75000.7500 0.75001.00000.8571 0.75000.75000.7500 0.75001.00000.8571
10000.75000.37500.5000 0.75000.50000.6000 0.75000.37500.5000 0.75000.50000.6000 0.75000.50000.6000
50000.75000.33330.4615 0.75000.50000.6000 0.75000.33330.4615 0.75000.50000.6000 0.75000.42860.5455
10,0000.25000.16670.2000 0.75000.60000.6667 0.25000.16670.2000 0.75000.60000.6667 0.75000.42860.5455
Avg0.65000.40000.4823 0.75000.62000.6733 0.65000.52500.5537 0.75000.62000.6733 0.75000.62140.6596
Std0.22360.21370.1952 0.00000.12550.0751 0.22360.34050.2584 0.00000.12550.0751 0.00000.24950.1385
CPY0151000.23530.47060.3137 0.32350.52380.4000 0.23530.47060.3137 0.32350.52380.4000 0.17650.35290.2353
5000.20590.46670.2857 0.32350.50000.3929 0.20590.46670.2857 0.32350.50000.3929 0.14710.38460.2128
10000.38240.44830.4127 0.38240.50000.4333 0.38240.44830.4127 0.38240.50000.4333 0.41180.45160.4308
50000.23530.47060.3137 0.32350.52380.4000 0.23530.47060.3137 0.32350.52380.4000 0.23530.47060.3137
10,0000.38240.44830.4127 0.38240.52000.4407 0.38240.44830.4127 0.38240.52000.4407 0.38240.44830.4127
Avg0.28830.46090.3477 0.34710.51350.4134 0.28830.46090.3477 0.34710.51350.4134 0.27060.42160.3211
Std0.08680.01160.0604 0.03230.01240.0219 0.08680.01160.0604 0.03230.01240.0219 0.12020.05030.0995
CPY0161000.66360.31000.4226 0.59810.52030.5565 0.69160.29600.4146 0.59810.52030.5565 0.61680.48890.5455
5000.66360.27630.3901 0.63550.40480.4945 0.64490.27270.3833 0.59810.51610.5541 0.50470.50940.5070
10000.63550.25470.3636 0.59810.53330.5639 0.64490.26440.3750 0.61680.56410.5893 0.61680.43420.5097
50000.58880.27270.3728 0.58880.62380.6058 0.30840.20890.2491 0.54210.61050.5743 0.52340.29020.3733
10,0000.57940.23660.3360 0.61680.41770.4981 0.63550.22080.3277 0.62620.54470.5826 0.61680.41770.4981
Avg0.62620.27010.3770 0.60750.50000.5438 0.58510.25260.3499 0.59630.55110.5714 0.57570.42810.4867
Std0.04020.02740.0321 0.01870.09040.0472 0.15620.03660.0644 0.03260.03840.0156 0.05670.08580.0659
CPY0171001.00000.75000.8571 1.00000.50000.6667 1.00000.75000.8571 1.00000.75000.8571 1.00000.75000.8571
5001.00000.75000.8571 1.00000.50000.6667 1.00000.75000.8571 1.00000.37500.5455 0.66670.50000.5714
10001.00000.21430.3529 1.00000.75000.8571 1.00000.20000.3333 1.00000.75000.8571 0.00000.0000-
50001.00000.50000.6667 1.00000.42860.6000 1.00000.50000.6667 1.00000.42860.6000 1.00000.33330.5000
10,0000.66670.66670.6667 0.66670.28570.4000 0.66670.66670.6667 1.00000.60000.7500 0.66670.66670.6667
Avg0.93330.57620.6801 0.93330.49290.6381 0.93330.57330.6762 1.00000.58070.7219 0.66670.45000.6488
Std0.14910.22660.2062 0.14910.16830.1641 0.14910.23230.2140 0.00000.17550.1443 0.40820.29820.1547
YOM0091000.63080.31780.4227 0.56920.30330.3957 0.64620.31820.4264 0.56920.30330.3957 0.56920.33940.4253
5000.58460.27340.3725 0.67690.31210.4272 0.69230.30200.4206 0.47690.47690.4769 0.47690.47690.4769
10000.53850.29660.3825 0.67690.29730.4131 0.55380.31030.3978 0.46150.39470.4255 0.58460.30160.3979
50000.55380.18180.2738 0.66150.36440.4699 0.61540.25810.3636 0.49230.41030.4476 0.50770.41770.4583
10,0000.47690.26270.3388 0.56920.27410.3700 0.47690.26050.3370 0.53850.29170.3784 0.43080.43080.4308
Avg0.55690.26650.3581 0.63070.31020.4152 0.59690.28980.3891 0.50770.37540.4248 0.51380.39330.4378
Std0.05700.05190.0558 0.05650.03340.0373 0.08390.02850.0382 0.04490.07760.0395 s0.06400.07120.0306
Table 4. The mean F1-scores and standard deviation of all DRL, MLP, and LSTM models when testing with the dataset from different stations (the best F1-score of each row is shown in bold).
Table 4. The mean F1-scores and standard deviation of all DRL, MLP, and LSTM models when testing with the dataset from different stations (the best F1-score of each row is shown in bold).
CPY0110.7433 (±0.08)0.7170 (±0.07)0.7433 (±0.08)0.7170 (±0.07)0.6020 (±0.17)0.8505 (±0.06)0.8167 (±0.04)
CPY0120.6550 (±0.08)0.7146 (±0.06)0.6468 (±0.10)0.7234 (±0.04)0.7045 (±0.10)0.7822 (±0.03)0.7753 (±0.02)
CPY0130.6014 (±0.10)0.6963 (±0.07)0.6212 (±0.09)0.6920 (±0.07)0.5639 (±0.21)0.6998 (±0.03)0.7265 (±0.02)
CPY0140.4823 (±0.20)0.6733 (±0.08)0.5537 (±0.26)0.6733 (±0.08)0.6596 (±0.14)0.8571 (±0.00)0.8571 (±0.00)
CPY0150.3477 (±0.06)0.4134 (±0.02)0.3477 (±0.06)0.4134 (±0.02)0.3211 (±0.10)0.2220 (±0.10)0.3276 (±0.09)
CPY0160.3770 (±0.03)0.5438 (±0.05)0.3499 (±0.06)0.5714 (±0.02)0.4867 (±0.07)0.5651 (±0.14)0.6252 (±0.06)
CPY0170.6801 (±0.21)0.6381 (±0.16)0.6762 (±0.21)0.7219 (±0.14)0.6488 (±0.15)0.9778 (±0.07)0.9857 (±0.05)
YOM0090.3581 (±0.06)0.4152 (±0.04)0.3891 (±0.04)0.4248 (±0.04)0.4378 (±0.03)0.2358 (±0.05)0.2596 (±0.06)
Table 5. The number of training epochs and the time spent on each epoch for each model.
Table 5. The number of training epochs and the time spent on each epoch for each model.
StationTraining EpochsTime (s./epochs)Total Time (s.)
Table 6. The F1-scores of the best DRL models when testing with the dataset from same station (show in the bracket) and different stations, while the average F1-scores and standard deviations of each station were calculated without their own scores.
Table 6. The F1-scores of the best DRL models when testing with the dataset from same station (show in the bracket) and different stations, while the average F1-scores and standard deviations of each station were calculated without their own scores.
Tested DatasetTrained Dataset
Table 7. The F1-scores of the MLP models when testing with the dataset from the same station (shown in the bracket) and different stations.
Table 7. The F1-scores of the MLP models when testing with the dataset from the same station (shown in the bracket) and different stations.
Tested DatasetTrained Dataset
Table 8. The F1-scores of the LSTM models when testing with the dataset from the same station (shown in the bracket) and different stations.
Table 8. The F1-scores of the LSTM models when testing with the dataset from the same station (shown in the bracket) and different stations.
Tested DatasetTrained Dataset
Table 9. The performance of ensemble models (the best F1-score of each row is shown in bold).
Table 9. The performance of ensemble models (the best F1-score of each row is shown in bold).
StationEpochs EDRL 3 EDRL 5 WEDRL 3 WEDRL 5
RecallPrecF1 RecallPrecF1 RecallPrecF1 RecallPrecF1
CPY0111000.85710.75000.8000 0.71430.62500.6667 0.85710.75000.8000 0.85710.75000.8000
5000.85710.75000.8000 0.85710.66670.7500 0.85710.75000.8000 0.85710.66670.7500
10000.71431.00000.8333 0.71430.83330.7692 0.71431.00000.8333 0.85710.75000.8000
50000.71430.62500.6667 0.71430.62500.6667 0.71430.71430.7143 0.71430.71430.7143
10,0000.85710.66670.7500 1.00000.77780.8750 0.85710.66670.7500 0.85710.60000.7059
Avg0.80000.75830.7700 0.80000.70560.7455 0.80000.77620.7795 0.82850.69620.7540
std0.07820.14550.0650 0.12780.09490.0863 0.07820.12970.0471 0.06390.06370.0451
CPY0121000.76470.70270.7324 0.73530.73530.7353 0.76470.70270.7324 0.76470.68420.7222
5000.79410.72970.7606 0.79410.72970.7606 0.79410.72970.7606 0.79410.71050.7500
10000.76470.66670.7123 0.73530.86210.7937 0.76470.66670.7123 0.73530.64100.6849
50000.70590.72730.7164 0.70590.75000.7273 0.70590.72730.7164 0.73530.73530.7353
10,0000.70590.82760.7619 0.70590.82760.7619 0.79410.79410.7941 0.73530.83330.7812
Avg0.74710.73080.7367 0.73530.78090.7558 0.76470.72410.7432 0.75290.72090.7347
std0.03940.05980.0236 0.03600.06010.0261 0.03600.04660.0342 0.02630.07190.0355
CPY0131000.87100.61360.7200 0.83870.61900.7123 0.87100.61360.7200 0.90320.59570.7179
5000.80650.50000.6173 0.77420.57140.6575 0.83870.49060.6190 0.80650.55560.6579
10000.80650.60980.6944 0.80650.60980.6944 0.93550.60420.7342 0.90320.60870.7273
50000.80650.59520.6849 0.70970.57890.6377 0.70970.64710.6769 0.77420.64860.7059
10,0000.83870.76470.8000 0.74190.67650.7077 0.83870.76470.8000 0.83870.74290.7879
Avg0.82580.61670.7033 0.77420.61110.6819 0.83870.62400.7100 0.84520.63030.7194
std0.02880.09490.0660 0.05100.04170.0328 0.08220.09830.0674 0.05770.07120.0467
CPY0141000.75000.75000.7500 0.75001.00000.8571 0.75001.00000.8571 0.75000.75000.7500
5000.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571
10000.75000.50000.6000 0.75000.60000.6667 0.75000.37500.5000 0.75000.37500.5000
50000.75000.50000.6000 0.75000.50000.6000 0.75000.50000.6000 0.75000.37500.5000
10,0000.75000.60000.6667 0.75000.60000.6667 0.75000.60000.6667 0.75000.75000.7500
Avg0.75000.67000.6948 0.75000.74000.7295 0.75000.69500.6962 0.75000.65000.6714
std0.00000.21100.1097 0.00000.24080.1196 0.00000.28960.1584 0.00000.27100.1625
CPY0151000.32350.52380.4000 0.26470.50000.3462 0.32350.52380.4000 0.26470.47370.3396
5000.32350.50000.3929 0.20590.46670.2857 0.32350.50000.3929 0.29410.52630.3774
10000.38240.50000.4333 0.41180.48280.4444 0.38240.50000.4333 0.38240.46430.4194
50000.32350.52380.4000 0.23530.47060.3137 0.32350.52380.4000 0.32350.52380.4000
10,0000.38240.52000.4407 0.38240.44830.4127 0.38240.52000.4407 0.38240.52000.4407
Avg0.34710.51350.4134 0.30000.47370.3605 0.34710.51350.4134 0.32940.50160.3954
std0.03230.01240.0219 0.09160.01920.0666 0.03230.01240.0219 0.05260.03000.0390
CPY0161000.59810.52030.5565 0.60750.49620.5462 0.59810.52030.5565 0.61680.53660.5739
5000.59810.52030.5565 0.61680.39520.4818 0.60750.56030.5830 0.59810.53330.5639
10000.61680.53230.5714 0.65420.46360.5426 0.61680.55460.5841 0.61680.54550.5789
50000.57010.61000.5894 0.55140.42750.4816 0.57010.61620.5922 0.58880.39870.4755
10,0000.61680.41770.4981 0.62620.42950.5095 0.62620.54470.5826 0.64490.51110.5702
Avg0.60000.52010.5544 0.61120.44240.5123 0.60370.55920.5797 0.61310.50500.5525
std0.01910.06840.0343 0.03770.03860.0314 0.02150.03530.0135 0.02150.06080.0434
CPY0171001.00000.75000.8571 1.00000.75000.8571 1.00000.75000.8571 1.00000.75000.8571
5001.00000.75000.8571 1.00001.00001.0000 1.00000.75000.8571 1.00000.75000.8571
10001.00000.75000.8571 1.00000.23080.3750 1.00000.75000.8571 1.00000.75000.8571
50001.00000.50000.6667 1.00000.50000.6667 1.00000.50000.6667 1.00000.50000.6667
10,0000.66670.66670.6667 0.66670.66670.6667 1.00000.60000.7500 0.66670.66670.6667
Avg0.93330.68330.7809 0.93330.62950.7131 1.00000.67000.7976 0.93330.68330.7809
std0.14910.10870.1043 0.14910.28680.2354 0.00000.11510.0866 0.14910.10870.1043
RecallPrecF1 RecallPrecF1 RecallPrecF1 RecallPrecF1
YOM0091000.63080.32540.4293 0.58460.31150.4064 0.61540.32260.4233 0.64620.32810.4352
5000.47690.47690.4769 0.61540.39600.4819 0.47690.47690.4769 0.60000.43330.5032
10000.55380.36000.4364 0.55380.31860.4045 0.52310.37780.4387 0.49230.32990.3951
50000.55380.40910.4706 0.58460.40860.4810 0.63080.39810.4881 0.55380.38300.4528
10,0000.53850.29910.3846 0.49230.30480.3765 0.41540.39710.4060 0.46150.31580.3750
Avg0.55080.37410.4396 0.56610.34790.4301 0.53230.39450.4466 0.55080.35800.4323
std0.05480.07070.0371 0.04670.05010.0484 0.09140.05540.0350 0.07570.04940.0503
Table 10. The mean F1-scores and standard deviations of all of the DRL, MLP, LSTM, and ensemble of DRL-based models when testing with the dataset from different stations (the best F1-score of each station is shown in bold).
Table 10. The mean F1-scores and standard deviations of all of the DRL, MLP, LSTM, and ensemble of DRL-based models when testing with the dataset from different stations (the best F1-score of each station is shown in bold).
D R L 0.7433 (±0.08)0.6550 (±0.08)0.6014 (±0.10)0.4823 (±0.20)0.3477 (±0.06)0.3770 (±0.03)0.6801 (±0.21)0.3581 (±0.06)
D R L F 1 0.7170 (±0.07)0.7146 (±0.06)0.6963 (±0.07)0.6733 (±0.08)0.4134 (±0.02)0.5438 (±0.05)0.6381 (±0.16)0.4152 (±0.04)
D R L R w d 0.7433 (±0.08)0.6468 (±0.10)0.6212 (±0.09)0.5537 (±0.26)0.3477 (±0.06)0.3499 (±0.06)0.6762 (±0.21)0.3891 (±0.04)
D R L A c c 0.7170 (±0.07)0.7234 (±0.04)0.6920 (±0.07)0.6733 (±0.08)0.4134 (±0.02)0.5714 (±0.02)0.7219 (±0.14)0.4248 (±0.04)
D R L V a l i d 0.6020 (±0.17)0.7045 (±0.10)0.5639 (±0.21)0.6596 (±0.14)0.3211 (±0.10)0.4867 (±0.07)0.6488 (±0.15)0.4378 (±0.03)
E D R L 3 0.7700 (±0.06)0.7367 (±0.02)0.7033 (±0.07)0.6948 (±0.11)0.4134 (±0.02)0.5544 (±0.03)0.7809 (±0.10)0.4396 (±0.04)
E D R L 5 0.7455 (±0.09)0.7558 (±0.03)0.6819 (±0.03)0.7295 (±0.12)0.3605 (±0.07)0.5123 (±0.03)0.7131 (±0.24)0.4301 (±0.05)
W E D R L 3 0.7795 (±0.05)0.7432 (±0.03)0.7100 (±0.07)0.6962 (±0.16)0.4134 (±0.02)0.5797 (±0.01)0.7976 (±0.09)0.4466 (±0.03)
W E D R L 5 0.7540 (±0.05)0.7347 (±0.04)0.7194 (±0.05)0.6714 (±0.16)0.3954 (±0.04)0.5525 (±0.04)0.7619 (±0.10)0.4323 (±0.05)
MLP0.8505 (±0.06)0.7822 (±0.03)0.6998 (±0.03)0.8571 (±0.00)0.2220 (±0.10)0.5651 (±0.14)0.9778 (±0.07)0.2358 (±0.05)
LSTM0.8167 (±0.04)0.7753 (±0.02)0.7265 (±0.02)0.8571 (±0.00)0.3276 (±0.09)0.6252 (±0.06)0.9857 (±0.05)0.2596 (±0.06)
Table 11. The mean F1-scores of the W E D R L 3 models when testing with the dataset from the same station (shown in the bracket) and different stations.
Table 11. The mean F1-scores of the W E D R L 3 models when testing with the dataset from the same station (shown in the bracket) and different stations.
Tested DatasetTrained Dataset
Table 12. The performance of the ensemble models built by combining DRL and candidate models (the best F1-score of each row is shown in bold).
Table 12. The performance of the ensemble models built by combining DRL and candidate models (the best F1-score of each row is shown in bold).
StationEpochs E 3 E 5 E 7 WE 3 WE 5 WE 7
RecallPrecF1 RecallPrecF1 RecallPrecF1 RecallPrecF1 RecallPrecF1 RecallPrecF1
CPY0111000.85711.00000.9231 0.85710.75000.8000 0.85710.75000.8000 0.85710.85710.8571 0.85710.85710.8571 0.85710.85710.8571
5000.85711.00000.9231 0.85710.75000.8000 0.85710.75000.8000 0.85710.85710.8571 0.85710.85710.8571 0.85710.85710.8571
10000.85711.00000.9231 0.85711.00000.9231 0.85711.00000.9231 0.85710.85710.8571 0.85710.85710.8571 0.85710.85710.8571
50000.85711.00000.9231 0.71430.71430.7143 0.85710.75000.8000 0.85710.85710.8571 0.85710.85710.8571 0.85710.85710.8571
10,0000.85711.00000.9231 0.85710.85710.8571 0.85710.85710.8571 0.85710.85710.8571 0.85710.85710.8571 0.85710.85710.8571
avg0.85711.00000.9231 0.82850.81430.8189 0.85710.82140.8360 0.85710.85710.8571 0.85710.85710.8571 0.85710.85710.8571
std0.00000.00000.0000 0.06390.11680.0774 0.00000.11010.0546 0.00000.00000.0000 0.00000.00000.0000 0.00000.00000.0000
CPY0121000.76470.92860.8387 0.76470.83870.8000 0.79410.77140.7826 0.76470.89660.8254 0.76470.89660.8254 0.76470.89660.8254
5000.76470.92860.8387 0.79410.72970.7606 0.79410.90000.8438 0.76470.89660.8254 0.76470.89660.8254 0.76470.89660.8254
10000.76470.92860.8387 0.76470.89660.8254 0.76470.89660.8254 0.76470.89660.8254 0.76470.89660.8254 0.76470.89660.8254
50000.76470.92860.8387 0.70590.75000.7273 0.73530.83330.7812 0.76470.89660.8254 0.76470.89660.8254 0.76470.89660.8254
10,0000.76470.92860.8387 0.79410.90000.8438 0.70590.82760.7619 0.76470.89660.8254 0.76470.89660.8254 0.79410.90000.8438
avg0.76470.92860.8387 0.76470.82300.7914 0.75880.84580.7990 0.76470.89660.8254 0.76470.89660.8254 0.77060.89730.8291
std0.00000.00000.0000 0.03600.08000.0475 0.03830.05370.0342 0.00000.00000.0000 0.00000.00000.0000 0.01310.00150.0082
CPY0131000.77420.72730.7500 0.83870.68420.7536 0.80650.67570.7353 0.74190.67650.7077 0.77420.68570.7273 0.77420.68570.7273
5000.77420.72730.7500 0.83870.63410.7222 0.74190.69700.7188 0.74190.67650.7077 0.74190.67650.7077 0.74190.67650.7077
10000.77420.72730.7500 0.87100.67500.7606 0.80650.64100.7143 0.74190.67650.7077 0.77420.68570.7273 0.77420.68570.7273
50000.77420.72730.7500 0.74190.67650.7077 0.74190.67650.7077 0.74190.67650.7077 0.74190.67650.7077 0.77420.68570.7273
10,0000.83870.76470.8000 0.77420.77420.7742 0.77420.77420.7742 0.83870.76470.8000 0.77420.72730.7500 0.74190.67650.7077
avg0.78710.73480.7600 0.81290.68880.7437 0.77420.69290.7301 0.76130.69410.7262 0.76130.69030.7240 0.76130.68200.7195
std0.02880.01670.0224 0.05300.05160.0277 0.03230.04970.0267 0.04330.03940.0413 0.01770.02120.0175 0.01770.00500.0107
CPY0141000.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571
5000.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571
10000.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571
50000.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571
10,0000.75001.00000.8571 0.75001.00000.8571 0.75000.60000.6667 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571
avg0.75001.00000.8571 0.75001.00000.8571 0.75000.92000.8190 0.75001.00000.8571 0.75001.00000.8571 0.75001.00000.8571
std0.00000.00000.0000 0.00000.00000.0000 0.00000.17890.0851 0.00000.00000.0000 0.00000.00000.0000 0.00000.00000.0000
CPY0151000.32350.52380.4000 0.26470.50000.3462 0.26470.50000.3462 0.32350.52380.4000 0.26470.47370.3396 0.23530.44440.3077
5000.32350.50000.3929 0.23530.50000.3200 0.23530.50000.3200 0.32350.50000.3929 0.29410.52630.3774 0.26470.50000.3462
10000.38240.50000.4333 0.41180.48280.4444 0.26470.42860.3273 0.38240.50000.4333 0.38240.46430.4194 0.38240.43330.4062
50000.32350.52380.4000 0.26470.50000.3462 0.26470.50000.3462 0.32350.52380.4000 0.29410.50000.3704 0.29410.50000.3704
10,0000.38240.52000.4407 0.38240.44830.4127 0.35290.48000.4068 0.38240.52000.4407 0.38240.52000.4407 0.44120.48390.4615
avg0.34710.51350.4134 0.31180.48620.3739 0.27650.48170.3493 0.34710.51350.4134 0.32350.49690.3895 0.32350.47230.3784
std0.03230.01240.0219 0.07950.02250.0522 0.04460.03090.0342 0.03230.01240.0219 0.05510.02740.0404 0.08580.03150.0587
CPY0161000.54210.85290.6629 0.59810.54700.5714 0.60750.54620.5752 0.54210.84060.6591 0.54210.82860.6554 0.54210.82860.6554
5000.54210.85290.6629 0.59810.61540.6066 0.58880.66320.6238 0.54210.84060.6591 0.54210.84060.6591 0.54210.84060.6591
10000.54210.85290.6629 0.57940.57940.5794 0.60750.53720.5702 0.54210.84060.6591 0.54210.82860.6554 0.54210.82860.6554
50000.54210.85290.6629 0.56070.70590.6250 0.56070.61220.5854 0.54210.84060.6591 0.54210.82860.6554 0.56070.83330.6704
10,0000.54210.85290.6629 0.60750.57520.5909 0.59810.46380.5224 0.54210.84060.6591 0.54210.82860.6554 0.54210.85290.6629
avg0.54210.85290.6629 0.58880.60460.5947 0.59250.56450.5754 0.54210.84060.6591 0.54210.83100.6561 0.54580.83680.6606
std0.00000.00000.0000 0.01870.06160.0215 0.01940.07620.0363 0.00000.00000.0000 0.00000.00540.0017 0.00830.01030.0063
CPY0171001.00001.00001.0000 1.00001.00001.0000 1.00000.75000.8571 1.00001.00001.0000 1.00001.00001.0000 1.00001.00001.0000
5001.00001.00001.0000 1.00001.00001.0000 1.00001.00001.0000 1.00001.00001.0000 1.00001.00001.0000 1.00001.00001.0000
10001.00001.00001.0000 1.00000.75000.8571 1.00000.75000.8571 1.00001.00001.0000 1.00001.00001.0000 1.00001.00001.0000
50001.00001.00001.0000 1.00000.75000.8571 1.00000.75000.8571 1.00001.00001.0000 1.00001.00001.0000 1.00001.00001.0000
10,0001.00001.00001.0000 1.00001.00001.0000 0.66671.00000.8000 1.00001.00001.0000 1.00001.00001.0000 1.00001.00001.0000
avg1.00001.00001.0000 1.00000.90000.9428 0.93330.85000.8743 1.00001.00001.0000 1.00001.00001.0000 1.00001.00001.0000
std0.00000.00000.0000 0.00000.13690.0783 0.14910.13690.0745 0.00000.00000.0000 0.00000.00000.0000 0.00000.00000.0000
YOM0091000.63080.32540.4293 0.58460.31150.4064 0.55380.33030.4138 0.61540.32260.4233 0.64620.32810.4352 0.58460.30650.4021
5000.47690.47690.4769 0.61540.39600.4819 0.35380.46940.4035 0.47690.47690.4769 0.60000.43330.5032 0.61540.40820.4908
10000.55380.36000.4364 0.55380.31860.4045 0.47690.38750.4276 0.52310.37780.4387 0.49230.32990.3951 0.55380.33960.4211
50000.55380.40910.4706 0.49230.43840.4638 0.44620.45310.4496 0.63080.39810.4881 0.55380.37890.4500 0.53850.44300.4861
10,0000.53850.29910.3846 0.49230.30480.3765 0.41540.32930.3673 0.41540.39710.4060 0.46150.31580.3750 0.55380.31860.4045
avg0.55080.37410.4396 0.54770.35390.4266 0.44920.39390.4124 0.53230.39450.4466 0.55080.35720.4317 0.56920.36320.4409
std0.05480.07070.0371 0.05500.05990.0443 0.07410.06610.0305 0.09140.05540.0350 0.07570.04890.0500 0.03080.05950.0440
Table 13. The mean F1-scores and standard deviation of all models when testing with the dataset from different stations (the best F1-score of each station is shown in bold).
Table 13. The mean F1-scores and standard deviation of all models when testing with the dataset from different stations (the best F1-score of each station is shown in bold).
D R L 0.7433 (±0.08)0.6550 (±0.08)0.6014 (±0.10)0.4823 (±0.20)0.3477 (±0.06)0.3770 (±0.03)0.6801 (±0.21)0.3581 (±0.06)
D R L F 1 0.7170 (±0.07)0.7146 (±0.06)0.6963 (±0.07)0.6733 (±0.08)0.4134 (±0.02)0.5438 (±0.05)0.6381 (±0.16)0.4152 (±0.04)
D R L R w d 0.7433 (±0.08)0.6468 (±0.10)0.6212 (±0.09)0.5537 (±0.26)0.3477 (±0.06)0.3499 (±0.06)0.6762 (±0.21)0.3891 (±0.04)
D R L A c c 0.7170 (±0.07)0.7234 (±0.04)0.6920 (±0.07)0.6733 (±0.08)0.4134 (±0.02)0.5714 (±0.02)0.7219 (±0.14)0.4248 (±0.04)
D R L V a l i d 0.6020 (±0.17)0.7045 (±0.10)0.5639 (±0.21)0.6596 (±0.14)0.3211 (±0.10)0.4867 (±0.07)0.6488 (±0.15)0.4378 (±0.03)
MLP0.8505 (±0.06)0.7822 (±0.03)0.6998 (±0.03)0.8571 (±0.00)0.2220 (±0.10)0.5651 (±0.14)0.9778 (±0.07)0.2358 (±0.05)
LSTM0.8167 (±0.04)0.7753 (±0.02)0.7265 (±0.02)0.8571 (±0.00)0.3276 (±0.09)0.6252 (±0.06)0.9857 (±0.05)0.2596 (±0.06)
E D R L 3 0.7700 (±0.06)0.7367 (±0.02)0.7033 (±0.07)0.6948 (±0.11)0.4134 (±0.02)0.5544 (±0.03)0.7809 (±0.10)0.4396 (±0.04)
E D R L 5 0.7455 (±0.09)0.7558 (±0.03)0.6819 (±0.03)0.7295 (±0.12)0.3605 (±0.07)0.5123 (±0.03)0.7131 (±0.24)0.4301 (±0.05)
E30.9231 (±0.00)0.8387 (±0.00)0.7600 (±0.02)0.8571 (±0.00)0.4134 (±0.02)0.6629 (±0.00)1.0000 (±0.00)1.0000 (±0.04)
E50.8189 (±0.08)0.7914 (±0.05)0.7437 (±0.03)0.8571 (±0.00)0.3739 (±0.05)0.5947 (±0.02)0.9428 (±0.08)0.9428 (±0.04)
E70.8360 (±0.05)0.7990 (±0.03)0.7301 (±0.03)0.8190 (±0.09)0.3493 (±0.03)0.5754 (±0.04)0.8743 (±0.07)0.8743 (±0.03)
W E D R L 3 0.7795 (±0.05)0.7432 (±0.03)0.7100 (±0.07)0.6962 (±0.16)0.4134 (±0.02)0.5797 (±0.01)0.7976 (±0.09)0.4466 (±0.03)
W E D R L 5 0.7540 (±0.05)0.7347 (±0.04)0.7194 (±0.05)0.6714 (±0.16)0.3954 (±0.04)0.5525 (±0.04)0.7619 (±0.10)0.4323 (±0.05)
WE30.8571 (±0.00)0.8254 (±0.00)0.7262 (±0.04)0.8571 (±0.00)0.4134 (±0.02)0.6591 (±0.00)1.0000 (±0.00)1.0000 (±0.03)
WE50.8571 (±0.00)0.8254 (±0.00)0.7240 (±0.02)0.8571 (±0.00)0.3895 (±0.04)0.6561 (±0.00)1.0000 (±0.00)1.0000 (±0.05)
WE70.8571 (±0.00)0.8291 (±0.01)0.7195 (±0.01)0.8571 (±0.00)0.3784 (±0.06)0.6606 (±0.01)1.0000 (±0.00)1.0000 (±0.04)
Table 14. The F1-scores of the E3 models when testing with the dataset from the same station (shown in the bracket) and different stations.
Table 14. The F1-scores of the E3 models when testing with the dataset from the same station (shown in the bracket) and different stations.
Tested DatasetTrained Dataset
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Khampuengson, T.; Wang, W. Deep Reinforcement Learning Ensemble for Detecting Anomaly in Telemetry Water Level Data. Water 2022, 14, 2492.

AMA Style

Khampuengson T, Wang W. Deep Reinforcement Learning Ensemble for Detecting Anomaly in Telemetry Water Level Data. Water. 2022; 14(16):2492.

Chicago/Turabian Style

Khampuengson, Thakolpat, and Wenjia Wang. 2022. "Deep Reinforcement Learning Ensemble for Detecting Anomaly in Telemetry Water Level Data" Water 14, no. 16: 2492.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop