In this section we provide the details of the performed experiment, including the environment setup, performance metrics, analysis of results, and discussion.
4.1. Environment Setup and Data
For training and testing the NPH, AEB, and UNB methods, the same data were used as in the case of evaluation of the heuristic method with simulated annealing optimization described in the work [
15]. The data consisted of real signals obtained from sensors placed in the soil with different variants of trends and with added artificially generated anomalies, namely, peaks
P, bumps
B, jumps
J, and instabilities
I of different widths and amplitudes inserted in random places. Details regarding the generated data can be found in [
15]. Experiments focused on the NPH, AEB, and UNB methods were performed independently for each anomaly type
and each physical parameter T, M and pH, with
model training and testing experiments in total. Neural network models were trained using our cloud computing facility and one
Nvidia H100 card. Training was performed using the Python 3.11.7 interpreter and the Tensorflow 2.18.0 library. The source code for training and testing all described methods is available from
https://ieee-dataport.org/documents/rural-iot-soil-data (accessed on 21 August 2025).
Details on the neural network structure, optimization method, and loss function are described in
Section 3.2 for the NPH method,
Section 3.3 for the AEB method, and
Section 3.4 for the UNB method.
Figure 11 shows the MSE loss function plots obtained during training for the AEB and UNB models. In the case of NPH, for each pair (anomaly, physical parameter), two models are required to detect the center and edge of the anomaly.
4.2. Results
The test error values obtained for the NPH method described in
Section 3.2 are presented in
Table 5. They were calculated for the test data not used in the detector construction process, as described in
Section 2.2.
These results were obtained for specific threshold values, based on which detectors decided whether a given signal fragment was, in fact, an anomaly. By manipulating the values of these thresholds, we can obtain a trade-off between the FPR and FNR errors described in
Section 2.3. Lower threshold values result in reduced FNR and increased FPS values, and vice versa. In this case, the quality measure of the detection system, regardless of the adopted detection threshold, can be considered as the area under the ROC curve (AUC), when the coordinate axes are FPR and FNR, and the tolerance threshold (or a combination thereof) is the parameter.
The values of the test error obtained for the AEB method described in
Section 3.3 are presented in
Table 6, and those for the UNB method described in
Section 3.4 are presented in
Table 7.
Comparing the test FPR and FNR error values and the IOU rates for the NPH, AEB, and UNB models presented in
Table 5,
Table 6 and
Table 7, respectively, it can be seen that the error values were significantly smaller for the AEB and UNB methods, whereas the IOU rates were higher; however, this does not necessarily mean that these methods are better than the NPH for anomaly detection on end devices. It is equally important to consider the processing power demand of the end device, the amount of data available for training, and the level of support from the cloud instance for continuous improvement of the model parameters. The weaker results obtained for the NPH method compared with the AEB and UNB methods are probably due to the quality of the heuristics, which were not optimized. On the other hand, heuristics based on human knowledge of the problem may allow for dealing with the problem of small datasets, with which training autoencoders can be difficult. Nevertheless, the results of anomaly segmentation alone with the NPH method were significantly better than those with the SA method. The average IOU value for the test data for the SA method was
, which was almost twice as bad than the value obtained with the NPH method (
), as presented in
Table 8. The AEB and UNB methods both present certain advantages and disadvantages. For example, one advantage of the latter is its higher prediction accuracy, as confirmed by our test results, while its disadvantage is a higher power demand due to its more extensive computational structure in comparison with the AEB, which may be a crucial consideration for a constrained device.
Table 9 presents the AUC values obtained for both AEB and UNB methods. These values are presented only for these two methods as, in the case of NPH, the anomaly probabilities were not determined for individual samples due to the heuristic nature of this method.
ROC curves for anomaly segmentation in temperature (T) signals are presented in
Figure 12, in moisture (M) signals in
Figure 13, and in pH signals in
Figure 14.
The parameter of each curve is the anomaly probability threshold, the values of which start at
in the upper right corner and end at
in the lower-left corner. The red color indicates the curve for the training data (sensors
), and blue for the test data (sensors
). The circle indicates the place on the curve corresponding to the threshold
, giving the maximum IOU value for training data obtained by Algorithm 3.
Figure 15,
Figure 16 and
Figure 17 show plots of T, M, and pH signals from sensor
and the regions of true and detected anomalies determined by the three methods used in this work. Signals from this sensor were used only for testing, and thus, the results indicate the generalization ability of the models. The results and graphs for the signals used for training of models are better and more accurate.
4.3. Discussion
All of the previously discussed models for detecting and segmenting anomalies in the analyzed signal fragments, after appropriate optimization/tuning on the cloud instance (performed according to the schemes depicted in
Figure 2 and
Figure 3) and their conversion from Python to C, were integrated with the sensor’s software to implement a daily (every 24 h) cleaning cycle, determined according to two consecutive leading edges of the PV signal. Measurements of the values of individual T, M, and pH signals are carried out with different sampling periods, depending on the dynamics of each individual signal. In the current implementation of our sensors, this period is approx. 10 min for all signals. After the cleaning process, the sample values of individual signals can be aggregated over longer time intervals without losing information, to a maximum of 70 min for M, 90 min for T and PV, and 220 min for pH. This aggregation is necessary to maintain the byte payload of data to be sent to the UAV at an appropriately low level, resulting from the specifications of the LoRaWAN technology (up to 250 bytes per single packet in Europe) [
13].
The following steps briefly summarize our daily cleaning cycle, as previously specified in detail in [
15]:
Samples delayed due to a power gap may have valid values, so they are preserved but marked as “misplaced” for further processing in the cloud.
All “absolute error” samples are set to the “empty” value.
Values of samples labeled by the anomaly detector as “peak” and “jump” are interpolated with respect to their neighbors.
Samples labeled as “bump” are smoothed relative to the values of their adjacent fragments.
Finally, samples labeled as “instabilities” are replaced with the signal trend samples, calculated as a daily moving average.
The rationale behind these steps is to first eliminate samples which may affect signal smoothing (steps 1 and 2), before correcting outliers and the most abrupt change points (step 3), followed by less abrupt ones (step 4). Smoothing of values is necessary to eliminate bias that could be introduced during calculation over the values neighboring more gentle changes, such as bumps and instabilities (steps 4 and 5).
We evaluated the effectiveness of each of the previously analyzed anomaly detection methods defined in
Figure 1 in the same way as in [
15], that is, by measuring the average distances between each reference signal (ground truth) and the signals cleaned via the operations specified above. The polar charts shown in
Figure 18 present the results we obtained for all seven (s01–s03, s10, s21–s23) devices, with distances averaged across 10 cleaned signals for every week of the lifetime of each sensor.
As a reference for assessing the effectiveness of time-series cleaning using detectors exploiting the neural network models NPH, AEB, and UNB, we used a heuristic method with simulated annealing-optimized (SA) detector parameters for each anomaly class specified in
Figure 1. At least two time series were used for each signal (M, T, and pH): one with truly labeled anomalous samples, and another with samples labeled by the detector being optimized. Several independent optimization tasks were created in this way and were executed in parallel in the computing cloud, each targeting a specific combination of anomaly signal types. As a result, the best possible values of the parameters for individual anomalies were determined, such that each detected anomaly corresponded as closely as possible to the true anomalies [
15].
The graph shows a relatively small improvement in signal quality with NPH over SA, whereas the improvements of AEB and UNB over SA are more significant. However, in the case of detection alone (without taking into account the effects of the cleaning algorithm), the NPH method was clearly better than the SA method, yielding almost twice the IOU value, evidence of which may be found in [
14]. It also performed better than the AEB method in some cases (e.g., in segmenting bump anomalies in the moisture signal, as can be seen from
Table 5 and
Table 6).
The worst segmentation quality was observed for jump anomalies, as indicated by both the IOU values presented in
Table 5,
Table 6 and
Table 7, as well as the AUC values presented in
Table 9. The reason for this is likely that a jump anomaly involves a momentary change in the trend in the time series, while the trends in the signal outside of the jump appear normal, which can make the detection of such an anomaly difficult.
The best segmentation quality was achieved for the temperature signal. This is likely due to the greater predictability of the trend and its daily periodicity, which makes it easier to distinguish anomalies from random noise.
As shown in
Table 9 and
Figure 12,
Figure 13 and
Figure 14, the segmentation results for the training data were clearly better than those for the test data. This indicates that the test results can be improved through the use of larger amounts of training data.
The comparison of anomaly detection performance against processing time for the three methods is presented in
Table 8. The intersection over union (IOU) measure was averaged across 12 detectors for combinations of four anomaly types and three measured physical parameters. It is also considered as a measure of generalization ability, as it was calculated only for the test sensors
. The number of addition operations in neural network model processing is approximately the same as the number of multiplications in each method and the same as the number of model parameters. Therefore, the total number of arithmetic operations can be approximately calculated by multiplying the number of multiplications by 2. Memory usage is also roughly proportional to the number of parameters of the neural model.
The last row of
Table 8 provides the average time complexity, expressed as processing time (excluding the time required to load neural models from disk to working memory). These values are not directly proportional to the number of multiplications (neural network model parameters) due to other operations, such as input vector normalization, output vector summation, or heuristic segmentation in the case of the NPH method. Individual times were determined using one CPU and one GPU of an Nvidia H100 computer for segmentation scripts written in Python. However, in the target microcomputer environment equipped with measurement devices, C code is expected to be used. Given the hardware capabilities and different implementation languages, the processing times may be much longer and in different proportions for the individual methods. The best method for anomaly detection accuracy appears to be the U-Net-based (UNB) model. However, the large number of arithmetic operations required per signal sample can be problematic on a constrained device running an MCU. Therefore, the AEB method seems to be a reasonable choice, as it offers only slightly lower accuracy but at the cost of nearly 230 times lower processing time. On the other hand, while the NPH method appears to be dominated by the other two methods, it has potential for improvement (especially regarding the heuristic component).
Anomaly detection in sensor-generated time series is highly application-specific, and there is no universal method that is suitable for every scenario. In IoT and industrial contexts, models such as ARIMA, machine learning classifiers, clustering, and deep learning require tailored selection and tuning depending on the nature of the sensor data and the anomaly type [
29]. For example, statistical and forecasting models turn out to be appropriate when time-series data exhibit clear trend and seasonal structures. An Isolation Forest scales well to high-dimensional, sparse, or unbalanced datasets. In environmental monitoring, such as high-frequency water quality sensors, researchers have found that combining regression-based methods, feature-based detection, and rule-based techniques improves performance, as each method excels at capturing different types of anomalies such as spikes, drift, or missing values [
30].
Similarly, when comparing various deep learning frameworks, it has been highlighted that even state-of-the-art models (e.g., autoencoders, graph-based networks, LSTM encoder–decoders) cannot be universally applied across systems, as their effectiveness depends on many factors such as cross-sensor relations, temporal dependencies, and system-specific characteristics [
31]. Autoencoders effectively adapt to complex, non-linear patterns [
32]. Advanced multivariate deep models (e.g., InterFusion) are ideal for multi-sensor systems where both temporal patterns and cross-sensor dependencies matter [
33]. GAN-based approaches such as TAD-GAN are powerful for anomaly detection, as they can jointly perform realistic time-series generation and discrimination, enabling subtle deviations from normal behavior to be effectively identified [
34].
Consequently, the design of anomaly detection systems must be adapted to the specific application context, data characteristics, anomaly types, and operational requirements.