1. Introduction
Reliable high-resolution rainfall data are essential for hydrological and climatic applications. For example, Berne et al. [
1] showed that high-resolution rainfall data are required for urban catchment simulation due to the short response time, while Huang et al. [
2] demonstrated that the results’ quality for hydrologic simulation in a natural catchment improved with the increase in rainfall data resolution. Low-resolution data cannot capture the evolution of short convective rainfall events that occur during the summer season [
3], especially those leading to flash flooding in rapid-response catchments [
4]. Frequency and intensity of short-duration rainfall events driven by climate change are increasing, further demanding accurate sub-hourly rainfall measurements [
4,
5]. Collecting uninterrupted rainfall time series at a fine temporal resolution remains an operational challenge because equipment malfunctions, sensor calibration errors, network communication failures, and maintenance downtimes may result in incomplete datasets, affecting the reliability of subsequent analyses [
6,
7]. Even quality control operations on rainfall data may produce missing data due to the need to exclude outliers and too frequent or too low data [
8]. The need for long and continuous high-resolution rainfall data records, as well as the possible presence of missing data in these records, makes the imputation of these values a necessary preprocessing step for obtaining high-quality outcomes produced by hydrologic models.
While studies have largely addressed the imputation of monthly [
9,
10] or daily precipitation [
11,
12], the sub-hourly scale introduces new challenges. The probability of finding no-rainfall data in records decreases with the time scale due to rainfall intermittency [
13], resulting in zero-inflated time series for high temporal resolution rainfall data. This can reduce the performance of models for the imputation of missing data due to the bias introduced by the excess of zero values [
14]. On the other hand, spatial correlations of high-resolution rainfall data decrease more rapidly with distance than coarser resolution data [
15], reducing the capacity of neighboring rainfall gauges to aid the imputation of missing data at the target station for high-resolution time series.
The simplest techniques for the imputation of missing data in time-series rely on basic statistical indicators such as mean, median, or mode [
16], but these approaches may fall short due to autocorrelation and persistence in the rainfall records [
17]. The availability of neighboring rain gauges allows the use of spatial interpolation techniques such as Inverse Distance Weighting (IDW) [
9,
18,
19,
20], even with optimized parameters [
21], Multiple Linear Regression [
9,
18], and Ordinary Kriging (OK) [
19,
20,
22,
23,
24], among the others. Geostatistical techniques like OK work better on flat terrain but may perform relatively poorly in mountainous areas [
25]. For this reason, Detrended Universal Kriging and Co-Kriging have been introduced to consider data trends and additional explanatory variables such as gauge elevations [
24].
Recently, machine learning (ML) approaches have emerged as a promising family of tools for missing environmental data imputation [
16], obtaining improved results over classic simple regression models [
26] and offering greater flexibility. Among the most widely used machine learning models are Multilayer Perceptrons (MLPs), Support Vector Machines (SVMs), and Random Forests (RFs), which have been successfully applied to various environmental and hydrological problems. Several studies have reported that ML models effectively capture nonlinear relationships in rainfall data. Sattari et al. (2020) [
27] demonstrated the potential of kernel and tree-based machine-learning models, such as Support Vector Regression, Gaussian Process Regression, and RF, for estimating missing rainfall data. Bellido-Jiménez et al. (2021) [
28] compared different ML approaches for filling gaps in daily rainfall series in a semiarid area (Andalusia, Spain), finding that MLP obtained the best results in most stations. Lupi et al. (2023) [
29] evaluated various ML models, including linear regression, long short-term memory, and convolutional neural networks, for the imputation of missing daily rainfall data. Their study found that these models are effective in handling incomplete datasets, with the quality of the input data playing a critical role in model performance. Praveen Kumar and Dwarakish (2025) [
30] considered different classic regression and ML approaches for filling gaps in daily rainfall series in the Kali River basin (India), finding that the K-nearest Neighbor approach behaved best among the models compared. From this very short review, it emerges that ML approaches seem suitable for the gap filling task in rainfall time series under different climates and in various regions.
Of course, ML approaches obtain the best results when they are built to incorporate the structure of the missing data to be imputed [
31,
32], allowing for an improved representation of the spatial and temporal dynamics that characterize the process [
16]. In the case of missing rainfall data, this could be carried out in different ways. The first strategy integrates additional data such as wind velocity and direction, elevation, humidity, air pressure and temperature, allowing for a more robust imputation of the missing data by exploiting the availability of covariates [
14,
33]. The second strategy, which serves to cope with zero-inflated time series, consists of breaking the imputation process in two steps: in the first step, time intervals with missing data are classified to determine if they are rainy or non-rainy; in the second step, missing rainfall data imputation is carried out only in the time intervals classified as rainy [
14,
33,
34,
35,
36]. In the third strategy, lagged data from different time steps are used to incorporate the time dynamics of the process to be imputed [
16].
Especially relevant for the present discussion are the works by Chivers et al. (2020) [
14] and Han et al. (2023) [
33], where both the strategies of incorporating additional data and using a two-step approach were exploited. Chivers et al. (2020) [
14] integrated sub-hourly (30-min) data collected from 37 weather stations (including precipitation, air pressure and temperature, relative humidity, wind speed and direction, volumetric water content) with rain gauge records in the UK (United Kingdom). Several ML models (gradient tree boosting, K-nearest neighbors, Random Forest, Support Vector Machines, and neural networks) were tested, adopting a two-step approach. The best performance was achieved when using the most comprehensive dataset, i.e., the one that included all the available soil moisture and hydro-meteorological features for each monitoring site and all the rain gauge stations surrounding the target site within a 30 km radius. Han et al. (2023) [
33] used daily meteorological data (precipitation, humidity, air pressure and temperature, and wind speed) from 30 weather stations across South Korea to reconstruct missing daily precipitation values. They tested the performance of two ML models, Artificial Neural Networks (ANNs) and RF for both classification and regression steps.
Despite these advances, few studies have investigated the application of ML-based imputation to sub-hourly rainfall, particularly under data-scarce conditions. In operational networks, sensors are sparse and additional meteorological variables, including wind speed and direction, and air temperature and pressure, may be limited or absent [
28,
37]. Under such constraints, imputation at fine temporal resolutions requires techniques that can effectively handle heterogeneous, feature-limited environments.
To address this gap, the present study conducted a comparative analysis of RF, MLP, OK, and IDW to impute missing 10-min rainfall records by applying two different approaches for each model. In the direct approach, raw data are directly used to feed the regression models. In the two-step approach, the RF model is used as a classifier to separate rain and no rain instances, and then a regression model is subsequently fed with the instances classified as rainy. This serves the aim of clarifying the relative model strengths and limitations and provides practical guidance for researchers and practitioners working with high-resolution hydrometeorological datasets, specifically in regions where there is limited access to additional meteorological data.
4. Discussion
The results presented clearly demonstrate the value of incorporating a classification step before regression for the imputation of missing rainfall data in high-resolution rainfall time series (10-min step). The numerical idea behind this approach is to alleviate the regressors’ task by only forcing the imputation of rainy time intervals and avoiding the noise introduced by numerous no-rain intervals, which are present at short time scales due to the intermittency of rainfall phenomena. This is further supported by considering the fundamental physical differences between rainy and no-rain time intervals. During rainy periods, atmospheric processes such as atmospheric lifting, cloud formation, moisture condensation, and precipitation dynamics, dominate. In contrast, dry intervals are governed by entirely different dynamics, including evaporation, atmospheric stability, and the absence of significant convective activity. The differences between these physical regimes make the sets of rainy and dry intervals inherently distinct and non-interchangeable.
4.1. The Classification Step
In this context, the random forest model exhibited consistently high performance in classifying rainy and dry periods across all stations for both the Calore River basin and the Metropolitan City of Naples, even without resorting to the use of additional covariates. A short comparison with the existing literature is instructive. Average Accuracy scores for the classification of rain and no-rain time intervals were 0.928 for the Calore River basin and 0.946 for the Metropolitan City of Naples. Average Precision scores also remained high, 0.889 for the Calore River basin, and 0.872 for the Metropolitan City of Naples, indicating that the classification model was effective at minimizing false positives. Although the average Recall score was lower, 0.334 for the Calore River basin and 0.345 for the Metropolitan City of Naples, the classifier could, however, succeed in isolating a representative subset of rainy time intervals. Finally, the average F1 score and weighted F1 score were 0.483 and 0.961, respectively, in the Calore River basin, while their values were 0.494 and 0.957, respectively, in the Metropolitan City of Naples.
In contrast, information about precipitation, air pressure and temperature, relative humidity, wind speed and direction, volumetric water content, with a 30 min step, were integrated in [
14] to obtain a weighted
F1 score of 0.938, an
Accuracy of 0.943, a
Recall of 0.648, and a
Precision of 0.847 (averaged on all the stations). Similarly, daily meteorological data (precipitation, humidity, air pressure and temperature, and wind speed) were integrated in [
33] to obtain 0.68 for
F1 score, 0.83 for
Accuracy, 0.64 for
Recall, and 0.77 for
Precision. From the comparison with the existing literature, it is evident that at least under a typical Mediterranean climate, it is possible to obtain good performance in the classification step even within rainfall series with a 10-min time step, without resorting to covariates.
4.2. The Regression Step
Overall, the combination of a high-performing classifier and regression models led to more accurate and stable rainfall predictions across both regions and all modeling approaches. The improvements were especially significant in the Calore River basin due to its denser station network, but they also extended to the coastal environment of Naples. The highest performance in the regression step was obtained by the RF in the Calore River basin with an
R2 of 0.541 and an
RMSE of 0.109 mm, considering a spatial configuration including all stations. These results align well with those presented in [
14], where the overall neural network performance was characterized by an
R2 of 0.66 and
RMSE of 0.141 mm. Similarly, an
R2 value of 0.53 was obtained in [
33].
The consistent superiority of the All-station configuration in the present work further underscores the importance of maximizing spatial information when it is available, particularly when paired with a method that filters the training data.
4.3. Rainstorm Space-Time Dynamics
Independent of the spatial configuration strategy and the presence of the initial classification step about rain/no-rain intervals, the machine learning regressors (RF and MLP) had sufficient elasticity to consider input data from time intervals that differed from the targeted time level, allowing the internal temporal structure of the rainfall event to be exploited. Again, the numerical approach has an underlying physical justification. In fact, it is expected that data from a targeted station hit by a precipitative cluster during the targeted time interval are well correlated with rainfall data from the positions along the cluster trajectory at the instants of the cluster passage. In other words, targeted data may be better correlated with data from distant positions at delayed time intervals, if these data are all connected by the passage of the same precipitative cluster, and may be less correlated with data at the same targeted time interval from surrounding positions not lying along the cluster trajectory.
Figure 14 contrasts the use of a 21-time interval window with the use of data from the targeted time interval only, within the RF two-step approach for both the
All and
3S scenarios (area of Calore River basin). Incorporating this temporal context only slightly increased the median
R2 values but considerably reduced the dispersion of results across stations, indicating more accurate and consistent predictions. By considering a time window centered on the target time step, the model leverages the temporal structure of the data, improving its ability to estimate missing values.
4.4. Limitations
Despite the encouraging results obtained in a data-scarce environment, it is important to acknowledge that the regression metrics obtained, particularly the
R2 values, leave room for further enhancement. The predictive capacity of the models could likely be improved by incorporating additional meteorological variables, such as temperature, relative humidity, wind speed, wind direction, and atmospheric pressure, as input features [
14,
33]. These covariates could provide additional contextual information that influences rainfall formation and intensity, allowing models to better distinguish spatial and temporal rainfall variability beyond what can be captured by precipitation data alone. The inclusion of covariates would be particularly beneficial in machine learning frameworks such as RF and MLP, which are well-suited to handling multivariate and nonlinear relationships.
Finally, the analysis involved two different areas, a coastal one and a mountainous internal one, of a single region (Campania, Italy). Although this fact does not undermine the results of the present study, future analysis could benefit from procedure verification in additional areas, characterized by different topographies and climates.
5. Conclusions
Complete sub-hourly rainfall datasets are required for practical and theoretical purposes, among which are flood modeling, real-time forecasting, and the understanding of short-duration rainfall extremes. Nonetheless, these datasets often contain missing values due to sensor or transmission failures, implying that filling data gaps in rainfall data series is an important preprocessing activity. In this study, two distinct missing data imputation approaches were compared:
- (a)
The direct approach (DA) operates on raw data to directly feed one of the selected imputation models (Multilayer Perceptron, MP, Random Forest, RF, Inverse Distance Weighting, IDW, or Ordinary Kriging, OK), while.
- (b)
The two-step approach (TSA) first classifies time steps as rain or no-rain with a RF classifier and subsequently applies one of the selected imputation models to the predicted rainfall depth instances classified as rain.
The two approaches were separately applied to three different rainfall gauge spatial configurations, namely all the available stations, the three best correlated stations, and a cluster of well-correlated stations. The results show that:
- (i)
The two-step approach exhibited improved accuracy (with a minimum relative gain in R2 of 25% over the direct approach), showing that missing rainfall data imputation benefits from breaking the process in stages where different physical conditions such as rain and no-rain are separately managed.
- (ii)
Machine learning models (MLP and RF) exhibited improved results with respect to interpolation methods (IDW and OK), thanks to their ability to approximate any continuous function and to learn complicated internal data structures. The highest performance in the regression step was achieved by the RF with an R2 of 0.541 and an RMSE of 0.109 mm.
- (iii)
The spatial configuration that considered all the available stations yielded improved or equivalent performance with respect to the other scenarios, underscoring the importance of maximizing spatial information when it is available and providing practical guidance for similar applications.
- (iv)
The use of time intervals lagged with respect to the targeted time intervals, which was easily carried out within machine learning models, allowed for the reconstruction of the dynamics occurring between different rainfall gauges and the tracking of precipitative clusters.
The methodology was tested in a region representative of the Mediterranean climate, characterized by heterogeneous rainfall formation phenomena and strong seasonality. In principle, the approach can be applied to different temporal resolutions (e.g., 30 min, hourly, or daily) and replicated in other regions characterized by diverse climatic and topographic conditions. Future research could further explore the performance and adaptability of the proposed approach under different climatic conditions and precipitation regimes to better assess its transferability to diverse contexts.
In conclusion, the findings of the present study highlight the benefits derived by adapting the imputation approach to the intermittent nature of data (two-step approach) and its space–time dynamics (sliding time window). Although further refinement is still required to improve the accuracy of precipitation predictions, this confirms the value of machine learning methods for recovering missing sub-hourly rainfall data, even in data-scarce contexts where rain gauges may be the only source of high-resolution rainfall data.