1. Introduction
Understanding river surface flow velocity is essential for hydrodynamic modelling, flood risk management, sediment transport studies, ecosystem monitoring, and hydraulic engineering applications [
1,
2]. Traditional in situ techniques, such as current meters, float tracking, and acoustic Doppler current profilers (ADCP), remain reliable for point or profile-based velocity measurements [
1,
3], but are often costly, time-consuming, and limited in spatial and temporal coverage [
4,
5]. Their use is further constrained in hazardous, remote, or rapidly changing environments, where fast, large-scale, and non-invasive measurements are required [
2,
6]. In this context, unmanned aerial vehicles (UAVs) have emerged as a transformative technology for river monitoring, offering rapid deployment, flexible operation, and the ability to collect high-resolution imagery from previously inaccessible locations [
7]. These advances have led to the growing adoption of image-based velocimetry methods that use UAV-derived video or imagery to retrieve surface velocity information over wide river reaches [
1,
8].
The family of image-based velocimetry techniques includes large-scale particle image velocimetry (LSPIV) [
9], large-scale particle tracking velocimetry (LSPTV), space–time image velocimetry (STIV) [
10,
11], and optical flow methods [
12,
13]. LSPIV and LSPTV rely on visible tracers on the water surface, either naturally occurring, such as foam or floating debris, or artificially introduced particles [
1,
14]. STIV estimates surface flow velocity by analyzing spatiotemporal patterns of tracers in video sequences, providing a non-intrusive alternative for river flow monitoring [
12]. Although these methods can provide accurate velocity fields in turbulent flows, their reliability decreases significantly in homogeneous, low-texture conditions or during floods, when tracer seeding is impractical or unsafe [
4,
14]. Various post-processing algorithms, such as Time Frequency Analysis (TiFA) applied to LSPIV results, aim to improve river surface velocity estimation in low tracer density conditions [
15]. Optical flow, by contrast, estimates pixel displacements directly between consecutive frames, allowing the retrieval of surface velocity vectors without explicit reliance on tracers [
2]. Early studies demonstrated that optical flow could produce spatially consistent velocity fields comparable to LSPIV, and subsequent developments extended its use to UAV imagery [
3,
6,
16].
Recent research has focused on improving the robustness of optical flow approaches for natural rivers by integrating deep learning techniques. Architectures such as Recurrent All-Pairs Field Transforms (RAFT) [
17] and Convolutional Neural Networks using pyramid, warping, and cost volume (PWC-Net) have significantly enhanced the ability to detect motion in challenging conditions, handling platform instability, reflections, and low contrast, which often limit classical methods [
18,
19]. Traditional optical flow methods typically use median filtering, spatial pyramids, and optimization approaches to minimize energy functions [
20], or variational models [
21], and usually process a single case. An important advantage of deep learning models is their good generalization, as networks like RAFT learn features and necessary patterns from data representing diverse environmental conditions [
17,
22]. Such models can operate on RGB video with limited textural content, where traditional particle-based approaches would struggle. Furthermore, real-time implementations have been proposed, such as UAV platforms integrating edge-computing units with convolutional neural networks and optimized optical flow algorithms, demonstrating the feasibility of near-operational hydrometric systems [
23].
There is considerable interest in integrating UAV-based RGB data with other sensors to address the limitations of single-modality observations. Thermal cameras enhance tracer visibility by detecting subtle surface temperature gradients, Doppler radar provides independent flow validation, and bathymetric sonar or light detection and ranging (LiDAR) technology offer geometric context for discharge estimation [
24,
25,
26]. These multi-sensor workflows have proven effective in specific case studies, such as combining thermal and RGB imagery to resolve velocity fields with deviations as low as 0.01 m/s [
27], or using UAV bathymetry alongside velocimetry for discharge estimation in complex morphologies [
28]. However, employing multiple sensors introduces additional complexity, costs, and payload requirements, which may limit practical deployment, particularly in emergency scenarios.
Despite the promise of sensor fusion, the most widely available and practical option remains the use of RGB cameras, which are standard on nearly all UAV platforms. Focusing on RGB video offers unmatched accessibility and deployment speed but introduces challenges that must be addressed to ensure accuracy. River reaches with low turbulence or limited surface tracers provide little textural contrast for tracking, while reflections, glare, and illumination changes reduce the reliability of optical flow estimation [
2,
16]. UAV motion adds further noise, requiring stabilization and geometric corrections such as deshaking, lens distortion correction, and orthorectification [
6]. When these pre-processing steps are adequately implemented, studies have shown that UAV-based RGB velocimetry can achieve accuracy comparable to conventional field instruments. For instance, Eltner et al. [
27] and Torres et al. [
29] demonstrated deviations below 10% compared to ADCP and field measurements, whereas Koutalakis et al. [
30] highlighted limitations in intermittent streams where dense vegetation and poor textural conditions compromised results.
Previous work by Kriščiūnas et al. [
31] introduced a framework for UAV-based river flow velocity determination using optical flow recognition. It highlighted two major challenges: the lack of suitable datasets for training robust models and the inherent limitations of frame-to-frame optical flow estimation. Motivated by these gaps, this paper presents a modified Multi-Frame RAFT architecture (MF-RAFT) with integration of a gated recurrent fusion unit module (Fuse-GRU) to enable river flow prediction from aerial video RGB streams. A comprehensive analysis of the results is conducted for different spatial resolutions, temporal strides, and valid-pixel coverage groups across datasets obtained from three river stretches.
2. Materials and Methods
The general workflow of this study is summarized in
Figure 1. It comprises two complementary dataset preparation paths. Steps 1 and 2 involve in situ data collection and the construction of a physical flow model for the specific river segment using the finite element method (FEM), to obtain reference velocity fields expressed as vector fields. Steps 3 and 4 involve UAV-based RGB video acquisition of the same segment, followed by calibration and georeferencing to correct perspective distortions and ensure spatial consistency. Both dataset branches are integrated in Step 5, where reference and UAV-derived data are combined into a single dataset. This dataset is then used in Step 6 to develop and evaluate an artificial intelligence model for predicting river flow velocity from UAV imagery.
Although presented sequentially in
Figure 1, the methodological details are discussed thematically in the following subsections. Specifically, the optical flow formulation underlying velocity estimation is provided in
Section 2.1, the artificial intelligence model is described in
Section 2.2, the experimental study area and data collection are detailed in
Section 2.3, and the dataset preparation procedures are explained in
Section 2.4.
2.1. Optical Flow Formulation for River Velocity Subsection
Optical flow refers to the apparent motion of brightness patterns between two consecutive frames in an image sequence. The classical formulation relies on the brightness constancy assumption, which states that the intensity
of a pixel at location
and time
remains unchanged as it moves over time. For small displacements, this constraint can be expressed as in Equation (1), following the classical formulation of Horn and Schunck [
32].
Here, denotes the horizontal and vertical components of the optical flow vector. Equation (1) provides one constraint for two unknowns, leading to the well-known aperture problem, which is typically addressed by introducing spatial smoothness constraints or more advanced formulations. The terms and represent the change in pixel intensity in the horizontal and vertical components, respectively. The temporal derivative term represents the change in pixel intensity between consecutive frames and is essential for linking the spatial displacement of features with their motion over time.
In river monitoring using pipeline UAV-based video, image sequences are calibrated and georeferenced so that each pixel corresponds to a fixed ground location and the spatial resolution remains constant; that is, one pixel (px) represents a predefined constant value
in meters (m) (1 px =
m). Under these conditions, the displacement of features in the image plane can be directly converted into physical velocity vectors. If the flow velocity field within the river segment remains approximately constant during the observation period, the optical flow vectors remain unchanged between consecutive frames, provided that the time interval
is constant:
Here, denotes the displacement vector between frame and , and N is the number of frames. Equation (2) highlights the temporal coherence property, which allows us to extend inference and model training beyond a simple two-frame input to sequences of frames, thereby improving robustness against noise, illumination changes, and local texture deficiencies.
Once displacement
is estimated, the corresponding physical flow velocity
(in meters per second) can be derived as follows:
Here, is the pixel-to-meter conversion factor obtained during georeferencing and is the frame interval. Equation (3) establishes the link between optical flow displacements and physical velocity, enabling UAV-based optical methods to provide quantitative estimates of river flow.
2.2. SWE-FE Model
The 2D model for each river segment was developed in the finite element (FE) software COMSOL Multiphysics 6.2, using the shallow water equations (SWE) application mode, following the scheme presented in [
33]. The SWE describe shallow flow in a 2D region, represented by river bottom height elevation varying with coordinates and water surface height. The SWE modelling results were calibrated and validated using field measurements of flow velocity at 0.6 of the total depth (from the surface). Bottom resistance was incorporated as bottom shear, calculated using a hydraulic resistance formula with an empirically determined coefficient. Flow-hindering stresses due to vegetation at specific points were set proportional to velocity and depth, with the coefficient selected according to the pre-determined vegetation type (vegetation-free, bank vegetation, or dense vegetation zones). The coefficients for each river stretch were adjusted by comparing computational results with measured data and selecting those that provided the best match for average flow velocity, the lowest relative velocity errors, and a balance between positive and negative error values. The SWE-FE model was constructed using the discontinuous Galerkin approach [
34] to discretize the SWE, and the Lax-Friedrichs flux [
35] to ensure numerical stability. Numerical integration in time was performed using the explicit Runge-Kuta method until the flow became stationary or near-stationary. The output of simulation using FEM is presented as vectors of flow velocities within the calculation area.
The orthomosaic UAV image of one of the analyzed river stretches (Mūša) is shown in
Figure 2b. The geometry of the SWE-FE model was created using data acquired from both direct field measurements and high-resolution UAV aerial images. The direct measurements were used in the SWE-FE model to define the bottom heights and water surface elevations. The 2D Delaunay triangulation was created from the measured points, with linear shape functions used to interpolate values at the SWE-FE model mesh nodes. The in situ measurements of flow velocity vectors (speed and direction at 0.6 of the depth) were also used as reference values to evaluate the simulation results and adjust the coefficients during the model calibration stage. The vegetation type was manually defined after analysis of the UAV aerial images. The SWE-FE model mesh was generated by taking into account the results of semi-automatic recognition of boulders above and below water from the UAV images [
36]. The polygons defining the banks and the velocity field obtained using the FEM simulation are presented in
Figure 2a.
2.3. AI Model for Processing UAV Video Sequences
Recent advances in optical flow estimation have been largely driven by deep learning models, with RAFT establishing itself as a state-of-the-art architecture due to its accuracy and robustness across diverse benchmarks [
17]. RAFT iteratively refines dense flow fields through recurrent updates, making it suitable for applications requiring high precision. However, its architecture was originally designed for frame-to-frame estimation, which may limit its ability to fully exploit temporal consistency in longer image sequences.
To address this limitation, several studies have investigated multi-frame optical flow methods. For example, VideoFlow proposes a tri-frame optical flow module combined with motion propagation across temporal segments, effectively extending estimation beyond two consecutive frames [
37]. StreamFlow introduces an in-batch multi-frame pipeline to compute flow across several frames simultaneously, reducing computational redundancy and improving temporal coherence [
38]. Other approaches use spatiotemporal learning. The spatiotemporal recurrent transformers for multi-frame optical flow estimation (SSTM) employ recurrent transformers to capture motion dependencies across longer video sequences [
39]. In contrast, the Self-Teaching Multi-frame Unsupervised RAFT with Full-Image Warping (SMURF) method introduces temporal self-supervision into RAFT-like architectures to improve stability and generalization [
40]. These developments demonstrate the feasibility of extending frame-to-frame optical flow into multi-frame paradigms.
In this research, we extend the RAFT architecture by proposing a multi-frame input variant, MF-RAFT, in which the model processes a sequence of
consecutive frames rather than a single frame pair. Each frame is first processed by the standard feature encoder, after which the extracted features from frame pairs are fused using a Fuse-GRU before correlation volume computation (
Figure 3). This design integrates temporal information from multiple frames before dense correlations are calculated, thereby enhancing robustness in scenarios where single frame-pair estimation is unstable, such as low-texture river surfaces or variable illumination.
It should be noted that several alternative design choices are possible. The fusion block could, in principle, be applied after the correlation volume, allowing the model to operate on higher-level matching features. However, this approach would significantly increase computational cost due to the dimensionality of the correlation tensor. By placing Fuse-GRU before correlation computation, the model achieves a more favorable balance between temporal context integration and efficiency, enabling practical training and inference on UAV-derived river flow datasets. Importantly, the fused representations ultimately contribute to the estimation of displacement vectors
as defined in
Section 2.1, which are subsequently converted into physical velocities
.
2.4. Experimental Area and Data Collection
The study area comprises the same river segments examined in [
31], located in Lithuania. The research focused on four shallow river sites with moderate flow velocities and minimal vegetative cover, offering well-distributed surface tracers visible in RGB imagery. These site characteristics supported both ground measurements and modeling using FEM as well as provided suitable conditions for UAV data acquisition. A more detailed description of the geographic context, hydraulic conditions, and field campaign logistics is available in [
31]. The general layout of the study area and objects is shown in
Figure 4.
Flow velocity measurements and UAV-based video data were collected during dedicated field campaigns. Point measurements of ground truth flow velocity were taken at 0.6 of the total depth (from the surface) using a Valeport Model 801 electromagnetic flowmeter (accuracy ±0.005 m/s). Each point measurement was georeferenced using coordinates determined with a GeoMax Zenith 40 GNSS GPS receiver (accuracy ±0.015 m). The duration of measurement campaign depends on the number of points under consideration and takes up to one day. UAV-based video frames were calibrated using ground control markers positioned along the riverbanks. The calibration procedure enabled correction of lens distortions, while georeferencing ensured that each pixel was associated with a fixed geographic location. This step established a constant spatial resolution (1 px = 1 cm) across all sequences, which is a prerequisite for the subsequent conversion of optical flow displacements into physical flow velocities as described in
Section 2.1.
Table 1 summarizes the main characteristics of the data collected at the experimental sites, including river name, measurement date, river discharge (RD), and the position and number of frame sequences (PFS). In the PFS column, the notations S1, S2, S3, and S4 indicate the locations along the river where UAV videos were recorded. The multiplier “×2” denotes that two sequences were acquired at the same location with a 180° rotation, ensuring bidirectional coverage of the river reach.
2.5. Dataset Preparation
Based on the calibrated and georeferenced UAV sequences, a dataset was constructed for training and evaluation of the proposed AI model. Although the sequences were already spatially aligned, additional steps were taken to ensure that the dataset remained unbiased and suitable for machine learning. First,
random target points
were selected within the ground control point (GCP)-constrained zone, excluding riverbanks and shoreline areas to avoid non-flow features. Around each selected point, square image patches of varying side lengths were extracted, with the set of considered patch sizes denoted by:
For each pair
, a corresponding patch was generated and then randomly rotated by an angle
, with additional random multiples of 90° applied to further diversify orientations. It is important to note that such rotations cannot be achieved by simply cropping the RGB imagery and vector field rasters and applying the same pixel-level transformation. A naive operation of this kind would break the georeferencing and, more critically, invalidate the physical meaning of the velocity vectors. While the georeferenced imagery is stored in a north-up orientation, rotation in pixel space would not preserve the correct coordinate reference system. Furthermore, the velocity components
are defined relative to the global axes; rotating only the image would leave the vector field inconsistent with the rotated patch. Therefore, the vector field itself must be rotated component-wise by the same angle
, using a 2D rotation matrix:
Figure 5 schematically illustrates this process, showing that RGB imagery and the associated velocity vectors remain consistent after an arbitrary-angle rotation.
To introduce angular variability while preserving numerical stability, the initial random rotation was restricted to the interval
. This was then combined with an additional one to three random multiples of
, effectively distributing patch orientations across the full
range. This hybrid strategy enhanced augmentation diversity and ensured that image rotations remained computationally stable and geospatially consistent.
Figure 6 illustrates an example in which a patch, first rotated within
, is subsequently rotated by an additional
, demonstrating that both the imagery and velocity fields remain properly aligned.
In addition to spatial diversity, temporal variability was incorporated by introducing multiple frame strides. We define a set
where each element specifies the number of frames skipped between consecutive samples. For a given stride
, the corresponding physical time interval
is expressed as:
where FPS refers to frames per second.
This construction ensures that the dataset reflects river dynamics across multiple temporal scales: small values of capture short-term fluctuations, while larger values account for slower, large-scale motions. Moreover, by including several stride values, the resulting dataset is not implicitly tied to a single acquisition frame rate, thereby improving robustness when applying the trained model to UAV surveys or video sequences acquired at different FPS. Such temporal augmentation thus complements spatial augmentation, broadening the variability of training instances while preserving the physical interpretability of the data.
Finally, the total number of dataset instances generated through this procedure can be expressed as:
Here, denotes the number of randomly sampled target points, the number of distinct patch sizes, and the number of temporal stride values. This strategy yielded a dataset in which river segments were represented without directional bias (i.e., not constrained to a consistent leftward or rightward flow). At the same time, each instance remained explicitly linked to its underlying river segment, enabling a principled partitioning of training and testing subsets according to distinct segments. Such a design prevents data leakage between training and test sets and ensures that model evaluation reflects true generalization performance under independent conditions.
2.6. Dataset Configuration and Partitioning
The UAV-based video data (
Section 2.4) were used to construct four complementary datasets for model training and evaluation. Independence between training and validation subsets was ensured by either separating entire measurement campaigns (temporal independence) or partitioning spatial zones within the same campaign (spatial independence).
Each video sequence was georeferenced and then converted into dataset instances following the procedure described in
Section 2.5 (Equation (8)). In this study, three temporal stride values were applied,
corresponding to ΔT intervals defined by the UAV acquisition frame rate. A total of
randomly selected target points were generated within the GCP-constrained zone, excluding non-flow regions such as riverbanks or shorelines. For each target point, six spatial resolutions were considered,
m/px, representing ground sampling distances at
m. increments. According to Equation (8), this configuration resulted in a total of
independent samples per dataset, ensuring both spatial and temporal variability across multiple scales.
For river-specific datasets, one measurement campaign (
Table 1) was completely withheld for validation, while all remaining measurements from the same and other rivers were used for training. Each position frame sequence (PFS) could include multiple sequences (denoted ×2) corresponding to videos acquired with a 180° rotation at the same location; each sequence was treated independently during dataset construction. The resulting dataset configuration and partitioning are summarized in
Table 2, which is directly derived from the measurement campaigns presented in
Table 1. The number of measured points in Verknė river was not high compared to the other river segments, therefore it was used for training only to increase the diversity of training data.
The partitioning strategy (
Table 2) ensured that validation was conducted on measurements not seen during training or on independent spatial zones within the same campaign. This design prevents overlap between training and validation subsets and allows model performance to be assessed under conditions approximating real deployment scenarios across individual rivers.
2.7. Evaluation Metrics of Results
The performance of the proposed method was assessed using complementary measures: training loss, endpoint error (EPE), average angular error (AAE), and flow outlier rate (FL). Together, these metrics capture both the convergence behavior during optimization and the quantitative accuracy of the resulting optical flow fields.
The training loss follows the standard RAFT formulation proposed by Teed and Deng [
17] adapted to include valid-pixel masking. It is a multi-iteration sequence loss that computes the pixel-wise Euclidean distance between the predicted and reference flow fields, weighted by an exponentially decaying factor
across recurrent refinement steps. Only pixels within valid flow masks are included in the computation. This loss design encourages gradual convergence while maintaining flow consistency in spatially coherent regions.
The endpoint error (EPE) measures the Euclidean distance between the predicted and reference flow vectors in pixel space. For a predicted vector
and a reference vector
, it is defined as:
This metric directly expresses the discrepancy in terms of pixel displacements. To allow interpretation in physical velocity units,
values were converted to meters per second as follows:
Here, denotes the spatial resolution determined by the selected zoom ratio, and is the physical time interval between frames defined by the chosen temporal stride. Since both and vary across dataset configurations, the physical interpretation of is not constant but depends on the experimental setup. For instance, an error of 1 px at a resolution of and a temporal stride of corresponds approximately to a velocity error of 0.07 m/s.
The average angular error (AAE) complements magnitude-based evaluation by focusing on directional deviations. It is computed as:
Here, the additional term in both the numerator and denominator ensures numerical stability for small vector magnitudes. The result, expressed in degrees, reflects the orientation consistency of the velocity field between prediction and reference.
By combining the loss-based convergence criterion with EPE and AAE, the evaluation framework captures both the learning dynamics and the resulting flow accuracy in terms of magnitude, direction, and spatial robustness across various flow regimes.
3. Results
3.1. Model Training
Three models were trained independently, each corresponding to one of the dataset configurations described in
Section 2.6 (Jūra, Mūša, and Šušvė). All training was conducted on an HPC cluster equipped with NVIDIA H100 GPUs, with 64 GB of system memory and 8 CPU cores allocated for data loading and preprocessing. The implementation used the PyTorch 2.6 framework.
Before training, all samples were resized and arranged to fixed dimensions (512 × 512), ensuring consistency for images, ground-truth flow, and masks across the dataset. Each training instance used sequences of ten consecutive frames per sample, with identical data loading parameters to maintain comparability between datasets. Windows were selected deterministically from the start of each sequence, and short sequences were right-padded to reach the full temporal length.
Optimization fused RAFT-style recurrent refinement scheme with 12 update iterations per forward pass. The training objective employed a masked RAFT sequence loss [
17] with exponentially decaying weights across iterations (
), applied only to valid pixels. Validation loss was computed as a masked L1 difference between the upsampled prediction and the reference flow. in addition to the loss, three evaluation metrics were monitored during training: endpoint error (EPE), average angular error (AAE), and the fraction of outlier pixels (FL-error), defined by absolute (3 px) and relative (5%) thresholds.
All models were trained for 100 epochs using the AdamW optimizer with a fixed learning rate of . Gradient clipping with a global norm of 1.0 was applied to stabilize optimization. The batch size was adjusted according to the available GPU memory capacity, ensuring full utilization of the H100 hardware without memory overflow. One training epoch takes approximately 90 min.
The training and validation loss curves, along with the evolution of EPE, AAE, and FL-error metrics across epochs, are shown in
Figure 7. These plots illustrate the convergence behavior of all three models and enable a comparative evaluation of training stability and generalization performance.
As shown in
Figure 7a–c, all models converged rapidly within the first 20–30 epochs, followed by stable metric evolution for the remaining epochs. Both EPE and AAE decreased monotonically, indicating consistent improvements in flow accuracy and directional stability. Among the configurations, the river-specific models exhibited similar convergence behavior, confirming that the training framework generalized well across different river environments.
It is important to note that these results are expressed in pixel space, representing relative displacements rather than absolute physical velocities. As both the spatial resolution () and temporal stride () vary across dataset configurations, direct comparison in physical units (m/s) is not straightforward. Conversion to physical flow magnitudes depends on these parameters and is discussed in the following section, where velocity predictions are quantitatively analyzed under real-world scaling conditions.
Furthermore, the reference flow fields used for training and validation were derived from SWE-FE model, which inherently approximate the true hydrodynamic conditions. While obtained flow fields provide physically consistent supervision, they may still contain systematic deviations due to boundary simplifications, mesh resolution, or numerical diffusion. Consequently, the reported error metrics reflect the model’s consistency with the SWE-FE reference rather than the absolute physical accuracy of the flow.
Despite these limitations, the observed trends provide clear evidence that the learning framework successfully captures the dominant kinematic and directional flow characteristics across various river environments. The resulting river-specific models demonstrate strong generalization capability and serve as a robust foundation for the detailed physical evaluation presented in
Section 3.2.
3.2. Preview of Visual Results
To provide an intuitive understanding of the model’s performance,
Figure 8 presents representative qualitative examples from three UAV test sites. Each case corresponds to a distinct river reach with different flow regimes and surface characteristics. For each site, the first valid frame, predicted velocity magnitude, SWE-FE reference field, and directional vector comparison are shown.
In the Šušvė case, the image resolution is 0.02 m/px, resulting in generally shorter velocity vectors, as each pixel represents a larger physical area and overall flow velocities are lower. Nevertheless, the model successfully reconstructs the full river cross-section and consistently captures the overall flow pattern in agreement with the SWE-FE reference. At the Mūša site, several submerged stones are visible. The SWE-FE model retains near-zero velocities around these obstacles, depending on how precisely the solid boundary was delineated. The proposed model reproduces most of these low-velocity regions but tends to slightly smooth smaller obstacles—a likely consequence of both limited texture and small-scale inaccuracies in the SWE-FE reference contours. In the Jūra sequence, the riverbanks partially overlap in the region of interest, and some vegetation occludes the edges. The visible boundary vectors in this area correspond to zones also present in the training set, which may contribute to better local consistency. Overall, the examples show that at low flow magnitudes, the model achieves mean endpoint errors (EPE) of around 0.3 px, while for higher velocities and coarser resolutions, EPE values reach 1.8–1.9 px. Directional discrepancies remain limited, typically within 10–15°, confirming that the Fuse-GRU architecture maintains good directional stability even under challenging optical conditions.
3.3. Comprehensive Performance Analysis
To comprehensively assess the robustness and behavior of the proposed model under different conditions, a quantitative analysis was conducted across multiple configurations of spatial and temporal parameters. This section examines how variations in spatial resolution, temporal stride, and valid-pixel coverage affect the accuracy and stability of the predicted flow fields. The primary evaluation metric is the endpoint error (
), which quantifies the Euclidean distance between predicted and reference flow vectors. For interpretability in physical terms, EPE values were expressed as
(
Section 2.7), allowing direct comparison of the actual flow velocity discrepancies between model predictions and reference data.
3.3.1. Analysis Across Spatial Resolutions
To evaluate how image scale influences the accuracy of predicted flow fields, a quantitative analysis was performed at multiple spatial resolutions for all three datasets.
Table 3 shows the validation performance of the proposed model at resolutions from 0.010 m/px to 0.020 m/px, including the mean and standard deviation of the physical endpoint error (
), mean absolute percentage error (MAPE), angular error (AAE), and Root Mean Squared Error (RMSE). This comparison enables investigation of how changes in pixel size, and thus the level of visible texture and physical displacement per pixel, influence the accuracy and stability of flow magnitude and direction estimation.
As shown in
Table 3, a gradual improvement in accuracy is observed as the spatial resolution becomes coarser, particularly in the Jūra dataset, where angular errors decrease from 45° at 0.010 m/px to 28° at 0.020 m/px. A similar trend is observed for
whereas MAPE remains relatively stable. For Mūša, the same overall tendency is visible, though percentage errors remain higher due to local turbulence and specular reflections that distort optical cues. In contrast, Šušvė shows minimal variation across scales, indicating that smoother, low-velocity flow conditions make the model less sensitive to pixel size. When the resolution decreases, a larger portion of the river scene fits within a single frame, providing richer spatial context for motion interpretation and facilitating smoother, more generalized predictions. Conversely, at higher resolutions, the model observes smaller, localized areas where lighting variations, ripples, or vegetation motion may dominate the signal, occasionally leading to unstable estimates. To further illustrate these effects, representative examples of failure cases are shown in
Figure 9. After filtering out 15% of the lowest velocity values for each case,
, AAE, and RMSE values do not change significantly. MAPE values drop for the Jūra and Mūša rivers, demonstrating that the model tends to generate proportionally higher errors for low velocities. However, in the Šušvė case, the change in MAPE values is not significant, showing that the model works robustly with respect to velocity in low-complexity river segments without boulders or ripples.
In the Mūša case (
Figure 9a), dense vegetation and fine-scale surface ripples locally disturb the predicted flow vectors, although the overall direction remains consistent. In the Jūra example (
Figure 9b), a limited field of view, high turbulence, and weak texture information result in large angular deviations between the predicted and reference vectors. These examples demonstrate that, while the model performs robustly in most scenarios, local inconsistencies can arise when visual patterns are ambiguous or physically unstable—particularly in areas where vegetation motion, wind-driven waves, or reflections dominate the observed surface dynamics.
3.3.2. Analysis Across Temporal Strides
To investigate the influence of temporal spacing between frames, the model was evaluated using different temporal strides (Δ_frames = 4, 5, and 6) across all datasets.
Table 4 summarizes the corresponding validation metrics, including the physical endpoint error (
), mean absolute percentage error (MAPE), and angular error (AAE). This analysis provides insight into how temporal separation affects the model’s ability to maintain flow consistency and accurately track displacements over time.
As shown in
Table 4, the effect of temporal stride varies with river conditions. For the Jūra dataset, increasing Δ_frames from 4 to 6 leads to progressively higher endpoint and angular errors (from 0.335 m/s to 0.391 m/s and from 31° to 41°, respectively) indicating that larger time gaps reduce temporal coherence and make motion correspondence less stable. This behavior is typical of faster and more turbulent flows, where displacements between frames can become too large for reliable optical flow matching. In contrast, Mūša and Šušvė exhibit relatively stable or slightly improved accuracy at larger strides. For Šušvė, where surface motion is smoother and dominated by slow laminar flow, increasing Δ_frames from 4 to 6 reduces
from 0.157 m/s to 0.138 m/s. This suggests that in low-velocity regimes, a longer temporal gap can enhance the detectability of meaningful motion by emphasizing displacements above the sensor’s noise threshold. If the lowest 15% velocities are filtered out, the trends remain the same, with significantly smaller MAPE values for the Jūra and Mūša rivers. Overall, these results demonstrate that the optimal temporal stride depends strongly on the flow regime: shorter intervals are preferable for rapid or turbulent motion, while slower flows can tolerate longer separations without significant loss of accuracy.
3.3.3. Effect of Valid-Pixel Coverage
Since training and evaluation used flow masks that excluded regions with unreliable or missing motion information, an additional analysis was conducted to quantify the influence of visible (valid) area coverage on model performance. The valid-pixel ratio represents the proportion of pixels included in the loss and metric computation, averaged across validation subsets. Lower coverage indicates that a larger portion of the image was masked out, providing less spatial context for motion estimation.
As shown in
Table 5, there is a clear relationship between coverage ratio and model accuracy. In all datasets, higher valid-pixel coverage consistently leads to lower endpoint and angular errors, confirming that richer spatial context improves both magnitude and directional estimates. For example, in the Jūra dataset, increasing valid coverage from below 60% to above 80% reduces
from 0.400 m/s to 0.340 m/s and AAE from 42° to 32°. A similar tendency is evident for Šušvė, where fully visible frames (>80%) achieve the lowest errors (
≈ 0.097 m/s, AAE ≈ 20°). In contrast, the Mūša dataset shows minor changes in
but a clear decrease in angular error with higher coverage, suggesting that when a larger portion of the river surface is visible, the model can better constrain flow directions even if velocity magnitude errors remain similar. This effect likely arises because vegetation and shadowed regions near the banks were excluded in low-coverage samples, reducing the available texture for reliable optical tracking. In the lowest 15% of velocities are filtered out, the trends remain the same, with significantly smaller MAPE values for the Jūra and Mūša rivers.
These results emphasize that adequate surface visibility is crucial for maintaining both quantitative and directional accuracy. Even with temporal fusion, insufficient valid-pixel coverage limits the model’s ability to infer coherent flow structures, highlighting the importance of high-quality imagery and consistent illumination during UAV data acquisition.
3.4. Independent Validation Using Field Measurements
To ensure that the MF-RAFT velocity estimates are physically consistent and generalize beyond the training data, an independent validation was carried out using in situ flow measurements collected at two river segments, Mūša (0.506) and Šušvė (0.697). Notably, the MF-RAFT inference used models trained without these specific segments, ensuring an unbiased evaluation (see
Section 2.6). For comparison, velocity fields were processed with the temporal stride fixed at
, and only cases with at least 60% visible flow surface were included across all spatial resolutions, as this threshold provides sufficient spatial context for reliable motion estimation and minimizes artifacts from occluded or low-texture regions. The resulting dataset enabled a three-way comparison among the measured velocities, the SWE-FE physical model, and the MF-RAFT predictions, providing an independent assessment of model consistency and accuracy.
The results in
Table 6 shows a strong correspondence between the MF-RAFT-derived velocities and the field measurements, confirming that the proposed approach captures the main flow structures with physically meaningful accuracy. In the Šušvė segment (discharge 0.697 m
3/s), the physical endpoint error between the measured and physics-based velocities was as low as 0.08 m/s, indicating excellent agreement between the numerical model and direct observations. Compared to the MF-RAFT predictions, both the measured and derived using SWE-FE velocities remained closely aligned, with
values of 0.17 m/s and 0.15 m/s, respectively. The angular error (AAE) was also small—4.6° between measured and simulated velocities, and below 10° in comparisons involving the MF-RAFT estimates—showing that the physical model reproduces realistic directional patterns that the MF-RAFT successfully captures. These results suggest that the physical flow behavior in this low-turbulence reach is well approximated by both the SWE-FE model and the proposed inference method.
In contrast, the Mūša segment (discharge 0.506 m3/s) exhibits a more complex flow regime, characterized by submerged stones, shallow zones, and turbulence induced by vegetation. Here, the discrepancies are more pronounced, with values of 0.28 m/s for the measured versus MF-RAFT comparison, 0.31 m/s for physics versus measured, and 0.36 m/s for physics versus MF-RAFT. The angular differences also increase, reaching 14–18°, reflecting the challenges of accurately resolving local vortices and near-bed flow variations.
Notably, the large errors (particularly in the SWE-FE vs. measured case) are mainly due to overestimations by the SWE-FE model near areas of high shear and stone-induced flow separation, where the measured velocities are substantially lower. These localized discrepancies are consistent with the visual examples shown in
Figure 10a (middle segments), where velocities derived using SWE-FE are visibly higher than both measured and MF-RAFT-predicted vectors. All comparisons in
Figure 10 were performed at a uniform image resolution of 0.018 m/px, ensuring a consistent spatial scale across both river sites.
For visualization and quantitative comparison, only points with more than 60% visible surface coverage were included, in accordance with the previously established dependence of accuracy on the valid-pixel ratio. Where more than ten valid measurement points were available, only the ten with the highest visible-area coverage were retained to maintain visual clarity and consistency between segments. In the Šušvė segments, however, this upper limit was not reached, with typically four to seven valid measurement points per area due to the relatively sparse sampling zones and larger bounding regions. It should also be noted that the parsed regions were extracted not only along the main riverbanks but were also constrained by the physical separation of ground markers used during field georeferencing. In the Mūša S1 segment (
Figure 10a), physical measurements did not begin exactly at the upstream boundary of the captured area; therefore, these uppermost points are not visible in the figure.
Figure 10 presents the results for segments of varying complexity. The Mūša segments (
Figure 10a, middle segments) exhibit high complexity due to the presence of boulders above and below the water, which generate ripples and increased reflections. These effects significantly impact the MF-RAFT predictions and lead to discrepancies in the results compared to both the SWE-FE model and in situ measurements. In the Mūša segments with boulders only below the water (
Figure 10a, top and bottom segments), the MF-RAFT predictions show better agreement with the SWE-FE model results. The Šušvė river (
Figure 10b) displays low complexity, as there are no boulders in the analyzed segment. Consequently, agreement between the SWE-FE model, in situ measurements, and MF-RAFT results is evident at most of the points selected for visualization. Overall, the visual comparisons indicate consistent directional agreement among all three data sources, despite local discrepancies in zones of complex flow. This coherence across both river sites supports the reliability of the proposed MF-RAFT approach for reconstructing realistic velocity patterns under varying hydrodynamic conditions.
4. Discussion
The proposed framework demonstrates that UAV-based RGB video analysis, combined with physically informed data generation and a multiframe deep MF-RAFT model, can produce stable and physically meaningful estimates of river velocity under various hydraulic conditions. Introducing a recurrent fusion module (Fuse-GRU) before correlation computation allows the model to exploit temporal coherence across multiple frames, reducing sensitivity to illumination changes, specular reflections, and low-texture surfaces that typically limit frame-to-frame optical flow. The multiframe RAFT extension thus offers a practical balance between temporal robustness and computational efficiency suitable for UAV hydrometric applications. Because the proposed approach relies on deep learning, its performance is strongly influenced by the quality of the training data. When the training dataset (UAV observations, in situ measurements for the physical model, and the physical model itself) is prepared with sufficient accuracy, the MF-RAFT model is expected to perform reliably. Nevertheless, environmental factors such as lighting conditions or wind can introduce substantial variability and negatively affect the results.
The hybrid data design, linking SWE-FE flow simulations with UAV-derived imagery, was essential for developing a reliable training dataset. Vector fields generated using FEM provided spatially coherent physical references that are otherwise difficult to obtain from field measurements alone. Although FEM solutions are numerical approximations, residual discrepancies were mainly confined to zones of strong shear, obstacle-induced separation, and vegetated areas. In several of these regions, the optical-flow model locally outperformed the SWE-FE baseline when compared with field data, suggesting that the learning-based system was able to extract subtle visual cues reflecting real dynamics beyond the simplified depth-averaged hydrodynamic representation.
Resolution and cadence analyses offered practical insights for UAV mission planning. Coarser ground sampling distances (approximately 0.010–0.020 m/px) improved performance by increasing spatial context and reducing small-scale radiometric artefacts. Similar benefits could be achieved by expanding the network’s input size (e.g., from 512 × 512 px to 512 × 1024 px), although computational efficiency constraints led to the use of smaller patches in this study. Temporal stride effects were dependent on flow regime: shorter intervals benefited fast and turbulent flows, while longer intervals enhanced motion detectability in slow, laminar conditions. Accuracy was strongly influenced by valid-pixel coverage, with at least 80% visible water surface consistently yielding the lowest endpoint and angular errors.
Future work should focus on expanding the scale and diversity of the dataset, including automatically generated geometry of SWE-FE model based on UAV-derived orthophotogrammetry and semi-synthetic hydrodynamic representations. Using dynamic sequence lengths could further enhance temporal context, enabling the model to capture longer flow histories and suppress transient disturbances such as wind-driven ripples. Integrating these improvements, along with physics-informed loss functions, uncertainty quantification, and edge-efficient implementations, would strengthen the framework’s applicability for operational, real-time UAV-based river monitoring.
5. Conclusions
This study presents a physics-informed, UAV-based approach for estimating river velocity using an enhanced MF-RAFT architecture. By combining FEM-based hydrodynamic modeling with UAV RGB video, a dense and physically coherent training dataset is created, supporting supervised learning under realistic flow conditions. The novelty of the proposed architecture lies in its ability to predict flow using multiple consecutive frames as input rather than relying solely on pairwise flow estimation [
17,
40]. While traditional particle-based techniques [
4,
10] focus on tracking discrete tracers or brightness patterns across two frames, the proposed method considers the entire image sequence as a unified source of information. In addition, the framework aims to estimate the actual river velocity, not merely the river surface velocity [
12,
15,
16]. This is achieved by integrating SWE-FE that produces a velocity vector field, which is then linked to observed RGB image features, enabling a more meaningful interpretation of image-derived motion cues. The proposed MF-RAFT model, extended with a pre-correlation Fuse-GRU module, effectively integrates temporal information from consecutive frames, improving stability, robustness, and physical consistency compared with traditional frame-pair methods.
A comprehensive evaluation across multiple rivers demonstrated that the framework reproduces spatially coherent and physically realistic velocity fields, achieving accuracy comparable to in situ measurements and physically based simulations. The method remains resilient under variable illumination, surface texture, and flow regimes, confirming its suitability for UAV-based hydrometric applications. However, severe environmental conditions (e.g., ripples caused by wind) can affect the results. The large computational resources required to train the MF-RAFT model on the high-resolution data is another limitation of using the proposed framework.
Overall, the developed system marks progress towards operational, data-driven hydrometry that integrates computer vision with physical modeling. By combining UAV flexibility, physics-aware learning, and computational efficiency, it offers a scalable, non-invasive tool for continuous river monitoring, flood risk assessment, and hydraulic model validation. Future developments, including automated dataset generation, dynamic temporal modeling, and real-time edge deployment, will further enhance its potential for large-scale environmental observation and water resource management.