Intelligent Flood Scene Understanding Using Computer Vision-Based Multi-Object Tracking

Yan, Xuzhong; Zhu, Yiqiao; Wang, Zeli; Xu, Bin; He, Liu; Xia, Rong

doi:10.3390/w17142111

Open AccessArticle

Intelligent Flood Scene Understanding Using Computer Vision-Based Multi-Object Tracking

by

Xuzhong Yan

^1,*

,

Yiqiao Zhu

²,

Zeli Wang

³

,

Bin Xu

⁴,

Liu He

⁵ and

Rong Xia

⁶

¹

School of Management, Zhejiang University of Technology, Hangzhou 310023, China

²

Engineering Management School, Zhejiang College of Construction, Hangzhou 311231, China

³

Department of Management Science and Engineering, East China University of Science and Technology, Shanghai 200030, China

⁴

Zhejiang Province Sanjian Construction Group Co., Ltd., Hangzhou 310012, China

⁵

Zhejiang Construction Investment Group Co., Ltd., Hangzhou 310012, China

⁶

College of Civil Engineering and Architecture, Zhejiang University, Hangzhou 310058, China

^*

Author to whom correspondence should be addressed.

Water 2025, 17(14), 2111; https://doi.org/10.3390/w17142111

Submission received: 12 June 2025 / Revised: 8 July 2025 / Accepted: 15 July 2025 / Published: 16 July 2025

(This article belongs to the Special Issue AI, Machine Learning and Digital Twin Applications in Water)

Download

Browse Figures

Versions Notes

Abstract

Understanding flood scenes is essential for effective disaster response. Previous research has primarily focused on computer vision-based approaches for analyzing flood scenes, capitalizing on their ability to rapidly and accurately cover affected regions. However, most existing methods emphasize static image analysis, with limited attention given to dynamic video analysis. Compared to image-based approaches, video analysis in flood scenarios offers significant advantages, including real-time monitoring, flow estimation, object tracking, change detection, and behavior recognition. To address this gap, this study proposes a computer vision-based multi-object tracking (MOT) framework for intelligent flood scene understanding. The proposed method integrates an optical-flow-based module for short-term undetected mask estimation and a deep re-identification (ReID) module to handle long-term occlusions. Experimental results demonstrate that the proposed method achieves state-of-the-art performance across key metrics, with a HOTA of 69.57%, DetA of 67.32%, AssA of 73.21%, and IDF1 of 89.82%. Field tests further confirm its improved accuracy, robustness, and generalization. This study not only addresses key practical challenges but also offers methodological insights, supporting the application of intelligent technologies in disaster response and humanitarian aid.

Keywords:

intelligent flood scene understanding; computer vision; multi-object tracking; disaster response

1. Introduction

Floods are among the most destructive natural disasters, affecting large populations worldwide each year and causing significant economic losses [1]. Floods pose a serious threat to the development of both urban and rural areas due to the concentration of population, infrastructure, and economic activities [2]. Efficient humanitarian aid and disaster relief are crucial for saving lives in the aftermath of floods. Within the first 72 h after a disaster, when response efforts are most critical [3], efficiently gathering and disseminating real-time disaster information is vital for crisis management and recovery. Flood scene understanding—interpreting visual data from affected areas—is essential for effective emergency response, damage assessment, and humanitarian planning [4]. For instance, identifying houses, buildings, and vehicles helps rescue teams plan operations and assess damage.

In the era of artificial intelligence (AI), leveraging cutting-edge digital and intelligent technologies for flood scene understanding is a critical pathway to enhancing the efficiency and effectiveness of disaster response. Some scholars have proposed methods for flood scene understanding based on the analysis of satellite remote sensing images [5,6], However, the low spatial and temporal resolutions of satellite images limit their ability to capture localized flood details. Recently, deep learning-based computer vision (DLBCV) techniques have been increasingly utilized for intelligent flood scene understanding using citizen sensors, such as traffic cameras, road webcams, and drones [7,8]. This increase is due to advances in AI [9], improvements in computer hardware, and the rapid development of citizen sensor science. Intelligent flood scene understanding through DLBCV can play a vital role by enabling real-time monitoring, object tracking, and situational awareness, thereby enhancing flood scene analysis and supporting decision making in humanitarian aid and emergency response.

Previous research has focused on several visual recognition tasks for flood scene understanding, including image classification [10], object detection [8,11], instance segmentation [12], visual question answering [13], and semantic segmentation [7,14,15,16,17,18]. For example, Sun et al. [8] utilized an enhanced version of the You Look Only Once version nine (YOLOv9) algorithm to achieve vehicle object detection in flood scenarios. Pally [12] developed a Python 3.9 package called “FloodImageClassifier” for object detection and instance segmentation in flood images. Wan et al. [7] introduced a flood scene semantic segmentation model called DSS-YOLOv8n, which incorporates Distributed Shift Convolution (DSConv) to improve the YOLO version 8 for segmentation (YOLOv8n-seg). Key features, advantages, and limitations of the aforementioned research projects are summarized in Table 1.

From Table 1, we identify a significant research gap in the current field of flood scene understanding based on deep learning and computer vision. Specifically, most previous methods focus on static image analysis, with few supporting multi-object tracking (MOT) for dynamic flood scene understanding. Unlike visual recognition tasks such as image classification, object detection, instance segmentation, and semantic segmentation, which focus on single images, MOT extends visual recognition to the video domain by incorporating the time dimension. This enables the tracking and identification of spatiotemporal features of target objects across a video sequence [21]. In deep learning and computer vision-based flood scene understanding, the importance of MOT arises from the following key advantages:

Capturing the evolving nature of flood events in real time;
Continuously tracking and consistently identifying moving objects;
Recovering and re-identifying objects temporarily occluded by waves, debris, or infrastructure within the camera view;
Enabling continuous tracking of the same target across different camera views or during a drone’s flight.

From a technical perspective, MOT can be categorized into two paradigms: joint-detection-and-tracking and tracking-by-detection. The former integrates detection and tracking components into a unified framework [22,23,24], while the latter first localizes objects and then associates them with motion and appearance information [25]. In recent years, tracking-by-detection methods have dominated the MOT task [26]. DeepSORT (Deep Simple Online and Realtime Tracker) is a classical tracking-by-detection method that incorporates deep learning-based appearance features for object re-identification and uses Kalman filtering combined with the Hungarian algorithm for data association [27]. StrongSORT builds on DeepSORT by further enhancing robustness and tracking accuracy [26]. It integrates an association refinement module and advanced filtering mechanisms to handle identity switches and occlusions more effectively. Due to its improved reliability, StrongSORT has become a benchmark for MOT [28] and has been widely adopted in industrial applications [29]. However, despite its outstanding performance in conventional MOT tasks, the direct application of StrongSORT in flood scenarios may face significant limitations. First, due to the influence of water flow, objects in flood environments often exhibit highly nonlinear and rapidly shifting motion patterns, leading to increased trajectory prediction errors in StrongSORT. Second, objects that reappear after long-term occlusions caused by waves, vegetation, or infrastructure are challenging for StrongSORT to consistently re-identify and track, thereby compromising its robustness in flood scene applications.

To address the identified research gaps, the primary objective of this study is to propose a deep learning and computer vision-based multi-object tracking (MOT) method for intelligent flood scene understanding in video sequences. To achieve this, we enhanced StrongSORT by integrating an optical-flow-based module for trajectory prediction and a deep re-identification (ReID) module for handling long-term occlusions in flood scenarios. Both performance evaluations and field applications demonstrated that the proposed method offers significant advantages across multiple metrics. This research not only addresses real-world challenges but also makes meaningful methodological contributions. Furthermore, it provides valuable guidance to government agencies and humanitarian organizations on leveraging intelligent technologies to improve disaster response efficiency. The key contributions of this study are threefold:

A deep learning and computer vision-based MOT method is proposed for intelligent flood scene analysis across continuous video frames;
To address the unique challenges of flood scenes compared to conventional environments, we propose a MOT method that integrates an optical-flow-based module for trajectory prediction and a deep re-identification (ReID) module for handling long-term occlusions in flood scenarios;
The proposed method achieves state-of-the-art performance across multiple evaluation metrics in both performance testing and field applications.

The remainder of this paper is organized as follows: Section 2 presents the proposed framework and approach; Section 3 reports the results of performance testing and field applications; Section 4 discusses the research implications; and Section 5 concludes the paper.

2. Methodology

The overall framework of the proposed deep learning and computer vision-based multi-object tracking (MOT) method is depicted in Figure 1. It consists of three core modules, which enable (1) instance segmentation and data association; (2) corner feature extraction and optical flow tracking; and (3) long-term occlusion handling. The first module serves as the foundational tracking-by-detection framework proposed in this study, responsible for providing and associating data for the other two modules. Considering the unique characteristics of flood scenarios compared to typical everyday scenes, flood environments present two specific challenges for multi-object tracking (MOT). First, the highly nonlinear motion of objects driven by water flow increases detection uncertainty and often leads to tracking failures. Second, prolonged occlusions caused by waves, vegetation, or infrastructure significantly hinder the re-identification of objects. To address the first challenge, we introduced an optical-flow-based module for trajectory prediction. To overcome the second, we incorporated a convolutional neural network (CNN)-based re-identification (ReID) module to handle long-term occlusions. By integrating these two modules into the tracking-by-detection framework, the proposed MOT approach can improve tracking performance in flood scenes, enabling stable, continuous, and reliable object tracking based on spatiotemporal features in video streams. Details of these modules are provided in the following sections.

2.1. Baseline Instance Segmentation and Data Association

A robust instance segmentation model with strong generalization capabilities can generate accurate masks around target objects, thereby mitigating tracking challenges. Considering that different models exhibit varying performance across different scenarios, the instance segmentation model should be easily replaceable to leverage advancements in computer vision and deep learning. For this study, we selected several influential models commonly used in practical applications, including Mask R-CNN (Region-based Convolutional Neural Network) [30], Deformable CNN [31], and Cascade R-CNN [32]. The impacts of these models on tracking performance were evaluated, and the results are discussed in the “Results” section. The data association model used in this framework is StrongSORT (Strong Simple Online and Real-time Tracking) [26], featuring a data association module that accurately links the position and ID information of target objects throughout the video.

2.2. Corner Feature Extraction and Optical Flow Tracking

Compared to simple, conventional scenes, flood scenes present a unique MOT challenge due to the highly nonlinear object motion caused by water flow. This can lead to missed detections in the instance segmentation model and, in a tracking-by-detection model, a missed detection results in the termination of object tracking. Once tracking is terminated, it cannot be correctly resumed even if subsequent detection results recover the object’s presence. To address this issue, we introduce an optical-flow-based method for tracking undetected objects. This approach aims to maintain tracking continuity despite the occurrence of missed detections.

When the instance segmentation model correctly detects the target in a video frame (Figure 2a), the mask region is converted to grayscale, and the Shi-Tomasi algorithm is used to generate up to 100 corner points within this grayscale mask (Figure 2b). If a missed detection occurs in a subsequent frame (Figure 2c), the corner features for the current frame are estimated based on the optical flow from the previous frame to the current frame (Figure 2d). This approach allows for the tracking of objects even when they are not detected by the instance segmentation model. Let the mask pixel intensity be denoted as I_mask and the translation filter as w(x, y). The optical flow between the current frame and the last frame can be represented by the pixel density variation E_mask (∆x, ∆y) induced by the filter movement, as shown in Equation (1). The regions with significant variation are considered the estimated positions of the corner points in the current frame:

E_{mask} (∆ x, ∆ y) = \sum_{x, y} w (x, y) {[I^{mask} (x + ∆ x, y + ∆ y) - I^{mask} (x, y)]}^{2},

(1)

where

I^{mask} (x, y)

represents the pixel intensity at coordinates

(x, y)

, while

I^{mask} (x + ∆ x, y + ∆ y)

denotes the pixel intensity after translation by (∆x, ∆y). The translation filter w(x, y) refers to a small region centered around the corner features generated within the grayscale mask. By computing the pixel density variation within this region, optical flow tracking of the corner features within the mask can be achieved. Subsequently, the affine transformation matrix between the homogeneous coordinates of the corner features in the current and last frames is computed. This matrix is then used to map the vertex coordinates of the mask in the last frame to the current frame:

[\begin{matrix} x_{i}^{current_frame} \\ y_{i}^{current_frame} \\ 1 \end{matrix}] = [\begin{matrix} s_{11} & s_{12} & t_{1} \\ s_{21} & s_{22} & t_{2} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x_{i}^{last_frame} \\ y_{i}^{last_frame} \\ 1 \end{matrix}], i = 1, 2, \dots, n

(2)

where s₁₁, s₁₂, s₂₁, and s₂₂ represent the scaling, shear, and rotation parameters, while t₁ and t₂ denote the translation parameters. (x^{current_frame}, y^{current_frame}) and (x^last_frame, y^last_frame) refer to the vertex coordinates of the mask in the current and last frames, respectively. When a missed detection occurs in the instance segmentation model for a given frame, (x^frame, y^frame) is used for missed detection correction (Figure 2e).

2.3. Lightweight CNN-Based Occlusion Handling

To mitigate the impact of long-term occlusions on multi-object tracking (MOT), a re-identification (ReID) module based on a lightweight convolutional neural network (CNN) is integrated into the tracking process, as shown in Figure 3. First, during instance segmentation, if the detection of a target is interrupted for 200 consecutive frames, it is assumed that the target has been occluded by objects such as floodwater, vegetation, or buildings. At this point, the algorithm automatically caches the image features and ID of the occluded object’s region of interest (RoI), forming a cache set. Next, when the instance segmentation model detects a new object (which could potentially be a reappearing occluded object), the proposed lightweight CNN computes the similarity between the new object and the cached objects. The object’s ID is then determined based on the similarity score. If the similarity exceeds a predefined threshold of 90%, the new object is identified as the previously occluded object, and the ID of the most similar cached object is reassigned to it. Otherwise, a new ID is assigned to the object. This approach ensures robust tracking continuity even in the presence of long-term occlusions.

To reduce computational costs and enhance generalization capability, an image scaling method is employed to standardize the input image size to 128 × 64 pixels. The network architecture consists of four convolutional layers (Conv1–Conv4), one max-pooling layer (Max), three residual blocks (Res1–Res3), and a fully connected layer. Each convolutional layer contains 32 channels and uses a 3 × 3 kernel with a stride of 1 and padding of 1 to process the input feature maps, ensuring that the input and output dimensions remain consistent. The max-pooling layer employs a 3 × 3 kernel with a stride of 2 and padding of 1. The three residual blocks Res1–Res3 contain 32, 64, and 128 channels, respectively. The fully connected layer generates a global feature map, which is then used to compute cosine similarity [27] to determine whether a newly detected object corresponds to an object stored in the cache set. Batch normalization and ReLU activation are applied to each convolutional layer. When the cache set reaches its limit, the oldest entry is removed. All detected, estimated, re-identified, and newly detected masks are then associated by StrongSORT for MOT in flood scene video streams.

2.4. Pre-Processing

As described in Section 2.1, three foundational models were used: Mask R-CNN, Deformable CNN, and Cascade R-CNN. To optimize the learning process, transfer learning was applied. The baseline instance segmentation networks were pre-trained on the large-scale COCO dataset [33] and fine-tuned using a custom flood dataset. This custom dataset, consisting of 3718 images and 29,766 instance annotations, was built based on FloodNet [13]. At this stage, we annotated four key object categories—people, boats, buildings, and vehicles—for three reasons: (1) they have well-defined boundaries suitable for instance segmentation; (2) they are critical for disaster assessment and relief efforts; and (3) boats, buildings, and vehicles can serve as staging points or operational routes for rescue teams. Notably, the proposed MOT method is not limited by the number of object categories, and future research will expand annotations to include additional categories for broader applicability.

Following standard dataset split principles [33], the labeled images were randomly divided into training (50%), validation (25%), and test (25%) subsets through stratified sampling to ensure class distribution balance across splits. To ensure high-quality annotations, we conducted a thorough quality check on the labels through multiple rounds of cross-verification, identifying and correcting annotation errors across the dataset. Data augmentation techniques, including affine scaling, horizontal flips, and rotation, were applied to the training set to reduce overfitting. The final dataset includes 26,542 instances for training, 846 for validation, and 2378 for testing. Sample annotations are shown in Figure 4.

In the context of evaluating instance segmentation performance, several key metrics are defined as follows. A prediction is considered a TP only if two conditions are met: (1) the predicted mask significantly overlaps with the ground truth mask and (2) the corresponding category of the mask is the same as the ground truth category. Conversely, a prediction is considered an FP if at least one of the above conditions is unmet. A prediction is considered an FN if a ground truth is not detected [34]. Precision and Recall are calculated using the formulas Precision = TP/(TP + FP) and Recall = TP/(TP + FN). The Average Precision (AP), which indicates the area under the precision–recall curve, provides a summary measure of the model’s performance across different levels of recall.

To evaluate the proposed occlusion handling module, both the Market 1501 dataset [35] and the CMOT dataset [36] were used. The Market 1501 dataset encompasses 1501 identities and approximately 30,000 images captured from six different cameras, while the CMOT dataset contains 100 domain videos and 625 identities. To enhance training efficiency, we extracted all annotated instances with the same ID by cropping them based on their bounding box locations and stored them in a unified directory. For evaluating the occlusion handling network, the cropped images from the validation and test sets sharing the same ID were randomly and evenly split into two separate directories: a gallery directory and a query directory.

To evaluate the proposed MOT method with optical flow tracking, a dataset of video footage was collected from ten flood-affected regions. The testing dataset includes 9979 frames and 4325 annotations, covering diverse flight attitudes, perspectives, weather conditions, recording times, and geographical areas (Figure 5). This variety ensures a challenging and comprehensive evaluation environment. While the exact geographic locations of the case studies are not disclosed due to data-sharing constraints, the selected areas represent a broad spectrum of flood-prone environments in monsoon-affected regions. These include rural mountainous zones, peri-urban settlements, agricultural lands, and industrial complexes. The severity of the flood events varies across cases, with some areas experiencing river overflows and landslides (e.g., testing video #09), while others depict large-scale urban waterlogging (e.g., testing video #10), submerged road networks (e.g., testing video #05), or flooded farmland (e.g., testing video #06). These scenarios also resulted in significant impacts such as road closures, damaged buildings, disrupted transportation, and potential economic losses. Each target object in the videos was manually annotated with identity, location, and category information following the MOT16 standard [37]. The method’s performance in flood scene understanding was assessed by comparing predictions with ground-truth annotations.

We used a set of standard MOT evaluation metrics, including Higher Order Tracking Accuracy (HOTA), Detection Accuracy (DetA), Association Accuracy (AssA), Localization Accuracy (LocA), Identification F₁ Score (IDF₁), and Identity Switches (IDSWs) [38]. Specifically, HOTA is defined as the geometric mean of detection accuracy and association accuracy, averaged across localization thresholds. This metric ensures that both detection and association are evenly balanced. DetA is calculated as the Detection Jaccard index averaged over localization thresholds. AssA is determined by the Association Jaccard index averaged over all matching detections and then averaged over localization thresholds. LocA represents the average localization similarity averaged over all matching detections and averaged over localization thresholds. IDF₁ measures the ratio of correct detections over the average number of ground-truth targets and computed detections. Lastly, IDSWs count the number of times that a tracked trajectory changes its matched ground-truth identity. These metrics provide a balanced assessment of detection, association, and localization performance, enabling a comprehensive evaluation of the tracking system’s effectiveness across multiple aspects.

3. Results

In this section, we comprehensively evaluate the performance of the baseline instance segmentation networks, the occlusion handling network, and the MOT method incorporating optical flow tracking and occlusion handling. All experiments were conducted on an ASUS computer manufactured in Suzhou, China, running Linux equipped with an Intel i7 CPU and an Nvidia GTX 3090 Ti GPU, using PyTorch 1.10 as the programming environment.

3.1. Evaluation of Baseline Instance Segmentation Networks

To mitigate the risk of overfitting during the training phase, three strategies were implemented. First, batch normalization was applied to stabilize the learning process and reduce the likelihood of overfitting. Second, transfer learning with pretrained weights was used to facilitate the extraction of generic features before fine tuning. Third, a learning rate schedule was employed to progressively decrease the learning rate, ensuring more stable convergence. All selected models were pretrained on the COCO dataset and fine-tuned on the custom dataset for 50 epochs (81,250 iterations) with a batch size of 2. A linear warm-up strategy increased the learning rate from 0.001 to 0.0025 over the first 1000 iterations. The learning rate was reduced by a factor of 0.1 after epochs 35 and 45. Other hyperparameters matched those in the original implementations. Training, validation, and testing performance are shown in Figure 6, indicating that all models converged within 50 epochs. The PR curves at ten IoU levels ranging from 0.5 to 0.95 were represented by various symbols (circle, square, triangle down, triangle left, triangle right, triangle up, star, diamond, hexagon, and pentagon). As the IoU threshold decreases, the AP increases, which is expected. Specifically, at an IoU threshold of 0.5, the mean average precision (mAP) on the test set was 95.2% for Mask R-CNN, 95.7% for Deformable CNN, and 96.3% for Cascade R-CNN. These results indicate that the trained baseline instance segmentation models perform robustly with new image data, demonstrating strong generalization capabilities for understanding flood scenes.

3.2. Evaluation of Occlusion Handling Network

The occlusion handling network was trained for ten epochs (1250 iterations) using the cross-entropy loss function and optimized with stochastic gradient descent (SGD). To mitigate overfitting, we implemented the following strategies: (1) batch normalization by adding a normalization layer after each convolutional layer, and (2) a learning rate schedule that reduced the learning rate by a factor of 0.1. The hyperparameters were set as follows: learning rate = 0.01, momentum = 0.9, and weight decay = 0.0005. The network successfully converged within 1250 iterations. The training and testing results are shown in Figure 7. Top 1 accuracy, a key evaluation metric for ReID tasks, was used to assess the model’s performance in handling occlusions. The results demonstrate that the trained occlusion handling network achieves robust performance when encountering new IDs, highlighting its strong generalization capability for occlusion handling in complex scenes.

3.3. Evaluation of MOT Incorporating Optical Flow Tracking

Tracking performance on the test videos is summarized in Table 2, Table 3 and Table 4, comparing our MOT method with the benchmark tracker StrongSORT. As shown in Table 2, for videos #01, #04, #07, #08, and #10, the proposed MOT method using Mask R-CNN achieves notable HOTA improvements of 1.8%, 2.76%, 0.73%, 1.45%, and 1.76%, respectively, over StrongSORT. These gains are reflected in DetA and AssA, demonstrating the method’s effectiveness in both detection and trajectory association. In video #02, despite matching the IDSWs of StrongSORT, the proposed MOT method shows a 4.16% increase in IDF1, indicating better identity assignment accuracy.

As shown in Table 3, the proposed MOT method with Deformable CNN consistently outperforms StrongSORT in HOTA, DetA, AssA, and IDF1 across all videos, demonstrating strong tracking and identity accuracy. Notable gains appear in videos #02, #04, #06, and #10, with HOTA improvements up to 7.12%. While minor drops are observed in videos #03 and #09, the differences are minimal. In video #05, despite solid gains in HOTA and IDF1, a modest DetA increase suggests potential for enhanced detection accuracy.

Table 4 shows significant HOTA gains in videos #02 (4.94%), #04 (3.17%), #06 (6.17%), and #10 (1.69%) over StrongSORT, with improvements in DetA, AssA, and IDF1, indicating robust performance in complex scenarios. For example, #02 achieves a 3.57% IDF1 increase. Performance is comparable in #03 and #09, despite minor declines in #03 (HOTA −0.53%, IDF1 −1.19%). Video #05 shows HOTA and IDF1 gains (1.78%, 2.40%), but the modest DetA rise (1.60%) suggests detection accuracy needs further optimization in tough conditions. By comparing Table 2, Table 3 and Table 4 horizontally, we find that the proposed MOT method consistently outperforms StrongSORT in terms of HOTA, regardless of the baseline instance segmentation networks. Specifically, the highest improvements are generally observed with Cascade R-CNN, followed by Mask R-CNN. In terms of DetA, AssA, and IDF1, similar patterns emerge, with Cascade R-CNN leading in improvement margins.

Table 5 summarizes the combined tracking performance of the proposed MOT method and StrongSORT across all testing videos, consolidating results from Table 2, Table 3 and Table 4. The proposed MOT method outperforms StrongSORT in key metrics across the three baseline networks. Notably, with Cascade R-CNN, it achieves the highest scores: HOTA (69.57%), DetA (67.32%), AssA (73.21%), and IDF1 (89.82%). Significant improvements in DetA and AssA highlight its effectiveness in object detection and trajectory association. The method also shows superior identity assignment consistency, with fewer identity switches and more reliable tracking. Additionally, it achieves higher FPS rates than StrongSORT across all models, indicating both better performance and efficiency. Based on these results, the proposed MOT method with Cascade R-CNN was selected for field application, demonstrating strong potential for enhancing MOT in complex flood scenes.

3.4. Field Application

The proposed computer vision-based MOT method for intelligent flood scene understanding was tested at four locations in China (names withheld for privacy). Sample frames from these field applications are shown in Figure 8. The videos were captured by rescue personnel using drones. Panoptic segmentation of water, vegetation, mountains, and sky—represented by dark blue, light green, gray, and light blue, respectively—was performed using Panoptic FPN [39]. Simultaneously, the proposed MOT method was applied to detect and track people, boats, buildings, and vehicles, shown with green masks/bounding boxes and red category/ID labels. Figure 8a–c demonstrate the proposed MOT method successfully identifying and tracking buildings in the flood scene video, maintaining consistent IDs across frames. Figure 8d shows the algorithm tracking boats and vehicles. Although the boat was occasionally missed by the instance segmentation model (e.g., in frames 59 and 211), the optical flow tracking and ReID method ensured continuous tracking. Throughout these field tests, most target objects were consistently detected and tracked, showcasing the method’s effectiveness. The results confirm its strong performance across diverse flood scenarios, including varying weather, lighting, and camera angles, proving its adaptability to different environments.

4. Discussion

4.1. Summary of Results and Implications

The proposed computer vision-based multi-object tracking (MOT) method is designed to address the complex and dynamic nature of flood scene understanding. It integrates three complementary modules: (1) instance segmentation and data association, (2) corner feature extraction combined with optical flow tracking, and (3) long-term occlusion handling. Each module was selected for its specific strengths in resolving key challenges, such as detecting partially submerged objects, maintaining track continuity under rapid scene changes, and recovering trajectories after occlusions. This modular design enables the system to achieve greater robustness and accuracy across diverse flood scenarios.

The first module, instance segmentation and data association, was used for its ability to precisely delineate object boundaries, an advantage over traditional bounding-box-based trackers, which often struggle with occlusion and irregular object shapes. Accurate segmentation is particularly valuable in flood scenes, where targets such as partially submerged vehicles or individuals require fine-grained identification.

The second module, corner feature extraction and optical flow tracking, captures motion cues across frames, supporting consistent tracking despite appearance changes caused by water reflection, debris, or motion blur. This feature tracking complements segmentation by maintaining identity continuity, especially in cases of brief occlusions or shape deformation.

The third module addresses long-term occlusion through a sliding window mechanism (set to 200 frames) and a similarity-based re-identification threshold (set to 90%), both determined empirically. These hyperparameters play a critical role in balancing recall and precision by reducing identity switches while maintaining track continuity.

Overall, the effectiveness of the proposed method relies on image features such as stable textures and relatively consistent lighting conditions that enhance both segmentation accuracy and the reliability of tracking.

Compared to benchmark trackers, the proposed method achieves state-of-the-art performance across key metrics, achieving HOTA (69.57%), DetA (67.32%), AssA (73.21%), LocA (81.82%), and IDF1 (89.82%). Field application results further confirm its effectiveness in tracking multiple objects in complex flood scenarios.

The proposed computer vision-based MOT method contains approximately 100 million parameters. All experiments were conducted on a Linux system equipped with an Intel i7 CPU and an Nvidia RTX 3090 Ti GPU using PyTorch, achieving an inference speed of 14.4 FPS. While the instance segmentation and data association module accounts for the majority of the computational cost due to its multi-stage object segmentation and localization processes, the optical flow tracking and long-term occlusion handling modules provide lightweight yet effective missed detection correction and re-identification (ReID) capabilities, respectively, with minimal impact on the overall inference speed.

The proposed method offers significant theoretical, empirical, and strategic benefits. From a theoretical perspective, this study advances the understanding of integrating computer vision-based MOT into intelligent flood scene analysis. The proposed framework (Figure 1) enables the recognition of spatiotemporal features in dynamic flood videos, adapting to challenging environments and accurately identifying target object locations and quantities for damage assessment. This enhances the theoretical foundation for deploying existing technologies and developing innovations to promote AI integration in humanitarian aid operations.

From an empirical perspective, the framework provides valuable insights into the practical performance of computer vision-based MOT in real-world humanitarian aid scenarios. Experimental results highlight its robust performance across diverse flood-affected areas and conditions, contributing to a deeper understanding of how MOT can be effectively applied in complex operational environments.

Strategically, the framework serves as a powerful tool for understanding the dynamics of flood-affected areas and optimizing rescue strategies. For instance, it enables rescue teams to efficiently plan staging points and operational routes based on rapid, dynamic, and comprehensive flood scene analysis. This capability enhances the efficiency and effectiveness of humanitarian aid operations in flood-impacted regions.

While this study primarily focuses on vision-based techniques for flood scene analysis, it is important to acknowledge the role of traditional hydrodynamic modeling in flood management [40,41]. We emphasize the complementary strengths of these approaches: vision-based methods provide wide spatial coverage with minimal reliance on ground data, making them well-suited for large-scale flood monitoring despite some limitations in detailed flow accuracy. In contrast, hydrodynamic models deliver precise water flow simulations but require extensive data and computational resources. Integrating vision-based data with hydrodynamic models could enhance parameter estimation and improve flood impact predictions, fostering a more effective and resilient flood management system.

4.2. Research Limitations and Future Research

Future research should address several limitations. First, due to resource constraints, only three baseline instance segmentation models were evaluated. We plan to include more models, such as those from the YOLO series and transformer-based models. Second, while our study focused on detecting and tracking objects like people, buildings, vehicles, and boats, future research will explore additional categories, including bridges and heavy machinery like excavators and cranes. Third, the current research lies in the empirical selection of certain key parameters in the pipeline, such as the 200-frame length for occlusion handling and the 90% similarity threshold. These values were determined through experimental trials rather than being derived from a theoretical foundation. While we observed that these settings yield stable performance across our test datasets, the lack of an automated data-driven parameter tuning mechanism may limit the generalizability of the method to unseen or more diverse scenarios. Future efforts will aim to investigate adaptive strategies for parameter estimation, such as learning-based approaches or dynamic thresholding methods, which could potentially improve the robustness and applicability of the system across different environments.

5. Conclusions

This study proposes a computer vision-based multi-object tracking (MOT) method to enhance intelligent flood scene understanding, addressing the urgent need for efficient disaster response in increasingly frequent extreme weather events. By leveraging dynamic video analysis over static imagery, the method enables real-time monitoring, object tracking, flood flow estimation, change detection, and behavior recognition. The proposed approach integrates three key modules: (1) instance segmentation and data association, (2) corner feature extraction with optical flow tracking for short-term undetected mask estimation, and (3) a deep re-identification (ReID) module for long-term occlusion handling. These components collectively improve MOT robustness and accuracy in complex flood environments.

The proposed method demonstrates notable improvements in accuracy, robustness, and generalization across multiple evaluation metrics, both in controlled performance tests and field applications. The proposed method outperforms benchmark trackers and achieves state-of-the-art results across several key metrics, including HOTA (69.57%), DetA (67.32%), AssA (73.21%), LocA (81.82%), and IDF1 (89.82%). Its effectiveness was further validated through field applications involving complex flood scenarios with multiple moving objects. These advancements are particularly crucial given the increasing frequency and intensity of flood events due to global warming. The study not only contributes methodologically by addressing key limitations of existing MOT approaches in flood contexts but also provides practical guidance for governmental and humanitarian aid organizations on utilizing intelligent technologies to enhance disaster response efficiency.

Key contributions of this research include (1) a deep learning and computer vision-based MOT method proposed for intelligent flood scene analysis across continuous video frames; (2) addressing the unique challenges of flood scenes compared to conventional environments via a MOT method that integrates an optical-flow-based module for trajectory prediction and a deep re-identification (ReID) module for handling long-term occlusions in flood scenarios; and (3) achieving state-of-the-art performance across multiple evaluation metrics in both performance testing and field applications. Through these contributions, our study underscores the potential of AI-driven solutions in mitigating the impacts of natural disasters and saving lives.

Despite promising results, this study has certain limitations, such as the evaluation of only three baseline models and the focus on specific object categories using only color images. In addition, the hyperparameters were empirically set rather than derived from theoretical foundations. In future research, we aim to extend the approach to a wider range of models and object types and explore data-driven strategies for parameter tuning, as well as investigate the transferability of the method across more diverse domains.

Looking forward, the proposed vision-based flood analysis framework holds promising potential beyond immediate flood scene understanding. Its ability to extract dynamic, high-resolution information in real time makes it a valuable component of early warning systems, enabling faster response and more informed decision making during disaster events. Moreover, the data generated by such systems could support post-event assessments and inform long-term risk reduction strategies, such as urban planning and infrastructure resilience, particularly under the growing threat of extreme weather events due to climate change. Integrating this approach into existing disaster management pipelines may significantly enhance preparedness, response efficiency, and adaptive capacity.

Author Contributions

Formal analysis, Y.Z. and Z.W.; Investigation, B.X. and L.H.; Methodology, X.Y.; Resources, B.X. and L.H.; Software, X.Y. and Z.W.; Supervision, Y.Z.; Visualization, X.Y. and Y.Z.; Writing—original draft, X.Y. and R.X.; Writing—review and editing, X.Y. and R.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 72201247.

Data Availability Statement

The source code for the models in this study is available in GitHub at the following address: https://github.com/XZ-YAN/CVB-MOT-Flood (accessed on 9 July 2025).

Acknowledgments

The authors thank the annotators Yiyuan Chen, Jingquan Zhao, Xiaolei Han, Hang Zhao, and Tieying Xue for their great contributions to this study. During the preparation of this manuscript/study, the authors used ChatGPT-4 for the purposes of improving readability. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. Author Bin Xu was employed by the company Zhejiang Province Sanjian Construction Group Co., Ltd. Author Liu He was employed by the company Zhejiang Construction Investment Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MOT	Multi-Object Tracking
ReID	Re-Identification
AI	Artificial Intelligence
DLBCV	Deep Learning-Based Computer Vision
UAV	Unmanned Aerial Vehicle
mAP	mean Average Precision
IoU	Intersection over Union
YOLO	You Look Only Once
COCO	Common Objects in Context
PR	Precision–Recall
CNN	Convolutional Neural Network
TP	True Positive
FP	False Positive
FN	False Negative
SGD	Stochastic Gradient Descent
DSConv	Distributed Shift Convolution

References

Romero, T.Q.; Leandro, J. A method to devise multiple model structures for urban flood inundation uncertainty. J. Hydrol. 2022, 604, 127246. [Google Scholar] [CrossRef]
Henderson, F.; Steiner, A.; Farmer, J.; Whittam, G. Challenges of community engagement in a rural area: The impact of flood protection and policy. J. Rural. Stud. 2020, 73, 225–233. [Google Scholar] [CrossRef]
Goldschmidt, K.H.; Kumar, S. Humanitarian operations and crisis/disaster management: A retrospective review of the literature and framework for development. Int. J. Disaster Risk Reduct. 2016, 20, 1–13. [Google Scholar] [CrossRef]
Iqbal, U.; Perez, P.; Li, W.; Barthelemy, J. How computer vision can facilitate flood management: A systematic review. Int. J. Disaster Risk Reduct. 2021, 53, 102030. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, P.; Chen, L.; Xu, M.; Guo, X.; Zhao, L. A new multi-source remote sensing image sample dataset with high resolution for flood area extraction: GF-FloodNet. Int. J. Digit. Earth 2023, 16, 2522–2554. [Google Scholar] [CrossRef]
DeVries, B.; Huang, C.; Armston, J. Rapid and robust monitoring of flood events using Sentinel-1 and Landsat data on the Google Earth Engine. Remote Sens. Environ. 2020, 240, 111664. [Google Scholar] [CrossRef]
Wan, J.; Xue, F.; Shen, Y.; Song, H.; Shi, P.; Qin, Y.; Yang, T.; Wang, Q.J. Automatic segmentation of urban flood extent in video image with DSS-YOLOv8n. J. Hydrol. 2025, 655, 132974. [Google Scholar] [CrossRef]
Sun, J.; Xu, C.; Zhang, C.; Zheng, Y.; Wang, P.; Liu, H. Flood scenarios vehicle detection algorithm based on improved YOLOv9. Multimed. Syst. 2025, 31, 74. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Jackson, J.; Yussif, S.B.; Patamia, R.A.; Sarpong, K.; Qin, Z. Flood or non-flooded: A comparative study of state-of-the-art models for flood image classification using the FloodNet dataset with uncertainty offset analysis. Water 2023, 15, 875. [Google Scholar] [CrossRef]
Iqbal, U.; Riaz, M.Z.B.; Barthelemy, J.; Hutchison, N.; Perez, P. Floodborne objects type recognition using computer vision to mitigate blockage originated floods. Water 2022, 14, 2605. [Google Scholar] [CrossRef]
Pally, R.J.; Samadi, S. Application of image processing and convolutional neural networks for flood image classification and semantic segmentation. Environ. Model. Softw. 2022, 148, 105285. [Google Scholar] [CrossRef]
Rahnemoonfar, M.; Chowdhury, T.; Sarkar, A.; Varshney, D.; Yari, M.; Murphy, R.R. FloodNet: A high resolution aerial imagery dataset for post flood scene understanding. IEEE Access 2021, 9, 89644–89654. [Google Scholar] [CrossRef]
Zhao, J.; Wang, X.; Zhang, C.; Hu, J.; Wan, J.; Cheng, L.; Shi, S.; Zhu, X. Urban waterlogging monitoring and recognition in low-light scenarios using surveillance videos and deep learning. Water 2025, 17, 707. [Google Scholar] [CrossRef]
Wang, Y.; Shen, Y.; Salahshour, B.; Cetin, M.; Iftekharuddin, K.; Tahvildari, N.; Huang, G.; Harris, D.K.; Ampofo, K.; Goodall, J.L. Urban flood extent segmentation and evaluation from real-world surveillance camera images using deep convolutional neural network. Environ. Model. Softw. 2024, 173, 105939. [Google Scholar] [CrossRef]
Rahnemoonfar, M.; Chowdhury, T.; Murphy, R. RescueNet: A high resolution UAV semantic segmentation benchmark dataset for natural disaster damage assessment. Sci. Data 2023, 10, 913. [Google Scholar] [CrossRef] [PubMed]
Jafari, N.H.; Li, X.; Chen, Q.; Le, C.-Y.; Betzer, L.P.; Liang, Y. Real-time water level monitoring using live cameras and computer vision techniques. Comput. Geosci. 2021, 147, 104642. [Google Scholar] [CrossRef]
Liang, Y.; Li, X.; Tsai, B.; Chen, Q.; Jafari, N. V-FloodNet: A video segmentation system for urban flood detection and quantification. Environ. Model. Softw. 2023, 5, 105586. [Google Scholar] [CrossRef]
Muhadi, N.A.; Abdullah, A.F.; Bejo, S.K.; Mahadi, M.R.; Mijic, A. Image segmentation methods for flood monitoring system. Water 2020, 12, 1825. [Google Scholar] [CrossRef]
Notarangelo, N.; Hirano, K.; Albano, R.; Sole, A. Transfer learning with convolutional neural networks for rainfall detection in single images. Water 2021, 13, 588. [Google Scholar] [CrossRef]
Hassan, S.; Mujtaba, G.; Rajput, A.; Fatima, N. Multi-object tracking: A systematic literature review. Multimed. Tools Appl. 2024, 83, 43439–43492. [Google Scholar] [CrossRef]
Yu, E.; Li, Z.; Han, S.; Wang, H. Relationtrack: Relation-aware multiple object tracking with decoupled representation. IEEE Trans. Multimed. 2022, 25, 3686–3697. [Google Scholar] [CrossRef]
Liang, C.; Zhang, Z.; Zhou, X.; Li, B.; Zhu, S.; Hu, W. Rethinking the competition between detection and reid in multiobject tracking. IEEE Trans. Image Process 2022, 31, 3182–3196. [Google Scholar] [CrossRef] [PubMed]
Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
He, J.; Huang, Z.; Wang, N.; Zhang, Z. Learnable graph matching: Incorporating graph partitioning with deep feature learning for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5295–5305. [Google Scholar] [CrossRef]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. StrongSORT: Make DeepSORT great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar] [CrossRef]
Alkandary, K.; Yildiz, A.S.; Meng, H. A comparative study of YOLO series (v3–v10) with DeepSORT and StrongSORT: A real-time tracking performance study. Electronics 2025, 14, 876. [Google Scholar] [CrossRef]
Sim, H.-S.; Cho, H.-C. Enhanced DeepSORT and StrongSORT for multicattle tracking with optimized detection and re-identification. IEEE Access 2025, 13, 19353–19364. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar] [CrossRef]
Heydarian, M.; Doyle, T.E.; Samavi, R. MLCM: Multi-Label Confusion Matrix. IEEE Access 2022, 10, 19083–19095. [Google Scholar] [CrossRef]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar] [CrossRef]
Yan, X.; Jin, R.; Zhang, H.; Gao, H.; Xu, S. Computer vision-based intelligent monitoring of disruptions due to construction machinery arrival delay. J. Comput. Civ. Eng. 2025, 39, 04025011. [Google Scholar] [CrossRef]
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar] [CrossRef]
Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. HOTA: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2020, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6392–6401. [Google Scholar] [CrossRef]
Patro, S.; Chatterjee, C.; Mohanty, S.; Singh, R.; Raghuwanshi, N.S. Flood inundation modeling using MIKE FLOOD and remote sensing data. J. Indian Soc. Remote Sens. 2009, 37, 107–118. [Google Scholar] [CrossRef]
Corti, M.; Francioni, M.; Abbate, A.; Papini, M.; Longoni, L. Analysis and modelling of the September 2022 flooding event in the Misa Basin. Ital. J. Eng. Geol. Environ. 2024, 1, 69–76. [Google Scholar] [CrossRef]

Figure 1. Overall framework of the proposed method.

Figure 2. Corner feature extraction and optical flow-based tracking.

Figure 3. Architecture of the lightweight CNN-based occlusion handling module.

Figure 4. Original annotations and corresponding augmented annotations in the instance segmentation dataset.

Figure 5. Example frames of testing videos.

Figure 6. Loss curves and PR curves of three baseline instance segmentation models. (a) Loss curves and PR curves of Mask R-CNN; (b) loss curves and PR curves of Deformable CNN; (c) loss curves and PR curves of Cascade R-CNN.

Figure 7. Training loss and top 1 accuracy of the occlusion handling network. (a) Training loss curve; (b) top 1 accuracy curves.

Figure 8. Example frames of testing videos. (a) Sample frames in case 1; (b) sample frames in case 2; (c) sample frames in case 3; (d) sample frames in case 4.

Table 1. Summary of representative computer vision-based flood scene understanding studies in the last 5 years.

References	Key Features		Major Advantages
References	Objectives	Visual Recognition Tasks	Major Advantages
[19]	Flood monitoring	Semantic segmentation	Capable of achieving a high dice score (97.70%) and Jaccard Index (95.51%)
[17]	Water level monitoring	Semantic segmentation	Reproducible and accurate from a controlled environment in field applications
[13]	Post-flood scene understanding	Image classification, semantic segmentation, and visual question answering	Introducing a high-resolution UAV ¹ imagery dataset “FloodNet” for multiple visual recognition tasks
[20]	Rainfall detection	Image classification	Capable of achieving high accuracy and F1 score
[12]	Water level and aspect ratio monitoring for flood severity and risk estimation	Object detection and instance segmentation	Developing an open-source Python package “FloodImageClassifier” that integrates multiple computer vision models
[11]	Foodborne objects type recognition	Object detection	Enhanced performance in mAP ²
[16]	Disaster damage assessment	Semantic segmentation	Introducing a UAV imagery dataset “RescueNet” for semantic segmentation
[18]	Water level and inundation depth estimation	Instance segmentation and video object segmentation	Balancing the speed and segmentation quality, which are insensitive to the input resolution Capable of segmenting flood and reference objects in long video sequences under various weather conditions
[15]	Urban flood extent segmentation	Semantic segmentation	Capable of achieving a very high F1 score, exceeding 0.9
[8]	Vehicle detection in flood scenarios	Object detection	Enhanced performance in accuracy, F1-score, and mAP The model training converges quickly and consumes less memory
[7]	Urban flood extent segmentation	Semantic segmentation	Enhanced performance in both box mAP and mask mAP at 50% recall Demonstrating robustness and general-ity in complex urban flood scenarios
[14]	Urban waterlogging monitoring	Semantic segmentation	Enhanced performance in mean recall, mean F1-score, and mean IoU ³ Showcasing robustness under low-light conditions for all-weather applications

Notes: ¹ UAV denotes Unmanned Aerial Vehicle. ² mAP denotes mean Average Precision. ³ IoU denotes intersection over union.

Table 2. Tracking performance of StrongSORT and the proposed MOT method with Mask R-CNN.

Video	Tracker	Evaluation Metrics
Video	Tracker	HOTA (%)	DetA (%)	AssA (%)	LocA (%)	IDF₁ (%)	IDSWs
#01	StrongSORT	72.21	72.21	72.21	81.15	97.75	0
#01	Proposed MOT	74.01	74.01	74.01	81.10	99.68	0
#02	StrongSORT	61.73	53.62	71.06	86.78	76.36	3
#02	Proposed MOT	65.89	57.41	75.63	86.89	79.58	3
#03	StrongSORT	58.63	54.26	63.53	81.46	79.78	1
#03	Proposed MOT	58.11	53.28	63.57	81.68	78.44	0
#04	StrongSORT	73.09	72.36	73.90	82.77	95.21	0
#04	Proposed MOT	75.85	75.03	76.81	82.81	97.13	0
#05	StrongSORT	31.56	24.14	52.71	75.11	38.69	5
#05	Proposed MOT	33.81	25.50	53.80	74.79	43.42	3
#06	StrongSORT	63.99	61.74	66.57	81.21	88.84	2
#06	Proposed MOT	69.18	67.31	71.45	81.24	93.28	2
#07	StrongSORT	75.54	75.23	75.86	84.34	95.76	0
#07	Proposed MOT	76.27	76.13	76.43	84.38	96.34	0
#08	StrongSORT	77.27	73.64	82.34	81.09	98.69	2
#08	Proposed MOT	78.72	75.00	83.93	81.11	99.63	1
#09	StrongSORT	72.19	72.19	72.19	79.51	98.36	1
#09	Proposed MOT	72.83	72.83	72.83	79.39	99.00	1
#10	StrongSORT	77.21	77.01	77.42	83.49	98.15	0
#10	Proposed MOT	78.97	78.78	79.16	83.48	99.32	0

Table 3. Tracking performance of StrongSORT and the proposed MOT method with Deformable CNN.

Video	Tracker	Evaluation Metrics
Video	Tracker	HOTA (%)	DetA (%)	AssA (%)	LocA (%)	IDF₁ (%)	IDSWs
#01	StrongSORT	70.14	70.14	70.14	81.02	96.35	0
#01	Proposed MOT	71.90	71.90	71.90	80.97	97.65	0
#02	StrongSORT	64.93	59.91	70.37	87.10	81.67	2
#02	Proposed MOT	69.77	65.02	74.86	87.20	85.63	2
#03	StrongSORT	58.93	54.29	64.20	81.59	79.67	1
#03	Proposed MOT	58.21	53.21	63.94	81.77	78.38	0
#04	StrongSORT	72.50	71.71	73.39	82.67	94.94	0
#04	Proposed MOT	75.69	74.81	76.72	82.68	97.13	0
#05	StrongSORT	32.83	25.13	53.06	75.20	39.44	5
#05	Proposed MOT	35.29	26.61	54.89	75.22	44.32	3
#06	StrongSORT	62.89	58.75	67.61	81.27	86.14	2
#06	Proposed MOT	70.01	66.27	74.32	81.52	92.24	2
#07	StrongSORT	74.72	73.48	75.98	84.83	94.26	0
#07	Proposed MOT	75.82	74.91	76.75	84.90	95.17	0
#08	StrongSORT	77.43	73.68	82.75	81.09	98.69	2
#08	Proposed MOT	78.90	75.07	84.38	81.11	99.63	1
#09	StrongSORT	72.23	72.23	72.23	79.68	98.87	0
#09	Proposed MOT	72.85	72.85	72.85	79.58	99.25	0
#10	StrongSORT	77.61	77.57	77.66	83.58	98.15	0
#10	Proposed MOT	79.28	79.23	79.33	83.52	99.32	0

Table 4. Tracking performance of StrongSORT and the proposed MOT method with Cascade R-CNN.

Video	Tracker	Evaluation Metrics
Video	Tracker	HOTA (%)	DetA (%)	AssA (%)	LocA (%)	IDF₁ (%)	IDSWs
#01	StrongSORT	72.60	72.60	72.60	80.82	98.14	0
#01	Proposed MOT	74.24	74.24	74.24	80.78	99.34	0
#02	StrongSORT	70.48	68.96	72.03	87.37	88.55	3
#02	Proposed MOT	75.42	74.16	76.71	87.50	92.12	2
#03	StrongSORT	58.94	54.36	64.08	81.48	79.89	0
#03	Proposed MOT	58.41	53.42	64.06	81.70	78.70	0
#04	StrongSORT	72.38	71.54	73.32	82.81	94.80	1
#04	Proposed MOT	75.55	74.61	76.65	82.86	96.99	0
#05	StrongSORT	32.68	25.04	53.13	75.18	40.75	6
#05	Proposed MOT	34.46	26.64	53.42	75.03	43.15	2
#06	StrongSORT	64.37	60.45	68.79	81.43	88.32	0
#06	Proposed MOT	70.54	67.12	74.41	81.43	93.72	2
#07	StrongSORT	75.30	74.93	75.70	84.48	95.26	2
#07	Proposed MOT	76.19	76.06	76.35	84.50	96.00	0
#08	StrongSORT	77.02	73.29	82.23	80.88	98.69	1
#08	Proposed MOT	78.49	74.67	83.85	80.90	99.63	1
#09	StrongSORT	72.05	72.05	72.05	79.47	98.87	0
#09	Proposed MOT	72.65	72.65	72.65	79.37	99.25	0
#10	StrongSORT	78.03	77.97	78.09	84.17	98.15	0
#10	Proposed MOT	79.72	79.67	79.77	84.09	99.32	0

Table 5. Combined tracking performance of the proposed MOT method across all testing videos.

Baseline Network	Tracker	Evaluation Metrics
Baseline Network	Tracker	HOTA (%)	DetA (%)	AssA (%)	LocA (%)	IDF₁ (%)	IDSWs	FPS ¹
Mask R-CNN	StrongSORT	66.34	63.64	70.78	81.69	86.76	14	8.6
Mask R-CNN	Proposed MOT	68.36	65.53	72.76	81.69	88.58	10	14.5
Deformable CNN	StrongSORT	66.42	63.69	70.74	81.80	86.82	12	8.2
Deformable CNN	Proposed MOT	68.77	65.99	72.99	81.85	88.87	8	14.3
Cascade R-CNN	StrongSORT	67.38	65.12	71.20	81.81	88.14	13	8.1
Cascade R-CNN	Proposed MOT	69.57	67.32	73.21	81.82	89.82	7	14.4

Note: ¹ FPS denotes frames per second.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, X.; Zhu, Y.; Wang, Z.; Xu, B.; He, L.; Xia, R. Intelligent Flood Scene Understanding Using Computer Vision-Based Multi-Object Tracking. Water 2025, 17, 2111. https://doi.org/10.3390/w17142111

AMA Style

Yan X, Zhu Y, Wang Z, Xu B, He L, Xia R. Intelligent Flood Scene Understanding Using Computer Vision-Based Multi-Object Tracking. Water. 2025; 17(14):2111. https://doi.org/10.3390/w17142111

Chicago/Turabian Style

Yan, Xuzhong, Yiqiao Zhu, Zeli Wang, Bin Xu, Liu He, and Rong Xia. 2025. "Intelligent Flood Scene Understanding Using Computer Vision-Based Multi-Object Tracking" Water 17, no. 14: 2111. https://doi.org/10.3390/w17142111

APA Style

Yan, X., Zhu, Y., Wang, Z., Xu, B., He, L., & Xia, R. (2025). Intelligent Flood Scene Understanding Using Computer Vision-Based Multi-Object Tracking. Water, 17(14), 2111. https://doi.org/10.3390/w17142111

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent Flood Scene Understanding Using Computer Vision-Based Multi-Object Tracking

Abstract

1. Introduction

2. Methodology

2.1. Baseline Instance Segmentation and Data Association

2.2. Corner Feature Extraction and Optical Flow Tracking

2.3. Lightweight CNN-Based Occlusion Handling

2.4. Pre-Processing

3. Results

3.1. Evaluation of Baseline Instance Segmentation Networks

3.2. Evaluation of Occlusion Handling Network

3.3. Evaluation of MOT Incorporating Optical Flow Tracking

3.4. Field Application

4. Discussion

4.1. Summary of Results and Implications

4.2. Research Limitations and Future Research

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI