1. Introduction
Globally, flood damage caused by heavy rainfall has been increasing significantly due to climate change [
1,
2,
3,
4]. Areas such as semi-basement residential zones, narrow alleyways, underpasses, and lowlands with inadequate drainage infrastructure are particularly vulnerable to flooding, often inhabited by economically vulnerable populations, leading to substantial financial losses and challenging recovery efforts [
5,
6,
7]. They are exposed to greater risks, as they face relative difficulties in emergency evacuation and have limited capacity to respond swiftly during flood events [
8,
9].
In 2022, unusually heavy rainfall in Seoul, South Korea, flooded semi-basement residential areas, resulting in the deaths of residents who were unable to evacuate in time. In 2023 as well, the underground roadway in Osong, South Korea, was flooded, resulting in numerous casualties. In 2024, a flash flood also occurred in Valencia, Spain, resulting in multiple fatalities reported in narrow alleyways. These cases indicate that the risks of heavy rainfall and flooding induced by climate change are steadily increasing on a global scale, and that existing disaster prevention systems are insufficient to effectively prevent or respond to all flood-related damages [
10,
11].
To effectively respond to these emerging risks, it is critical to develop detection methods tailored specifically to flood-vulnerable urban areas. However, research focusing on these areas remains limited. Thus far, research on deep learning-based flood detection and prediction has primarily focused on data from areas near rivers or large-scale inundation zones [
12,
13]. Moreover, most existing studies have relied on data collected from a distance or on aerial data obtained through unmanned aerial vehicles (UAVs) [
14,
15,
16]. While such approaches may be effective in detecting large-scale river flooding or widespread inundation, they have limitations in accurately identifying localized flood-vulnerable areas within urban environments, such as narrow alleyways or semi-basement residential zones [
16,
17]. Although some studies have utilized ground-level imagery to detect floods, these images were not specifically tailored to economically and flood-vulnerable areas such as narrow alleyways or semi-basement residences making it difficult to effectively detect floods in economically and flood-vulnerable areas [
17,
18]. In particular, while UAV-based data can provide high-resolution imagery, it is often difficult to achieve immediate detection and response in rapidly evolving flood situations [
19]. Therefore, there is a growing need for data collection and analysis specifically focused on flood-vulnerable areas.
Although damage in urban flood-vulnerable areas is increasing, previous studies have not focused sufficiently on promptly detecting floods in these areas to prevent casualties and property damage. Most studies have relied primarily on remote imagery, and even those using ground-level imagery have rarely targeted economically and flood-vulnerable urban areas. The development of customized flood detection and response systems is essential for addressing the needs of flood-vulnerable areas. To address this issue, this study constructs a specialized ground-level image dataset, AlleyFloodNet, designed to accurately predict and alert flood impacts in flood-vulnerable areas such as narrow alleyways, lowlands, and semi-basement residential spaces. Unlike previous studies, this research specifically focuses on ground-level image analysis tailored to flood-vulnerable areas such as narrow alleyways and semi-basement residences. The primary objective of this study is to overcome the limitations of prior research and enable practical responses in real-world disaster situations by fine-tuning various deep learning algorithms specifically to this domain using the AlleyFloodNet dataset.
In this study, various deep learning-based image classification models collected from Google and YouTube are fine-tuned using the constructed dataset to specifically adapt each model to accurately classify ground-level imagery from flood-vulnerable urban areas. The performance of the models is evaluated using metrics such as accuracy, precision, recall, and F1 score. Additionally, to further validate the effectiveness and necessity of AlleyFloodNet, this study also utilizes another publicly available ground-level flood classification dataset obtained from Kaggle [
20]. Models fine-tuned using this general ground-level imagery dataset—which is not specifically tailored to economically and flood-vulnerable areas—are evaluated on AlleyFloodNet’s test set to emphasize the need for specialized datasets. Misclassifications are also explored to better understand challenging conditions in flood-vulnerable areas, and to identify specific visual features that cause detection errors, which can guide future dataset improvements and algorithm refinement. Through these experiments, this study aims to establish the effectiveness and necessity of AlleyFloodNet, ultimately contributing to the development of reliable and real-time flood detection and alert systems for protecting lives and property in economically and flood-vulnerable urban areas.
The rest of this paper is structured as follows.
Section 2 (Related Works) systematically analyzes existing flood detection studies from two aspects: dataset construction methods and detection techniques.
Section 3 (Materials and Methods) describes the design purpose and criteria of the AlleyFloodNet dataset, data preprocessing methods, and the architectures of various deep learning models used in this study.
Section 4 (Experimental Setup) details the experimental environment, hyperparameter configurations, performance evaluation methods, comparative experiment methods with general ground-level image datasets, and misclassification analysis methods.
Section 5 (Results) presents quantitative evaluation results of the models trained on AlleyFloodNet, comparison results with another dataset, and results of the misclassification analysis.
Section 6 (Discussion) interprets the main findings of this research, including the superior performance of AlleyFloodNet and the necessity of viewpoint-specific datasets, and thoroughly discusses the advantages and significance of AlleyFloodNet, limitations and ways to address them, and directions for future work. Finally,
Section 7 (Conclusions) summarizes the key findings, clarifies the significance and limitations of this study, and suggests directions for future research. The overall workflow of this study is summarized in
Figure 1.
2. Related Works
2.1. Dataset-Based Research
The construction of high-quality datasets for flood detection is a prerequisite for reliable analysis, and many researchers have attempted to overcome the limitations of existing flood datasets. For example, Rahnemoonfar et al. pointed out that traditional flood datasets mainly rely on low-resolution satellite imagery with infrequent updates, making rapid damage assessment difficult [
16]. To address this, they utilized high-resolution UAV imagery. As a result, they developed a dedicated dataset named FloodNet, which includes detailed imagery of otherwise inaccessible regions and provides pixel-level annotations to distinguish flooded from non-flooded areas, enabling rapid and fine-grained scene analysis. Manaf et al. created a large-scale flood image dataset by integrating multiple benchmark datasets and collecting additional images from the web [
21]. Their experiments showed that lightweight CNN models such as MobileNet and Xception outperformed ResNet-50, VGG-16, and Inception-v3, achieving up to 98% accuracy and a 92% F1 score. Karanjit et al. developed FloodIMG, a flood image database collected from various sources including Google searches, social media, traffic cameras, and the USGS [
22]. Some studies also collected specialized image data such as UAV aerial images or urban underpass scenes to be used as training data for model development [
23,
24,
25,
26]. However, most existing datasets are focused on large-scale river flooding or general urban environments, lacking specialized data for vulnerable urban areas such as alleyways and semi-basement residences. The AlleyFloodNet proposed in this study aims to address that need.
2.2. Deep Learning-Based Detection Techniques and Methodologies Research
A range of deep learning models have been proposed for flood detection, each employing different architectures and techniques depending on the image type and application context. Munawar et al. developed a UAV-based flood detection system by first applying Haar cascade classifiers to identify buildings and roads, and then using these features to train a deep learning model for classifying flood presence [
23]. Stateczny et al. proposed a hybrid deep learning model that combines CNN and ResNet architectures. Prior to model training, they applied image preprocessing techniques including K-means clustering and vegetation index calculation to enhance feature quality. Model performance was further improved by optimizing weights through the CHHSSO metaheuristic algorithm [
24]. In another study, Munawar et al. extracted 2150 image patches from UAV imagery taken before and after flooding events, and trained a CNN to recognize spatial flood patterns for automated flood assessment [
25].
From a model architecture perspective, recent studies have implemented specialized deep learning techniques to enhance flood detection in complex environments, focusing on segmentation accuracy, object-level understanding, and robustness under poor visual conditions. Yoo et al. constructed a U-Net–based segmentation model tailored for identifying flood regions in urban underpasses. The model achieved high accuracy in real-world conditions, demonstrating the effectiveness of fully convolutional architectures in dense urban environments [
26]. Zhong et al. implemented an object detection–based flood recognition approach using YOLOv4. By targeting partially submerged features such as vehicle exhaust pipes and pedestrians’ legs, the model inferred inundation status without requiring additional sensors [
27]. Vo et al. applied a deep learning–based visual recognition model to detect structural cues of flood vulnerability, such as semi-basement windows or entrances, from urban building facades [
28]. Witherow et al. developed a preprocessing-enhanced CNN pipeline that addressed visual noise such as poor lighting, reflection, and occlusions. Their system used edge detection, inpainting, and vehicle removal via R-CNN to extract flood regions with higher accuracy [
29]. Zeng et al. enhanced semantic segmentation performance in low-light video data by integrating SRGAN-generated super-resolution images into a DeepLabv3+ framework. This led to superior accuracy in CCTV-based flood detection compared to standard models [
30].
This study does not merely propose a new dataset, but establishes AlleyFloodNet as a platform for adapting and revalidating deep learning models in the context of structurally complex and socially vulnerable urban environments. Through fine-tuning high-performing architectures such as ConvNeXt, Vision Transformer, and others, the study demonstrates how existing deep learning methods can be extended to address visual ambiguities, constrained spaces, and non-standard perspectives typical of urban flood-vulnerable areas. AlleyFloodNet thus serves not only as a domain-specific benchmark dataset, but also as a robust experimental framework for advancing flood detection technologies that are resilient to the complexities of real-world urban flooding scenarios.
3. Materials and Methods
3.1. Data Development and Preprocessing
In this study, we developed an original dataset named AlleyFloodNet, composed of flood and non-flood images, to identify the occurrence of flooding using a binary classification approach. Considering that existing datasets mainly consist of either aerial images or general ground-level images not specifically captured in economically vulnerable urban environments, we collected ground-level photographs specifically targeting narrow alleyways, semi-basement residences, and lowlands, areas particularly prone to severe flood damage yet underrepresented in current datasets. The dataset consists of images collected from economically and flood-vulnerable urban areas around the world, including South Korea, Japan, China, Vietnam, the United States, Spain, Italy, and India. These images were captured at close-range and low camera angles via CCTV and smartphones, obtained from publicly accessible platforms such as Google and YouTube.
AlleyFloodNet was developed by taking into account various types of floods that occur in flood-vulnerable areas. Floods can take different forms, such as rapidly flowing and rising water or stagnant and gradually rising water. Depending on the form of the flood, the color of the water can also vary—appearing brown due to mixed soil or remaining relatively clear. Furthermore, the visual characteristics of the water change according to the time of day, as the amount of light varies, and floods tend to cause more damage under lower lighting conditions. To account for the diverse characteristics of floods mentioned earlier, AlleyFloodNet was developed to classify flood situations effectively across a wide range of conditions, including flood types, water color, and visual changes over time.
AlleyFloodNet was developed by collecting not only photographs of flooding situations taken in alleys and lowlands, but also images captured under non-flooding conditions, to perform binary classification of flooding versus non-flooding. Compared to existing ground-level datasets—which are typically general in nature and include images from various distances and angles—AlleyFloodNet explicitly focuses on economically vulnerable urban areas and contains images captured from close proximity at lower angles, enabling precise detection of flooding in highly localized urban environments such as narrow alleys and semi-basement residences. This allows the deep learning models to quickly detect situations in which water rapidly accumulates, and to accurately assess the actual risk of flooding rather than simply determining the presence of rainfall. Moreover, both flooding and non-flooding data include various objects such as people and vehicles, allowing the model to be trained effectively under diverse environmental conditions. Based on this design, AlleyFloodNet is structured to support effective model training in various flood-prone areas and is expected to contribute to flooding detection in vulnerable regions.
A total of 1110 images were collected for use in this study. The size of AlleyFloodNet was determined with reference to the training set size (637 images) of FloodNet Track1, a well-known dataset for flooding and non-flooding classification. All images were preprocessed using the ImageNet mean and standard deviation and resized to 224 × 224 pixels. The entire dataset was evenly divided across classes into a training set and a test set at a ratio of 8:2. Additionally, the training set was further split into a training and validation set at a ratio of 8:2, maintaining class balance. As a result, the dataset was divided as follows: the training set consisted of 468 flooding and 380 non-flooding images; the test set included 133 flooding and 129 non-flooding images; and the validation set contained 79 flooding and 90 non-flooding images.
Figure 2 shows example images from the flood class, while
Figure 3 presents example images from the non-flood class in the AlleyFloodNet dataset. In this study, ground-level images retrieved by searching for ‘flood’ on Google and YouTube were manually reviewed and labeled. Specifically, images were classified as ‘flood’ if major structures, roads, vehicles, or people were clearly submerged or structural elements were visibly inundated by water. Conversely, images showing merely wet ground surfaces or heavy rainfall without noticeable water accumulation or visible inundation were classified as ‘non-flood’. Through this classification, the AlleyFloodNet dataset facilitates deep learning models to effectively distinguish between flood and non-flood images, thus providing a robust training foundation for developing real-time flood detection and alert systems.
3.2. Deep Learning Models
In this study, several deep learning models were employed to effectively detect flooding occurrences. Deep learning models are known to achieve outstanding performance in complex pattern recognition and image classification tasks, particularly Convolutional Neural Networks (CNNs) and Transformer-based architectures, which have demonstrated high accuracy in image classification. However, training these models from scratch can result in overfitting when sufficient data are not available and requires substantial computational resources [
31,
32]. Therefore, this study adopted a fine-tuning approach, using pre-trained weights from models already trained on large-scale image datasets such as ImageNet. This method improves the generalization capability of the models and allows the AlleyFloodNet dataset to effectively capture specific flooding patterns occurring in economically vulnerable urban areas. The models utilized in this study include widely recognized CNN-based architectures such as AlexNet, VGG-19, ResNet-50, DenseNet-121, as well as advanced architectures such as Vision Transformer (ViT) and ConvNeXt.
Ground-level images collected from flood-vulnerable areas were clearly labeled into two classes: flooded and non-flooded. Subsequently, these labeled images underwent preprocessing, including resizing and normalization, to match the input specifications of the selected deep learning models. Then, several well-known deep learning models were fine-tuned using this dataset, enabling each model to learn distinct visual patterns for distinguishing flooded from typical urban images.
3.2.1. AlexNet
AlexNet is a deep convolutional neural network that achieved groundbreaking results in the 2012 ImageNet competition. It consists of five convolutional layers and three fully connected layers, using ReLU activation to speed up training and Dropout to reduce overfitting. The model also introduced Local Response Normalization and made effective use of GPU parallelism to accelerate computation. While it played a crucial role in advancing CNN research, it demands more memory and computational power than more recent architectures like VGGNet and ResNet [
33].
3.2.2. VGG-19
VGG-19, part of the VGGNet family, is a deep CNN with 19 layers that stood out in the 2014 ILSVRC. It uses multiple small 3 × 3 convolutional filters stacked together, allowing for greater depth and non-linearity while maintaining computational efficiency. Max pooling reduces spatial dimensions, and two fully connected layers with 4096 neurons handle classification. Though simple in design, VGG-19 offers strong performance, but its depth increases computational demands and can lead to vanishing gradients, sometimes addressed with batch normalization [
34].
3.2.3. ResNet-50
ResNet-50 is a deep neural network with 50 layers that introduced residual learning to tackle the vanishing gradient problem in very deep models. Its key innovation is the use of skip connections, which allow feature information to bypass layers, enabling more stable and efficient training. The network uses a 1 × 1, 3 × 3, 1 × 1 bottleneck block structure to reduce computation while preserving accuracy. Thanks to this design, ResNet-50 achieves strong performance on large-scale datasets like ImageNet and has become a widely used model for tasks like object detection and image segmentation due to its robustness and generalization capabilities [
35].
3.2.4. DenseNet-121
DenseNet-121 is a deep CNN with 121 layers that introduces dense connectivity, where each layer receives input from all previous layers. This design enhances feature reuse, improves gradient flow, and reduces overfitting. It uses 1 × 1 bottleneck layers and global average pooling to minimize parameters and computational cost. Despite its depth, DenseNet-121 is memory-efficient and achieves strong generalization, especially with smaller datasets. It performs comparably or better than models like VGGNet and ResNet while using significantly fewer parameters, making it well-suited for tasks like medical imaging and object recognition [
36].
3.2.5. ViT (Vision Transformer)
Vision Transformer (ViT) is an image classification model that replaces traditional CNNs with a Transformer-based architecture. It splits images into fixed-size patches, embeds them, and processes them using a Transformer encoder with self-attention, enabling better capture of long-range dependencies. ViT shows strong performance on large datasets but requires substantial data and computational resources. On smaller datasets, it may underperform compared to CNNs, though this can be mitigated with transfer learning. Recently, hybrid models combining CNNs and Transformers have been developed to address these challenges [
37].
3.2.6. ConvNeXt-Large
ConvNeXt is a modern CNN architecture built on ResNet and enhanced with design ideas from Transformer models like Swin Transformer. It incorporates advanced techniques such as layer normalization, large 7 × 7 convolution kernels, and depthwise separable convolutions to improve efficiency and performance. With a deeper architecture and updated normalization strategies, ConvNeXt matches or exceeds the performance of Transformer-based models in tasks like classification, detection, and segmentation. It challenges the notion that CNNs are outdated, proving they remain powerful and competitive in modern deep learning [
38].
4. Experimental Evaluation
4.1. Experimental Setup
In this study, experiments were conducted using PyTorch 2.1.0+cu118 in a Google Colab environment equipped with an NVIDIA A100 GPU to evaluate the performance of image classification models. Input images were resized to 224 × 224 pixels and normalized using the mean (0.485, 0.456, 0.406) and standard deviation (0.229, 0.224, 0.225) of the ImageNet dataset to enhance model generalization. The data were loaded in batches of size 32 for training, validation, and testing.
All deep learning models, including the ConvNeXt-Large architecture, were imported directly from the torchvision.models module provided by PyTorch. The pre-trained weights from the ImageNet dataset were utilized, and modifications were applied specifically to the final output layers of these models to match the binary classification task (flood vs. non-flood).
During training, BCEWithLogitsLoss was employed as the loss function, and the AdamW optimizer was adopted to effectively update model parameters. We utilized an early-stopping mechanism based on validation loss to prevent overfitting and to save computational resources. Specifically, training was set to stop if no improvement in validation loss was observed for four consecutive epochs. Additionally, the average training time per epoch and inference time per single image were measured by recording execution times using Python’s built-in time module.
The training process was monitored in real-time using the tqdm library, which provided continuous feedback on epoch-wise training progress and performance metrics. After completing training, inference time was measured using a single test image to evaluate the model’s applicability for real-time flood detection systems. Additionally, the best-performing model based on validation loss was automatically saved using PyTorch’s built-in torch.save function, ensuring reproducibility and facilitating future reuse. Finally, evaluation metrics such as accuracy, precision, recall, and F1 score were computed using functions provided by the scikit-learn library.
4.2. Evaluation Metrics
To comprehensively evaluate model performance, accuracy, recall, precision, and F1 score were employed as the primary evaluation metrics. Accuracy represents the proportion of correctly classified samples out of the total samples and serves as a useful indicator of overall model performance. However, in the presence of class imbalance, accuracy alone may be insufficient for performance assessment; thus, additional metrics were analyzed in conjunction. Recall indicates the proportion of actual positive samples that are correctly predicted and is particularly important in domains such as medical diagnosis or anomaly detection, where false negatives can have critical consequences. In contrast, precision measures the proportion of predicted positive samples that are truly positive and is particularly relevant in tasks such as financial fraud detection or spam filtering, where reducing false positives is crucial. Lastly, the F1 score considers the balance between precision and recall by computing their harmonic mean, enabling an unbiased evaluation that does not favor either metric. In this study, these evaluation metrics were collectively analyzed to compare the predictive performance of the models and to identify the most optimal model.
4.3. Average Fine-Tuning Time per Epoch and Single Image Inference Time
To compare the training speed and real-time detection capability of each model, the average training time per epoch and single image inference time were measured. The average fine-tuning time per epoch was measured by recording the fine-tuning time for each epoch using the AlleyFloodNet dataset and calculating the mean. This enabled analysis of the differences in training speed across models. Single image inference time was calculated by averaging the prediction times for 224 × 224-sized images from the test set after model training was completed. This was used to evaluate the extent to which each model can process data in real-time within a flood detection system.
4.4. Comparative Analysis of Flood Datasets and Misclassification Patterns
To further validate the effectiveness, practical utility, and uniqueness of the AlleyFloodNet dataset, this study conducts comparative experiments using the Flood Classification Dataset [
20], a publicly available large-scale ground-level flood image dataset from Kaggle. The Flood Classification Dataset comprises 9296 flood images and 3748 non-flood images, which include ground-level, near-field images capturing various flooding scenarios worldwide. However, this dataset primarily consists of ground-level imagery collected from general urban environments without specific consideration of localized spatial contexts or regional vulnerability characteristics.
In contrast, AlleyFloodNet specifically targets localized flooding scenarios in economically and environmentally vulnerable urban settings, such as narrow alleyways, semi-basement residences, and lowlands. Images in AlleyFloodNet are carefully collected considering the spatial context and unique environmental features associated with urban flooding, such as specific architectural structures, narrow and enclosed spaces, and particular camera angles or viewpoints optimized for urban flood detection. Thus, AlleyFloodNet is distinctly specialized for enhancing detection accuracy and real-world applicability in vulnerable urban flood situations, rather than for generalized flood detection scenarios.
Based on these considerations, this study adopts a cross-dataset experimental design, separating 15% of the Flood Classification Dataset’s training data as a validation set, fine-tuning pre-trained deep learning models using the remaining data, and subsequently evaluating model performance on the AlleyFloodNet dataset. The justification for this experimental design is to empirically assess whether a model trained on ground-level imagery from general urban environments can effectively detect localized flooding events in economically and environmentally vulnerable urban regions. By analyzing how accurately models trained on general urban data can perform on the region-specific AlleyFloodNet dataset, we aim to quantitatively demonstrate the necessity and importance of constructing and utilizing localized, specialized datasets in disaster response and urban flood detection scenarios.
5. Results
5.1. Comparison of Model Performance: Accuracy, Precision, Recall, and F1 Score
This study evaluates the performance of six deep learning models—AlexNet, ResNet50, VGG19, Vision Transformer (ViT), DenseNet121, and ConvNeXt-Large—for flood detection using the AlleyFloodNet dataset. The model performances were quantitatively assessed using accuracy, precision, recall, and F1 score, and the detailed results are presented in
Table 1.
The results indicate that the ConvNeXt-Large model achieved the highest accuracy (0.9596), recall (0.9767), and F1 score (0.9655). The VGG19 and DenseNet121 models also showed strong performances, both achieving an accuracy of 0.9466 and high precision and recall values. The Vision Transformer (ViT) exhibited competitive performance with an accuracy of 0.9313 and notably high recall (0.9612), though with relatively lower precision (0.9051). Similarly, ResNet50 achieved an accuracy of 0.9275, showing balanced precision (0.9435) and recall (0.9070). AlexNet exhibited comparatively lower overall performance, with an accuracy of 0.9046 and precision of 0.8611, though it maintained high recall (0.9612).
Additionally, training and validation loss/accuracy curves are presented in
Figure 4, illustrating rapid convergence and stable validation trends. The confusion matrix of the ConvNeXt-Large model (
Figure 5) further demonstrates the model’s robustness, correctly classifying the majority of flood and non-flood images with minimal misclassifications.
5.2. Results of Average Fine-Tuning Time per Epoch and Single Image Inference Time
To evaluate the training speed and real-time applicability of each model, we measured the average fine-tuning time per epoch and the single image inference time. AlexNet and ResNet50 exhibited the fastest training speeds, with average fine-tuning times of approximately 4.73 and 5.58 s per epoch, respectively. In contrast, ConvNeXt-Large required the longest training time at 18.25 s per epoch. Regarding single image inference speed, AlexNet demonstrated the fastest inference at 0.001459 s, followed by ViT with a relatively fast inference time of 0.008279 s. Conversely, DenseNet121 showed the slowest inference speed at 0.100600 s, and ConvNeXt-Large also exhibited a relatively high computational cost with an inference time of 0.012291 s. Thus, although the ConvNeXt-Large model achieved the highest overall performance, it showed limitations in terms of training time and inference speed.
Table 2 compares the average fine-tuning time per epoch and single image inference time.
5.3. Results of Comparative Analysis of Flood Datasets and Misclassification Patterns
Models fine-tuned using AlleyFloodNet generally demonstrated high performance, with ConvNeXt-Large achieving the most outstanding results (accuracy: 0.9596, F1 score: 0.9655). In contrast, the model fine-tuned on the Kaggle Flood Classification Dataset and subsequently evaluated on the AlleyFloodNet test set exhibited a notable decline in performance (accuracy: 0.5611, precision: 0.6750, recall: 0.2093, F1 score: 0.3195). This indicates a significant domain discrepancy between the general urban flood images in the Kaggle dataset and the localized alleyways, semi-basements, and lowland environments specifically targeted by AlleyFloodNet. In other words, general datasets capturing typical urban flood scenes, such as the Kaggle Flood Classification Dataset, may not be sufficient for effectively detecting floods in economically and environmentally vulnerable micro-urban areas. Thus, specialized datasets like AlleyFloodNet are critical for improving detection accuracy in such contexts.
Misclassifications primarily occurred in environments with very low illumination, where even high-performance models struggled to achieve accurate classification due to insufficient visual information. In particular, when dim or distant streetlights illuminated only parts of the scene, the model had difficulty distinguishing between wet surfaces reflecting light and actual flooded areas. Thus, despite the current dataset encompassing diverse environments, its limitation in adequately representing low-light or nighttime flooding scenarios has become apparent. To address these issues, future studies should consider enhancing model robustness through increased collection of nighttime images, data augmentation simulating low-light conditions, or the incorporation of multimodal information such as thermal or infrared imagery.
6. Discussion
In this study, we fine-tuned multiple deep learning models—AlexNet, ResNet50, VGG19, Vision Transformer (ViT), DenseNet121, and ConvNeXt-Large—using the AlleyFloodNet dataset and evaluated their flood detection performance, including training time per epoch and inference speed per image. ConvNeXt-Large demonstrated superior performance across all evaluation metrics, achieving the highest accuracy (0.9596), precision (0.9545), recall (0.9767), and F1 score (0.9655). By assessing both detection accuracy and computational efficiency (training and inference times), this study provides empirical evidence to assist in selecting the most appropriate model for real-time flood detection systems in economically vulnerable urban areas.
Additionally, we conducted comparative experiments using the Kaggle Flood Classification Dataset, another ground-level flood image dataset, to evaluate the unique effectiveness of AlleyFloodNet. The model fine-tuned on the Kaggle dataset and subsequently tested on AlleyFloodNet exhibited significantly reduced performance (accuracy: 0.5611, precision: 0.6750, recall: 0.2093, F1 score: 0.3195). This notable drop in performance underscores the domain discrepancy between general urban flood scenarios in the Kaggle dataset and the localized alleyways, semi-basements, and lowlands targeted by AlleyFloodNet. Thus, it highlights the importance of specialized datasets such as AlleyFloodNet for accurately detecting floods in highly localized urban environments.
Figure 6 visually compares the experimental results obtained using the Kaggle Flood Classification Dataset and AlleyFloodNet, clearly demonstrating the superior performance and effectiveness of AlleyFloodNet in detecting floods within localized urban environments.
Moreover, analysis of misclassified images revealed critical insights into challenging scenarios for flood detection. Most misclassifications occurred in images with ambiguous visual cues—such as partially submerged ground or reflections—and under challenging conditions like low lighting, shadows, or visually unclear boundaries between flooded and non-flooded areas. These findings indicate that future research should specifically address these challenging conditions, potentially through targeted data augmentation techniques or enhanced model architectures designed for robust performance under ambiguous and low-light scenarios.
The practical implications of these findings are significant. Deep learning models fine-tuned using AlleyFloodNet could be integrated into automated flood monitoring systems based on CCTV or smartphone footage. Such systems could continuously monitor economically vulnerable urban areas—including narrow alleyways, semi-basement residences, and lowlands—and rapidly classify images to detect flooding events. Upon detecting flood conditions, these systems could promptly issue location-specific alerts to emergency services and residents, facilitating timely evacuations and targeted disaster responses, thereby enhancing public safety and resilience.
The AlleyFloodNet dataset was constructed by gathering diverse ground-level flood images captured from various global regions through publicly accessible platforms. However, due to practical constraints, collecting data in flood-vulnerable urban areas under diverse weather and rainfall conditions remains challenging. Thus, future research should aim to expand the dataset by incorporating more flood images captured under various rainfall intensities and environmental conditions. This would improve the dataset’s robustness and generalizability, ultimately enhancing the effectiveness of flood detection models in real-world scenarios. Moreover, given the observed differences in training and inference speeds across models, future studies should also investigate lightweight or hybrid model architectures that can deliver high accuracy with reduced computational demands. Such advancements would optimize real-time applicability and enhance the effectiveness of flood detection systems deployed in disaster-prone urban environments.
7. Conclusions
The significance of this study lies in the construction of a specialized, close-range image dataset—AlleyFloodNet—which precisely reflects flood-vulnerable urban areas, including narrow alleyways, semi-basement residences, and lowlands. Unlike conventional satellite- or UAV-based datasets that provide distant-view images, AlleyFloodNet comprises ground-level images collected globally through publicly accessible platforms, specifically designed to effectively capture flood patterns in structurally complex urban environments. The experimental validation conducted with multiple deep learning models demonstrated the dataset’s superior performance in accurately classifying floods in these economically vulnerable urban settings.
Moreover, comparative experiments using the Kaggle Flood Classification Dataset highlighted AlleyFloodNet’s unique strengths. The significant performance gap observed when models fine-tuned on general urban flood imagery were evaluated against AlleyFloodNet underscores the necessity and value of region-specific datasets for effective urban flood detection. Furthermore, analysis of misclassified images provided critical insights into the visual ambiguity and environmental challenges associated with accurate flood detection, emphasizing the need for targeted enhancements such as specialized data augmentation and robust model architectures.
Beyond purely technical advancements, this study holds important societal implications by specifically targeting densely populated areas inhabited by economically vulnerable groups, repeatedly affected by flooding due to climate change. The fine-tuned models utilizing AlleyFloodNet offer a practical foundation for integrating automated real-time flood monitoring systems that use CCTV or smartphone footage to promptly issue location-specific alerts, thereby protecting lives and property.
Future research will focus on expanding AlleyFloodNet by incorporating additional ground-level images captured under diverse rainfall intensities, lighting conditions, and environmental contexts, further enhancing its robustness and generalizability. Additionally, the development of lightweight or hybrid deep learning models, combined with model compression and computational optimization techniques, will be essential to achieving real-time detection capabilities suitable for practical disaster response scenarios. Building upon these findings, subsequent research can develop integrated flood response platforms capable of flood depth prediction, risk visualization, and evacuation route guidance, substantially improving the effectiveness and field applicability of disaster response technologies. Ultimately, these efforts will contribute significantly toward protecting vulnerable populations and enhancing urban resilience against flood disasters intensified by climate change.