1. Introduction
The intrinsic relationship between human society and water resources is inseparable and dates back to the dawn of civilization. Water, a vital element for sustaining life, drives numerous sectors, from agriculture and food production to transportation and energy generation. Although planet Earth is predominantly aquatic, covering approximately 71% of its surface, the availability of fresh water is significantly limited, representing only a small fraction—approximately 3%—of this vast total volume. This irreplaceable resource is essential for human consumption, basic health, and human dignity. It is a fundamental pillar of climate dynamics, the preservation of rivers and oceans, and the development of rich biodiversity [
1].
However, human action drives growing challenges. Continuous global population growth and urban expansion intensify pressure on water resources, resulting in the degradation of water quality and aquatic ecosystems. Water scarcity thus emerges as one of the most pressing global environmental concerns. This scenario is aggravated by practices such as the disorderly occupation of riverbanks, the removal of riparian vegetation, and improper waste disposal. It is intensified by ineffective effluent treatment and the presence of emerging substances, such as microplastics and pharmaceuticals, which compromise ecosystem health and human health [
2,
3,
4,
5].
The scenario is particularly critical in regions with precarious sanitation infrastructure, where vulnerable populations depend on polluted water sources for their sustenance, increasing the risk of disease. The United Nations World Water Development Report (UNESCO World Water Assessment Program, 2024) warns that more than 80% of the world’s wastewater is discharged into the environment without adequate treatment [
6]. In Brazil, more than 110,000 km of rivers have compromised water quality, and 83,450 km of water abstraction for public supply is unfeasible due to pollution [
7]. The quality of river water exerts a direct and far-reaching impact on food production, energy generation, public health, and economic dynamics. The UN’s Agenda 2030, through Sustainable Development Goal 6 (SDG 6), establishes access to quality water and basic sanitation as a crucial target [
8].
Rivers can be affected by a wide range of pollutants. Although this work focuses on visible pollutants, it also considers organic pollutants (such as biodegradable organic matter), inorganic pollutants (including heavy metals and nutrients), and thermal pollution [
9,
10,
11,
12]. Among these, plastic pollutants deserve attention due to their persistence and fragmentation into microplastics, a growing concern for their impacts on the food chain and health [
13,
14]. Visible litter encompasses human-discarded materials that float or accumulate on riverbanks and riverbeds, such as plastics (bottles, bags, foam), various packages (paper, metal, glass), organic material, textiles, rubber/tires, construction materials, and appliances [
9,
14,
15,
16,
17,
18,
19]. Their sources are predominantly anthropogenic, including direct disposal by populations, irregular dumping of urban waste carried by rain, agricultural and rural waste, transport during flood events, and tourist abandonment [
15,
20,
21,
22,
23].
Therefore, early detection and continuous monitoring of water pollution, with a particular focus on visible litter, are imperative for protecting public health, ensuring food and energy security, and maintaining economic stability. However, traditional monitoring methods have limitations, including high costs, prolonged time, and limited coverage, which hinder the rapid detection of pollutants and effective response to pollution events [
24]. Effective management of water resources and environmental preservation, therefore, requires innovative, more efficient approaches to assess pollution levels precisely [
25].
In this context, Computer Vision approaches emerge as a promising strategy for environmental monitoring, as they enable the identification of visual patterns in water bodies with greater agility, scope, and frequency than conventional methods [
26]. Such approaches are grounded in Artificial Intelligence (AI) algorithms and machine learning (ML) techniques to identify patterns and complex correlations in images through models and architectures capable of performing computational operations on this data.
Among the models widely used in computer vision tasks, Convolutional Neural Networks (CNNs) stand out. Inspired by the organization of the visual cortex in animals, CNNs are deep neural network architectures designed to process grid-topology data, such as images [
26]. However, although they are powerful models, they present limitations when applied in contexts with scarce computational, energy, and connectivity resources, characteristics common in environmental monitoring operations, especially in remote, hard-to-reach areas or with limited infrastructure.
Faced with such limitations, low-complexity models have proven to be a viable alternative to traditional CNN architectures [
27], since they offer satisfactory performance in detecting visual patterns while requiring fewer computational resources. Due to their light weight and efficiency, these models have the potential to be embedded in autonomous, low-cost devices, such as Unmanned Aerial Vehicles (UAVs) and remote sensing stations. In this sense, the use of these models in environmental monitoring can enable field applications with greater autonomy and scalability, as well as serve remote and hard-to-reach areas.
Therefore, the main objective of this study is to develop and evaluate a CNN-based model for the automated detection of visible litter in water bodies through river surface image analysis, providing a support tool for water pollution monitoring and environmental management. Specifically, this work aims to: (i) build and organize a labeled image dataset from multiple remote sensing sources; (ii) implement and train a binary classification model using Transfer Learning with the MobileNetV2 architecture; and (iii) evaluate model performance using standard classification metrics, demonstrating the effectiveness and applicability of the proposed solution.
The main contributions and novel aspects of this work can be summarized as follows: (i) the construction of a curated, multi-source labeled dataset for visible litter detection in river environments, combining manually collected images under Creative Commons licensing with a complementary annotated dataset from the Roboflow platform; (ii) the design and evaluation of a lightweight binary classification pipeline based on MobileNetV2 with transfer learning, specifically tailored for deployment in resource-constrained environmental monitoring scenarios; (iii) a rigorous statistical evaluation protocol comprising 20 independent training runs, providing robust estimates of mean performance and variability rather than single-run results; and (iv) the development and public deployment of an interactive web application that demonstrates the practical transition of the trained model into an accessible environmental monitoring tool, including GPS-based geolocation of analyzed images.
2. Materials and Methods
This section describes the methodology used to develop the automated detection model for visible litter in rivers. The computational environment, software tools used, data acquisition and augmentation process, the CNN model architecture based on Transfer Learning, training procedures, and performance evaluation metrics are detailed. It should be noted that the design of this study was influenced by previous studies demonstrating the potential of Convolutional Neural Networks for environmental image classification tasks. In particular, the thesis by Araújo [
28] contributed to the understanding of the theoretical and architectural foundations of CNNs and guided the methodological choice of the approach proposed by Araújo [
28].
2.1. Computational Environment and Tools
For the development and experiments, a Dell notebook (Dell Technologies Inc., Round Rock, TX, USA) equipped with an Intel Core i7 processor (Intel Corporation, Santa Clara, CA, USA), 16 GB of RAM, and a 512 GB SSD was used. The integrated Intel Iris Xe Graphics card powered the graphical environment. The operating system used was Ubuntu 22.04 LTS. The development was conducted in Python 3.11 using the Jupyter Notebook (version 7.1.3) interactive environment, which facilitated experimentation and visualization of intermediate results. The main machine learning and computer vision libraries employed included:
TensorFlow/Keras (versions 2.19.0/3.9.2): Used for the construction, training, and evaluation of the Convolutional Neural Network, providing a high-level API for Deep Learning.
NumPy (version 1.26.0): Fundamental for efficient numerical operations, especially in the handling of arrays and image data matrices.
OpenCV (version 4.11.0.86) and Pillow (PIL) (version 11.2.1): Used for loading, manipulating, and saving images, supporting various preprocessing operations.
Scikit-learn (version 1.4.2): Used for the calculation of evaluation metrics, such as the classification report and confusion matrix.
Matplotlib (version 3.10.0) and Seaborn (version 0.13.2): For visualization of results, including performance graphs and confusion matrices.
2.2. Data Acquisition and Preparation
Creating a robust dataset is fundamental for the effective training of machine learning models. In this study, river images were classified into two categories: polluted, indicating the presence of visible litter to the naked human eye, and not polluted, characterized by the absence of such elements. The complete process of data acquisition, preprocessing, and consolidation is illustrated in the Flowchart in
Figure 1.
Initially, the dataset was constructed from diverse sources, with the following methodology:
Manually Collected and Selected Images: A significant portion of the images was obtained through searches on platforms with Creative Commons licenses (mainly via Google Images) and other public datasets. For these images, both the ‘polluted’ and ‘not polluted’ classes underwent rigorous manual selection and separation into training, validation, and test subsets. This curation process ensured that each image was correctly categorized and distributed according to the initial proportions of 70% for training, 15% for validation, and 15% for testing, represented in the ‘Initial Organization’ phase of the flowchart.
Complementary Roboflow Dataset: To enrich and expand the database of polluted river images, a previously curated dataset from the Roboflow platform was incorporated. This dataset contributed 1887 images for training (with Data Augmentation already applied), 116 for validation, and 61 for testing, and was integrated during the ‘Dataset Consolidation’ phase. Roboflow is widely recognized for providing annotated databases for computer vision applications.
Regarding dataset quality and integrity, the following controls were applied during data preparation. Near-duplicate images were manually inspected and removed during the curation process; images that were visually identical or near-identical were excluded to prevent data leakage between subsets. A scene-level separation strategy was applied: images originating from the same river location or photographic session were assigned exclusively to a single subset (training, validation, or test) to avoid information overlap between splits. The Roboflow dataset used a pre-defined split that was maintained as provided, and its images were not cross-mixed with the manually curated subsets. While the dataset size is modest—a characteristic common in applied environmental monitoring research—the statistical evaluation across 20 independent runs provides robust performance estimates and reduces the risk of conclusions being sensitive to a single train/test split.
All images in the initial set were resized to 224 × 224 pixels and converted to RGB format to ensure compatibility with the model architecture.
It is important to highlight that the model developed in this work operates exclusively on static images (photos) and performs detection offline. The current methodology was not designed for video stream processing or real-time operation; instead, it focuses on the analysis of previously collected photographic data.
Data Augmentation
Given that limited datasets can lead to overfitting, data augmentation techniques were applied to expand the training set artificially and, in some cases, to balance the validation and test subsets. This technique generates new images from the originals through geometric and color transformations, promoting greater diversity and improved model generalization.
For this purpose, two data augmentation generators based on the Keras library were defined:
datagen_train: Applied exclusively to the training set, with intense transformations to increase model robustness. This included zoom, rotation, horizontal and vertical shifts, horizontal flip, brightness variation, and filling of empty regions with constant color.
datagen_val_test: Used in the validation and test sets, with lighter transformations, simulating real acquisition variations without compromising data integrity for evaluation. Includes pixel value normalization and small variations in brightness and orientation.
The augmentation strategy was differentiated for each class and subset, aiming to optimize dataset representativeness and balance:
For the ‘polluted’ class, data augmentation was applied exclusively to its training set. For each original image, two augmented versions were generated.
For the ‘not polluted’ class, data augmentation was applied to all its folders (training, validation, and test) because the validation and test sets initially contained fewer samples than the ‘polluted’ class. For these, nine augmented versions of each original image were generated in training, and only one copy of each original image was used in the validation and test sets.
The transformations applied to the validation and test data for the ‘not polluted’ class were limited to brightness changes and horizontal flips to maintain fidelity to the expected variations in real environments.
It is important to clarify the rationale for applying limited data augmentation to the validation and test subsets of the ‘not polluted’ class. This decision was driven exclusively by class imbalance: the ‘not polluted’ category initially contained approximately 80 original images in each evaluation subset, substantially fewer than the approximately 170 ‘polluted’ samples. To address this, a single augmented copy was generated deterministically for each original ‘not polluted’ image in the validation and test subsets—resulting in 160 images per subset—through two conservative transformations: a horizontal flip and a brightness adjustment within the range [0.9, 1.1]. These transformations simulate plausible real-world acquisition variability (e.g., variable illumination and camera orientation) without synthesizing novel visual content. Critically, the augmentation was deterministic and one-to-one: each original image produced exactly one fixed augmented counterpart, preserving the fundamental visual distribution of the subset. No augmentation was applied to the ‘polluted’ class in the evaluation subsets. The authors acknowledge this as a methodological limitation; future work should validate performance on fully unaugmented evaluation sets.
After this augmentation phase, the subsets of ‘polluted’ images (already augmented in training) were consolidated with the corresponding data from the Roboflow dataset (training, validation, and test). Additionally, for the test subset of the ‘polluted’ class, 7 more images were selected, bringing the total to approximately 170 samples for testing in this class. The directories were organized to reflect this final structure, and the generators were configured with the flow_from_directory function to read images directly from disk, resize them to 224 × 224 pixels, and organize them into batches of 16 samples, optimizing data flow during training.
2.3. Model Development and Training
The classification model was developed using the Transfer Learning technique, which is effective in scenarios with limited datasets. This approach reuses knowledge previously acquired by a model trained on a large database, such as ImageNet, and applies it to a new task.
2.3.1. Model Architecture
For the development of the water pollution classification model, the MobileNetV2 architecture was chosen as the base, pre-trained on the vast ImageNet dataset. This choice was not arbitrary; MobileNetV2 is highly recognized for its computational efficiency, low computational cost, and satisfactory performance in various computer vision tasks [
29,
30]. Its design allows it to run optimally on devices with limited resources, such as cell phones, drones, or embedded systems, making it ideal for applications that require agility and practical feasibility in real-time environmental monitoring.
A crucial advantage of MobileNetV2, especially relevant for projects with smaller datasets such as this study, is its capacity to utilize transfer learning. By starting with a model pre-trained on ImageNet, the vast knowledge it acquired from millions of images is leveraged. This allows the model to generalize well to our specific task (visible litter detection), even with fewer training examples. This approach significantly reduces the need to build a large labeled dataset from scratch, which is frequently a challenge in real-world applications. The architecture also stands out for its ‘inverted residuals’ structure and the intelligent use of ‘depthwise separable convolutions’, which drastically reduce the number of parameters and operations without compromising learning capacity [
29].
2.3.2. Understanding the Essential Concepts of MobileNetV2
To understand the efficiency of MobileNetV2, it is useful to understand its main building blocks:
Depthwise Separable Convolutions: Unlike standard convolutions that combine filtering and input aggregation in a single step, depthwise separable convolutions divide this process into two phases. First, a ‘depthwise convolution’ applies a single filter to each input channel. Then, a ‘pointwise convolution’ (a 1 × 1 convolution) combines the outputs of the depthwise convolution. This separation drastically reduces computational cost and the number of parameters.
Inverted Residuals: Traditional residual blocks (such as those in ResNet) connect bottleneck layers, which reduce dimensionality, with shortcut connections. MobileNetV2 inverts this logic. It uses expansion layers that first expand the input to a larger dimension, apply depthwise convolution, and then project it back to a smaller dimension (the bottleneck) for the shortcut connection. This ‘inverted’ approach, with wider intermediate layers, improves information flow and enables linear bottlenecks, which are crucial for performance.
Linear Bottlenecks: At the inputs and outputs of the residual blocks, MobileNetV2 uses linear activation functions rather than nonlinear ones. This prevents information loss in low-dimensional spaces, which can occur with non-linear activations.
2.3.3. Application of MobileNetV2 in This Work
In this project, the layers of the MobileNetV2 base were kept frozen. This means their weights were not updated during training. This approach preserves the general feature extraction patterns previously learned from ImageNet, reducing the risk of overfitting on our new dataset. On this frozen base, custom layers were added, forming the ‘head’ of the model, specifically designed for the binary classification task:
Global Average Pooling layer: Responsible for reducing the dimensionality of the base model output to a fixed vector, preparing the data for the dense layers.
Two Dropout layers with a rate of 30% were employed to mitigate overfitting by randomly deactivating a portion of neurons during training, forcing the model to learn more robust representations.
A dense layer with 64 neurons and ReLU activation function was added, designed to extract more complex and high-level patterns from the features previously extracted by MobileNetV2.
An output layer with a single neuron and Sigmoid activation function was also added, appropriate for binary classification tasks, where the output represents the probability of the image belonging to the ‘polluted’ class. A schematic representation of the complete architecture, including the frozen MobileNetV2 base and the custom classification head, is presented in
Figure 2.
The unique contribution of the proposed pipeline lies in the combination of three design decisions tailored to the environmental monitoring context: (i) the use of a fully frozen MobileNetV2 base, which maximizes knowledge transfer from ImageNet while preventing catastrophic forgetting on the limited domain-specific dataset; (ii) the incorporation of two Dropout layers (rate = 0.30) between the feature extraction backbone and the classification head, providing regularization without architectural complexity; and (iii) the use of class weights during training to compensate for the imbalance between ‘polluted’ and ‘not polluted’ samples. The key hyperparameters were determined as follows: the learning rate of 0.0001 was selected empirically to ensure stable convergence; the batch size of 16 was chosen to balance computational efficiency given the available hardware; the 64-neuron dense layer was determined through preliminary experimentation as the minimal configuration capable of capturing the discriminative features of visible litter; and the Dropout rate of 0.30 is consistent with established best practices for transfer learning fine-tuning.
The model was compiled using the Adam optimizer, configured with a learning rate of 0.0001. The loss function adopted for optimization was Binary Cross-Entropy, and the main metric used for monitoring performance during training was accuracy.
2.4. Model Training
To ensure the robustness of the results, training was run 20 times independently. In each run, the model was trained for 20 epochs using the training data, with validation performed on a separate set.
Considering the class imbalance in the training set, class weights were applied, calculated inversely proportional to sample frequency. This technique aims to assign greater relevance to the minority class, favoring more balanced learning.
The accuracy and loss histories for both training and validation were recorded across all runs for subsequent analysis. The model that achieved the best F1-Score value on the test set was saved as the final model in .h5 format. This representative model was used exclusively for the qualitative analysis presented in
Section 3.3; the aggregate metrics in
Table 1 reflect the mean performance across all 20 runs independently. This format allows the model to be loaded later without retraining, enabling its reuse in future applications or production environments.
The model output corresponds to the probability of the positive class (‘polluted’). For evaluation purposes, a fixed threshold of 0.5 was adopted: predictions with a value equal to or greater than this limit were considered ‘polluted’; values below were classified as ‘not polluted’.
2.5. Performance Evaluation
The performance of each trained model was evaluated on the test set, composed of images not used during training. For this evaluation, the following classification metrics were employed:
Accuracy: Total proportion of correct predictions, considering both positive and negative cases.
Precision: The model’s ability to correctly identify images of the ‘polluted’ class, minimizing the occurrence of false positives.
Recall (Sensitivity): Measure of the model’s ability to correctly identify all real examples of the ‘polluted’ class, reducing false negatives.
F1-Score: Harmonic mean between precision and recall, useful especially in contexts with class imbalance or when it is necessary to balance the impacts of false positives and false negatives.
Confusion Matrix: Tabular representation of model performance, indicating true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
Classification Report: Detailed summary of precision, recall, and F1-Score metrics for each class, plus support (number of samples per class).
The detailed results of each of the 20 runs, including training, validation, and test metrics, were stored in a text file. After the runs were completed, a summary file was generated containing the means and standard deviations of all calculated metrics to provide a statistically more robust analysis of the overall model performance.
Additionally, the training histories and confusion matrices from all runs were saved as .npy files, enabling complementary analyses and visualizations, such as boxplots, learning curves, ROC curves, and the area under the curve (AUC).
The source code used in all steps described in this work is available at [
31].
3. Results
This section presents the results obtained to evaluate the performance of the convolutional neural network model and the interface developed for visualizing the classifications performed.
3.1. Consolidated Metrics from 20 Runs
Table 1 presents the means and standard deviations of the main metrics across 20 independent runs, with variations in weight initializations and data order, to obtain statistical estimates of the model’s mean performance and to enable analysis of its variability across runs.
Figure 3 provides a visual overview of the distribution of these metrics through boxplots.
The results indicate that the model presents stable performance. The mean accuracy on the training set was 91.53% with a very low standard deviation (±0.27%), evidencing consistent learning. On the validation set, accuracy decreased to 86.92%, with greater variation (±1.69%), as expected, since this set simulates exposure to unseen data.
The mean accuracy on the test set was 89.70%, with a mean F1-score of 90.96%, which reinforces the model’s generalization capacity. The differences between training, validation, and test metrics are small, indicating that the model does not exhibit significant overfitting.
Furthermore, loss values also reinforce this good performance: the mean error in training was 0.2132 and, in validation, 0.2918. The stability of these values across the 20 runs is corroborated by the boxplots in
Figure 3, which show narrow boxes and aligned medians, suggesting predictable behavior despite random variation.
In general, the results of this stage indicate a model that combines high performance, low variability, and robustness to different initializations—characteristics that are essential for real-world automated monitoring applications, where consistency is as important as accuracy.
3.2. Convergence Analysis
The learning curves, which detail model performance over epochs, are presented in
Figure 4. This figure consolidates the accuracy and loss graphs for the training and validation sets, displaying both the means across 20 runs and the performance of the best-selected model.
As shown in
Figure 4, the model demonstrates stable convergence. Training accuracy increases steadily, while validation accuracy closely tracks this growth, with a small, controlled difference. This behavior is indicative of a well-adjusted model that effectively generalizes to unseen data. In the loss graph, a continuous reduction in error is evident in both training and validation, with smooth curves and no abrupt fluctuations. The stability of Val Loss at the end of training, without significant growth, corroborates the absence of overfitting. The curves of the best model, shown alongside the means of the 20 runs, confirm that this model was representative of the overall performance.
Regarding the choice of 20 training epochs, this stopping criterion was determined based on preliminary convergence analysis: beyond epoch 15–18, validation loss stabilized without further significant reduction, and both accuracy and loss curves showed plateau behavior across the majority of runs. The variability observed in the best model’s validation curve reflects the stochastic nature of weight initialization and mini-batch sampling; however, the validation accuracy of the selected model falls within ±2% of the mean across all 20 runs (mean: 86.92% ± 1.69%), confirming it is representative of the overall training distribution rather than an outlier.
3.3. Performance of the Best Selected Model
The model with the highest F1-Score on the test set was subjected to a more detailed analysis. It is acknowledged that this selection criterion constitutes a form of test-set-informed model selection for the purpose of qualitative illustration; however, the primary performance results reported in
Table 1 are based on aggregate statistics across all 20 independent runs and are therefore not subject to this selection bias. The test set was not used during any stage of model training or hyperparameter tuning. The results are presented through the confusion matrix, classification report, and evaluation curves.
3.3.1. Confusion Matrix and Classification Report
For an in-depth understanding of the model’s performance in relation to specific hits and misses for each class, the Confusion Matrix (
Figure 5) and the Classification Report (
Table 2) were analyzed.
Figure 5 reveals that the model correctly classified all 171 images of the ‘polluted’ class. Of the 160 ‘not polluted’ images, 137 were correctly identified, while 23 were erroneously labeled as polluted (false positives). This characteristic, where there are no false negatives for the polluted class, is desirable in environmental surveillance systems that prioritize threat detection.
The classification report (
Table 2) confirms and complements these findings, demonstrating the balance between the model’s precision and sensitivity. The highlight is the recall of 1.00 for the ‘polluted’ class, indicating that no cases of pollution were missed—a crucial characteristic in environmental surveillance systems.
The 23 false positives indicate that the model adopts a slightly conservative behavior, which, in this context, can be considered positive. In many environmental applications, it is preferable to investigate a false alarm than to fail to detect a real pollution event. Thus, the model’s behavior is aligned with the logic of preventive alert systems.
3.3.2. ROC and Precision–Recall Curves
For a more sophisticated evaluation of the model’s discriminative capacity, ROC (Receiver Operating Characteristic) and Precision–Recall curves were analyzed and are displayed in
Figure 6.
The ROC curve (
Figure 6a) presents an area under the curve (AUC) of 0.996. This value, close to unity, indicates that the model is highly effective at distinguishing between the ‘polluted’ and ‘not polluted’ classes, regardless of the adopted classification threshold. The proximity of the curve to the upper left corner of the graph reinforces the model’s excellent discriminative capacity.
Concurrently, the Precision–Recall curve (
Figure 6b) also demonstrates notable performance, with an Average Precision (AP) of 0.996. This result is particularly relevant in scenarios with class imbalance, as it indicates that the model maintains very high precision even as recall approaches 1. This demonstrates a robust balance between correctly identifying the majority of positive pollution cases and minimizing false alarms.
Taken together, these curves complement the analysis of previous metrics, confirming that the developed model is statistically reliable and operationally useful. Its robustness and adaptability across different thresholds make it applicable to various environmental policies, enabling strategies that prioritize either high detection sensitivity (focus on recall) or greater control over false positives (focus on precision), as highlighted by Araújo [
28].
It is thus concluded that the model demonstrated excellent performance in all evaluated aspects, with high metrics, low variation between runs, good generalization capacity, and behavior that favors environmental safety. It is therefore a viable solution for automatic pollution-detection systems in water bodies.
3.4. Impact of Threshold on Classification Metrics
To characterize the model’s behavior across operating points, an exploratory analysis was conducted by applying different decision thresholds (from 0.3 to 0.9) to the network’s probabilistic output. The threshold of 0.5 was established a priori as the standard binary classification boundary for sigmoid-output classifiers and was not selected through optimization on the test set. The analysis presented here uses test set predictions for illustrative purposes, quantifying the trade-off between precision and recall across thresholds. In a production deployment scenario, threshold calibration should be performed on the validation set according to an explicit operational cost function, with the final model evaluated on a separate test set using only the pre-determined threshold.
Figure 7 illustrates the evolution of Precision, Recall, and F1-Score as a function of this threshold variation, while
Table 3 details the numerical values, including the most critical classification errors: false positives (FP) and false negatives (FN). The confusion matrices for each threshold are presented in
Figure 8.
Figure 7 illustrates the evolution of Precision, Recall, and F1-Score metrics as a function of this threshold variation, while
Table 3 details the numerical values, including the most critical classification errors: false positives (FP) and false negatives (FN). The confusion matrices for each threshold are presented in
Figure 8.
As shown in
Figure 7, while recall remains at 100% at lower thresholds (indicating that all pollution cases were detected), it drops sharply to 0.7, accompanied by a gain in precision. The F1-score, a metric balancing precision and recall, reaches its maximum at a threshold of 0.90. Still, this high precision comes at the cost of introducing false negatives, which may be undesirable in environmental alert systems.
Among the evaluated decision thresholds, 0.50 proved particularly appropriate. This threshold was the only one that combined a total recall of 1.000 (i.e., zero false negatives) with a moderate number of false positives (23). The absence of false negatives is a crucial characteristic in environmental monitoring applications, where failing to detect a real pollution event is more costly than investigating a false alarm.
Among the evaluated decision thresholds, 0.5 proved particularly appropriate, as it strikes a balance between precision and recall and eliminates false negatives—a desirable characteristic in systems that prioritize environmental safety. Although other thresholds show slight improvements in certain metrics, adopting the standard value is justified by its robustness and the adequate compromise between sensitivity and specificity.
In general, the performance metrics presented reinforce the technical feasibility of the proposed approach and validate its potential for practical water surveillance solutions, especially for remote image monitoring.
3.5. Practical Application of the Model
To demonstrate the practical applicability and potential use of the visible litter classification model in rivers, an interactive web application was developed. This application was built using the Streamlit library, an open-source Python-based tool. The choice of Streamlit is due to its ability to drastically simplify the development of user interfaces for machine learning applications, allowing direct integration of models (such as the one saved in .h5 format from TensorFlow/Keras) with a few lines of code. Its native compatibility with the Python ecosystem facilitates transitioning the trained model to an interactive, functional web environment without requiring in-depth knowledge of front-end development. The main objective of the application is to facilitate user interaction with the system, enabling the upload and analysis of new images and the immediate visualization of results in a real-world environmental monitoring scenario. The application’s operational flow, from image submission to final result display, is illustrated in
Figure 9.
As illustrated in
Figure 9, the application’s image analysis process follows a logical sequence of automated steps. Initially, the image is loaded by the user via a flexible uploader that supports multiple formats, including JPG, PNG, and HEIC (common on mobile devices). After upload, the image goes through essential preprocessing, which includes format verification, RGB conversion, resizing (224 × 224 pixels), and pixel normalization, preparing it for inference. The preprocessed image is then passed to the trained Convolutional Neural Network model for inference, which returns a pollution probability (a value between 0 and 1). In parallel, the application attempts to extract geographic metadata (date, latitude, and longitude) from the image if it exists, and the image capture device was configured to record location information. Finally, all this data is presented to the user clearly, including the analyzed image, the percentage probability of pollution, the binary classification (‘Polluted’ or ‘Not Polluted’), and, if available, the location displayed on an interactive map, providing visual geographic context. The application also offers a general summary of results when multiple images are processed, indicating the total number of images analyzed and the estimated mean pollution.
The user interface was designed to be clear and functional, with customized design elements through integrated CSS.
Figure 10 displays the initial screen, where the user can upload an image for analysis.
After processing by the model, the results are presented in a clear, visual format.
Figure 11 illustrates an example prediction: a river image classified as ‘Not Polluted’ and another image classified as ‘Polluted’, highlighting the calculated probability and, when available, the image’s geographic location on a map.
This web application prototyping approach facilitates understanding of the model’s functionality and demonstrates the transition of research results into a practical and accessible tool. The application was designed to be responsive, allowing it to be used on various devices such as smartphones, tablets, and desktop computers. This interactive tool, accessible online [
32], allows users to upload images and automatically receive the network’s prediction, promoting transparency and facilitating experimentation with the developed solution. Although this web application is geared towards direct use on conventional devices and operates offline, applying the model in contexts such as drones or other embedded systems for real-time inference would require specific adaptations to its integration and data collection.
3.6. Comparison with Related Work
Although a direct numerical benchmark against other architectures was not performed in this study, the performance achieved here is consistent with results reported for MobileNetV2 in related environmental and waste-classification image tasks. In a study classifying plastic waste using transfer learning, MobileNetV2 achieved an accuracy of 97.12%, with precision of 96.31%, recall of 92.69%, and an F1-score of 94.26% [
33], reinforcing the architecture’s suitability for fine-grained visual classification tasks involving environmental waste. In a separate cross-domain benchmark involving marine plastic detection, MobileNetV2 was reported to achieve the strongest cross-domain F1-score among several tested architectures, including larger models such as ResNet-18 and Vision Transformers, suggesting that its inductive biases generalize well to out-of-distribution visual conditions [
34]. These findings, while not directly comparable due to differences in dataset composition and task definition, support the appropriateness of MobileNetV2 as a lightweight yet effective architecture choice for the binary river pollution classification task addressed in this work. A formal head-to-head benchmark against alternative lightweight architectures (e.g., EfficientNet-B0, MobileNetV3) on the same dataset is identified as a valuable direction for future work.
It must be acknowledged that the comparisons presented in this section draw on results reported in the literature for different datasets and do not constitute a controlled benchmark on the same data used in this study. Claims regarding the comparative suitability of MobileNetV2 for this specific river pollution classification task should therefore be interpreted with caution, as performance differences may be influenced by dataset-specific characteristics rather than intrinsic architectural properties.
3.7. Supplementary Evaluation on Unaugmented Data
To further address concerns regarding the use of augmented images in the evaluation subsets, an additional evaluation was conducted using exclusively original, non-augmented images from the validation and test sets. The same best-performing model was evaluated with no transformation beyond pixel normalization (rescaling to [0, 1]). The original (unaugmented) validation set comprised 235 images (80 ‘not polluted’ and 155 ‘polluted’), and the test set comprised 230 images (75 ‘not polluted’ and 155 ‘polluted’), reflecting the actual class distribution of the dataset prior to augmentation.
The results are presented in
Table 4. Performance on unaugmented data remains fully consistent with the originally reported results: test accuracy of 88.26% (vs. 89.70%), F1-score of 91.03% (vs. 90.96%), and Precision of 93.84%. These results confirm that the deterministic augmentation strategy applied to the ‘not polluted’ evaluation subsets did not artificially inflate the reported metrics, and that the model generalizes robustly to original, unmodified images. The reduction in AUC (96.16% vs. 99.60%) is expected and reflects the smaller and more imbalanced evaluation set used in this supplementary analysis, which contains fewer ‘not polluted’ samples than the augmented version.
4. Discussion
The results obtained in this study are consistent with findings reported in the literature on the application of CNNs for environmental image classification. The mean test accuracy of 89.7% and F1-score of 90.9%, achieved across 20 independent runs, demonstrate that the proposed approach, based on MobileNetV2 and Transfer Learning, is both robust and generalizable.
The choice of MobileNetV2 proved especially appropriate for this task. While the present study did not include direct on-device benchmarking of inference latency, the architecture’s computational efficiency is well documented in the literature: MobileNetV2 was specifically designed to achieve high accuracy with a substantially reduced number of parameters and floating-point operations compared to standard CNN architectures, through the use of depthwise separable convolutions and inverted residual blocks [
29]. This efficiency has been empirically confirmed in prior deployments on mobile and embedded hardware, where MobileNetV2-based classifiers have demonstrated inference times in the range of tens of milliseconds per image on consumer-grade mobile processors [
29,
30]. This supports its viability for deployment in resource-constrained devices, such as smartphones and UAVs, which are common tools in field environmental monitoring. A direct quantitative benchmarking of inference time and memory footprint for the present model was not performed in this study and is identified as a direction for future work.
The absence of false negatives in the best model—where all 171 polluted river images were correctly identified—is a particularly relevant outcome for environmental surveillance applications. In such contexts, missing a real pollution event carries significantly greater consequences than generating a false alarm, as highlighted in
Section 3.3.1. This behavior reflects the model’s conservative nature at a threshold of 0.50, which is well-suited to preventive monitoring strategies.
It is important to note that this result refers specifically to the representative model selected for qualitative illustration in
Section 3.3, which was identified based on its F1-score on the test set. This model is not guaranteed to produce zero false negatives in all deployment scenarios; across the 20 independent runs, the mean recall for the ‘polluted’ class was 90.96% (±1.69%), indicating consistently high but not absolute detection performance across different training initializations.
The data augmentation strategy adopted in this work, differentiated by class and subset, effectively addressed the initial class imbalance and contributed to the model’s generalization capacity. The stability of metrics across 20 runs, evidenced by low standard deviations (e.g., ±0.27% for training accuracy), further confirms that the training pipeline is reproducible and reliable.
A limitation of the current approach is that the model operates exclusively on static images in an offline setting and was not designed for real-time video stream processing. Future integration with continuous data-collection systems—such as fixed cameras on bridges or drone-mounted sensors—would require adaptations to the inference pipeline and data transmission. Still, the efficiency of MobileNetV2 makes this transition technically feasible.
An additional limitation concerns the absence of a controlled comparative benchmark against alternative lightweight CNN architectures on the same dataset. While
Section 3.6 contextualizes the results against related literature, a direct comparison with MobileNetV3-Small, EfficientNet-B0, ShuffleNetV2, or ResNet18 under identical experimental conditions was not performed. Future work should include such a benchmark to more rigorously validate the architectural choice.
Compared to traditional monitoring methods, which typically require in-person sampling, laboratory analysis, and high operational costs, the proposed solution offers a scalable, low-cost alternative that can be rapidly deployed across diverse geographic regions. The additional web application interface further lowers the barrier to adoption by non-specialist users such as environmental inspectors and public managers.
5. Conclusions
This work presented the development and evaluation of an intelligent classifier based on a CNN approach for automatically detecting visible pollution on river surfaces in images. For this purpose, a database was constructed, manually labeled, and augmented. For classifier implementation, a Transfer Learning model based on the MobileNetV2 architecture was used to perform binary classification of images as ‘polluted’ or ‘not polluted’ rivers.
The results demonstrate that the model achieved high, consistent performance, with a mean accuracy of 89.7% and an F1-score of 90.9% on the test set. The ROC curve, with an AUC of 0.996, and the Precision–Recall curve, with an AP also of 0.996, indicate that the model is highly effective at distinguishing between classes, even at different decision thresholds. The absence of false negatives in the best run reinforces its applicability in scenarios that require maximum sensitivity, such as environmental alert systems.
From an applied perspective, the results reveal the model’s potential for incorporation into large-scale monitoring systems, especially in regions with limited infrastructure. The computational lightness of the MobileNetV2 architecture makes it viable for execution on devices with limited resources, such as smartphones, and for applications on drones or other embedded systems. In this way, the model can be directly integrated into on-board software to perform real-time inference, with results that can be used for task automation and remote data transmission.
As the main contribution, this study proposes a low-cost, scalable, and replicable solution for water monitoring to expand the tools supporting environmental management and impact mitigation. Regarding future work, the extension of the proposed approach to real-time and continuous monitoring scenarios represents a natural and technically feasible next step, given MobileNetV2’s demonstrated computational efficiency. Specifically, two deployment pathways merit further investigation: (i) UAV-based monitoring, which would require developing frame-extraction and on-board preprocessing pipelines, implementing edge inference on embedded processors, and designing geospatial logging protocols to associate pollution detections with GPS coordinates; and (ii) fixed-camera video stream monitoring, for which temporal filtering mechanisms—such as sliding-window aggregation or change detection algorithms—could be employed to reduce false-positive rates in continuous acquisition scenarios. Beyond real-time deployment, additional future directions include integration with the Geographic Information Systems (GIS), expansion of the training dataset across diverse river basins and geographic regions, and investigation of multimodal approaches that combine imagery with contextual metadata (such as location, weather, and seasonal variation) to further improve detection accuracy and robustness.