1. Introduction
Water quality is an essential aspect of public health and environmental sustainability. The presence of contaminants, such as total suspended solids (TSS), significantly impacts the potability and safety of the water supply, posing substantial challenges to effective monitoring and maintenance [
1]. Water quality also has important consequences for aquatic ecosystems and biodiversity. Turbidity, a key indicator of water quality, is influenced by the concentration of TSS. High turbidity levels can interfere with aquatic habitats, affect species that depend on water for their survival, and contribute to the degradation of river and marine ecosystems [
2]. Ensuring water quality is therefore essential not only to protect human health but also to preserve the integrity and functioning of aquatic ecosystems [
3].
TSS are related to the accumulation of organic and inorganic matter, feed residues, and aquatic microorganisms. They are defined as the amount of mass present in a water column (mg/L) [
4]. Meanwhile, turbidity is the degree of loss of water transparency due to TSS [
5]. Both tend to increase almost proportionally [
6]. There are different methods to calculate and monitor them. Method 180.1 by the U.S. EPA, known as nephelometry, is based on comparing the intensity of light scattered by a reference sample and the sample being measured. The measurement ranges between 0 and 40 NTU (nephelometric turbidity units) and, to achieve higher values, the samples must be diluted in water and the measurement rescaled [
7]. This method is the most commonly used and is implemented in the majority of commercial turbidimeters, utilizing a light source and a sensor detector, but it has several limitations. Inexpensive turbidimeters often have limited detection ranges and can be influenced by colored dissolved substances or air bubbles, leading to inaccurate readings. Additionally, these turbidimeters typically require multiple data records for comparison, which can be time-consuming and less efficient in dynamic environments [
8,
9,
10]
Recent techniques for turbidity measurement have been developed, such as the method implemented by Zhou and Zhang, which presents a new approach based on ultraviolet–visible near-infrared (UV-VIS-NIR) absorption measurements, achieving a coefficient of determination of 0.99 [
11]. Additionally, Goblirsch et al. implemented fluorescence spectroscopy for turbidity estimation, achieving a sensitive detection of 0.2 NTU [
12]. However, both methods are expensive. Zhue et al. introduced a method using two NIR digital cameras for turbidity measurement, but it requires two data records for estimation and is also expensive [
13]. Cheng et al. proposed a method based on the scattering of light, which effectively eliminates difficult-to-remove air bubbles in the water channel with high accuracy, but it requires a constant calibration process to work effectively [
14].
Advancements in image processing have introduced new methods for assessing turbidity. Digital image processing techniques analyze the gray levels in water images to estimate turbidity levels. For instance, studies have demonstrated how image pixels correlate with water turbidity [
15,
16,
17]. These methods, however, also face challenges such as sensitivity to lighting conditions and image quality.
Convolutional neural networks (CNNs) offer a promising alternative for turbidity measurement. CNNs replicate the human visual cortex, making them highly effective for image analysis tasks such as classification and detection [
18,
19]. CNNs are mathematical algorithms that replicate how humans learn and mimic the mammalian visual cortex using computational blocks and multiple layers of artificial neurons to approximate any continuous function [
20]. CNNs are particularly advantageous in this context because they can handle the complex patterns and high-dimensional data typical in image analysis tasks. They offer robust feature extraction capabilities that traditional methods might miss. This makes CNNs suitable for analyzing images of water samples where suspended solids need to be identified and quantified accurately.
Multiple linear regression (MLR) is another technique that has been used to model and predict water quality parameters. MLR can be particularly useful when the relationship between the predictors (input variables) and the response variable (output) is linear. It is simpler and computationally less intensive compared to CNNs. However, MLR may not capture complex patterns and interactions in the data as effectively as CNNs. Combining CNNs with MLR can use the strengths of both methods, providing a robust framework for turbidity and TSS estimation.
For measuring the performance of a CNN coupled with MLR, there exists a loss function where the parameter weights are adjusted to reduce the discrepancy between the model’s predictions and the known data. During the training process, these weights are iteratively updated by optimization algorithms to minimize the loss function [
21,
22]. The optimization algorithms produce a fast fit with low memory costs, avoid overfitting, and prevent the model from settling in the local minima of the loss function. The selection of an optimizer depends on the nature of the database used.
Typically, two commonly used optimizers for TSS and turbidity measurement are Adam and SGD. For the Adam algorithm, Feizi et al. achieved a turbidity estimation accuracy of 97.5%, though only after a large number of epochs, specifically between 150 and 200 [
17]. Nazemi et al. also implemented a CNN to classify turbid water samples, achieving 98.42% accuracy for color images and 94.34% for grayscale images [
23]. On the other hand, Haciefendioglu et al. only reached 87% accuracy [
24], while Li et al. achieved a mean square error of 0.92 [
25]. Additionally, an SGD algorithm has also been implemented in turbidity and TSS tasks. Wan et al. reached an R-squared of 0.931 [
26], and Lopez-Betancur et al. achieved 98.24% accuracy for turbidity, and a 97.20% for TSS estimation [
27].
Despite obtaining acceptable results, Adam and SGD may not be generalized solutions due to their sensitivity to data distribution and variability. This is related to the nature of the applied database and the specific characteristics of these optimizers [
28]. The selection of an optimizer for the evaluation of TSS and turbidity should be based on the potential of the CNNs to be trained and the characteristics of the database to be used. It should also provide a foundation for the development of more efficient and accessible water quality monitoring methods. This can have a significant impact on water resource management and the protection of aquatic ecosystems.
For this purpose, a comparison of twelve different optimization algorithms available in PyTorch was conducted to identify the most effective ones for estimating suspended solids in liquid samples using a pre-trained AlexNet model. Computationally generated binary images were used for this comparison [
29]. This methodology was adopted to maintain a controlled environment and eliminate variables that could introduce noise and optical aberrations. In this way, we ensure that the optimization algorithms focus on the nature of the database, which consists of black points (suspended solids) on a white background, analogous to liquid samples with suspended solids as referred as referred in articles [
15,
16,
17,
27]. The aim of this research is to identify the most suitable optimizer based on the nature of the database and to provide additional information about the performance of each optimizer.
2. Materials and Methods
This section describes the different algorithm optimizers evaluated for the classification task using computationally generated binary images with black points (simulating suspended solids) on a white background image.
2.1. CNN and Multiple Linear Regression (MLR) Used
Therefore, since the goal is to analyze optimization algorithms, a simple CNN like AlexNet is used to isolate and evaluate the performance of each optimized algorithm in the task of measuring suspended solids. AlexNet is based on convolutional and fully connected layers, exhibits a suitable representation capacity to capture discriminative features present in such visually simple images. Additionally, its computational efficiency and ease of transfer of pre-trained weights make it an attractive option for this specific binary classification problem. Furthermore, AlexNet has been widely studied and benchmarked in various image classification tasks, making it a well-understood and reliable choice for this research [
30].
Any CNN involves two main steps: feature extraction and classification. The classification step uses neurons to process inputs (features) and compute a response (output) or logits, which are then usually normalized using a SoftMax function to determine the probabilities of classes. The trained CNN’s output vector can be seen as a decoded version of the input image because the model extracts hidden information from the sample. Although the CNN can accurately classify certain liquid samples, it faces challenges when dealing with images containing intermediate levels of samples. However, by utilizing the feature vectors (CNN output vectors) to train a multiple linear regression (MLR) model, it is possible to predict the values for any sample. The key is to train the CNN with classes that encompass the desired dynamic range of the samples. In a multiple linear regression model, multiple independent variables are used to predict a single dependent variable. Specifically, in this context, the feature vectors obtained from the CNN serve as the independent variables, while the number of black pixels values represents the dependent variables. This MLR approach allows us to approximate new black pixel values based on the logits vector obtained from unknown images or images not used in the training process, providing a valuable tool for sample analysis and prediction [
27]. The general sequence described is shown in
Figure 1.
2.2. Optimization Algorithms Evaluated
Optimization algorithms play a critical role in the training of convolutional neural networks (CNNs) by minimizing the loss function and improving the model’s performance. These algorithms adjust the weights of the neural network to reduce the error between predicted and actual outcomes. In this study, we evaluate twelve different optimization algorithms available in PyTorch to identify the most effective ones for estimating suspended solids in liquid samples using a pre-trained AlexNet model. The evaluated algorithms include Adadelta, Adagrad, Adam, AdamW, Adamax, ASGD, LBFGS, NAdam, RAdam, RMSprop, Rprop, and SGD. Each algorithm has unique characteristics and advantages, which are briefly described below.
2.2.1. Adadelta
The method adjusts dynamically over time, relying solely on first-order information, and incurs minimal computational overhead compared to basic stochastic gradient descent [
31]. The main advantages of this method include: no manual adjustment of the learning rate; insensitivity to hyperparameters; minimal computational requirements, and robustness to large gradients, noise, and choice of architecture.
2.2.2. Adagrad
This optimizer incorporates data observed in earlier iterations, adapting subgradient methods to the geometry of the data. This method is based on a diagonal approximation of the matrix obtained from the products of subgradients. In essence, the adaptation enhances the effectiveness of the method on certain types of data with sparse gradients compared to previous methods [
32].
2.2.3. Adam
This method focuses on efficient optimization using information from first- and second-order gradients without requiring a large amount of memory. Learning rates are adaptively adjusted for different parameters based on these moment estimates. This can be useful in situations where memory resources are limited or when efficient optimization is sought using low-order gradient information [
33]. The advantages of this optimizer include its ability to adapt to changes in gradient scale, automatically control step sizes during optimization to enhance convergence, and its effectiveness in situations with sparse gradients.
2.2.4. AdamW
This method improves the regularization of the Adam optimization algorithm by separating weight decay from gradient-based updates. It shows that decoupling weight decay simplifies hyperparameter optimization, as it makes the optimal configurations for learning rate and weight decay factor much more independent of one another [
34].
2.2.5. Adamax
Adamax, a variant of Adam, utilizes second-order gradient information and applies a different infinity norm to compute adaptive learning rates. Adamax is preferred in situations where gradients demonstrate a wide range of magnitudes [
33].
2.2.6. ASGD
Average Stochastic Gradient Descent is based on averaging gradients over time to smooth the optimization process and improve convergence, particularly in situations where gradients may be noisy or problem conditions are complex [
35].
2.2.7. LBFGS
Limited Memory Broyden Fletcher Goldfarb Shanno is an optimization algorithm based on Matlab’s minFunc. It relies on an efficient approximation of the inverse of the Hessian matrix (which describes how gradients change with respect to model parameters). Instead of storing and manipulating the entire matrix, it employs only a low-memory approximation of previous gradients. This makes it particularly suitable for optimization problems in high-dimensional spaces or with limited memory resources [
36].
2.2.8. NAdam
Nesterov-accelerated Adam is based on replacing the momentum component of Adam with the Nesterov’s accelerated gradient (NAG) algorithm. The NAdam algorithm employs first- and second-order information to adjust the learning rate adaptively and determine the direction of the step. This allows it to perform better in scenarios with steep valleys or when there is high curvature in the loss function. Consequently, this leads to improved convergence speed in non-convex problems and enhances the quality of learning for models [
37].
2.2.9. RAdam
The Rectified Adam algorithm recognizes that, due to the limited number of samples in the early stages of model training, the adaptive learning rate in the Adam model exhibits an undesirably large variance. This can lead the model to converge towards suboptimal local minima. Therefore, RAdam not only rectifies this variance of the adaptive learning rate but also compares favorably with the warmup heuristic [
38].
2.2.10. RMSprop
The operation of RMSprop is based on maintaining a weighted average of the squares of previous gradients. This allows it to be applied in situations where it is not advisable for the learning rate to be constant, such as when dealing with loss functions that have different scales, variable curvature, slow convergence, and oscillation cycles, among others [
39].
2.2.11. Rprop
The resilient propagation algorithm performs a direct adaptation of the weight step based on local gradient information, according to the behavior of the sequence of signs of the partial derivatives. What is most interesting is that this algorithm is not affected by the behavior of the gradient, which is very useful for situations where the gradient is highly volatile or difficult to interpret [
40].
2.2.12. SGD
Stochastic Gradient Descent uses training data samples in a stochastic manner, which means it employs small, randomly selected data subsets in each iteration, making it computationally more efficient. Furthermore, the use of small random subsets is highly beneficial when working with large datasets. However, its primary advantage can also lead to it being a noisier and less stable algorithm [
41].
A summary of the main characteristics of these algorithms and their relationship with the dataset is described in
Table 1.
2.3. Database
The dataset was created by randomly adding black pixels to white images, resulting in binary images. Nine classes were created based on the number of black pixels in a white image. These nine classes represent the number of black pixels and are labeled as 0, 6272, 12,544, 18,816, 25,088, 31,360, 37,632, 43,904, and 50,176 (See
Figure 2). The images were created with dimensions of 224 × 224 pixels, corresponding to the input layer of the CNN used.
A total of 9000 images from nine different classes were generated. Out of these, 7200 images were randomly selected for the training process (800 images per class), while the remaining 1800 images were allocated to the validation dataset (200 images per class). Additionally, 8000 new images for eight additional classes with intermediate pixel concentrations were generated to test the different optimization algorithms. These intermediate classes included images with black pixel amount between the main classes, specifically designed to validate the generalization capability of the model, and these were not utilized in training the CNN (See
Figure 3).
Each image was carefully inspected to ensure it adhered to the specified class definitions. The generation process was automated to maintain consistency and prevent human error. Furthermore, the distribution of black pixels in each image was random to simulate various real-world conditions where suspended solids might not be evenly distributed.
The training process was developed and implemented using a workstation with the specifications described in the orange part of
Table 2. Optimization algorithms and AlexNet CNN were extracted from the PyTorch torchvision package. For the training of the CNN, the algorithm executed a total of 50 epochs with 5-fold cross-validation for each optimization algorithm listed in
Table 1. The cross-validation technique was used to ensure the robustness of our findings, and statistical tests were applied to compare the performance of different optimization algorithms.
The epoch number was selected by analyzing the loss of training according to previous executions of the training process. The network was trained with the default momentum settings for the optimization algorithms that required this hyperparameter. The batch size was set to 40 to balance computational efficiency and training stability. The hyperparameters used in the experiment are listed in the blue part of
Table 2.
Data augmentation techniques, such as random rotations and flips, were applied to the training images to improve the model’s robustness and generalization capability. The validation set was strictly used to evaluate the performance of the trained models, ensuring an unbiased assessment of their predictive accuracy.
2.4. Evaluation Metrics
The performance of the proposed method was evaluated for a classification task based on the confusion matrix, which has four important elements: TP for true positives, TN for true negatives, FP for false positives, and FN for false negatives. These elements of the confusion matrix are used to calculate the following performance metrics for evaluating the classifier, as listed in the blue part of
Table 3: accuracy, precision, recall, specificity, and F-score [
27].
Additionally, for evaluate the performance of the MLR, whose task is to estimate the correct measured value of black pixels, the following metrics used are listed in the orange part of
Table 3: coefficient of determination, mean absolute error, and mean square error, where
y_predicted is defined as the predicted value,
y_true as the true value, and
y_mean as the average of the
y data [
20].
3. Results
This research evaluated state-of-the-art optimization algorithms aimed at classifying and estimating the number of black points on a white background image, which is related to suspended solids in liquid samples. The goal for classification was to assess their accuracy, precision, recall, specificity, and F-Score. The training time taken by each optimizer is listed in
Table 4.
The performance metrics were evaluated using an additional validation dataset (See
Figure 3), which consisted of eight classes with 1000 images in each class. This dataset included intermediate classes, which were not used in the training process. The performance metrics are presented in
Table 5. The confusion matrix of the best performing optimization algorithms is presented in
Figure 4.
The classification task was evaluated according to the following accuracy categories: Excellent (0.90–1.00), Good (0.80–0.89), Moderate (0.70–0.79), and Poor (below 0.70) [
42,
43]. The models Adagrad, Rprop, Adamax, SGD, ASGD, and Adadelta achieved 100% classification accuracy. RAdam also archived excellent accuracy. This achievement may be attributed to the dataset used, which consists of a structured database storing information about the presence or absence of objects, such as suspended solids. The structured nature of the database allows for efficient gradient-based optimization for algorithms such as SGD, ASGD, Rprop, and Adadelta [
44,
45]. Furthermore, it is plausible that many of these objects may be absent for the majority of entries in the database, resulting in a sparse representation of the data. This sparse representation is suitable for algorithms like Adagrad and Adamax [
46]. However, these results are only for the classification task and are not definitive for the estimation of the number of black points on a white background image, which is related to suspended solids. For the estimation task, the aim was to assess their coefficient of determination, mean absolute error, and mean square error. The results are listed in
Table 6. The models that achieved 100% accuracy demonstrated excellent performance in the estimation task. Additionally, two more models (LBFGS and RAdam) are added that, despite not achieving 100% accuracy, show good and moderate coefficients of determination, respectively.
In addition, the predicted data for each optimization algorithm are shown in
Table 7. The true data represent the number of black pixels in the images that were created. An error bar plot comparing the best models in terms of classification accuracy is shown in
Figure 5, and for the regression coefficient of determination in
Figure 6.
4. Discussion
The results obtained in our research show significant variability in the performance of the optimization algorithms used for the estimation of suspended solids. In particular, the Adagrad, Rprop, Adamax, SGD, and ASGD algorithms proved to be the most effective, achieving 100% accuracy in the classification task and high coefficients of determination (R
2) in the estimation task. The top five optimization algorithms do not require momentum to function (See
Table 1 listed in
Section 2); they have their own default optimization strategies. Remember that momentum is an optional feature that can improve the convergence and stability of optimization algorithms. However, momentum can sometimes lead to overshooting the minimum in the loss surface, leading to slower convergence, or becoming stuck in local minima [
47].
SGD has emerged as a standard method for optimizing various types of deep neural networks, primarily because of its capacity to escape local minima like ASGD (verifiable with the best training times in
Table 4) [
48,
49], and its efficiency for large-scale datasets, making it ideal for linear classification problems related to our database’s nature [
50]. Additionally, in recent works, SGD has proven to be an excellent optimizing algorithm for suspended solids and turbidity estimation using CNN, presenting
value of 0.931 [
26], accuracy of 98.24% and 97.20% for TSS and turbidity, respectively [
27], accuracy of 94% for turbidity task [
51].
Adamax and Rprop are known for their robustness against noisy gradients and abrupt fluctuations. This allowed them to maintain consistent performance even in the presence of variability in the images. Adamax, due to its adaptive ability, also works well with low-resolution images [
52]. Rprop, in particular, performs well when gradients are very noisy or have abrupt fluctuations, as it focuses only on the sign of the gradient and not its magnitude [
53].
The structure of the dataset, consisting of binary images with uniformly distributed black pixels, favors algorithms that handle sparse and high-dimensional data well. For example, Adagrad adapts the learning rate for each parameter individually based on the history of gradients for that parameter. When a feature is infrequent in the dataset, Adagrad assigns a higher learning rate to that parameter, allowing the algorithm to make larger updates for these infrequent features. Adagrad is particularly effective at handling sparse and high-dimensional features, such as those found in black and white pixel images. This can result in faster convergence and excellent performance, as shown in this study [
54,
55].
In the literature, ADAM optimizers have reached in turbidity task: 0.89 of AUC (good discrimination between classes) [
56], mean square error less than 0.05 [
57],
value of 0.80 [
58], accuracy of 88.45% [
59] and 87% [
24]. These values are lower than those that SGD has been able to provide, as seen in the present study. In the case of Adam, it is important to note that it could achieve better performance if the initial hyperparameters are adjusted according to the observed training trends. However, this research aimed to analyze each optimizer with its default hyperparameters to determine which ones adapt best to datasets with suspended particles. The use of default hyperparameters may have benefited certain algorithms that are well-suited to these initial settings, while others might require fine-tuning to achieve optimal performance.
In this study, computationally generated images were used, where black pixels represent suspended solids in a liquid sample. This methodology was adopted to maintain a controlled environment and eliminate variables that could introduce noise and optical aberrations. The relationship between the number of black pixels and turbidity or suspended solids values is based on previous studies that have demonstrated the feasibility of using computer vision techniques to estimate these parameters. A relevant work that establishes a relationship between pixel values and turbidity is presented by Berrocal et al. [
60], and by Gang Dou et al. [
61]. Additionally, this relationship is more detectable through digital image processing techniques, such as those presented by Karnawat and Patil [
62]. In image processing, various features, including the gray levels in the images, are related to the image pixels, and are used to detect the degree of water turbidity, as defined by Feizi et al. [
17].
However, all the simulated suspended solids in this study have the same dimensions, which may not reflect real-world conditions where suspended solids vary in size and color. The system’s performance on real samples, where suspended solids have different sizes and colors, remains to be tested. Furthermore, overlapping effects in real applications have not been considered in this study. In practice, suspended solids can overlap, which could affect the accuracy of turbidity estimation. A possible solution might be to consider turbidity not as a single image but as a combination of images gathered at different times, allowing for a more comprehensive analysis.
This study provides a comprehensive comparison of multiple optimization algorithms in the estimation of suspended solids, filling a gap in the existing literature. The results indicate that algorithms such as Adagrad and Rprop are highly effective for this task, which can guide future research and practical applications in water quality monitoring. Additionally, by using default hyperparameter settings, we demonstrate that it is possible to obtain accurate results without the need for complex adjustments, facilitating implementation in practical environments.
In conclusion, our findings not only highlight the importance of selecting the appropriate optimization algorithm based on the nature of the dataset but also provide a foundation for the development of more efficient and accessible water quality monitoring methods. This can have a significant impact on water resource management and the protection of aquatic ecosystems.
In future research, we hope to extend our approach to real data to further validate our findings and improve their practical applicability. Additionally, addressing the limitations identified in this study, such as varying sizes and colors of suspended solids and overlapping effects, will be critical in enhancing the system’s robustness and reliability in real-world applications.
5. Conclusions
In this paper, a performance comparison of twelve optimization algorithms was conducted on an AlexNet CNN and an MLR to estimate the quantity of black points (suspended solids) distributed randomly on a white background image, which simulates the total suspended solids in liquid samples. The goal was to assess the effectiveness of different optimizers on image classification and multiple linear regression related to suspended solids in liquid samples. Therefore, AlexNet and the MLR were trained with nine classes from 0 to 50,176 black pixels per image and validated with eight additional extra classes (not used in the training process) ranging from 3136 to 47,040 black pixels per image.
The results demonstrated that the performance of each optimizer is influenced by the characteristics of our dataset. The three worst optimizers performances were shown to be Adam, AdamW, and NAdam. And the top five best optimizer performances were by Adagrad, Rprop, Adamax, SGD, and ASGD. The Adagrad optimizer was chosen as the first option because it attained a coefficient of determination ( 0.982), largely owing to its adaptive learning rate for each parameter and its ability to manage sparse and high-dimensional features.
As future work, the top five optimizers could be tested for performance in the top CNN models to date and the different regression models to achieve better method performance. Additionally, it is expected that this research study will be helpful in improving the development of new turbidimeters based on CNN implementations.