1. Introduction
As the world’s population grows rapidly, as does industry and technology, the amount of various types of waste is increasing, from organic to medical to hazardous. A large part of this waste is of liquid origin, usually referred to as wastewater. Liquid waste poses a major environmental challenge. According to the United Nations, about 80% of global wastewater is discharged without treatment or with minimal processing. This means that contaminated water is massively returned to ecosystems, posing a threat to human health, fauna and flora [
1,
2,
3,
4,
5]. The scale of liquid waste is enormous: for example, in the United States, the industrial sector generates more than 22 billion gallons (about 83 billion liters) of wastewater annually. The largest sources of pollution are various industries. The chemical and oil refining industries produce large volumes of hazardous wastewater, including acids, solvents, and so-called “produced water”—highly salty and chemically contaminated water generated during oil and gas extraction. The food and beverage industry consumes about 20% of all industrial water in the world, which also means a large amount of wastewater contaminated with organic substances and fats [
1,
2,
3,
4,
5,
6]. The textile sector is one of the most damaging to aquatic ecosystems—producing one kilogram of fabric consumes about 200 L of water, and the resulting wastewater is saturated with dyes, heavy metals, and other harmful compounds. The scale of the laundry industry is also significant: in the USA alone, there are more than 2500 industrial laundries operating, which generate about 5.11 km
3 of wastewater per year [
6,
7,
8,
9,
10]. In the healthcare sector, an additional threat is posed by medical waste, which includes pharmaceuticals, infectious agents and laboratory chemicals. These examples show that liquid waste is not only domestic wastewater, but also the result of a wide range of industrial activities. Since only a quarter of waste is properly managed globally, the problem of liquid waste is becoming one of the most important environmental issues. Modern technologies, stricter regulation and international cooperation are necessary for its treatment and recycling, otherwise contaminated water will continue to have a detrimental effect on human health and the entire ecosystem of the planet [
11,
12,
13,
14].
One of the essential aspects of liquid waste management is the assessment of its concentration. Liquid waste is characterized by great complexity, as it may contain various organic and inorganic compounds, heavy metals, pharmaceutical substances or microbiological contaminants. For example, industrial liquid waste from the textile and chemical industries is often saturated with phenols, formaldehyde, ammonia, nitrogen compounds and sulfur dioxide residues. Residues of pharmaceutical preparations are found in wastewater from the healthcare sector, including antibiotics (e.g., ciprofloxacin), hormonal substances, etc. [
1,
2,
15,
16,
17,
18,
19,
20]. Wastewater from the food industry is characterized by high biochemical oxygen demand (BOD) due to the sugars, fats and proteins they contain. Such diverse chemical compositions determine that determining the concentration of wastewater is necessary not only to select the most appropriate treatment technologies but also to ensure that pollutants do not exceed environmental standards [
1,
2,
3,
4]. Various physical, chemical and biological methods are used to determine the composition and concentration of liquid waste. The main indicators, such as biochemical oxygen demand (BOD), chemical oxygen demand (COD) and total dissolved solids (TDS), are determined by titration or spectrophotometric methods. Gas chromatography (GC) and high-pressure liquid chromatography (HPLC) are used to analyze organic pollutants (e.g., pesticides, phenols, solvents). These studies require specialized equipment and qualified specialists and can take from several hours to several days, making the process both expensive and time-consuming [
1,
2,
3,
4,
21,
22,
23,
24,
25]. If the liquid waste is of a single origin and its chemical composition is known in advance, the concentration can be roughly estimated based on the intensity of the color. While such colorimetric estimation can provide a rapid and low-cost preliminary assessment, its scientific validity depends on empirical calibration. The relationship between pollutant concentration and color intensity must be established specifically for the wastewater type under investigation. Factors such as turbidity, pH, temperature, and the presence of multiple-colored substances can significantly influence RGB measurements, potentially leading to inaccurate estimations. Therefore, this approach should be used only as an approximate indicator, calibrated against laboratory reference measurements, rather than as a replacement for comprehensive chemical analyses. For example, if the solution is brightly colored (e.g., brown or red) and evenly distributed throughout the volume, its concentration can be determined visually or using simpler colorimetric methods [
1,
3,
4,
26,
27,
28,
29,
30]. In this case, calibration curves are constructed in advance, allowing one to predict the percentage concentration of pollutants based on the color saturation. This method is not very accurate, but it allows one to obtain quick results and avoid expensive and time-consuming laboratory tests when only a preliminary assessment of contamination is sufficient. Trained neural networks can be adapted to such a task, e.g., using the PyTorch platform. The analysis can be performed numerically, using color intensity measurements in numerical values. Instead of visual assessment, RGB or spectrophotometric signals can be measured, and these data are converted into numerical weights or intensity indicators. Then, knowing the specific relationship between the color of the pollutant and its concentration, it is possible to approximately determine the concentration based on these numerical measurements. The combination digital color intensity estimation + AI/ML algorithms for concentration prediction is a relatively new and rapidly developing scientific direction. Recent studies have demonstrated the potential of smartphone-based colorimetry for assessing water quality and environmental parameters. Several works have proposed RGB-based models for detecting water contaminants, colorimetric ion sensing, and reflectance measurement, emphasizing the need for consistent color calibration and illumination control. Other research has addressed color standardization and color-constancy algorithms to reduce inter-device variability [
1,
2,
3,
4,
5,
28,
29]. However, existing approaches still lack laboratory-verified ground truth and cross-device validation, which this study addresses through lab-standardized RGB data and multi-device calibration. Prior studies used RGB colorimetry and AI for water quality analysis. This work introduces a multi-output neural network that predicts both pollutant concentration and dominant color. Unlike earlier works, our model is directly calibrated against laboratory-measured chemical parameters (BOD, COD, TOC, TSS), ensuring that predictions have a physical and chemical basis rather than relying solely on empirical color correlations. Additionally, the methodology is designed for rapid, equipment-free preliminary assessment of wastewater and is adaptable for real-time applications, including mobile devices, which has not been addressed in prior studies. This study aims to develop a method for estimating pollutant levels in liquid waste using RGB color values and AI/ML prediction algorithms.
3. Results
Before developing an artificial intelligence (AI) model for identifying the origin of wastewater by color, a preliminary study was conducted to systematically classify various types of wastewater according to their RGB color values (
Table 2). During the study, wastewater collected from different sources—households, food industry, textile, metallurgy and other industrial sectors—was visually and chemically evaluated, and its color components (red, green, blue) were recorded. The obtained data were grouped into a table, in which typical RGB values were assigned to each wastewater concentration interval, as well as the probable type of wastewater and organic or chemical contamination [
1]. Based on this data, the AI model can later approximately determine the origin of wastewater by comparing the color profile of a real sample with previous data. The model created in this way allows for quick and effective prediction of the type of wastewater and its potential environmental hazard, since color analysis becomes the main indicator reflecting chemical and organic contamination. The data presented in
Table 1 show a clear correlation between the wastewater concentration, RGB color values, and the expected type of wastewater and its chemical or organic contamination level. At the lowest concentrations (0–5%), the wastewater is almost transparent, very light (R 249–255, G 250–255, B 251–255), which corresponds to household wastewater with minimal organic and chemical load. This type of wastewater almost does not change the color of the water, since the water component dominates in it, and the concentration of pollutants is very low. As the concentration increases to 6–15%, color changes become noticeable: the red (R) and green (G) components gradually decrease, while the blue (B) remains high (R 236–248, G 240–250, B 244–251). This indicates slightly turbid water, which may be light domestic or food industry wastewater with a low content of organic matter and chemical impurities. In this range, wastewater may already contain some soluble nutrients or detergent residues, which do not yet give an intense color shade [
1,
2,
3]. A further trend of increasing concentration (16–35%) leads to a more pronounced darkening of the color, especially a decrease in the red and green color components (R 209–235, G 219–239, B 229–243), while the blue still remains bright. This indicates light to moderate industrial pollution, possibly from food processing, textiles or other light industrial sources. Color changes in this range signal that the wastewater already contains chemical components, dyes or detergents, which begin to form a visible color difference from clear water. When the concentration reaches 36–55%, the colors darken even more (R 177–208, G 195–218, B 212–228), the red and green components continue to decrease, while the blue remains relatively high, giving the wastewater a dark blue or blue-gray tone. This corresponds to industrial wastewater with moderate to severe chemical contamination, which may contain dyes, chemical solvents or other industrial substances. Chemical reactions are observed in such wastewater, which change the color and increase the viscosity, and their organic and inorganic load becomes significant. At the highest concentrations (56–100%), the color changes become dramatic: the red and green components are strongly reduced (R 0–168, G 122–194) and the blue gradually darkens (B 168–211), creating a dark blue-gray or almost black color. This corresponds to highly concentrated, often highly toxic industrial wastewater, possibly from chemical, textile, metallurgical or other heavy industries. In such wastewater, chemical pollutants dominate and organic and inorganic substances reach dangerous concentrations; therefore, they can pose a risk to the environment and health. The color intensification in this interval is directly related to the increasing concentration of chemicals, and changes in the RGB components reflect both the presence of dyes or metal ions and the possible decrease in biological viability in the water. In summary, the table clearly shows that with increasing wastewater concentration: the color gradually darkens, the red and green components decrease, and the blue usually remains high or darkens in later intervals; at the same time, the probable chemical and organic contamination increases, and the type of wastewater changes from almost transparent domestic wastewater to highly concentrated, dark, potentially hazardous industrial discharges. This trend allows us to visually determine the concentration and type of wastewater based on RGB color values and predict its composition and hazard.
The concentration of wastewater in this study is defined as the total amount of dissolved and suspended pollutants expressed as a percentage according to laboratory measurements, including biochemical oxygen demand (BOD5), chemical oxygen demand (COD), total organic carbon (TOC), and total suspended solids (TSS). In order to relate the concentration to specific chemical elements, the RGB color components were correlated with the main pollutants: the red (R) component reflects organic matter and nitrogen compounds, the green (G) component correlates with phosphorus and minerals, and the blue (B) component reflects metals, dyes, and other inorganic pollutants. Based on this relationship, it was determined that wastewater in the low concentration range (0–5%) corresponds to clear water of domestic origin, wastewater in the medium range (16–35%) corresponds to light to moderate industrial pollution, and wastewater in the high concentration range (91–100%) corresponds to highly concentrated, potentially hazardous industrial wastewater. In this way, RGB analysis not only allows for a preliminary assessment of the total concentration of pollutants but also provides information about the type of wastewater and its chemical composition, especially with regard to the components that determine color.
The script is designed to predict the chemical composition of wastewater samples using a multilayer perceptron with two hidden layers of 64 neurons each, the activation function “ReLU” and dual output. The model is optimized by the “Adam” algorithm with a learning step of 0.001, and the loss function consists of regression (MSE) and classification (CrossEntropy) components with a ratio of 1:0.5. The input data consists of three normalized color channel intensities (red, green, blue), which are used to form a common hidden representation. From this representation, the first output head predicts the total concentration of pollutants in the normalized interval [0, 1] and the second one assigns the dominant color class (R, G or B). The predicted concentrations are further decomposed into percentages of individual color components, which are compared with reference values based on known chemical composition profiles (nitrogen, phosphorus, total organic matter, urea, minerals, metals, etc.). The performance of the model is evaluated by the mean absolute error in the total concentration regression and the classification accuracy in the color dominance task [
1,
2,
3,
4]. Such an architecture allows for the simultaneous assessment of the total amount of pollutants, identification of dominant components, and automatic assignment of wastewater samples to the appropriate chemical groups.
The results of the analysis show a clear dynamic of training and validation losses, which allows us to assess the learning progress of the model and its generalization capabilities (
Figure 1). From epoch 0 to 75, training loss dropped from 0.65 to 0.10. Validation loss decreased from 0.65 to 0.22. This indicates that the model is learning general data patterns properly, and the accuracy of its predictions in both the training and validation sets is consistently improving. However, from approximately epoch 75, it is observed that the training loss continues to decrease (from 0.10 to 0.05), while the validation loss stops decreasing and starts to increase (from 0.22 to 0.25). This tendency is a classic sign of overfitting, where the model over-adapts to the structural details of the training data, but loses the ability to share knowledge and accurately predict unseen data. This dynamic allows us to conclude that the optimal learning point is reached at around epoch 70–80, and further training is not only no longer useful but may actually degrade the model’s performance. To avoid overfitting, it is recommended to apply an early stopping strategy where training is stopped if the validation loss does not improve for a certain period of time.
The optimal dropout value (
p = 0.2) was determined empirically, taking into account theoretical guidelines for regularization according to Bayesian principles. The results showed that this value provides a good balance between learning accuracy and network generalization capabilities, consistent with theoretical principles—it is precisely such a regularization strength, as the Bayesian method shows, that is often optimal for achieving stable learning [
1,
2,
3,
4,
5]. In addition, a preliminary analysis was performed using an adaptive dropout method, Concrete Dropout, which allows the network to determine optimal dropout values during training. Although this method was not fully integrated into the final model due to computational cost and complexity, the results confirmed that the optimal values range from approximately 0.18 to 0.25, which is broadly consistent with the empirically chosen values. This reinforces the claim that a dropout value = 0.2 is not only practically effective but also theoretically sound. Future work is planned to further integrate adaptive adjustment methods in order to further reduce overfitting in more complex prediction scenarios.
Analyzing the prediction results, the model’s ability to accurately predict the total concentration is high, as shown by the MAE (0.012) and RMSE (0.018). The model accuracy results are presented in
Table 3. These values correspond to very low mean absolute error and standard square deviation, so the model predictions practically agree with the measured concentrations. The MAPE value (2.1%) confirms that the average error in percentage is small, which indicates a reliable application of the model to real data. The coefficient of determination R
2 = 0.96 indicates that almost 96% of the total variation in the predicted data set is explained by the model predictions, which confirms the high ability of the model to capture the structure and trends of the data. The accuracy of the classification task (accuracy ~0.93) indicates that the model reliably distinguishes the dominant color, which is important for pigment identification [
1,
2,
3,
4,
5]. This result suggests that the classification based on RGB components is sufficiently informative and can be used for further predictions of wastewater groups or chemical composition. Overall, the high regression and classification accuracy indicates that the model is able to effectively integrate both digital RGB data and chemical concentration information. This allows it to be applied to both accurate concentration predictions and pigment identification, and the results are sufficiently reliable for practical analyses and further research.
To assess the model’s ability to generalize to new data, a 5-fold cross-validation was performed. The multi-head neural network showed an average MAE = 0.013 ± 0.002, RMSE = 0.019 ± 0.003, and R2 = 0.95 ± 0.02, indicating stable model accuracy across different data subsets. In addition, 95% confidence intervals were calculated using the bootstrapping method: MAE range 0.011–0.015 and R2 range 0.93–0.97, confirming the reliability of the model’s output. In order to demonstrate the added value of the multi-head network, the results were compared with simpler models: Linear Regression obtained MAE = 0.024, RMSE = 0.035, and R2 = 0.88 and Random Forest Regressor obtained MAE = 0.017, RMSE = 0.025, and R2 = 0.92. These results show that the multi-head neural network not only predicts the total pollutant concentration more accurately but also allows for the simultaneous determination of the dominant color, which is not possible for simpler models. Therefore, the multi-head network architecture provides a clear advantage in terms of both prediction accuracy and additional chemical information, confirming the reliability of the methodology for preliminary remote sensing applications.
To assess the potential risk of label leakage associated with the dominant color class head, we performed an additional analysis. Although this class is derived from the RGB input (according to the largest channel), our multi-head network uses not only absolute RGB values but also their relationship to chemical concentrations determined in laboratory measurements (R → organic/N, G → P/minerals, B → metals/colors). To test whether the network trivially learns classification from the regression target, an ablation experiment was performed: the model trained without the regression head (classification only) showed similar but slightly lower classification accuracy (~0.91 vs. 0.93), while removing the regression head from the model reduced the accuracy of chemical component predictions. This analysis shows that the classification head does not depend solely on direct RGB classification, and the network learns to associate color proportions with real chemical properties. Therefore, although the dominant color class is partially input-dependent, the network architecture and additional chemical supervision ensure that the classification makes physicochemical sense and the risk of label leakage is minimally significant.
To ensure the reliability of the results, the accuracy of the model was not only evaluated in the RGB space but also compared with the base models in the HSV and Lab color representations. Linear and Ridge regression models trained on the HSV and Lab channels showed a larger mean absolute error (MAE) range (0.020–0.030) and a lower R2 (0.85–0.90) than the proposed multi-head neural network, while Random Forest models yielded MAE ~0.018–0.022 and R2 ~0.91–0.93. The 95% confidence intervals for MAE were 0.011–0.015 and the R2 range was 0.93–0.97, confirming the stability of the network in different data subsets. Despite the high accuracy, the method has limitations: the results are sensitive to lighting conditions, cameras or device characteristics, so any application on different devices requires calibration. Furthermore, the RGB/HSV/Lab methodology is based on data from existing pollutant species and therefore its generalizability to other chemicals may be limited. Field studies and validations are needed to assess the applicability of the methodology under real-world conditions and mixtures of different pollutants. In this regard, it is suggested to integrate device specification control, calibration protocols, and testing with new pollutant combinations before widespread implementation of the system in real-world environments.
During the study, it was observed how the color of the solution changes with increasing concentration from 0 to 100%, analyzing RGB (red, green, blue) data, which allows us to draw conclusions about the color transformation and its possibilities for determining concentration (
Figure 2a). The initial color of the solution at 0% concentration was almost white (R = 255, G = 255, B = 255), indicating that the material is practically transparent or light white, and, at low concentrations, the color changes are minimal. As the concentration increases, the value of the red (R) component decreases almost linearly from 255 to 0, while the green (G) and blue (B) components decrease more slowly, from 255 to 122 and 168, respectively, so the color of the solution gradually changes from light gray or white to a greenish-blue (cyan/teal) hue, and, from 50% concentration, it becomes noticeably darker, acquiring a gray-green with a blue tint, which at high concentrations (90–100%) turns into a dark greenish-blue tone, as the red component almost disappears. The decrease in the red color is the fastest and main factor determining the color shift, while the green and blue components, decreasing more slowly, maintain their respective shades even at high concentrations. These trends allow visual observation of color changes based on the ratio of RGB components without special equipment, and the consistent and almost linear color change makes it possible to use RGB data for concentration visualization, color scale, or visual indicators, allowing for quick and accurate assessment of substance concentration in laboratory studies or chemical experiments without complex measuring devices [
2,
3,
4,
5,
6].
The text describes the relationship between wastewater concentration (0–100%) and color changes measured by RGB values, which corresponds to the table provided. At a concentration of 0–5%, wastewater is almost transparent, domestic, with minimal organic load (R = 255, G = 255, B = 255). As the concentration increases, the color changes from light gray to greenish-blue, and, from 50%, it becomes dark gray-green with a blue tint, indicating industrial, potentially toxic wastewater (96–100%: R = 0–59, G = 122–131, B = 168–172). The decrease in the red component is the main factor in the color change. These changes allow visual determination of the type and concentration of wastewater, using RGB data as an indicator without complex instruments.
Figure 2b shows the model’s ability to accurately predict pollutant concentrations based on RGB data. To verify the reliability of the model predictions, actual laboratory concentrations were compared with the model predictions (
Figure 2b). As can be seen from the data (scatter plot), the model predictions are consistent with actual measurements over the entire 5–100% concentration range. For example, the 5% actual concentration was predicted as 4.9%, 20% as 20.1%, 50% as 50.2%, and 100% as 100.2%. Most of the predictions differ from the actual measurements by no more than ±0.3%, and the 45° line drawn on the graph visually demonstrates a good agreement. Points close to the line indicate that the model predictions practically agree with the real data. This direct comparison of actual and predicted concentrations proves that RGB color intensity analysis combined with a multihead neural network can reliably estimate preliminary pollutant concentrations. This visualization confirms the low MAE (0.012) and RMSE (0.018) indicators and the high R
2 (0.96), meaning that almost 96% of the variation is explained by the model predictions. The use of averages allows for a clear assessment of the general trend across all concentration ranges, confirming that the multi-branch neural network reliably predicts both low and high pollutant concentrations, and the analysis of RGB color components can be effectively used for preliminary assessment of water pollutants without complex laboratory equipment.
The RGB-based method shows very high accuracy at low turbidity levels (0–50 mg/L), with MAE of 0.012. As turbidity increases, the error gradually rises, reaching MAE values above 0.05 for TSS > 200 mg/L. This indicates that suspended solids significantly influence the color-based measurements, reducing reliability at high turbidity. For practical applications, the method is recommended for samples with TSS below ~150 mg/L, while higher turbidity requires caution or alternative measurement strategies (
Table 4).
The method performs optimally in near-neutral pH conditions (6–8) with MAE = 0.012. Deviations from neutrality, both acidic (<5) and alkaline (>9), increase errors to MAE = 0.035–0.040. Extreme pH values may alter the solution’s color or the dominant RGB channel, causing mispredictions.
Therefore, preliminary wastewater assessment using RGB values is most reliable for pH between 6 and 8, and caution is advised outside this range (
Table 5).
Temperature has a moderate effect on method accuracy. The RGB method performs stably at room temperature (15–25 °C) with MAE = 0.012. Slight increases in error are observed at lower (10–15 °C) or higher (25–35 °C) temperatures (MAE ~0.018–0.020), while extreme temperatures (<10 °C or >35 °C) lead to more noticeable deviations (MAE 0.028–0.030). These results suggest that controlling sample temperature or applying calibration for temperature variation can improve robustness (
Table 6).
Container and background color strongly affect RGB-based predictions. Optimal accuracy (MAE = 0.012) is achieved with white or neutral backgrounds (
Table 7). Light pastel colors slightly increase MAE (~0.018), while gray or dark backgrounds further reduce accuracy (MAE 0.025–0.035). Colored containers produce the highest errors (MAE 0.040), likely due to reflections or color contamination. Therefore, standardized white or neutral backgrounds are recommended for reliable measurements.
Comparing the proposed RGB color intensity and multi-head neural network method with other similar machine learning models, it is obvious that it has several essential advantages. Traditional single-output perceptrons or simple regression models can only predict the total concentration of pollutants but cannot simultaneously determine the dominant color, which would provide additional information about the chemical composition. Separate tree-based methodologies, such as Random Forest or Gradient Boosting, although they accurately predict the concentration, are not suitable for direct classification of RGB numerical data and cannot integrate both regression and classification tasks in a single model [
1,
2,
3,
4,
5]. Convolutional Neural Network (CNN) methods based on color value analysis have high accuracy and are able to exploit spatial RGB relationships, but require large amounts of data and complex infrastructure, which limits their practicality for mobile devices or real-time analysis. LSTM or RNN models for time-series RGB data analysis are very suitable for sequence modeling, but for identical single samples, they are often too complex and take a long time to train. The proposed multi-head neural network solves these problems and provides several important advantages: it simultaneously predicts the total pollutant concentration and dominant color, allows for visual assessment of wastewater characteristics, works with simple RGB numerical data, does not require large photo data and is easily adaptable in real time, including on mobile device applications. In addition, the model is characterized by high accuracy—MAE = 0.012, RMSE = 0.018, MAPE = 2.1%, R
2 = 0.96, classification accuracy ≈ 0.93—which indicates that the predictions almost coincide with the measured data. Changes in RGB components visually reflect the chemical and organic load, so the method is not only accurate but also easy to interpret, allowing you to make preliminary decisions without complex laboratory equipment. Compared to other similar models, this method optimally combines accuracy, speed, and adaptability, making it particularly useful for preliminary wastewater monitoring assessments and real-time monitoring systems.
Based on this study, which developed a neural network that uses RGB color intensity measurements to predict the concentration of pollution in liquid waste, we plan to develop a mobile application for smartphones and other mobile devices in the future. This application will allow users, such as environmental professionals, industrial workers or even citizens, to quickly and easily analyze wastewater samples in real time, without the use of expensive laboratory equipment [
3]. Using the device’s camera, the application will take a picture of the wastewater sample, automatically extract the RGB color components and, based on the trained model, approximately determine the total concentration of pollution in percent (from 0% to 100%), the dominant color class (red, green or blue), the possible type of wastewater (e.g., domestic, food industry, textile or chemical origin), and the level of chemical or organic contamination. In addition, it will be able to provide a preliminary hazard assessment based on color changes, such as a decrease in the red component, and offer recommendations for further research, thus facilitating environmental monitoring and rapid response to pollution problems.