1. Introduction
Texture segmentation plays an important part in image analysis and understanding. Basically, it involves the identification of regions with the same texture features, so that further analysis can be performed on the respective regions alone. An effective segmentation algorithm is very useful in many areas, such as industrial monitoring of product quality, medical image analysis, or image retrieval.
This paper focuses on the analysis of images obtained from wastewater treatment plants. In these plants, various processes are carried out to facilitate the removal of organic matter, nitrification, denitrification of the influent, as well as the removal of suspended solids.
In the case of plants such as the ones used in this research, the removal of organic matter and nitrogen in the MBBR (Moving Bed Biofilm Reactor) [
1] is achieved through a biological treatment using AnoxKaldnes technology. AnoxKaldnes™ technology is based on the growth of biomass (in the form of a biofilm) on continuously moving plastic media in the biological reactor. These media are small in size but have a high specific surface area per unit volume, allowing for a high level of contact between the wastewater and the biofilm, thus allowing the biofilm to consume the organic matter of the wastewater.
In the event of any abnormal chemical spills in water, aeration issues or the accumulation of suspended solids, large foams can be formed in the tanks of biological treatments [
2]. The foams have to be removed because they can cover the tanks causing serious operational problems and preventing proper aeration. Currently, this activity is done manually in the wastewater treatment plant considered in this research, leaving aside any metric that indicates the most appropriate time when defoamers should be used.
Foam analysis has traditionally been performed in a manual manner in different domains of study. However, manual measurement of some foam and liquid properties can be time-consuming, and certain important geometric and dynamic measures are difficult or impossible to quantify using manual methods.
In the work by Collivignarelli et al. [
2], two on-site foam measurement methods, namely Foam Surface Covered (FSC) and Foam Volume (FV), were described. It is important to note that these methods come with certain limitations, such as the requirement for the camera to be positioned orthogonally to the tank or the necessity of employing devices such as hydrometers to calculate the foam volume.
In the study by Wang et al. [
3], a method based on texture analysis was proposed to effectively segment liquid from foam. Additionally, the approach aims to identify the boundaries of individual bubbles within the foam layer. Furthermore, alternative techniques for foam segmentation based on the watershed method were explored in Refs. [
4,
5]. Forbes and de Jager [
4] presented a novel approach that combines a texture measure with two stages of watershed. This method has been demonstrated to facilitate the segmentation of images containing both large and small bubbles. In the context of Ref. [
5], various implementations of the watershed method are presented for both semantic and instance segmentation. However, it is important to note that these papers primarily concentrate on foams formed by relatively large bubbles, which differ from the characteristics of the foams observed in the wastewater treatment plant tanks that will be the focus of our investigation.
In Ref. [
6], an image processing method is presented for quantifying foam coverage across the entire surface of tanks in sewage plants. However, the paper notes that the effectiveness of this method may be influenced by complex environmental factors.
Given the limitations observed in classical methods documented in the literature, it was opted for the application of deep learning methods due to their versatility in handling images with diverse characteristics.
To the best of our knowledge, there are currently no deep learning algorithms specifically tailored for foam segmentation. Therefore, the decision was made to employ models created to solve a similar problem, specifically, texture segmentation.
Various deep learning algorithms for segmenting and classifying textures in different domains are documented in the literature. Typically, these models utilize classic Convolutional Neural Networks (CNN) for image segmentation to extract characteristic textures. For instance, in Ref. [
7], diverse deep learning models, such as AlexNet, VGG16, or ResNet34, are utilized to identify various diseases in tomato leaves. The prompt identification and timely treatment of these diseases can mitigate potential losses in tomato production. Similarly, in Ref. [
8], convolutional neural networks are applied to segment breast density in mammographic imaging.
This way, in Ref. [
9], an energy efficient system based on deep convolutional neural networks for early smoke detection in both normal and foggy environments is created. The proposed architecture is a VGG-16 pre-trained in Imagenet. Moreover, this model is fine-tuned in a new dataset, created by the authors of the paper, consisting of 72,012 images from 4 different classes: nonsmoke, smoke, nonsmoke with fog, and smoke with fog. In 2021, a new work was published [
10], where they present a CNN-based smoke detection and segmentation framework for both clear and hazy environments. Regarding the detection an EfficientNet architecture is introduced, which outperforms the results acquired in Ref. [
9]. Concerning the semantic segmentation of the smoke regions, a DeepLabv3+ model is used. The classification and segmentation models are tested on the dataset presented in Ref. [
9], obtaining better metrics than previous models. Finally, in Ref. [
11] a highly adaptable model was introduced for texture segmentation, namely One Shot Texture Segmentation (OSTS), which does not require fine-tuning on specific problems or a large amount of annotated data for training. The key feature of the proposed architecture is the utilization of two input images: the image to be segmented and an image of the desired texture to be segmented. Additionally, this model is trained on a novel dataset named CollTex, created from texture images sourced from the Describable Texture Data (DTD) (
https://www.robots.ox.ac.uk/~vgg/data/dtd/ (accessed on 2 December 2022)). The achieved results surpass 90% accuracy on images composed of two textures.
Recognizing potential challenges in acquiring a sufficient quantity of images with adequate variability from individual wastewater treatment plant, the feasibility of employing federated learning instead of a centralized approach was explored to keep data privacy and increase communication efficiency [
12]. An analysis of different aggregation methods and major libraries was conducted, examining their functionality, advantages, and disadvantages.
The most commonly used aggregation method is FedAvg (Federated Averaging), which involves alternatively selecting a subset of client nodes in each iteration and applying a weighted average of the local models to update the global model. Another prominent method is FedSGD (Federated Stochastic Gradient Descent), which averages the gradients in each round.
As for the main frameworks or libraries, Flower (
https://flower.dev (accessed on 7 November 2022)), NVIDIA Flare (
https://developer.nvidia.com/flare (accessed on 9 November 2022)), TensorFlow Federated (
https://www.tensorflow.org/federated (accessed on 8 November 2022)), and IBM Federated Learning Framework (
https://www.ibm.com/docs/en/cloud-paks/cp-data/4.7.x?topic=models-federated-learning (accessed on 8 November 2022)) stand out. These libraries, except for NVFlare, are known for their user-friendly nature, making them highly attractive tools for research purposes. However, their commercial use is limited due to the lack of security they provide. NVFlare, on the other hand, is specifically designed for developing commercial products and addresses these security concerns.
The fundamental purpose of this initiative is to achieve a more efficient management of the WWTP. Therefore, our main contributions are the following:
Implement innovative solutions with the aim of reducing costs and improving efficiency. To do so, we introduced a computer vision system that enables real-time monitoring of foam parameters, ensuring timely and accurate treatment interventions;
Comparison between centralized and federated data treatment.
To achieve the objectives, two foam segmentation models, DeepLabv3+ [
10] and OSTS [
11], have been implemented and trained using images from a real WWTP. The former was selected due to the power of DeepLabv3+ segmenting smoke and the similarity between smoke and foam. The latter will contribute with its capacity to get good performance on small training datasets.
The foam percentage on the tank surface can be calculated from the segmented foam. This is crucial for determining the appropriate time to remove excess of foam, therefore enhancing the response speed to foam accumulations.
Moreover, the foam segmentation methodology is part of a complete computer vision pipeline that includes image capture, the selection of the suitable images for further processing, and the transmission of operationally significant data to the responsible personnel at the WWTP.
Regarding federated learning, the possibility of working with multiple clients has been explored in order to create a flexible model that works in different environments maintaining the privacy of each client’s data and reducing data communications compared to centralized training.
This study surpasses prior research by leveraging powerful and highly flexible models, utilizing datasets composed of a large number of images with diverse characteristics, and offering a complete remote sensing system based on image analysis to monitor foam coverage on WWTP tanks, which ranges from the automatic and periodic image capturing process to the sending of alerts to WWTP operators.
The rest of this work is organized as follows.
Section 2 provides the datasets that are used throughout the work and describes the proposed methodology. The experimental evaluations and the characteristics of the deployment are described in
Section 3. Finally, this work is concluded in
Section 4 with the key findings and future directions.
3. Results
In this section, the results obtained from various experiments conducted are described.
First, the outcomes of the model training phase employing publicly available texture datasets are delineated. Subsequent to this, the pretrained models derived from the aforementioned step are retrained using the manual dataset. Then, the training results obtained using the final dataset are summarized. Lastly, an exhaustive comparison between the two evaluated training paradigms, namely centralized and federated, is expounded upon.
Regarding the evaluation metrics, Dice and IoU scores are utilized to assess the performance:
IoU is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. This metric ranges from 0 to 1 (0–100%) with 0 signifying no overlap and 1 meaning perfectly overlapping segmentation;
The Dice metric is defined as twice the Area of Overlap divided by the total number of pixels in both images. It is a metric equivalent to the F1-score in classification problem, so it balances the precision and the recall.
Both metrics presented are positively correlated, so if classifier A is better than B under one metric, it is also better than classifier B under the other metric. However, the difference between the two metrics is that the IoU penalizes under -and over- segmentation more than Dice.
3.1. Training with Public Texture Datasets
Since the manual dataset is formed by a limited number of images with little variation in terms of lighting conditions, firstly the models were trained using public datasets to further perform transfer learning to the manual dataset.
3.1.1. OSTS
The model OSTS was trained using 2000 images of collages as presented in
Figure 3, where 1500 form the train set, 250 were used for validation, and 250 for testing. Regarding the training parameters, the model was trained for 29 epochs, which took approximately 1 h and 15 min.
The metrics on the test set are shown in
Table 1. In addition to the metrics,
Figure 12 presents the segmentation results for three images from the test set.
3.1.2. Deeplabv3+
The DeepLabv3+ model was trained using 395 images of size 128 × 128 from the DeepSmoke dataset presented in Ref. [
10]. Out of these images, 75% were used for training, while the remaining 25% were divided between validation and testing. The model underwent 38 epochs of training, which lasted approximately 11 min.
This model allows the use of different encoders. In order to find the model that provides the best results, the metrics on the test set were compared using MobileNet, ResNet50, and HRNetv2-32 as encoders (
Table 2). Although the metrics for the different models are similar, ResNet50 was chosen as the encoder due to its simplicity.
Figure 13 displays the segmentation results on the test set using this model.
3.2. Transfer Learning to Manual Dataset
Both the OSTS and DeepLabv3+ models were retrained with the manual dataset. The results obtained in the test set are shown in
Table 3.
The segmentation results for three different images are shown for both the OSTS and DeepLabv3+ models (
Figure 14). These results show good metrics for both models, DeepLabv3+ achieving relatively better results in either IoU and Dice. Moreover, although metrics can be improved, qualitative results are fairly good since it is very difficult to decide what is or is not foam.
3.3. Training with the Final Dataset
The installation of the PTZ camera at the WWTP made it possible to capture a sufficient number of images to train the models without the need for transfer learning. Additionally, the variability in lighting conditions (
Figure 6) throughout the day allows for training a robust model capable of handling changes in image brightness. Having such diverse images (with varying lighting, shadows, and different weather conditions such as rain or wind) enables the construction of a more flexible model. However, it also poses challenges in creating a more powerful predictive model.
To train the best predictive model, images from both tanks of the WWTP were utilized. Specifically, the training process involved 2000 images from each tank, with 1500 images used for training, 250 for validation, and 56 for testing.
Regarding the model, DeepLabv3+ was chosen. This decision was made because, although the OSTS model is also suitable for segmenting images without variations in lighting conditions, the challenge of selecting a valid reference texture for all types of images makes it less effective for the entire image set.
Table 4 and
Figure 15a display the metric results obtained by DeepLabv3+ in the test set, as well as the application of the trained model to three images from the test set, respectively. The obtained metrics are good, but the variability of lighting conditions in the image dataset achieved higher scores. Moreover, it is important to point out that the differences between IoU and Dice metric in the final dataset is lower than when doing transfer learning. This is due to the fact that when doing transfer learning, more instances of bad segmentation appear and IoU tends to penalize this cases. Furthermore, it can be observed in the test set images that it is not always clear what is and what is not foam, which in general makes this segmentation problem difficult to solve.
3.4. Federated Training
Regarding the model parameters, two clients were selected, each containing the data from one of the tanks in the WWTP. In this case, similar to centralized training, each client was trained with 2000 images.
As shown in
Table 4, the results are similar to those obtained in centralized training. This closeness can be attributed to the similarity of the datasets from both clients. Results of the application of the federated model to the final dataset can be seen in
Figure 15b.
The parameters used in the different training processes are as follows: an image size of 128 × 128, a training of 200 epochs with early stopping of 10 epochs without improving the validation loss, a learning rate of 0.001 with step size of 0.1 each 20 epochs, a batch size of 32, and Adam as optimizer.
3.5. Deployment
The deployment process was conducted on a server equipped with an NVIDIA GeForce RTX 3060 GPU. It consisted of four steps: image capture, validation of captured images for correctness or anomalies, prediction of the percentage of foam over the total surface area of each tank, and sending the data and images to the event server to store and monitor the foam measurement in the tanks and notify the WWTP operator in case the amount of foam exceeds a certain threshold (
Figure 16).
3.5.1. Capture of Images
The image capture process is set to acquire 5 images from each tank of the WWTP every 10 min. The captured images are stored in a MinIO (
https://min.io/ (accessed on 17 January 2023)) bucket for further processing by the vision algorithms. The acquisition time for 5 images from each of the 2 camera positions capturing each tank is 24 s.
3.5.2. Image Validation against Anomalies
Method
The purpose of introducing anomaly detection algorithms in the preprocessing stage is to identify images that do not correspond to the expected reality. This can be due to problems with the position of the PTZ camera, the presence of unexpected elements nearby tanks, such as people or seagulls, and cyberattacks. Three types of checks are performed to determine if an image is anomalous or not:
- (i)
An image is considered anomalous if its size differs from the defined one;
- (ii)
An image is considered anomalous if it is exactly the same as the previous image. To verify if two images are exactly the same, the Mean Square Error (MSE) between them is calculated. A value of zero indicates that the images are exactly identical. This method helps detect possible “man-in-the-middle” attacks that provide fake images through loops or repetitions;
- (iii)
An image is considered anomalous based on similarity if its similarity percentage with a reference image falls below a threshold.
To further understand the possible anomalies, a more in-depth explanation of the checks is provided.
The first two checks are straightforward. In case of the third, given a non-anomalous reference image, four areas or patches have been selected as reference zones (marked with a green rectangle in
Figure 17). Then, to check if an image is anomalous based on similarity, the Learned Perceptual Image Patch Similarity (LPIPS [
16]) is calculated between the reference areas of the reference image and the reference areas of the newly captured image. What LPIPS does is to calculate the similarity between the activations of two image patches given a predefined neural network. One advantage of this measure is that it has been shown to align well with human perception. In this case, a VGG neural network is used as the reference network. Moreover, in order to make the method robust to certain occlusions, only the top 3 patches with the highest similarity to the patches of the reference image are considered.
Regarding the algorithm for detecting anomalies, it is based on comparing the newly captured image from the PTZ camera with a reference image. The reference image is set as the first image captured on the previous day. Subsequently, if the first image captured today matches the reference image, it becomes the new reference image. The second image captured today is compared to the first image of the day, provided that it was classified as correct. If the second image of the day is also deemed correct, it becomes the new reference image, and so on. If an image capture is detected as anomalous, it is not selected as the new reference image, and the last image detected as correct remains the reference. The objective of this process is to compare a new image with the most recent reference image to mitigate the influence of lighting changes.
Validation
In order to validate the method, a dataset consisting of 47 images, including both correct and anomalous images, has been created. The anomalous images encompass variations in size, repeated images, and discrepancies in similarity compared to the reference image patches.
Since the first two cases are trivial, the focus will be on the anomalous images based on their similarity to a reference image. Thus, several investigated cases are presented as examples:
- (i)
Changes in the camera parameters: the reference image was captured with the following camera parameters: pan is set to 281, tilt to 9.81, and zoom to 2.2 (
Figure 17). To simulate different anomalous images, the images were captured with varying parameter values. For the pan parameter, images were captured in the range 277–285. Regarding tilt, the dataset contains images from tilt equals to 5.81 to tilt equals to 13.81. As for the zoom parameter, images range from zoom equal to 1.3 to zoom equal to 3.2. Some examples are shown in
Figure 18, with the differing parameter value highlighted in bold;
- (ii)
Occlusions in the received images: to validate the algorithm’s robustness against certain occlusions, we will refer to
Figure 19a, where one of the patches is occluded by a bird. Since the method considers only the top 3 patches with the highest similarity to a reference image, the occlusion of a patch will not affect the accurate prediction of an image as anomalous or non-anomalous;
- (iii)
Images completely different from the expected ones: to validate this point, a completely different image from the expected ones, presented in
Figure 19b, was introduced, which was detected as anomalous by the applied method.
Once the LPIPS values were calculated for the validation image set, a threshold of 0.3 was set to classify an image as anomalous or non-anomalous. By using this threshold, all images in the validation set were correctly classified as either anomalous or non-anomalous.
3.5.3. Inference
Once the images are captured and verified to be correct, the segmentation process is performed using the corresponding model. The resulting output includes the segmented image, the original image, and the foam percentage over the total surface area of the tank.
Figure 20 displays the segmentation results for images captured under different atmospheric conditions, showcasing the model’s effectiveness in each case.
The time required for the anomaly detection and performing inference is 17 s for every set of 5 images.
3.5.4. Data Submission to the Event-Server
The Event-Server, or event consumer, serves as a global repository for storing images and alphanumeric values resulting from the application of various AI models. Additionally, the Event-Server is capable of processing these events to generate, record, and send notifications to end users if the event inferred meets the established criteria based on user-defined parameters for a relevant event.
Regarding anomalies, three types of events are being sent based on the detected anomaly: repeated image, incorrect image size, and incorrect camera position or erroneous image. As for inference, an event is sent to record and track the segmentation results, including the following information:
- (i)
The original image with the overlaid segmentation percentage (
Figure 20, left);
- (ii)
The resulting image obtained by performing an AND operation between the original image after applying homography and the foam segmentation result (
Figure 20, right);
- (iii)
A numeric value representing the percentage of the tank occupied by foam.
Additionally, an alert has been set up in the Event Server based on the foam percentage value. This alert triggers a notification to the designated person when the foam percentage exceeds a predefined threshold specified in the alert.
4. Discussion
The calculation of foam percentage on the total surface area of a WWTP tank using computer vision algorithms can become a highly complex problem.
This work presented a complete remote sensing system based on imagen analysis to segment foams and monitor in real time the amount of foam in WWTP tanks. The developed segmentation models have been applied and validated in tanks of a biological reactor of a wastewater treatment plant with an MBBR (Moving Bed Biofilm Reactor) system. Due to the lack of a large set of tank images in the early stages of the research, the segmentation models have been trained on large public datasets used in similar segmentation applications and then retrained on the small foam dataset. In a later stage, the dataset collected at the plant was large enough to train the models from scratch using only foam images. Additionally, the research carried out a comparison between two types of training: centralized and federated. The proposed methodology involves a complete computer vision pipeline that includes automatic image capture, ensuring the absence of anomalies in the captured images, image processing to segment foams and quantify the foam percentage of the total surface area of a tank, and the transmission of operationally significant data to the responsible personnel at the WWTP to support the foam removal at the right time.
Regarding limitations, managing lighting variations emerged as a primary challenge in achieving superior metrics. Additionally, accurately defining what is foam on the tank surface presented another notable limitation.
In this research, two texture segmentation models applied to foam segmentation are presented, yielding results surpassing 85% in the Dice metric and 75% in IoU. Furthermore, the percentage of foam on each tank calculated with the models has been validated by experts from the wastewater treatment plant using different images, thus verifying the proper functioning of the model. The experts’ confidence implies that the system developed in this research is ready to be used in new WWTPs.
The comparison between centralized and federated training yielded similar results as both clients in the federated training have used images from the same WWTP. Nevertheless, the conducted research has allowed for the validation of the federated architecture. Consequently, it can be extrapolated for future use in different wastewater treatment stations.
Furthermore, the methodology explained in this paper has already been integrated into a real WWTP for their use, obtaining the following benefits:
Reduction of maintenance requirements;
Prevention of oxygen transfer issues;
Avoidance of biomass washout;
Enhanced monitoring and control;
Avoid odour problems;
Energy savings;
Mitigation of risks accidents.
Moreover, due to the fact that foam formation can also commonly occur in other types of WWTPs reactors and processes, it is important to note that the technology presented can be implemented in plants using other biological treatments such as in Activated Sludge Processes, Anaerobic Reactors, or Sedimentation Tanks. However, such applicability needs to be tested and verified by applying the models in additional WWTPs.
Finally, it is important to state the relevance of this work in achieving the SDG6, as the techniques presented help improving the wastewater treatment, which is one of the six outcome targets of the goal.
For future work, the use of the federated framework in different wastewater treatment plants and the comparison of results with centralized training remains to be explored. A successful outcome in the utilization of various WWTP in the federated training could lead to a global foam segmentation model for WWTPs. Moreover, the utilization of distinct models depending on the month of the year may serve to mitigate the impact of atmospheric conditions on image segmentation.