A Multi-View Image Feature Fusion Network Applied in Analysis of Aeration Velocity for WWTP

The instability of the aeration system brings a significant challenge to the management of wastewater treatment plants (WWTP). Using image recognition methods to monitor aeration conditions accurately and enhance management efficiency is a promising way to solve this problem. To improve the efficiency of aeration condition identification and provide support for troubleshooting, we propose a method for aeration velocity condition identification based on a multi-view image feature fusion network (MVNN). Firstly, an experimental platform for simulating aeration tanks is established, and two cameras are used to acquire aeration images from different perspectives. Secondly, an image data set with 10 aeration velocity gradients is constructed and applied to the network’s training. Finally, the MVNN is used to extract and fuse the features of aeration images, and the model’s performance is evaluated on the dataset. Experiments show that the average accuracy of the method is over 98.3%, and the AUC of aeration identification is above 0.98, which indicates that the model has the potential for practical application in WWTP.


Introduction
Wastewater treatment facilities are incredibly significant infrastructures in modern cities, and aeration treatment is essential. However, the mechanism of the aeration process in the biochemical tank is complicated and involves many types of reaction processes, such as physics, chemistry, and biology [1]. Under the influence of improper manual operation or untimely monitoring of aeration, WWTP workers will find it challenging to ensure the stability and reliability of aeration treatment, which will quickly lead to sludge bulking or excessive foam accidents [2,3]. The abnormal events mentioned above tend to develop rapidly. If these incidents are not found in time, the situation will worsen and lead to more severe accidents [4].
The aeration system is an indispensable part of wastewater treatment. Ensuring the stable operation of the aeration system is an essential work of the wastewater plant. Regular aeration can make the biological treatment unit have a constant dissolved oxygen concentration, thus ensuring the normal physiological activities of microorganisms. At present, most wastewater plants rely on gas flowmeters or manual observation of wastewater characteristics to determine the conditions of aeration, including the change of chromaticity and the foam on the water surface. However, the above methods still have shortcomings. On the one hand, because of the large area of the aeration tank, the existing mechanical instruments can only monitor the partial aeration velocity of the biochemical tank. As the wastewater is a heterogeneous fluid, the gas flowmeter cannot identify the abnormal working conditions when partial aeration in the tank is unusual [5]. Furthermore, only experienced engineers can capture the inconspicuous anomalies in the aeration tank. However, such professional engineers are scarce.
To improve the operation and maintenance efficiency of the aeration process and further enhance the management efficiency of WWTP. Researchers began to apply the image analysis method to wastewater treatment to realize the analysis of aeration conditions based on the computer vision to improve the monitoring dimension of the aeration system. Currently, there are mainly two kinds of image recognition technologies in wastewater treatment. One is based on microscopic images of wastewater, and the other is based on macroscopic images of wastewater [6,7].
The microscopic image method focuses on analyzing the microscopic scale of wastewater or sludge. Since the operating condition of the sludge system is associated with the effluent quality of the wastewater, Costa et al. [8] used quantitative image analysis (QIA) to calculate the microscopic image features of the sludge. The features mentioned above can comprehensively reflect the properties of sludge, such as aggregation mode and aggregation size, which are closely related to sludge activity. Then, a linear correlation model between image characteristics and water quality parameters, including chemical oxygen demand and ammonium, was developed using partial least squares (PLS). Experiments showed that the microscopic characteristics of the effluent could evaluate the working state of the sludge and the effluent quality of the wastewater. Beyond using integrated microscopic features of the wastewater image, Khan et al. [9] suggested using fewer microscopic features to reduce the complexity of the calculation. For example, without relying on the characteristics of filamentous bacteria, only the floc size was used to analyze the sludge settling performance. By optimizing the combination of floc features using stepwise regression, a linear model of microscopic image features with sludge volume index (SVI) and mixed liquor suspended solids (MLSS) was constructed to achieve efficient analysis of SVI and MLSS. Besides the study of the process of formation of the abnormal conditions, Mesquita et al. [10] focused on four abnormal conditions that are most likely to occur in wastewater treatment due to external disturbances. They found the correlation between sludge settling capacity and sludge image by partial least squares and constructed a model that can directly identify the main anomalous conditions. The macroscopic image method mainly studies the macroscopic visual characteristics of wastewater. Since the reduction of image luminance is closely related to the turbidity of the water sample, Mullins et al. [11] used a digital camera to obtain infrared images through the effluent. By extracting the features of the region of interest (ROI) from the pictures, they established a linear relationship between the turbidity and the number of pixels with luminance that exceeded a certain threshold. The results show that the turbidity of wastewater can be measured by light diffusion. Besides turbidity analysis, macroscopic images are also used to study sludge settling performance. To calculate the settled sludge volume (SSV) in real-time, researchers have constructed an image acquisition device based on a digital camera [12]. By exploring the pixel value difference between sludge and water in the process of sludge settling, a measurement method of the sludgewater interface was constructed. A practical SSV measurement method was obtained by measuring the changes of the sludge-water interface over a certain period, and its feasibility for monitoring abnormal sludge conditions in wastewater treatment was verified. As a typical macro image characteristic, color is also used to analyze abnormal working conditions in wastewater treatment. Liu et al. [13] extracted the RGB color information of wastewater using a camera and transformed it into hue-saturation-intensity (HSI) color space. By establishing the relationship between HSI characteristics and dissolved oxygen in wastewater, the method of abnormal identification of dissolved oxygen based on image analysis was successfully realized.
The above methods try to construct the linear relationship between microscopic or macroscopic characteristics of sewage images and abnormal working conditions. However, there are still some obstacles when applying the above methods to practice. On the one hand, microscopic image analysis needs accurate microbial characteristic information. Still, the microbial species in sewage are very complex, and the knowledge that can be verified and analyzed is limited. On the other hand, the macro image analysis method needs to extract information such as the image's color and texture in an exact way. Therefore, this method has strict requirements for image acquisition conditions. The macro image analysis method will lead to more significant errors if the external environment changes drastically, such as illumination and color temperature.
In this study, we propose the MVNN model to fuse the multi-view features of aeration images and classify images with different aeration velocities. Two main facts drive this effort: the relationships of aeration constructed in the microscopic image method are primarily linear, while the aeration process often shows non-linear characteristics. At the same time, the macroscopic image method only extracts image features from a singleview, which can not provide a comprehensive feature of aeration. The MVNN method is constructed by using effective and more robust strategies, i.e., convolutional neural network (CNN) and multi-view learning. The image features from two perspectives are fused and used to classify aeration velocity conditions. Aeration image datasets with practical value are collected through the experimental platform, and different feature extraction methods are used for comprehensive evaluation.
The main contributions of the paper are summarized as follows.

1.
To obtain the high-dimensional non-linear features of the aeration condition, CNN is used as the base network for image feature extraction; 2.
Based on the multi-view learning strategy, the MVNN model, which can fuse the features of multi-view aeration images, is proposed and used to classify images with different aeration velocities; 3.
In order to illustrate the effectiveness of the MVNN method in feature extraction, comparison and research are made on a large dataset.
The rest of this paper is organized as follows. Section 2 describes the detailed experimental configuration and the implementation of the multi-view feature fusion method, including the composition of the experimental equipment, evaluation metrics, dataset, and the architecture of MVNN. Section 3 presents the experimental results. Conclusions are given in Section 4.

Experimental System and Working Conditions
For the purpose of obtaining comprehensive features of aeration images, the multiview images are defined in this study as the top-view and side-view images of the aeration platform. The schematic diagram of the aeration experiment platform is shown in Figure 1a, where the flume used to simulate the aeration tank is a transparent cuboid. The flume's length, width, and height are 0.44 m × 0.44 m × 1.0 m, respectively. The multi-view image acquisition cameras are connected to the flume through professional camera brackets, ensuring image acquisition stability. The specific parameters of the image acquisition equipment are shown in Table 1.
To simulate the actual production conditions as much as possible, an air compressor is used to provide static pressure for the entire experimental system. Once the equipment is in operation, the air compressor pumps air into the aeration tray located at the bottom of the tank through the duct, providing uniform input airflow for the experimental platform. At this time, an electronic gas flowmeter monitors the aeration velocity through the vent in real-time and uploads it to the computer. The positions of the two cameras ensure that their field of view can completely cover the aeration area. The installation of the image acquisition device is shown in Figure 1b.  Aiming to standardize the experimental procedure, the image acquisition devices are jointly controlled by a program written in Python. The cameras can synchronize the image acquisition with a 1920 × 1080 pixels resolution. The ambient light was always stable during the experiment, and we selected white LEDs as a supplementary light source with a color temperature in the range of 4000-4500 k. Meanwhile, the supplemental light source can improve the image quality and avoid the problem of uneven brightness of the acquired images. The details of the multi-view image acquisition device are shown in Table 2.

Data Preprocessing and Evaluation Metrics
In this study, the essence of aeration condition identification is image classification using the deep learning method. This means the model's input is the images of different aeration velocities, and the model should output the class of aeration conditions. According to the parameters set in the actual operation of the WWTP, we collected 10 kinds of aeration images in total, with the velocity ranging from 1 L/min to 10 L/min. It is worth noting that the interval of aeration velocity is 1 L/min. The types of aeration images are represented by capital letters A, B, C, D, E, F, G, H, I, and J. To obtain better classification performance of the model, it is necessary to preprocess the aeration images before training the model, including image defogging and noise reduction [14]. Image preprocess is helpful to enhance the aeration features and improve the classification accuracy of the model [15].
Under the above experimental conditions, we constructed two aeration image datasets, the aeration velocity images on the side-view camera (AVSV) dataset and the aeration velocity images on the top-view camera (AVTV) dataset. They contain photographs of 10 aeration velocities of clear water and wastewater, respectively, with 5000 photos of each kind. The datasets are used to train and test the performance of the model. In practice, it will be divided into a training set and a test set in the ratio of 7:3.
To quantify the effectiveness of image feature recognition, we use average accuracy (Avg_acc) as the primary metric for image category-based evaluation [16]. (Avg_acc) is one of the most important metrics to test the model performance, as defined in Equation (1).
where n s is the total number of categories of samples (in this paper, there are 10 sample categories for aeration, so n s = 10); i ∈ [1, 10] is the sample category label; n i denotes the total number of samples in category i; n ii denotes the total number of samples in category i with prediction result i (i.e., the number of correctly predicted samples in each category).
The results of image classification are also evaluated by the receiver operating characteristic curve (ROC) [17]. ROC is usually used to assess the overall performance of the model. The model whose ROC curve is closest to the upper left corner has the best performance. The area under the ROC (AUC) is a value between 0 and 1. AUC evaluates the ability of a model to be used for classification. AUC value close to 1 indicates a model with higher performance, while the value close to 0 indicates no information model [18].

Single-View Feature Extraction Model Based on CNN (SVNN)
Convolutional neural networks (CNN) are certain kinds of deep feed-forward artificial neural networks, which have been successfully applied in image analysis in many different disciplines [19]. In recent years, CNN has become the most advanced method of image recognition and shows excellent advantages over traditional methods [20][21][22][23].
The structure of the CNN constructed in this research is shown in Figure 2. This network mainly consists of 13 convolutional layers (conv), and other layers include the pooling layer (pool), fully connected layer (fc), rectified linear units layer (ReLU), and softmax layer. First of all, the original aeration image will be reduced to 224 × 224 pixels to improve the calculation efficiency. In the next layer, a 3 × 3 convolution kernel conducts a feature extraction operation on the image to generate a feature map containing aeration characteristics. A pooling layer is also added to reduce the size of the processed image from 224 × 224 to 112 × 112. The extraction process of aeration features mentioned above will be repeated five times. Finally, the network will generate feature vectors with a size of 7 × 7 × 512. Then the vector is fed to the fully connected layer, which provides a calculation basis for the subsequent identification of aeration flow. There are 10 aeration rate levels that need to be identified. A softmax layer is placed after the fully connected layer to normalize the probability of each output level by using a normalized exponential function. The model passes the aeration rate type with the highest predicted likelihood to the output layer to obtain the final predicted level of the input image.   The entire dataset is divided into a training set (70% of the data) and a test set (30% of the data). The training set is used to fit the network parameters. During training, the weights are adjusted to minimize the difference between the target and predicted values. The performance of the trained network is evaluated using the test set. The implementation details of SVNN are as follows: the network is built on Pytorch under the Python 3.6 environment. The Adam optimizer with a learning rate of 10 −4 is used. Due to memory constraints, the batch size of this model is set to 64. The models are trained for 125 epochs, where the training and validation errors have converged relatively to their equilibrium state. Similar training and validation error trends confirm that the proposed neural network is not prone to overfitting. To evaluate the performance of the image classifier, the recognition accuracy of each aeration image by the classifier is calculated. The higher the accuracy, the better the recognition performance of the model.

Multi-View Image Feature Fusion Network (MVNN)
If only single-view aeration images are used for feature extraction, there may be problems such as information missing and being easily influenced by collection conditions and the environment. On the contrary, if the image features from two views can be fused, then the neural network can capture more information related to aeration velocity. In addition, feature fusion can make the image recognition network more robust and reduce interference from the external environment.
As the change of aeration velocity, the bubbles in the aeration tank display specific image characteristics. For example, at lower aeration velocity, the bubbles are smaller and more dispersed. At high aeration velocity, the number of bubbles increases, and bubbles appear to aggregate. By comparing the images from different perspectives, we can find the following rules. Using the top-view images, we can obtain the characteristics of the liquid level in the aeration tank, including sewage chroma, bubble size, bubble movement speed, etc. Through the side-view images, it is possible to obtain the characteristics of the interior of the aeration tank, such as the thickness and the rising dynamics of the bubbles.
Based on the SVNN model proposed in the previous section, an image recognition network based on multi-view image feature fusion is proposed, which is called MVNN. The network structure is shown in Figure 3. It includes two SVNN networks with parallel channels. Each SVNN network is used to extract features of top-view images and side-view images, respectively. The method of feature fusion is as follows.   Firstly, each SVNN network is used to extract the high-dimensional features of the top-view image and the side-view image, respectively. After feature extraction, each SVNN network will output a feature vector with dimension 4096 at fc1. Secondly, the two feature vectors will be concatenated to obtain the fused feature vector with dimension 8192, illustrated in Figure 4. Finally, the feature fusion and learning are carried out by fc2 and fc3, and the final multi-classification result is carried out by softmax.

Side-View Aeration Velocity Image Classification Experiment
The identification model was first tested using the AVSV dataset. The dataset contains a total of two fluids commonly found in biochemical tanks, clear water (water, MLSS = 0 mg/L), and wastewater with sludge (sludge, MLSS = 3500 mg/L). Both liquids contain a total of 10 gradients of aeration images, which are captured by an image acquisition device placed on the side of the reactor. The experiment can be used to test the aeration identification performance of the model under actual operating conditions. Figures 5 and 6 show example images of clear water and a certain concentration of sludge in the identified AVSV dataset. As shown, the visual characteristics of the side view images of the two liquids are different. As the aeration velocity increases, the bubble density in the clear water image increases significantly, showing a tendency to spread from the center of the reactor to the surrounding area, and the liquid gradually becomes opaque. Compared with the clear water image, the wastewater image always shows an opaque state due to a large amount of sludge contained in it. It is worth noting that two subtle changes can be observed in the aeration images. When the aeration velocity is small, the bubbles on the liquid surface are large and dispersed, while when the aeration velocity is high, the bubbles become small and dense. As the aeration velocity increases, the sludge scale on the inner wall of the reactor gradually becomes thicker.  Although the above changes are very subtle and obscure, they provide the possibility for the model to distinguish different aeration images. Table 3 gives the results of the SVNN model to identify different aeration conditions in clear water (water, MLSS = 0 mg/L) and wastewater with sludge (sludge, MLSS = 3500 mg/L). The experiments reveal that the single-view classifier can accurately identify different aeration images. The obtained results show that the aeration images collected at different velocities can be accurately divided into 10 categories.
The recognition rate of all the targets in the remaining nine classes of aeration images is higher than 92.8%, except for the recognition accuracy of class I (9 L/min), which is slightly less, and the relatively low recognition rate of class I (9 L/min) is mainly due to the difference of light intensity in it. The average recognition rates of the method are calculated to be 97.3% and 97.1%, respectively, based on the percentage of correct recognition of the test samples of the 10 classes of targets. It can be seen that the method of this paper is feasible in recognizing the aeration rate images under this condition.

Top-View Aeration Rate Image Classification Experiment
In order to validate the classification accuracy of SVNN model in identifying the top-view aeration images, the model is tested using the AVTV dataset. Compared to the side-view camera, the top-view camera has two differences. First, it is closer to the observation angle of hydraulic engineers. Secondly, the top-view camera is easier to install in the actual production scenes. Therefore the test results of this experimental setting can provide an intuitive reference basis for aeration recognition using images in wastewater plants. Figures 7 and 8 show the sample images of the AVTV dataset, which are the topview images of clear water (water, MLSS = 0 mg/L) and wastewater with sludge (sludge, MLSS = 3500 mg/L). From the images, it can be found that there are similarities between the top-view features of the two liquids. For example, with the increase of aeration velocity, the bubbles in the reactor tended to change from small to large and from dense to sparse. However, when the aeration velocity is large, bubbles will be dissolved and combined continuously on the water surface, and the aeration condition will become challenging to distinguish. Especially in Figure 8, we can see that the images of 8 L/min, 9 L/min, and 10 L/min are very similar, bringing uncertainty to image recognition.    Table 4 shows the results of the SVNN model for identifying different aeration velocities under the conditions of water and sludge. The experiments show that the aeration images collected at different aeration velocities can be accurately divided into 10 categories. Comparing the results of the model for each category, it can be found that when the aeration velocity is low (A: 1 L/min-F: 6 L/min), the model has a good performance for both kinds of liquid. However, when the aeration velocity is higher than 7 L/min, the accuracy of the two kinds of liquids will be reduced. For image recognition of sludge aeration, this performance loss is more prominent. In the water classification task, the categories that lead to decreased accuracy are mainly the inter-class misidentification of class H (8 L/min) and class I (9 L/min). In the sludge classification task, the misidentification occurs from class G (7 L/min) and class J (10 L/min), resulting in a decrease in the average accuracy of the model. After an in-depth analysis of the results, we think that the primary source of the above problems lies in the imaging characteristics of the sludge images from the top-view. First of all, because of the poor transmission performance of light on sludge, the characteristics of sludge images are not as rich as those of water images. Second, the liquid surface features of the sludge images are closer at higher aeration velocities, increasing the possibility of model misclassification. The average recognition accuracy of the SVNN method under the AVTV dataset was calculated to be 97.0% and 94.7%, respectively, based on the percentage of misclassification of the correctly recognized target test samples in the 10 categories. It can be seen that, despite the above limitations, it is still possible to distinguish different aeration velocities of liquid by this method.

Aeration Image Classification Based on Feature Fusion Method
According to the multi-view image feature fusion method proposed in Section 2.4, we fuse features before the fully connected layer, which are respectively from two single-view feature vectors extracted by SVNN. For the aeration image of each view, the size is cut to 224 × 224 pixels as the model input, and the network training is completed. Secondly, under the strategy of feature fusion, MVNN model is used to evaluate the accuracy of aeration image recognition.
The recognition results of different models are given in Table 5. To compare the effects of the two feature extraction strategies on recognition accuracy, the result of the SVNN model on the AVTV and AVSV datasets is added as a baseline in the table. The results show that the MVNN network utilizes a feature fusion strategy to fuse the images with different views. It can significantly improve the model's accuracy in recognizing images with varying velocities of aeration. To calculate the recognition accuracy boost of the feature fusion strategy, accuracy improvement will be calculated in the form of hundreds of a percent. We select the best precision of the same kind of liquid under a single-view to calculate, to compare the accuracy under multi-view. As is shown in Table 5, the performance gain from feature fusion is mainly reflected when the aeration velocity is higher. For example, the most obvious improvements in the water images is in class I (9 L/min) and Class J (10 L/min), in which the accuracy rate is increased to 92.4% and 98.8%, and the performance is increased by 6.8% and 2.0%. Although the accuracy of Class E (5 L/min), Class F (6 L/min), and Class G (7 L/min) are slightly decreased, none of the decreases exceed 0.5%, and the actual accuracy is still above 99.0%. The significant improvement of accuracy and the slight loss of performance reach a good compromise, and the Avg_acc of water images using MVNN reaches 98.4%.
The results of applying the MVNN method to sludge images are similar to those of water images. The original SVNN method has poor recognition performance for high aeration velocity images, while the MVNN method achieves significant improvements. The results of applying the MVNN method to sludge images were similar to those of water images, which shows a significant improvement in the recognition of images with high aeration velocities. The accuracy of class G (7 L/min) to class J (10 L/min) increased by different magnitudes. The most significant ones are class H (8 L/min) and class I (9 L/min), with accuracy improvements of 1.4% and 3.7%. Meanwhile, Class G (7 L/min) and Class J (10 L/min) also showed a small increase in accuracy, with 1.3% and 1.2% increases, respectively. We also notice that the recognition accuracy of Class C (3 L/min) has slightly decreased by 1.5%, but it still meets the actual requirement of 98.5%. The Avg_acc of sludge images identified by MVNN reached 98.3%, and the overall accuracy rate was significantly improved, as is shown in Figure 9.  Figure 10 shows the ROC-AUC for the SVNN model and the MVNN model when identifying aeration velocity conditions. The AUC values indicate (Fusion water AUC = 0.99, Fusion sludge AUC = 0.98) that the MVNN model achieves superior performance in terms of identification accuracy. Although there is a decrease in accuracy for several categories, the MVNN approach achieves a better performance balance compared to the overall improvement in model accuracy.

7UXH3RVLWLYH5DWH
)XVLRQZDWHU$8& )XVLRQVOXGJH$8& 7RSZDWHU$8& 6LGHZDWHU$8& 6LGHVOXGJH$8& 7RSVOXGJH$8& The results show that the feature fusion strategy has the best practical value in classification accuracy. Because this strategy is entirely independent of the underlying network structure, it can easily be used to expand other network structures to calculate the fused image representations. The above analysis shows that this algorithm has the advantages of solid anti-interference and broad applicability. When identifying aeration images in wastewater treatment, using the complementary features of multiple perspectives is the key to improving the identification performance.

Conclusions
Aiming at the problem that it is difficult to measure the aeration velocity of the biochemical tank accurately, an aeration image recognition method based on feature fusion is proposed. An aeration image with a complicated background and water mist interference is enhanced to improve the robustness of image recognition. A simulated aeration platform is built, and a large amount of experimental data is collected. The results are verified using clear water and wastewater images. The experimental results show that the identification accuracy of the MVNN model for the two kinds of liquids is as high as 98.8% and 98.4%, respectively. Further analysis using ROC-AUC values shows that the MVNN model for clear water identification is 0.99 and that for wastewater identification it is 0.98, which demonstrates that the feature fusion method can overcome the shortcoming of insufficient feature information extraction in a single-view. Therefore, the MVNN method based on image feature fusion can be applied to recognize aeration conditions.
The above conclusions show that the multi-view feature fusion network has a robust feature extraction capability and high accuracy, which provides a direction for further research. At the same time, the MVNN method can provide real-time aeration status for managers of sewage treatment plants, which is beneficial to timely adjust production status and maximize sewage treatment efficiency. In the future, we can extract the characteristics of various dimensions of wastewater by modifying the network structure or establishing a new model, thus obtaining more measured values. Therefore, it is possible to establish an effective and complete water quality testing system, and this method can also be applied to medicine, food, and other fields.