AQE-Net: A Deep Learning Model for Estimating Air Quality of Karachi City from Mobile Images

: Air quality has a signiﬁcant inﬂuence on the environment and health. Instruments that efﬁciently and inexpensively detect air quality could be extremely valuable in detecting air quality indices. This study presents a robust deep learning model named AQE-Net, for estimating air quality from mobile images. The algorithm extracts features and patterns from scene photographs collected by the camera device and then classiﬁes the images according to air quality index (AQI) levels. Additionally, an air quality dataset (KARACHI-AQI) of high-quality outdoor images was constructed to enable the model’s training and assessment of performance. The sample data were collected from an air quality monitoring station in Karachi City, Pakistan, comprising 1001 hourly datasets, including photographs, PM2.5 levels, and the AQI. This study compares and examines traditional machine learning algorithms, e.g., a support vector machine (SVM), and deep learning models, such as VGG16, InceptionV3, and AQE-Net on the KHI-AQI dataset. The experimental ﬁndings demonstrate that, compared to other models, AQE-Net achieved more accurate categorization ﬁndings for air quality. AQE-Net achieved 70.1% accuracy, while SVM, VGG16, and InceptionV3 achieved 56.2% and 59.2% accuracy, respectively. In addition, MSE, MAE, and MAPE values were calculated for our model (1.278, 0.542, 0.310), which indicates the remarkable efﬁcacy of our approach. The suggested method shows promise as a fast and accurate way to estimate and classify pollutants from only captured photographs. This ﬂexible and scalable method of assessment has the potential to ﬁll in signiﬁcant gaps in the air quality data gathered from costly devices around the world.


Introduction
Air pollution has worsened over the past few decades; therefore, it has received significant attention from scholars and policymakers. An air quality index (AQI), which is composed of six pollutants, including particulate matter 10 (PM 10 ), particulate matter 2.5 (PM 2.5 ), sulfur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), carbon monoxide (CO), and ozone (O 3 ), is an overall index that can more objectively depict the levels of air pollution than an index that includes a single air contaminant [1,2]. Pakistan has invariably encountered severe air pollution issues caused by industrial sources and automobile exhausts, particularly in Karachi City, located in southern Pakistan [3]. Consequently, air pollution poses a severe threat to health, and prompt monitoring of air quality is vital to control pollution and immensely useful for protecting human health. Air pollution has a variety of adverse presented the application of an attention-based convolutional BiLSTM autoencoder model for air quality forecasting.
This study proposes a deep CNN model (AQE-Net) based on ResNet to classify photos per air quality level. Previous approaches based on CNN networks concentrate almost solely on PM2.5, despite the fact that PM2.5 is just a small component of air pollution and does not accurately reflect overall air quality information. Additionally, the existing studies have estimated air quality in different aspects. Among them, much research focuses on particular pollutants. However, this study contributes theoretically and practically and takes AQI as an outcome variable to estimate air quality. Moreover, many studies use satellite images for air quality estimations. In contrast, this study uses mobile images. Therefore, more investigations into image-based air quality estimates are needed to boost accuracy and reliability. Our proposed model can measure the AQI directly, more accurately estimating the environment's air quality. In this context, this study investigates the connection between air quality and image characteristics using air quality analysis of many fixed-site photographs, builds a prediction model, and calculates air quality everywhere. People can collect pictures easily and quickly using portable terminals such as mobile phones, tablets, and other smart devices and can use this method to estimate the AQI in real-time.

Study Area
This study focuses on Pakistan's largest metropolitan city, Karachi, the capital of the Sindh province. It is the twelfth-largest city in the world, with a population of over 12 million people. Karachi comprises seven districts: the Karachi Central, Karachi East, Karachi South, Karachi West, Korangi, Malir, and Keamari districts. Additionally, Clifton is part of the Karachi South district, which is our main research area for this study. Figure 1 shows the Karachi map with all districts, and the location symbol indicates the Clifton area on the map.

Dataset
Due to the lack of a publicly available image library for air quality related to image detection, the KHI-AQI image database was created. There are a total of 1001 photographs

Dataset
Due to the lack of a publicly available image library for air quality related to image detection, the KHI-AQI image database was created. There are a total of 1001 photographs in the library, which are a series of scene images captured at varying levels of air quality. To create the dataset, we went through the following stages.
We installed a mobile device with a camera in a firm position and orientation nearby the US Consulate General's monitoring station in Karachi to capture surrounding air quality images. Every hour from 8:00 am to 18:00 pm, the camera collects photographs of the sky that are automatically saved. The information about the air quality image collecting points  Table 1. Further, Figure 2 depicts an example of scene images from the air quality image library that correspond to different degrees of air pollution levels. In Figure 2, Pictures (a), (b), (c), (d) and (e) were taken at 8:00, 9:00, 10:00, 11:00, and 12:00, respectively, while pictures (f), (g), (h), (i) and (j) were obtained at 13:00, 14:00, 15:00, 16:00, and 17:00. in the library, which are a series of scene images captured at varying levels of air quality.
To create the dataset, we went through the following stages. We installed a mobile device with a camera in a firm position and orientation nearby the US Consulate General's monitoring station in Karachi to capture surrounding air quality images. Every hour from 8:00 am to 18:00 pm, the camera collects photographs of the sky that are automatically saved. The information about the air quality image collecting points is included in Table 1. Further, Figure 2 depicts an example of scene images from the air quality image library that correspond to different degrees of air pollution levels. In figure 2, Pictures (a), (b), (c), (d) and (e) were taken at 8:00, 9:00, 10:00, 11:00, and 12:00, respectively, while pictures (f), (g), (h), (i) and (j) were obtained at 13:00, 14:00, 15:00, 16:00, and 17:00.  We accessed monitoring station data from the MicroStation for air quality at the United States embassies and consulates in Karachi from 1 st August, 2021 to 30 th October, 2021 , a total of three months of data [41]. Figure 3 shows the AQI data points for these three months. The hourly data were then translated to levels in accordance with the AQI classification table, which indicates the concentration of AQI in the atmosphere. We noted the file name and capturing time of the photographs captured at each collecting point. They comprised the following fields: AQI value, image name, AQI level, capturing time, and others. There has been an overall collection of 1001 data points, with the level of AQI serving as the picture label. As a result, image collection and observation data related to the geographic place and time have been gathered, and the database has been produced with higher quality site and air quality images ( Table 2). We accessed monitoring station data from the MicroStation for air quality at the United States embassies and consulates in Karachi from 1 st August, 2021 to 30 th October, 2021, a total of three months of data [41]. Figure 3 shows the AQI data points for these three months. The hourly data were then translated to levels in accordance with the AQI classification table, which indicates the concentration of AQI in the atmosphere. We noted the file name and capturing time of the photographs captured at each collecting point. They comprised the following fields: AQI value, image name, AQI level, capturing time, and others. There has been an overall collection of 1001 data points, with the level of AQI serving as the picture label. As a result, image collection and observation data related to the geographic place and time have been gathered, and the database has been produced with higher quality site and air quality images ( Table 2).

Convolutional Neural Network (CNN)
Fukushima and Miyake (1982) [42] proposed a convolutional neural network (CNN) in 1980, which was then updated by LeCun et al. (1989) [43]. A number of areas in which CNN has succeeded in recent years include synthetic biomedicine [44], catastrophe detection [45], natural language processing [46], holographic image reconstruction [47], the artificial intelligence program of Go [48], optical fiber communication [49], and so on. Using high-performance computing platforms like high-performance computers, graphics workstations, cloud computing platforms, etc., it is now possible to train complicated models using large-scale datasets. There have been a wide array of convolution neural network models developed in this regard, including ZFNet [50], GoogleNet [51], LeNet [52], Mo-bileNets [53], VGGNet [54], Overfeat model [55], DenseNet [56], SPPNet [57], ResNet [58], AlexNet [59], and so on. A CNN is a multilayer network with a fundamental structure that is mostly composed of the following layers: the input layer, the convolutional layer, the pooling layer, the completely connected layer, and the output layer, as illustrated in Figure 4.

Convolutional Neural Network (CNN)
Fukushima and Miyake (1982) [42] proposed a convolutional neural network (CNN) in 1980, which was then updated by LeCun et al. (1989) [43]. A number of areas in which CNN has succeeded in recent years include synthetic biomedicine [44], catastrophe detection [45], natural language processing [46], holographic image reconstruction [47], the artificial intelligence program of Go [48], optical fiber communication [49], and so on. Using high-performance computing platforms like high-performance computers, graphics workstations, cloud computing platforms, etc., it is now possible to train complicated models using large-scale datasets. There have been a wide array of convolution neural network models developed in this regard, including ZFNet [50], GoogleNet [51], LeNet [52], MobileNets [53], VGGNet [54], Overfeat model [55], DenseNet [56], SPPNet [57], Res-Net [58], AlexNet [59], and so on. A CNN is a multilayer network with a fundamental structure that is mostly composed of the following layers: the input layer, the convolutional layer, the pooling layer, the completely connected layer, and the output layer, as illustrated in Figure 4.

AQE-NET Model
The training set for the model is defined as x , y , x ∈ ℝ × × , y ∈ N. A collection of air quality evaluation images and a set of labels are referred to as x , and y , respectively. When an image x is input, it is needed to acquire the air quality level

AQE-NET Model
The training set for the model is defined as {(x i , y i )} N i=1 , x i ∈ R H×E×C , y i ∈ N. A collection of air quality evaluation images and a set of labels are referred to as {x i } N i=1 , and {y i } N i=1 , respectively. When an image x i is input, it is needed to acquire the air quality level y i that corresponds to the image x i , as well as the mapping relationship y i = F(x i ).  The AQE-Net is built from merging a self-supervision module known as the Spatial and Context Attention block with a network of other self-supervision modules (SCA); it was previously built in conjunction with the original ResNet18 [58] network structure to design a feature extraction network for air quality pictures. The residual unit is the unit The AQE-Net is built from merging a self-supervision module known as the Spatial and Context Attention block with a network of other self-supervision modules (SCA); it was previously built in conjunction with the original ResNet18 [58] network structure to design a feature extraction network for air quality pictures. The residual unit is the unit that is most susceptible to changes in weight. The self-monitoring module can continually adjust the relevance of feature information, allowing the model to come closer to the overall best solution. The third module was expanded to include a module for scene self-supervision, although the original structure was unchanged. The third block consists of two residual structures. Each residual structure has an SCA module, and the feature map from the input SCA module is utilized in the third module, which has a resolution of 1/16 of the initial input image, resulting in a considerable reduction in the amount of computation required for matrix multiplication. After three branches, the feature map generates various pieces of contextual scene feature information. The first branch determines the correlation amongst each pixel from the air quality image and then matrix-multiplies the 1st branch's output by the 2nd branch's output to obtain the similarity between distinct channel maps. The third branch forms feature maps by matrix multiplication with the result to disperse relevant feature properties back to the original initial feature map to identify the relationship between the complete feature map information. The feature map is then combined by utilizing global average pooling [52] and multiplied with the input feature map to get the final output. A recent study [60,61] demonstrated that self-supervised learning could significantly increase network performance. Figure 6 shows how the SCA block unit is integrated into the network architecture. Using rich relevant information, the SCA module re-calibrates the feature mapping throughout channels while concurrently emphasizing key feature information and hiding information that is not connected to the feature mapping. The primary structure of the SCA is divided into two components. Part one encodes the overall scene context into local characteristics, examines the similarity across channels, and increases the representational capacities of the scene. For each channel, the second component integrates spatial context information to strengthen and accurately manage the dependence between the scenes. Firstly, the input feature graph X ∈ ℝ × × is created first using Φ • and Ψ • operations to form a new feature graph of A ∈ ℝ × × and B ∈ ℝ × × , as seen in Figure  2. The Φ • and Ψ • operations indicate convolutional layers containing batch normalization [62] and ReLu layers [63]. The size of the convolution kernel can be adjusted to 1 × 1 × C in order to limit the amount of calculation required. Where C = C, the dimension of channel can be lowered, as well as reducing the number of matrix multiplication calculations required. The feature map A to ℝ × is resized and then transposed to ℝ × when the Z reshape is complete. Finally, multiply A and B in matrix fashion and apply the softmax function to produce a map of Z ∈ ℝ × of the channel's correlation. The equation is given below: Firstly, the input feature graph X ∈ R H×W×C is created first using Φ(·) and Ψ(·) operations to form a new feature graph of A ∈ R H×W×C 1 and B ∈ R H×W×C 1 , as seen in Figure 2. The Φ(·) and Ψ(·) operations indicate convolutional layers containing batch normalization [62] and ReLu layers [63]. The size of the convolution kernel can be adjusted to [1 × 1 × C 1 ] in order to limit the amount of calculation required. Where C 1 = 1 16 C, the dimension of channel can be lowered, as well as reducing the number of matrix multiplication calculations required. The feature map A to R C 1 ×HW is resized and then transposed to R HW×C 1 when the Z reshape is complete. Finally, multiply A and B in matrix Remote Sens. 2022, 14, 5732 7 of 17 fashion and apply the softmax function to produce a map of Z ∈ R C 1 ×C 1 of the channel's correlation. The equation is given below: In this case, X i denotes the number i indexed pixel from the feature vector. j denotes the index for total possible locations. The relationship between each remaining pixel and i is represented by the letter Z ij . Simultaneously, once the feature map X is given as input to K(·), the feature map E ∈ R H×W×C 1 is formed, and then feature map E is transposed to R C 1 ×HW after being reshaped. K(·) has the same function as Φ(·) and Ψ(·). In order to redistribute the correlation information to the original feature map, it is matrix-multiplied to feature map Z. The feature map D ∈ R H×W×C 1 is then obtained by reshaping the acquired result into R H×W×C 1 . Given below is the calculation equation.
The spatial attention mechanism is used to aggregate the scene context mapping in order to create the feature map D, and the connected channels benefit from each other (Equation (2)). In order to appropriately optimize the correlation between every channel in the feature map X and other channels, the channel of the feature map D is weighted by applying the channel-wise module. First, a channel-wise statistic V ∈ R C 1 is calculated using the global average pooling to aggregate the spatial dimension W × H of the feature map D, where the number i item in V is determined as follows: The feature map X includes C channels. To alter the dimension from w to R C , a new, fully connected layer is added. The method for calculation is as follows: where the Sigmoid activation function W ∈ R C 1 ×c is represented by σ. Finally, the SCA module's final output complies with the updated feature map G: where the feature map X ∈ R H×W multiplied by the weight z i is represented by the G = g 1 , g 2 , g 3 and F scale (X i , Z i ) variables, respectively.

Model Training
The deep learning framework PyTorch [64] was used to implement the model presented in this paper. The following server setup was used to train the model: Intel (R) Xeon (R) E5-2620 v3 2.40 GHz CPU, Tesla K80 GPU, and Ubuntu64 as the OS. Stochastic gradient descent (SGD) was used to optimize parameters during training, and the momentum β was set to 0.9. The mini-batch was set to 32 to lessen the random gradient's instability. The beginning learning rate was 10 −2 and then a 10 times reduction in learning rates every 90 cycles, with a 10 −2 weight attenuation. Using the approach found in [65], the weights were initialized. Training for all models began at zero and lasted for 270 iterations. For the model training, 70% of the photos were chosen at random, with the remaining 30% being the testing set. To avoid the model from overfitting and to increase the model's accuracy and resilience, we improved the method of training datasets as follows. The following approaches were used to sample each image: a [0, 360 • ] random rotation of the picture and a random coefficient range of [0.8, 1] were used to scale the image. A cropping ratio of 3/4 or 4/3 of the original size was applied to the photograph. Finally, each sampling region was normalized to the range of [0, 1] after the preceding processes.
Moreover, the categorical cross-entropy loss function, which is employed in multi-class classification applications, was implemented to minimize the loss between the predicted and actual value. An optimizer such as the stochastic gradient descent (SGD) optimizer was used in our proposed model to improve the accuracy; the stochastic gradient descent is known as the "classical" optimization algorithm. When using SGD, we calculated the loss function gradient with regard to each node's weights.

Model Selection Criteria
The confusion matrix was used for each model to evaluate its predictive performance during testing. The confusion matrix is primarily used to evaluate predicted performance for classification problems. For the purpose of determining the proportion of properly identified samples, the predicted values were compared to the actual values. The model prediction was evaluated using the accuracy, sensitivity, F1 score, and error rate. The metric equations are as follows:

Results
In this study, the standard machine learning technique SVM and the deep learning methods VGG16, InceptionV3, and AQE-Net were contrasted and examined on the KHI-AQI dataset. Furthermore, the accuracy, sensitivity, F1 score, and error rate metrics have been employed for evaluating the performance of deep learning models for classification problems.
The SVM classifier's basic premise is to turn image classification problems into highdimensional feature classification spaces, with difficult-to-classify problems becoming linearly separable due to the transformation. A kernel is utilized to construct a hyperplane in the high-dimensional feature classification space, which is then used to discriminate between different air quality levels. An RBF radial basis kernel is employed because the picture classification issue exhibits linear inseparability. SVM achieved 56.2% accuracy after training on the KHI dataset, but for predicting a single image for classification, the process typically takes 0.0532 s. For the SVM model, the sensitivity, the F1 score, and the error rate were all determined (0.77, 0.87, 0.16). Following the application of the SVM model to the KHI dataset, we then utilized the VGG16 model on the same dataset in order to compare the outcomes. By increasing the depth of the network and making use of tiny convolution kernels rather than large convolution kernels, VGG improves the accuracy of the model, which in turn provides good performance for image classification. The VGG16 algorithm obtained an accuracy of 59.2% when predicting the air quality index based on photographs, which is 3% higher than the SVM model's performance. It was found that the VGG16 model had an error rate of 0.14%, a sensitivity of 0.79, and an F1 score of 0.88, and the error rate was calculated as 0.14. With this model, we see a decrease in errors of 0.02% of points compared to the SVM model. On the KHI dataset, the InceptionV3 model was used for testing after the VGG16 model. The accuracy of InceptionV3 was measured at 64.6%, which is 5.4% better than VGG16's performance. The calculated sensitivity for InceptionV3 was 0.85, while the F1 score and error rate for InceptionV3 were 0.96 and 0.05, respectively. These values are significantly lower than those for VGG16. Following the use of the three earlier models, SVM, VGG16, and InceptionV3, we then applied our newly proposed model, AQE-Net, on the same dataset in order to test it and compare the results. When compared to VGG16, the accuracy of identifying air quality levels from photos using the AQE-Net model increased by 5.5%. The AQE-Net model that we have proposed has an accuracy of 70.1%. The values for sensitivity, F1 score, and error rate were calculated to be 0.92, 0.96, and 0.03, respectively. Following the application of the SVM, VGG16, and InceptionV3 models to the KHI dataset, it was observed that the AQE-Net model achieved the greatest accuracy compared to the other models in terms of classifying images. Table 3 demonstrates the prediction time, accuracy, sensitivity, F1 score, and error rate values for all of the models that have been utilized in this research. The testing dataset contains a total of 201 photos relating to the first five air pollution level classes such as good, unhealthy, poor, severe, and dangerous. These levels are represented in the classification problem by the numbers 1, 2, 3, 4, and 5. On the basis of the classification results provided by models, a confusion matrix was computed, which is also known as a summary of the results of the predictions made on a classification task or model classification accuracy.
The confusion matrix is presented in Figures 7-10 for the four machine learning models SVM, VGG16, InceptionV3, and AQE-net that have been deployed in this study. The numbers 1 to 5 on the horizontal axis reflect the values that were predicted for the test samples, and the values 1 to 5 on the vertical axis represent the actual values of the test samples, respectively. In Figures 7-10, the values that are on-diagonal show the number of correctly classified photos, whereas the values that are off-diagonal reflect the number of images with incorrect classifications that vary from the diagonal. After applying the SVM model to the KHI-AQI testing dataset, the confusion matrix is displayed in Figure 7 below. In accordance with the findings, the SVM model successfully classified 113 out of 201 samples, whereas it mistakenly classified 88 samples. In total, there were 201 samples included in the study. The confusion matrix for the KHI-AQI testing dataset using VGG16 is depicted in Figure 8. It was found that 119 of the samples were correctly categorized across all classes, whereas 82 of the samples were misclassified. When compared to the SVM model, the VGG16 algorithm provided six more results that were correctly categorized. After running VGG16 on the same testing dataset, the InceptionV3 algorithm was then applied; the confusion matrix for the InceptionV3 model can be seen in Figure 9. The findings show that out of 201 samples, only 130 were correctly identified using the InceptionV3 model, while 71 samples were incorrectly classified. InceptionV3 had 11 more correctly classified results than VGG16. After first attempting to use the SVM, VGG16, and InceptionV3 models, we finally attempted to validate our proposed model, AQE-Net, by applying it to a testing dataset. Figure 10 presents the classification results using the confusion matrix generated by the AQE-Net model after it was applied to the testing dataset. Out of 201 possible classifications, there were a total of 144 accurate classifications, while there were 57 wrong classifications. AQE-Net was found to have delivered 14 more correct classifications than InceptionV3, according to the findings. When it comes to the categorization of images with AQI levels, the overall confusion matrices on classification results obtained by models indicate that AQE-Net is more superior than other models. rectly classified results than VGG16. After first attempting to use the SVM, VGG InceptionV3 models, we finally attempted to validate our proposed model, AQE applying it to a testing dataset. Figure 10 presents the classification results using fusion matrix generated by the AQE-Net model after it was applied to the testing Out of 201 possible classifications, there were a total of 144 accurate classification there were 57 wrong classifications. AQE-Net was found to have delivered 14 more classifications than InceptionV3, according to the findings. When it comes to the c zation of images with AQI levels, the overall confusion matrices on classification obtained by models indicate that AQE-Net is more superior than other models.     In addition, we evaluated the predictive performance of the SVM, VGG1 tionV3, and AQE-Net models with the help of three statistical error metrics k mean squared error (MSE), mean absolute error (MAE), and mean absolute pe error (MAPE). Table 4 shows the MSE, MAE, and MAPE values. When appli testing dataset, the SVM model achieved values of 1.915 MSE, 0.830 MAE, a MAPE, respectively. The MSE was found to be 1.910, the MAE was 0.796, and th was found to be 0.465 using the VGG16 model. When compared to the SVM m VGG16 model produced fewer errors than the SMV model. Following the appl the VGG16 model, we next applied the InceptionV3 model, and the results of MS and MAPE were 1.373, 0.626, 0.326, respectively, which also reflects fewer er were produced by the earlier models, SVM and VGG16. In overall, the AQE-N that we proposed had a lower error rate than the other models that were employ Remote Sens. 2022, 14,5732 research. AQE-Net generated estimates of 1.278 MSE, 0.542 MAE, and 0.310 M spectively, which is quite less than all other models. This shows that the AQE-N that we proposed is superior than other models.

Discussion
In this study, all of the sample images were taken from fixed-point image In addition, we evaluated the predictive performance of the SVM, VGG16, InceptionV3, and AQE-Net models with the help of three statistical error metrics known as mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). Table 4 shows the MSE, MAE, and MAPE values. When applied to the testing dataset, the SVM model achieved values of 1.915 MSE, 0.830 MAE, and 0.473 MAPE, respectively. The MSE was found to be 1.910, the MAE was 0.796, and the MAPE was found to be 0.465 using the VGG16 model. When compared to the SVM model, the VGG16 model produced fewer errors than the SMV model. Following the application of the VGG16 model, we next applied the InceptionV3 model, and the results of MSE, MAE, and MAPE were 1.373, 0.626, 0.326, respectively, which also reflects fewer errors than were produced by the earlier models, SVM and VGG16. In overall, the AQE-Net model that we proposed had a lower error rate than the other models that were employed in this research. AQE-Net generated estimates of 1.278 MSE, 0.542 MAE, and 0.310 MAPE, respectively, which is quite less than all other models. This shows that the AQE-Net model that we proposed is superior than other models.

Discussion
In this study, all of the sample images were taken from fixed-point images, which means the image is acquired at an angle to the sky and that about one-third of the image is taken up by land shared with a building. The goal is to emulate a more frequent and simpler shooting perspective. For monitoring purposes, at least 50% of the frames taken are of the sky. The photographs in this experiment depict scenarios that occur throughout the day (between 7:00 and 19:00). In the evenings, vision is quite poor due to the poor imaging quality. This experimental model is only appropriate for daytime air quality monitoring and is not suitable for nighttime monitoring. Because the model training data is gathered in Karachi, the model's controllability, dependability, and efficiency are all pretty good in the local region, and the model's prediction speed and accuracy are all relatively consistent. However, owing to regional climatic and atmospheric variances, the model may not be able to attain the requisite precision in other areas. Our model must be trained and tweaked again with local picture data before it can be used elsewhere. Due to various restrictions, this model will not be able to match the precision of air quality monitoring stations, but it can serve as a complementing tool. The model's benefit is that individuals can utilize portable image acquisition equipment to get real-time air quality metrics, especially in rural and suburban regions, where the monitoring stations are located far from population centers. Future studies should focus on several areas that can be improved. Different weather conditions significantly impact the brightness or blackness of air quality photographs. The model can be used to directly extract the brightness properties of the picture from the data. Humidity, however, has no discernible influence on photos of air pollution, even though it may impact air quality. Future studies can take into account these considerations to increase model accuracy. Finally, we concentrated our investigation on the AQI, a complete indication of air quality. Future studies could focus on PM2.5 if they wanted to do so.
The dataset size, the initial learning rate, and the number of layers are three training network characteristics that have an impact on the results. This section discusses the impact of the MiniBatchsize training parameter. MiniBatchsize or Batch training involves backpropagating the error of classification via groups of pictures [66]. We propose training the model for various MiniBatchsize values in order to see how this parameter affects the model. Tables 5 and 6 give the findings for the values of 60 and 10. Classification rates during a major training period that ranged between 0.4866 and 0.6541 were obtained by training for various numbers of epochs and a big MiniBatchsize of 60. When compared to the results from Table 5, the decline in rate helps to explain the memorization issue depicted in Figure 11, where unhealthy has been misclassified to the poor category. Images (a), (b), (c), and (d) in Figure 11 are all unhealthy which were mislabeled.
Remote Sens. 2022, 14, 5732 13 of 17 account these considerations to increase model accuracy. Finally, we concentrated our investigation on the AQI, a complete indication of air quality. Future studies could focus on PM2.5 if they wanted to do so. The dataset size, the initial learning rate, and the number of layers are three training network characteristics that have an impact on the results. This section discusses the impact of the MiniBatchsize training parameter. MiniBatchsize or Batch training involves backpropagating the error of classification via groups of pictures [66]. We propose training the model for various MiniBatchsize values in order to see how this parameter affects the model. Tables 5 and 6 give the findings for the values of 60 and 10. Classification rates during a major training period that ranged between 0.4866 and 0.6541 were obtained by training for various numbers of epochs and a big MiniBatchsize of 60. When compared to the results from Table 5, the decline in rate helps to explain the memorization issue depicted in Figure 11, where unhealthy has been misclassified to the poor category. Images (a), (b), (c), and (d) in Figure 11 are all unhealthy which were mislabeled. The training for various numbers of epochs results in significant values of the classification rate, as shown in Table 6. These values, which range from 0.5866 to 0.7014, are computed over a training period of 25,934.44 s. Precision in performing the categorization operation is made possible by MiniBatchsize's low value. The training for various numbers of epochs results in significant values of the classification rate, as shown in Table 6. These values, which range from 0.5866 to 0.7014, are computed over a training period of 25,934.44 s. Precision in performing the categorization operation is made possible by MiniBatchsize's low value.

Conclusions
In recent decades, air pollution has posed major hazards to human health, prompting widespread public concern. However, ambient pollution measures are expensive, so the geographic coverage of air quality monitoring stations is limited. A low-cost, high-efficiency air quality sensor system benefits human health and air pollution prevention. The AQE-Net air quality assessment model, which is based on deep learning, is proposed in this article. Specifically, deep convolutional neural networks are used to extract feature representational information relating to air quality from scene photos, with the information used to identify air quality levels. A comparative examination of our developed model with traditional and deep learning models, such as SVM, VGG16, and InceptionV3, was also carried out on the KHI-AQI dataset. The experimental findings indicated that the AQE-Net model is superior to other models in classifying photos with AQI levels.
This study has certain limitations. The study used a small sample size and focused only on the Karachi region. Future research should add more datasets and multiple regions to compare the findings with our study. Future studies should take in to account different pollutants, such as PM2.5, PM10, and carbon monoxide (CO 2 ). Additionally, it is also important to examine seasonal weather conditions and estimate air quality, particularly when visibility is affected due to a foggy environment.