PM2.5 Concentration Estimation Based on Image Processing Schemes and Simple Linear Regression.

Fine aerosols with a diameter of less than 2.5 microns (PM2.5) have a significant negative impact on human health. However, their measurement devices or instruments are usually expensive and complicated operations are required, so a simple and effective way for measuring the PM2.5 concentration is needed. To relieve this problem, this paper attempts to provide an easy alternative approach to PM2.5 concentration estimation. The proposed approach is based on image processing schemes and a simple linear regression model. It uses images with a high and low PM2.5 concentration to obtain the difference between these images. The difference is applied to find the region with the greatest impact. The approach is described in two stages. First, a series of image processing schemes are employed to automatically select the region of interest (RoI) for PM2.5 concentration estimation. Through the selected RoI, a single feature is obtained. Second, by employing the single feature, a simple linear regression model is used and applied to PM2.5 concentration estimation. The proposed approach is verified by the real-world open data released by Taiwan's government. The proposed scheme is not expected to replace component analysis using physical or chemical techniques. We have tried to provide a cheaper and easier way to conduct PM2.5 estimation with an acceptable performance more efficiently. To achieve this, further work will be conducted and is summarized at the end of this paper.


Introduction
Air pollution has been reported to significantly affect human health [1], causing issues such as premature death, bronchitis, asthma, cardiovascular disease, and lung cancer [2]. Pollutants in the air include CO, NO 2 , and particulate matter. Among them, particulate matter with a diameter of less than 2.5 microns (PM 2.5 ) is a key component which severely affects human health in many ways. For example, PM 2.5 aerosols are able to directly enter the lungs through the respiratory tract and affect a person's health [3]. According to the World Health Organization report, more than 90% of the world's population inhales large amounts of pollutants every day, which results in approximately seven million deaths each year [4]. Consequently, PM 2.5 concentration estimation is required and has become an important concern for human health [5,6]. valuable alternative to estimating PM 2.5 concentration. The main aims of this image-based approach are as follows: (i) to automatically locate the RoI to replace the manual selection of Liu's work [28]; (ii) to use a single feature for linear regression instead of multiple features in PM 2.5 concentration estimation, with an acceptable performance; and (iii) to provide a cheaper alternative method with a camera for estimating the PM 2.5 concentration. This paper is organized as follows. The proposed approach is described in Section 2. In Section 3, real-world data is given to verify the proposed approach. Finally, a conclusion is made in Section 4.

The Proposed Approach
There are two stages involved in the proposed approach. In the first stage, a series of image processing schemes are employed to automatically locate the region of interest (RoI) to extract a single feature, which is required in the following stage for PM 2.5 concentration estimation. In the second stage, a simple linear regression model is used with the training data, which contains pairs of the single feature obtained through the selected RoI and the actual PM 2.5 concentration measurement. The simple linear regression model is then used in PM 2.5 concentration estimation with the testing data. The estimated PM 2.5 concentration is compared with the actual value and evaluated by performance indices. An overall block diagram for the proposed approach is depicted in Figure 1. The details of the proposed approach are described in the following sections. The proposed automatic RoI selection approach is described in Section 2.1, the simple linear regression model is given in Section 2.2, and three performance indices employed to assess the proposed approach are given in Section 2.3. provides a valuable alternative to estimating PM2.5 concentration. The main aims of this image-based approach are as follows: (i) to automatically locate the RoI to replace the manual selection of Liu's work [28]; (ii) to use a single feature for linear regression instead of multiple features in PM2.5 concentration estimation, with an acceptable performance; and (iii) to provide a cheaper alternative method with a camera for estimating the PM2.5 concentration. This paper is organized as follows. The proposed approach is described in Section 2. In Section 3, real-world data is given to verify the proposed approach. Finally, a conclusion is made in Section 4.

The Proposed Approach
There are two stages involved in the proposed approach. In the first stage, a series of image processing schemes are employed to automatically locate the region of interest (RoI) to extract a single feature, which is required in the following stage for PM2.5 concentration estimation. In the second stage, a simple linear regression model is used with the training data, which contains pairs of the single feature obtained through the selected RoI and the actual PM2.5 concentration measurement. The simple linear regression model is then used in PM2.5 concentration estimation with the testing data. The estimated PM2.5 concentration is compared with the actual value and evaluated by performance indices. An overall block diagram for the proposed approach is depicted in Figure 1. The details of the proposed approach are described in the following sections. The proposed automatic RoI selection approach is described in Section 2.1, the simple linear regression model is given in Section 2.2, and three performance indices employed to assess the proposed approach are given in Section 2.3.

Automatic RoI Selection
It should be noted that not all parts of an image are strongly related to the PM2.5 concentration. Therefore, selecting an appropriate RoI to estimate the PM2.5 concentration is an important step for the successful application of the proposed approach. It is known that some details in the image will be blurred when the PM2.5 concentration is high, compared to when there is a low PM2.5 concentration. In other words, the pixel value of the images with a high and low PM2.5 concentration is different. This also illustrates that not every feature has a good correlation with the PM2.5 concentration. It motivates us to use the differences in image pairs of low and high PM2.5 concentrations in automatic RoI selection. A flowchart of the proposed automatic RoI selection is depicted in Figure 2. A pair of images, shown in Figure 3a,b, are given to demonstrate how the proposed automatic RoI selection works. Given a pair of images of low and high PM2.5 concentrations, both images are converted into gray-level images. The image of a low PM2.5 concentration is denoted as I1 and the one with a high PM2.5 concentration is denoted as I2. A series of image processing steps to determine the final RoI is described in the following.

Automatic RoI Selection
It should be noted that not all parts of an image are strongly related to the PM 2.5 concentration. Therefore, selecting an appropriate RoI to estimate the PM 2.5 concentration is an important step for the successful application of the proposed approach. It is known that some details in the image will be blurred when the PM 2.5 concentration is high, compared to when there is a low PM 2.5 concentration. In other words, the pixel value of the images with a high and low PM 2.5 concentration is different. This also illustrates that not every feature has a good correlation with the PM 2.5 concentration. It motivates us to use the differences in image pairs of low and high PM 2.5 concentrations in automatic RoI selection. A flowchart of the proposed automatic RoI selection is depicted in Figure 2. A pair of images, shown in Figure 3a,b, are given to demonstrate how the proposed automatic RoI selection works. Given a pair of images of low and high PM 2.5 concentrations, both images are converted into gray-level images. The image of a low PM 2.5 concentration is denoted as I 1 and the one with a high PM 2.5 concentration is denoted as I 2 . A series of image processing steps to determine the final RoI is described in the following.

Sobel Edge Detection
As the first step, Sobel edge detection is applied to the image pair, I1 and I2, to extract the highfrequency components [29]. In Sobel edge detection, the gradients used in this approach for the x-axis and y-axis, respectively, are denoted as Gx and Gy, and given as where a 3 × 3 mask is employed. The final magnitude Gxy is calculated as

Sobel Edge Detection
As the first step, Sobel edge detection is applied to the image pair, I1 and I2, to extract the highfrequency components [29]. In Sobel edge detection, the gradients used in this approach for the x-axis and y-axis, respectively, are denoted as Gx and Gy, and given as where a 3 × 3 mask is employed. The final magnitude Gxy is calculated as

Sobel Edge Detection
As the first step, Sobel edge detection is applied to the image pair, I 1 and I 2 , to extract the high-frequency components [29]. In Sobel edge detection, the gradients used in this approach for the x-axis and y-axis, respectively, are denoted as G x and G y , and given as where a 3 × 3 mask is employed. The final magnitude G xy is calculated as The images produced by Sobel edge detection are denoted as I 1,s and I 2,s , and are shown in Figure 4a,b, respectively. In Figure 4, one can see that the low concentration image I 1 , after Sobel edge detection, has more details than I 2 . This shows that more high-frequency components are contained in I 1,s than I 2,s . The edge detection results of Figure 3a,b are shown in Figure 4a,b, respectively. We can see that two buildings on the right of Figure 3a do not appear in Figure 3b. This is because the PM 2.5 concentration of Figure 3b is higher than that of Figure 3a. This means that the edges of the two buildings are invisible in Figure 4b. The difference of Figure 4a,b is shown in Figure 4c. The results show that the PM 2.5 concentration has a significant effect on the high frequency components of images. The images produced by Sobel edge detection are denoted as I1,s and I2,s, and are shown in Figure 4a,b, respectively. In Figure 4, one can see that the low concentration image I1, after Sobel edge detection, has more details than I2. This shows that more high-frequency components are contained in I1,s than I2,s. The edge detection results of Figure 3a,b are shown in Figure 4a,b, respectively. We can see that two buildings on the right of Figure 3a do not appear in Figure 3b. This is because the PM2.5 concentration of Figure 3b is higher than that of Figure 3a. This means that the edges of the two buildings are invisible in Figure 4b. The difference of Figure 4a,b is shown in Figure 4c. The results show that the PM2.5 concentration has a significant effect on the high frequency components of images.

Otsu Thresholding
After Sobel edge detection, Otsu thresholding [30] is applied to the two images in Figure 4 to obtain binary images. In Otsu thresholding, the pixels in an image are separated into two groups based on the histogram. By employing statistical properties, the optimal threshold, where the variance of each group is minimized and the variance between two groups is maximized, is determined. In Otsu thresholding, the weighted sum of the variance between two groups is found as where σ 0 2 t and σ 1 2 t represent the variance of each group, and w 0 t and w 1 (t) are the weights of two groups separated by the threshold t, respectively. The weights w 0 t and w 1 (t) are obtained, respectively, as and where p i is the probability of the pixel value i and L is the number of gray levels. The variance between two groups is given as where σ 2 is the variance of the whole image. Equation (6) can be transformed into

Otsu Thresholding
After Sobel edge detection, Otsu thresholding [30] is applied to the two images in Figure 4 to obtain binary images. In Otsu thresholding, the pixels in an image are separated into two groups based on the histogram. By employing statistical properties, the optimal threshold, where the variance of each group is minimized and the variance between two groups is maximized, is determined. In Otsu thresholding, the weighted sum of the variance between two groups is found as where σ 2 0 (t) and σ 2 1 (t) represent the variance of each group, and w 0 (t) and w 1 (t) are the weights of two groups separated by the threshold t, respectively. The weights w 0 (t) and w 1 (t) are obtained, respectively, as and where p(i) is the probability of the pixel value i and L is the number of gray levels. The variance between two groups is given as where σ 2 is the variance of the whole image. Equation (6) can be transformed into where µ 0 (t) and µ 1 (t) are the means of two groups separated by threshold t. The optimal threshold is then found with t, which maximizes σ 2 o (t) in Equation (7). The images I 1,s and I 2,s after Otsu thresholding, are denoted as I 1,so and I 2,so and shown in Figure 5a and Figure 5b, respectively.
where µ 0 t and µ 1 t are the means of two groups separated by threshold t. The optimal threshold is then found with t, which maximizes σ o 2 t in Equation (7). The images I1,s and I2,s after Otsu thresholding, are denoted as I1,so and I2,so and shown in Figure 5a and Figure 5b, respectively.

Morphological Dilation
Using the obtained binary images, I1,so and I2,so, shown in Figure 5, morphological dilation is applied to expand boundaries and to connect neighborhood pixels. The degree of expansion depends on the size of structuring elements. The equation employed for morphological dilation is given below: where A is the image to be processed and B represents the structuring elements.
In the proposed RoI scheme, the 3 × 3 mask for structuring elements with all white pixels is used. After morphological dilation, the resulting images are denoted as I1,som and I2,som and shown in Figure 6a and Figure 6b, respectively.

Image Subtraction and Labeling
In this step, image subtraction is used to obtain the difference image for I1,som and I2,som in Figure 6. Then, a labeling scheme is employed to identify connected pixels. The difference image for I1,som and I2,som is shown in Figure 7, denoted as Id, where pixels with the same value in the image pair are eliminated and those with different pixel values remain in a white color. In order to distinguish whether pixels are connected, a labeling scheme [31] is applied to mark the connected pixels by colors. The connected neighborhood pixels are marked with the same color. After labeling, the resulting image, denoted as Idl, is as shown in Figure 8. Finally, the labeled regions with the top three largest numbers

Morphological Dilation
Using the obtained binary images, I 1,so and I 2,so , shown in Figure 5, morphological dilation is applied to expand boundaries and to connect neighborhood pixels. The degree of expansion depends on the size of structuring elements. The equation employed for morphological dilation is given below: where A is the image to be processed and B represents the structuring elements.
In the proposed RoI scheme, the 3 × 3 mask for structuring elements with all white pixels is used. After morphological dilation, the resulting images are denoted as I 1,som and I 2,som and shown in Figure 6a and Figure 6b, respectively.
where µ 0 t and µ 1 t are the means of two groups separated by threshold t. The optimal threshold is then found with t, which maximizes σ o 2 t in Equation (7). The images I1,s and I2,s after Otsu thresholding, are denoted as I1,so and I2,so and shown in Figure 5a and Figure 5b, respectively.

Morphological Dilation
Using the obtained binary images, I1,so and I2,so, shown in Figure 5, morphological dilation is applied to expand boundaries and to connect neighborhood pixels. The degree of expansion depends on the size of structuring elements. The equation employed for morphological dilation is given below: where A is the image to be processed and B represents the structuring elements.
In the proposed RoI scheme, the 3 × 3 mask for structuring elements with all white pixels is used. After morphological dilation, the resulting images are denoted as I1,som and I2,som and shown in Figure 6a and Figure 6b, respectively.

Image Subtraction and Labeling
In this step, image subtraction is used to obtain the difference image for I1,som and I2,som in Figure 6. Then, a labeling scheme is employed to identify connected pixels. The difference image for I1,som and I2,som is shown in Figure 7, denoted as Id, where pixels with the same value in the image pair are eliminated and those with different pixel values remain in a white color. In order to distinguish whether pixels are connected, a labeling scheme [31] is applied to mark the connected pixels by colors. The connected neighborhood pixels are marked with the same color. After labeling, the resulting image, denoted as Idl, is as shown in Figure 8. Finally, the labeled regions with the top three largest numbers

Image Subtraction and Labeling
In this step, image subtraction is used to obtain the difference image for I 1,som and I 2,som in Figure 6. Then, a labeling scheme is employed to identify connected pixels. The difference image for I 1,som and I 2,som is shown in Figure 7, denoted as I d , where pixels with the same value in the image pair are eliminated and those with different pixel values remain in a white color. In order to distinguish whether pixels are connected, a labeling scheme [31] is applied to mark the connected pixels by colors. The connected neighborhood pixels are marked with the same color. After labeling, the resulting image, denoted as I dl , is as shown in Figure 8. Finally, the labeled regions with the top three largest numbers of pixels are considered as candidate regions of interest.

Selected RoI in the Given Pair of Images
Now, the red flow path shown in Figure 2 will be described. The difference image, denoted as Isd, for I1,s and I2,s is obtained by image subtraction. Then, the three candidate regions of interest and the difference image Isd are overlapped to select the pixels in the candidate regions of interest. Next, the averages of pixel values in each candidate region of interest are calculated. Then, the RoI with the highest average is determined as the final RoI in the given pair of images, I1 and I2. This completes the process of automatic RoI selection given in Figure 2 for the given pair of images.

Final RoI Determination
It needs to be pointed out that the image pair given above is just an example provided to show the process of the proposed automatic RoI selection. In practice, in automatic RoI selection, 30 images with a low PM2.5 concentration (≤5 µg m 3 ⁄ ) and 120 images with a high PM2.5 concentration (≥70 µg m 3 ⁄ ) are randomly selected from the training set. In this study, the images with low and high PM2.5 concentrations are paired by combinations. In other words, the 30 × 120 paired images are included in the automatic RoI selection process, as described in Figure 2. By using the averages of 3600 results, the three candidate regions of interest are determined, as shown in Figure 9. The box plot given in Figure 10 shows the range of average pixel values in each candidate RoI. Since Region 1 has the highest average value, it is selected as the final RoI to estimate the PM2.5 concentration. The average pixel value within the final RoI will be used as the only single feature for the following simple linear regression model in the proposed approach.

Selected RoI in the Given Pair of Images
Now, the red flow path shown in Figure 2 will be described. The difference image, denoted as Isd, for I1,s and I2,s is obtained by image subtraction. Then, the three candidate regions of interest and the difference image Isd are overlapped to select the pixels in the candidate regions of interest. Next, the averages of pixel values in each candidate region of interest are calculated. Then, the RoI with the highest average is determined as the final RoI in the given pair of images, I1 and I2. This completes the process of automatic RoI selection given in Figure 2 for the given pair of images.

Final RoI Determination
It needs to be pointed out that the image pair given above is just an example provided to show the process of the proposed automatic RoI selection. In practice, in automatic RoI selection, 30 images with a low PM2.5 concentration (≤5 µg m 3 ⁄ ) and 120 images with a high PM2.5 concentration (≥70 µg m 3 ⁄ ) are randomly selected from the training set. In this study, the images with low and high PM2.5 concentrations are paired by combinations. In other words, the 30 × 120 paired images are included in the automatic RoI selection process, as described in Figure 2. By using the averages of 3600 results, the three candidate regions of interest are determined, as shown in Figure 9. The box plot given in Figure 10 shows the range of average pixel values in each candidate RoI. Since Region 1 has the highest average value, it is selected as the final RoI to estimate the PM2.5 concentration. The average pixel value within the final RoI will be used as the only single feature for the following simple linear regression model in the proposed approach.

Selected RoI in the Given Pair of Images
Now, the red flow path shown in Figure 2 will be described. The difference image, denoted as I sd , for I 1,s and I 2 , s is obtained by image subtraction. Then, the three candidate regions of interest and the difference image I sd are overlapped to select the pixels in the candidate regions of interest. Next, the averages of pixel values in each candidate region of interest are calculated. Then, the RoI with the highest average is determined as the final RoI in the given pair of images, I 1 and I 2 . This completes the process of automatic RoI selection given in Figure 2 for the given pair of images.

Final RoI Determination
It needs to be pointed out that the image pair given above is just an example provided to show the process of the proposed automatic RoI selection. In practice, in automatic RoI selection, 30 images with a low PM 2.5 concentration (≤5 µg/m 3 ) and 120 images with a high PM 2.5 concentration (≥70 µg/m 3 ) are randomly selected from the training set. In this study, the images with low and high PM 2.5 concentrations are paired by combinations. In other words, the 30 × 120 paired images are included in the automatic RoI selection process, as described in Figure 2. By using the averages of 3600 results, the three candidate regions of interest are determined, as shown in Figure 9. The box plot given in Figure 10 shows the range of average pixel values in each candidate RoI. Since Region 1 has the highest average value, it is selected as the final RoI to estimate the PM 2.5 concentration. The average pixel value within the final RoI will be used as the only single feature for the following simple linear regression model in the proposed approach.

Simple Linear Regression Model
A simple linear regression model, which is a statistical analysis scheme [25], will be used to estimate the PM2.5 concentration in the proposed approach. x i is the average pixel value within the final data and y i is the corresponding PM2.5 concentration measurement in the training data (where subscript i denotes the ith sample). It is assumed that these two sequences of data have a linear relation, shown as where α and β are coefficients to be determined. y i denotes an estimate of y i (corresponding PM2.5 concentration). The estimation error between y i and y i is given as Employing the least squares algorithm to minimize the estimation error, coefficients α and β can be found as and where N is the number of samples. Once the simple linear regression model is obtained, it is employed to estimate the PM2.5 concentration with the testing data.

Simple Linear Regression Model
A simple linear regression model, which is a statistical analysis scheme [25], will be used to estimate the PM2.5 concentration in the proposed approach. x i is the average pixel value within the final data and y i is the corresponding PM2.5 concentration measurement in the training data (where subscript i denotes the ith sample). It is assumed that these two sequences of data have a linear relation, shown as where α and β are coefficients to be determined. y i denotes an estimate of y i (corresponding PM2.5 concentration). The estimation error between y i and y i is given as Employing the least squares algorithm to minimize the estimation error, coefficients α and β can be found as and where N is the number of samples. Once the simple linear regression model is obtained, it is employed to estimate the PM2.5 concentration with the testing data.

Simple Linear Regression Model
A simple linear regression model, which is a statistical analysis scheme [25], will be used to estimate the PM 2.5 concentration in the proposed approach. x i is the average pixel value within the final data and y i is the corresponding PM 2.5 concentration measurement in the training data (where subscript i denotes the ith sample). It is assumed that these two sequences of data have a linear relation, shown as where α and β are coefficients to be determined. y i denotes an estimate of y i (corresponding PM 2.5 concentration). The estimation error between y i and y i is given as Employing the least squares algorithm to minimize the estimation error, coefficients α and β can be found as and where N is the number of samples. Once the simple linear regression model is obtained, it is employed to estimate the PM 2.5 concentration with the testing data.

Performance Indices
Inherently, image-based method cannot analyze the ingredients in the air, as in previous works, thus it is hard to define a parameter to show the performance by error. Instead, three overall performance indices are used to evaluate the proposed approach. The first one is the root mean square error (RMSE). It is used to show the error between the recorded value and the estimated value of the proposed method. RMSE is calculated as where y i and y i are the true and estimated PM 2.5 concentrations, respectively. The second performance index is R squared (R 2 ), which has also been used in previous work [28], and is employed to show the correlation between estimated results and measured values. It is defined as where y is the mean of y i . R 2 indicates the linearity between y i and y i . When it is linear, R 2 = 1. The third index is F-test, which is the test statistic for an F-distribution under the null hypothesis [32], where the p-value indicates the statistical significance; that is, it determines whether the result is beyond chance or not. The p-value will be used as an indicator of statistical significance in the following experiments.

Experimental Results
In this section, the proposed approach is verified by a real-world data set, which is described later in Section 3.1. Then, the results without and with unreliable data exclusion are shown in Sections 3.2 and 3.3, respectively.

Experimental Data Sets
In the experiments, the images were taken from Renwu Environmental Monitoring Station, Kaohsiung City, Taiwan. A consumer camera was set up at the station and took one image every ten minutes during the period of 7:00 AM to 5:00 PM. In total, 10,084 images were collected from May to October 2016. We did not exclude sampled images of sunny or rainy days. The image data were divided into training and testing data, of which the proportions were 60% and 40%, respectively. The images shown in Figure 3a,b are examples taken from the data set. Furthermore, the hourly PM 2.5 concentration and relative humidity (RH) in the corresponding area were obtained from the open data released by the Environmental Protection Administration, Executive Yuan, Taiwan [33]. Using the data, a simple linear regression model was obtained and used to estimate the PM 2.5 concentration by employing the proposed approach.

Results with All Data
In this experiment, all of the data set, including 10,084 images, was used. As described in Section 2, three candidate regions of interest were automatically selected and the final RoI was determined by the highest average pixel value among the three candidate regions of interest. Besides, the average pixel value in the final RoI was used as the only single feature. To compare the estimation performances for the whole image, Region 1, Region 2, and Region 3 are presented in Figure 11a-d, which show scattering plots for each case, where the region under consideration is shown in the upper right corner. The three performance indices with all data are displayed in Table 1. Table 1 indicated that Region 1 had a better performance than the other cases. Besides, all results were statistically significant in the F-test. In the case with all data, the highest R 2 = 0.41, which was achieved by Region 1. One may see that the performance of the whole image case is inferior to those for candidate regions of interest. When Regions 1 to 3 are considered, the performance index R 2 , from high to low, is Region 1, Region 3, and Region 2. The result is consistent with the priority for the proposed automatic RoI selection. In other words, the proposed automatic RoI selection is appropriate for the given data.
the proposed automatic RoI selection. In other words, the proposed automatic RoI selection is appropriate for the given data.

Results with Unreliable Data Exclusion
By conducting experiments, it was observed that two factors may affect the performance of the proposed approach. One is the time difference between the time to take images and the time to measure

Results with Unreliable Data Exclusion
By conducting experiments, it was observed that two factors may affect the performance of the proposed approach. One is the time difference between the time to take images and the time to measure the PM 2.5 concentration. For the data set described in Section 3.1, the images were taken every ten minutes, but the PM 2.5 concentration was collected hourly. In other words, six images were related to only one PM 2.5 concentration for each hour. When the PM 2.5 concentration changes within an hour, it might degrade the estimation performance. To solve this problem, the variance of six images taken in the same hour was calculated. When the variance was greater than 1, the images were considered as unreliable data and discarded.
The other factor seen to affect the performance of the proposed approach was the RH. There are many substances, in addition to PM 2.5 , in the atmosphere that affect visibility, such as sulfur oxides, nitrogen oxides, carbon monoxide, and water droplets. It has been observed that PM 2.5 aerosols are expanded by absorbing water molecules in the air and this affects visibility [34]. It has also been reported that the RH affects PM 2.5 concentration estimation [28]. Consequently, the effect of RH on PM 2.5 concentration estimation was considered in the proposed approach.
By conducting experiments, we observed that the estimation performance of the proposed approach was significantly degraded when RH ≥ 65%. Consequently, the data was excluded if its corresponding RH ≥ 65%. Moreover, it should be noted that human health is mostly endangered by a higher PM 2.5 concentration, instead of a lower one. Consequently, the data with PM 2.5 concentrations less than 5 µg/m 3 were excluded. By employing the criteria RH ≥ 65% or PM 2.5 concentration less than 5 µg/m 3 , 2361 images were excluded from the given data set. With the consideration of data exclusion, the three performance indices were recorded and are presented in Table 2 for all cases, as in Table 1. As seen in Table 2, Region 1 had a better performance than the other cases, as in Table 1. Moreover, all results were statistically significant in the F-test. When comparing the results presented in Tables 1 and 2, one can see that the RMSE and R 2 were obviously improved in all cases with data exclusion. Additionally, Region 1 exhibited the most improvement. The RMSE was reduced from 11.88 to 8.67, while the R 2 increased from 0.41 to 0.73. Again, the results implied that the automatically selected RoI was appropriate in the given example. To sum up, the proposed approach with automatic RoI selection and data exclusion is feasible and has an acceptable performance for PM 2.5 concentration estimation. By Table 2, one may observe that the performance of the whole image case is inferior to those for candidate regions of interest, as in Table 1. According to the results, the performances from high to low are Region 1, Region 3, and Region 2, which is consistent with the priority for the proposed automatic RoI selection, as shown in Figure 10. Again, the results have verified the feasibility of the proposed automatic RoI selection scheme in the given experiments.

Conclusions
This paper has presented a simple alternative for estimating the PM 2.5 concentration in which a series of image processing schemes and simple linear regression are employed. The proposed method uses images with a high and low PM2.5 concentration to obtain the difference between these images. The difference is used to find the RoI. Two main stages are involved in this approach. The first stage includes a series of image processing schemes, which are used to automatically select the final RoI, from which only a single feature is extracted and used in a simple linear regression model. The second stage is employed to find a simple linear regression model with the single feature, by applying the final RoI identified in the first stage. Then, PM 2.5 concentration estimation is performed. Using an image data set and an open PM 2.5 concentration data set, experiments were conducted to verify the proposed approach. The results indicated that the proposed approach with the automatically selected RoI achieved the best performance, with R 2 = 0.73. Although the proposed method is not as direct as chemical schemes used to analyze the composition of air, the aim of this paper has been fulfilled, i.e., to provide a simple alternative approach for PM 2.5 concentration estimation with an acceptable performance. The proposed approach is not expected to replace component analysis using physical or chemical techniques. However, we hope that the proposed method can provide a cheaper and easier way to conduct PM 2.5 estimation with an acceptable performance more efficiently. To achieve this, further work will be conducted and can be summarized as follows: 1.
Since the proposed method uses a fixed camera to capture images at the same location, the influence of images taken in different locations on the results of this study need to be investigated further; 2.
Though we have shown that the performance for each candidate RoI is better than the whole image case, it is still worthy to seek a better way to find the final RoI for the performance improvement; 3.
In this study, sunny or rainy days are not considered and they will be researched in the future. Besides, other weather factors, such as solar conditions, will be considered in the PM 2.5 concentration estimation from a higher dimension aspect.