An Improved Hybrid Transfer Learning-Based Deep Learning Model for PM2.5 Concentration Prediction

: With the improvement of the living standards of the residents, it is a very important and challenging task to continuously improve the accuracy of PM2.5 (particulate matter less than 2.5 µ m in diameter) prediction. Deep learning-based networks, such as LSTM and CNN, have achieved good performance in recent years. However, these methods require sufﬁcient data to train the model. The performance of these methods is limited for the sites where the data is lacking, such as the newly constructed monitoring sites. To deal with this problem, an improved deep learning model based on the hybrid transfer learning strategy is proposed for predicting PM2.5 concentration in this paper. In the proposed model, the maximum mean discrepancy (MMD) is used to select which station in the source domain is most suitable for migration to the target domain. An improved dual-stage two-phase (DSTP) model is used to extract the spatial–temporal features of the source domain and the target domain. Then the domain adversarial neural network (DANN) is used to ﬁnd the domain invariant features between the source and target domains by domain adaptation. Thus, the model trained by source domain site data can be used to assist the prediction of the target site without degradation of the prediction performance due to domain drift. At last, some experiments are conducted. The experimental results show that the proposed model can effectively improve the accuracy of the PM2.5 prediction at the sites lacking data, and the proposed model outperforms most of the latest models.


Introduction
With the rapid development of industrialization and urbanization in recent decades, the PM2.5 emissions of developing countries have increased substantially. Serious PM2.5 pollution has caused many adverse effects on economic activities. For example, as of 2015, for every 5 µg/m 3 increase in PM2.5 concentration, all other things being equal, GDP per capita will decrease by about 2500 China Yuan [1]. Therefore, how to accurately predict PM2.5 concentration becomes more and more important. The commonly used prediction methods can be divided into two categories: statistics and machine learning algorithms. Statistical methods predict air quality by applying statistics-based models, such as the autoregressive integrated moving average (ARIMA) model [2][3][4], multiple linear regression (MLR) model [5][6][7], and generalized additive model (GAM) [8][9][10]. However, the earlier linear models described above assume that the relationship between variables and target labels is linear, which is not suitable for nonlinear and unstable air quality prediction problems.
In order to overcome this limitation, researchers began to adopt nonlinear machine learning methods. For example, Yang et al. [11] used support vector regression (SVR) to predict the PM2.5 concentration in Beijing and verified that the accuracy of the proposed model was better than that of other methods. Li et al. [12] proposed a stacked automatic encoder (SAE) model for air quality prediction and demonstrated that the model exhibited a better performance than linear models such as ARIMA. Feng et al. [13] used the set of back-propagation neural networks (BP) to predict daily biomass combustion pollutant emissions. Zhang et al. [14] used the genetic algorithm (GA) combined with the artificial neural network (ANN) to predict local indoor air quality with two ventilation models. Although the nonlinear machine learning methods have achieved satisfactory performance in predicting air pollution, they are unable to learn from the long-term effects of air pollution, because their models are shallow networks with few model parameters. The generalization ability of these models to complex prediction problems is limited.
To solve the problem of models with fewer parameters, people have started to use deep neural networks recently, which have been used widely in image processing, natural language understanding, and so forth [15][16][17][18]. For example, Seng et al. [19] proposed a multi-output multi-index supervised learning comprehensive prediction model (MMSL) based on long-term and short-term memory (LSTM) to predict the overall air quality in Beijing. Yan et al. [20] used the CNN-LSTM model based on spatial-temporal clustering to predict the air quality of Beijing in multi-sites. Experiments show that CNN-LSTM and LSTM generally have a better performance than the BP neural network. Feng et al. [21] proposed a method based on WRF/RNN to predict the air pollutants in Hangzhou over the next 24 h. Qin et al. [22] proposed a dual-stage attention-based recurrent neural network (DA-RNN), where the attention mechanism is used in the input stage of the encoder and decoder, so that the most relevant input features can be selected adaptively. Liu et al. [23] proposed a dual-stage two-phase attention-based recurrent neural network (DSTP-RNN), where a DSTP-based structure was used to enhance the spatial correlation of an exogenous series, and a two-stage attention mechanism was used to generate stable response weights. However, this method only uses the data of one site without considering the influence of the data of other sites on the model. To solve the above problems, in our previous work [24], an improved attention-based dual-stage two-phase fully connected (DSTP-FC) model was proposed to improve the accuracy of PM2.5 concentration prediction, where an exogenous series correlation method is used to calculate the relationship between the target series and the exogenous series, and the PM2.5 concentrations are predicted by a modified DSTP model.
Although advanced deep learning methods can get good results in air quality prediction, these deep learning-based methods all need enough historical datasets to train the models. For datasets with very little data, these methods do not provide very good prediction results. To solve the above data shortage problem, Ma et al. [25] proposed a transfer learning-based bidirectional long short term memory (TL-BiLSTM) network to predict the air quality of new stations lacking data. This method transfers the knowledge learned from the existing air quality monitoring stations to the new monitoring stations to improve the prediction accuracy of the new stations. Fong et al. [26] proposed a transfer learning model combining LSTM and RNN to predict the concentration of air pollutants. Their method inputs the data of all source domain sites into the model for pretraining, then adds the number of network layers to input the data of the target domain to train and predict the air quality of the target domain. Fang et al. [27] proposed a hybrid deep migration learning strategy based on long and short-term memory (LSTM) and domain adversarial neural networks (DANN), where the temporal features of the source and target buildings are extracted by LSTM, and DANN is used to find the domain invariant features between the source and target buildings through domain adaptation.
The above-mentioned methods have achieved satisfactory performance in the case of new site data shortages, but there are still some problems that should be further studied. For example, the temporal feature extractors of these models are all based on LSTM, which treat all input features equally and fail to pay attention to the important features. The TL-BiLSTM model is a single-site migration, and for the case where the source domain has multiple sites, it is not known which site in the source domain is selected for migration. The LSTM-RNN model inputs the data of all source domain sites into the model for pre-training, which is unsuitable when the number of source sites is large, because this method will input a lot of redundant data, resulting in the over-fitting and calculation problems [28].
To deal with these problems above, an improved hybrid transfer learning-based deep learning model is proposed in this paper for PM2.5 concentration prediction. When the amount of data in the target domain is small, the model cannot be well trained only by using the data in the target domain. If the transfer learning-based method is used, the model trained on the source domain data is not applicable to the target domain data, when the source and target domain data have different distributions. Thus, the motivation of this study is to use the domain adaptive migration learning method to find the domain invariant characteristics between the source domain and the target domain, and to use the data of the source domain and the target domain to predict the PM2.5 concentration in the target domain with fewer data.
The main contributions of this paper are summarized as follows: (1) An improved hybrid transfer learning model with a dual-stage two-phase model (DSTP) and a domain adversarial neural network (DANN) is proposed; (2) The maximum mean discrepancy (MMD) is introduced into the air quality prediction based on transfer learning, which is used to select which station in the source domain is most suitable for migration to the target domain; (3) An improved dual-stage two-phase (DSTP) model is used to extract the spatial-temporal features of the source domain and the target domain. Various experiments on several cities in China are conducted, and the results verify the efficiency and the generalization ability of the proposed method.
This paper is organized as follows: Section 2 describes the proposed method and presents the structure of the proposed deep learning-based model; Section 3 presents the experiments and results; Section 4 discusses the performance of different feature extractors, the generalization ability and the robustness of the proposed method, and the setting of hyperparameters; Section 5 provides the conclusion and possible future research directions.

Proposed Model
In this paper, a hybrid transfer learning model is proposed. The input of the model includes historical air quality and meteorological data of source and target domains. Firstly, the source domain site selection method based on MMD is used to find the source domain site closest to the target domain. Then the data of the two sites are input into the improved DSTP model together. A feature extractor based on the DSTP model is used to extract the spatial-temporal features of training data from source and target site data. The obtained spatial-temporal features are input into the domain classification model and the regression prediction model, respectively. In this paper, a domain adversarial neural network (DANN) is used to find the domain invariant features between the source domain and the target domain through adversarial domain adaptation of DSTP feature extractor and domain classifier. Finally, the regression prediction model based on the fully connected layer is used to predict the values of the source and target sites. The test data of the target site are input into the pre-trained DSTP-DANN model for PM2.5 concentration prediction. The framework of the proposed model is shown in Figure 1, which will be described in detail below.

Remark 1.
The method presented in this paper is different from those that fine-tune by freezing the first few layers of the model. This method uses the domain adaptation of an adversarial neural network to conduct transfer learning. DANN combines domain adaptation and feature learning in a training process, so that the features of domain invariance can be predicted. Then, the proposed transfer learning-based model trained by source domain site data can be used to assist in predicting target site data without degradation of the prediction performance due to domain drift.

Site Selection for Source Domain Based on MMD
Because there are many source domain sites, it is necessary to measure the distribution distance between source domain sites and target domain sites, and select the source domain sites closest to the target domain sites. Recent studies have proved that the maximum mean discrepancy (MMD) in the regenerative kernel Hilbert space is an effective method for estimating the distance between two distributions [29]. Based on two distributed samples, the average difference between two samples corresponding to f can be obtained by subtracting the function mean of different samples, and MMD is the maximum value of the average difference. For the convenience of calculation, the square form of MMD is generally adopted. The process of using MMD to estimate the difference between two domains is as follows.
The source domain site data in a given source domain is denoted as: where x represents the source domain site data and n represents the source domain site data number. The target site data in the target domain is denoted as: where z represents the target domain site data and m represents the target domain site data number. The nonlinear mapping function in the Hilbert space of the regenerative kernel is denoted as φ. Then the squared form of MMD is defined as follows: The difference in distribution between two domains is the distance between the two data distributions. The smaller the MMD value, the closer the two domains are. Currently, MMD has been widely used in transfer learning algorithms [30][31][32]. The proposed method is used to select the source domain site that is most suitable for migration to the target domain site by calculating the similarity between the source domain and the target domain based on MMD.

Spatial-Temporal Features Extraction Based on DSTP
The center site is the site to be predicted, and the best matching site of the center site is determined by the exogenous series correlation method. The main reason to use this DSTP model is that a stable attention weight can be obtained by the DSTP model, which uses a dual-stage attention mechanism in the encoder stage. Thus, temporal and spatial features can be extracted simultaneously [24].
Given all sites' data, each site contains n exogenous series and a target series (series to be predicted). Within the window size S W of the central site collection, the k-th exogenous series is represented by: All exogenous series within window size S W are represented by: The target series is represented by: In this study, the encoder adopts a two-stage attention mechanism, which aims to study the spatial correlation between the exogenous series of the central site collection, its matching sites' exogenous series and target series. Specifically, the spatial correlation between the exogenous series of the central site collection and the exogenous series of the matching sites is studied in the first stage of attention. In the second stage of attention, the weighted features are studied again, that is, the spatial correlation among the exogenous series of the central site collection, its target series and matching sites' target series. Thus, the two-stage spatial mechanism ensures that the learned spatial correlations are stable. The decoder is a temporal attention mechanism designed to learn the temporal correlation among the encoder hidden state, the target series of the central site collection, and the target series of the matching site.

First Stage of Attention
The data from the central site and its matching sites are input into the model together, which can be used to study the exogenous series relationships between them and can improve the accuracy of predicting PM2.5 concentrations. The exogenous series correlation method is used to find matching sites. Given the k-th feature x k of the central site collection at time t, the k-th feature x (best) k of the exogenous series of the best matching site can be obtained by the exogenous series correlation method [24]. The spatial correlation between the exogenous attributes of the learning central site collection and the matching site in the input attention mechanism is: where [ * : * ] is a concatenation operation, are the parameters to learn; h f t−1 ∈ R m and s f t−1 ∈ R m are the hidden state and unit state of the encoder LSTM unit at the previous time. After f k t is calculated, the Softmax function is used to normalize to get the attention weight , the k-th feature x k of the current input, and the k-th feature x (best) k of the best-th matching site, which measures the importance of the k-th feature at time t. x t is the combination of all features at moment t, which is defined as follows: Then, the hidden states h f t−1 and x t are input into the LSTM layer to update the hidden state of the current moment, and x t is input into the attention of the second stage.

Second Stage of Attention
This module aims to learn the spatial correlation between the exogenous series and the target series of the central site collection and the target series of the matching sites. The specific method is to combine the target series of the central site collection with the exogenous series of the corresponding time and add the target series of the best matching site. The attention weights for the input attention mechanism are as follows: where v s ∈ R S W , W s ∈ R S W ×2q , U s ∈ R S W ×S W , M s ∈ R S W are the parameters to be learned. h s t−1 ∈ R q and s s t−1 ∈ R q are the hidden state and unit state of the encoder LSTM unit at the previous time; and q is the hidden size in the second attention module.
After s k .
Then, h s t−1 and z t are input into the LSTM layer to update the hidden state h s t at the current moment, and h s t is input into the temporal attention stage.

Decoder with Temporal Attention
The decoder with temporal attention can adaptively select the encoder hidden state most relevant to the target series by weighting the encoder hidden state. The encoder with spatial attention outputs the hidden state, and the decoder learns the temporal relations of the hidden state through the attention mechanism within a window size S W . Based on the hidden state h d t−1 ∈ R p and unit state s d t−1 ∈ R p of the decoder LSTM unit at the previous time, the attention weight of each encoder hidden state in the attention module at the moment t can be calculated. The attention weights for the temporal attention mechanism are as follows: where are parameters to learn; p is the hidden size of the third attention module, and h s i ∈ H s is the i-th encoder hidden state of the second attention module. After d i t is calculated, it is normalized by Softmax function to get γ i t . The context vector c t is defined as follows: The temporal relationship between all the hidden state of the central site collection and the target series of matching sites is again learned by concatenating the target series of matching sites: where W T ∈ R q+1 and b, H T ∈ R are the parameters that map the connection to the size of the hidden state of the decoder. Then, y t−1 and h d t−1 are input into the LSTM layer to update the hidden state h d t at the current moment. The final multi-step prediction formula is as follows: where W y ∈ R p×(p+q) and b y ∈ R p are parameters that map concatenation to the size of the decoder hidden state; h d t : c t ∈ R p+q represents the concatenation of the decoder hidden state and the context vector; v y ∈ R τ×p is the weight and b y ∈ R τ is the deviation, where τ is the time steps to predict in the future. The linear function produces the final prediction result. The training optimization loss of the TL-DSTP-DANN model includes regression loss and domain classification loss. The regression loss for PM2.5 prediction is defined as the mean squared error:

DSTP-DANN Based on Transfer Learning
where n is the batch size of training data; y i and y i represent the actual and predicted values of PM2.5, respectively. The loss for domain label classification is defined as the dichotomous cross-entropy: where l i and l i represent the actual domain label and the prediction domain label, respectively. In this study, we set the source domain label to 0 and the target domain label to 1. In the training process, to obtain domain invariant features, the distribution of two features is as similar as possible. The parameter θ f of feature mapping is found to maximize the loss of the domain classifier, and at the same time, the parameter θ d of the domain classifier is found to minimize the loss of the domain classifier. The minimum and maximum change between losses cannot be directly realized by gradient update in the back-propagation process of neural networks. The difference between these two losses is achieved by inserting a gradient reversal layer (GRL) between the feature extractor and the domain classifier.
In this paper, DANN is used to search for domain-invariant features between source domain and target domain through the domain adaptation of DSTP feature extractor and domain classifier. The main reason for using the DANN is that it can combine domain adaptation and feature learning in a training process, so that the parameters learned can be directly applied to the target domain without reducing its prediction accuracy due to the domain deviation [27].
The idea of this paper is very similar to the generative adversarial networks (GANs). The generating model G: Equivalent to a feature extractor, the goal is to make the domain classifier not correctly identify the domain labels (the two feature distributions should be as similar as possible). The discriminant model D: Determine whether a label is the label of the target domain, and the target is to distinguish whether the extracted features come from the source domain or the target domain. GANs is implemented by the competition between G and D.
During the training process, the two models G and D can be enhanced simultaneously by competing with each other. Because of the existence of discriminant model D, G can learn the similar features of the two distributions well without a lot of prior knowledge and prior distribution, and finally make the data generated by the model achieve the effect of faking the truth (that is, D cannot distinguish whether the features extracted by G come from the source domain or the target domain, so G and D reach a certain Nash equilibrium [33]).
In the proposed TL-DSTP-DANN model, GRL acts as a constant transform during forward-propagation, gaining gradients at the latter level and changing its sign during backward propagation. In particular, GRL can be regarded as a pseudo function R β , and the following equations are its forward and backward propagation processes: where Q is a unit matrix; β is a positive hyperparameter, which realizes the trade-off between regression loss and domain classification loss, and the setting of β refers to [34]. Because the difference between the regression loss and the domain classification loss is relatively large, the model loss is the sum of the regression loss and the domain classification loss. The GRL layer is followed by the domain classifier, and a hyperparameter is set in the GRL layer to achieve a balance between two loss functions. In this paper, the source domain site data is denoted as D s = ( x 1 , y s 1 , [x 2 , y s 2 ], . . . , [x n , y s n ]), where x i and y s i represent the source domain site's exogenous data and target data, respectively, n represents the source domain data number. The target domain site data are denoted as D t = z 1 , y t 1 , z 2 , y t 2 , . . . , z m , y t m , where z i and y t i represent the target domain site's exogenous data and target data, respectively, m represents the target domain data number. The expression of the final objective "pseudo-function" is: where θ f , θ y , θ d denote the network connection weights of the feature extractor, regression predictor and domain classifier, respectively; G f , G y , G d represent the feature extractor, regression predictor and domain classifier, respectively. The gradient descent method is used to update the learning weights in the TL-DSTP-DANN model, which is expressed as follows: where µ represents the learning rate. The pseudo-code of the proposed TL-DSTP-DANN training process is shown in Algorithm 1.

Algorithm 1 TL-DSTP-DANN model training process.
Input: source domain site data D s , target domain site data D t Output: Parameters of the model θ f ,θ y ,θ d 1: for i = 1 → n do 2: Forward: 3: Calculate the regression loss L i y (θ f , θ y ) by Equation (16) 4: Calculate the loss of domain label classification L i d (θ f , θ d ) by Equation (17) 5: Calculate the loss of "pseudo-function" L θ f , θ y , θ d by Equation (20) 6:

Experiment Setting and Data Source
The dataset used in this paper was collected by Microsoft Research's Urban Air project [35]. We select datasets related to Beijing and Tianjin to evaluate the proposed TL-DSTP-DANN neural network. The distribution of monitoring sites in Beijing and Tianjin is shown in Figure 3, where the red points are all the sites in Beijing, and the black point is the Tianjin site to be predicted. For reasons of data collection, the historical data of the Tianjin site are only one month, and the historical data for the Beijing sites are from 1 May 2014 to 30 April 2015 (1 year, 8759 data). Because the data of Tianjin site are very few, the training effect is not good if they are directly used. This paper investigates how to use data migration learning from all sites in Beijing to predict PM2.5 concentrations at the Tianjin site. The dataset collects air quality records from 36 sites in Beijing and 1 site in Tianjin. Each air quality record contains six pollutants: PM2.5, PM10, SO 2 , NO 2 , CO and O 3 . Each weather record contains seven items: time, weather, temperature, pressure, humidity, wind speed, and wind direction. Table 1 shows the details of the datasets used in this paper. There are very complex relationships among these factors [28]. It is the main reason why the deep learning-based method is used in this study. The first 75% of the Tianjin site is selected as the training data, and the remaining 25% as the test data. For data from the Beijing stations, because PM2.5 data from individual stations are highly correlated, consecutive missing values greater than one row are filled in using the IDW interpolation [36] method based on the PM2.5 concentrations at adjacent stations. If the consecutive missing values are less than two rows, the linear interpolation method is used. For the data from the Tianjin site, linear interpolation is used directly. For proving the effectiveness of the proposed method for stations with fewer data, the TL-DSTP-DANN network is used to model the data and predict the PM2.5 concentration for the 3rd hour in the future. Appropriate hyperparameters are set to the model to produce the best performance. The prediction time step is set to 8. In order to use all source domain data and target domain training data, the batch sizes are set to 275 and 18 for the source and target domain data, respectively. A back-propagation algorithm is used to train all models, with regression losses for PM2.5 prediction as MSE loss functions and losses for domain label classification defined as dichotomous cross-entropy loss functions. During training, small-batch stochastic gradient descent is used combined with the Adam optimizer, setting the upper limit of the training period to 120, and the learning rate to 0.001. Each attention module uses a layer of LSTM network, where the hidden state of LSTM network is set to the same, that is, m = p = q = 128. In order to improve the prediction accuracy, we use the minimum-maximum normalization method given in the formula for normalization: For evaluating the effectiveness of the method, three metrics are used in the experiment, including the root mean square error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE). These three metrics are often used to evaluate the performance of the deep learning-based prediction methods [25,36], which are defined as follows: where y i is the true value and y i is the predicted value. The smaller the value of these three indicators, the higher the prediction accuracy and the better the performance of the model.

Model Comparison
In this paper, we use some state-of-the-art models to test the superiority of the proposed model (TL-DSTP-DANN). In these comparison experiments, the hyperparameters of the compared models are set as following principle: For the models that provided the hyperparameters, we use the original hyperparameters directly. Otherwise, we adjust the hyperparameters to achieve the best performance. The compared models are introduced as follows.
TL-LSTM: LSTM is used to learn from the long-term dependence of PM2.5, and transfer learning is applied to transfer the features of the source domain to the target domain.
TL-BiLSTM [25]: the TL-BiLSTM model is proposed to predict the air quality of new stations lacking data. This approach uses data from existing sites to pre-train the Stacked BiLSTM model. Then, freeze the first few hidden layers of the basic model, and fine-tune the remaining hidden layers using the data of the newly-built sites.
TL-DSTP [24]: This method uses an improved DSTP model to transfer source domain site data to assist in predicting PM2.5 concentrations at target domain sites.
The results of the proposed model compared to the baselines are shown in Table 2. As can be seen from the results in Table 2, the combined results of TL-BiLSTM outperform TL-LSTM, indicating that the accuracy of transfer learning experiments using the Stacked BiLSTM model is better than that of LSTM. Meanwhile, the combined results of the TL-DSTP model outperform the TL-BiLSTM, indicating that transfer learning using the improved DSTP model is better than the BiLSTM. Compared with the model TL-DSTP, the proposed TL-DSTP-DANN model reduces 13.09%, 8.90%, and 13.04% in MAE, RMSE, and MAPE, respectively. The results show that the proposed model have good prediction performance for the PM2.5 concentrations at target domain sites with less data. To verify the performance of the proposed method, the true and predicted values of  Table 3. The results show that the TL-CNN-DANN model is the worst because CNN is more suitable for extracting spatial features. The TL-LSTM-DANN model performs better than TL-CNN-DANN because the LSTM model can better extract temporal features of long time series than the CNN module. The TL-BiLSTM-DANN model performs better than the TL-LSTM-DANN because the BiLSTM considers the information contained in subsequent time series to adjust modeling and computation. The TL-DSTP-DANN model has the best performance compared to other DANN-based structures, mainly because DSTP can extract spatial-temporal features. The models using different feature extractors to predict PM2.5 concentrations at the Tianjin site are shown in Figure 5, from which it can be seen that the TL-DSTP-DANN model has the smallest error between the predicted and true values.

Test of Generalization Ability for Different Regional Sites
The previous experiment was to transfer data from Beijing to Tianjin. Because Tianjin and Beijing are very close, the effect of transfer learning is not universal. Cities farther away from Beijing are now selected as the target domains to verify the generality of the proposed method. In this study, the PM2.5 concentration in Guangzhou was predicted by transfer learning from the Beijing sites data. The details of the dataset in Guangzhou are as follows: the time range and the number of the samples are the same as those of Tianjin site. The standard deviation and mean value of the dataset in Guangzhou are 22.48 and 37.02, respectively. To facilitate comparison, we intercepted the Guangzhou site at the same point in time as the previous target domain. Seventy-five percent of the target domain data are used for training, and the remaining 25% of the samples are used for testing. The interpolation of missing values is the same as that in Section 3.1. The experimental results of using different models to predict PM2.5 concentrations in Guangzhou are shown in Table 4. The results in Table 4 show that the indicators of TL-DSTP-DANN are the best, indicating that the proposed method has good generalization ability. The true and predicted values of PM2.5 concentrations at the Guangzhou site predicted using different models are shown in Figure 6. It can be seen from the figure that compared with the TL-LSTM and TL-DSTP-DANN models, the predicted and true values of the TL-BiLSTM and TL-DSTP models in 24-72 h are quite different. However, the predicted results of the TL-LSTM model in 120-168 h are not ideal, while TL-DSTP-DANN is relatively ideal for predicting PM2.5 concentration in Guangzhou. Therefore, the comprehensive performance of the TL-DSTP-DANN model is the best.

Cross Validation Experiment
Cross-validation or Monte Carlo simulation method can be used to evaluate the robustness of the model [37] . In this study, a five-fold cross-validation experiment is conducted to further test the robustness of the proposed method (see [38] for details). Since the data are too few, the early stopping strategy is adopted to prevent overfitting, and the maximum batch of early stopping is 30. In this cross validation experiment, the target dataset is divided into six subsets. Then, the first subset is used to predict the second subset, the first two subsets are used as the training set to predict the third subset, and so on in a similar fashion. The average value of the results of all subsets is used as the final evaluation. The experimental results of the cross validation are as follows: MAE = 13.67, RMSE = 17.98, and MAPE = 0.22. The results show that the proposed method can achieve good prediction results, when the training data are too few, which mean that the proposed model has good robustness.

Setting of Hyperparameters
The main hyperparameters in the proposed model are the batch size of source domain data and target domain data, time step and hidden state size of LSTM. Most of the hyperparameters can refer to our previous work [24] and the related literature [27]. Here, just the time step is discussed, which is the hyperparameter closely related to the proposed model.
Reasonable setting of the time step has a great influence on experimental accuracy and speed. The larger the value of time step is, the more the characteristics of the sample it contains. However, the influence of past data on the current PM2.5 concentration will become weaker and weaker with the increase of the time step. If the time step is too large, it will lead to the reduction of the experimental accuracy. At the same time, the time of experimental training will increase with the increase of the time step. To get an appropriate setting of the time step, some experiments are conducted. The experimental results corresponding to different time steps are shown in Figure 7. It can be seen from the figure that the comprehensive performance of the proposed model is the best when the time step is 8. Thus, the time step is set as 8 in this study.

Conclusions and Future Work
In this paper, a dual-stage two-phase model and an adversarial domain adaptation hybrid transfer learning strategy are proposed to predict PM2.5 concentration, especially for new sites with relatively little historical data. Firstly, the maximum mean discrepancy (MMD) is introduced into the proposed model to select the most suitable source domain site. Then, inputting data from the source domain and the target domain together into an improved DSTP model, the DSTP model extracts the spatial-temporal characteristics of both. DANN finds domain invariant features between source domain and target domain by fusing extracted spatial-temporal features. Finally, the PM2.5 concentration in the target domain is predicted by a regression predictor. To evaluate the performance of the proposed model, we use air quality data from the Beijing sites to assist in predicting PM2.5 concentrations at the Tianjin and Guangzhou sites. The main experimental results are as follows: (1) Compared with other transfer learning prediction models (including TL-LSTM, TL-BiLSTM, TL-DSTP), the proposed TL-DSTP-DANN model decreases by more than 8.5% in MAE, RMSE and MAPE; (2) Transfer learning can obviously improve the performance of PM2.5 prediction in newly built monitoring stations with insufficient data; (3) The comprehensive experimental results of the improved DSTP model combined with DANN are better than those of CNN, LSTM, and BiLSTM. Compared with the BiLSTM, the MAE of the improved DSTP model decreases by 12.05%, and the MAPE decreases by 20%.
In our future work: (1) The current dataset contains only historical air pollutant concentrations and meteorological data, lacking relatively important geographical data, but geographical factors have an impact on PM2.5 concentrations. In the future, the proposed method could provide higher prediction accuracy if datasets containing geographical information are available; (2) An algorithm needs to be investigated to determine which of the multiple source domain data are most suitable for transfer learning to the target domain; (3) Whether the method proposed in this paper can be used to predict other air pollutants, such as O 3 and SO 2 .