Input-Adaptive Proxy for Black Carbon as a Virtual Sensor

Missing data has been a challenge in air quality measurement. In this study, we develop an input-adaptive proxy, which selects input variables of other air quality variables based on their correlation coefficients with the output variable. The proxy uses ordinary least squares regression model with robust optimization and limits the input variables to a maximum of three to avoid overfitting. The adaptive proxy learns from the data set and generates the best model evaluated by adjusted coefficient of determination (adjR2). In case of missing data in the input variables, the proposed adaptive proxy then uses the second-best model until all the missing data gaps are filled up. We estimated black carbon (BC) concentration by using the input-adaptive proxy in two sites in Helsinki, which respectively represent street canyon and urban background scenario, as a case study. Accumulation mode, traffic counts, nitrogen dioxide and lung deposited surface area are found as input variables in models with the top rank. In contrast to traditional proxy, which gives 20–80% of data, the input-adaptive proxy manages to give full continuous BC estimation. The newly developed adaptive proxy also gives generally accurate BC (street canyon: adjR2 = 0.86–0.94; urban background: adjR2 = 0.74–0.91) depending on different seasons and day of the week. Due to its flexibility and reliability, the adaptive proxy can be further extend to estimate other air quality parameters. It can also act as an air quality virtual sensor in support with on-site measurements in the future.


Introduction
According to a report by the Health Effect Institute in 2019 [1], nearly one out of every ten worldwide deaths resulted from exposure to air pollution which is strongly associated with different cardiovascular and respiratory diseases. Air pollution costs heavily for the global economy and recent estimates suggest that 2-5% of gross domestic profit (GDP) is spent on treating diseases linked to air

Methods
This section describes the traditional multiple imputation method and our input-adaptive proxy modified on the foundation of traditional proxy. The workflow for the modified proxy development is shown in Figure 1. Section 2.1 describes the development of the input-adaptive proxy. The criteria of rejecting models and the grading system of the models are explained in detail in Section 2.2 and Section 2.3, respectively.

Proxy Development
The first step of exploring a given data set is to examine the number of relevant variables and its completeness. We then examine the type of distribution for each variable and transform the data when necessary to build a normal distribution for regression techniques in later stage. The most common methods include logarithmic transformation, inverse transformation and square transformation [19]. We calculate the absolute value of Pearson correlation coefficient ( ) to investigate the linear correlation between the response variable and other explained variables by [22]: where ̅ is the arithmetic mean of the output variable and is the mean of the each input variable. is the number of valid data points in the variables of and . The values range from 0 to 1, where 1 indicates the highest degree of correlation. We select input variables with high or moderate ( > 0.1) for the next step of model creation. This procedure is able to lower the number of models and to minimize the computational time. After that, we limit the number of input variables to a maximum of three. Although adding more input variables in the model can usually archive higher coefficient of determination, it increases the chance to get a worse prediction by overfitting. More input variables are also prone to incalculable estimations due to the possibility of data insufficiency [5].
We perform ordinary least squares (OLS) linear regression for each model of maximum three input variables generated in the previous step. In case of a model with input variables, the generalized OLS regression model is [10,23,24]:

Proxy Development
The first step of exploring a given data set is to examine the number of relevant variables and its completeness. We then examine the type of distribution for each variable and transform the data when necessary to build a normal distribution for regression techniques in later stage. The most common methods include logarithmic transformation, inverse transformation and square transformation [19]. We calculate the absolute value of Pearson correlation coefficient (R) to investigate the linear correlation between the response variable and other explained variables by [22]: where x is the arithmetic mean of the output variable and y is the mean of the each input variable. N is the number of valid data points in the variables of x and y. The R values range from 0 to 1, where 1 indicates the highest degree of correlation. We select input variables with high or moderate (R > 0.1) for the next step of model creation. This procedure is able to lower the number of models and to minimize the computational time. After that, we limit the number of input variables to a maximum of three. Although adding more input variables in the model can usually archive higher coefficient of determination, it increases the chance to get a worse prediction by overfitting. More input variables are also prone to incalculable estimations due to the possibility of data insufficiency [5].
We perform ordinary least squares (OLS) linear regression for each model of maximum three input variables generated in the previous step. In case of a model with p input variables, the generalized OLS regression model is [10,23,24]: where Y is the output variable, β 0 is the intercept of the model, β i and X i correspond to the ith regression coefficient and ith input variable of the model, and ε is the random error with expectation 0 and variance σ 2 . We apply an extra regularization by using Tukey's bisquare weighting function ρ [25]: which r i is a function dependent on residuals, leverages from OLS fits and the estimates of the standard deviation of the error terms. A tuning factor c, equal to 4.685, is used [26]. This robust fitting is designed to be not overly affected by violations of OLS assumptions [27], and hence serves as an alternative to the traditional OLS regression [28] especially when data are contaminated with outliers that often take place in field measurements. We run the regression for every model and obtain its evaluation attributes (see more in Section 2.3). According to the attributes, we rank the models from the best performing one to the worst. The traditional proxy predicts the missing data with the estimated intercepts and coefficients only with the best performing model. Nevertheless, since data set in practice is typically incomplete, some data points in the input variables in the best model can possibly be missing, hence the imputation cannot be fully achieved. The input-adaptive proxy further imputes the missing data with the second best performing model, and so on, until all the voids are filled up.
To enhance the reliability, we propose to use multiple imputation instead of single by the bootstrapping method [4,19,29]. Original data set is divided into five subsets each including random 80% of the data set. Separate robust OLS fittings are carried out to each subset. The final regression coefficients and evaluation attributions are computed by calculating the arithmetic mean of the resulting parameters in the five subsets [19]. Concerning the choice of the number of imputations, some used hundreds of data subsets, but it demands hundreds folds of computational time. It has been suggested that three to five imputations are sufficient to obtain excellent results [13]. However, when the percentage of missing data is larger, more imputations may be needed [30].

Model Rejection
Since the models are established by all possible combination of all selected features, some models might include interdependent variables, which might influence the resulting estimation. In atmospheric science field, air pollutants often interact and are dependent with each other. This interdependency is called multicollinearity in statistics [31].
When multicollinearity exists, the variances of the estimated regression coefficients for the independent variables are inflated. The variance inflation factor (VIF) for an estimated regression coefficient β p of a particular variable X p is a factor by which the variance is inflated. This quantifies the severity of multicollinearity in OLS regression analysis. The variance of OLS estimator for a typical regression coefficient β p and VIF p are given by the following equations, respectively [24,31]: where σ 2 is the residual variance and R 2 p is the unadjusted coefficient of determination when X p is regressed against all the other input variables in the model, and represents the proportion of variance Sensors 2020, 20, 182 5 of 23 of that variable that is associated with the other independent variables in the model. If there is no linear relation between X p and the other input variables in the model, R 2 p will be zero and the variance ofβ p will be σ 2 / N i=1 X ip − X p 2 . By dividing this term, VIF p is obtained by (5). It is evident that a higher VIF p value leads to a higher proportion of variance for the X p input, which is related to the other input variables in the model. This means that severe multicollinearity effects are present. Kleinbaum et al. [31] suggested that the effect of multicollinearity can be ignored if the VIF is less than 10. Here, as well as a previous study by Fernández-Guisuraga et al. in 2016 [24], we use VIF = 5 as the threshold for model rejection.
In addition, non-normality in the residuals and heteroscedasticity imply that the amount of error in the model is not consistent across the full range of the observed data. This violates the assumption of using OLS regression. Therefore, we use Lilliefors test to examine whether residuals from the model come from normally distributed population, and reject the model if the null hypothesis is not statistically significant (p < 0.05) [32,33]. It is a modification of the Kolmogorov-Smirnov (K-S) test, which corrects the K-S test for small values at the tails of probability distributions equation needed by calculating z-scores and test statistics based on the empirical distribution function of the z-scores for every single sample member [32,33].

Evaluation Attributes
In order to evaluate and rank the models, adjusted coefficient of determination (ad jR 2 ), followed by mean absolute error (MAE) and root mean square error (RMSE) are used as diagnostic evaluation attributes. ad jR 2 illustrates the linear association between two variables (the estimated output variable by proxy and the measured variable). This attribute also considers the degree of freedom, and adjusts the number of input terms in a model relative to the number of data points. We use ad jR 2 over R 2 because ad jR 2 takes the number of output data points into consideration [24]. In case of similar ad jR 2 performance, we prefer to use the model with higher number of data points input to lower its uncertainty [22]: where N is the number of complete data input to the model, p is the number of input variables, y i andŷ i are ith measured and ith estimated output variable by the model, respectively, and y and y is expected value of the measured valuable and the estimated output, respectively. However, ad jR 2 does not consider the biases in the estimation. Therefore, we further validate the models with MAE and RMSE, where MAE measures the arithmetic mean of the absolute differences between the members of each pair, whilst RMSE calculates the square root of the average squared difference between the forecast and the observation pairs. RMSE is more sensitive to larger errors than MAE [5,24]: where N is the number of complete data input to the model, and y i andŷ i are ith measured and ith estimated response variable by the model, respectively. Models will be first evaluated by ad jR 2 to see how well the estimations fit with the original data with the consideration of degree of freedom. If ad jR 2 have the same score, those models will be ranked by MAE and then RMSE on the basis of absolute error. Apart from ranking, p-values are also checked for each model in order to investigate whether the model is statistically significant. Standard error (SE) and classical p-values are also calculated to examine the significance of estimates of regression coefficients (β).

Case Studies: BC in Street Canyon and Urban Background
In this section, we present the evaluation of our input-adaptive proxy for two real data sets in Helsinki urban region. These two stations represent two different environments and they have different degree of missing data pattern. We chose BC as the variable to be filled as a case study. BC has a range of fine-size particulate matter (PM) and it is emitted as a by-product of combustion. In urban regions, BC is mainly originated from diesel-powered vehicles, wood combustion and long-range transport [34,35]. According to a recent report by WHO [36], BC might not be directly toxic, but it can act as a universal carrier of other chemical components with varying toxicity which can bring severe effects on human health, for example cardiopulmonary, respiratory diseases and diseases that are not related to allergies [37]. BC not only plays an important role in climate change by changing surface albedo when deposited and absorbing solar radiation [38], but also contributes to visibility and local air quality [39]. It has been recommended to include BC alongside with the other air quality parameters in air quality index because BC concentration can associate better with health effects of aerosol particles than just PM, which does not solely originate from combustion sources [36]. Nevertheless, continuous BC measurements are not necessarily carried out in every measurement stations. Therefore, BC is a good case study to deploy the modified proxy as a virtual sensor.

Site Description
Helsinki (60 • 10 N, 24 • 56 E) is the capital of Finland, located on a relatively flat land at the coast of Gulf of Finland. The area of the city is 715 km 2 with about 650,000 inhabitants [40]. The two measurement stations are located at Mäkelänkatu and Kumpula in Helsinki, approximately 1 km from each other ( Figure 2). They represent environments of street canyon and urban background, respectively. The climate type in Helsinki can be classified as continental or maritime based on the air flows and the pressure system [19]. urban regions, BC is mainly originated from diesel-powered vehicles, wood combustion and longrange transport [34,35]. According to a recent report by WHO [36], BC might not be directly toxic, but it can act as a universal carrier of other chemical components with varying toxicity which can bring severe effects on human health, for example cardiopulmonary, respiratory diseases and diseases that are not related to allergies [37]. BC not only plays an important role in climate change by changing surface albedo when deposited and absorbing solar radiation [38], but also contributes to visibility and local air quality [39]. It has been recommended to include BC alongside with the other air quality parameters in air quality index because BC concentration can associate better with health effects of aerosol particles than just PM, which does not solely originate from combustion sources [36]. Nevertheless, continuous BC measurements are not necessarily carried out in every measurement stations. Therefore, BC is a good case study to deploy the modified proxy as a virtual sensor.

Site Description
Helsinki (60°10'N, 24°56'E) is the capital of Finland, located on a relatively flat land at the coast of Gulf of Finland. The area of the city is 715 km 2 with about 650 000 inhabitants [40]. The two measurement stations are located at Mäkelänkatu and Kumpula in Helsinki, approximately 1 km from each other ( Figure 2). They represent environments of street canyon and urban background, respectively. The climate type in Helsinki can be classified as continental or maritime based on the air flows and the pressure system [19]. Mäkelänkatu urabn supersite measurement station (Mäkelänkatu 50) is operated by the Helsinki Region Environmental Services Authority (HSY). The station is located in a street canyon in the immediate vicinity to one of the main streets of the city, 3 km from the Helsinki centre [41]. The street is one of the main roads leading to the city centre and it consists of six lanes and two tramlines. The annual mean traffic volume is 28 000 vehicles per workday. The traffic loads are especially high during rush hours at 8 in the morning and 5 in the afternoon. The buildings on both sides weaken Mäkelänkatu urabn supersite measurement station (Mäkelänkatu 50) is operated by the Helsinki Region Environmental Services Authority (HSY). The station is located in a street canyon in the immediate vicinity to one of the main streets of the city, 3 km from the Helsinki centre [41]. The street is one of the main roads leading to the city centre and it consists of six lanes and two tramlines. The annual mean traffic volume is 28,000 vehicles per workday. The traffic loads are especially high during rush hours at 8 in the morning and 5 in the afternoon. The buildings on both sides weaken the dilution process of pollutants. The supersite measurement station consists of a container (length 8.0 m, width 1.7 m, height 2.7 m), which is equipped with standard air quality measurement devices and other instruments. All the inlets for the measuring devices are located on the top of the container approximately at a height of 2.8 m from the ground level [42].
The Station for Measuring Ecosystem-Atmosphere Relations in Kumpula (SMEAR III) is situated on a rocky hill at 26 m above sea level, about 4 km northeast from the Helsinki centre. The surroundings of this urban background station are heterogeneous, constituting of buildings, small roads, parking lots, patchy forest and low vegetation. Residential activity, including wood combustion, accounts for a large portion of atmospheric pollutant source [19]. Trace gases are measured at a 31-m-high triangular lattice tower. Next to the tower stands a measurement container, where aerosols measurement instrumentation is located. In addition, basic meteorological measurements are made from the rooftop of Physicum building in the University of Helsinki.
Apart from the trace gases, aerosols and meteorological measurements, traffic rates in Helsinki metropolitan area are monitored by the City of Helsinki. The nearest continuous counter station is in proximity to Mäkelänrinne swimming centre, about 600 m south and 1 km north from Kumpula and Mäkelänkatu measurement sites, respectively. The traffic monitoring does not take into account the split between light-and heavy-duty vehicles. It logs number of vehicles along Mäkelänkatu but it is estimated that 30% of traffic may have been diverted to Kumpulankatu before reaching Mäkelänkatu supersite. The actual positions are shown in Figure 2.

Instruments
BC mass concentration was measured by a multi-angle absorption photometer (MAAP) Thermo Scientific 5012 with a PM 1 inlet at a 1-minute time resolution [43]. The MAAP determines the light absorbance from particles deposited on the filter using measurements of both transmittance and reflectance at different angles. The absorbance was converted to BC mass concentration by using a fixed 6.6 m 2 g −1 mass absorption coefficient at wavelength of 637 nm [44].
Lung deposited surface area (LDSA), which is considered as a relevant metric for the negative health effects of aerosol particles [45], were collected with diffusion charging-based Pegasor AQ Urban sensor [46]. Particle mass concentration of diameter less than 2.5 µm (PM 2.5 ) and less than 10 µm (PM 10 ) were measured with continuous ambient particulate monitor TEOM 1405 in Mäkelänkatu and TEOM 1405-D in Kumpula [19]. Aerosol size distribution was also measured with a differential mobility particle sizer (DMPS) in combination of a differential mobility analyser (DMA) and a condensation particle counter (CPC). Mäkelänkatu used Vienna DMA and Airmodus A20 CPC while Kumpula used Twin DMPS (Hauke-type DMA and TSI Model 3025 CPC + Hauke-type DMA and TSI Model 3010 CPC) [47]. The DMPS systems measured particles with diameter 6-800 nm and 3-950 nm, respectively. The DMPS technique is based on the bipolar charging of aerosol particles, followed by classification of particles into size classes according to their electrical mobility (detailed procedures of data treatment of DMPS in [19]).
Nitrogen oxide (NO), nitrogen dioxide (NO 2 ) and their sum NO x in Mäkelänkatu and Kumpula were measured with chemiluminescence analyzers, Horiba APNA-370 and Thermo TEI42S, respectively. Ozone (O 3 ) concentration was taken with UV photometric analyzers Horiba APOA-370 and Thermo Model 49i in Mäkelänkatu and IR-absorption photometer TEI49 in Kumpula. The measurement of carbon monoxide (CO) was collected with non-dispersive IR-absorption analyzer Horiba APMA-360 in Mäkelänkatu and Horiba APMA-370 in Kumpula. Sulphur dioxide (SO 2 ) was only measured in Kumpula with a UV-fluorescence analyzer Horiba APSA-360.
All the meteorological variables, except radiation, in Mäkelänkatu were measured with Vaisala WXT 520 and Vaisala WXT536 weather transmitters. In Kumpula, horizontal wind speed and direction were measured by a Vaisala cup anemometer. Air temperature was measured with a platinum resistant thermometer Pt-100 and relative humidity (RH) with a platimun resistance thermometer and thin film polymer sensor Vaisala DPA500 on the roof of the University of Helsinki building. On the roof, air pressure was measured with a barometer Vaisala DPA500. Global radiation and photosynthetically active radiation (PAR) were also measured at the same place with a net radiometer and photodiode sensor Kipp and Zonen CNR1+PAR lite, respectively. Since there are no radiation measurements in Mäkelänkatu, we used the measurements of global radiation and PAR from Kumpula for the regression step. A list of the measured parameters and their instruments in Mäkelänkatu and Kumpula is demonstrated in Table 1.

Data Pre-Processing
In this study, we retrieved continuous data in the period between 1 January 2017 and 31 December 2018. This equals 730 days in total, which covers all four seasons. Traffic data were logged at 1-hour intervals, except during rush hours when the logging interval is 15 minutes. Hourly values were calculated for the whole measurement period. Global radiation and PAR from Kumpula are also included in Mäkelänkatu data set. Other aerosols, gaseous and meteorological data were also averaged hourly. For easier comparison, the data from all the instruments were synchronized to UTC+2.
Apart from measuring LDSA with the instruments (LDSA ins ), LDSA parameter was calculated from DMPS (LDSA cal ) by a conversion equation with alveolar deposition efficiency and summing up the total LDSA concentration of the aerosol. The particle size-dependent deposition fraction conversion equation was first introduced in the human respiratory tract model on the ICRP report by Bair in 1994 [48]. Kuula et al. [49] revealed that LDSA cal is comparable to the instrumental measurements of LDSA ins by Pegasor. Total particle number concentration (N tot ) and particle modes were extracted from size distribution data. The four particle modes used in this study are nucleation mode, Aitken Sensors 2020, 20, 182 9 of 23 mode, adjusted Aitken mode, which is believed to be the range of BC size distribution [50], and accumulation mode. The range of the size bins is different in Mäkelänkatu and Kumpula, and the total particle number concentration of the whole spectrum of DMPS measurements represents particles with diameter between 6-800 nm and 3-950 nm, respectively. Nucleation mode sums up the particle number concentration of size bins below 0.025 µm (PN 0.025 ). Aitken and adjusted Aitken modes were defined to include particles with diameter between 0.025-0.09 µm (PN 0.09-0.025 ) and 0.030-0.30 µm (PN 0.3-0.03 ), respectively. Accumulation mode was also calculated as size fractions 90 µm or above (PN 1-0.090 ) in both sites.
Since wind direction is a circular variable, we applied trigonometric function sine and cosine to resolve into North-South and East-West vector components, as demonstrated in other literature [19,24]. The numerical values wind direction variables, named as WD-N and WD-E respectively, are zero-direction and eligible for the application of standard linear methods [51]. Variables from two measurement sites were converted into the same unit. PM 2.5 , PM 10 , BC were calculated in mass concentration µg m −3 . Particle modes were reported in number concentration cm −3 and LDSA in µm 2 cm −3 . All gaseous variables were measured in volume concentration ppb.
In order to perform linear regression, response and predictor variables are assumed to be in normal distribution. By checking distribution histograms, most aerosols and gaseous variables, NO, NO 2 , NO x , BC, PM 10 , PM 2.5 , O 3 , SO 2 , CO, two LDSA variables, particle modes and traffic, have skewed distribution and logarithmic transformation were applied as performed in Fernández-Guisuraga's paper in 2016 [24]. Unlike Equation (2), a log-link function was then used to specify the relationship between expected response and other predictors [24]: whereŶ is the output variable, which is BC in this case, β 0 is the intercept of the model, and β i and X i correspond to the ith regression coefficient and ith input variable of the model, respectively.

Data Completeness
The two data sets have different levels of data completeness and their missing data distributions differ. Table 2 outlines the data completeness for the key parameters in the two data sets. In general, data in Mäkelänkatu are more complete (mostly above 90%), except for CO, LDSA cal and particle modes. They cover only 45-80% in winter and spring. LDSA cal and particle modes have the same data completeness because both are converted by DMPS mathematically. The data in Kumpula in the period of spring have relatively low completeness percentage, especially for NO x , BC, PM 10 and LDSA (below 40%). The missing mechanism of most variables is MAR. Traffic data at Mäkelänrinne are used for both sites. The data completeness of traffic data ranges from 69% to 94% in different classes and the data gap is periodic. The meteorological data for both sites are sufficiently complete to support full imputation with the input-adaptive proxy method.

Classification
Since air pollutants show varying patterns due to their meteorological conditions and traffic patterns, we categorized the data into several classes based on seasons and workdays.
A definition of thermal (temperature-based) seasons was used. Spring and autumn are the periods when daily average temperatures are between 0 • C and 10 • C, and winter and summer are when the temperatures are below 0 • C and above 10 • C, respectively [19]. A sinusoid curve was fitted to the daily average temperature time series. The season was determined when the curve passed through the temperature threshold ( Figure 3). According to this definition, winter was found to be from 1 January  Figure 3 presents daily averaged meteorological data for the whole measurement period. Since we used thermal definition to classify season cases, the length of seasons is highly dependent on the daily temperature. Compared with year 2017, the period of the seasonal cycle in 2018 is much shorter, which indicates that the seasonal changes came more quickly. The winter between the year 2017 and 2018 started late in early January in 2018 but the summer season came very early with a high amplitude. It shortened the spring season and lengthened the summer season in 2018. In addition, during 2018 there was a drought episode during May and June. The daily RH mean dropped below 50% for more than one month. Wind speed, air pressure and PAR are comparable for both years, ranging from 1 to 4 ms -1 , from 975 to 1050 hPa and from 0 to 650 W m -2 , respectively. The unexpected heat wave and drought in 2018 may affect the range of confident intervals when we fit the data to the models.  Figure 4 illustrates the monthly (y-axis) and diurnal (x-axis) patterns of measured BC in street canyon in Mäkelänkatu. BC concentrations behave slightly differently in all months on workdays and weekends so we will develop separate proxies. Figure 4a shows two peaks in BC concentration during workdays at 7-9 in the early morning and at 4-6 in the late afternoon in all months due to the elevated traffic counts during working peak hours [46,53,54]. The evening peak during workday is smaller than the morning peak because of the mixing height in the atmosphere [53,54]. The higher mixing height in the evening allows BC to mix into a larger air volume; therefore, a lower concentration at surface is measured. During weekends, only one peak is observed in the late evening (Figure 4b). The evening peak during weekends appears at 17-20 in the wintertime and at 20-22 in the summertime. The boosted nocturnal BC concentration might be owing to the increasing traffic Workdays and weekends are classified for each of the four seasons since they perform differently in terms of traffic rates and corresponding air pollutant concentrations [19,52]. Workday category typically includes all weekdays, excluding all public holidays, while weekend category represents all Saturdays and Sundays, plus public holidays. 502 workdays and 228 weekends were identified during the measurement period. Altogether, the classification generates eight classes and proxies were developed separately.

Results and Discussion
This section describes first the meteorological condition in year 2017 and 2018, in which our training data were measured. We also show the seasonal and diurnal pattern of BC and the correlation between BC and other variables in Section 4.1. Section 4.2 focuses on the evaluation of proxy performance. Figure 3 presents daily averaged meteorological data for the whole measurement period. Since we used thermal definition to classify season cases, the length of seasons is highly dependent on the daily temperature. Compared with year 2017, the period of the seasonal cycle in 2018 is much shorter, which indicates that the seasonal changes came more quickly. The winter between the year 2017 and 2018 started late in early January in 2018 but the summer season came very early with a high amplitude. It shortened the spring season and lengthened the summer season in 2018. In addition, during 2018 there was a drought episode during May and June. The daily RH mean dropped below 50% for more than one month. Wind speed, air pressure and PAR are comparable for both years, ranging from 1 to 4 ms −1 , from 975 to 1050 hPa and from 0 to 650 W m −2 , respectively. The unexpected heat wave and drought in 2018 may affect the range of confident intervals when we fit the data to the models. Figure 4 illustrates the monthly (y-axis) and diurnal (x-axis) patterns of measured BC in street canyon in Mäkelänkatu. BC concentrations behave slightly differently in all months on workdays and weekends so we will develop separate proxies. Figure 4a shows two peaks in BC concentration during workdays at 7-9 in the early morning and at 4-6 in the late afternoon in all months due to the elevated traffic counts during working peak hours [46,53,54]. The evening peak during workday is smaller than the morning peak because of the mixing height in the atmosphere [53,54]. The higher mixing height in the evening allows BC to mix into a larger air volume; therefore, a lower concentration at surface is measured. During weekends, only one peak is observed in the late evening (Figure 4b). The evening peak during weekends appears at 17-20 in the wintertime and at 20-22 in the summertime. The boosted nocturnal BC concentration might be owing to the increasing traffic rates along the daytime, reaching a peak approaching sunset when residents in the city return home. The peak of BC in weekends is much smaller than in workdays. The arithmetic mean of BC in weekends is also approximately 50% less than in workdays (Table 3). In terms of seasonality, considering only workdays, BC concentration in the summer (1.29 ± 0.22 µg m −3 ) has the highest value and spring BC (1.07 ± 0.52 µg m −3 ) has the lowest among all seasons. The difference between the highest and the lowest is about 20%, which is in alignment of the observation by Helin et al. [35] who suggested the lack of BC seasonal variation in traffic environment. and spring BC (1.07 ± 0.52 µg m -3 ) has the lowest among all seasons. The difference between the highest and the lowest is about 20%, which is in alignment of the observation by Helin et al. [35] who suggested the lack of BC seasonal variation in traffic environment.

Variable Characterization
As an urban background station in Kumpula, the source of BC is more varied, not only vehicle exhaust from traffic roads, but also residential wood combustion. Since there is a forest between the station and a nearby road, it may block part of the BC from traffic. The long distance from the source could also increase dilution rate of BC. The two possible factors enhance the relative influence from residential activity. The discrepancies between workdays and weekends are smaller (10-20%) in all seasons because the direct effects by traffic is milder than Mäkelänkatu. Winter appears to have the highest BC (0.67 ± 0.38 µg m −3 ) among all seasons because of the lower mixing and elevated wood combustion by domestic heating [55] (Table 3). Another possible reason is that cold-start of a car engine at low ambient temperature may enhance the BC emissions from light-duty vehicles [56]. The overall BC mass concentration is lower than in street canyon by 30% in the winter and more than 50% in the summer during workdays.    As an urban background station in Kumpula, the source of BC is more varied, not only vehicle exhaust from traffic roads, but also residential wood combustion. Since there is a forest between the station and a nearby road, it may block part of the BC from traffic. The long distance from the source could also increase dilution rate of BC. The two possible factors enhance the relative influence from residential activity. The discrepancies between workdays and weekends are smaller (10-20%) in all seasons because the direct effects by traffic is milder than Mäkelänkatu. Winter appears to have the highest BC (0.67 ± 0.38 µg m −3 ) among all seasons because of the lower mixing and elevated wood combustion by domestic heating [55] (Table 3). Another possible reason is that cold-start of a car engine at low ambient temperature may enhance the BC emissions from light-duty vehicles [56]. The overall BC mass concentration is lower than in street canyon by 30% in the winter and more than 50% in the summer during workdays.
In order to find out the most linearly correlated terms with BC, we computed their absolute Pearson correlation coefficient as in two matrixplots ( Figure 5). The redness of each box represents the degree of correlation. The logarithm of BC is highly correlated with part of the gaseous and aerosol variables in both sites.
The correlations are fairly good in all eight classes in street canyon, in particular, the three NO x variables, particle modes and the two LDSA variables (R = 0.5-0.9). We also notice that the correlation on workdays is much higher than that on weekends. It is due to the fact that a great portion of NO x and the number of particles also come from traffic as BC emissions do [35]. Therefore, the amount of traffic is supposed to be able to explain for the discrepancy between the case on workdays and weekends. Unexpectedly, the correlations with traffic rates are just moderate (R = 0.3-0.7). The reason might be that the traffic counter is few hundred meters north to the supersite and the count does not fully represent the number of vehicles passing the station. Additionally, the traffic counter does not separate between different types of vehicles. Another reason for the low impact of the traffic counts might be that the weather conditions cause fluctuation. Even so, the difference in correlation between workdays and weekends are more significant than other variables in all seasons.
In urban background, the correlations of BC with gaseous compounds and aerosols are generally lower. No clear discrepancies between workdays and weekends are observed. However, in winter and spring, the correlations of BC with the two LDSA variables and accumulation mode are unexpectedly high. The adjusted Aitken mode, unexpectedly, does not exhibit a noticeable correlation with BC as suggested in previous study [50]. The reason might be that the correlation relationship is location-specific and the source of BC in different cities might not be the same.  Figure 6 presents the combined scores without any classification for every measured variable in descending order in both sites. Irrelevant variables ( < 0.1) were discarded from the whole data set and were omitted from further analysis in order to minimize the computational time. We did not pick a higher threshold because the variables with low scores might act as an accessory input variable in proxy models in the combination with other more-correlated variables. Eventually, we selected 19 variables in both Mäkelänkatu and Kumpula for the model creation step. The linear association of BC and meteorological data remain low in both sites [57]. Wind speed ranks top (R = 0.2-0.5) among the other meteorological variables [34,54]. On the other hand, the two wind direction variables (WD-N and WD-E) perform very little correlation with BC (R < 0.1). In street canyon, wind speed appears to correlate better with BC in weekends (R = 0.3-0.5) than in workdays (R = 0.2-0.3) while radiation variables perform in an opposite way. No clear discrepancies between workdays and weekends are observed in the urban background environment. Figure 6 presents the combined R scores without any classification for every measured variable in descending order in both sites. Irrelevant variables (R < 0.1) were discarded from the whole data set and were omitted from further analysis in order to minimize the computational time. We did not pick a higher threshold because the variables with low R scores might act as an accessory input variable in proxy models in the combination with other more-correlated variables. Eventually, we selected 19 variables in both Mäkelänkatu and Kumpula for the model creation step.

Performance of the Imputation Proxies
In general, the traditional proxy in the street canyon gives fairly good continuous estimations, with ranging from 0.86 to 0.94 (Table 4), mostly less than 0.01 and ranging from 0.19 to 0.28. The value of is slightly higher in workdays than in weekends. The reason might be that the number of workday samples is much higher than that weekend samples, which lowers the uncertainties. No obvious seasonal discrepancies are found in terms of model evaluation. A noticeably high , however, is found in autumn weekends. After ranking the different models by evaluation attributes described in Section 2.3, the best performing OLS models give the same results in seven out of eight classes, such that all three input variables are traffic counts, NOx and accumulation mode. The high resemblance is because BC source in street canyon is mainly fuel combustion from traffic, which at the same time emits NOx to the atmosphere, despite the time variation. The only exception is found in winter weekends, in which CO appears to better explain for BC instead of NOx. Due to the similarities in the model selection in all classes, classification may not be needed in Mäkelänkatu to maintain its simplicity.
In Kumpula, the urban background station gives a slightly worse than in the street canyon, ranging from 0.74 to 0.91 ( Table 5). The in spring and autumn weekends is exceptionally high, 0.056 and 0.036, respectively. The in spring and summer is higher than 0.4, which indicates more largely biased fitted points are found in these seasons. Unlike street canyon, the input variables that explain BC are varying in urban background station. Accumulation mode appears to play an important role in explaining BC in most classes. CO and LDSA are also significant in certain seasons. Temperature is the only meteorological input variables to be included in the best performing regression models. Due to the rule of only including three input variables to the model and various BC sources in urban background, the regression proxy could not include all relevant input variables to the models. Besides, in this data set, we did not include any variables as a direct indicator to residential activity; therefore, the performance is poorer.

Performance of the Imputation Proxies
In general, the traditional proxy in the street canyon gives fairly good continuous estimations, with ad jR 2 ranging from 0.86 to 0.94 (Table 4), MAE mostly less than 0.01 and RMSE ranging from 0.19 to 0.28. The value of ad jR 2 is slightly higher in workdays than in weekends. The reason might be that the number of workday samples is much higher than that weekend samples, which lowers the uncertainties. No obvious seasonal discrepancies are found in terms of model evaluation. A noticeably high MAE, however, is found in autumn weekends. After ranking the different models by evaluation attributes described in Section 2.3, the best performing OLS models give the same results in seven out of eight classes, such that all three input variables are traffic counts, NO x and accumulation mode. The high resemblance is because BC source in street canyon is mainly fuel combustion from traffic, which at the same time emits NO x to the atmosphere, despite the time variation. The only exception is found in winter weekends, in which CO appears to better explain for BC instead of NO x . Due to the similarities in the model selection in all classes, classification may not be needed in Mäkelänkatu to maintain its simplicity.
In Kumpula, the urban background station gives a slightly worse ad jR 2 than in the street canyon, ranging from 0.74 to 0.91 ( Table 5). The MAE in spring and autumn weekends is exceptionally high, 0.056 and 0.036, respectively. The RMSE in spring and summer is higher than 0.4, which indicates more largely biased fitted points are found in these seasons. Unlike street canyon, the input variables that explain BC are varying in urban background station. Accumulation mode appears to play an important role in explaining BC in most classes. CO and LDSA are also significant in certain seasons. Temperature is the only meteorological input variables to be included in the best performing regression models. Due to the rule of only including three input variables to the model and various BC sources in urban background, the regression proxy could not include all relevant input variables to the models. Besides, in this data set, we did not include any variables as a direct indicator to residential activity; therefore, the performance is poorer.  Note that the above only show the performance of the traditional proxy, i.e., the best performing model in the input-adaptive proxy. Figure 7 better illustrates how different the two proxies behave. During the period from 1 January to 20 January 2018, the proxy BC fits fairly well and follows the daily cycle. However, it occasionally underestimates at peaks and overestimates at toughs compared to the original data (black line). This serves as one of the drawbacks of using robust OLS fitting in linear regression because this technique penalizes outliers and makes them less influential to the regression in order to optimize the estimation of regression slopes [28]. In this way, the robust method sacrifices the estimation at extreme data points. Additionally, the estimation by traditional proxy is not always available and, in this study, only 64.5% of BC concentration in Mäkelänkatu (Figure 7a) and 54.7% in Kumpula (Figure 7b) are estimable by the method. Note that the above only show the performance of the traditional proxy, i.e., the best performing model in the input-adaptive proxy. Figure 7 better illustrates how different the two proxies behave. During the period from 1 January to 20 January 2018, the proxy BC fits fairly well and follows the daily cycle. However, it occasionally underestimates at peaks and overestimates at toughs compared to the original data (black line). This serves as one of the drawbacks of using robust OLS fitting in linear regression because this technique penalizes outliers and makes them less influential to the regression in order to optimize the estimation of regression slopes [28]. In this way, the robust method sacrifices the estimation at extreme data points. Additionally, the estimation by traditional proxy is not always available and, in this study, only 64.5% of BC concentration in Mäkelänkatu (Figure 7a) and 54.7% in Kumpula (Figure 7b) are estimable by the method. The missing gap will then be substituted by input-adaptive proxy (red line), such that proxy BC gives continuous estimation. The fitting performance does not show obvious difference in the classification of workdays and weekends. The performance in the two locations does not differ either. On top of the diurnal pattern on individual days, both traditional and input-adaptive proxies estimate the averaged diurnal pattern with a satisfactorily high accuracy (Figure 8). In Mäkelänkatu, the diurnal fitting by both traditional and input-adaptive proxies for workdays in all seasons performs well (Figure 8a). The highest deviation in the diurnal cycle takes place in winter weekends (Figure 8b) where only about 20% of data is fitted with the traditional model. However, the input-adaptive proxy shows very similar diurnal cycle as the measured BC in the same class. This is because all data are fitted with the top 10 best models in the input-adaptive proxy ( Figure  9), so remains high.   In Mäkelänkatu, the diurnal fitting by both traditional and input-adaptive proxies for workdays in all seasons performs well (Figure 8a). The highest deviation in the diurnal cycle takes place in winter weekends (Figure 8b) where only about 20% of data is fitted with the traditional model. However, the input-adaptive proxy shows very similar diurnal cycle as the measured BC in the same class. This is because all data are fitted with the top 10 best models in the input-adaptive proxy (Figure 9), so ad jR 2 remains high. model can estimate 65% of the total data points in winter workdays, it is only able to fit less than 20% in winter weekends. However, the rest of the data points in winter weekends manage to be estimated by the other top ten best models; therefore, the overall coefficient of determination remains high. In the urban background site in Kumpula, the probability of using the best model is much lower on weekends than on workdays in all seasons (Figure 9b). Winter appears to have the highest percentage of estimation by the best model. The other seasons do not show obvious deviation.  Table 6 shows the input variables and the corresponding evaluation attributes in winter workdays in Mäkelänkatu as an example. Accumulation mode, which was found to resemble the pattern of BC in Helsinki in previous studies [34,58], is found in every single model in the table. Other variables are NOx, NO2, traffic counts and some meteorological variables. Although meteorological variables have relatively low correlation coefficient with BC, they still manage to play a role as an accessory input variable in the top ten models. Because of the lower dependency of meteorological data with other air quality variables [57], the meteorological variables survive in the model rejection tests and guarantee the estimation in input-adaptive proxy. This serves an example to suggest that individual does not necessarily reflect the contribution in models with more than one input variable. In Kumpula, the estimation has a greater off in diurnal cycle for both proxies with bigger standard derivation. Figure 8c shows a bigger deviation in urban background between the original BC diurnal pattern and that by both proxies compared to street canyon environment. The input-adaptive proxy also appears to have a shift in estimating morning peak in spring and summer workday. The peaks by the input-adaptive proxy appear 1-2 hours earlier than the original data. The reasons remain unknown. For weekends, although there are no as clear diurnal patter as in the others, both traditional and input-adaptive proxies do not model the original data as well as the others, except the winter classification (Figure 8d). Overestimation of traditional proxy BC is found in spring and autumn weekends, and it might be attributed to the insufficiency of successful estimation in each hour and thus influencing the averaged values. Only less than 20% of data are successfully estimated in these two classes (Figure 9). Figure 9 shows the probability of data points fitted by models of input-adaptive proxy with different ranks in both locations. In Mäkelänkatu, about 20-90% of data points are fitted by the best model depending on different seasons (Figure 9a). The pattern of model ranking probability is similar on both workdays and weekends in all seasons except winter. While the best model can estimate 65% of the total data points in winter workdays, it is only able to fit less than 20% in winter weekends. However, the rest of the data points in winter weekends manage to be estimated by the other top ten best models; therefore, the overall coefficient of determination remains high. In the urban background site in Kumpula, the probability of using the best model is much lower on weekends than on workdays in all seasons (Figure 9b). Winter appears to have the highest percentage of estimation by the best model. The other seasons do not show obvious deviation. Table 6 shows the input variables and the corresponding evaluation attributes in winter workdays in Mäkelänkatu as an example. Accumulation mode, which was found to resemble the pattern of BC in Helsinki in previous studies [34,58], is found in every single model in the table. Other variables are NO x , NO 2 , traffic counts and some meteorological variables. Although meteorological variables have relatively low correlation coefficient R with BC, they still manage to play a role as an accessory input variable in the top ten models. Because of the lower dependency of meteorological data with other air quality variables [57], the meteorological variables survive in the model rejection tests and guarantee the estimation in input-adaptive proxy. This serves an example to suggest that individual R does not necessarily reflect the contribution in models with more than one input variable.
Furthermore, the evaluation attributions in the top ten models are very close to each other. The ad jR 2 scores in the top ten models only vary in a range of 2%. If we choose MAE or RMSE over ad jR 2 to be the first ranking criteria, the outcomes of the model rank become very different. The overall performance difference in the street canyon in Mäkelänkatu and the urban background in Kumpula indicates that input-adaptive proxy is location-specific because of their different pollutant sources. Street canyon has been suggested to have relatively constant BC source [35], so the regression performance in street canyon is overall better than in urban background. Model performance also depends on the characteristics of missing data patterns, like gap lengths and number of complete rows, but not the amount of missing data, suggested by Junninen et al. [4] and Junger and Ponce de Leon [9].
However, the proxy accuracy might drift when we train the same type of data set in different periods. Therefore, we should train longer historical data to lower the uncertainties of the proxy. The prolonged drought in year 2018 might influence the BC variation and thus reducing the proxy reliability in the future.
Since this input-adaptive is simple yet reasonable and manages to give continuous estimation, it is applicable to fill up gaps in air quality data and act as a virtual sensor of air pollutant. In case one of the input variables have measurement failure, we can still model the output variable with a relatively reliable estimate. Ideally, this input-adaptive proxy could co-exist side-by-side with on-site measurements in support of each other. In case of unavailability of on-site measurements, input-adaptive proxy can act as a virtual sensor. The computed output variable, in this case BC, is also useful in filling in missing pieces in air quality index.

Conclusions
In this study, we presented a modified method to impute missing data by input-adaptive proxy, which manages to fill up all the voids and give reasonable estimates. We used BC concentration in Mäkelänkatu traffic site and Kumpula urban background site as a case study. We then created models with selected input variables, which were determined by Pearson correlation coefficient. We evaluated and ranked the models by their ad jR 2 , MAE and RMSE. By using the traditional proxy method, about 20-80% of missing BC data was filled up. The input-adaptive proxy selected input variables and managed to fill up all the missing voids. In addition, the overall performance is reliable, with ad jR 2 of 0.86 to 0.94 in the street canyon and 0.74 to 0.91 in the urban background. Its flexibility and reliability ascertain for the input-adaptive proxy to act as a virtual sensor.
Regression performance is better in the street canyon because BC can be mostly explained by traffic counts, NO x and accumulation mode. The BC source in urban background is heterogeneous, which includes traffic and residential wood burning. It is also affected by some meteorological parameters. The diurnal pattern in BC is clearly different on workdays from weekends. The proxy performance in workdays is slightly better than in weekends in both sites. In the street canyon, the proxy works quite similarly in all seasons, while in the urban background station, the proxy performance in spring and summer is poorer than the other seasons. Although the overall regression performance seems promising, the results can be altered by choosing other ranking criteria in the model evaluation scheme.
The input-adaptive proxy could be modified by inserting autoregression term [15,23,59] or time variables [4,10] because the dataset is a time series and air quality variables have a strong time dependency. Although this modification would add slight computational complexity to the model, the eight classes in this study are no longer needed. Another modification is that we could investigate their

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: