Low-Cost Air Quality Sensing towards Smart Homes

: The evolution of low-cost sensors (LCSs) has made the spatio-temporal mapping of indoor air quality (IAQ) possible in real-time but the availability of a diverse set of LCSs make their selection challenging. Converting individual sensors into a sensing network requires the knowledge of diverse research disciplines, which we aim to bring together by making IAQ an advanced feature of smart homes. The aim of this review is to discuss the advanced home automation technologies for the monitoring and control of IAQ through networked air pollution LCSs. The key steps that can allow transforming conventional homes into smart homes are sensor selection, deployment strategies, data processing, and development of predictive models. A detailed synthesis of air pollution LCSs allowed us to summarise their advantages and drawbacks for spatio-temporal mapping of IAQ. We concluded that the performance evaluation of LCSs under controlled laboratory conditions prior to deployment is recommended for quality assurance/control (QA/QC), however, routine calibration or implementing statistical techniques during operational times, especially during long-term monitoring, is required for a network of sensors. The deployment height of sensors could vary purposefully as per location and exposure height of the occupants inside home environments for a spatio-temporal mapping. Appropriate data processing tools are needed to handle a huge amount of multivariate data to automate pre-/post-processing tasks, leading to more scalable, reliable and adaptable solutions. The review also showed the potential of using machine learning technique for predicting spatio-temporal IAQ in LCS networked-systems.


Introduction
Indoor air pollution is placed among the top five environmental public health risks that cause morbidity and mortality globally. The majority of people spend more than 90% of their time in indoor environments [1,2], and health problems and diseases associated with poor indoor air quality (IAQ) can cause a variety of adverse health effects to them [3,4]. The time spent indoors recently increased significantly in year 2020 due to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic when people are advised to 'stay home stay safe' to protect health workers [5,6]. Table 1 summarises common indoor air pollutants, their sources, and current guidelines to maintain IAQ. Air pollutants inside indoor environments can be generated from different sources, including occupants' exhalation (carbon dioxide; CO 2 ), activities such as cooking and smoking, emissions from building materials, etc. from which various air Table 1. The unhealthy exposure thresholds defined for the common indoor and outdoor air pollutants [11][12][13].

Pollutants
Indoor Air Outdoor Air References N/A [17] As summarised in Figure 1, the article starts with a brief review of the common indoor air pollutants for developing the background context for the topic areas covered (Section 3), followed by the state-of-the-art air pollution sensing technologies for indoor environments by considering their performance and drawbacks (Section 4). The subsequent section (Section 5) explores optimal deployment strategies to better capture spatiotemporal distribution of indoor air pollutants and addresses how these strategies could be built to improve the accuracy and reliability of data. Section 6 summarises pre-and post-processing methods and tools in dealing with LCSs data. Section 7 describes suitability of advanced IAQ predictive models for indoor settings. Finally, a summary of topic areas covered, conclusions and future remarks are presented in Section 8. Figure 1. Essential steps toward a successful implementation of smart indoor sensor network in achieving appreciable indoor air quality (IAQ) and health benefits to home occupants.

Common Indoor Air Pollutants and Their Sources
IAQ is affected by diverse ranges of indoor sources as well as infiltration of outdoor air pollutants. Each source could impact the overall IAQ, depending on their intensity and the operational time (see Table 1). The most common indoor air pollutants arising from indoor occupants, activities and/or materials are CO2, CO, VOCs, and PM in different aerodynamic size fractions, including PM ≤2.5µm (PM2.5) and ≤10µm (PM10). Although there can be other pollutants, such as polycyclic aromatic hydrocarbons (PAHs; specifically, benzo[a]pyrene), nitrogen oxides (NOx = NO+NO2), ozone (O3), sulphur dioxide (SO2), formaldehyde (HCHO), radon and persistent organic pollutants (POPs), the presence of all of these components in one place is unlikely. In addition, under-controlled thermal comfort parameters, such as temperature, air velocity, relative humidity (RH), noise and lighting levels are other parameters that make the living environment pleasant for the occupants. Hence, a flow of clean air throughout a building environment is necessary to minimise the risk of accumulation of indoor air pollutants.

Sensor Technology
Assessing the existing IAQ and unexpected changes in its level through continuous measurement is necessary to know the status of IAQ and its effects on the occupants' health. Sensing IAQ with the help of LCSs could be served as the core of smart homes and Figure 1. Essential steps toward a successful implementation of smart indoor sensor network in achieving appreciable indoor air quality (IAQ) and health benefits to home occupants.

Common Indoor Air Pollutants and Their Sources
IAQ is affected by diverse ranges of indoor sources as well as infiltration of outdoor air pollutants. Each source could impact the overall IAQ, depending on their intensity and the operational time (see Table 1). The most common indoor air pollutants arising from indoor occupants, activities and/or materials are CO 2 , CO, VOCs, and PM in different aerodynamic size fractions, including PM ≤ 2.5 µm (PM 2.5 ) and ≤10 µm (PM 10 ). Although there can be other pollutants, such as polycyclic aromatic hydrocarbons (PAHs; specifically, benzo[a]pyrene), nitrogen oxides (NOx = NO+NO 2 ), ozone (O 3 ), sulphur dioxide (SO 2 ), formaldehyde (HCHO), radon and persistent organic pollutants (POPs), the presence of all of these components in one place is unlikely. In addition, under-controlled thermal comfort parameters, such as temperature, air velocity, relative humidity (RH), noise and lighting levels are other parameters that make the living environment pleasant for the occupants. Hence, a flow of clean air throughout a building environment is necessary to minimise the risk of accumulation of indoor air pollutants.

Sensor Technology
Assessing the existing IAQ and unexpected changes in its level through continuous measurement is necessary to know the status of IAQ and its effects on the occupants' health. Sensing IAQ with the help of LCSs could be served as the core of smart homes and counted as one of the major components to maintain high-quality living standards. The desirable sensors in smart homes should: (i) be sensitive and selective to target pollutants for reliable sensing relevant to indoor environments that pose health risks to occupants; (ii) be durable with optimal performance over a long-term of deployment; (iii) be small in size, maintenance-free with low-power consumption; (iv) be adopted in complex sensor networks; and (v) work quietly with minimum operating noise [11,12,[33][34][35][36][37]. These features enable air pollution sensors to be deployed with relative ease to locations where understanding air quality level could have a huge impact on human health. However, LCSs come with challenges, which may reduce user trust, accuracy and interpretability of recorded data [12,38]. If their quality remained unchanged under realistic conditions, they could become a game-changer in various IAQ measurements [39].

Electrochemical Sensors
Electrochemical technology is one of the oldest and perhaps widely used technologies for concentration measurements of gaseous pollutants using either potentiometric (measuring a difference of potentials) or amperometric (measuring current of a redox reaction) principles. Fundamentally, electrochemical sensors (ECs) require at least two electrodes (reference and counter electrodes) for operation, which operate based on a chemical reaction between a gaseous pollutant in the air and an electrode in an electrolyte. The sensors are coated with a catalyst that provides a high surface area, which promotes reactions [34]. The recent ECs contain a cell with three electrodes including, measuring, reference and counter electrodes, which host reduction/oxidation of chosen gases. In this technology, the sample gas diffuses through the sensor's membranes towards the measuring electrode, which results in an electron transfer (produce an internal current). Recently, some sensor manufacturers (e.g., those of AlphaSense and Membrapor, Wallisellen, Switzerland) have upgraded ECs by adding the fourth electrode to monitor physical changes and measure drift [40].
ECs have a comparatively low-cost, high sensitivity/low cross-sensitivity, low detection limit (~sub-ppm), reasonable response time, and less power-intensive (µW) characteristics compared to traditional monitors [34]. Additionally, stability with acceptable drift values (between 2% and 15% per year) have been reported for the commercial ECs (e.g., Nemoto and SGX Sensortech) [40]. However, they are more complicated, vulnerable to poisoning, large in size, of shorter life span (~1-3 years), and more expensive than that of metal oxide semiconductor (MOx) gas sensors (see Section 4.2). As listed in Table 2, ECs have shown interference with the change in meteorology (e.g., air temperature), which is in the first-order impact on an electric output signal of gas concentration (ppb level) and second-order error on gas sensitivity. Low temperatures decrease the speed of reaction in electrochemical cells, which reduce the applicability to operate under cold environments (<10 • C). However, there is a solution to overcome the effects of temperature on background currents (zero currents) that would make a significant impact on measurements at low concentration levels [41].

Metal Oxide Semiconductor (MOx) Sensors
In MOx sensors, gaseous air pollutants react with the sensor surface and change it's electrical (resistance or conductivity) properties [44,45]. Measuring the changes in electrical properties represent the concentration of the target pollutant in the air. Because of advances in fabrication methods and the simplicity of semiconductor sensor devices, MOx gas sensors are moderately low-priced compared to other technologies (cheaper than ECs). MOx sensors are robust, lightweight/long-lasting, sensitive to low-concentration gases (as low as ppb level), and less power intensive (less than 1 W) but higher than PIDs (photoionisation detectors; see Section 4.3) [46][47][48].
Simple and fast production processes on a large scale as well as simply controllable processes make MOx gas sensors a desirable technology for air quality monitoring. MOx gas sensors have been reported to be sensitive to a variety of air pollutants [48], with responses changing with the concentration of gaseous pollutants and device operating temperature [46]. MOx gas sensors have been implemented to measure/monitor trace amounts of gaseous pollutants, such as CO, CO 2 , O 3 , total VOCs, Ammonia (NH 3 ) and NO x [46,49]. However, non-linear output signals, cross-sensitivity to other gases (especially to changes in environmental conditions and other VOC substances in complex mixtures), poisoned by certain or high doses of target gases (e.g., high concentration of certain organic compounds and gaseous sulphur-containing substances) have been discussed in the literature [12,48,50,51].

√
Able to identify the size of the particle in the size of PM 10 and PM 2.5 . × Conversion from PM counts to PM mass with the theoretical model. × The measured signal depends on a variety of parameters such as particle shape, colour and density, RH, refractive index, etc. × Unable to detect ultrafine particles. 6 Response time 20-120 s. √ Limited drift over time of the sensor calibration. × Need for correction for the effects of temperature, RH and pressure. 1 Photoionisation detectors (PIDs) demonstrate a better sensitivity than electrochemical cells for volatile organic compounds (VOCs) (range from 100 ppb and 20 ppm). 2 Depend on the air temperature [40]. 3 The interference caused by temperature influence can be compensated. 4 MOx should not be used to measure low concentrations of VOCs in the presence of high concentrations of NO, NO 2 or CO. MOx sensors are suitable when sensing VOCs, which are not detected by PIDs (e.g., many chlorofluorocarbons (CFCs)) [40]. 5 An empirical relation for drift or stability corrections have been suggested [40,44]. 6 No LCS is available that could detect ultrafine particles (<100 nm in diameter), because the optical systems are unable to detect <300 nm particles [42]. Note 1: Near real-time monitoring in indoor environments is required to capture the immediate incidents and to adopt precautionary and corrective measures, but not all the sensors discussed above are fast/immediate responsive enough to concentration changes. Currently, a reasonable average time among the deployed sensors is 30-s and/or 1-min averaging timestamp as per published studies in the literature. Besides, a balance should be maintained between sampling frequency and power source.

Photoionisation Detectors (PIDs)
The PID is another type of LCS, which uses high-energy photons (ultraviolet (UV) light) for ionisation of gaseous molecules [40]. The main principle is that the gas between the electrodes is ionized by UV light (in the energy scale of 10 eV) to produce charged ions. The resulting ions are proportional to the output signals as well as pollutant concentrations in the detector. Due to high sensitivity, PIDs are extensively used for the detection of VOCs, because each VOC component has its own ionisation potential (IP). IP range varies from easy to ionise substances (~7 eV) to extremely difficult to ionise substances (~12-16 eV). For example, PIDs effectively detect most hazardous gases, including VOCs (e.g., benzene = 9.25; hexane = 10.13; toluene = 8.82; and xylene = 8.56 eV) due to their low IPs, and offer a range of benefits, such as fast response, small size, ease of use/maintenance, and ability to detect low concentrations. However, PIDs cannot detect air constituents (O 2 and N 2 ), CO 2 , CO, SO 2 , CH 4 , and O 3 due to their high IPs.

Optical Sensors
Optical sensors, also called light scattering sensors, are used for detection of PMs. Light-scattering PM sensors measure the optical properties of the particles as an ensemble, which offers fast and real-time responses, minimal drift and greatly reduces the cost and size of the sensors [52][53][54][55]. In addition to small size, low-energy consumption (less power supply voltage~5 V) and ability to generate high-frequency output data during operations make optical sensors a good candidate in various applications [56,57]. Furthermore, variations in PM 2.5 concentration measurement under low-concentrations (20-30 µg m −3 ) among different optical PM sensors against reference instruments could be a major drawback of sensors of this type. This is because the amount of scattered light is reliant on size, shape, density, and refractive index of particles [58]. Despite all these limitations, reliable functioning of optical PM sensors in indoor environments with small spatial scale was reported [59].

Sensor Selection
Putting multiple sensors together onto boards, calibrating and reshaping them as commercial products for indoor (or outdoor) applications has been a common practice. Such sensor-based products are becoming increasingly available, while the information around lifetime and maintenance are not clearly available. Table 2 (sensors) and Table 3 (commercial sensor-based products) summarise the specification of technologies in the market, whose performances have been evaluated by at least one indoor study. Moreover, the manufacturer's specifications obtained from technical datasheets, such as type of pollutant, technology, measuring range, reported sensor lifetime, sampling mechanism, sampling interval, environmental operating range, and connectivity have been summarised in Table 3.
Studies showed that the sensor correlations against the research-grade instruments could vary before and/or after deployment even for identical sensors under identical conditions [60][61][62][63]. Furthermore, environmental conditions (temperature and RH) and cross-sensitivities of certain pollutants (e.g., NO 2 gas on O 3 sensors, NO gas on NO 2 sensors, and hydrogen molecule on CO sensors) on sensor readings have been imperfectly addressed [34,38,[64][65][66]. In other words, due to the lack of regulatory bodies, questions are raised about their reported values, reproducibility and comparability. However, significant progress has been made in this direction in the recent past. For example, the Air Quality Sensor Performance Evaluation Centre (AQ-SPEC) operated by South Coast Air Quality Management District (SCAQMD) [67,68], the US EPA, Air Sensor Toolbox [69], and the EU Joint Research Centre (EU JRC) [50,70] programs have been initiated to quantitatively evaluate the performance, stability and quality assurance/control (QA/QC) of sensor-based products. To tackle these issues in a more convenient way by not only considering in-field co-location, field normalisation or field calibration with reference instruments [71][72][73], recent studies have shown an alternative solution that can be utilised to improve the QA/QC of readings. Affordable laboratory facilities, such as the Envilution ® chamber are currently offered by academic and research institutions to calibrate and evaluate the performance of LCSs before and after deployment under controlled environments [73]. Here, a controlled Atmosphere 2021, 12, 453 9 of 33 environment is defined as a situation where changes in environmental conditions and pollution concentrations, representing indoor environments, for testing LCSs can be simulated (controlled) inside the chamber. Therefore, LCSs performance can be assessed under a combination of indoor variations in environmental parameters and pollution concentrations. In-field co-location would be an alternative QA/QC measure after deployment. Moreover, routine calibration checks after deployment for simple networks along with advanced statistical techniques, e.g., data consistency checks, network correlations, and principal components analysis, in complex networks (Section 7) can boost the performance of this system to maintain long-term satisfactory performance. Such platforms, initiatives and programs offer support to obtain reliable data by the use of appropriate sensors, which could result in improving personal exposure estimates in home environments. Table 3. Specification of sensor-based product specifications (both single-and multiple-purpose units) reported by manufacturers available in the market that could be used for IAQ and/or personal indoor exposure monitoring systems. The authors highly suggest the buyers to check the up-to-date specifications of the sensors prior to selection and do not endorse any brand or a product.

Deployment Strategies
Enclosed environments such as homes trap more polluted air than open environments due to the presence of indoor sources, lack of free-flow air circulation and inadequate ventilation. In addition, different exposure levels to indoor air pollutants have been reported for individuals even at the same location [32,74]. Unfortunately, the use of conventional monitoring devices are unable to satisfactorily capture a spatial variation and map instantaneous changes in IAQ because of the associated cost, non-scalability, and lack of spatio-temporal mapping of indoor air pollutants [11,30,75,76]. Considering these potentials and demands for technologies, the emergence of LCSs has changed the landscape of IAQ monitoring systems, where specific sensors and sensor-based products are manufactured and designed for indoor applications (see Tables 2 and 3, respectively). Although air pollution sensors have some drawbacks (Table 2), relatively smaller changes in the environmental parameters and less complexity of indoor air flow patterns compared to outdoor environments could be beneficial for using them indoors. Table 4 summarises examples of the sensor applications in indoor environments, in which less attention has been given to strategies for spatio-temporal distribution of multiple air pollutants. The objectives of the reviewed studies were limited to the performance evaluation of sensors in enclosed environments, in which near-source air pollution monitoring systems or considering an adult's breathing height as a common practice among the studies is inadequate to assess overall IAQ [35,[77][78][79]. Considering indoor arrangements and the relationship between indoor-outdoor environments [80], here we focus our efforts in developing suitable strategies for indoor environments, while air pollution sensors are playing the major role in covering the area.  Air quality sensors should be deployed systematically across a location in order to (i) optimise the cost and the number of sensors according to building layout, space and room features, (ii) ensure the reliability level of the sensor network in case of sensor failure, (iii) provide acceptable spatial and temporal coverage of indoor air pollutants, and (iv) minimise the cost associated to computational analysis and prediction models [23,[90][91][92]. There are common suggestions regarding deployment strategies in practical engineering applications, such as sensor selection as per common indoor sources, considering the impacts of outdoor air pollution on indoors, and deploying sensors along the wall with proper accessibility for calibration or maintenance [35,93]. However, sensor deployment strategies, especially for IAQ applications are usually determined based on objective functions and sensor applications [90,94]. In general, deployment strategies in indoor environments vary with time and space, which can be categorised into (i) engineering, and (ii) optimisation methods. In the engineering method, previous experiences and rules of thumb are incorporated. Uniform deployment of several sensors in space would be a common practice in engineering methods as can be seen in studies listed in Table 4, which may result in a fairly expensive and unfeasible output in some cases. Application of this method may result in lack of (i) spatio-temporal mapping, (ii) controlling the response time, and (iii) generalisability to multiple rooms/spaces [90,94,95]. To compensate for the limitations of this method, the optimisation method has recently developed, in which indoor airflow patterns in the deployment of sensors are taken into account [96][97][98][99][100]. In this method, modelling tools such as computational fluid dynamics (CFD), zonal model, and multi-zone airflow model are utilised along with genetic algorithm, artificial neural networks (ANNs), simulated annealing, and stochastic approximation methods to optimise objective, cost, or fitness functions based on the predefined goals [97,98,101,102]. Although this method could bring precision in choosing the optimal strategy, optimisation methods could be computationally intensive in the large deployment of sensors in multi-zone airflow and CFD-based simulations [95]. Nevertheless, to achieve optimal strategies regardless of sensor locations and to avoid occasional error in the prediction results of small sensor networks (a combination of 3 to 4 sensors as reported by Ren and Cao [79]), systematic sensor deployment methods, such as clustering model of fuzzy C-means (FCM) algorithm based on ANN [103] or based on the genetic algorithm [104] for the efficient prediction of indoor environments could be employed.
Although no standard values for IAQ exist and the idea of setting guideline values [13,105] is not new, we propose a simple deployment strategy for LCS deployment in typical indoor spaces after building the evidence-base from the relevant published literature (Figure 2). This basic strategy could be considered as a generalised plan, where developing an optimisation model is not computationally feasible and could include (i) deployment of environmental and pollutant sensors across the indoor space, whereas deploying height has to be set according to occupants' height; and (ii) deploying sensors based on specifications discussed in Figure 2, in locations, where taking samples using sensors' induction fan can represent the entire environment. In the absence of a legislative framework for regulating IAQ, such a strategy could help optimise the sampling that is representative of indoor environments and can be beneficial in planning appropriate mitigation steps for reducing the exposure from indoor air pollutants. However, an optimised network of air pollution LCSs needs to be supported by the appropriate data processing (Section 6) and predictive modelling (Section 7) to allow its interpretation, visualisation and conveying the meaningful messages to the users in a simple form. duction fan can represent the entire environment. In the absence of a legislative framework for regulating IAQ, such a strategy could help optimise the sampling that is representative of indoor environments and can be beneficial in planning appropriate mitigation steps for reducing the exposure from indoor air pollutants. However, an optimised network of air pollution LCSs needs to be supported by the appropriate data processing (Section 6) and predictive modelling (Section 7) to allow its interpretation, visualisation and conveying the meaningful messages to the users in a simple form. Figure 2. Schematic diagram of a simple home deployment strategy for LCSs as per location, including proposed environmental and air pollution sensors (green boxes) and their associated ranges (a blue box) in a typical indoor space. The representative image of a home building was obtained from free sources using Google image search engine.

Data Processing
A network of LCSs requires a substantial amount of pre-and post-processing of data before presenting to the users. Pre-processed data is recorded by LCSs, which utilises an initial calibration (pre-processed data). Post-processed data is transmitted by the sensor to a database, which subsequently undergoes QA/QC protocols before being made available to the users. Pre-processed data should often not be made available to the users until sufficient QA/QC has been performed. QA/QC is essential in LCS monitoring systems and refers to a set of activities and measures that are taken to ensure that the requirements, objectives and established quality standards with a pre-established level of performance and confidence being met. However, their role is not to guarantee that the data is of the highest possible quality, which is often unreachable and unfeasible. What is sought is to ensure that the data are accurate, reliable, fit and adequate for a particular purpose or application.

Pre-Processing of Low-Cost Sensor (LCS) Data
LCSs are manufactured to measure numerous parameters, including but not limited to (i) date/time; (ii) environmental parameters (e.g., temperature (°C), relative humidity (RH, %), barometric pressure); (iii) gaseous pollutants (concentration by molar ratio or mass); and (iv) particle concentrations, segregated size fractions in different size bins (µg m −3 ). The amount of data produced by LCSs is often orders of magnitude greater than Figure 2. Schematic diagram of a simple home deployment strategy for LCSs as per location, including proposed environmental and air pollution sensors (green boxes) and their associated ranges (a blue box) in a typical indoor space. The representative image of a home building was obtained from free sources using Google image search engine.

Data Processing
A network of LCSs requires a substantial amount of pre-and post-processing of data before presenting to the users. Pre-processed data is recorded by LCSs, which utilises an initial calibration (pre-processed data). Post-processed data is transmitted by the sensor to a database, which subsequently undergoes QA/QC protocols before being made available to the users. Pre-processed data should often not be made available to the users until sufficient QA/QC has been performed. QA/QC is essential in LCS monitoring systems and refers to a set of activities and measures that are taken to ensure that the requirements, objectives and established quality standards with a pre-established level of performance and confidence being met. However, their role is not to guarantee that the data is of the highest possible quality, which is often unreachable and unfeasible. What is sought is to ensure that the data are accurate, reliable, fit and adequate for a particular purpose or application.

Pre-Processing of Low-Cost Sensor (LCS) Data
LCSs are manufactured to measure numerous parameters, including but not limited to (i) date/time; (ii) environmental parameters (e.g., temperature ( • C), relative humidity (RH, %), barometric pressure); (iii) gaseous pollutants (concentration by molar ratio or mass); and (iv) particle concentrations, segregated size fractions in different size bins (µg m −3 ). The amount of data produced by LCSs is often orders of magnitude greater than traditional measurement techniques. For example, at an acquisition rate of 1 Hz, the total number of measurements could be 86,400 per day per single measurement. If one considers monitoring of six indoor parameters at a minute sampling frequency, it will have 8640 samples in one day per location. Considering a network with multiple locations, it brings challenges to data management and processing [106][107][108].
Handling of large volume datasets requires an infrastructure to process data. To do so, several tools have been developed to address the processing of such large multivariable dataset, with good performance. One of the available tools is the Apache Spark framework (https://spark.apache.org; accessed on 21 March 2021), which was initially designed to be open-source. The tool supports the processing of large amounts of data using distributed computing for the development of iterative algorithms (like machine learning and graph models), interactive data mining, streaming and time-series applications [109]. The framework supports a set of programming languages such as Java, Python, Scala, and R, while being capable of distributing data and computation with a robust fault tolerance mechanism for both. One of the main current tasks of this emerging smart computing platform includes the processing and streaming of large amounts of data from sensors as well as machine learning tasks [109,110]. This framework is able to offer an optimal model in terms of both processing time and least error rate in working with air quality databases, especially related to smart monitoring [111][112][113][114].
On top of big data frameworks-a descriptor for very large and multivariate timeseries datasets produced by LCS systems-different kinds of tool were designed to tackle specific applications. In the internet of things (IoT) area, multivariate time-series are continuously needed to be pre-processed to guarantee its fitness for their expected usage. Currently, the combination of big data frameworks along with time-series databases, data collectors, data monitors, and data visualisers, has boosted the ability to use data from LCSs to generate useful and reliable IAQ information. For time-series database management and streaming, the open-source InfluxDB platform [115] offers a variety of tools and mechanisms to deal with LCSs time-series datasets [82,116,117]. The capability of open-source InfluxDB has been proved for its time-series functionality, keeping costs as low as possible, making querying archives simple, and connectivity to data collectors like Telegraf and to graphing software like Grafana low-effort [116]. The Telegraf tool [118] offers a plugin architecture that supports the connection between a broad range of data sources to collect and report metrics and events. Grafana [119] has emerged as one of the most used platforms by the industry, offering a rich and extendable web interface to build dashboards on top of data sources and collectors, catch errors and monitor readings, bring compatibility with several languages, tools and frameworks [116].
In summary, pre-processed datasets always involve three problems: the quality of data, high dimensionality, and the growing amount of data. The measurements provided by LCSs are only useful when these issues are overcome. With an increase in interest surrounding big data and its applications, many open-source frameworks have been developed as discussed with the capability to process and store large amounts of timedependent data. These tools help LCS networks to effectively propagate and batch-process data enabling users to conduct a wide range of experiments concurrently with real-time monitoring of the results.

Post-Processing of LCS Data
Post-processing techniques, such as outlier detection, data cleaning and gap-filling methodologies could help to determine missing, duplicated, inconsistent datasets, and eliminate high-frequency noises to improve the quality of measured data [120,121]. To meet the demands for higher data quality in LCS systems, Mahajan and Kumar [106] presented a toolbox, known as Sense Your Data: Sensor toolbox. This web-based tool provides easy and efficient functions to analyse air pollution data for both researchers as well as the general public. The tool offers data plotter (including data summary), anomaly/outlier removal and gap-filling. The three different algorithms implemented in this tool for data processing are: (i) autoregressive integrated moving average (ARIMA) additive for tasks related to prediction/forecasting [122]; (ii) K-nearest neighbour (K-NN) for anomaly detection [120,123]; and (iii) the ANN model for air pollution time-series data dealing with forecasting [122] and gap-filling [120]. The two algorithms for gap filling are: (i) Interpolation using the "imputeTS" package [124] to fill the missing values in the dataset; and (ii) Kalman filter to estimate past, present and future values even when the precise nature of the system is unknown [125].
Other anomaly detection techniques that are specialised in time-series data are the SAX algorithm (symbolic aggregate approximation); [126]) and the cluster-based algorithm for anomaly detection in time-series using Mahalanobis distance (C-AMDATS; [127]). SAX addresses the detection of anomalies in time-series datasets using the concept of discords, which transforms a time-series into a sequence of characters (i.e., a string) using clustering techniques [128]. C-AMDATS, in turn, is an unsupervised learning technique that uses clustering methods and the covariance matrix to compute the Mahalanobis distance, to determine how a certain pattern differs from the others, and to calculate the most anomalous using an anomaly rank index. It is a multivariate technique and its performance has been evaluated as the best results compared to the SAX algorithm using urban air pollution data [127]. Recently, a lightweight python library called Luminol for time series data analysis was developed, which implements several anomaly detection algorithms [129]. Luminol owns a series of applications ranging from detecting and correcting network anomalies-the amount of writing, requests, etc.-to health, sensors and IoT applications, which could be valuable and important for post-processing timeseries data from networked LCSs.

Predictive Modelling
Developing a predictive model that can forecast the changes in IAQ and occupants' exposure is crucial to obtain concentration profiles of air pollutants in indoor spaces [130]. Predictive modelling is a commonly used technique, which employs analysis of historical/current data and generation of a model to help predicting future outcomes. With the availability of IAQ data collected using the LCSs, sophisticated techniques can be employed to develop a predictive model. In the subsequent sections, we review and consolidate the techniques used for predictive modelling and bring the prevalent best practice and knowledge to develop optimal indoor models, in which previously discussed topics are used as the foundation in the model development.

Types of Indoor Air Quality (IAQ) Predictive Models
IAQ modelling is a non-invasive and inexpensive method to better estimate spatiotemporal distribution of indoor air pollutants. IAQ is commonly predicted using mechanistic (white box) or statistical (black box) models. Mechanistic models utilise detailed input parameters which apply fate and transport of indoor air pollutants via diffusion, convective mass transfer, and sorption of pollutants. Mechanistic models can be applied on unoccupied microenvironments where detailed indoor/outdoor target air pollutants, building layout and ventilation conditions are available or under-controlled. Mechanistic models have been implemented in several studies to predict indoor PMs [131][132][133] as well as VOCs [134][135][136][137]. Mechanistic models can be categorised as single compartment mass balance-based model and CFD model, as described here:

•
The single compartment mass balance-based model is a common mechanistic model that has been widely used in studies to explore IAQ with proper validation against real-world data [138][139][140]. Liu and Zhai [94] integrated a probability-based adjoint inverse method into the single compartment mass balance-based model to back-track indoor pollution sources. In the model, interpolation was used to obtain the pollutant concentrations at the locations among sensors, where sensor readings are assumed to be always accurate. However, this is not the right assumption in the case of LCSs due to drift error. For example, the uncompensated drift error and standard deviation of a VOC sensor in many environments were about 0.8 and 0.3 ppm per 4 months, respectively [141]. Therefore, Xiang et al. [142] improved the mass balance-based model by considering LCS specifications and optimally compensating drift errors. The corrected model was composed of an optimal indoor concentration prediction and estimation model, which was supported by a hybrid sensor network synthesis algorithm. • CFD is a well-known mechanistic model that is restrictive in nature due to its exceptional complexity and dependency on many assumptions, approximations, and real observations. Empirical models can be integrated into detailed mathematical models to enhance the accuracy of predictions. CFD supported models by empirical/physicsbased models require additional resources and pre-existing knowledge during model development [143][144][145].
In statistical models, model parameters are identified using experimental data and the model structure is inferred by applying statistical methods. While mechanistic and empirical/physics-based models are complicated to develop and there are no established mechanisms, statistical models can help especially in case of dealing with large datasets [146]. This technique can deliver reliable outcomes, but the complete lack of physical insights is a significant drawback. Statistical models have been developed in which they appear to be less resource-demanding compared to other models. In fact, statistical models need the use of consistent input data streams via data loggers or pollution monitors, thereby, the absence of input information flow could endanger the accuracy of the model [144,147,148].
In addition to traditional statistical models, such as kriging or Gaussian process regression, the use of machine-learning techniques gained increased attention in statistical IAQ predictions. The common statistical machine learning-based models are multiple linear regression, partial least squares, generalized linear model, decision trees (classification and regression trees), Bayesian hierarchical model, generalized boosting model, support vector machine, random forests, and ANN [72,146,[149][150][151][152][153]. Although discussing the details of these methods is not the primary objective of this study, we showcase the most applicable models that can be of use in building predictive models using LCSs.
Linear regression is a statistical method that captures the linear relationships of independent variables to predict the value of a dependent variable, such as forecasting air pollution [154,155]. Partial least square model and generalised linear model provide a general framework for handling regression models for normal/non-normal data that can be applied in IAQ applications [156,157]. Decision trees are simple but successful techniques that predict the target value via learning simple decision rules [152,158]. Bayesian hierarchical modelling is a statistical model that utilises Bayes' theorem for estimations. The hierarchical approach facilitates the understanding of multi-parameter problems and developing computational strategies [159,160]. A generalised boosting model is a combination of decision tree-based algorithms and boosting techniques, which frequently fit decision trees to improve the accuracy of the model [152]. Support vector machine regression is the proposed method to deal with non-linear problems [72,161]. Random forest or random decision forests regression model is a simple, flexible and most used machine learning algorithm, which can be utilised in both classification and regression applications [162][163][164][165]. ANN is the most commonly used machine learning technique for solving complex problems [70,72,[166][167][168][169]. ANN has shown the capability of estimating IAQ with an acceptable range of 0.62 < R 2 < 0.79 only with one hidden layer [167,170,171]. However, there are few emerging applications of deep neural networks (DNN), like recurrent neural networks (RNN), long short-term memory (LSTM), and gated recurrent units (GRU), that need exploration [146,[172][173][174]. Table 5 presents a summary of modelling studies for residential settings that various machine learning techniques are used for the prediction of IAQ parameters. The development of machine learning and statistical models in recent years (Section 7.1) has offered significant benefits in the prediction of complex indoor environments [72,146,165]. The development of these predictive models would require large scale data collection provided by the sensor network, adequate computing infrastructure for data processing, analysis and model construction. Although the applications of predictive models are vast, the limited efforts on implementation of ANN, multiple linear regression, and random forest regression models showed acceptable performances in predicting indoor variables. Nevertheless, further efforts should be undertaken to enhance the performance of these tools in predicting all known indoor air pollutants ( Figure 2) using continuously generated data by sensor networks rather than focusing only on proxies.  Based on the review of various predictive modelling techniques, it has been found that the statistical models based on machine learning (Table 5) could provide a good fit for indoor air pollution prediction in smart homes. This technique provides a powerful tool for modelling the behaviour of indoor built environments with a complex interplay of the response and predictor variables. The predictive model should also be able to optimally maintain its stability in dealing with inaccurate readings and source generation rate estimates by applying proper weighting factors (a function of sensor drift and source generation rates) to improve the overall prediction accuracy. To do so, mechanistic model techniques are utilised to provide the basis for the selection of appropriate parameters for machine learning models on theoretical physics-based principles. However, uncertainty and potential disadvantages of mechanistic models as highlighted in the previous section could endanger the feasibility of the model in multiple buildings or the case of an occupied building.

Conclusions and Future Remarks
People spend a significant amount of their time in indoor environments, where they are most probably exposed to at least one IAQ problem. IAQ remains mostly unregulated and maintaining safe IAQ during the long-term stay at homes to tackle the novel coronavirus pandemic, or similar outbreaks is more challenging [184]. Smart homes equipped with air quality LCSs and integrated processing/predicting tools can offer a healthy environment to occupants. Although technologies in this field are continuously evolving, emerging knowledge among the researchers in different fields is sparse, and smart home components are considered separately due to diversity in the research field. Here, we reviewed the standard protocols needed to be met to satisfy the indoor measurement challenges. Then we reviewed data assimilation and data processing tools and predictive modelling techniques to estimate indoor exposure. From the study, the following conclusions are drawn: • Indoor pollutants are released from different sources at different concentration levels, thereby, selection of LCSs should be in the way that they can serve the task according to the target pollutants and concentrations. The accuracy and diversity of LCSs used in indoor environments is an important focus in deployment strategies of LCSs in smart homes. Proper deployment height is also suggested due to variation in exposure heights among the occupants. • Deployment of networked LCSs to map spatio-temporal distribution of indoor air pollutants is necessary to optimise the number of deployed LCSs, obtain meaningful data, reducing the computational time/cost, and data handling without losing accuracy.
There are limited studies on long-term deployments of sensor networks, especially in indoor residential environments.

•
The lack of data reliability and QA/QC is counted as the most important challenge associated with LCSs. We emphasised an important role of laboratory calibration of LCS. Relying only on initial LCS calibration, which is a prevalent practice in reviewed studies, for long-term deployment should be complemented by routine performance testing to the success of networked sensors. Such performance evaluations can allow maintaining data quality, oversee manufacturing variability, sensor stability, drift and ageing over time.

•
Several open-source tools have been developed for data processing to give network providers the tools to deploy large-scale networks with little overhead. As LCSs record large amounts of time-series data, open-source tools such as InfluxDB and Grafana are necessary to be able to capture and process recorded measurements as well as allow easy visualisations for both the network operator and the occupants. Considering home-specific internal data servers can offer additional security from the external threats. • A wide range of data processing tools are available with many capabilities, including data cleaning, data plotting and different types of anomaly detection. These tools can increase the confidence and reliability of the data, improving the services provided by the network providers and improving the experience for the occupants.

•
There is an increasing trend towards the application of machine learning-based statistical models due to the availability of a continuous flow of IAQ data using LCSs. However, there are several limitations of exclusive data-based studies due to the lack of established knowledge related to the selection of desirable parameters, appropriate performance metrics, and the application of different models for different scenarios. Therefore, the best way forward would be to further advance the knowledge of statistical models for IAQ prediction by carrying out larger-scale deployments and considering a wider range of indoor pollutants that are backed by the theoretical principles from mechanistic models for modelling the underlying micro-environmental principles and mechanisms.
Making homes smarter is becoming an integral component of the smart city concept. According to the Allied Business Intelligence (ABI) Research report on smart homes [185], almost 300 million smart homes are set to be installed around the world by 2022. Having smart homes in terms of IAQ is not a distant dream. This review reveals the benefits of using technological advancement in estimating the effects of long-term exposure to indoor air pollutants and determining new prevention strategies and control measures on health conditions in smart homes. It contributes to future generations of smart buildings as well as designing of smart cities and embracing smart technologies for IAQ monitoring by the general public and adopted in their routine lifestyle. Some of the ongoing projects such as the MyGlobalHome [186] aim to develop such advanced property development platform by connecting developers to consumers of sustainable and connected homes and seek to bridge a gap between the smart technology developers and property developers. The efforts by the aforementioned projects along with the support of ongoing research activities concerning air quality sensors could result in appreciable health benefits to smart home occupants.