Simulated Data to Estimate Real Sensor Events—A Poisson-Regression-Based Modelling

Automatic detection and recognition of Activities of Daily Living (ADL) are crucial for providing effective care to frail older adults living alone. A step forward in addressing this challenge is the deployment of smart home sensors capturing the intrinsic nature of ADLs performed by these people. As the real-life scenario is characterized by a comprehensive range of ADLs and smart home layouts, deviations are expected in the number of sensor events per activity (SEPA), a variable often used for training activity recognition models. Such models, however, rely on the availability of suitable and representative data collection and is habitually expensive and resource-intensive. Simulation tools are an alternative for tackling these barriers; nonetheless, an ongoing challenge is their ability to generate synthetic data representing the real SEPA. Hence, this paper proposes the use of Poisson regression modelling for transforming simulated data in a better approximation of real SEPA. First, synthetic and real data were compared to verify the equivalence hypothesis. Then, several Poisson regression models were formulated for estimating real SEPA using simulated data. The outcomes revealed that real SEPA can be better approximated (Rpred = 92.72%) if synthetic data is post-processed through Poisson regression incorporating dummy variables.


Introduction
Remote sensing is enabling us to understand more about our surroundings, particularly around environmental change. Remote sensing through geospatial data, is however, not typically seen as a means for continuous monitoring. It generally relies on sensors attached to aircraft or satellite for geological mapping or capturing observations of the earth. As a result, remote sensing is often associated with collection frequencies measured in months rather than hours or days. There are a vast number of monitoring and inspection applications that would require and benefit from more frequent observation. For these applications, remote sensing and Internet of Things (IoT) could be used to complement and strengthen each other. Remote sensing and IoT bring together external observations possible only from extrinsic sensors such as satellite images and combine/rationalize these findings with data streamed by embedded IoT sensors.
In the literature, many research works make use of the combination of real data and simulated data for the feeding of ADL classification models. For example, in Reference [17] was implemented a smart-home simulation combining an avatar-based scenario (acquired from real-world data), and probabilistic modelling of sensors. Authors obtained similarities of simulated and real data, so it demonstrated the viability of probabilistic sampling approach. The proposal exposed in this article is an extension of a previous work [18] in which we applied different regression models to simulated data to use them as a complement to real data. In our previous work, a simulation was carried out in terms of duration and intensity of sensor events; however, we did not consider different types of sensors that associate to each ADL and that are involved in events that form activity sequences. Precisely, the simulation considering the different types of sensors are studied in the paper and applying Poisson regression to improve the results. The main difference to the previous work is that here we hypothesize that data generated by simulation and adjusted by Poisson regression is more similar to real data than unadjusted data. This paper is described as follows: Section 2 presents a review of simulation tools that have been designed for smart environments; whereas Section 3 depicts the details on Poisson regression modelling. Section 4 describes our experiments whereas results are reported and discussed in Section 5. Finally, Section 6 details the conclusions and future work.

Review of Simulation Tools for Smart Environments
A comprehensive review of simulation tools has been reported previously within the literature [13,16]. These tools can be split into model-based and interactive approaches. Model-based approaches generally focus on the use of statistical or machine learning techniques to generate synthesised or surrogate data [19,20]. Techniques involved in this include the use of correlation preservation, and amplitude distribution [21] or more recently the use of adversarial neural networks [22]. This section will provide an overview of interactive approaches and approaches used to validate the generated data.
Generally, interactive approaches rely on a human user controlling an avatar around a 2D or 3D virtual Environment [16]. As the avatar moves throughout the environment, it interacts with various passive and active virtual sensors and/or actuators, for example activating pressure or presence sensors and turning on or off lights.
The intelligent environment simulation (IE Sim), developed by Synnott et al. [23] is a tool to generate simulated datasets for Activities of Daily Living. It allows the researcher to design smart homes by providing a 2D graphical top-view of the floor plan. The researcher can add different types of sensors such as temperature sensors and pressure sensors. Using an avatar, the researcher can carry out ADLs interacting with objects and triggering sensor in the virtual environment. Similarly, Ariani et al. [24] created a smart home simulation tool that collects data from virtual ambient sensors including binary motion detectors and pressure sensors. The researcher produces the smart home floor plan by drawing shapes on a 2D canvas and can then place sensors onto the floor plan. To simulate the activities and interactions in the smart home, the authors used a pathfinding algorithm which simulates the movement of the inhabitants.
The OpenSHS simulator [13], generates realistic Smart Home data through a hybrid approach. Specifically, it combines both interactive and model-based approaches. Data generated through interactive simulation can then be replicated using a specifically designed algorithm. The OpenSHS was demonstrated in generating a dataset for classification as well as the detection as anomalous activity such as leaving the front door open. The opensource simulator, SIMACT [25] allows for the creation of a 3D environment and the selection and positioning of virtual sensors. These virtual sensors are modelled upon common Smart Home sensors such as RFID, PIR sensors, and contact sensors. The simulator generates datasets in two modes (1) is an interactive mode where the avatar is controlled by the user who can interact with various items in the home, (2) is a model-based approach where the inhabitants are controlled by pre-defined scripts where the user defines the completion time of each step and the objects that are interacted with.
As highlighted by Table 1, few of the interactive approaches reported in the literature have compared the accuracy of data generated by the simulator with data generated in a real environment. When doing so, it is important to consider not only which sensors are firing, but also the duration and timing of these sensor events. For instance, making a meal may take a longer time in the morning than in the evening. Lee et al. [27] developed the PerSim 3D human activity simulator. A contrast between real data gathered within the Gator Tech Smart House and synthetic data produced by Persim 3D concluded mean data similarities of between 0.78 and 0.81. Another work comparing real data and with data produced by the simulator MASSHA25 revealed the similarities between 0.8810 and 0.9352 in terms of frequency, and 0.9827 and 0.9909 in terms of duration on datasets including single user ADLs.
Renoux and Klugl [29] presented a framework to generate sensor data from a simulated smart home. The solution used a flexible agent-based simulation tool and constraint-based planning. The authors highlighted that the data generated could be used to test or train algorithms that are then directly usable in real-world applications. Through an evaluation of the solution, the authors showed that the activity plans generated by the simulator show some plausibility. The comparison of these plans with real datasets, however, revealed some issues. In particular, there was a noted discrepancy between the expectation of what an activity plan for a real day looked like when looking at a complete day and the actual recorded timeline of a real day. The author was unclear whether this difference was due to real activity fragmentation or to errors during activities annotation.
As discussed, interactive methods mainly depending on an avatar interacting within a virtual environment to produce simulated datasets have a restricted capability to consider the inherent variations in activity duration and intensity regarding the daytime that would be exhibited in a real dataset. This is mainly due to the synthetic nature through which interaction takes place. Mendez-Vazquez et al. [30] demonstrated the use of Markov chains describing the order of events, combined with Poisson distribution to calculate a range of realistic activity times and probability distributions to calculate a range of sensor values to generate a simulated activity dataset. This simulated activity set contained a distribution of activities such as reading, sleeping, walking and sitting together with metrics including time and energy expenditure. As a result, simulated datasets may not be reflective of those produced in a real environment.

Method
The procedure of Poisson regression (see e.g., Reference [31]) is a generalized linear model where the log conditional expected response given the covariates can be expressed as a linear combination of the covariates and a noise term, that is, where Y is the response variable, X 1 , X 2 , . . . , X n are the covariates and is the noise term. The response is a variable of count data and is assumed to be Poisson distributed. Poisson regression is adopted in this study since it is suitable for modelling the relationship between a group of predictors (in this case: simulated events per activity, simulated activity duration, and simulated events per sensor) and a response variable representing the number of times an event occurs in a finite timestamp (as real SEPA does). Other candidates to modelling the activity durations could be, for example, ordinary regression with Gaussian response, non-linear regression models, models for data subrogation [32], non-Gaussian models and so forth. However, the Poisson distribution is fundamentally derived from the assumption of counts. More to the point, when there is no reason to assume dependence between events, resulting in either clustering or regularity, the number of events in each fixed interval is Poisson distributed. Here, the covariates are typically count data (number of events per activity and number of events per sensor), but also time (duration of an activity). The linear combination of these counts (possibly also including interaction terms) is therefore hypothesized to be Poisson rather than Gaussian or otherwise distributed. Gaussian and other continuous distributions are preferable when the data is continuous (such as exact weight, length etc.). See Reference [33] for further motivations for choice of model.

Overdispersion
The (homogeneous intensity) Poisson probability function of a single variable Y is for y ∈ {0, 1, 2, . . .} and λ ∈ R + . A characterizing property of the Poisson distribution is that the single parameter stands for both expectation and variance. If these two moments differ, this is an indication that the Poisson assumption is not sustained. The phenomenon of the variance being larger than the expectation is called overdispersion, typically caused by too few covariate observations or strongly correlated covariates. In case when the dispersion is on par with the expectation the term is equidispersion.
To address the problem of overdispersion one may consider the Poisson distribution as a special case of the Generalized Poisson (GP) distribution, with a probability function for y ∈ {0, 1, 2, . . .}, λ ∈ R + and max(− λ 4 , −1) < κ < 1. Then the case of Poisson distribution corresponds to GP distribution with κ = 0 (equidispersion), overdispersion corresponds to κ > 0 and underdispersion corresponds to κ < 0. A random variable X distributed according to the GP distribution has expectation and variance (see Reference [34]) according to which motivates defining the dispersion parameter φ as the deviance of the GP from the Poisson distribution with φ = 1 (1−κ) 2 . A hypothesis test for checking on indications of overdispersion may be carried out by utilizing the Pearson's goodness-of-fit statistic which is χ 2 -distributed with n − 1 degrees of freedom under the null hypothesis H 0 : κ = 0 against the alternative H 1 : κ > 0. Here, x ij denotes the ith observation of the jth covariate X j , the estimator µ j = ∑ m j=1β j x ij and the despersion parameter estimatorφ is achieved by maximum likelihood estimation in parallell to the estimation of the regression coefficients [35].

Normally Distributed Residuals
One assumption in the modelling with Poisson regression is that the residuals follow a normal distribution. To validate this condition an Anderson-Darling test is conducted to confirm that the residuals are not deviating significantly from the normal distribution. Assuming the expected value of the residuals E(ε) = µ = 0, the standardized residuals are˜

Anderson-Darling test statistic is
where Φ(·) denotes the standard normal cdf. This statistic may then be used to reject the null hypothesis that the residuals are normally distributed in favour of the alternative that the residuals are not normally distributed as soon as the value of A 2 exceeds the Anderson-Darling percentile [36].

Independence
Another assumption is that the covariates are uncorrelated within each sample, that is, that the = 0 for all h = 0. This may be estimated by for time lags h = 0, 1, 2, . . . , n − 1 wherex is the average 1 n ∑ n i=1 x i . To check whether there is evidence for dependence violating the assumptions, one may perform an ordinary t-test of whether or not ρ(h 0 ) = 0 for some specified lag h 0 . Another way is to check the evidence for dependence is to utilize a Ljung-Box test to multiply check whether correlations ρ(h) = 0 for all lags h such that |h| ≤ h 0 for some specified bound h 0 [37]. This may be carried out by calculating the test statistic and reject the null at level α of significance as soon as Q exceeds

Experiment Description
Smart environments were developed to help older people or people who are suffering from some degenerative disorders (i.e., dementia) to maintain their independence in daily life. This is the case of the Halmstad Intelligent Home (HINT) at Halmstad University (Sweden), where a realistic home environment is provided for underpinning innovations and research studies relating to human behaviour analysis [38]. HINT, an apartment of 50 m 2 built, is supplied with a variety of thermal cameras and sensors (PIR, pressure, door contact, contact/touch, and others) capable with supporting (i) emergency detection and on-time response, (ii) detection of deviating behaviour patterns, and (iii) healthcare monitoring [38]. The left and right sides of Figure 1 present some of the spaces within this home lab. Such an environment was designed in IE Sim software which virtually incorporated the current sensor deployment so that model robustness and reliability can be fully granted. To compare real and synthetic SEPA, an experiment involving eleven participants was undertaken. Each participant was initially asked to carry out a set of eight ADLs (Go to bed, Use bathroom, Prepare breakfast, Leave house, Get cold drink, Go to office, Get hot drink, and Prepare dinner) in the virtual environment by using a virtual avatar (see Figure 2).  A general description of the activities to be performed by users can be found below:

Initial instructions
• Please close each door after passing through.
• Please turn off each domestic appliance after use. • You will be guided through each activity in sequence, please remember to select the "Stop/Start" button after each activity is complete. • Time is not an issue in this experiment. Do not worry about needing to take time to re-read an activity description.

Activity 1: Go to bed
You can stay in bed all the time that you want. Time maximum is 2 min. Then, you have to leave the bedroom, close the door and press the button.

Activity 2: Use bathroom
You can use the toilet if you need, or just wash your hands. Then, leave the bathroom, close the door, and press the button.

Activity 3: Prepare breakfast
You have to prepare something to eat for breakfast. You can choose between milk and cereals or coffee, but you can make or also prepare both. Then, put the bowl on the table, sitting, and press the button.

Activity 4: Leave house
You can choose to leave the home either from the front door or from the garden door. When you are outside, press the button.

Activity 5: Get cold drink
You can choose between tap water or by taking something from the fridge. Then, put the glass with the drink on the kitchen desk and push the button.

Activity 6: Go to Office
You have to go to the office and press the button.

Activity 7: Get hot drink
You can choose between making tea or coffee. Then, put the cup on the kitchen desk and press the button.

Activity 8: Prepare dinner
You have to prepare a soup. Put the bowl on the table and press the button.
Once participants finished the aforementioned activities within the simulator, they were required to undertake the ADLs at HINT. The resulting data from real home and virtual environment were then arranged as two datasets specifying each sensor events aligned with their corresponding participant, sensor ID, code, sensor type, and time stamp. The next chapter will present the comparison between data emanating from HINT and IE Sim in terms of real SEPA. Furthermore, it will illustrate how synthetic data can be transformed (using Poisson regression modelling) for approximating the number of events perceived by each sensor in the real environment.

Contrast between Simulated and Real Sensor Events
Paired t-tests (α = 5%, CL = 95%) were performed to contrast the number of synthetic and real SEPA considering two sensor types: door and pressure. This study also regarded seven ADLs (Go to bed, Use bathroom, Prepare breakfast, Leave house, Get cold drink, Be in the office, and Prepare dinner) and eleven sensors (Bedroom door, Bed pressure, Bathroom door, Bowl cupboard-Prepare breakfast, Refrigerator-Prepare breakfast, Chair pressure-Prepare breakfast, Chair pressure-Leave house, Refrigerator-Get cold drink, Office chair pressure 3, Bowl cupboard-Prepare dinner, Chair pressure-Prepare dinner) for providing further analysis on how the equivalence between real and synthetic datasets may vary depending on the related ADL and sensor type.  Table 2 describe the results obtained from the contrast between synthetic and real SEPA for Bedroom door (ADL: Go to bed). In this case, the two-sided CI for the mean difference between the real and synthetic SEPA does not contain zero (Figure 3), suggesting that, in case of "Bedroom door sensor" (ADL: Go to bed), the SEPA generated by the IE Sim simulator is significantly different from those emanated from the real environment with a confidence level of 95%. This result is consistent with the small p-value (0.005) derived from the paired t-test which does not provide good evidence for the equivalence statement. Specifically, the real SEPA (mean = 4.125 events) was found to be meaningfully higher than the number of events reported by IE Sim (mean = 1.750 events).  The aforementioned analysis was extended to all door sensors so that further insights can be obtained regarding the equivalence between synthetic data and real observations (refer to Table 3). In particular, we found that in 83.33% door sensors, the equivalence statement was rejected (p-value < 0.05). It can be hence inferred that the SEPA produced by the IE Sim simulator and real-world are considerably dissimilar (CI = 95%). It is hypothesized that the difference between the real and simulated sensor activation is due to differences in how individuals interact with objects in the real world versus in the simulation. For example, when entering a room the individual may move through the doorway and then instinctively push the door closed or partially closed. This would trigger an activation of the contact sensor. Conversely in the simulation, the user is abstracted from what is happening in the environment and has to make a concerted effort to interact with each object/sensor. In this case, they may sometimes forget to do so. Meaning the door may open but may not be closed.  Figure 4 and Table 4 present the results of the paired t-test supporting the contrast between the SEPA (Chair pressure) from IE Sim and the real environment when preparing breakfast. Considering that the CI for the mean difference between the compared variables does not include zero, there is then not enough support for the equivalence statement ( Figure 4). This finding is confirmed by the small p-value (0.004) associated with the null hypothesis, which further suggests (CL = 95%), in this case, no statistical similarity between real data and those produced using the simulator. In particular, the real number of SEPA (mean = 1.250 events) was found to be meaningfully lower compared to the SEPA obtained from the IE simulator (mean = 4.875 events). The outcomes emanating from the comparative analysis are detailed in Table 5. Supported on statistical evidence, it was inferred that in 60% of the pressure sensors, the equivalence hypothesis was rejected. Hence, it can be deduced that the SEPA derived from IE Sim and real-world are different (CI = 95%) in most sensors.

Predicting Real SEPA Using Simulated Data: The Application of Poisson Regression
Given that the equivalence hypothesis was rejected in most door and pressure sensors (Section 5.1.1), the next step was to define How the synthetic data could be modified to better approximate the real SEPA. Two types of Poisson regression-based models were proposed to deal with this challenge: sensor-based and dummy-variable based. The following sub-sections will illustrate the results obtained from each of these models including validation (overdispersion, normality, and independence of residuals) and assessment of predictive ability R-Sq(adj). It is noteworthy that no regression model was defined for Sensor 7: Bowl cupboard (Prepare dinner) due to lack of sufficient data.

Sensor-Based Poisson Regression Model
Poisson regression models were defined for the above-described sensors by utilizing Minitab 17 ® statistical package. The resulting equations have been validated (see Sections 3.1-3.3) for ensuring their applicability in practical scenarios. The use of sensor-based models is proposed given the diversity of ADLs considered in this study. As mentioned above, models with high predictive ability can be used for complementing real datasets and then training algorithms capable of recognizing ADLs accurately.

Sensor 1: Bedroom Door (ADL: Go to Bed)
In this case, the model (9) was concluded to be statistically significant (p-value = 0.000) at an alpha level of 0.05. This means that at least one predictor coefficient is different from 0 as noted in Equation (9). Furthermore, X 1 (simulated events per activity) and interactions including X 2 (simulated activity duration), and X 3 (simulated events per sensor) were found to be significant. Thus, a model considering these terms may be suitable for predicting Y (real events per sensor).
In this case, the predictive ability of the model was concluded to be satisfactory (R 2 adj = 90.21%). Such a model was also found to have the lowest Akaike Information Criterion -AIC (35.57) and is then concluded to strike a superior balance between data fit and its ability to tackle overfitting.
On the other hand, an Anderson-Darling test was undertaken for assessing the normality of residuals ( Figure 5). Considering that p-value = 0.338 and AD = 0.366, it can be assumed that the residuals do not deviate significantly from the normal probability distribution. Also, the independence assumption was validated through an auto-correlation test whose metrics (Max|T| = 0.44) evidenced no dependence among residuals. Lately,the deviance (p-value = 0.829) and Pearson (p-value = 0.846) coefficients were used to verify the equidispersion phenomenon. Given that both p-values are higher than the significance level (0.05), this condition is then discarded and the proposed logarithmic equation (Equation (9)) can be suggested for predicting the response variable Y.
Sensor 2: Bed Pressure (ADL: Go to Bed) Similar to Sensor 1 Bedroom door (Go to bed), the predictive model (10) was concluded to be statistically significant (p-value= 0) at 0.05. In this case, X 1 (simulated events per activity), X 2 (simulated activity duration), and interactions including X 1 , X 2 , and X 3 (simulated events per sensor) were identified to explain the response variable. Thus, a Poisson-regression-based model incorporating these predictors may be appropriate for obtaining new Y (real events per sensor) observations. Indeed, the predictive ability R-Sq(adj) was calculated to be 93.24% which ensures reasonable estimations of Y. AIC index (40.73) also validates this conclusion while confirming the good data fit provided by the model.
On a different tack, the Equation (10) was concluded to satisfy the Poisson regression assumptions. First, the normality of residuals was verified through the Anderson-Darling test statistic (AD = 0.237; p-value = 0.688) and Quantile-Quantile plots ( Figure 5). Besides, the auto-correlation was estimated to be ρ(h) = 0 (Max|T| = 0.61) for all h = 0; thereby supporting the independence of residuals. Ultimately, the Deviance (p-value = 0.721) and Pearson (p-value = 0.716) statistical tests sustained the equidispersion assumption. Based on these results, the logarithmic equation is concluded to be valid for predicting the real number of sensor (Bed pressure) events when Going to bed.

Sensor 3: Bathroom Door (Use Bathroom)
The Poisson regression modelling (11) was also found to be suitable (p-value = 0) for approximating the real number of events registered by Bathroom door sensor when participants used the bathroom. Indeed, the good data fit was evidenced through the correlation coefficient (R 2 adj = 95.09%) and AIC (30.59).
The quality of the model described in 11 was verified by assessing the assumptions explained in Section 3. First, the homogeneous property of the Poisson equation was confirmed through the Deviance (p-value = 0.987) and Pearson (p-value = 0.988) tests which were concluded to accept the null hypothesis. The normality distribution of model residuals was evaluated using the Anderson-Darling test and QQ − plots ( Figure 5). In this case, the resulting p-value (p-value = 0.756) and AD coefficient (0.218) provide enough support for the normality assumption. Ultimately, the covariates were concluded to be uncorrelated within each sample (Max|T| = 1.0) which confirms the independence property of the model. Being aware of the above-mentioned findings, the Equation (11) is concluded to be valid for predicting the response Y. Of particular interest is the inclusion of X 1 (simulated events per activity) as the only dependent variable capturing Y variations.

Sensor 4: Refrigerator (Prepare Breakfast)
The use of a Poisson regression model (12) was also appropriate for estimating the real SEPA when participants opening the refrigerator during breakfast preparation (p-value = 0.000). Such a finding was checked upon estimating (R 2 adj = 92.92%) and AIC (29.98) which reflect a high performance concerning prediction ability and data fit respectively.
To justify the use of this model in practical scenarios, Poisson regression assumptions were checked. Initially, the residuals were verified for deviation from normality ( Figure 5). In this case, the results (p-value = 0.566; AD = 0.271) are in favour of the normality hypothesis. On the other hand, the auto-correlation test was performed to detect potential correlations with Y and residuals from previous models. Given that Max|T| = 0.82, dependence among residuals is discarded at a significance level of 0.05. The assessment of the Poisson model also included evaluating the equidispersion property. The Deviance's (p-value = 0.965) and Pearson's (p-value = 0.965) goodness-of-fit statistics concluded against the alternative; therefore, overdispersion phenomenon cannot be sustained. Considering the aforementioned results, the model (Equation (12)) is assumed to be appropriate for predicting the response variable Y. Similar to the previous model, only one variable (X 2 : Simulated activity duration) was identified as a good predictor of real SEPA in Refrigerator (Prepare breakfast) sensor.

Sensor 5: Chair Pressure (Prepare Breakfast)
In Chair pressure (Prepare breakfast), two interaction terms (including X 1 and X 3 ) were found to be significant at 0.05 and 0.1 respectively: X 1 *X 3 (p-value = 0.049) and X 1 *X 2 3 (p-value = 0.072). Nevertheless, the best predictive model incorporating these variables (13) (Normality: AD = 0.281 p-value = 0.508 ( Figure 5); Independence: Max|T| = 1.45; Equidispersion: Deviance [p-value =0.816 and Pearson [p-value = 0.816) was not considered acceptable for estimating the real SEPA in this sensor (R 2 adj = 56.36%). It is then recommended to include other variables explaining the variability of sensor events when sitting on the chair (ADL: Prepare breakfast). Thereby, the predictive ability of the model can be improved for better training activity recognition algorithms focused on this ADL.

Sensor 6: Chair Pressure (Prepare Dinner)
A p-value = 0.016 confirms that Poisson regression modelling (Equation (14)) is suitable for representing chair pressure sensor events upon preparing dinner. In Equation (14), X 3 (simulated events per sensor) and a quadratic combination between X 1 (SEPA) and X 2 (simulated activity duration) were found to explain part of the response (Y) variability. (R 2 adj = 73.11%) and AIC (11.69) indicate an acceptable ability for predicting the real observations (Y) based on synthetic data (X 1 , X 2 , and X 3 ).
The suitability of the model presented in Equation (14) was validated through the normality, independence, and equidispersion assumptions (see Section 3). On one hand, the Anderson-Darling test revealed that residuals follow a normal distribution (AD = 0.191; p-value = 0.846).On the other hand, the auto-correlation analysis revealed no interdependence among residuals (Max|T| = 1.14). Ultimately, p-values of Deviance (0.946) and Pearson (0.960) tests evidence no overdispersion within the Poisson distribution (κ < 0). Thus, Equation (14) provides a reasonable approximation of real sensor events when sitting down at a kitchen chair. Table 6 summarizes the results (prediction performance and validation) of the sensor-based regression models. In this study, most of the models (Equations (9)-(12)) were found to provide excellent predictions of real SEPA while Equations (14) and (13) were concluded to offer acceptable and non-satisfactory transformations of synthetic data respectively.

Poisson Regression Incorporating Dummy Variables
Dummy binary variables were also incorporated into the Poisson regression modelling as an alternative for predicting the real SEPA of any sensor. In this case, these parameters C i denote the presence or absence of a particular sensor i (i = 1, 2, . . . , 6) which is defined by two specific codes: 1 and 0 correspondingly. The use of these artificial variables then facilitates the application of a standard predictive model that can be adapted to different types of sensors [39]. The dummy variables to be included in the standard Poisson regression model were predefined as follows: In this case, X 1 (SEPA), X 2 (simulated activity duration), X 3 (simulated events per sensor), C 2 and C 5 were included into the model either in a single or combined form. Table 7 presents the variables with significant influence on real SEPA Y. Equations (15)-(17) (based on three combinations of D 2 and D 5 ) consolidate these terms for predicting the real SEPA of the sensors. 1. 2.
3. Table 8 enlists the performance metrics of the above Poisson regression models. Both R 2 (0.9461) and R 2 adj (0.9353) values evidence excellent data fit. In a similar vein, the small difference (0.0108) between these coefficients reveals no overfitting problems. On a different tack, R 2 pred (0.9272) denotes high predictive ability and new observations can be therefore derived for effectively training activity recognition algorithms. This is confirmed by the error standard deviation and PRESS whose values (0.4821 and 10.9856 respectively) are close to 0. Following this, the normality, equidispersion, and independence properties were assessed for validating the proposed Poisson regression models. Initially, the Anderson-Darling test (see Figure 6) was undertaken for defining whether the residuals follow a normal distribution. The results confirmed the normality hypothesis (AD = 0.272; p-value = 0.654) with a mean equals to zero. On the other hand, the randomness test revealed no significant auto-correlations among residuals (Max|T| = 0.96). Indeed, Figure 6 does not evidence the presence of runs nor other non-random patterns. Ultimately, both Pearson (p-value > 0.15) and Deviance (p-value > 0.15) were found to confirm the equidispersion property. The aforementioned outcomes confirm the appropriateness of the dummy-variable-based model for their use in the wild. Upon analyzing the results derived from Poisson regression models, we propose the application of sensor-based models in Go to bed (Bed pressure), Bathroom door (Use bathroom), and Refrigerator (Prepare breakfast). In contrast, the dummy-variable-based model is suggested for predicting the real SEPA derived from the rest of the sensors. The rationale behind this decision is the superior performance provided by this model (R 2 pred = 92.72%) compared to those resulting from sensor-based models (90.21%, 56.36%, and 73.11%).

Conclusions
This paper presented the use of Poisson regression modelling for transforming simulated smart home data to provide an improved approximation of real SEPA. In doing this, synthetic and real data were compared to verify the equivalence hypothesis. This analysis indicated that sensor events per activity produced by the IESim simulator and real-world data do not tend to be statistically equivalent. These results indicate that whilst interactive simulators provide opportunities to facilitate the collection of data in the absence of a real environment, simulated data may not be truly reflective of that collected in the real world.
Results indicated that real SEPA can be better approximated (R 2 pred = 92.72%) if synthetic data is post-processed through Poisson regression incorporating dummy variables. Such model is particularly suggested for predicting the real SEPA derived from three of the sensors cited in this study (Bedroom door -ADL: Go to bed, Chair pressure -ADL: Prepare breakfast, and Chair pressure -ADL: Prepare dinner). On a different tack, Equation (10) (R 2 pred = 93.24%), Equation (11) (R 2 pred = 95.09%), and Equation (12) (R 2 pred = 92.92%) are recommended for training algorithms recognizing three ADLs: Go to bed, Use bathroom, and Prepare breakfast. Further, the real SEPA from sensors Bedroom door, Bed pressure, Bathroom door, Refrigerator, Chair pressure-Prepare breakfast, and Chair pressure-Prepare dinner are well captured by a combination of Poisson modelling with quadratic (see (9)-(11), (14)- (17)) and cubic (see (12) and (13)) covariates.
It is important to note the limitations of this study. In particular, the assessment was carried out using simulated data from one simulator, IESIm. Therefore, it is not possible to tell whether simulated data produced by other simulators would generate the same results. It is also not possible to say whether similar techniques would work with model-based approaches to data simulation. Nonetheless, results from this research highlight the importance of considering the quality of simulated data when modelling solutions for human activity recognition. Future work will investigate the applicability of these findings to data generated by other simulation techniques including both interactive and model-based approaches.
One appealing idea is the ability to use activity data from one person and transform these so that they fit according to the profile of another person. A means to this end could be achievements from the theory of transfer learning [40]. The relationships making this kind of transformation possible has not been covered in this study but remains as an urgent question for future research.

Acknowledgments:
The authors would like to thank Giselle Paola Polifroni Avendaño for her valuable support during this research. Also many thanks to Jens Lundström for his helpful comments and previous work in the field.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: