SIMADL: Simulated Activities of Daily Living Dataset

: With the realisation of the Internet of Things (IoT) paradigm, the analysis of the Activities of Daily Living (ADLs), in a smart home environment, is becoming an active research domain. The existence of representative datasets is a key requirement to advance the research in smart home design. Such datasets are an integral part of the visualisation of new smart home concepts as well as the validation and evaluation of emerging machine learning models. Machine learning techniques that can learn ADLs from sensor readings are used to classify, predict and detect anomalous patterns. Such techniques require data that represent relevant smart home scenarios, for training, testing and validation. However, the development of such machine learning techniques is limited by the lack of real smart home datasets, due to the excessive cost of building real smart homes. This paper provides two datasets for classiﬁcation and anomaly detection. The datasets are generated using OpenSHS, (Open Smart Home Simulator), which is a simulation software for dataset generation. OpenSHS records the daily activities of a participant within a virtual environment. Seven participants simulated their ADLs for different contexts, e.g., weekdays, weekends, mornings and evenings. Eighty-four ﬁles in total were generated, representing approximately 63 days worth of activities. Forty-two ﬁles of classiﬁcation of ADLs were simulated in the classiﬁcation dataset and the other forty-two ﬁles are for anomaly detection problems in which anomalous patterns were simulated and injected into the anomaly detection dataset.


Introduction
Recent developments in technology have increased the adoption of smart devices and sensors in smart homes.With the realisation of the Internet of Things paradigm (IoT), the number of these internet-connected devices is likely to grow.In a study conducted by Gartner [1], the number of connected "Things" is 8.4 billion devices in 2017.This number grew by 31% from 2016 and the study predicts that the number will continue to grow and will reach 20.4 billion connected devices by 2020.Moreover, the spending on IoT services that provide design, development, and implementation of IoT solutions was estimated to reach $273 billion by the end of 2017.
With the widespread usage of smart devices in smart homes, these environments will generate an enormous amount of streaming data.These generated data have the potential to provide novel services to the smart home inhabitants to improve their standards of living.These services can benefit from the analysis of this generated data.
Machine learning has been widely applied to develop probabilistic and statistical methods and sequence-learning algorithms to classify and predict ADLs of inhabitants.Nowadays, machine learning models and their contribution to the Internet of Things (IoT) applications are becoming one of the most active and interesting research areas [2].The smart home is one of the most prominent applications of the IoT paradigm.There are many advantages for adopting smart home technologies such as monitoring energy consumption, security, automation, entertainment, eldercare, etc.To implement machine learning techniques in any of the previous applications, a representative dataset for that application is required.The dataset will be used to train and test the machine learning models to evaluate and validate their performance.
There are real smart home datasets available in the literature (e.g., [3][4][5]), however, they lack the flexibility to cope up with the recent advancements in sensor techniques, and they are costly to build and construct.Up to the knowledge of the authors, there is no real-world dataset targeted at anomaly detection in the context of smart homes.
Smart home simulation tools are an alternative to constructing real smart homes.These tools allow the researcher to design a smart home suitable to the problem that they are investigating and generate a representative dataset.There is less cost and effort involved in the process, and they can cope with new emerging techniques.However, many of these simulation tools are not available in the public domain as an open-source project, or they lack the flexibility and accessibility for both the researchers and the participants.
The simulation tools regarding dataset generation approaches can be categorised into two approaches, model-based and interactive approaches.The model-based approaches generate datasets using pre-defined scripts that generate events, probability of the occurrence of events, and their duration.On the other hand, the interactive approaches capture the sensor activities and log them to the dataset in real-time.Examples of model-based approaches include [6][7][8].Examples of interactive approaches include [9][10][11].
The two approaches suffer from disadvantages for the researchers and the participants alike.Model-based approaches allow the researcher to generate big datasets in short periods of time.However, the generated datasets do not capture realistic and fine-grained interactions that happen in real smart homes.The interactive approaches can capture these fine-grained interactions because they capture the output of the sensors directly to the dataset.However, the interactive approaches produce smaller datasets and take more time for the participants to perform their habits.Most of the interactive tools focus on context-awareness applications and not on generating datasets.
OpenSHS is an open-source, 3D, cross-platform simulation tool that follows a hybrid approach that combines the advantages of both approaches.It allows the researcher to design a smart home specific to their research problem and generate a sufficiently large dataset in reasonable time while retaining the fine-grained interactions that the participants are performing.
This paper presents two datasets generated by OpenSHS for classification and anomaly detection problems.The remainder of this paper is structured as follows: Section 2 presents the related work in the literature.Section 3 explains OpenSHS architecture and how we use it to generate the datasets.Section 4 presents our methodology to generate the two datasets.Section 5 provides a description of the datasets.

Related Work
In this section, we review some of the available real datasets in the literature and the simulation tools that allow the researchers to generate synthetic datasets.

Real Datasets
Alemdar et al. [3] published the ARAS (Activity Recognition with Ambient Sensing) dataset which is a real dataset for complex scenarios of multi-residents.The dataset was captured for the duration of two months for two different houses and each house had two inhabitants.ARAS dataset was used to assess ADLs classification algorithms.
The Centre for Advanced Studies in Adaptive Systems (CASAS) is a project for creating real smart homes for the researchers in this field.Cook et al. [12] designed a simple and lightweight toolkit called "smart home in a box".The components of this toolkit are assembled in a single small box and easily installed in a home to be able to provide smart tasks.They have installed the toolkit in 32 smart homes and generated several datasets.The datasets are publicly available online [4].
The TigerPlace [13] conducted a study on the ageing population.They used passive sensor networks that were installed in 17 flats within eldercare facility.They used many kinds of sensors such as motion sensors, pressure sensors, etc.In some of the flats, the collection of data took more than two years.
Some datasets focus on wearable technologies to monitor and acquire the activities performed by the participants.The Smartphone-Based Human Activities Recognition (SBHAR) dataset [14] is one example of such datasets.The authors collected the accelerometer and gyroscope data of 30 participants who performed several ADLs using a smartphone.Casale et al. [15] and Bruno et al. [16] are other examples of similar datasets.
The Intelligent System Laboratory (ISL) [17] generated a dataset from three smart homes in which a single participant was performing his ADLs.The dataset represent around two months worth of data.The first smart home had 14 sensors, the second had 23 sensors, and the third had 21 sensors.
Using a camera feed to capture a participant activities is another approach to recognise ADLs.Pirsiavash and Ramanan [18] presented a dataset of one million frames captured from a wearable camera that represent a first-person view.The data were gathered from 20 participants who performed unscripted ADLs in their homes.
The ContextAct@A4H dataset [19] is an example of recent datasets that focus on ADLs.The dataset was generated using a real flat equipped with many sensors of different types.The dataset consists of one week worth of captured data during the summer season and three weeks of the fall season.The authors proposed a new annotation method using temporal logic.

Simulation Tool
Synnott et al. [20] conducted a survey of existing simulation tools for generating datasets in a smart home environment.They found that, due to the sensors technology cost, availability limitations, time considerations and finding the optimal sensors configurations, simulation tools are valuable assets to have for smart home research.The authors also identified that most of the available simulation tools focus on context-awareness applications and not on generating representative datasets.Moreover, supporting multiple inhabitants was one of the features lacking in current simulation tools.
Cook et al. [21] presented some challenges facing the evaluation of machine learning performance and pervasive computing techniques.The authors identified the need to have real datasets and there is a lack of real datasets in the literature.
Bouchard et al. [8] designed a 3D smart home simulator for activity recognition to overcome the limitations of creating real datasets in a smart home.Many pre-recorded scenarios were captured from clinical experiments and used to generate datasets.
To evaluate activity recognition algorithms, researchers require good representative datasets.Due to the high cost of building real smart homes and due to privacy and ethical issues with the human subjects, Helal et al. [22] developed an event driven simulation tool for researchers in the smart home domain.The developed simulator is called "Persim" and it can generate realistic datasets for complex scenarios of the occupant's activities.
An improved version of Persim was developed by Helal et al. [23] called PerSim 3D.This tool helps to generate realistic datasets from the inhabitants activities in a smart home scenario.The major improvement was adding 3D simulations of the inhabitant, sensors and actuators.In addition, the tool supports the researcher by a Graphical User Interface (GUI) to envisage the activities in 3D.
The intelligent environment simulation (IE Sim) developed by Synnott et al. [9] to generate synthetic datasets that capture ADL of smart home users.IE Sim provides the researcher with a 2D graphical interface of the floor plan to design the smart homes.The researcher can add different types of sensors such as temperature sensors, pressure sensors, etc.Then, using an avatar, the simulation can be carried out to capture ADLs.The output of the simulation dataset is in the homeML [24] file format.

OpenSHS
Most of the available simulation tools follow two approaches to generate synthetic datasets, model-based and interactive approaches [20].
The model-based approach relies on already defined statistical models of activities to generate synthetic data.The statistical model determines the order of events, the probability of occurrence, and the duration of activities.The model-based approach makes it easy to generate large datasets in a short period of time.The disadvantage of this approach is the lack of capturing fine-grained interactions and/or unexpected accidents that are common in real activities.
The interactive approach, on the other hand, can capture more interesting interactions and fine-grained details.This approach uses a virtual avatar controlled by a researcher, a human participant or a simulated participant.The avatar moves and interacts with the virtual environment equipped with virtual sensors and/or actuators.These interactions can be passive or active.
An example of active interactions is opening a door or turning the light on or off.Another example of passive interactions is having a pressure sensor installed on the floor that detects the movements of the avatar without the avatar explicitly activating the sensor.The disadvantage of the interactive approaches is how long it takes to generate enough data: because of the nature of the approach, the interactions must be captured in real-time.

OpenSHS Advantages
Most of the simulation tools in the literature are not open-source, except for [8], which makes it harder for the researcher to acquire the software and modify it to the experiment's need.In addition, having a 3D simulation adds to the realism of the conducted experiment.
OpenSHS is an open-source smart home simulator that allows the participants to simulate their ADLs in a 3D virtual environment.OpenSHS is developed with open-source and cross-platform techniques that makes it easy for the researcher to modify the tool and extend it according to their needs.
The approach that OpenSHS uses to generate datasets can be thought of as a hybrid approach of the model-based and interactive approaches.OpenSHS offers a replication mechanism of the recorded ADLs which allows for a quick and large dataset generation, similar to the model-based approaches.The replications have fine-grained details as the activities are captured in real-time, similar to the interactive approaches.
OpenSHS has the flexibility to add different activity labels that can be customised by the researcher and tailored to their needs.It also has a fast-forwarding feature which facilitates the simulation of long inactivity periods.
We use OpenSHS to generate the two datasets.One is for classification and prediction of ADLs problems and the other is for anomaly detection problems.

Methodology
In this section, we present the design of the smart home and the contexts to be performed by the participants, followed by the aggregation and generation of the datasets.

Smart Home Design
We designed a smart home consisting of a bedroom, living room, bathroom, kitchen, and home office, as shown in Figure 1.Each room has several types of sensors.The smart home is equipped with twenty-nine binary sensors, as shown in Table 1.The binary sensor has two states, on (1) and off (0).The sensors can be divided into two groups, passive and active.The passive sensors do not explicitly require the participant to interact with them.Instead, they react to the participant movements and position.An example of this type is the carpet sensors.The carpet sensors turn on when the participant walks over them.
The other type of sensors are the active sensors.This type requires explicit action from the participant to change their state, for example, when opening a door or when turning on the light.
The activities labels that we decided to include in this dataset are: sleep, eat, personal, work, leisure, and other.The anomaly detection dataset includes an additional label anomaly.
The participant controls a 3D avatar in first-person view and navigates and performs his/her ADLs in the virtual smart home environment.Throughout the simulation period, OpenSHS will capture and record the state of all the smart devices and sensors every second.Some activities take a long time, such as staying at the office for studying.OpenSHS provides a solution for this problem by implementing a fast-forwarding mechanism which enables the participants to quickly perform the long constant activities.
During the simulation, when a participant wants to change his/her activity, they can do that by using the dialogue shown in Figure 2. It is worth noting that, when the participants change their activity label, it does not immediately apply the change in the dataset.The activity label changes when one of the sensor's state has changed.This approach ensures a clean separation when the participant transits from one activity to another.
OpenSHS uses the concept of a context which is a specific time-frame of interest to the researcher to be simulated [25].In this work, we have chosen to simulate the interactions of the participants in different contexts.On the weekdays, we have two contexts, one in the morning and the other in the evening.On the weekends, we have the same contexts during the day.Thus, there are four different contexts per participant.The day contexts are "morning" and "evening" contexts.The week contexts are "weekday" and "weekend".

The Participants
The participants in this work were chosen randomly but all of them have jobs.They also have experience with first-person games which will facilitate the learning curve of the tool.
The number of participants was 7, and the average time it took to conduct the simulation was 50 min (min time = 30, max time = 75, std time = 14.43).
For each participant, we followed the following procedures: 1.
The researcher guides the participant and shows him/her the virtual smart home.

2.
The participant is asked to play with the virtual smart home to get familiar with it.

3.
The participant's familiarity with the virtual smart home is tested by asking them to perform specific tasks.

4.
The actual simulation takes place, and the participant is asked to give us their actual starting times for each context.

5.
The participant is asked to complete the usability questionnaire.

The Anomalies
In some contexts, the definition of an anomaly is clear and can be quantified, for example, the heart rate for a patient.A heart rate that ranges from 60 to 100 beats per minutes is considered a normal resting heart rate for an adult.However, in the context of an inhabitant's behaviour in their smart home environment, the definition of what an anomalous behaviour is can be difficult and hard to quantify.Anomalous behaviour becomes much more subjective and varies from one inhabitant to another.Thus, anomalies in the datasets were not injected after the simulations were conducted, based on the researcher's idea of what an anomaly is.
To overcome the issues with defining what is an anomaly for an inhabitant, the researcher left this definition to the persons capable of defining these anomalies, the participants themselves.
Each participant performed an additional simulation that is intended to represent an anomaly from the point of view of the participant.All the anomalies are defined by the participants and no restrictions were imposed by the researcher.Table 2 shows each participant's anomaly that he/she simulated.Although there are seven anomalies in total, each anomaly is injected into six different contexts based on the user's behaviour.
Table 2.The anomalies defined by the participants.

Dataset Aggregation
To accelerate the process of generating the dataset, the participants are asked to perform several simulations of the same context.Since we record the activities of the participants in real-time, every simulation will be different and will contain unique information.OpenSHS provides an aggregation algorithm that uses all the real-time recorded simulations to generate a new and random dataset but in a controlled manner [25].
For each participant, we have generated six datasets with unique parameters.The parameters used to generate each dataset are as follows: The above parameters generated one month and two months worth of data.For the one-month set, we have three variants with 0, 5, and 10 time-margins.The same goes for the two-month set.This ensures that the generated datasets are different in the time dimension.Table 3 shows a sample of the final dataset.

Dataset Description
We generated a dataset for classification problems, and a dataset for anomaly detection problems.Each dataset consists of forty-two files, thus totalling eighty-four files.The naming convention used for the datasets files is d{x}-{y}m-{z}tm where:

•
x is an index number to uniquely identify a dataset; • y is the number of months generated; and • z is the time-margin value.
The classification dataset has a target column of the previously mentioned labels of the activities, while the anomaly detection dataset has an additional label for the anomalous activity.In addition to the twenty-nine binary sensor readings, both datasets have a timestamp column.
Table 4 shows a listing of the number of records for both datasets excluding the header record.It is worth noting that, for each file in the classification dataset, OpenSHS generated the final output randomly from the record samples.The same procedure was used for the anomaly dataset, with the exception that the anomalous activity was injected in the last quarter of the file.This decision of injecting the anomalous activity towards the end of the file was made to allow the model to learn the normal patterns before detecting the anomalous ones in anomaly detection problems.
Figure 3 shows seven bar charts of the classification files.Each bar chart shows the proportions of the training records (the first 60%) and the testing records (the last 40%).Some files do not have all the labels included because the participants did not perform that activity, for instance, as shown in the dataset d1_2m_0tm where the participant did not perform the "work" activity.
Figures 4 and 5 show the frequency of the active sensor readings that are associated with the "leisure" label in the training and testing samples which shows that there are slight differences between the two.The remaining labels, figures, and dataset files are available online at http://datasets.openshs.org.

Conclusions
This paper introduces two datasets for the smart home research community, one for classification and the other for anomaly detection.The two datasets are generated using a simulation tool (OpenSHS), and seven participants simulated their ADLs.The collection of the generated date accumulates to 63 days worth of patterns for both datasets.
Representative smart home datasets, such as the ones presented in this paper, have direct machine learning applications, mainly for the training, testing and validation of new models.Different datasets are needed depending on the machine learning target application, i.e., classification, clustering, prediction or anomaly detection.The contributed datasets can be used to validate machine learning models that perform classification tasks and/or anomaly detection tasks in the smart home domain.Classification and anomaly detection tasks are applicable to many use cases such as automation, eldercare, healthcare, entertainment, security, etc.
For future work, we will use the developed datasets to visualise smart home designs.This visualisation would allow researchers to identify drawbacks in a smart home environment.This will help and accelerate the development and proposition of new effective designs.Moreover, within the IoT paradigm, the contributed datasets will be used to test and validate IoT frameworks.

Figure 1 .
Figure 1.The design of the smart home.

participant 1
leaving the fridge door open.participant 2 leaving the oven on for long time.participant 3 leaving the main door open.participant 4 leaving the fridge door open.participant 5 leaving the bathroom light on.participant 6 leaving tv on.participant 7 leaving light bedroom and wardrobe open.

Figure 4 .
Figure 4.The sensor readings for the leisure activity in the training sample.

Figure 5 .
Figure 5.The sensor readings for the leisure activity in the testing sample.

Table 1 .
All smart home sensors.
Figure 2. The activities selection dialogue.

Table 3 .
A sample of the final dataset output.

Table 4 .
The number of records for the forty-two files for both datasets.