A Comprehensive Review of Machine Learning Approaches for Anomaly Detection in Smart Homes: Experimental Analysis and Future Directions

: Detecting anomalies in human activities is increasingly crucial today, particularly in nuclear family settings, where there may not be constant monitoring of individuals’ health, especially the elderly, during critical periods. Early anomaly detection can prevent from attack scenarios and life-threatening situations. This task becomes notably more complex when multiple ambient sensors are deployed in homes with multiple residents, as opposed to single-resident environments. Additionally, the availability of datasets containing anomalies representing the full spectrum of abnormalities is limited. In our experimental study, we employed eight widely used machine learning and two deep learning classifiers to identify anomalies in human activities. We meticulously generated anomalies, considering all conceivable scenarios. Our findings reveal that the Gated Recurrent Unit (GRU) excels in accurately classifying normal and anomalous activities, while the naïve Bayes classifier demonstrates relatively poor performance among the ten classifiers considered. We conducted various experiments to assess the impact of different training–test splitting ratios, along with a five-fold cross-validation technique, on the performance. Notably, the GRU model consistently outperformed all other classifiers under both conditions. Furthermore, we offer insights into the computational costs associated with these classifiers, encompassing training and prediction phases. Extensive ablation experiments conducted in this study underscore that all these classifiers can effectively be deployed for anomaly detection in two-resident homes.


Introduction and Motivation
The growing number of Internet of Things (IoT) devices has transformed smart homes from a luxury into a necessity.These homes have an integrated ecosystem that enables users to efficiently monitor and control multiple gadgets and systems remotely.In addition to convenience, they also address critical concerns related to security and safety.Smart security systems provide real-time monitoring, alarms, and surveillance, increasing the safety of properties and their occupants.Furthermore, smart equipment can instantly identify anomalies and respond to crises, such as the health issues of the residents, reducing potential hazards.In today's globalized society, smart homes are essential for streamlining daily living, enhancing security, and ensuring the safety and well-being of individuals.Currently, turning homes into smart homes provides such opportunities by utilizing the Internet of Things (IoT) and placing multiple sensors to monitor home activities.The data recorded from these sensors in a smart home assist in monitoring the pattern of human activities, as well as detecting any deviations from the regular pattern [1,2].Anomalies in human activities refer to the unusual execution of activities, such as deviating from the regular pattern or taking an abnormal duration for each activity.These anomalies should be detected to monitor the health of residents, especially elderly people, who require more surveillance and prompt response during critical times [1,3].
The detection of anomalies in human activities from the recorded data can provide important indications regarding the health of residents and aid in the timely prevention of health complications.Detecting anomalies in human activities has various applications, such as guiding dementia patients if they miss essential activities like taking meals or medicines and monitoring the health of those staying alone, especially elderly people [4,5].Many researchers have proposed various methodologies, including the identification of body positions and actions, as well as the recognition of visual activities, to be significant in the context of anomaly detection [6][7][8].Nonetheless, these approaches exhibit certain limitations.For example, in visual activity recognition, the privacy of the resident is compromised, while in body position and action identification, some devices should be attached to the resident for recording his/her body position and actions, such as standing, sitting, and walking, which is not always convenient [9][10][11].In addition, most research on detecting anomalies in human activity considers a single resident.However, this approach is impractical when multiple residents live together in a house, where the activities of one resident are directly influenced by the activities of others [12][13][14].
Artificial intelligence has opened the door to more efficient identification of unusual human activities within smart homes.Leveraging machine learning classifiers such as decision trees, naïve Bayes, gradient boosting, random forest, k-nearest neighbors, and Support Vector Machines simplifies the process of recognizing anomalies [7,[15][16][17][18][19][20].The classifiers work by analyzing sensor data, which enables accurate classification of a resident's normal and abnormal activities.This approach has proven to be highly effective in ensuring the safety and security of smart home residents [21].Later, while deep learning has been effective in classification tasks, researchers have applied neural network (NN)-based techniques such as Convolutional Neural Networks (CNNs) and recurrent neural networks (RNNs) for detecting anomalies in human activities [22][23][24][25].However, multi-activity datasets are often converted into binary class datasets, which can result in imbalanced datasets as only a few abnormal behaviors are generated, and do not represent all possible anomaly situations [26][27][28].In many studies, only normal activities are used for training and a threshold value is computed by estimating the loss to classify data as normal or anomaly [29,30].An input is considered an anomaly when the summed loss is greater than the threshold value [22].Although these techniques classify normal and abnormal activities well for single residents, they have not reported their performance on multi-resident datasets.Furthermore, we have encountered a challenge in our search to identify a comprehensive resource that can assist us in assessing the performance and computational complexity of various classifiers for the task of detecting anomalies related to multiple residents in a human-centric context.
In this review study, we conducted thorough experiments over all the popular machine learning classifiers such as decision tree (DT), naïve Bayes (NB), gradient boosting (GB), Light Gradient Boosting (LGB), random forest (RF), k-nearest neighbors (KNN), Support Vector Machine (SVM), Linear Regression (LR), and two RNN-based models utilizing Long Short-term Memory (LSTM), and the Gated Recurrent Unit (GRU) for detecting anomalies in two-resident human activities.The key contributions of this study are as follows:

•
We generated 50,000 abnormal activities by considering all potential anomalies that could occur in a two-resident home, significantly enhancing the reliability of our research findings.

•
Our research includes a comprehensive guide that examines how varying the training-test splitting ratios and implementing k-fold cross-validation impact the performance of these classifiers.

•
In our study, we also present a detailed analysis of the computational complexity of these classifiers, spanning from the training phase to making predictions.This analysis effectively illustrates the trade-off between performance and computational costs associated with these algorithms.

•
Our research entails a rigorous comparative analysis of these classifiers using the activity recognition using ambient sensing (ARAS) multi-resident smart home dataset.Additionally, we offer valuable insights and recommendations for future researchers in this field, aiming to guide and inform their work on similar topics or within the same domain.
Our research offers valuable insights to fellow researchers by pinpointing the optimal machine learning algorithm within the smart home domain.Additionally, it contributes to the ongoing progress in crafting more efficient and effective anomaly detection methods for two-resident scenarios.These findings carry substantial potential for a wide range of applications across domains such as healthcare, security, and smart home technologies.
The rest of the paper is structured as follows: Section 2 covers the literature review.Section 3 explains different machine learning and deep learning classifiers used in this study and their detailed implementation.Section 4 presents the obtained results and comparative analysis among the classifiers, and Section 5 summarizes the proposed study along with future directions.

Background and Related Work
This study aims to detect anomalies in human behavior in two-resident homes using machine and deep learning techniques.Traditional vision-aided methods pose privacy concerns and require extensive computation due to processing large video data.Researchers have proposed methods utilizing sensor data to make anomaly detection more efficient while ensuring resident privacy.

Machine Learning-Based Human Activity Anomaly Detection
Identifying human activities is very important to automate the monitoring of the health of elderly people.A significant number of research works and studies have been conducted to identify and classify human activities by analyzing the motion from video captured through closed-circuit television (CCTV) or other types of camera systems.Machine learning and deep learning models were widely used in many works to identify anomalies in activities [31,32] along with classification [33][34][35].Since placing surveillance cameras to observe the residents presents data privacy issues/concerns, sensor-based observation has become popular.Adrien et al. proposed a method for identifying human activities using the Hidden Markov Model (HMM) and a wearable motion suit with a sensorattached glove [5].Lawal, I. A. et al. conducted a similar study by placing sensors on seven body parts and using the Convolutional Neural Network (CNN) model over the collected frequency images [24].While machine learning and deep learning classifiers work well over sensor data, placing sensors on residents for a long time is inconvenient.Researchers suggest deploying sensors throughout the home to detect human activity and anomalies in order to create a smart home.Fahad et al. utilized SVM, one-vs-one (OSVM), and K-means classifiers to identify human activities and anomalies in the ARAS and the Center for Advanced Studies in Adaptive Systems (CASAS) smart home datasets [36].Similarly, Gupta et al. established a sensor-based test bed to collect data and used HMM to detect anomalies in user behavior [15].All these discussed works assumed a single resident at home, but it is more realistic to consider multiple residents as the activities of one resident are directly affected by the others.Several studies were conducted later to detect anomalies in user behavior by considering multiple residents.In their research, Liang et al. employed the power of machine learning to identify and differentiate between multiple residents' activities accurately while also flagging any unusual activities [9].Howedi et al. utilized an innovative technique based on entropy to detect anomalies in the presence of visitors, ensuring maximum safety and security for residents [6].Jakkula et al. introduced a novel algorithm that leverages temporal pattern discovery to identify any irregularities in user activities [37].

Machine Learning-Based Anomaly Detection in Other Domains
In addition to human activity anomaly detection, machine learning models have been applied for anomaly detection in other domains such as cyber security, the IoT, finance, manufacturing, and so on.Jadidi Z. et al. proposed an artificial neural network (ANN)based model for identifying adversarial attacks in IoT and industrial IoT networks [38].They reported the effectiveness of the CNN model in detecting adversarial attacks.Another study for detecting anomalies in cyber security was carried out by Vávra J. et al., where they applied four different machine learning (ML) and deep learning (DL) models for protecting industrial control systems [39].They optimized the hyper-parameters to propose an adaptive anomaly detection system.To identify anomalies in an IoT network, a study was conducted using different machine learning models, and random forest was found to be more effective [40].Several ML models have been applied to identify anomalies and fraud in finance.Alexander B. et al. conducted a studies in which they applied seven supervised and two unsupervised models over the general ledger data to identify transaction inconsistencies [41].Along with the supervised ML models, the unsupervised models have also been used in several studies for anomaly detection.Schlegl T. et al. proposed an interpretable deep learning model for classifying anomalies and normal torque sequences in a manufacturing system [42].
As summarized in Table 1, most of the listed studies focused on using threshold-based anomaly-detection methods.This approach entails determining an appropriate threshold value, which is achieved through a trial-and-error process.The datasets utilized in these studies primarily exhibited normal activities, with the models trained on these data and tested with user-generated anomalies.However, the issue with this approach is that the created anomalies may only partially represent some possible anomalies.Additionally, the studies needed to comprehensively analyze the impact of different training-test splitting ratios and k-fold cross-validation on performance.To address these limitations, we conducted a study that utilized popular machine learning classifiers and two recurrent neural network (RNN) techniques on the ARAS dataset to identify anomalies in human activities.Furthermore, we conducted a comparative evaluation of these methods under different settings.To ensure the validity of our results, we generated 50,000 anomalies by considering all possible scenarios.Table 1.List of related works with their objectives, contributions, and limitations in human activity anomaly detections.

Objectives Contributions Limitations
Adrien et al. [5] Recognizing the activity based on the wearable sensors' data.
Proposed probabilistic model based on HMM for single activity detection.
The dataset contains one activity and the manual extraction and selection of features.
Lawal, I. A. et al. [24] Activity recognition based on the motion signals (accelerometer and gyroscope).
Converted the signals into frequency images and applied CNN models for recognizing activities.
The model cannot differentiate closely related activities.
Fahad et al. [36] Identifying anomalies based on the number of activities performed each day.
Identified anomalies by considering missing or excess subevents and an unusual duration of an activity using the H20 autoencoder.
Works well for single residents while not tested for multiple residents; ground truths were generated, but not validated.

Gupta et al. [15]
Classifying human behavior anomalies by utilizing the Internet of Medical Things and smart homes.
Applied the HMM model for identifying anomalies where data were collected from the authors' set test bed.
HMM works well when the hidden states are few and requires effective feature engineering for better performance.

Paper Objectives Contributions Limitations
Liang et al. [9] Activity recognition of multiple residents using historical activity features.
Different machine learning models like random forest (RF), decision tree (DT), Support Vector Machine (SVM), and k-nearest neighbor (KNN) and neural network models such as Multilayer Perceptron (MLP) and Long Short-term Memory (LSTM) were used to classify human activities.
The considered features are not enough to classify all activities, including anomalies, accurately.
Howedi et al. [6] Detecting anomalies in human activity in the presence of visitors.
Applied entropy-based models to classify the samples and identify anomalies.
Finding the optimal threshold for classification is difficult and significantly impacts the performance.
Used temporal features in conjunction with the machine learning model to detect the anomalies in human activities.Generated synthetic data to increase the size of the dataset.
The quality of the synthetic data was not validated, and finding the temporal pattern, including the interval, is challenging.

Analysis and Comparison of Machine Learning-Based Anomaly Detection
This section discusses the applied machine learning and deep learning models, the experiment in detail, and the dataset used.The entire workflow of this study is shown in Figure 1.This study starts with collecting the data and cleaning the data to remove null and inconsistent values.Then, we perform scaling over the inputs to remove the bias of a specific activity on the outcome.We further split the data into training and test sets randomly and, finally, trained and tested the machine learning (ML) and deep learning (DL) models.The details of each ML and DL model are described in the following subsection.

Machine Learning Models
Eight machine learning classifiers were applied to detect human activity anomalies in this study.

Decision Tree (DT)
Decision trees (DTs) are a popular machine learning tool that can be used for both classification and regression.When trained on a dataset, they will gain the ability of classifying new unseen data.A decision tree consists of three types of nodes, namely the root, inner, and leaf nodes.The tree is created using a number of edges that connect these nodes and helps to carry out classification and regression.Among the many DT algorithms, we used the classification and regression trees (CART) method to identify anomalies.This method recursively splits the dataset into subsets by making binary decisions based on the input features at each stage until it reaches a stopping point or a predetermined depth.The algorithm selects the feature at the tree's root node that optimally divides the dataset into subsets while maximizing a certain criterion.The Gini index, a measure of inequality that may be used to measure any unbalanced distribution between 0 and 1, divides the nodes and creates a decision tree.We determined the feature's Gini Gain by calculating the Gini index across all of the values of a feature within the data collection [43,44], as shown in Equation (1).
where p j indicates the likelihood that a dataset sample will belong to a certain class and n indicates the total number of classes in the dataset.The Gini Split Info calculates the Gini index across all feature values, and the Gini index of the ith feature is computed using the following equation (Equation ( 2)).
3.1.2.Support Vector Machine (SVM) A Support Vector Machine (SVM) is a widely used machine learning classifier that divides various classes of data points by locating the best hyperplane in a feature space.The main goal of SVM is to locate the hyperplane that optimizes the distance between the classes while minimizing classification errors [45].The margin, also known as the distance between the hyperplane and the closest data point of each class (support vector), is a crucial parameter in the SVM algorithm.The SVM algorithm looks for the hyperplane with the maximum margin (as illustrated in Equation ( 3)), as it effectively separates the data and increases the model's ability to generalize to unseen data.In the SVM classifier, a specific weight or coefficient is given to each feature in the dataset, and it then learns to build a decision boundary by optimizing these weights.The best hyperplane is discovered by resolving a mathematical optimization problem that aims to minimize a cost function while ensuring that all data points are accurately classified, and the margin is maximized.Suppose the hyperplane is defined by w and b is a set of points in which H = {x | w T .x+ b = 0}; the hyperplane is shown by γ.The maximum margin would be: where y i is the class label associated with the i-th data point x i .This method makes SVMs efficient for both linearly separable data, where a straight line serves as the hyperplane, and non-linearly separable data, where SVMs can translate the data into a higher-dimensional space using a kernel function, allowing them to identify a more complex decision boundary.

Naïve Bayes (NB) Classifier
The naïve Bayes classifier is a machine learning algorithm that uses probability to classify data.It is based on Bayes' theorem and assumes that features are independent (naïve) [46].The algorithm first estimates the prior probabilities of each class using training data and then calculates the conditional probabilities of each feature given to each class.To make predictions, it combines the prior probabilities with the likelihood of the observed features for each class and selects the class with the highest posterior probability.Naïve Bayes is efficient with high-dimensional data, but may not capture complex relationships between features.The NB classifier uses the following Bayes theorem presented in Equation (4): where P(C|x) is the probability of a data point belonging to class C given its features x, P(x|C) is the probability of features x given class C, P(C) is the prior probability, and P(x) is the marginal probability.

Gradient Boosting (GB) Classifier
The gradient boosting (GB) classifier is an effective ensemble machine learning approach for classification applications.It works by gradually integrating numerous weak learners (usually decision trees) so that each new learner corrects the errors committed by the preceding ones.During training, GB makes an initial prediction (typically the target variable's mean) and then fits a weak learner to the residuals of past predictions, changing the model's weights to reduce residual errors [47].This procedure is continued recursively, with each new learner focusing on the remaining faults as the model's overall accuracy improves.The formula for GB, which incorporates a weak learner into the ensemble at each iteration, can be expressed as Equation ( 5): where t is the iteration number and ŷ(t) i is the predicted score for the i-th training example at iteration t.The weak learner, which was added at iteration t, is shown by f t .
This iterative process allows Extreme Gradient Boosting (XGBoost) to gradually improve its predictions by incorporating the knowledge of multiple weak learners, eventually leading to a strong ensemble model.Gradient boosting is a popular choice in many machine learning applications due to its high predicted accuracy, robustness against overfitting, and ability to capture complicated correlations in data.

Light Gradient Boosting Machine (LGBM) Classifier
The Light Gradient Boosting Machine (LGBM) classifier is a fast, efficient, and highperforming gradient boosting system for machine learning and statistical modeling.Because of its histogram-based learning method, the LGBM is well-suited for training on huge datasets because it can more quickly calculate gradients by grouping data points into histogram bins [48].Using a structure very similar to classic gradient boosting, it gradually adds decision trees to enhance prediction precision.The LGBM is a popular option for many machine learning tasks, including classification, regression, and ranking, as it has many features including regularization, early stopping, and the handling of missing variables.

Random Forest (RF) Classifier
The random forest (RF) classifier is an ensemble machine learning technique that builds numerous decision trees during training and then pools their predictions to boost accuracy and mitigate overfitting.These decision trees are generated using a method called bagging (Bootstrap Aggregating).A bootstrap sample is generated by randomly drawing samples from the training data and replacing them with new ones for each tree in the forest.Random forest can be represented as the aggregation of the predicted output ( ŷi ) from individual decision trees: In Equation ( 6), f k (x i ) is the prediction of the k-th decision tree for the i-th sample and N represents the number of decision trees in the forest.
To further increase the randomization and variety of the trees, we evaluated a random subset of features for splitting at each decision tree node.The risk of overfitting is mitigated, and tree decorrelation is improved by combining bootstrapped samples and feature randomization.Each tree in the forest makes its own classification or regression during prediction, and the final output is decided by majority vote (classification) or the average (regression) [49].Due to its effectiveness in many machine learning applications, random forest is frequently employed as an ensemble technique to improve predicted accuracy, resilience, and generalization.

k-Nearest Neighbors (KNN) Classifier
The k-nearest neighbors (KNN) classifier is a non-parametric and instance-based machine learning technique used for classification and regression.It is based on the idea of similarities between features.In KNN, each data point is assigned a category based on the classification of its k-nearest neighbors, where the user selects k as a hyper-parameter.The predicted class for a new data point can be determined by the majority class among the k-nearest neighbors, as expressed in Equation ( 7): where c shows each class label and the predicted class is represented by ŷ.Therefore, y i denotes the class label of the i-th nearest neighbor.In this formula, I(•) is the indicator function, which returns 1 if the condition inside is true and 0 otherwise.
In the study, we conducted a comparative analysis and determined that using five neighbors (the optimal value of k) is appropriate.The algorithm determines the distance of a new data point from the rest of the dataset, usually using the Euclidean distance metric, to decide its classification.It picks the most frequent label among the k-nearest data points for classification tasks, and for regression tasks, it takes the mean of the selected points.The algorithm's performance is highly dependent on the value of k; smaller values make it vulnerable to local noise, while larger values can lead to over-smoothing of the decision border [50].KNN is computationally efficient during training, but its prediction method can be time-consuming, particularly when working with large datasets, due to the need to calculate distances to all data points.

Logistic Regression (LR) Classifier
Logistic Regression (LR) is a popular technique in supervised machine learning, primarily used for binary classification tasks.It can be adapted to handle multi-class classification as well.Unlike a true regression technique, logistic regression employs the logistic (sigmoid) function to represent the likelihood of a binary outcome, such as 0 or 1, as a function of one or more predictor variables.The logistic function converts a linear combination of predictor variables into a probability score between 0 and 1, making it a useful tool for classification [51].The logistic regression model estimates each predictor's impact on the log odds of the binary outcome, and the coefficients are optimized during the training process to improve the model's performance.Logistic regression employs a threshold, typically 0.5, to categorize data points.Values above the threshold are assigned to one class, while those below the threshold are assigned to the other class.

Deep Learning Techniques
Since we are working with the temporal (time-dependent) dataset for anomaly detection, we also looked into deep learning techniques (neural networks with deep structures) and applied two recurrent neural network (RNN) techniques named Long Short-term Memory (LSTM) and the Gated Recurrent Unit (GRU) model, which are described below.

Long Short-Term Memory (LSTM) Model
Long Short-term Memory (LSTM) is a type of recurrent neural network architecture that has been designed to solve the vanishing gradient problem.It can effectively capture long-range dependencies in sequential data.LSTMs are composed of memory cells and a network of gates that regulate the flow of information.Each LSTM cell maintains a hidden state, which can capture and store information over extended sequences.It also has a cell state, which selectively retains or forgets information.The gates in an LSTM, namely the forget gate, input gate, and output gate, control the flow of data through mathematical operations such as elementwise multiplication and addition [52].Figure 2 presents the base structure of LSTM.The constituting gates enable LSTMs to learn and remember patterns in sequential data, making them particularly well suited for tasks such as natural language processing (NLP), speech recognition, and time series prediction.LSTMs can capture complex temporal relationships and have been instrumental in achieving state-of-the-art results in a wide range of sequence modeling tasks.The architecture (and layers) of our used LSTM model is shown in Figure 3.

Gated Recurrent Unit (GRU) Model
The Gated Recurrent Unit (GRU) is another type of recurrent neural network (RNN) architecture designed to process sequential data.It addresses the issue of vanishing gradients commonly found in traditional RNNs.The GRU cell comprises several components, including a reset gate and an update gate.These components work together to regulate the flow of information within the cell.The reset gate determines which information from the prior hidden state should be reset or forgotten, while the update gate controls the extent the new input should impact the updating process of the hidden state.By combining these gates, GRU cells can selectively update their hidden states, allowing them to capture long-term dependencies in sequential data [48].Figure 2 presents the base structure of the GRU.In comparison to other RNN versions like the LSTM, the GRU's design is simpler due to its ability to update the hidden state quickly without the use of a dedicated memory unit (Long Short-term Memory).Since the GRU is so effective at modeling long-term dependencies, it is frequently used for tasks like NLP, speech recognition, and time series analysis.The architecture of our used GRU model is shown in Figure 3.
For anomaly detection, we employed models based on both the LSTM and GRU, which share the same underlying architecture.We employed a straightforward design consisting of two LSTM/GRU layers, each comprising a pair of LSTM/GRUs units.

Dataset
Within the scope of our research, we made use of a real-world dataset known as activity recognition using ambient sensing (ARAS https://www.cmpe.boun.edu.tr/aras/,accessed on 19 April 2024).The dataset consists of data recorded by twenty binary sensors strategically positioned in various locations across a two-resident residence.Each second, data were captured from two homes, called House A and House B, with two people each for thirty days.The creator labels the data that are recorded at each second based on the activity of both people in the house.On the basis of the readings obtained from the sensors, a total of 27 activities were assigned to both of the occupants.Because the activity of one resident is affected by the activity of the other resident and both cannot activate the same sensor at the same time, we took into account the sensor readings in addition to the activity of one resident to determine whether other resident's activities were abnormal.
Abnormal activity is defined as any event that triggers an unrelated sensor.For instance, if someone is outside the home, but a sensor inside the home is activated, it is considered abnormal.The various types of activities and sensor placements are detailed in Table 2.In each house, a total of 2,592,000 normal activities for both residents were recorded.Since the dataset does not contain any anomalies, we created 50,000 anomalies https://github.com/Rahman-Motiur/Anomaly-Detection-in-Smart-Home,accessed on 19 April 2024, by considering all possible combinations of sensor values that may lead to anomalies.Table 2. Dataset description.This table includes the placement of sensors (20 different locations in the home) and the list of activities that were evaluated.

Experiments
In our study, we conducted thorough experiments to report on the performance of machine learning and deep learning classifiers in anomaly detection under different circumstances.All the models' hyper-parameters are presented in Table 3.For the experiments, we split the dataset into training and test sets using different ratios, such as 80:20, 70:30, and 60:40, and used 5-fold cross-validation to determine the comparative performance of the models.As the number of anomalies is smaller than the normal activities, we used stratified splitting to maintain the balance between the classes in both the training and testing sets.We also used stratified splitting in the k-fold cross-validation to ensure class balance in each fold.We implemented early stopping by monitoring the training loss in order to prevent overfitting of the models.

Computing Platform
Our experiments were conducted on a cluster server consisting of 4 nodes, each with an NVIDIA A30 Tensor Core GPU, 64 cores, 512 GB of memory, and one A30 GPU (24 GB) per node.We used PyTorch 1.13.1 and CUDA tools 11.2 to implement the models and conduct the experiments.

Evaluation
The performance of the applied classifiers over the ARAS dataset was measured by utilizing several metrics including the accuracy, precision, recall, F-1 score, macro average F-1, and weighted average F-1, presented in Equations ( 8)- (13).The value ranges of these metrics are between between 0 and 1, with 1 indicating the best performance.As the dataset is slightly imbalanced, we used the macro average F-1 and weighted average F-1 to ensure reliable performance comparison.The description of each metric and its formula is as follows: Accuracy: For the accuracy, we measured the proportion of correctly classified predictions among the total number of predictions.
where TP refers to the number of true positives, TN refers to the number of true negatives, FP represents the number of false positives, and FN denotes the number of false negatives.These measures refer to the actual number of instances a classifier model has correctly (referring to true) or incorrectly (falsely) predicted in the positive or negative class (where positive and negative in this context refer to being or not being in a defined class, respectively).Precision: Precision measures the proportion of instances that are correctly classified as positive (TP) among all positive predictions made.
Recall: This score measures the proportion of true positive predictions among all actual positive instances, whether they are correctly classified as positive or incorrectly classified as negative (FN).Recall is, thus, calculated as the number of true positive predictions divided by the sum of true positive and false negative predictions.
F-1 score: The F-1 score is the harmonic mean of the precision and recall.
Macro average F-1: This score calculates the F-1 score for each class independently and then takes the unweighted average of these scores.Unweighted average means that this score will treat all the classes equally regardless of the number of instances they have.
Weighted average F-1: This score calculates the F-1 score for each class independently and then takes the weighted average of these scores, weighted by the number of true instances for each class.In this score, the classes with more instances will receive a higher weight in the calculation.
In the above measures, TP, TN, FN, FP, N, and w denote the number of true positives, the number of true negatives, the number of false negatives, the number of false positives, the number of classes, and the weight assigned to each class, respectively.

Results and Discussion
Our study aims to perform in-depth experiments on ten machine learning and deep learning models to detect anomalies in two-resident home activity.We applied these models to data from two houses separately to see how different the training-test splitting ratios and k-fold cross-validations affected their performance.We also conducted comparative experiments to report the computational cost of these classifiers in processing time series anomaly detection.

Performance on House A
In House A, two residents live together, and the activities of one resident are classified based on the active presence of the other.Table 4 displays the performance of ten classifiers on the ARAS dataset using five-fold cross-validation.The results indicate that the Gated Recurrent Unit (GRU) model performed the best, followed by the random forest model, while the Gaussian naïve Bayes model delivered the lowest performance when evaluated using the metrics.The performance of other classifiers listed in Table 4 is also not poor, comparatively.We evaluated the performance of various models using the House A data in different splitting ratios (80:20, 70:30, and 60:40) for training and testing.The results of the applied models are shown in Figure 4.The figure indicates that, for the most part, the performance of the same classifiers did not vary significantly across different splitting ratios.Figure 4c-e show that the Gaussian naïve Bayes (GNB) and KNN classifiers performed differently in terms of the recall, macro F-1, and weighted F-1 scores for different splitting ratios.The performance of the other classifiers remained consistent across different splitting ratios in regard to most of the metrics, suggesting that splitting ratios do not significantly impact performance.However, there was a significant difference in performance among the classifiers.

Performance on House B
In addition, we conducted comparative experiments on the House B data from the ARAS multi-resident activity dataset.The models were evaluated using five-fold crossvalidation, and their performance is presented in Table 5.As shown in the table, the Gated Recurrent Unit (GRU) achieved the highest performance on House B, followed by the decision tree (DT), while the Gaussian naïve Bayes (GNB) had the lowest performance.These results are similar to those observed for House A.
To gather more empirical information, we analyzed the impacts of different training-test splitting ratios on the House B data from the ARAS dataset.Our experiments yielded comparative results, which we present in Figure 5. Similar to our findings for House A, we observed that the GNB and KNN classifiers exhibited varying levels of performance in terms of the recall, macro F-1, and weighted F-1 scores for different splitting ratios.On the other hand, the remaining classifiers displayed more consistent performance across different metrics.
Table 5.This report presents a comparison of the performance of ten machine learning and deep learning classifiers applied to the House B data from the ARAS multi-resident dataset.The performance evaluation was conducted using 5-fold cross-validation.In the table, an asterisk (*) indicates the overall best performance, (**) indicates the second-best performance, and (+) indicates the worst performance.

Computational Cost Analysis
As part of our study, we examined the computational cost of each classifier on the ARAS multi-resident activity dataset.We conducted training and testing to determine the amount of time each classifier required to complete both tasks.Figure 6 displays the time taken by each classifier to finish training and testing.The chart indicates that the LSTM took the longest time, followed by the GRU, to accomplish the training and validation necessary to achieve the same level of performance as the other classifiers.In contrast, the GNB classifier required the least amount of time to complete the training and testing process.The above discussion shows that the GRU-based model identified anomalies more precisely than the other classifiers, although it had the second-highest computational cost.The reason behind this is that the human activity data are temporal, and one activity is dependent on the other activities that happened in the past.So, the information from past activities is needed to decide the present activities.The working approach of the GRU model is consistent with the requirement of accessing the information from the past during training and testing.That is why the GRU-based model outperformed the other models in terms of the overall accuracy to identify anomalies in human activities.

Conclusions and Future Directions
In our study, we utilized various machine learning and deep learning models to identify anomalies in human activities.Since the ARAS multi-resident activity dataset only includes normal activities, we generated 50,000 anomalies by considering all possible unusual sensor combinations.We took into account that the activities of one resident directly influence the activities of the other resident and, therefore, used one resident's activity as the input when identifying anomalies in the other resident's activities.To evaluate the performance of the applied classifiers, we examined the effects of five-fold cross-validations and different splitting ratios.Our findings revealed that the Gated Recurrent Unit (GRU) provided the highest performance results, followed by the decision tree (DT), while Gaussian naïve Bayes (GNB) yielded the lowest performance results.We also discovered that the splitting ratios of 80:20, 70:30, and 60:40 did not significantly impact performance, except for the k-nearest neighbor (KNN) and GNB classifiers.Additionally, we investigated the computational costs of all applied classifiers from training to prediction.Our findings showed that the Long Short-term Memory (LSTM)-based model took the longest time, followed by the GRU, while the GNB classifier took the least.The study provides a comprehensive guide for researchers, including performance and computational costs, to compare machine learning and deep learning classifiers for anomaly detection in human activities.
Our findings can provide valuable insights for future researchers in anomaly detection of human activities.Our analysis revealed performance differences among classifiers, indicating that future studies should focus on optimizing models or exploring alternative architectures to address our limitations.For certain uses, the decision tree (DT) and Gaussian naïve Bayes (GNB) classifiers can be made better, or new models can be devised that perform better than the Gated Recurrent Unit (GRU).Our paper also investigates how data splitting ratios affect classifiers such as k-nearest neighbor (KNN) and GNB, which could lead to optimal data-partitioning algorithms.Researchers can optimize model training and prediction based on the computational costs of classifiers.Additionally, exploring ways to reduce the time complexity of the LSTM and GRU models could be beneficial.Further studies can examine the trade-offs between model performance and computational efficiency to help practitioners select the most suitable model for their requirements.Our future objectives include improving anomaly-generation methods, refining existing classifiers, exploring alternative architectures, and optimizing computational costs.These efforts will contribute to developing robust and efficient anomaly-detection systems for human activities.

Figure 2 .
Figure 2. The base architecture of the GRU is shown on the left side of the figure, and the LSTM is shown on the right side of the figure.In this figure, the input data at time step t are denoted as x t , the hidden states are shown with h, and the cell states are indicated by C.

Figure 6 .
Figure 6.Comparative analysis of the run time of different classifiers from training to testing.The run time is computed in seconds.

Table 4 .
Comparative performance of the ten applied machine learning and deep learning classifiers on the House A data from the ARAS multi-resident dataset with 5-fold cross-validation.In the table, an asterisk (*) indicates the overall best performance, (**) indicates the second-best performance, and (+) indicates the worst performance.