Daily Living Activity Recognition In-The-Wild: Modeling and Inferring Activity-Aware Human Contexts

: Advancement in smart sensing and computing technologies has provided a dynamic opportunity to develop intelligent systems for human activity monitoring and thus assisted living. Consequently, many researchers have put their efforts into implementing sensor-based activity recognition systems. However, recognizing people’s natural behavior and physical activities with diverse contexts is still a challenging problem because human physical activities are often distracted by changes in their surroundings/environments. Therefore, in addition to physical activity recognition, it is also vital to model and infer the user’s context information to realize human-environment interactions in a better way. Therefore, this research paper proposes a new idea for activity recognition in-the-wild, which entails modeling and identifying detailed human contexts (such as human activities, behavioral environments, and phone states) using portable accelerometer sensors. The proposed scheme offers a detailed/ﬁne-grained representation of natural human activities with contexts, which is crucial for modeling human-environment interactions in context-aware applications/systems effectively. The proposed idea is validated using a series of experiments, and it achieved an average balanced accuracy of 89.43%, which proves its effectiveness.


Introduction
Human beings are the most integral part of an environment and ecological units that collaborate and make up the urban landscape. Generally, human behavior reflects their surroundings, and varying environments adversely affect human psychology and, thus, their behavior. It is crucial to understand how human beings (as the occupants of an environment) react and acclimate to their surroundings. Naturally, human behavior and activity patterns are chaotic and inconsistent, primarily affected by the variability of environment and contexts. Diverse human contexts may lead a person to behave irrationally, thus giving rise to abnormal user behavior, because of which human physical activity patterns may also get distracted. Therefore, it is crucial to efficiently model and learn human physical activities and interactions in varying contexts for enabling contextdependent systems and applications. Recent advancements in sensing and networking technologies, such as the internet of things (IoT), have provided a ubiquitous platform to develop intelligent systems for context-aware human-centric computing. The accessibility of real-time data through ubiquitous devices (such as smartphones and smartwatches) has resulted in the proliferation of research work in the field of sensor-based activity recognition (AR) [1][2][3]. The goal of AR is to provide a suitable analysis of human activities from the data acquired from wide-ranging sensors, including video cameras and depth sensors, behavioral context or environment may lead to ample variations in human physical activity patterns. Likewise, the phone context (i.e., phone position on the human body) during a particular activity execution in-the-wild is also affected as a result of diversity in human behavioral environments. The changes in phone position significantly alter the activity patterns recorded by the device-embedded inertial sensors that are sensitive to phone orientation and placement. These vital differences in the activity patterns (occurring in response to change in human contexts or phone positions) can be effectively modeled based on a supervised machine learning approach to infer the detailed user's contexts associated with different activities. Following this, the proposed ARW model works based on a twostage supervised classification approach. The first stage identifies the primary physical activities of daily living (PADLs) based on smartphone and smartwatch accelerometers. The second stage entails inferring knowledge about activity-related contexts based on the activity recognized in the first stage, thus providing the notion of activity-aware context recognition. The second stage further consists of two building blocks, i.e., behavioral context recognition (BCR) and phone context recognition (PCR), which independently learn and recognize the specified set of contexts using accelerometer sensors. The outputs from both stages are finally aggregated to form a triplet of information, i.e., {primary physical activity, behavioral context, and phone context}. In this manner, our proposed ARW scheme offers a multi-label and fine-grained/context-aware representation of human daily living activities in-the-wild. The coinciding recognition of the participant's physical activity, behavioral/social context, and phone position is essential for human behavior modeling and cognition in their living environments [40]. Inferring phone positions with activity can be effective for inferring the habitual behavior of a person in different contexts. Thus, the proposed scheme can be extended to detect and recognize normal/abnormal human behavior, which can further be advantageous in predicting/avoiding health-related risks, as discussed in the existing studies [41][42][43]. In addition, the proposed scheme can serve as a building block for recommender systems and context-aware computing applications.
The experiments for the proposed scheme are conducted using a public domain "Ex-traSensory" dataset [38] that involves daily living human activities and the associated contexts in-the-wild. For the AR task in the proposed scheme, six (06) PADLs, including sitting, walking, lying, standing, running, and bicycling, are chosen from the dataset, whereas for context recognition, the fourteen (14) most frequent context labels are selected for identification purpose. These labels provide information regarding the phone positions (such as phone on table or phone in bag/hand/pocket) and the participants' environmental/behavioral aspects (such as participant's location, social context, and secondary activity) during the primary activity execution in-the-wild. Figure 1 shows how different contexts are associated with the selected PADLs for enabling ARW. The relationship between the particular PADLs and the equivalent contexts is established by systematically analyzing the co-occurrences of different activity and context pairs available in the "ExtraSensory" dataset. A boosted decision tree (BDT) is used for evaluating the performance of the proposed ARW model, and the obtained results are compared against those obtained with a neural network (NN) classifier.
This research paper provides the following significant contributions.
• A two-stage model is proposed for ARW, which first identifies the primary physical activity and then uses this label to infer activity-related context information, thus providing a detailed activity representation in-the-wild. • A methodical approach is conceived and followed to analyze the co-occurrences of different activity-context pairs in the "ExtraSensory" dataset. As a result, a set of ten (10) most frequent human behavioral contexts and four (04) phone contexts/positions are incorporated with six (06) primary PADLs, respectively, for ARW. The approach used to analyze and select activity-context pairs for the proposed ARW scheme is reproducible and can be applied to any multi-label dataset. • An in-depth exploration of the proposed ARW scheme is conducted for feature selection, model selection (i.e., classifier selection), and classifier hyperparameter optimiza-tion to attain state-of-the-art recognition performance. Finally, based on the best-case experimental observations and parameters, the performance of boosted decision tree and neural network classifiers is further evaluated in detail for the proposed scheme using smartphone and watch accelerometers. The remaining paper portion is arranged as follows. Section 2 presents the related works for the proposed scheme. Section 3 provides the stepwise explanation of the proposed methodology in detail. Section 4 investigates and discusses the experimental results for our proposed ARW method in detail. Finally, Section 5 summarizes the research outcomes and provides future recommendations for the proposed scheme.

Related Works
The upsurge in smart systems with evolving sensing capabilities has made sensorbased AR a significant area of interest for researchers in the field of pervasive computing. Considering this, numerous schemes have been proposed for sensor-based AR, which can be classified as ambient AR, wearable AR, smartphone-based AR, and heterogeneous sensor-based AR approaches. Ambient AR systems aim to collect and process continuous data from various sensors installed in the environments for ambient assisted living (AAL). Numerous research studies have used ambient sensors to recognize PADLs and home tasks [11,12,[44][45][46]. Vanus et al. [13] performed the fusion of gas (i.e., carbon dioxide) and audio sensors with humidity and temperature sensors to detect any person in the smart room. The authors employed a neural network for human detection and achieved accuracy greater than 95%. Ni et al. [47] proposed an ontology-based method for smart home activity monitoring, utilizing a three-layered approach for context-aware activity modeling. Ghayvat et al. [48] proposed an anomaly prediction model for detecting abnormal activity patterns of elderly people in the smart home environment. Likewise, Muheidat et al. [49] proposed a real-time fall detection scheme based on walking activity pattern monitoring using a sensor pad installed under a carpet. The primary advantage of using ambient AR systems is their high accuracy rate and reliability. However, installing and setting up ambient sensors is a complex and expensive task, and the sensors are restricted to a particular area of monitoring. Hence, it is not possible to monitor natural human activities and behaviors in diverse contexts.
Wearable AR systems entail on-body sensors for recording and monitoring the participant's data. They are advantageous owing to their portability; thus; they can be taken to any place for continuous activity monitoring, including indoor and outdoor environments. The authors in [50][51][52][53] utilized the Inertial Measurement Units (IMUs) and other wearable sensors for activity monitoring. In [54], the authors proposed a probabilistic method using Bayesian formulation to recognize transition activities, such as stand-to-sit and sit-to-stand, using wearable sensors. They achieved 100% recognition accuracy for two activities. Mehrang et al. [55] utilized random forests for recognizing a number of daily living activities (including household activities) using wearable sensor data from a wrist-mounted accelerometer. In addition, they also used an optical heart rate sensor for the AR task and achieved an accuracy of 89.6 ± 3.9%. The research work in [56] presented "HuMAn", a wearable AR system for the classification of 21 indoor human activities. In this aspect, the authors recorded data from ten subjects in a home environment using wearable sensors and extracted statistical signal attributes to train their proposed AR system. They used the conditional random field (CRF) classifier for AR task and acquired the best average accuracy around 95%. Anwary et al. [57] utilized wearable sensors (i.e., accelerometer and gyroscope) for monitoring and detecting abnormalities in the gait pattern of the participants. Moreover, wearable sensors have also been utilized for detecting and preventing abrupt human actions, such as falls [20]. The authors in [58] proposed a deep learning model for AR in the mountains using an accelerometer sensor. Nevertheless, wearable sensors often turn out to be a source of disturbance for the subjects in their activity execution, which hinders effective AR performance.
The increasing development in smartphone sensing technologies has offered a ubiquitous platform for sensor-based AR. Consequently, smartphone-based AR systems have been proposed by numerous research studies. The research studies in [59][60][61][62][63][64] proposed smartphone-based position-dependent and position-independent AR systems, respectively. Moreover, position-aware AR systems [34,65,66] have also been proposed, which employ a two-level or multi-level classification approach to identify a physical activity based on phone position recognition. In [67], Esfahani and Malazi presented "PAMS", a positionaware multi-sensor dataset for an AR task, where they achieved an average precision of approximately 88% in recognizing everyday physical activities in the dataset. Smartphones have also been employed for crowdsourcing and context recognition (such as indoor vs. outdoor, moving vs. stationary, etc.) [35,[68][69][70][71]. However, smartphone-based AR systems are not sufficiently accomplished to detect or recognize activities involving hand gestures and arm movements. As a result, heterogeneous sensors have been used for AR tasks, which combine multimodal sensors (such as smartphones and wearable sensors) to improve AR performance [29][30][31]72,73], which is the case for our proposed scheme. With the evolvement of deep learning algorithms in recent years, some authors have made use of these algorithms for the automatic extraction of high-level features from the sensor data to achieve promising AR results [74][75][76][77][78]. The survey work in [79,80] investigated the latest trends in sensor-based AR studies based on deep learning models and explained their pros and cons along with the future recommendations/implications. The high computational complexity of deep learning algorithms is a crucial challenge to be addressed in the case of sensor-based AR studies, which makes them ineffective for instant processing on battery-constrained devices, e.g., smartphones and smartwatches. Hence, there is a need for developing such schemes that are computationally efficient and can recognize natural user behavior in varying contexts with high accuracy, which is the main aim of our proposed ARW scheme. Figure 2 provides a block diagram of the proposed methodology for ARW that entails a two-stage classification model, consisting of four crucial steps as follows: (1) data acquisition and preprocessing, (2) feature extraction and feature selection, (3) primary physical activity recognition, (4) activity-aware context recognition. The subsequent sections present the necessary details for each step of the proposed ARW method.

Data Acquisition and Preprocessing
For the implementation and testing of any AR model, the first step is to acquire data concerning the activities of interest. Many researchers have utilized their efforts in collecting sensor-based datasets for AR, which entail data from different sensing modalities, including on-body wearable sensors, smartphone-embedded sensors, and multimodal heterogeneous sensors [81][82][83][84][85]. Generally, these datasets have been recorded in some constrained environments following a specific set of protocols for executing the scripted tasks. Therefore, there is a void of natural user behavior and any information regarding the participant's context. The "ExtraSensory" dataset [38], presented by Yonatan and Ellis, contains in-the-wild human activity data from 60 subjects. Smartphone and smartwatchbased heterogeneous sensors are used to record natural user behavior regarding six (06) primary PADLs in diverse contexts. As the proposed scheme focuses on recognizing daily living human activities and their context details in-the-wild, the "ExtraSensory" dataset fits well into the proposed pipeline. As a result, we opted to utilize this dataset for the implementation and validation of the proposed ARW model. For the computational efficacy of the proposed method, only smartphone and smartwatch accelerometer data (collected with a sampling rate of 40 Hz and 25 Hz, respectively) are used for ARW. The existing AR studies [61,73] validate the efficient recognition performance of these sensors as compared to other inertial sensors, such as a gyroscope or a magnetometer.

Activity-Context Pairs for ARW: Systematic Analysis and Selection
The "ExtraSensory" dataset contains multiple secondary labels for each activity instance, which demonstrate detail regarding the participant's context (for example, secondary activity, location, social and/or behavioral context, and phone state/position) during the primary activity execution. However, the context labels for each activity instance are not consistent as the data collection is conducted in-the-wild. To implement ARW, we systematically analyzed the PADLs and corresponding context labels to find out the most frequent activity-context pairs in the "ExtraSensory" dataset. In this regard, for all participants' data, we counted the frequency of different context labels (including human behavioral contexts and phone positions) that occur in a pair with each of the six selected daily living activities. In the end, we selected ten (10) and four (04) different human behavioral contexts and phone positions for context recognition, respectively, which had maximum frequencies of co-occurrence with the primary PADLs. Further, we tended to discard the activity instances having secondary labels with very few instances (i.e., less than 100), as they are not sufficient to be trained and tested for context recognition. Neglecting these instances has no adverse effect on the overall system training, due to the remaining instances still being very huge in number, i.e., 51,001.
Algorithm 1 shows the steps followed in extracting the activity-context pairs and their frequencies in the "ExtraSensory" dataset. These steps are reproducible and can be adopted for any multi-label dataset. Table 1 presents the list of primary PADLs and activity-context pairs along with their frequencies, which are finally chosen to validate the proposed ARW method. Two activities, including bicycling and running, are linked to only one behavioral context (i.e., exercise) and phone position (i.e., phone in pocket) as no more context labels exist with these activities in the "ExtraSensory" dataset. Likewise, for lying activity, only two phone positions (i.e., phone in hand and phone on table) are available, which are used in further analysis.

Input: userID
Output: prActCtxLabels and f reqPrActCtxLabels % prActCtxLabels and f reqPrActCtxLabels show the labels and counts for all activity-context pairs per user, respectively.

Signal De-Noising and Segmentation
The raw signals acquired from the accelerometer sensor of the smartphone/smartwatch are exposed to unwanted noise, for example, equipment noise or the noise produced by the subject's unconscious movements. It is vital to de-noise the acquired signals before any further processing and computation. A lot of signal de-noising techniques have been used in AR literature, including time-domain and frequency-domain filtering methods. In this study, we employed a time-domain averaging filter (with size 1 × 3) for signal de-noising, which is computationally cheap and capable of eliminating sudden noise, such as spikes, from the acquired signals. The "ExtraSensory" dataset entails activity instances that are pre-segmented and labeled based on a 20-s time window with mutually exclusive samples. Generally, a fixedsize window of 2 s to 5 s is considered sufficient for simple AR, while complex AR deals with a larger window size having a time duration from 15 s to 30 s or more [28,30,86]. The proposed scheme aims to recognize the natural physical PADLs and in-the-wild activity-aware contexts, thus giving rise to complex AR. Hence, in accordance with the "ExtraSensory" dataset, a segmentation window of 20 s is used for feature extraction and classification in the proposed scheme.

Feature Extraction
After signal de-noising, features are extracted from the segmented data for further processing. Features are summarized representations of the essential signal attributes, which are fed as input into machine learning algorithms to classify a given chunk of data into one of the selected classes. Based on the existing AR studies [30,[87][88][89], the proposed ARW model involves the extraction of twenty (20) time-domain features corresponding to each segmented data chunk. The extracted features include entropy, maximum signal amplitude, minimum amplitude, mean value, standard deviation of the signal, skewness, kurtosis, peak-to-peak value, peak-to-peak-time, median of the signal, maximum latency, minimum latency, latency-amplitude ratio, energy, signal variance, third moment of the signal, fourth moment of the signal, signal peak-to-peak slope, mean of first difference, and mean of second difference. The features are extracted for 3D data from the phone and watch accelerometer, thus resulting in a feature vector of size 1 × 60 per sensor.
Feature extraction is followed by feature selection to choose the most discriminating features from the whole set of extracted features. In this regard, we used a filter-based approach for supervised feature selection, which is known as "Correlation-based Feature Subset Selection" (CfsSubetSel) [90]. This approach assesses the predictive power of each feature individually and finds redundancy between different features to produce the final set of most predictive features. After applying CfsSubetSel, the final subset of obtained features is used for classification in the next stage.

Primary Physical Activity Recognition
As discussed earlier, the proposed ARW model is based on a two-stage classification approach, where the first stage involves primary physical activity recognition (PPAR), i.e., the classification of six (06) primary PADLs in-the-wild. These activities include lying, sitting, walking, standing, running, and bicycling. Two machine learning classification algorithms, i.e., BDT and NN, are utilized for PPAR in a supervised manner.
A BDT [91] is an ensemble classifier that utilizes a combination of multiple decision trees (instead of using a single decision tree) to boost the output prediction performance. The main objective of the BDT algorithm is to sequentially combine a group of weak learners to create a strong learner. Each subsequent tree performs corrections for the errors in the preceding tree, and the final prediction is made based on the entire set of trees. In general, once aptly configured, a BDT is the easiest method for getting top recognition performances on wide-ranging machine learning tasks.
A NN [92] entails a set of interconnected layers, where the input layer is connected to the output layer using a feed-forward connection based on an acyclic graph consisting of weighted edges and nodes (i.e., neurons). A number of hidden layers can be inserted between the input and output layer; however, usually, one hidden layer is sufficient for most of the predictive tasks. Each node in a layer is connected to all the nodes in the subsequent layer using weighted edges. Each node in the hidden layers participates in generating the output of the network based on a non-linear activation function. This whole process is envisaged as an inspiration from the learning mechanisms of the human brain.

Activity-Aware Context Recognition
Human physical activity patterns alter with respect to change in their behavioral environments. These variations in the physical activity patterns can be monitored and tracked easily using the 3D accelerometer data from a smartphone/smartwatch to learn and identify the detailed activity contexts, such as human behavioral contexts and phone contexts. As follows, the second stage of the proposed ARW model entails activity-aware context recognition (AACR). The primary objective of AACR is modeling and detecting/recognizing the varying patterns of primary PADLs in diverse contexts to infer details about human behavioral contexts and phone positions in-the-wild. In this manner, the proposed ARW scheme enables activity-related contexts to be inferred based on activity pattern identification. An AACR module comprises two central units, including BCR and PCR. These units individually infer human behavioral context and phone context (i.e., phone position) labels, respectively, based on the activity recognized in the first stage (i.e., PPAR). In this aspect, for each selected primary activity, BDT and NN classifiers are trained to identify the relevant activity contexts (as given in Table 1). These classifiers are fed with the physical activity label (recognized in the first stage) and the final feature vector to train the proposed ARW system context recognition. Both smartphone and smartwatch accelerometers are used for BCR, while in the case of PCR, only a smartphone accelerometer is employed. Overall, for each classifier, four (04) different models are trained for BCR and PCR corresponding to four PADLs (including lying, sitting, walking, and standing). The activities of running and bicycling, which only involve one behavioral context and phone position, are ignored when training the proposed ARW model for AACR.
In the end, the outputs from both BCR and PCR units are aggregated with the output from the first stage (i.e., PPAR) of the model to provide a detailed and in-the-wild representation of daily living human activities. As follows, the proposed ARW scheme is capable of differentiating a large number of context-aware and fine-grained activities produced as a result of different combinations of primary PADLs, human behavioral contexts, and phone positions.

Experimental Results, Performance Analysis, and Discussions
This section discusses the methods of validation and analysis used for assessing the performance of the proposed scheme. In addition, it evaluates and discusses the achieved experimental results in detail, as given in the following sections. The proposed ARW scheme is implemented and validated using the Microsoft Azure machine learning tool [93]. The "AutoML" package of Microsoft Azure is used for model selection based on the "ExtraSensory" dataset, where a set of standard machine learning classifiers (including BDT, NN, k-nearest neighbours (K-NN), naïve Bayes (NB), and support vector machine (SVM)) are investigated for the proposed ARW scheme. In this aspect, the finally selected features set (obtained using CfsSubetSel) is fed as input into machine learning classifiers to assess their performance for PPAR, BCR, and PCR experiments. Table 2 provides the list of finally selected features from each sensor, which are used for experimentation purposes. These features are simply concatenated in the case of sensor fusion. kurtosis; F7: skewness; F8: peak-to-peak value; F9: peak-to-peak-time; F10: signal median; F11: maximum latency; F12: minimum latency; F13: latency-amplitude ratio; F14: energy; F15: signal variance; F16: 3rd moment of the signal; F17: 4th moment of the signal; F18: signal peak-to-peak slope; F19: mean of 1st difference of the signal; F20: mean of 2nd difference of the signal. 2 A x , A y , and A z represent the x-, y-, and z-axis of the smartphone accelerometer, whereas W x , W y , and W z represent x-, y-, and z-axis of the watch accelerometer, respectively. In the case of sensor fusion for PPAR and BCR, the finally selected features from each sensor are combined accordingly.
Following the model selection, BDT is chosen as the first choice for the proposed method implementation. Additionally, the NN classifier is employed to assess its performance for the proposed scheme in comparison to BDT, which has been successfully adopted for numerous sensor-based AR studies [60,87,94]. A one-vs.-all (OVA) classification approach is used for both classifiers, which utilizes an ensemble of C binary classifiers to solve a multiclass problem with C number of classes. The existing research work has demonstrated the effectiveness of using the OVA approach for multiclass classification, provided that the underlying binary classifiers are fine-tuned [95]. Following this, a random parameter sweep is performed on the data using five-fold cross-validation to explicitly learn the optimal hyperparameters of the selected classifiers for different recognition experiments. The maximum number of runs for the parameter sweep is set as 10. Finally, the best-tuned model hyperparameters are chosen for all recognition experiments, providing the best performance for the proposed scheme. Table 3 presents the optimal hyperparameter values obtained for the selected classifiers regarding PPAR, BCR, and PCR experiments. In the case of BDT, multiple additive regression trees (MART) [91] is used as a decision tree algorithm, whereas gradient descent is used for error estimation. A fully connected (FC) hidden layer is used for the NN classifier with a sigmoid function as the output function. The number of nodes in the hidden layer is set equivalent to the average size of the input and output layer. The size of the input layer for different experiments is equal to the number of input features, which is given in Table 2. The output size represents the number of classes for each NN, which is six (06) and four (04) for PPAR and PCR, respectively. In the case of BCR, the size of the output layer is equal to the number of contexts corresponding to lying, sitting, walking, and standing activities. To evaluate the classification performance, an m-fold cross-validation method (with m = 5) is utilized. This validation scheme allows a model to train on multiple splits and uses all the data for training and testing in different iterations, thus ensuring fairness.

Performance Evaluation Metrics for Classification
The performance of BDT and NN classifiers is assessed independently for PPAR, BCR, and PCR experiments, based on accuracy, precision, sensitivity, F1-score, balanced accuracy, and log loss. The mean value of true positive rate (i.e., sensitivity) and true negative rate (i.e., specificity) is termed as balanced accuracy (BALACC). It is the most crucial measure for assessing the classification performance of a system that entails imbalanced class data [38]. In addition, micro-averaging and macro-averaging metrics (i.e., micro-F1 and macro-F1) are computed for average performance comparison, where micro-precision and microsensitivity values are equal to micro-F1 scores. To estimate the classification error, log loss or logarithmic loss is used, which assesses the uncertainty of a model by comparing the output probabilities with ground truths. It expresses the penalty for misclassifications and is measured as a difference of two probability distributions, i.e., the true one and the one enclosed by the proposed model.

Performance Analysis of Primary Physical Activity Recognition (PPAR)
The first stage of the proposed ARW model incorporates PPAR, where six (06) in-thewild PADLs are classified based on smartphone and watch accelerometer data, using BDT and NN classifiers. For this purpose, twenty time-domain features are extracted from each sensor channel, which are further subjected to feature selection and reduction using the CfsSubetSel method. As a result, a set of twenty-nine (29) and twenty-eight (28) features (as shown in Table 2) is obtained related to the phone and watch accelerometer, respectively, which is used for classifier training and testing based on a five-fold cross-validation scheme. Table 4 provides the average numerical results for PPAR, where the BDT classifier achieves the best results in classifying six selected PADLs. Using the phone and watch accelerometer alone, BDT achieves a macro-F1 score of 82.9% and 75.2%, respectively, for PPAR, which is 20.1% and 19.9% greater than the corresponding scores attained with the NN classifier. Similarly, the values of micro-F1 scores are also improved for the BDT classifier. These results also depict that the phone accelerometer performance is better than the watch accelerometer. With sensor fusion, the macro-F1 score is improved with a 6.4% and 14.1% rate in comparison to that achieved with the individual phone and watch sensor, respectively. Likewise, in the case of the NN classifier, the macro-F1 value is increased to 71.8% with sensor fusion, which is still 17.5% less than the best-case value obtained with BDT. The best error rate (i.e., average log loss of 0.787) for PPAR is also obtained with the BDT classifier using the combination of both sensors. Likewise, the values of other performance measures (i.e., precision and sensitivity) are also better for the BDT as compared to the NN classifier, where the best results are attained with sensor fusion.  Figure 3a compares the BALACC values obtained for PPAR using BDT and NN classifiers. It is evident from the figure that NN underperforms as compared to the BDT classifier in terms of the BALACC value as well. The best BALACC rate of 93.1% is achieved with the BDT classifier using sensor fusion, which is 10.8% more than the best value (BALACC = 82.3%) achieved with the NN. Likewise, using individual sensors for PPAR, the BALACC values achieved with the NN are worse than those obtained with the BDT classifier. These results validate the efficacy of utilizing a BDT classifier to recognize primary PADLs in-the-wild. Furthermore, adding the watch accelerometer with the phone accelerometer tends to achieve the best accuracy rate for PPAR. Generally, natural user behavior involves diverse behavioral contexts and phone positions, which may poorly affect the activity pattern being recorded by smartphone sensors. For example, in the case of phone on table, it becomes quite impossible to recognize the participant's activity based on a smartphone accelerometer. In such cases, the use of a smartwatch accelerometer may help in learning and identifying the user's activities as the watch is supposed to be worn by the user most of the time. To demonstrate the per-class recognition performance of the selected PADLs, Figure 3b displays the confusion matrix concerning the best-case PPAR performance (obtained with BDT using sensor fusion). The matrix rows and columns denote the ground truths and predicted outputs, respectively. The labels represent the codes for six (06) primary PADLs as follows: A1: lying, A2: sitting, A3: walking, A4: standing, A5: running, and A6: bicycling. It can be analyzed from the confusion matrix that most of the PADLs are truly classified with a high percentage. Notably, static activities (such as lying/sitting/standing) are truly recognized with a rate of more than 90%. The percentage of truly classified samples for walking, running, and bicycling activities is 78.2, 84.3, and 83, respectively, which shows that identification of the static activities in-the-wild is more comfortable than dynamic activities. This is due to the inconsistency of dynamic activity patterns with respect to diverse behavioral contexts, which gives rise to misclassifications of different dynamic activities in-the-wild. As a result, their recognition accuracies are reduced.

Performance Evaluation of Activity-Aware Context Recognition (AACR)
The proposed ARW scheme entails the recognition of activity-aware contexts in its second stage, where BDT and NN classifiers are trained explicitly for all PADLs to infer the associated human behavioral contexts and phone contexts (i.e., phone positions). The second stage consists of two parallel units, i.e., BCR and PCR, which take as input features from the sensor(s) data as well as the primary activity label (recognized in the first stage) to identify the corresponding behavioral contexts and phone positions independent of each other. As context labels for the primary activities are not the same, it is vital first to recognize the primary activity and then infer the context information being aware of the primary activity. BCR is performed based on the data from smartphone and smartwatch accelerometers, whereas in the case of PCR, only the smartphone accelerometer sensor is utilized. In this regard, the final subset of features selected for each accelerometer sensor (as shown in Table 2) is used to train and test the chosen classifiers for BCR and PCR using a five-fold cross-validation method. The results are evaluated separately for BCR and PCR based on four different PADLs, while running and bicycling activities are ignored as they only involve one behavioral context and phone position that requires no classification. The following sections discuss the individual experimental results achieved for BCR and PCR.

Behavioral Context Recognition (BCR) Results and Investigation
The average numerical results obtained for BCR, based on four different PADLs, are presented in Table 5. By investigating these results, it can be stated that the BDT classifier performs significantly better than the NN classifier in recognizing activity-aware behavioral contexts. Furthermore, it can be analyzed that the phone accelerometer better recognizes most of the human behavioral contexts associated with four different physical activities, as compared to the other accelerometer sensor (i.e., watch accelerometer). In the case of BCR based on sitting, walking, and standing activities, the BDT classifier achieved macro-F1 scores of 97.0%, 74.1%, and 98.6%, respectively, using a phone accelerometer. These numerical results are 3.5%, 20.6%, and 1.5% better than the corresponding scores obtained for BCR based on the watch accelerometer, respectively. In contrast, for BCR based on lying activity, the macro-F1 score (i.e., 80.3%) obtained using the watch accelerometer is 4.8% more than the phone accelerometer. The same performance trend is observed in the case of average micro-F1 scores also. However, the fusion of both sensors provides the best-case BCR performance, where the best average macro-F1 scores of 86.8%, 97.8%, 76.4%, and 98.8%, respectively, are achieved for BCR based on lying, sitting, walking, and standing activity, using the BDT classifier. These values are 19.4%, 6.5%, 22.3%, and 2.8% more than the corresponding macro-F1 scores obtained with the NN classifier, respectively, using sensor fusion. The average numerical values of accuracy, precision, sensitivity, and log loss are also better for a BDT as compared to a NN, which proves the efficiency of a BDT classifier over an NN classifier for BCR experiments.  Figure 4 compares the performance of different sensors for activity-aware BCR in terms of BALACC. The average BALACC rate achieved for BCR using the smartphone accelerometer is 71.9%, 92.9%, 78.2%, and 82.3% based on lying, sitting, walking, and standing activity, respectively, using the BDT classifier. In the case of BCR based on lying activity, the BALACC value achieved with the watch accelerometer is 7.2% and 6.2% better than that achieved with the phone accelerometer using the BDT and NN classifier, respectively. The combination of phone and watch accelerometers results in an increase in the BALACC values for BCR, particularly for lying and walking activities. The overall average BALACC value achieved for BCR (with sensor fusion) using the BDT classifier is 11% more than the NN classifier. Therefore, based on all these analyses and discussions, it is eminent that the combination of both accelerometers is the best choice for activity-aware BCR using a BDT classifier.  Table 6 provides the confusion matrices for the best-case BCR results obtained with the smartphone and smartwatch combination using the BDT classifier. It can be analyzed from the table that in the case of sitting and standing activities, the corresponding human behavioral contexts are truly classified with an accuracy of more than 90%. However, the individual recognition rates achieved for behavioral contexts associated with lying and walking activities are lower. In particular, the percentage of truly classified samples for A1C2 (surfing the internet based on lying), A1C3 (watching TV based on lying), A3C3 (shopping based on walking), and A4C4 (talking based on walking) is 68.8, 75.2, 46.6, and 61.9, respectively. These results depict the difficulty of accurately identifying these human behavioral contexts based on the recognition of associated physical activity patterns.
In general, the sitting and standing activity patterns of human beings show variations pertaining to different behavioral contexts. For instance, the sitting posture for most persons is altered when working on a personal computer/laptop or watching TV. Likewise, the pattern of standing indoors somewhat differs from standing at some outdoor place. These differences are realized by the 3D motion sensor (i.e., accelerometer) of smartphone/smartwatch to efficiently model and recognize different behavioral contexts linked with these activities. As a result, the performance of BCR based on sitting and standing activity patterns is enhanced. In contrast, the lying activity typically attributes to the state of relaxing; thus, it does not often entail explicit body movements. Therefore, the recognition of associated human behavioral contexts becomes challenging. In addition, the phone position associated with lying is often on table, which yields unpredictable and inaccurate results for BCR. Therefore, it is very crucial to use the smartwatch in combination with a smartphone for BCR. In the case of walking activity, the recognition of associated human behavioral contexts becomes hard owing to the dynamic motion patterns of an individual in the same or different physical environment. These changes are often triggered as a result of chaotic human behavior and an emotional state that may instinctively alter the gait pattern of a subject. In addition, human behavior varies from one person to another, which makes it impractical to create a general model for BCR based on walking activity in-the-wild.  Table 7 provides the average numerical results for PCR based on lying, sitting, walking, and standing activity patterns. Only the phone accelerometer sensor is used in this regard, which provided the macro-F1 scores of 83.1%, 91.1%, 71.1%, and 97.4% in recognizing different phone positions based on lying, sitting, walking, and standing activities, respectively, using the BDT classifier. For the same set of activities, the NN achieved corresponding scores of 49.8%, 34.5%, 31.3%, and 69.8%, respectively, which are quite low as compared to the BDT results. The values of other performance parameters (including accuracy, precision, sensitivity, micro-F1, and log loss) are also better for the BDT classifier. In addition, Figure 5 compares the PCR results in terms of BALACC, where the best recognition performance is also obtained using the BDT classifier. The overall average BALACC value for PCR based on a BDT is 13.5% more than the NN classifier. Moreover, it can be investigated from the figure that the average performance of PCR based on sitting and standing activities is better as compared to other activities. Table 8 provides the confusion matrices for PCR based on four PADLs using the BDT classifier. The individual accuracies of different phone positions based on each physical activity can be computed from these confusion matrices. The row and column labels of the confusion matrices represent different phone positions (i.e., phone in bag (PB), phone in hand (PH), phone in pocket (PP), and phone on table (PT). There are only two phone positions (i.e., PH and PT) associated with the lying activity, which are classified with a true positive rate of 54.5% and 99.9%, respectively. In the case of sitting and standing activities, PB and PT positions obtained a very high true positive rate of more than 95%, which depicts their easier recognition as compared to other phone positions. Likewise, PP is truly recognized with a more than 95% rate based on standing activity. The recognition of PB and PT positions based on walking activity attained a true positive rate of less than 60%, which shows inferring these phone positions based on in-the-wild gait patterns is very challenging. On the other hand, the recognition of PP based on walking activity is more comfortable, which achieved a true positive rate of 89.4%. In general, the proposed scheme achieves satisfactory performance for activity-aware PCR.

Analysis of BDT vs. NN for Proposed ARW Scheme
As indicated by the results presented and discussed in the previous sections, the performance of the BDT is better for all types of recognition experiments (i.e., PPAR, BCR, and PCR). In contrast, the NN fails to provide satisfactory results for the proposed scheme. The best-case average results (i.e., BALACC values) achieved for PPAR, BCR, and PCR experiments using the BDT classifier are 10.8%, 11%, and 13.5% more than NN results. Generally, the NN classifier fits well for AR tasks, and numerous research studies have successfully utilized different variants of NNs (i.e., deep NN, convolutional NN, and recurrent NN) for AR [96][97][98]. However, based on the underlying data distribution, there can be certain bottlenecks in achieving effective performance for sensor-based AR-related tasks using a NN. For instance, a massive bulk of data is required for efficient training of NNs to avoid any underfitting/overfitting or regularization issue. When dealing with imbalanced class data (such as the case with our proposed scheme), where the number of samples for certain individual classes is very small, the NN classifier performs below par due to a lack of training samples. In addition, training or labeling noise, data standardization/normalization, cross-validation strategy, poor hyperparameter optimization strategies, and poor selection of the number of hidden layers and number of nodes in each hidden layer also degrade the performance of the NN. These factors consequently lead to performance degradation of the NN classifier for the proposed scheme. In contrast to the NN, the BDT classifier works well with the smaller datasets by utilizing a combination of multiple decision trees to minimize the prediction error. The trees are connected in a sequential order, where each tree makes up for the prediction error of the preceding trees to boost the overall recognition performance. The final result is based on an ensemble of all decision trees, which may lead to overfitting problems in some cases. However, the final (i.e., best-case) recognition performance of our proposed scheme determines the efficacy of the BDT classifier for ARW, thus making it an optimal choice for such types of experiments. For handling imbalanced class data, resampling techniques, such as the synthetic minority oversampling technique (SMOTE) [99], can be used to achieve good results with classifiers that require a large amount of training data for each class (e.g., NN).  Table 9 demonstrates the primary characteristics of some well-known state-of-the-art AR studies and compares them with our proposed ARW method. The comparison is made in terms of activity type, the number of activities recognized, activity occupancy and environment/context, sensing modalities for data acquisition, and machine learning classifiers for AR. It can be investigated from the table that most of the existing AR studies (such as [100][101][102][103]) emphasize the recognition of simple (or atomic) daily living activities in certain restricted settings or environments. The occupancy for collecting participants' data during activity execution generally follows a single location, such as a laboratory or home, where the sensing equipment is installed or carried out to record the participant's data. The activities to be recognized by the system are performed in a predefined way as scripted tasks. Moreover, there is a lack of diversity in activity-related contexts. As a result, it becomes easier for existing studies to achieve efficient AR performance. However, these schemes fail to adapt to natural user behavior, which is indispensable for real-time applications in diverse environments. Only a few AR schemes (such as [37,104]) worked on learning and identifying natural user activities in indoor and outdoor environments. The authors in [38] recognized diverse single-label human contexts in-the-wild using heterogeneous sensor data from smartphones and smartwatches. However, single-label activity/context information is not adequate for fine-grained AR. Our proposed ARW scheme offers multi-label activity and context recognition by aggregating outputs from different stages, such as PPAR, BCR, and PCR, and achieves state-of-the-art results in terms of BALACC. In comparison to most of the existing AR studies presented in Table 9, the proposed method demonstrates efficient recognition of six PADLs, ten behavioral contexts, and four phone contexts in-the-wild. Furthermore, the proposed scheme is computationally beneficial and low-cost as it simply depends on the smartphone and smartwatch accelerometer data for recognition. Henceforth, the efficacy of the proposed ARW scheme is justified over state-of-the-art AR schemes.

Conclusions
This research paper demonstrates a novel two-stage model for sensor-based activity recognition in-the-wild. In the first stage, the proposed scheme classifies six (06) primary physical activities, whereas in the second stage, the proposed scheme infers fourteen (14) activity-aware contexts using the "ExtraSensory" dataset. The outputs from both stages are combined for better cognition and understanding of natural human activities in diverse contexts. Three types of experiments are conducted in this paper, including primary physical activity recognition, behavioral context/environment recognition, and phone context recognition. Smartphone and smartwatch accelerometers are utilized to identify daily living human activities and the associated behavioral contexts. In contrast, phone context recognition only entails a smartphone accelerometer sensor. A boosted decision tree achieves the best experimental results for the proposed scheme. Although the proposed method achieves a reasonable accuracy rate, there are some limitations associated with it. For example, the activities and behavioral contexts considered for experimentation cannot generalize to all use cases in the real world. The proposed scheme thus cannot handle unforeseen activities and contexts. There are some privacy issues with continuous activity/context monitoring of a human being, particularly if an impostor gets access to the device data/output. The continuous monitoring of human beings using smart devices has memory and battery constraints as well.
The limitations of this paper can be improved in future works. In this aspect, our proposed method can be extended to incorporate more sensing modalities for robust detection/recognition of a large number of human activities and contexts, which can be helpful for human-environment interaction modeling. Resampling and data augmentation techniques can be applied to cope with imbalanced class data, particularly for activities/contexts that exist less in-the-wild daily. Likewise, the proposed scheme can be modified to handle unforeseen activities and contexts. The coinciding recognition of a person's physical activity and behavioral/social context can be crucial for human behavior modeling and cognition in their living environments. Thus, the proposed scheme can also be extended to detect/recognize normal and abnormal human behavior for predicting health-related risks. Knowledge-based systems, focusing on human-centered computing, can utilize the proposed method for improved decision-making and recommendations. The correlation between human daily living activities and their social/behavioral contexts can be examined in diverse environments to realize the factors giving rise to abnormal behavior. Funding: This research work is funded by the School of Information Technology, Whitecliffe, Wellington, New Zealand.

Data Availability Statement:
The dataset used for validating this research study is publicly available as an "ExtraSensory" dataset that is cited in the paper.