The proposed methodology utilizes a combination of machine learning and computer vision techniques for understanding human actions in home environments. The approach is based on human movement and object analysis to improve the accuracy and efficiency of semantic classification of home areas.
3.2. Procedures of the Methodology
The proposed methodology consists of five sequential steps, as illustrated in
Figure 2, which are required to enhance location identification through human pose and object detection.
The developed system implements a supervised data collection process, storing the data in a standardized format using text files. Data collection occurs according to the configuration shown in
Figure 1, either through video recordings or camera input. For instance, to gather data for the “lying on sofa” action, multiple instances of the same action with different human poses are recorded, keeping the camera focused on the action area without any movement. Data capture occurs at one-second intervals, starting from the first second and concluding at the sixth second. During this period, the data are supervisedly labeled to assign the corresponding action. To ensure balanced representation, a sufficient amount of data has been collected for each action, with approximately the same number of examples for all actions.
The dataset is processed to remove incomplete or incorrect data, and then, it is segmented into specific human actions for further analysis.
As observed in
Figure 3, the full visibility of all pose points is not always achievable. However, the library includes calculations within its model to determine values of
even when the visibility is less than
.
Figure 4 illustrates the representation of data for different human pose points. These representations show examples of a person doing different activities: lying (
Figure 4a) or sitting (
Figure 4b) on a sofa, and performing exercises (
Figure 4c). The acquisition of 3D points is achieved using the MediaPipe library, where the input image is a 2D RGB image. Various angles and distances have been considered for data collection, with the camera height fixed at
m.
Feature extraction techniques will be used to obtain relevant information about objects and human pose.
Figure 5 depicts the structure of a category sample, consisting of 6 data subsections representing the attributes of the category at different time intervals, each separated by one second, encompassing a total of 6 s. (1) These data are from human poses obtained through the MediaPipe library. The content has been structured using a subset of only 9 specific points, which was a decision made after careful consideration to reduce computational time. These 9 points were selected after conducting extensive data reduction experiments. The criteria for selecting these points were based on identifying those that best represent the most relevant information of the human pose for the classification task. It is important to note that a total of 33 points can be obtained, but the use of 9 points was considered sufficient to achieve optimal performance in the classification process. These obtained data are already scaled and are represented in 3D position data from a 2D image, spatial position values
, and visibility value z. (2) The presented data correspond to the objects detected using the YOLO v3 object detector, which utilizes the official COCO object names list, comprising 80 categories. However, for this particular version, an exhaustive selection of only 58 categories has been performed, representing objects commonly found in an average household. Examples of these objects include sofas, beds, chairs, tables, and televisions, among others, which are commonly encountered in a typical home. This rigorous selection process has allowed us to focus the analysis on the most relevant elements for object detection applications in home environments. (3) This value corresponds to the category that describes the human action being represented. For instance, one of the trained actions is “lying on bed”. In total, eight different output classes are being worked on: reading a book, using a laptop, lying on sofa, sitting on sofa, lying on bed, drinking with a cup, working out and playing a console.
It is essential to clarify that human pose estimation can be achieved either in 2D or 3D, with the primary difference lying in the desired type of output result. With the 2D output, we receive a visual that resembles a stick figure or skeleton representation of the various key points on the body. While with 3D human pose estimation, we receive a visual representation of the key points on a 3D spatial plane, with the option of a three-dimensional figure instead of its 2D projection. For this study, the 2D model is established first, and then the 3D version is lifted from that visual.
Machine learning methods will be developed to comprehend human actions based on the characteristics obtained from human pose and objects detected. A wide range of machine learning algorithms have to determine which would be the most effective for the classification task.
Support Vector Machine (SVM): SVMs [
30] can be used for binary classification, multi-class classification, and regression tasks. The choice of kernel type, kernel coefficient, and regularization parameter can have a significant impact on the performance of the model. For example, a linear kernel may be appropriate for linearly separable data, while a non-linear kernel like the radial basis function (RBF) may be better suited for non-linearly separable data. The regularization parameter C controls the tradeoff between maximizing the margin and minimizing the classification error and can be tuned to optimize performance.
Gradient Boosting (GB): GB [
31] is a powerful ensemble method that can be used for classification and regression tasks. The learning rate determines the contribution of each individual tree to the final model, and a smaller learning rate can help prevent overfitting. The maximum depth of the trees and the number of estimators (trees) can also be tuned to optimize performance.
Extreme Gradient Boosting (XGB): XGB [
32] is a popular variant of gradient boosting that is known for its speed and performance. In addition to the hyperparameters mentioned above for gradient boosting, XGBoost also includes regularization parameters like L1 and L2 regularization as well as a subsampling ratio parameter that controls the fraction of observations used to train each individual tree.
Light Gradient Boosting Machine (LGBM): LGBM [
33] is another variant of gradient boosting that is designed for efficient performance on large datasets. In addition to the hyperparameters mentioned above for gradient boosting, LightGBM also includes hyperparameters like the number of leaves per tree and the minimum gain to split a node that can be used to optimize performance.
K-Nearest Neighbors (K-NN): K-NN [
34] is a simple but effective non-parametric method that can be used for classification and regression tasks. The number of neighbors and the distance metric used to compute distances between points are the two main hyperparameters that can be tuned to optimize performance. A larger number of neighbors can help prevent overfitting, while different distance metrics like Euclidean distance or cosine distance may be more appropriate depending on the dataset and problem being solved.
Table 1 summarizes the main hyperparameters of each method and the values that have been used for each of them in the experiments conducted in this study.
The accuracy and efficiency of the machine learning method will be evaluated using the test data. Additionally, the hyperparameter optimization process was applied after obtaining the initial results. To evaluate the performance of the proposed framework, the concept of the confusion matrix [
35] is used. Let
n be the number of different classes; a confusion matrix of size
associated with a classifier shows the actual and predicted classification values.
Table 2 illustrates a
confusion matrix in which each cell has a specific interpretation as follows:
: indicates the number of positive instances classified accurately.
: is the number of actual negative instances classified as positive.
: indicates the number of actual positive instances classified as negative.
: is the number of negative instances classified accurately.
The confusion matrix provides the basis for obtaining various performance measures [
36,
37]. In this study, the following metrics are used to evaluate the performance of the machine algorithms.
The accuracy metric is a measure commonly used to evaluate the performance of a classification model. It represents the proportion of correctly classified instances out of the total number of instances in a dataset. The accuracy of a classification model is calculated using the following equation:
The recall metric, also known as sensitivity or true positive rate, measures the ability of the model to correctly classify instances of a given class out of all the instances that truly belong to that class. The recall of a classification model is calculated using the following equation:
The precision metric is a performance measure that assesses the accuracy of a model’s predictions for each class. It measures the proportion of correctly classified instances for a given class out of the total number of instances predicted to be in that class. The recall of a classification model is calculated using the following equation:
The
metric is a performance measure that provides a balanced assessment of the model’s performance by taking into account both the precision and the recall for each class. It is the harmonic average of the precision and recall, where an
score reaches its best value at 1 (perfect precision and recall) and worst at 0. Therefore, this score takes both false positives and false negatives into account. The
score of a classification model is calculated using the following equation:
The Matthews Correlation Coefficient (MCC) is a performance measure that quantifies the quality of predictions in multi-class classification tasks. It takes into account the
,
,
, and
for each class and calculates a correlation coefficient that ranges between −1 and +1, where values of +1, 0, and −1 indicate an accurate prediction, a random prediction and a mismatch between the predicted and actual classes, respectively. The MMC measure of a classification model is calculated using the following equation:
The Area Under the Curve (AUC) is a metric derived from the Receiver Operating Characteristic (ROC) curve, which plots the rate against the rate for different classification thresholds. In multi-class classification, the AUC metric is typically calculated by averaging the pairwise comparisons between each class and the rest. It represents the probability that a randomly selected instance from one class will be ranked higher by the model than a randomly selected instance from another class by the model. The AUC ranges from 0 to 1, where a higher value indicates better discrimination and overall classification performance.
Cohen’s kappa coefficient () is a statistical measure used to assess the degree of agreement between the predictions of a multi-class classification model and the true class labels. It measures the agreement between the predicted class labels and the true class labels, taking into account both correct predictions and misclassifications. A kappa coefficient of 1 represents a perfect agreement, while a coefficient close to 0 indicates no better than chance agreement.