Machine Learning-Based Smartphone Grip Posture Image Recognition and Classification

Kwon, Dohoon; Cui, Xin; Lee, Yejin; Choi, Younggeun; Murugan, Aditya Subramani; Kim, Eunsik; You, Heecheon

doi:10.3390/app15095020

Open AccessArticle

Machine Learning-Based Smartphone Grip Posture Image Recognition and Classification

by

Dohoon Kwon

¹

,

Xin Cui

¹

,

Yejin Lee

¹

,

Younggeun Choi

²

,

Aditya Subramani Murugan

³,

Eunsik Kim

³

and

Heecheon You

^1,*

¹

Department of Industrial and Management Engineering, Pohang University of Science and Technology, Pohang 37673, Republic of Korea

²

School of Industrial Engineering, University of Ulsan, Ulsan 44610, Republic of Korea

³

Mechanical, Automotive, and Materials Engineering, University of Windsor, Windsor, ON N9B 3P4, Canada

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5020; https://doi.org/10.3390/app15095020

Submission received: 20 March 2025 / Revised: 26 April 2025 / Accepted: 30 April 2025 / Published: 30 April 2025

(This article belongs to the Special Issue Novel Approaches and Applications in Ergonomic Design III)

Download

Browse Figures

Versions Notes

Abstract

Uncomfortable smartphone grip postures resulting from inappropriate user interface design can degrade smartphone usability. This study aims to develop a classification model for smartphone grip postures by detecting the positions of the hand and fingers on smartphones using machine learning techniques. Seventy participants (35 males and 35 females with an average of 38.5 ± 12.2 years) with varying hand sizes participated in the smartphone grip posture experiment. The participants performed four tasks (making calls, listening to music, sending text messages, and web browsing) using nine smartphone mock-ups of different sizes, while cameras positioned above and below their hands recorded their usage. A total of 3278 grip posture images were extracted from the recorded videos and were preprocessed using a skin color and hand contour detection model. The grip postures were categorized into seven types, and three models (MobileNetV2, Inception V3, and ResNet-50), along with an ensemble model, were used for classification. The ensemble-based classification model achieved an accuracy of 95.9%, demonstrating higher accuracy than the individual models: MobileNetV2 (90.6%), ResNet-50 (94.2%), and Inception V3 (85.9%). The classification model developed in this study can efficiently analyze grip postures, thereby improving usability in the development of smartphones and other electronic devices.

Keywords:

grip posture; smartphone user interface; machine learning; image detection; image classification

1. Introduction

Smartphone grip posture is influenced by the smartphone user interface (UI). An inappropriate UI can induce uncomfortable grip postures, negatively impacting smartphone usability. Smartphone grip posture refers to the positions of the hand and fingers adopted during user interaction with the device. Changes in grip posture, caused by changes in device UI, may affect the device’s usability [1,2]. An improperly designed smartphone UI may lead to uncomfortable grip postures, causing discomfort in the user’s fingers [2,3]. Lee et al. (2018) reported that improper grip posture may place a physical burden on the user’s hand and wrist [4]. Furthermore, Choi et al. (2020) reported that improper grip posture could lead to slips during device use, potentially causing damage to the device [1]. Such improper grip posture may degrade device operability, resulting in high error rates and low task performance [4,5]. Although it is acknowledged that using a smartphone with both hands may enhance stability [6], users prefer a one-handed grip for convenience [7]. Therefore, it is necessary to understand one-handed grip posture. For stable and convenient smartphone use, it is necessary to systematically identify and analyze grip postures and, based on these categories, develop a grip posture classification model needed for designing a smartphone UI with proper usability.

Different grip postures may be identified by three different methods: (1) manual identification by researchers, (2) use of hardware equipment such as 2D/3D cameras, depth cameras, and gloves, and (3) application of computer vision techniques such as image detection and modeling. First, in manual identification, researchers define grip posture classification criteria and manually categorize the grip postures. Choi et al. (2020) identified smartphone grip postures by placing web cameras above and below participants’ hands to record usage tasks [1]. Grip postures were manually classified from the recorded videos based on the number of fingers positioned on each part of the smartphone (L: left, R: right, T: top, B: bottom) and were labeled accordingly. For instance, 3L-1R-1B refers to a grip posture where three fingers are positioned on the left side of the device, one on the right side, and one on the back. Next, the method for identifying grip posture using equipment includes using cameras, such as depth cameras or 3D cameras, and equipping the user’s hand with devices like gloves or markers. Coleca et al. (2015) used a depth camera to capture images of the user’s hand and applied a self-organizing maps (SOM) technique to construct a finger link model for the five fingers (thumb, index, middle, ring, and little) [8]. Using the constructed link model, five different hand postures were identified. Achenbach et al. (2023) conducted an experiment using the Manus Prime X data glove worn by participants to recognize 56 different hand gestures [9]. The collected 3480 hand gesture data samples were classified using a voting meta-classifier (VL2) model, which ensembles three models: support vector machine (SVM), random forest (RF), and logistic regression (LR). The classification accuracy was reported to be 95.5%. Lastly, computer vision and model-based detection techniques involve detecting the hand by removing the background, identifying the hand contour based on the captured image, and then classifying the grip posture using a developed classification model. Haria et al. (2017) used image detection techniques to remove the background from hand images and then employed contour extraction and the Haar cascade classifier to distinguish seven hand postures (2-finger, 3-finger, 4-finger, 5-finger, palm, fist, and swipe). The classification accuracy was reported to be between 85% and 95% (2-finger: 94%, 3-finger: 93%, 4-finger: 92%, 5-finger: 92%, palm: 95%, fist: 95%, and swipe: 85%) [10]. Singh et al. [11] detected the overall image of the hand using the RGB form of OpenCV to identify skin color. The convex hull technique was applied to detect finger angles and identify hand shapes, which were classified into 10 forms (A, B, C, D, E, F, G, H, I, and J) using a convolutional neural network (CNN) technique, achieving a reported classification accuracy of 90%.

In smartphone grip posture classification, inducing natural hand grip postures and achieving high model performance are considered important, with few studies applying machine learning techniques to classify smartphone grip posture in usage scenarios. Previous studies on identifying grip posture have involved attaching markers to the hand or using gloves; however, wearing such equipment may hinder natural grip postures [1,12]. Choi et al. (2020) ensured natural hand movements by utilizing web cameras without wearable equipment, manually classifying grip postures from videos [1]. However, this method relies on subjective judgment and requires significant effort and time. While many studies have detected natural grip posture based on 2D images, few have focused on smartphone grip posture. Nigam et al. (2019) used a hidden Markov model (HMM) to classify hand gestures into five categories related to smart light control: switch smart light on, switch smart light off, increase saturation of smart light, decrease saturation of smart light, and change smart light’s color [13]. While ensuring natural hand movements by not attaching any equipment to the hand, there remains a need to identify grip posture in smartphone usage scenarios. Ensemble models have demonstrated their ability to ensure diversity among models and achieve better predictive accuracy compared to single models [14,15]. Alnuaim et al. (2022) utilized an ensemble of ResNet-50 and MobileNetV2 models to recognize 32 Arabic hand gestures, achieving an accuracy of 97.5% for ResNet-50 alone and 97.1% for MobileNetV2 alone; the ensemble of both models reached an accuracy of 98.2% [16].

The present study aims to propose a model for detecting and classifying smartphone hand grip posture images using machine learning techniques. To identify smartphone grip posture, an experiment was conducted with participants of various handedness, hand lengths, and hand widths. Grip postures performed during tasks with smartphones of different sizes were recorded using video cameras. Grip posture images were extracted from the recorded videos through frame captures and preprocessed based on skin color. The proposed model detected and classified grip postures, and the model’s accuracy was evaluated in the study.

The rest of this article is organized as follows. Section 2 reviews related work on hand recognition and its applications. Section 3 describes an experimental protocol, as well as the structures and training procedures of the proposed models. Section 4 presents experimental results, and Section 5 discusses the key findings, their implications, the limitations of the study, and future research directions in detail. Lastly, Section 6 concludes the article with the major findings and a brief summary of the study’s limitations and directions for future research.

2. Related Work

2.1. Preprocessing Techniques for Hand Recognition

In the preprocessing stage, hand data captured using a regular camera—without the use of gloves or markers—was primarily segmented from the background through skin color filtering, contour detection techniques, or a combination of both methods. Tang et al. [17] employed a combination of skin color segmentation and depth-based segmentation to separate the hand from other skin-colored objects for hand gesture tracking. In their study, the YCbCr color space—characterized by the separation of luminance and chrominance and the compactness of the skin cluster—was used for skin color segmentation, achieving a gesture recognition accuracy of 90.7% [18]. Rahmat et al. [19] proposed a method combining two color spaces, HS and CbCr (HS-CbCr), to detect skin regions, and employed a background averaging technique for hand gesture recognition. They reported an accuracy of 96.87%, while noting that recognition performance was affected by lighting conditions, with better results under stable illumination. Yörük et al. (2006) recognized hands using two techniques: the Hausdorff distance of the hand contours and the independent component features of the hand silhouette images [20]. Yao and Fu (2014) developed a hand contour model to simplify the gesture matching process, which can reduce the computational complexity of gesture matching [21]. Minnen and Zafrulla (2011) detected hands by identifying local peaks in depth images and developed a method to distinguish the hand from the forearm based on the radius of the palm [22].

In addition, in the preprocessing stage, multiple techniques are often used in combination, and algorithms such as YOLOv7 are also being applied. Sarma and Bhyuan (2022) proposed a pre-processing approach that integrates YCbCr and HSV color-space-based skin segmentation, motion-based three-frame differencing, and a double-tracking system combining particle filtering and CAMShift for robust hand tracking and gesture trajectory extraction [23]. Qi et al. [24] proposed a Gaussian Mixture Model (GMM) based on CbCr-I color components to improve hand segmentation under varying illumination. By applying this model with an adaptive threshold, skin regions were effectively detected and the hand region was extracted as the region of interest (ROI). Furthermore, a novel hand shape distribution feature in polar coordinates was introduced to reduce misclassification among postures with the same number of extended fingers and to enhance the accuracy of hand posture recognition. Ansar et al. [25] emphasized the importance of pre-processing in hand gesture recognition systems, particularly to address challenges such as illumination variation, noise, and complex backgrounds in real-world settings. Their proposed method involved reducing noise and adjusting brightness in the input images, followed by detecting the region of interest (ROI) using directional images. After extracting the hand region, geometric features were obtained via convex hull-based landmark extraction. These features were then used for gesture classification through a convolutional neural network (CNN). The system achieved high recognition accuracy, with results of 93.2% and 90.2% on the MNIST dataset, and 91.6% and 88.14% on the ASL dataset, using one-third and two-thirds training-validation splits, respectively. Dewi et al. [26] adopted the YOLOv7 object detection model—which integrates enhanced layer aggregation (E-ELAN) and reparameterized convolution (RepConvN) to improve accuracy and scalability—for the task of hand detection. Leveraging its efficient architecture, they trained the YOLOv7x variant for 200 epochs on the Oxford Hand dataset and reported a high mean average precision (mAP) of 86.3%, confirming the model’s effectiveness in detecting hand shapes under various conditions.

2.2. Classification Model

For hand recognition, MobileNetV2—with its lightweight architecture and fast inference speed—ResNet-50—based on residual learning—and Inception V3—which leverages various filter sizes for improved image recognition—have been widely used. First, MobileNetV2 adopts a streamlined architecture based on standard operations, enabling both memory-efficient and computationally efficient inference for mobile applications. This model demonstrates superior performance on benchmark datasets such as ImageNet and COCO, achieving a favorable balance between accuracy and complexity. In combination with the SSDLite detection module, MobileNetV2 requires 20% fewer computations and 10% fewer parameters compared to YOLOv2. The network structure also separates model expressiveness—through expansion layers—from capacity—via bottleneck inputs—facilitating effective model scaling [27,28]. Sun et al. [29] proposed a palm vein image recognition approach based on MobileNetV2, incorporating transfer learning and SENet modules to improve robustness and accuracy under conditions such as lighting variation and limited data. By segmenting a region of interest (ROI) rich in vein patterns and enhancing critical feature representation through SENet, their image recognition model achieved a high accuracy of 99.35%. Dabwan et al. [30] developed a sign language recognition system for individuals with hearing and speech impairments using a MobileNetV2-based model. They trained the model on a dataset comprising 24 alphabet gesture classes (excluding J and Z, which require motion), achieving 99.9% validation accuracy and 100% test accuracy.

Next, ResNet-50 enables the effective training of deep neural networks with up to 50 layers, making it well-suited for image recognition by achieving higher accuracy compared to low-depth networks. Residual connections mitigate the vanishing gradient problem and facilitate the training of deeper models, effectively addressing challenges in object recognition [31,32]. Doma and Miriyala [32] applied the ResNet50 model to object detection tasks using the COCO dataset and demonstrated its robustness in identifying and localizing objects under complex conditions. Their results showed that ResNet50 achieved higher accuracy compared to other lightweight models like MobileNet and EfficientNet, particularly in scenarios with occlusion and scale variation. Yildirim et al. (2024) compared the performance of ResNet18, ResNet34, ResNet50, and ResNet101 models for hand recognition using palm and dorsal hand images. Among these, ResNet50 achieved the highest accuracy (92.5%) and the lowest loss rate (0.2372%) [33]. Li [34] proposed a static hand gesture recognition system based on the ResNet50 architecture, enabling control of a web browser through gesture input. The proposed system was constructed by fine-tuning a pre-trained ResNet50 model on a target gesture dataset and achieved an average recognition accuracy of 93.86% on the test set. To demonstrate its practical applicability, eight static gestures were predefined and mapped to web browser commands (e.g., open browser, switch page), enabling gesture-based control of browsing functions.

Lastly, Inception V3 is a high-performance convolutional neural network (CNN) architecture designed to balance computational efficiency and accuracy, employing structural optimizations such as factorized convolutions, aggressive dimension reduction, auxiliary classifiers, and label-smoothing to achieve strong performance with relatively low computational cost and parameter count [35]. Additionally, the Inception-v3 architecture employs 1 × 1 convolutions (pointwise convolutions) in conjunction with parallel convolutional layers of varying kernel sizes, along with an increased depth through additional hidden layers. This design enhances the model’s ability to capture complex and abstract features, making it especially effective for addressing sophisticated recognition tasks [36]. Karsh et al. [37] proposed mIV3Net, a two-stage deep learning model based on Inception V3, for accurate hand gesture recognition even in complex background environments. The model achieved accuracy rates of 97.14%, 99.3%, 97.4%, 99%, and 99.8% on five public datasets (MUGD, ISL, ArSL, NUS-I, NUS-II), reporting up to a 12.58% improvement over existing methods. Hussain et al. [36] fine-tuned an Inception V3 model for hand gesture recognition using 5440 color static gesture images representing 37 poses from the American Sign Language dataset. The model achieved 90% accuracy, 93% precision, 91% recall, and an F1-score of 90%. Anusha et al. [38] classified American Sign Language using 87,000 image data samples by locking 248 layers of the Inception V3 model and training only the last two blocks and the fully connected layers. By applying transfer learning along with data augmentation techniques, they reported a training accuracy of approximately 98.87% and a testing accuracy of 96.43%.

2.3. Ensemble Methods

Ensemble methods achieve higher performance than individual algorithms by leveraging multiple algorithms to process data, generating predictions based on extracted features, and integrating the outcomes through a unified mechanism [39,40]. Sen et al. [41] proposed an ensemble-based CNN approach for hand gesture recognition, where three custom CNN architectures—GoogLeNet-like, VGGNet-like, and AlexNet-like—were trained in parallel and integrated through score-level averaging to improve classification performance. The system was validated on three datasets, including two public infrared gesture datasets and a self-constructed binary image dataset, achieving recognition accuracies of 99.80%, 96.50%, and 99.76%, respectively. Ewe et al. [42] proposed a hybrid model (VGG16-RF) that combines the feature extraction capabilities of a lightweight VGG16 model with the classification performance of a Random Forest, in order to overcome the limitations of manual feature extraction in vision-based hand gesture recognition. On the ASL Alphabet, ASL Numbers, and NUS Hand Posture datasets, the proposed model achieved accuracies of 99.98%, 100%, and 100%, respectively, outperforming traditional deep learning and machine learning models. Rahim et al. [43] proposed a model for Bengali sign language gesture recognition by combining Support Vector Machine (SVM), Random Forest (RF), and Convolutional Neural Network (CNN), and applied a probability-based soft voting technique for ensemble learning. Using a dataset of 190,000 images, the ensemble model achieved an accuracy of 99.50%, outperforming the individual models (CNN: 99.44%, SVM: 96.39%, RF: 95.56%). Jabbar [44] developed an ensemble model for network attack detection based on three classifiers: K-Nearest Neighbors (KNN), Decision Tree (DT), and Gradient Boosting (GB), applying both soft voting and hard voting methods. The individual models achieved accuracies of 86.52% (KNN), 99.95% (DT), and 99.82% (GB), while the ensemble model attained 99.967% accuracy with soft voting and 100% accuracy with hard voting. The study concluded that combining models can enhance overall performance by leveraging the strengths and compensating for the weaknesses of individual classifiers.

3. Methods

3.1. Participants

Seventy Koreans (35 males, 35 females; 20 left-handed, 50 right-handed) in their 20s to 60s (mean ± SD = 38.0 ± 16.2; range = 20~63 years) with varying hand lengths, hand widths, and smartphone UI experience participated in the smartphone grip posture experiment. The participants (mean hand length = 180.6 ± 11.1 mm; mean hand width = 81.2 ± 6.7 mm) were recruited based on the distribution of hand length and hand width from Korean anthropometric data [45]. As shown in Figure 1, participants were divided into nine groups according to hand length and hand width for each gender: for males, the 33rd percentile hand length is 181 mm and the 66th percentile is 188 mm, with hand widths at the 33rd percentile being 83 mm and the 66th percentile being 87 mm; for females, the 33rd percentile hand length is 166 mm and the 66th percentile is 173 mm, with hand widths at the 33rd percentile being 74 mm and the 66th percentile being 78 mm. The dominant hand of each participant was determined using the Edinburgh Handedness Inventory [46]. The hand dominance assessment involved 10 tasks (e.g., writing, drawing, using a spoon), each rated on a 5-point scale (1: always left hand, 2: usually left hand, 3: both equally, 4: usually right hand, 5: always right hand). Additionally, participants were required to have no history of visual or musculoskeletal disorders that might interfere with normal smartphone use and to have at least three years of experience using smartphones. The present study was approved by the Institutional Review Board (IRB) of the Pohang University of Science and Technology (PIRB-2018-E100).

3.2. Apparatus

In this study, nine smartphone mock-ups with different sizes and weights were used to identify changes in grip posture based on different smartphone UIs, and two web cameras (Microsoft Co. Ltd., Redmond, WA, USA) were used to extract grip posture image data. As shown in Figure 2, the nine smartphone mock-ups varied in size and weight (screen size = 3.0~7.0 inches; height = 95~175 mm; width = 56~93 mm; depth = 8~12 mm; weight = 100~190 g) and were produced using a 3D printer, Dimension SST 768 (Stratasys Ltd., Edina, MN, USA), for the grip posture experiment. The smallest mock-up model has a screen size of 3 inches, a height of 95 mm, a width of 56 mm, and a weight of 100 g. Eight additional mock-up models were created by incrementing the screen size by 0.5 inches, the height by 10 mm, the width by 1 mm, and the weight by 10 g from the previous model. To record the natural hand movements of the participants, two web cameras were placed 30 cm above and 30 cm below the hand to capture grip postures during smartphone use. Black cloth was placed on the ceiling and floor of the recording area to minimize background noise in image detection (Figure 3).

3.3. Experiment Procedure

The smartphone grip posture measurement experiment was conducted in four stages: (1) experiment preparation, (2) anthropometric data collection, (3) grip posture measurement, and (4) debriefing. First, in the experiment preparation stage, participants were informed about the purpose and procedures of the experiment, and consent forms were obtained. During the anthropometric data collection, participants’ hand sizes were measured, and their handedness was assessed. Hand length was measured using vernier calipers from the base of the wrist to the tip of the middle finger. In the grip posture measurement stage, grip postures were identified while the participants performed four tasks—calling, sending text messages, web surfing, and music playing—based on previous studies [1]. To prevent learning effects, the order of tasks and the sizes of the smartphone mock-ups were randomized. Additionally, a 3-min break was provided after performing tasks with the nine different smartphone mock-ups to prevent fatigue. Finally, during the debriefing stage, participants were checked for physical fatigue experienced from performing the tasks and were given the opportunity to ask questions regarding the experiment. Additionally, the video recordings were checked to ensure they were correctly saved.

3.4. Grip Posture Classification

Grip posture was classified based on the number of fingers positioned on different parts of the smartphone, following the grip posture classification system for smartphone UI usage established by Choi et al. (2020) [1]. As shown in Figure 4, different parts of the device were each labeled as L (left), R (right), T (top), and B (bottom), and the number of fingers on each part was counted and labeled accordingly. For instance, L3-R1-B1 refers to a grip posture using three fingers on the left side, one finger on the right side, and one finger on the back. The types of grip postures identified during UI usage in the experiment were classified into seven categories. For the right-hand-dominant, the categories are: (1) L2-B1-R1-K1, (2) L2-R1-K2, (3) L2-T1-B1-R1, (4) L3-B1-R1, (5) L3-R1-K1, (6) L3-R1-T1, and (7) L4-R1. These seven categories were also applied to the left hand, with the labels for the left (L) and right (R) sides of the smartphone reversed accordingly.

3.5. Dataset

Grip posture image data were obtained by capturing images from the recorded videos of smartphone usage. The grip posture capture moments were designated at the points of UI interaction (e.g., pressing the volume key). To ensure accuracy, the captured moments were reviewed by three ergonomics researchers. Images where the user’s hand moved out of the camera range or the camera shook, making it difficult to accurately capture the hand shape, were excluded from the analysis. A total of 3278 grip posture images were extracted, as displayed in Table 1, with variations in the number of images for each category. The frequency of use varies by grip posture, leading to differences in the amount of data for each posture.

3.6. Preprocessing

The captured grip posture image data were processed to remove the background using a skin color and hand contour detection algorithm from Python’s open-source computer vision library (OpenCV). The background of the captured input images was removed using a background subtraction algorithm, which was implemented using OpenCV functions including (1) cv2.cvtColor, (2) cv2.GaussianBlur, (3) cv2.findContour, (4) cv2.boundingRect, and (5) cv2.bitwise, applying an adaptive threshold to each image frame [47,48,49]. Although the RGB color standard method is commonly used for skin color differentiation, ambient light intensity may affect the actual color. When an object is exposed to light, the color channel values of the actual image may increase, making it difficult to find boundaries for skin color [50]. Therefore, in this study, the YCrCb (Y: luminance, Cr: red chrominance, Cb: blue chrominance) color standard was used to recognize skin color while minimizing the impact of brightness on skin color recognition [51]. As shown in Figure 5, the hand contour is extracted from the binary-segmented image. The hand contour is extracted using the convex hull method, which is based on skin segmentation and convexity defects. Using the extracted hand contour, the boundary is set, and the image frame is cropped [50]. As shown in Figure 6, left hand grip postures were horizontally flipped to match the categories of the right hand grip postures. For example, the left hand grip posture R4-L1 (with four fingers on the right side and one finger on the left side of the phone) was considered and classified as the same data as the right hand grip posture L4-R1 (with four fingers on the left side and one finger on the right side of the phone).

3.7. Model Architectures

To train and classify grip posture images in the present study, three models were used: MobileNetV2, ResNet-50, and Inception V3. Three pre-trained models were fine-tuned for the classification task. In all the three models, the layers of the base models pre-trained on ImageNet were frozen. Only the newly added classification layers, which included a GlobalAveragePooling2D layer, a fully connected Dense layer with 1024 units, a Dropout layer, and a final Dense layer for 7-class classification, were trained on the target dataset. MobileNetV2 is an optimized, lightweight deep learning model with an architecture that maximizes computational efficiency and minimizes memory usage. Improved from the architecture of MobileNetV1, it introduces a new feature: the inverted residual with linear bottleneck, utilizing high-efficiency depthwise separable convolutions and ReLU6 activation functions (Figure 7) [27,28,52,53]. MobileNetV2 has advantages in adaptability to different environments and requirements, providing high accuracy and a lightweight framework for image classification [53], making it suitable for classifying grip posture images.

Next, ResNet-50 is a convolutional neural network with 50 layers that learns residuals instead of directly learning features [31,54]. ResNet-50 addresses the gradient vanishing problem, enabling the construction of deep neural networks [55]. As shown in Figure 8, the stacked layers produce an output defined as y = F(x) + x, where F(x) represents the layer’s output, and the initial input x is added element-wise [56]. This model is particularly effective for image classification tasks, offering benefits such as easy optimization and low computational burden [57].

Lastly, Inception V3 is a CNN-based model comprising both symmetric and asymmetric components [35,58]. These components include convolutional layers, max pooling, average pooling, dropouts, concatenations, and fully connected layers, as shown in Figure 9. By extracting features at multiple levels, Inception V3 enhances computational efficiency and requires less computational power [59].

Furthermore, an ensemble of MobileNetV2, ResNet-50, and Inception V3 is used in this study. Ensemble learning leverages various machine learning algorithms, combining their strengths and compensating for their weaknesses, resulting in superior performance compared to a single algorithm [60]. This study employs the soft voting technique, which integrates the probabilistic outputs of individual models to generate a final prediction. By averaging the predicted probabilities of each algorithm, soft voting effectively balances their contributions, further enhancing the robustness and accuracy of the ensemble model [39]. As shown in Figure 10, ensemble learning integrates multiple machine learning algorithms into a unified framework, leading to improved prediction accuracy [39].

3.8. Training/Test Setup

Models were trained using TensorFlow 2.16 (https://www.tensorflow.org/ (accessed on 30 October 2024)) with the Keras API in the Python 3.6 programming language (Python Software Foundation, Wilmington, DE, USA, https://www.python.org/ (accessed on 30 October 2024)) on an NVIDIA GeForce RTX 2060 GPU. Additionally, hyperparameters for each model were configured for training and testing. For MobileNetV2, the learning rate was set to 0.001, the batch size to 32, and the number of epochs to 8. Adam (Adaptive Moment Estimation), based on the gradient descent algorithm, was used as the optimizer, which adjusts the learning rate independently for each parameter to enhance training efficiency. The loss function used was categorical cross-entropy. Dropout was set to 0.5, randomly deactivating 50% of the nodes to prevent overfitting and enhance generalization. For ResNet-50, the batch size was set to 32 with the number of epochs to 8. Similar to MobileNetV2, Adam was used as the optimizer, and categorical cross-entropy was utilized as the loss function. Dropout was set to 0.5 for ResNet-50 as well to prevent overfitting. For Inception V3, the batch size was set to 32 and the number of epochs to 15. Adam was again used as the optimizer, and categorical cross-entropy as the loss function. Dropout was set to 0.5, and a learning rate scheduler was implemented to enhance training efficiency and optimize the model. The learning rate factor of the scheduler was set to 0.2 with patience set to 5, meaning the learning rate was reduced if the monitored value did not improve for 5 epochs. The minimum learning rate was set to 0.0001 to prevent excessive overfitting. The data training process was divided into training set, validation set, and test set, with adjustments made to image size, epoch, batch size, dense layers, and dropout. The number of epochs was determined to balance between insufficient learning and potential overfitting. When the epoch count was reduced below the selected value, the model showed higher loss and lower accuracy, indicating underfitting. Conversely, increasing the number of epochs beyond the chosen value led to a rise in training accuracy but a decline in validation accuracy, suggesting a risk of overfitting. The batch size was set to 32 to balance training stability and generalization performance. When smaller batch sizes, such as 16, were used, training became slower and more unstable, while larger batch sizes, such as 64 or 128, led to faster convergence but a decline in generalization performance. The dropout rate was set to 0.5 to prevent overfitting while maintaining sufficient model capacity. Lower dropout rates, such as 0.2, resulted in increased risk of overfitting, whereas higher rates, such as 0.7, led to underutilization of the model’s learning capacity and unstable performance. The preprocessed set of 3278 images was divided into 2551 images for the training set, 309 images for the validation set, and 418 images for the test set.

3.9. Model Evaluation

The performances of individual models (MobileNetV2, ResNet-50, and Inception V3) and their ensemble model for grip posture image classification were evaluated using a confusion matrix and metrics such as precision, recall, and F1-score. Initially, the number of samples for each model was counted for the four categories in the confusion matrix: false positives (FP), true positives (TP), false negatives (FN), and true negatives (TN). TP indicates samples correctly classified as positive, while FP represents samples incorrectly classified as positive when they actually belong to the negative class. Similarly, FN denotes samples incorrectly classified as negative when they actually belong to the positive class, whereas TN indicates samples correctly classified as negative. Using the number of samples in each of these four categories, precision, recall, and F1-score were computed to evaluate the model’s performance. Precision is the ratio of true positives (TP) to the total number of positive predictions made by the model, reflecting how accurately the model predicts true samples. Recall is the ratio of true positives (TP) to all actual positive cases, indicating the model’s effectiveness in identifying all true cases of the positive class accurately. The F1-score, which combines both precision and recall, is calculated as their harmonic mean. A high F1-score indicates that the model has a good balance between precision and recall, with a perfect model achieving an F1-score of 1 [61,62].

In this study, we evaluated and validated the performance of the model through k-fold cross-validation and assessed the model’s ability and applicability using a separate validation set. To evaluate the generalization performance of our model, we employed k-fold cross-validation with k = 5. The dataset was divided into five subsets, and for each iteration, four subsets were used for training, while the remaining subset was used for validation. The process was repeated five times, and each performance was averaged to report the overall accuracy.

4. Results

For grip posture classification, three individual classification models (MobileNetV2, ResNet-50, and Inception V3) and an ensemble model of the three models were established. Their overall accuracies were 92.5% for MobileNetV2, 94.3% for ResNet-50, 85.9% for Inception V3, and 95.9% for the ensemble model. First, as shown in Table 2, the accuracy for each class based on MobileNetV2 was as follows: 93.2% for L2-B1-R1-K1, 76.3% for L2-R1-K2, 80.0% for L2-T1-B1-R1, 98.5% for L3-B1-R1, 94.3% for L3-R1-K1, 90.5% for L3-R1-T1, and 96.2% for L4-R1 (Figure 11 and Figure 12).

Second, as shown in Table 3, the accuracy for each class based on ResNet-50 was as follows: 93.2% for L2-B1-R1-K1, 84.2% for L2-R1-K2, 93.3% for L2-T1-B1-R1, 100% for L3-B1-R1, 94.4% for L3-R1-K1, 92.9% for L3-R1-T1, and 96.2% for L4-R1 (Figure 13 and Figure 14).

Third, as shown in Table 4, the accuracy for each class based on Inception V3 was as follows: 89.8% for L2-B1-R1-K1, 52.6% for L2-R1-K2, 60.0% for L2-T1-B1-R1, 97.1% for L3-B1-R1, 89.4% for L3-R1-K1, 88.1% for L3-R1-T1, and 86.5% for L4-R1 (Figure 15 and Figure 16).

Lastly, as shown in Table 5, the accuracy for each class based on the ensemble model was as follows: 93.2% for L2-B1-R1-K1, 97.3% for L2-R1-K2, 100% for L2-T1-B1-R1, 98.6% for L3-B1-R1, 95.1% for L3-R1-K1, 92.9% for L3-R1-T1, and 98.1% for L4-R1.

As a result of k-fold cross-validation, the average accuracy for MobileNetV2 was 93.5%, with the results for each fold as follows: 94.3% for k = 1; 92.3% for k = 2; 93.3% for k = 3, 94.5% for k = 4; 93.1% for k = 5. For ResNet-50, the average accuracy was 94.0%, with the results for each fold as follows: 94.1% for k = 1; 96.9% for k = 2; 92.2% for k = 3; 93.5% for k = 4; and 93.1% for k = 5. For Inception V3, the average accuracy was 92.3%, with the results for each fold as follows: 95.7% for k = 1; 88.5% for k = 2; 94.7% for k = 3; 93.3% for k = 4; and 89.0% for k = 5.

5. Discussion

The present study developed a machine learning-based model to detect grip posture images while using a smartphone UI and to classify grip postures based on the number of fingers on each side of the smartphone. Previous studies on detecting and classifying grip postures manually differentiated grip postures rather than using machine learning algorithms or focused on grip postures in non-smartphone usage scenarios [1,8,63]. Additionally, methods such as using hand gloves or attaching markers made it difficult to induce natural hand movements in users. The grip posture recognition model developed in the present study ensures the user’s natural hand movements by utilizing web cameras without requiring additional devices on the user’s hand, employing machine learning methods to process and classify the image data. To identify various grip posture types, the grip posture experiment utilized nine different smartphone mock-ups with varying lengths, widths, and weights, along with four different usage scenarios, collecting data from participants with a range of hand sizes and hand widths. In the preprocessing phase, utilizing a combination of hand contour and skin color improved the accuracy of grip posture classification. Additionally, classifying grip postures based on the number of fingers positioned on different sides of the smartphone enabled detailed recognition of grip postures at the finger position level.

The image detection and classification model developed in the present study is effective in identifying the user’s grip posture, with an accuracy of 95.9%. The overall accuracies of the individual models used in the present study were 90.6% for MobileNetV2, 94.3% for ResNet-50, and 85.9% for Inception V3. The ensemble classification model achieved an accuracy of 95.9%, higher than that of the individual models. The ensemble technique combines the classification accuracies of individual models to attain improved accuracy in image classification [8,10,63]. Chung et al. [12] used three models (VGG19, ResNet-50, and MobileNetV2) to detect and classify 29 hand gestures in American Sign Language (ASL). The accuracies of the individual models were 99.4% for VGG19, 98.2% for ResNet-50, and 88.1% for MobileNetV2. When the models were ensembled, the accuracy increased to 99.7%, demonstrating improved accuracy with the ensemble technique compared to the individual models. The RGB-based algorithm for tracking hands using skin color has the advantage of accurately tracking hands when there is no background noise or when the skin color of the data used is consistent. However, challenges arise when the background contains colors similar to skin tones or when there is variability in skin color, necessitating adjustments to the algorithm’s settings [64]. The data preprocessing method in this study focuses on identifying hand contours (hand shapes) through skin color, allowing the model to be applied regardless of skin color. Although Inception V3 was trained for more epochs than the other models, it demonstrated relatively lower classification accuracy. This may be attributed to its complex multi-branch convolutional architecture and large number of parameters, which likely require larger datasets to generalize effectively. In contrast, MobileNetV2 and ResNet-based models, with their lightweight or residual architectures, appear to generalize more efficiently on small, shape-oriented datasets such as grip posture images. Ariefwan et al. [65] explored suitable combinations of models and optimizers for facial recognition by testing MobileNetV2, Inception V3, and ResNet50 in combination with Adam, SGD, and RMSprop. Their results indicated that Inception V3 consistently yielded lower performance across all optimizers.

The classification model developed in this study may be utilized to identify the user’s smartphone grip posture and may serve as reference for designing ergonomic smartphone interfaces. The grip posture classification model developed in the present study may be utilized to identify the frequency of grip posture usage, which can then be referenced for designing optimal interface positions. For example, Choi et al. (2020) used frequency of grip posture usage information to determine the optimal UI positions, ensuring that preferred UI locations in frequently used grip postures were given higher importance compared to those in less frequently used postures [1]. The classification model used in this study may be applied to smartphones as well as to products operated with one hand to identify and classify grip postures. For instance, this model may identify and classify grip postures when using household appliances, allowing for the identification of frequently used grip postures to be referenced for interface design. By using the model developed in this study to identify users’ grip postures, it can contribute to designing product interfaces with high usability [66].

Lastly, the grip posture identification experiments in this study were performed in a laboratory environment using mock-ups, and the background of the image data needs to be evaluated in various environments. Although the smartphone grip posture image data in the present study were collected using mock-ups based on actual smartphone specifications, it is necessary to evaluate the model using commercially available smartphones to reflect real-world usage scenarios. Additionally, grip posture image data collected in this study were controlled with a black background. To ensure the model’s applicability in real-world situations, it needs to be evaluated using image data detected in various backgrounds [67,68]. Also, the unequal amounts of data for the seven grip postures are thought to have affected the model’s accuracy. Ensuring balanced data collection for each posture in future studies may improve the model’s accuracy. To address the issue of data imbalance, techniques such as applying class weighting during training, oversampling minority classes, or using synthetic data augmentation may be considered. Although the model achieved relatively high classification accuracy, the number of training epochs used for some models (8–15) was somewhat limited. These values were selected based on preliminary experiments to avoid overfitting, but future work may involve more systematic hyperparameter tuning—including adjustments to the number of epochs—to further improve model performance and generalizability.

Future studies may aim to develop grip posture classification models that better reflect real-world usage environments, expand the range of recognizable grip postures, and apply the findings to UI/UX optimization in home appliances. Improving the model’s preprocessing methods and training it with images captured in diverse backgrounds may enable robust grip posture classification even in visually complex environments. This approach is expected to enhance the model’s practical applicability in real-world usage scenarios. In addition, collecting data under various lighting conditions and from individuals with diverse skin tones may help improve classification accuracy in real-world environments. Although this study classified seven types of grip postures, future research may focus on developing models capable of recognizing a wider range of grip types, including those involving two-handed usage. Collecting balanced data across all categories might contribute to improved classification performance and generalizability. In addition, exploring sequential analysis techniques that can better capture the temporal patterns of grip posture data may improve the robustness and practical applicability of the model in real-world environments. The proposed grip posture classification model may be adapted for specific product domains such as remote controls or handheld vacuum cleaners. For instance, the identification of frequently used grip positions may inform the ergonomic placement of buttons or control elements in consumer electronics. Furthermore, analyzing grip patterns in assistive devices for individuals with motor impairments may support the development of more inclusive interface designs.

6. Conclusions

The present study developed a machine learning-based model for classifying smartphone grip postures with a high degree of accuracy. The ensemble model, combining MobileNetV2 (90.6%), ResNet-50 (94.2%), and Inception V3 (85.9%), achieved a classification accuracy of 95.9%, outperforming the individual models. This model leverages hand contour and skin color for preprocessing, enabling detailed recognition of grip postures based on finger positioning. The developed model offers potential for designing ergonomic smartphone interfaces by identifying frequently used grip postures. While the study demonstrates promising results, limitations exist, including the controlled laboratory environment and the use of mock-ups. Future research should focus on evaluating the model with commercially available smartphones in diverse real-world settings and addressing data imbalances across grip postures. Further work may also involve more systematic hyperparameter tuning to improve model performance and generalizability. Expanding the range of recognizable grip postures and applying the findings to UI/UX optimization in other handheld devices are also promising avenues for future exploration.

Author Contributions

Conceptualization, D.K., Y.C. and H.Y.; methodology, D.K., Y.C. and H.Y.; software, all coauthors; validation, D.K., X.C. and Y.L.; formal analysis, D.K., X.C., Y.L., A.S.M. and E.K.; data curation, D.K. and Y.C.; writing—original draft preparation, D.K. and H.Y.; writing—review and editing, all coauthors; visualization, D.K., Y.L. and H.Y.; supervision, H.Y.; project administration, H.Y.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly supported by the research programs (2017M3C1B6070526; 2018R1A2A2A 05023299) through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (MEST), (No. 10063384; R0004840, 2017), those of the Ministry of Trade, Industry, and Energy (MOTIE) under Industrial Technology Innovation Program, and the Biomedical Research Institute Fund, Chonbuk National University Hospital.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and was approved by the Institutional Review Board (IRB) of the Pohang University of Science and Technology (PIRB-2018-E100).

Informed Consent Statement

Informed consent was obtained from all participants in the study.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

ChatGPT 4o has been used for proofreading the manuscript.

Conflicts of Interest

The author declares no conflicts of interest.

References

Choi, Y.; Yang, X.; Park, J.; Lee, W.; You, H. Effects of Smartphone Size and Hand Size on Grip Posture in One-Handed Hard Key Operations. Appl. Sci. 2020, 10, 8374. [Google Scholar] [CrossRef]
Wobbrock, J.; Myers, B.; Aung, H. The Performance of Hand Postures in Front- and Back-of-Device Interaction for mobile computing. Int. J. Hum. Comput. Stud. 2008, 66, 857–875. [Google Scholar] [CrossRef]
Finneran, A.; O’Sullivan, L. Effects of Grip Type and Wrist Posture on Forearm EMG Activity, Endurance Time and Movement Accuracy. Int. J. Ind. Ergonom. 2013, 43, 91–99. [Google Scholar] [CrossRef]
Lee, S.; Cha, M.; Hwangbo, H.; Mo, S.; Ji, G. Smartphone Form Factors: Effects of Width and Bottom Bezel on Touch Performance, Workload, and Physical Demand. Appl. Ergon. 2018, 67, 142–150. [Google Scholar] [CrossRef]
Kietrys, D.; Gerg, M.; Dropkin, J.; Gold, J. Mobile Input Device Type, Texting Style and Screen Size Influence Upper Extremity and Trapezius Muscle Activity, and Cervical Posture While Texting. Appl. Ergon. 2015, 50, 98–104. [Google Scholar] [CrossRef]
Trudeau, M.B.; Asakawa, D.S.; Jindrich, D.L.; Dennerlein, J.T. Two-Handed Grip on a Mobile Phone Affords Greater Thumb Motor Performance, Decreased Variability, and a More Extended Thumb Posture Than a One-Handed Grip. Appl. Ergon. 2016, 52, 24–28. [Google Scholar] [CrossRef]
Karlson, A.K.; Bederson, B.B.; Contreras-Vidal, J.L. Understanding One-Handed Use of Mobile Devices. In Handbook of Research on User Interface Design and Evaluation for Mobile Technology; Lumsden, J., Ed.; IGI Global: Hershey, PA, USA, 2008; pp. 86–101. ISBN 978-159-904-871-0. [Google Scholar]
Coleca, F.; State, A.; Klement, S.; Barth, E.; Martinetz, T. Self-Organizing Maps for Hand and Full Body Tracking. Neurocomputing 2015, 147, 174–184. [Google Scholar] [CrossRef]
Achenbach, P.; Laux, S.; Purdack, D.; Müller, P.N.; Göbel, S. Give Me a Sign: Using Data Gloves for Static Hand-Shape Recognition. Sensors 2023, 23, 9847. [Google Scholar] [CrossRef]
Haria, A.; Subramanian, A.; Asokkumar, N.; Poddar, S.; Nayak, J.S. Hand Gesture Recognition for Human Computer Interaction. Procedia. Comput. Sci. 2017, 115, 367–374. [Google Scholar] [CrossRef]
Singh, A.; Singh, A.; Rani, R.; Dev, A.; Sharma, A. Hand Gesture Detection Using Convexity Hull and Convolutional Neural Network. In Proceedings of the 2022 International Conference on Machine Learning, Computer Systems and Security (MLCSS), Bhubaneswar, India, 5–6 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 105–110. [Google Scholar]
Chung, H.X.; Hameed, N.; Clos, J.; Hasan, M.M. A Framework of Ensemble CNN Models for Real-Time Sign Language Translation. In Proceedings of the 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA), Phnom Penh, Cambodia, 2–4 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 27–32. [Google Scholar]
Nigam, S.; Shamoon, M.; Sakshi, D.; Choudhury, T. A Complete Study of Methodology of Hand Gesture Recognition System for Smart Homes. In Proceedings of the 2019 International Conference on Contemporary Computing and Informatics (IC3I), Amity Global Institute, Singapore, 12–14 December 2019; Niranjan, S.K., Rana, A., Khurana, H., Eds.; IEEE: Piscataway, NJ, USA, 2019; pp. 289–294. [Google Scholar]
Dua, M.; Shakshi; Singla, R.; Raj, S.; Jangra, A. Deep CNN Models-Based Ensemble Approach to Driver Drowsiness Detection. Neural Comput. Appl. 2020, 33, 3155–3168. [Google Scholar] [CrossRef]
Phung, V.H.; Rhee, E.J. A High-Accuracy Model Average Ensemble of Convolutional Neural Networks for Classification of Cloud Image Patches on Small Datasets. Appl. Sci. 2019, 9, 4500. [Google Scholar] [CrossRef]
Alnuaim, A.; Zakariah, M.; Hatamleh, W.A.; Tarazi, H.; Tripathi, V.; Amoatey, E.T. Human-Computer Interaction with Hand Gesture Recognition Using ResNet and MobileNet. Comput. Intell. Neurosci. 2022, 2022, 8777355. [Google Scholar] [CrossRef] [PubMed]
Tang, C.; Ou, Y.; Jiang, G.; Xie, Q.; Xu, Y. Hand Tracking and Pose Recognition via Depth and Color Information. In Proceedings of the 2012 IEEE International Conference on Robotics and Biomimetics (ROBIO), Guangzhou, China, 11–14 December 2012; pp. 1104–1109. [Google Scholar]
Hsu, R.L.; Abdel-Mottaleb, M.; Jain, A.K. Face Detection in Color Images. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 24, 696–706. [Google Scholar] [CrossRef]
Rahmat, R.F.; Chairunnisa, T.; Gunawan, D.; Pasha, M.F.; Budiarto, R. Hand Gestures Recognition with Improved Skin Color Segmentation in Human-Computer Interaction Applications. J. Theor. Appl. Inf. Technol. 2019, 97, 727–739. [Google Scholar]
Yörük, E.; Konukoğlu, E.; Sankur, B.; Darbon, J. Shape-Based Hand Recognition. IEEE Trans. Image Process. 2006, 15, 1803–1815. [Google Scholar] [CrossRef]
Yao, Y.; Fu, Y. Contour Model-Based Hand-Gesture Recognition Using the Kinect Sensor. IEEE Trans. Circuits Syst. Video Technol. 2014, 24, 1935–1944. [Google Scholar] [CrossRef]
Minnen, D.; Zafrulla, Z. Towards Robust Cross-User Hand Tracking and Shape Recognition. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 1235–1241. [Google Scholar]
Sarma, D.; Bhuyan, M.K. Hand Detection by Two-Level Segmentation with Double-Tracking and Gesture Recognition Using Deep-Features. Sens. Imaging 2022, 23, 9. [Google Scholar] [CrossRef]
Qi, J.; Xu, K.; Ding, X. Approach to hand posture recognition based on hand shape features for human–robot interaction. Complex. Intell. Syst. 2022, 8, 2825–2842. [Google Scholar] [CrossRef]
Ansar, H.; Al Mudawi, N.A.; Alotaibi, S.S.; Alazeb, A.; Alabdullah, B.I.; Alonazi, M.; Park, J. Hand Gesture Recognition for Characters Understanding Using Convex Hull Landmarks and Geometric Features. IEEE Access 2023, 11, 82065–82078. [Google Scholar] [CrossRef]
Dewi, C.; Chen, A.P.S.; Christanto, H.J. Deep Learning for Highly Accurate Hand Recognition Based on Yolov7 Model. Big Data Cogn. Comput. 2023, 7, 53. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–29 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4510–4520. [Google Scholar]
Wu, X.; Luo, Z.; Xu, H. Recognition of Pear Leaf Disease Under Complex Background Based on DBPNet and Modified MobilenetV2. IET Image Process. 2023, 17, 3055–3067. [Google Scholar] [CrossRef]
Sun, Q.; Luo, X. A New Image Recognition Combining Transfer Learning Algorithm and MobileNet V2 Model for Palm Vein Recognition. In Proceedings of the 2022 4th International Conference on Frontiers Technology of Information and Computer (ICFTIC), Qingdao, China, 2–4 December 2022; pp. 559–564. [Google Scholar]
Dabwan, B.A.; Jadhav, M.E.; Al Yami, M.; Hassan, E.A.; Almula, S.M.; Ali, Y.A. Classifying Hand Gestures for People with Disabilities Utilizing the MobileNetV2 Model. In Proceedings of the 2024 1st International Conference on Innovative Sustainable Technologies for Energy, Mechatronics, and Smart Systems (ISTEMS), Dehradun, India, 26–27 April 2024; pp. 1–4. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2018; pp. 770–778. [Google Scholar]
Doma, G.; Miriyala, T. Object Detection Using ResNet50. Int. J. Creat. Res. Thoughts 2024, 12, 385–394. [Google Scholar]
Yildirim, M.E.; Salman, Y.B.; Genc, E. Person Identification by Using ResNet on Hand Images. In Proceedings of the 2024 24th International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 29 October–1 November 2024; pp. 1345–1348. [Google Scholar]
Li, Z. Practice of Gesture Recognition Based on Resnet50. J. Phys. Conf. Ser. 2020, 1574, 012154. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2021; pp. 2818–2826. [Google Scholar]
Hussain, A.; Amin, S.U.; Fayaz, M.; Seo, S. An Efficient and Robust Hand Gesture Recognition System of Sign Language Employing Finetuned Inception-V3 and EfficientNet-B0 Network. Comput. Syst. Sci. Eng. 2023, 46, 3509–3525. [Google Scholar] [CrossRef]
Karsh, B.; Laskar, R.H.; Karsh, R.K. mIV3Net: Modified inception V3 network for hand gesture recognition. Multimed. Tools Appl. 2024, 83, 10587–10613. [Google Scholar] [CrossRef]
Anusha, S.B.; Samyama Gunjal, G.H.; Manjushree, N.S. Static Hand Gesture Prediction Using Inception V3. In Cognitive Science and Technology, Proceedings of the International Conference on Cognitive and Intelligent Computing, Hyderabad, India, 11–12 December 2021; Springer: Singapore, 2021; pp. 121–133. [Google Scholar] [CrossRef]
Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A Survey on Ensemble Learning. Front. Comput. Sci. 2019, 14, 241–258. [Google Scholar] [CrossRef]
Zhou, Z.H. Ensemble Methods: Foundations and Algorithms; Chapman and Hall/CRC: Boca Raton, FL, USA, 2012. [Google Scholar]
Sen, A.; Mishra, T.K.; Dash, R. A Novel Hand Gesture Detection and Recognition System Based on Ensemble-Based Convolutional Neural Network. Multimed. Tools Appl. 2022, 81, 40043–40066. [Google Scholar] [CrossRef]
Ewe, E.L.R.; Lee, C.P.; Kwek, L.C.; Lim, K.M. Hand Gesture Recognition via Lightweight VGG16 and Ensemble Classifier. Appl. Sci. 2022, 12, 7643. [Google Scholar] [CrossRef]
Rahim, M.A.; Shin, J.; Yun, K.S. Soft Voting-based Ensemble Model for Bengali Sign Gesture Recognition. Ann. Emerg. Technol. Comput. 2022, 6, 41–49. [Google Scholar] [CrossRef]
Jabbar, H.G. Advanced Threat Detection Using Soft and Hard Voting Techniques in Ensemble Learning. J. Robot. Control. 2024, 5, 1104–1116. [Google Scholar] [CrossRef]
Size Korea Report on the Fifth Survey of Korean Anthropometry. Available online: http://sizekorea.kats.go.kr/ (accessed on 2 May 2013).
Oldfield, R.C. The Assessment and Analysis of Handedness: The Edinburgh Inventory. Neuropsychologia 1971, 9, 97–113. [Google Scholar] [CrossRef] [PubMed]
Achilleas, M.; Eleni, D.; Paris-Alexandros, K.; Minas, D. Real-Time Detection of Suspicious Objects in Public Areas Using Computer Vision. In Proceedings of the 21st Pan-Hellenic Conference on Informatics, Larissa, Greece, 28–30 September 2017; ACM: New York, NY, USA, 2017; pp. 1–2. [Google Scholar]
Harini, V.; Prahelika, V.; Sneka, I.; Ebenezer, P.A. Hand Gesture Recognition Using OpenCv and Python. In New Trends in Computational Vision and Bio-Inspired Computing; Smys, S., Iliyasu, A.M., Bestak, R., Shi, F., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 1711–1719. [Google Scholar]
OpenCV Team. Open Source Computer Vision Library. Available online: http://opencv.org/ (accessed on 20 August 2024).
Shiqiang, Y.; Dan, Q.; Peilei, L. Research on Hand Recognition Method Based on Markov Random Field. Procedia. Eng. 2017, 174, 482–488. [Google Scholar] [CrossRef]
Chung, H.Y.; Chung, Y.L.; Tsai, W.F. An Efficient Hand Gesture Recognition System Based on Deep CNN. In Proceedings of the 2019 IEEE International Conference on Industrial Technology (ICIT), Melbourne, VIC, Australia, 13–15 February 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 853–858. [Google Scholar]
Baumgartl, H.; Sauter, D.; Schenk, C.; Atik, C.; Buettner, R. Vision-Based Hand Gesture Recognition for Human-Computer Interaction Using MobileNetV2. In Proceedings of the 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 12–16 July 2021; Chan, W.K., Ed.; IEEE: Piscataway, NJ, USA, 2021; pp. 1667–1674. [Google Scholar]
Dong, K.; Zhou, C.; Ruan, Y.; Li, Y. MobileNetV2 Model for Image Classification. In Proceedings of the 2020 2nd International Conference on Information Technology and Computer Application (ITCA), Guangzhou, China, 18–20 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 476–480. [Google Scholar]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.; van Ginneken, B.; Sánchez, C.I. A Survey on Deep Learning in Medical Image Analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
Alnuaim, A.A.; Zakariah, M.; Shashidhar, C.; Hatamleh, W.A.; Tarazi, H.; Shukla, P.K.; Ratna, R. Speaker Gender Recognition Based on Deep Neural Networks and ResNet50. Wirel. Commun. Mob. Comput. 2022, 2022, 4444388. [Google Scholar] [CrossRef]
Pan, Y.; Liu, J.; Cai, Y.; Yang, X.; Zhang, Z.; Long, H.; Zhao, K.; Yu, X.; Zeng, C.; Duan, J.; et al. Fundus Image Classification Using Inception V3 and ResNet-50 for the Early Diagnostics of Fundus Diseases. Front. Physiol. 2023, 14, 1126780. [Google Scholar] [CrossRef]
Yuan, J.; Fan, Y.; Lv, X.; Chen, C.; Li, D.; Hong, Y.; Wang, Y. Research on the Practical Classification and Privacy Protection of CT Images of Parotid Tumors Based on ResNet50 Model. J. Phys. Conf. Ser. 2020, 1576, 012040. [Google Scholar] [CrossRef]
Basnin, N.; Sumi, T.A.; Hossain, M.S.; Andersson, K. Early Detection of Parkinson’s Disease from Micrographic Static Hand Drawings. In Proceedings of the Brain Informatics 2021, Virtual Event, 17–19 September 2021; Mahmud, M., Kaiser, M.S., Vassanelli, S., Dai, Q., Zhong, N., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 433–447. [Google Scholar]
Afroze, T.; Akther, S.; Chowdhury, M.A.; Hossain, E.; Hossain, M.S.; Andersson, K. Glaucoma Detection Using Inception Convolutional Neural Network V3. In Proceedings of the Applied Intelligence and Informatics 2021, Nottingham, UK, 30–31 July 2021. [Google Scholar]
Tebaldi, C.; Knutti, R. The Use of the Multi-Model Ensemble in Probabilistic Climate Projections. Philos. Trans. R. Soc. A 2007, 365, 2053–2075. [Google Scholar] [CrossRef]
Lipton, Z.C.; Elkan, C.; Naryanaswamy, B. Optimal Thresholding of Classifiers to Maximize F1 Measure. In Proceedings of the Machine Learning and Knowledge Discovery in Databases European Conference, ECML PKDD 2014, Nancy, France, 15–19 September 2014; pp. 225–239. [Google Scholar]
Paumgartner, D.; Losa, G.; Weibel, E.R. Resolution Effect on the Stereological Estimation of Surface and Volume and its Interpretation in terms of Fractal Dimensions. J. Microsc. 1981, 121, 51–63. [Google Scholar] [CrossRef]
Chattaraj, R.; Khan, S.; Roy, D.G.; Bepari, B.; Bhaumik, S. Vision-Based Human Grasp Reconstruction Inspired by Hand Postural Synergies. Comput. Electr. Eng. 2018, 70, 702–721. [Google Scholar] [CrossRef]
Yan, H.; Liu, Y.; Wang, X.; Li, M.; Li, H. A Face Detection Method Based on Skin Color Features and AdaBoost Algorithm. J. Phys. Conf. Ser. 2021, 1748, 042015. [Google Scholar] [CrossRef]
Ariefwan, M.R.M.; Diyasa, I.G.S.M.; Hindrayani, K.M. InceptionV3, ResNet50, ResNet18 and MobileNetV2 Performance Comparison on Face Recognition Classification. Literasi Nusant. 2023, 4, 1–10. [Google Scholar] [CrossRef]
Zhang, L.; Sun, M.; Pei, Y. Innovative Design of Aging-Friendly Household Cleaning Products from the Perspective of Ergonomics. In Proceedings of the 25th International Conference on Human-Computer Interaction (HCII 2023), Copenhagen, Denmark, 23–28 July 2023; pp. 295–312. [Google Scholar]
Hu, Y.; Wang, P.; Zhao, F.; Liu, J. Low-frequency Background Estimation and Noise Separation from High-frequency for Background and Noise Subtraction. Appl. Opt. 2024, 63, 283–289. [Google Scholar] [CrossRef]
Kottha, B.R.T.N.; Penumacha, N.K.; Gadde, H.V. Smart Traffic Management System using Background Subtraction. In Proceedings of the 2024 International Conference on Expert Clouds and Applications (ICOECA), Bengaluru, India, 18–19 April 2024; pp. 511–518. [Google Scholar] [CrossRef]

Figure 1. Hand size distribution of participants by gender.

Figure 2. Diagram of a smartphone mock-up and specifications of the nine smartphone mock-ups used in the grip posture experiment.

Figure 3. Camera setup for grip posture experiment and views from the upper and lower cameras.

Figure 4. An example of grip posture classification based on the number of fingers positioned on different parts of the smartphone.

Figure 5. Preprocessing results of grip posture images.

Figure 6. The same grip posture type (L4-R1) for the left hand and the right hand.

Figure 7. MobileNetV2 architecture [27].

Figure 8. ResNet-50 architecture [31].

Figure 9. Inception V3 architecture [35,59].

Figure 10. Ensemble model including MobileNetV2, ResNet-50, and Inception V3.

Figure 11. ROC curve for MobileNetV2.

Figure 12. Learning curve for MobileNetV2.

Figure 13. ROC curve for ResNet-50.

Figure 14. Learning curve for ResNet-50.

Figure 15. ROC curve for Inception V3.

Figure 16. Learning curve for Inception V3.

Table 1. Seven types of grip posture and the number of image data.

No.	Grip Posture Name	Number of Data
1	L2-B1-R1-K1	511
2	L2-R1-K2	279
3	L2-T1-B1-R1	120
4	L3-B1-R1	556
5	L3-R1-K1	981
6	L3-R1-T1	311
7	L4-R1	520
Total		3278

Table 2. Classification accuracies for the seven classes of the test data using MobileNetV2.

Grip Posture	Accuracy (%)	Precision	Recall	F1-Score
Class 1: L2-B1-R1-K1	93.2	0.93	0.96	0.93
Class 2: L2-R1-K2	76.3	0.97	0.73	0.85
Class 3: L2-T1-B1-R1	80.0	1.00	0.90	0.89
Class 4: L3-B1-R1	98.5	0.96	0.99	0.97
Class 5: L3-R1-K1	94.3	0.90	0.94	0.92
Class 6: L3-R1-T1	90.5	0.93	0.90	0.92
Class 7: L4-R1	96.2	0.91	0.93	0.93

Table 3. Classification accuracies for the seven classes of the test data using ResNet-50.

Grip Posture	Accuracy (%)	Precision	Recall	F1-Score
Class 1: L2-B1-R1-K1	93.2	0.95	0.93	0.94
Class 2: L2-R1-K2	84.2	0.91	0.84	0.88
Class 3: L2-T1-B1-R1	93.3	1.00	0.93	0.97
Class 4: L3-B1-R1	100	0.97	1.00	0.99
Class 5: L3-R1-K1	94.4	0.93	0.94	0.94
Class 6: L3-R1-T1	92.9	0.93	0.93	0.93
Class 7: L4-R1	96.2	0.94	0.96	0.95

Table 4. Classification accuracies for the seven classes of the test data using Inception V3.

Grip Posture	Accuracy (%)	Precision	Recall	F1-Score
Class 1: L2-B1-R1-K1	89.8	0.95	0.90	0.92
Class 2: L2-R1-K2	52.6	0.87	0.53	0.66
Class 3: L2-T1-B1-R1	60.0	0.90	0.60	0.72
Class 4: L3-B1-R1	97.1	0.97	0.97	0.97
Class 5: L3-R1-K1	89.4	0.77	0.89	0.82
Class 6: L3-R1-T1	88.1	1.00	0.88	0.94
Class 7: L4-R1	86.5	0.80	0.87	0.83

Table 5. Classification accuracies for the seven classes of the test data using the ensemble model.

Grip Posture	Accuracy (%)	Precision	Recall	F1-Score
Class 1: L2-B1-R1-K1	93.2	0.95	0.93	0.94
Class 2: L2-R1-K2	97.3	0.93	0.97	0.95
Class 3: L2-T1-B1-R1	100	1.00	1.00	1.00
Class 4: L3-B1-R1	98.6	0.97	0.99	0.98
Class 5: L3-R1-K1	95.1	0.96	0.95	0.96
Class 6: L3-R1-T1	92.9	0.98	0.93	0.95
Class 7: L4-R1	98.1	0.98	0.98	0.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kwon, D.; Cui, X.; Lee, Y.; Choi, Y.; Murugan, A.S.; Kim, E.; You, H. Machine Learning-Based Smartphone Grip Posture Image Recognition and Classification. Appl. Sci. 2025, 15, 5020. https://doi.org/10.3390/app15095020

AMA Style

Kwon D, Cui X, Lee Y, Choi Y, Murugan AS, Kim E, You H. Machine Learning-Based Smartphone Grip Posture Image Recognition and Classification. Applied Sciences. 2025; 15(9):5020. https://doi.org/10.3390/app15095020

Chicago/Turabian Style

Kwon, Dohoon, Xin Cui, Yejin Lee, Younggeun Choi, Aditya Subramani Murugan, Eunsik Kim, and Heecheon You. 2025. "Machine Learning-Based Smartphone Grip Posture Image Recognition and Classification" Applied Sciences 15, no. 9: 5020. https://doi.org/10.3390/app15095020

APA Style

Kwon, D., Cui, X., Lee, Y., Choi, Y., Murugan, A. S., Kim, E., & You, H. (2025). Machine Learning-Based Smartphone Grip Posture Image Recognition and Classification. Applied Sciences, 15(9), 5020. https://doi.org/10.3390/app15095020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning-Based Smartphone Grip Posture Image Recognition and Classification

Abstract

1. Introduction

2. Related Work

2.1. Preprocessing Techniques for Hand Recognition

2.2. Classification Model

2.3. Ensemble Methods

3. Methods

3.1. Participants

3.2. Apparatus

3.3. Experiment Procedure

3.4. Grip Posture Classification

3.5. Dataset

3.6. Preprocessing

3.7. Model Architectures

3.8. Training/Test Setup

3.9. Model Evaluation

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI