YOLOv8n-RF: A Dynamic Remote Control Finger Recognition Method for Suppressing False Detection

Wang, Yawen; Wang, Gaofeng; Yao, Yining

doi:10.3390/s25092768

Open AccessArticle

YOLOv8n-RF: A Dynamic Remote Control Finger Recognition Method for Suppressing False Detection

by

Yawen Wang

^*,

Gaofeng Wang

and

Yining Yao

College of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(9), 2768; https://doi.org/10.3390/s25092768

Submission received: 18 March 2025 / Revised: 20 April 2025 / Accepted: 25 April 2025 / Published: 27 April 2025

(This article belongs to the Section Intelligent Sensors)

Download

Browse Figures

Versions Notes

Abstract

Gesture interaction is one of the novel human-computer interaction methods for smart TVs. Addressing the issues of false detection and high detection costs in gesture recognition algorithms for gesture interaction, this paper proposes the YOLOv8n-Remote Finger (YOLOv8n-RF) algorithm for dynamic remote control finger detection. This algorithm utilizes the CRVB-DSConvEMA module in the feature extraction network, adopts the SPPF-DSConvEMA module in the downsampling process, and introduces BiFPN in the Neck layer. Experiments conducted on the self-made Remote Finger dataset and the public HaGRID dataset demonstrated that, compared to the YOLOv8n algorithm, the proposed YOLOv8n-RF algorithm achieved an improvement in mean Average Precision (mAP) by 1.23% and 0.84%, respectively. Additionally, the model size was reduced by 2.49 M, the GFLOPs were decreased by 1.7, and the false detection rate was lowered by 22%. The YOLOv8n-RF algorithm meets the requirements of low cost and low complexity, which contributes to reducing false control operations on smart TVs.

Keywords:

multi-scale features; YOLOv8n-RF; attention mechanism; remote-controlled finger recognition; deep learning

1. Introduction

Through the integration of artificial intelligence technology, smart TVs have expanded a number of functions, including voice control and gesture recognition. Although gesture interaction technology has great potential and is intuitive and natural, its popularity is not as high as voice interaction, and the problem of false detection in gesture interaction may lead to false control, thus affecting user experience. Therefore, it is of great significance to study gesture interaction of smart TVs, reduce gesture false detection rate, and achieve universal applicability.

Regarding the research on gesture interaction in smart TVs, Sang-Heon Lee’steam [1] proposed a gesture recognition method based on a single camera. This method integrates the Adaboost algorithm, the KLT (Kanade-Lucas-Tomasi) tracker, and the SVM algorithm, achieving a high recognition rate for five types of gestures. With further research, Lian S’steam [2] addressed the issue of power consumption in gesture recognition modules and camera devices by proposing a solution aimed at reducing the computational cost and energy consumption of smart TV systems, thereby promoting the practical application of gesture interaction technology in smart TVs. Meanwhile, J. Ducloux’s team [3] took a different approach and designed an accelerator-based gesture recognition system. This system identifies gesture commands by pressing a button on the remote control and converts the gesture into a control command upon button release. This method not only simplifies user operations but also improves the accuracy and efficiency of gesture recognition. Neha Dawar’s team [4] introduced depth cameras and wearable inertial sensor devices, further expanding the application scenarios and possibilities of gesture detection. Huiyue Wu’s team [5] proposed a TV control system based on a combination of static and dynamic gesture models, mapping nineteen types of custom gestures to specific operational functions, thereby achieving a more flexible and intuitive gesture interaction experience.

In recent years, Yi Zhang’s team [6] has utilized Doppler radar sensors and a machine learning-based Doppler signal classification algorithm to achieve efficient recognition of predefined gestures. This research provides new ideas and methods for smart TV gesture interaction technology. Additionally, Zimo Liao’s team [7] proposed a solution that supports gesture control without hardware modifications. This solution enables mid-air gesture recognition based on screen and ambient light sensors, further reducing the implementation difficulty and cost of smart TV gesture interaction technology. Bayan Ibrahimm Alabdullah’s team [8] proposed a TV control model based on a recurrent neural network, which improves the accuracy of smart TV gesture recognition by computing skeletal features. Yuri Suzuki’s team [9] developed a gesture recognition system using a mat embedded with optical reflection sensors and accelerometers as an interface for controlling household appliances. Huijing Wang’s team [10] combined convolutional neural networks (CNNs) with the MPCNN transfer learning strategy to propose a novel smart TV control system.

Deep learning technology has significantly contributed to the advancement of gesture detection. Inspired by the performance of CNNs, Jaya Prakash Sahoo’s team [11] proposed an end-to-end fine-tuning method. This method transfers the weights of a pre-trained CNN to a network trained on the target dataset. During this process, the weights and biases of the pre-trained CNN model are updated after each iteration on the target dataset. Md. Ahasan Atick Faisal’s team [12] developed a low-cost wireless data glove that combines flexible sensors and IMUs, and proposed a novel prediction method based on deep learning that converts one-dimensional time series signals into two-dimensional images by introducing spatial projection to achieve dynamic gesture recognition. In addition, the proposed parallel path neural network architecture has a better ability to extract features from multi-modal data than traditional architectures. Due to the challenges of gesture recognition in complex backgrounds, Weina Zhou’s team [13] proposed an encoder-decoder structure and integrated DRN with ASPP as the encoder module to effectively address the problem of gesture segmentation. To address the challenge that RNN, TCN, and 3DCNN struggle to effectively extract spatiotemporal features and classify noisy and complex skeleton sequences, Adam A. Q. Mohammed’s team [14] proposed a deep integrated gesture recognition network (MMEGRN) to leverage various capabilities in extracting and classifying skeleton sequences. Qi Sun’s team [15] achieved high-precision gesture recognition by optimizing the convolutional neural network (CNN). Abu Saleh Musa Miah’steam [16] utilized graphical models and general deep learning networks to extract effective gesture features. As the exploration of transformer models in gesture recognition tasks remains limited, Mallika Garg’s team [17] proposed a novel GestFormer architecture for dynamic gesture recognition. Fo Hu’s team [18] proposed the Spatio-Temporal Fusion Network with Brain Region Partitioning Strategy (STFN-BRPS), where the Recurrent Multi-scale Convolution Module (RMSCM) combines multi-scale convolution and CNN-BiLSTM for temporal feature extraction, the Dynamic Graph Convolution Module (DGCM) establishes electrode connections using functional partitioning strategy, and the Feature Fusion Module (FFM) with channel attention adaptively weights spatio-temporal features, improving detection accuracy and robustness.

YOLO is an algorithm that balances real-time performance with high accuracy. To solve the problem of excessive parameters and high complexity in this algorithm, Dan Zhao’s team [19] developed a kiwifruit detection algorithm based on Faster-YOLOv8n. Their approach replaces all C2f modules in the YOLOv8n model with C2f-Faster modules, where feature maps are processed using partial convolution (PConv) and pointwise convolution (PWConv) while maintaining residual connections. To address the issues of low detection accuracy, missed detections, false positives, and excessive model parameters, Xiang Li’s team [20] proposed the 2SC-YOLOv8n steel defect detection model. This model integrates the Context Guided Block module and Shuffle Attention mechanism into the Backbone of YOLOv8n and adds the SENetV2 attention module to the Neck layer.

In summary, existing research often requires multiple peripheral devices or wearable equipment, leading to additional costs. Furthermore, YOLOv8n suffers from issues of false detections and model complexity. Therefore, the current research focus is on improving the accuracy of monocular camera gesture detection while maintaining a lightweight model and minimizing false detections that result in erroneous control. The contributions of this paper are as follows:

The CRVB-DSConvEMA module is used to replace the C2f module, enabling the model to adaptively focus on local details in prominent gestures based on the input feature map. Additionally, it enhances features across multiple scales, different channels, and spatial information.
The SPPF-DSConvEMA module is used to replace the SPPF module. By reconstructing partial channels and evenly distributing spatial semantics among sub-features, it preserves multi-angle information while improving computational efficiency. This approach further reduces false detections without increasing the model’s complexity.
The BiFPN module is adopted in the Neck layer of YOLOv8n, enabling the model to achieve greater performance gains with minimal computational cost. While ensuring the effectiveness and accuracy of the previous two steps, this approach further reduces the model’s complexity.

2. Materials and Methods

2.1. Materials

2.1.1. RemoteFinger Dataset

This study, set in the research context of gesture interaction with smart TVs, aims to generate gesture commands using specific hand gestures. Therefore, data collection was conducted with 10 participants, capturing three types of dynamic gesture commands. Utilizing OpenCV, the recorded videos were processed frame by frame, producing a total of 6300 images, which were manually annotated to form the Remote Finger 1.0 dataset. However, after training and testing with YOLOv8, analysis of various performance metrics revealed that the dataset had minimal feature differences between some categories. This led to the misclassification of certain gestures as others. To enhance the distinctiveness of the data, characteristics of the HaGRID dataset were referenced. The original 6300 images were manually filtered to remove adjacent frames with identical hand gestures, ensuring that each image contains a distinct gesture, and additional samples were collected under varying conditions, including different environments, lighting conditions, subjects, and distances between the camera and the hand. Furthermore, data augmentation techniques were applied to enrich the dataset. Sample images are shown in Figure 1, and detailed dataset information is provided in Table 1.

In Table 1, Remote Finger1.0 refers to the initial version of the collected data, while Remote Finger represents the final dataset, which is also the one used in the following experiments.

Figure 2 shows the gesture image acquisition device, which has a maximum capture resolution of 1920 × 1800. To meet the model’s input requirements, we adjusted the resolution to 640 × 640 during data acquisition.

2.1.2. HaGRID Dataset

Since this dataset is particularly large, with a vast number of images for each category, manual filtering and annotation were performed. The detailed dataset information is provided in Table 2.

2.1.3. Data Augmentation

Since the custom dataset underwent manual filtering after collection, the total number of samples was slightly reduced. To expand the dataset and improve the quality of certain samples, data augmentation techniques such as mirroring, brightness adjustment (both increasing and decreasing), and adding Gaussian noise were applied. The processed results are shown in Figure 3.

The code logic for adding Gaussian noise to the image is as follows (Algorithm 1):

Algorithm 1. Gaussian Noise algorithm

I

: input image of size

W \times H \times C

μ

: noise mean (default 0)

σ

: noise standard deviation (controls intensity)
AddGaussianNoise

(I, μ, σ)

:

I_{n o i s e} = c o p y (I)

# Initialize output image

for each pixel coordinate

(x, y)

:
for each channel

c \in \{R, G, B\}

:

γ ~ N (μ, σ^{2})

# Sample from Gaussian distribution

I_{n o i s e} (x, y, c) = I (x, y, c) + γ

I_{n o i s e} = C l i p T o R a n g e (I_{n o i s e}, 0, 255)

                  end for
              end for
              return

I_{n o i s e}

The formula for vertical flipping is shown in (1), the formula for brightness change is shown in (2), and the formula for rotation is shown in (3).

I_{f l i p} (x, y, c) = I (x, H - 1 - y, c)

(1)

I_{a d j} (x, y, c) = C l i p T o R a n g e (α \cdot I (x, y, c) + β, 0, 1)

(2)

\{\begin{matrix} x^{'} = (x - x_{c}) \cos θ - (y - y_{c}) \sin θ + x_{c} \\ y^{'} = (x - x_{c}) \sin θ - (y - y_{c}) \cos θ + y_{c} \end{matrix}

(3)

In Equations (1)–(3),

I_{f l i p}

is the flipped image,

I \in R^{W \times H \times C}

is the original image,

x

is the column coordinate,

y

is the row coordinate,

c

represents the channel,

I_{a d j}

is the brightness-adjusted image,

C l i p T o R a n g

function ensures the result stays within the valid range,

α > 0

is the contrast coefficient,

β

is the brightness offset,

(x_{c}, y_{c})

are the rotation center coordinates,

(x^{'}, y^{'})

are the coordinates of the rotated pixel,

θ

is the rotation angle.

2.2. Methods

2.2.1. Network Architecture

Combining the research background, the proposed method in this paper is based on YOLOv8n as the baseline model. This model consists of three main components: the feature extraction backbone (Backbone), the feature enhancement module (Neck), and the detection head (Head). The Backbone is responsible for extracting fundamental features from images and includes two key components: the C2f module and the SPPF module. The C2f module integrates multiple layers through convolutional fusion, facilitating the capture of multi-scale and contextual information. This design enhances feature representation and detection accuracy by capturing semantic information and contextual relationships in images, thereby improving the model’s generalization ability. Additionally, it optimizes computational efficiency by simplifying the convolutional process and stacking multiple layers. The SPPF module consists of convolution and pooling operations, which concatenate feature maps of different scales along the channel dimension. This approach enables the extraction of multi-scale feature information. In the Neck layer, cross-stage feature transmission is employed to facilitate the transfer and fusion of feature information across different levels. The Head layer plays a crucial role in detection by separating classification and localization tasks. This layered design minimizes interference between tasks, making the optimization process more independent.

Therefore, this paper improves the model by focusing on the C2f module, SPPF module, and Neck layer, aiming to reduce false detections while maintaining other performance metrics and lowering model complexity. The proposed YOLOv8n-RF dynamic remote finger detection network architecture, designed to suppress false detections, is illustrated in Figure 4.

This model improves the Backbone of YOLOv8n, which heavily relies on the C2f module to utilize multi-scale features and contextual information, thereby enhancing detection accuracy. However, stacking multiple C2f modules may introduce redundant information, and the standard convolutions used in this module mainly process local features, potentially limiting their effectiveness in capturing global feature relationships. To address these issues, the C2f module is replaced with the CRVB-DSConvEMA module, which integrates the RepViT Block from RepViT [21], DSConv, and EMA [22] to overcome the limitations of C2f. Additionally, the SPPF module, which frequently performs downsampling, may lead to the loss of certain positional information, reducing image resolution. This makes it more challenging for the network to precisely locate objects, thereby increasing the difficulty of target recognition. To mitigate this, the SPPF module is replaced with the SPPF-DSConvEMA module, which avoids dimensionality reduction by reconstructing partial channels and evenly distributing spatial semantics across sub-features. Regarding the Neck layer, which includes convolution, upsampling, and concatenation operations, the increased model complexity needs to be addressed. To achieve this, the BiFPN [23] module is introduced, which provides significant performance improvements with minimal computational cost. Ultimately, these optimizations ensure that the model maintains low complexity while effectively reducing false detections.

2.2.2. CRVB-DSConvEMA Module

To reduce false detections, enhance the network’s ability to recognize subtle differences in dynamic gesture information, and minimize misinformation during inference, the proposed CRVB-DSConvEMA module integrates RepViT Block, DSConv, and EMA.

Among them, RepViT (Revisiting Mobile CNN From ViT Perspective) [21] is a novel lightweight CNN series that re-examines and integrates lightweight CNNs from a ViT (Vision Transformer) perspective. By leveraging the architecture and optimization strategies of lightweight ViTs, it improves model performance and efficiency while maintaining low computational complexity. Its core structure, RepViT Block, transforms MobileNetV3 by separating the token mixer and channel mixer [24]. The specific structure of this module is shown in Figure 5.

As shown in Figure 5a, the MobileNetBlock consists of a

1 \times 1

convolution block, a depthwise convolution, and a

1 \times 1

projection layer. The

1 \times 1

expansion convolution layer and the

3 \times 3

depthwise convolution enable information exchange between channels, while the depthwise convolution is responsible for spatial feature fusion. These two operations correspond to the channel mixer and token mixer, respectively. As shown in Figure 5b, the RepViT Block shifts the

3 \times 3

depthwise convolution upward and employs structural re-parameterization, introducing a multi-branch topology during training to enhance performance. During inference, the multi-branch structure is merged into a single-branch structure, eliminating the additional computational and storage costs associated with multi-branch configurations.

Dynamic Snake Convolution (DSConv) [25] introduces deformable offsets, enabling the learning of displacement values and adaptive adjustment of kernel sizes to accommodate object shape variations. It also employs an iterative strategy, sequentially selecting the next position of the target to be processed for observation. This ensures continuity of attention, supplements key features from multiple perspectives, and achieves efficient multi-view feature fusion. A comparison between Dynamic Snake Convolution and standard convolution is shown in Figure 6.

DSConv is a combination of traditional convolution and deformable convolution. For 2D convolution, assuming the convolution kernel size is

3 \times 3

with the center coordinate represented as

(x_{i}, y_{i})

, a standard convolution can be expressed as:

K = \{(x - 1, y - 1), (x - 1, y), \dots, (x - 1, y + 1)\}

(4)

To prevent the receptive field from drifting outside the target and to achieve better focus, a standard convolution kernel offset

Δ

is introduced, adding constraints to the standard convolution kernel in both the x-axis and y-axis directions.

K_{i \pm c} = \{\begin{matrix} (x_{i + c}, y_{i + c}) = (x_{i} + c, y_{i} + \sum_{i}^{i + c} Δ y) \\ (x_{i - c}, y_{i - c}) = (x_{i} - c, y_{i} + \sum_{i - c}^{i} Δ y) \end{matrix}

(5)

K_{j \pm c} = \{\begin{matrix} (x_{j + c}, y_{j + c}) = (x_{i} + \sum_{j}^{j + c} Δ x, y_{i} + c) \\ (x_{j - c}, y_{j - c}) = (x_{j} + \sum_{j - c}^{j} Δ x, y_{i} - c) \end{matrix}

(6)

Among them,

\sum

represents the accumulation of the offset. However, since the offset

Δ

is generally a decimal, bilinear interpolation is used to process its integer coordinates, namely:

K = \sum_{K^{'}} B (K^{'}, K) \cdot K^{'}

(7)

In the expression,

K

represents the decimal position in expressions (5) and (6), and

B

represents bilinear interpolation, which can be decomposed into two one-dimensional interpolation kernels:

B (K, K') = b (K_{x}, K_{x}^{'}) \cdot b (K_{y}, K_{y}^{'})

(8)

To enhance feature extraction capability, a high-efficiency multi-scale attention (EMA) mechanism [22] is introduced. This mechanism dynamically adjusts the weights in the feature map based on the importance of each region. EMA achieves this by reorganizing a portion of the channels into the batch dimension and grouping the channel dimension into multiple sub-features, ensuring a uniform distribution of spatial semantic features within each feature group. This approach helps maintain channel-related information while minimizing computational overhead, allowing the module to focus on task-critical regions and improving its adaptability in complex scenarios. Its structure is shown in Figure 7.

In Figure 7, let the input tensor be

x \in R^{C \times H \times W}

, with

C

as the number of input channels,

H

and

W

as the spatial dimensions of the input features. Therefore, the global information one-dimensional average pooling along the horizontal dimension of

C

at height

H

is given by:

Z_{C}^{H} (H) = \frac{1}{W} \sum_{0 \leq i \leq W} x_{c} (H, i)

(9)

Similarly, the global information one-dimensional average pooling along the horizontal dimension of

C

at width

W

is given by:

Z_{C}^{W} (W) = \frac{1}{H} \sum_{0 \leq i \leq H} x_{c} (i, W)

(10)

Finally, the formula for the two-dimensional global pooling operation is:

Z_{C} = \frac{1}{H \times W} \sum_{j}^{H} \sum_{i}^{W} x_{C} (i, j)

(11)

GroupNorm (Group Normalization) used in Figure 7 is applied to normalize the grouped features, ensuring sufficient feature distribution within each group. It divides

C

into

G

groups, with each group containing

C / G

channels, and normalizes the groups. The specific calculation formula is:

G r o u p N o r m (X) = γ \cdot \frac{X - μ}{\sqrt{σ^{2} + ε}} + β

(12)

In the expression (12),

μ

and

σ

represent the mean and variance of each group’s features,

γ \in R^{C}

denotes the scaling parameter; and

β \in R^{C}

is the shifting parameter.

In this study, the depthwise convolution component of the RepViT Block is replaced with Dynamic Snake Convolution (DSConv). This structure employs re-parameterization techniques to eliminate the additional computational and storage costs associated with multi-branch architectures during inference. The CRVB-DSConvEMA module integrates the Efficient Multi-scale Attention (EMA) mechanism, ensuring a uniform distribution of spatial semantics across channels and sub-features. This design preserves multi-angle information while maintaining computational efficiency. It not only globally encodes information to adjust channel weights but also captures pixel-level relationships through cross-dimensional interactions. The improved CRVB-DSConvEMA structure is illustrated in Figure 8, and the RVB-DSConv module is depicted in Figure 9.

In Figure 8, when the input feature flows through the CRVB-DSConvEMA module, the preliminary features of the input

X \in R^{C \times H \times W}

tensor are extracted, while the feature representation capability is enhanced through nonlinear activation. Subsequently, the divided two feature maps are fed into the RVB-DSConv block and the EMA block, respectively. The EMA attention mechanism captures long-range dependencies in the spatial dimension, and the local DSConv operation globally models the features, enhancing the global representation capability. Next, the features output from the multiple branches are concatenated or fused along the channel dimension to form the final feature tensor that combines multi-scale, local, and global information. Finally, the output from the previous layer undergoes a global EMA optimization, reintegrating information from different channels and spatial positions.

In Figure 9, the process begins with a

3 \times 3

DSConv operation and a

1 \times 1

DSConv operation for spatial feature extraction and fusion. This is followed by an SE layer (Squeeze-and-Excitation), which decouples the token mixer and channel mixer. Subsequently, through residual skip connections, the information is fused with the outputs of a

1 \times 1

convolutional layer and a

1 \times 1

projection layer, integrating channel and token mixer features. Finally, leveraging structural reparameterization, the computational load during the inference phase is significantly reduced compared to that during training.

2.2.3. SPPF-DSConvEMA Module

The SPPF module in YOLOv8n has demonstrated its advantages in enhancing model performance through multi-scale feature fusion, especially in certain contexts. However, it must be acknowledged that the SPPF module may have limitations when dealing with variations in object scale and complex backgrounds. This is because, as the network deepens and frequent downsampling occurs, some positional information may be lost. Therefore, a fine-grained mechanism is needed to ensure that downsampling does not lead to the loss of fine-grained information.

To address the limitations of SPPF, the EMA attention mechanism is introduced to adaptively and dynamically adjust the weights in the feature map, ensuring the uniform distribution of spatial semantic features within each feature group. Meanwhile, DSConv is integrated into SPPF to ensure sufficient feature extraction from multiple perspectives before pooling. The specific structure of the proposed SPPF-DSConvEMA is shown in Figure 10, where the DBS module consists of DSConv, Batch Normalization (BatchNorm2d), and the activation function (SiLU).

In Figure 10, the tensor

X \in R^{C \times H \times W}

(output from SPPF-DSConvEMA’s pooling branch) is divided into

G

sub-features along the channel dimension (Equation (10)) for multi-semantic learning. The grouped features are denoted as

X = [X_{0}, X_{i}, \dots, X_{i - 1}], X_{i} \in R^{C / / G \times W \times H}

(usually

G ≪ C

), ensures stable spatial grouping of feature semantics. After grouping, a three-branch parallel structure is formed to process features of different semantic levels respectively. Among them, two parallel branches are on the

1 \times 1

path, while the other is on the

3 \times 3

path. In the

1 \times 1

convolutional path, dual encoding of channel features is achieved through one-dimensional global average pooling along the horizontal and vertical axes. In parallel, the

3 \times 3

convolutional path employs a single

3 \times 3

convolutional kernel to accomplish the fusion representation of multi-scale features. Here, the number of convolutional filters is independent of the batch dimension of the input data. This is because, in PyTorch, the parameter dimensions of a 2D convolutional kernel are defined as

[o u p, i n p, k, k]

, excluding the batch dimension. The

o u p

parameter represents the spatial two-dimensional characteristics of the output feature map,

i n p

denotes the two-dimensional plane of the input features, and

k

indicates the kernel size. Thus, by mapping the

G

groups to the batch dimension, the defined feature size becomes

C / / G \times W \times H

, which serves as the tensor input. A dual-path Sigmoid activation function is applied to the decoupled feature vectors output by the

1 \times 1

convolutional layer, performing nonlinear mapping to generate a two-dimensional probability distribution. Between the parallel

1 \times 1

and

3 \times 3

convolutional branches, a cross-dimensional heterogeneous feature interaction architecture is established. This architecture fuses dual-dimensional attention weights within each feature group through element-wise multiplication. Before multiplication, the dimensions of the two tensors are guaranteed to be identical because, prior to feature-wise concatenation and activation, global average pooling is applied for spatial feature encoding, transforming the branch outputs into the target dimensions. After generating the first spatial attention map, 2D global average pooling is similarly employed before fusion activation to directly encode the

3 \times 3

branch and the global spatial information with a specific dimensional shape. This step subsequently yields a second spatial attention map, which fully retains precise spatial location information. In the final stage, the two generated spatial attention weights are combined to compute the output feature maps within each group. During this process, the Sigmoid function is applied. This method effectively captures pixel-level pairwise correlations and efficiently highlights the global contextual information embedded in all pixels.

2.2.4. BiFPN Module

The neck network of YOLOv8 adopts a synergistic combination of the Path Aggregation Network (PAN) and the Feature Pyramid Network (FPN) architecture, as shown in Figure 11. The FPN framework effectively transfers deep feature information to shallower layers, enriching critical high-level insights. In contrast, the PAN structure facilitates the precise flow of localization data from shallow layers upward to feature-rich deep layers. This fusion creates the PANet [26] structure, which skillfully integrates shallow and deep features, significantly enhancing the model’s ability to recognize the most subtle details [27].

To significantly reduce the computational cost of the model and improve its practicality without compromising recognition accuracy, BiFPN is leveraged to achieve the research objective. This module enhances information flow between feature maps at different network levels through skip connections, allowing feature maps to retain more positional and detailed information, thereby helping the model focus better on key target information. Furthermore, it strengthens the original algorithm through more efficient bidirectional cross-scale connections and feature map fusion. Additionally, in terms of network topology, BiFPN integrates neural architecture search, a strategic approach designed to accommodate irregular feature network topologies under various task and resource constraints, thus offering greater flexibility in network design. Its well-crafted feature fusion and refinement operations not only improve accuracy but also successfully reduce computational complexity.

Taking the 6th layer of BiFPN as an example, its feature fusion is as follows:

P_{6}^{t d} = C o n v (\frac{ω_{1} \cdot P_{1}^{i n} + ω_{2} \cdot R e s i z e (P_{7}^{i n})}{ω_{1} + ω_{2} + ε})

(13)

P_{6}^{o u t} = C o n v (\frac{ω_{1}^{'} \cdot P_{6}^{i n} + ω_{2}^{'} \cdot P_{6}^{t d} + ω_{3}^{'} \cdot R e s i z e (P_{5}^{i n})}{ω_{1}^{'} + ω_{2}^{'} + ω_{3}^{'} + ε})

(14)

In Equations (13) and (14),

P_{6}^{t d}

represents the intermediate feature of the 6th layer in the top-down pathway, while

P_{6}^{o u t}

denotes the output feature of the 6th layer. The feature fusion in other layers follows a similar pattern.

3. Results

3.1. Experimental Environment and Parameter Settings

The computer configuration used for training and testing in this experiment is shown in Table 3, and the main parameters of the specified program are listed in Table 4. Other parameters follow the default settings of the model.

3.2. Performance Evaluation Metrics

To evaluate the performance of YOLOv8n-RF, the following metrics were selected: Mean Average Precision (mAP), Model Size, Giga Floating-point Operations Per Second (GFLOPs), Accuracy, False Drop Rate (FDR), and Model Detection Time (MDT). The calculation formulas are as follows:

P r e c i s i o n = \frac{T_{p}}{T_{p} + F_{p}}

(15)

A P = \int_{0}^{1} P r e c i s i o n (t) d t

(16)

m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n}

(17)

F L O P s = 2 \cdot H_{o u t} \cdot W_{o u t} \cdot C_{i n} \cdot C_{o u t} \cdot K_{h} \cdot K_{w}

(18)

F D R = \sum_{i = 1}^{6} e r r o r_{i}

(19)

M D T = MAX \{t i m e_{1}, t i m e_{2}, \dots, t i m e_{i}\}, i = 1, 2, 3, \dots, n

(20)

In Equation (18),

H_{o u t}

and

W_{o u t}

represent the height and width of the output feature map,

C_{i n}

and

C_{o u t}

denote the number of input and output channels,

K_{i n}

and

K_{o u t}

represent the height and width of the convolution kernel. In Equation (19),

e r r o r_{i}

represents the probability of a certain category being falsely detected as the i-th category. In Equation (20), since YOLOv8 only provides the detection time (

t i m e_{i}

) for each individual image during model prediction, the longest detection time observed in this study is taken as the value of DMT.

3.3. Experimental Analysis

3.3.1. Ablation Experiment

To determine the importance of each proposed component and enhance the interpretability of the model, five sets of comparative experiments were conducted. Since this study involves testing on two datasets, the RemoteFinger dataset is abbreviated as RF, and the HaGRID dataset is abbreviated as HR for convenience. The experimental results are shown in Table 5.

In addition, to address false detections caused by incorrect gesture commands with features similar to correct ones, we put HaGRID’s untrained ‘rock’ category into the detection test and obtained the 1-FDR metric. Since the YOLOv8n model faced an untrained dataset, the experimental results showed that only the confusion matrix was valid, while other metrics (e.g., mAP) were 0. Therefore, the 1-FDR value was derived from the normalized confusion matrix. Furthermore, due to normalization, 1-FDR had a smaller error compared to FDR, so it was selected as the experimental metric.

As shown in Table 5, the baseline model YOLOv8n achieves a satisfactory mAP value when ignoring the parameters and GFLOPs metrics. However, additional experiments revealed that the presence of false detections still negatively impacted the user experience in real-world scenarios. After introducing the CRVB-DSConvEMA module into YOLOv8n, the model size was reduced by 1.32 M, and the false detection rate decreased by 7%, demonstrating the lightweight and efficient characteristics of CRVB-DSConv. Building upon this, the proposed SPPF-DSConvEMA module was incorporated. Since EMA and DSConv are not inherently lightweight but rather designed to enhance multi-angle feature extraction, the model size slightly increased by 0.01 M, and GFLOPs increased by 0.3. However, this adjustment further reduced the false detection rate. Finally, integrating BiFPN enabled efficient multi-scale feature fusion with minimal computational cost. Compared to the model before introducing BiFPN, the model size decreased by 1.18 M, GFLOPs decreased by 0.9, and the false detection rate was reduced by 5%.

Overall, compared to YOLOv8n, the proposed improvements result in a total reduction of 2.49 M in model size, a 1.7 decrease in GFLOPs, and a 22% reduction in the false detection rate, while the DMT increased by only 2.6 ms. Based on the experimental results presented in Table 5, it can be concluded that the proposed enhancement strategies significantly improve the detection performance of the baseline model YOLOv8n, demonstrating their effectiveness in reducing false gesture detections.

For convenience in future use, the models in Table 5 are assigned simplified names, with the simplification process detailed in Table 6.

To illustrate the final improvement effect, the visualization of the data in Table 5 is shown in Figure 12.

Meanwhile, the normalized confusion matrix of false detections in this experiment is shown in Figure 13, and the detailed false detection data is presented in Table 7.

In Table 7, the sum of the FDR value and the (1-FDR) value for models YOLOv8n does not equal 1. This is because, after normalization, YOLOv8 retains only two significant digits in its results. Therefore, to minimize errors, (1-FDR) was used as the evaluation metric.

3.3.2. Comparative Experiment

To compare the effectiveness of the detection model proposed in this paper, representative models with widely recognized detection efficiency in current research, SSD, MMDetectionv3, YOLOv5, YOLOv7, and YOLOv9, were selected for comparison and testing against the proposed model. All network models were trained and tested using the same dataset, with the number of training epochs set to 100. The same evaluation metrics were used as observation criteria, and the experimental results are shown in Table 8.

Table 8 shows that compared to other models, YOLOv8n-RF achieves the highest mAP and 1-FDR values, the lowest model size and GFLOPs, and its recall value is second only to the highest, with a difference of no more than 0.65%. Overall, the proposed model demonstrates outstanding performance across all metrics, combining lightweight characteristics with low false detection rates.

4. Discussion

As shown in Figure 14, the false detections observed in the experiment include cases of multiple detections: multiple detection boxes either not overlapping or overlapping with each other.

Furthermore, by computing the gradient of the feature map to generate a heatmap, we gain deeper insights into the model’s attention to relevant features during the detection process. The heatmaps generated before and after the model improvement are shown in Figure 15. In the heatmap, deeper red indicates higher attention, yellow represents moderate attention, and blue signifies minimal impact on image recognition, which the model considers as redundant information. It can be observed that the improved model focuses more on the shape features of fingers, with the main attention areas concentrated on the shape of the gesture, which is more beneficial for recognizing gesture commands.

For the comparison of detection performance in normal detection and false detection cases, refer to Figure 16. When detecting at close range, the results are almost identical, whereas at long distances, there are noticeable differences. The first row shows the detection results of YOLOv8, while the second row displays the results of the improved model. Compared to the YOLOv8 baseline model, the proposed algorithm in this study demonstrated better detection performance in our experiments. Specifically, in long-distance detection, the confidence score was higher than that of the baseline model. Regarding false detection, the proposed model correctly detected the object in the first example, while its performance was slightly inferior to the original model in the second example. This is because the proposed model does not entirely eliminate false detections. However, considering the overall detection performance, the improved algorithm significantly outperformed the original algorithm.

In the final stage of this study, the dynamic gesture detection results of YOLOv8n-RF were mapped into remote control finger commands, and a compatible system was developed accordingly. Validation on this system demonstrated seamless interaction, further confirming the effectiveness of the algorithm. An experimental example is shown in Figure 17.

5. Conclusions

To address the issues of gesture commands being easily misrecognized and the high complexity of the model, this paper proposes the YOLOv8n-RF dynamic remote finger detection algorithm. This algorithm reduces the false detection rate of gesture command detection while achieving high detection performance with lower computational complexity. By embedding the CRVB-DSConvEMA module into the feature extraction network, the model effectively preserves fine morphological characteristics from multiple angles, reducing false detections while maintaining low computational complexity. Additionally, the introduction of the SPPF-DSConvEMA module allows adaptive weight adjustment during downsampling, preserving semantic information from multiple angles without losing important details. Finally, the BiFPN module is incorporated to further lightweight the model without compromising its performance.

Experiments were conducted on a custom Remote Finger dataset, the publicly available HaGRID dataset, and additional “rock” class data that were not used in training, covering a total of seven categories. Among them, the ‘rock’ class did not belong to the six predefined categories for training purposes and was solely used for false detection testing. The results showed that, compared to the original YOLOv8n, the YOLOv8n-RF algorithm achieved improvements of 1.23% and 0.84% in mAP along with 0.78% and 0.38% gains in recall on the two datasets, respectively, reduced model size by 2.49 M, decreased GFLOPs by 1.7, and lowered the false detection rate by 22%, while the DMT increased by only 2.6 ms. The YOLOv8n-RF algorithm effectively reduces false detections, thereby minimizing erroneous control actions. Additionally, it exhibits a certain level of generalization, making it suitable for real-world smart TV gesture interaction applications based on monocular cameras.

Author Contributions

Conceptualization, Y.W., G.W. and Y.Y.; methodology, Y.W., G.W. and Y.Y.; software, Y.W., G.W. and Y.Y.; validation, Y.W., G.W. and Y.Y.; formal analysis, Y.W., G.W. and Y.Y.; investigation, Y.W., G.W. and Y.Y.; resources, Y.W., G.W. and Y.Y.; data curation, Y.W., G.W. and Y.Y.; writing—original draft preparation, Y.W. and G.W.; writing—review and editing, Y.W. and G.W.; visualization, Y.W. and G.W.; supervision, Y.W.; project administration, Y.W. and G.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Shaanxi Provincial Key R&D Program Project (2024GX-YBXM-126).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lee, S.H.; Sohn, M.K.; Kim, D.J.; Kim, B.; Kim, H. Smart TV Interaction System Using Face and Hand Gesture Recognition. In Proceedings of the 2013 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 11–14 January 2013; pp. 173–174. [Google Scholar]
Lian, S.; Hu, W.; Wang, K. Automatic user state recognition for hand gesture based low-cost television control system. IEEE Trans. Consum. Electron. 2014, 60, 107–115. [Google Scholar] [CrossRef]
Ducloux, J.; Colla, P.; Petrashin, P.; Lancioni, W.; Toledo, L. Accelerometer-based Hand Gesture Recognition System for Interaction in Digital TV. In Proceedings of the 2014 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Montevideo, Uruguay, 12–15 May 2014; pp. 1537–1542. [Google Scholar]
Dawar, N.; Kehtarnavaz, N. Real-time continuous detection and recognition of subject-specific smart TV gestures via fusion of depth and inertial sensing. IEEE Access 2018, 6, 7019–7028. [Google Scholar] [CrossRef]
Wu, H.; Yang, L.; Fu, S.; Zhang, X. Beyond remote control: Exploring natural gesture inputs for smart TV systems. J. Ambient Intell. Smart Environ. 2019, 11, 335–354. [Google Scholar] [CrossRef]
Zhang, Y.; Dong, S.; Zhu, C.; Balle, M.; Zhang, B.; Ran, L. Hand gesture recognition for smart devices by classifying deterministic Doppler signals. IEEE Trans. Microw. Theory Tech. 2020, 69, 365–377. [Google Scholar] [CrossRef]
Liao, Z.; Luo, Z.; Huang, Q.; Zhang, L.; Wu, F.; Zhang, Q.; Wang, Y. SMART: Screen-Based Gesture Recognition on Commodity Mobile Devices. In Proceedings of the 2021 Annual International Conference on Mobile Computing and Networking (MobiCom), New Orleans, LA, USA, 25–29 October 2021; pp. 283–295. [Google Scholar]
Alabdullah, B.I.; Ansar, H.; Mudawi, N.A.; Alazeb, A.; Alshahrani, A.; Alotaibi, S.S.; Jalal, A. Smart Home Automation-Based Hand Gesture Recognition Using Feature Fusion and Recurrent Neural Network. Sensors 2023, 23, 7523. [Google Scholar] [CrossRef] [PubMed]
Suzuki, Y.; Kato, K.; Furui, N.; Sakamoto, D.; Sugiura, Y. Exploring Gestural Interaction with a Cushion Interface for Smart Home Control. arXiv 2024, arXiv:2410.04730. [Google Scholar]
Wang, H.; Xu, Y. Intelligent Television Operating System Based on Gesture Recognition. In Proceedings of the 2024 4th International Symposium on Computer Technology and Information Science (ISCTIS), Xi’an, China, 12–14 July 2024; pp. 218–222. [Google Scholar]
Sahoo, J.P.; Prakash, A.J.; Pławiak, P.; Samantray, S. Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors 2022, 22, 706. [Google Scholar] [CrossRef] [PubMed]
Faisal, M.A.A.; Abir, F.F.; Ahmed, M.U.; Ahad, M.A.R. Exploiting Domain Transformation and Deep Learning for Hand Gesture Recognition Using a Low-Cost Dataglove. Sci. Rep. 2022, 12, 21446. [Google Scholar] [CrossRef] [PubMed]
Zhou, W.; Chen, K. A lightweight hand gesture recognition in complex backgrounds. Displays 2022, 74, 102226. [Google Scholar] [CrossRef]
Mohammed, A.A.; Lv, J.; Islam, M.S.; Sang, Y. Multi-model ensemble gesture recognition network for high-accuracy dynamic hand gesture recognition. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 6829–6842. [Google Scholar] [CrossRef]
Sun, Q.; Zhang, T.; Gao, S.; Yang, L.; Shao, F. Optimizing Gesture Recognition for Seamless UI Interaction Using Convolutional Neural Networks. arXiv 2024, arXiv:2411.15598. [Google Scholar]
Miah, A.S.M.; Hasan, M.A.M.; Tomioka, Y.; Shin, J. Hand Gesture Recognition for Multi-Culture Sign Language Using Graph and General Deep Learning Network. IEEE Open J. Comput. Soc. 2024, 5, 144–155. [Google Scholar] [CrossRef]
Garg, M.; Ghosh, D.; Pradhan, P.M. GestFormer: Multiscale Wavelet Pooling Transformer Network for Dynamic Hand Gesture Recognition. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–18 June 2024; pp. 2473–2483. [Google Scholar]
Hu, F.; Zhang, L.; Yang, X.; Zhang, W.-A. EEG-based driver fatigue detection using spatio-temporal fusion network with brain region partitioning strategy. IEEE Trans. Intell. Transp. Syst. 2024, 25, 9618–9630. [Google Scholar] [CrossRef]
Zhao, D.; Wang, J.; Shan, W. Research on kiwifruit detection algorithm based on Faster-YOLOv8n. In Proceedings of the 2024 5th International Conference on Artificial Intelligence and Computer Engineering (ICAICE), Wuhu, China, 8–10 November 2024; pp. 399–403. [Google Scholar]
Li, X.; Dai, S.; Zhang, Y.; Jiang, J. Steel defect detection based on improved YOLOv8n. In Proceedings of the 2024 9th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 21–23 November 2024; pp. 818–823. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Rep ViT: Revisiting Mobile CNN From ViT Perspective. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 15909–15920. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Zhang, X.; Wang, M. Research on Pedestrian Tracking Technology for Autonomous Driving Scenarios. IEEE Access 2024, 12, 149662–149675. [Google Scholar] [CrossRef]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular Structure Segmentation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6047–6056. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Han, K.; Wang, Y. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. arXiv 2023, arXiv:2309.11331. [Google Scholar]
Qu, Z.; Gao, L.Y.; Wang, S.Y.; Yin, H.N.; Yi, T.M. An improved YOLOv5 method for large objects detection with multi-scale feature cross-layer fusion network. Image Vis. Comput. 2022, 125, 104518. [Google Scholar] [CrossRef]

Figure 1. RemoteFinger Dataset Sample Images.

Figure 2. Image sensor device.

Figure 3. Data Augmentation Sample Images. (a) Original Image; (b) Transformed Image, Corresponding to Flipping, Gaussian Noise, Brightening, Rotation, and Darkening operations.

Figure 4. YOLOv8n-RF Network Architecture Diagram.

Figure 5. MobileNetBlock and RepViT Block Structure Diagram. (a) MobileNetBlock; (b) RepViT Block.

Figure 6. Comparison Diagram of Dynamic Snake Convolution and Standard Convolution. (a) Standard Convolution; (b) Dynamic Snake Convolution.

Figure 7. EMA structure diagram.

Figure 8. CRVB-DSConvEMA structure diagram.

Figure 9. RVB-DSConv Structure Diagram.

Figure 10. SPPF-DSConvEMA Module Structure Diagram.

Figure 11. FPN, PANet, and BiFPN Structure Diagram. (a) FPN; (b) PANet; (c) BiFPN.

Figure 12. Comparison chart of experimental data for different models.

Figure 13. Normalized Confusion Matrix of False Detections for Each Module on the rock Category. (a) YOLOV8n; (b) YOLOv8n-A; (c) YOLOv8n-B; (d) YOLOv8n-RF.

Figure 14. Example image of false detection cases. (a) Detection boxes without overlap; (b) Detection boxes partially overlapping; (c) Detection boxes completely overlapping.

Figure 15. Comparison of heatmaps before and after model improvement. (a) YOLOv8n; (b) YOLOv8n-RF.

Figure 16. Comparison of detection performance between two algorithms. (a) YOLOv8n normal detection result image; (b) YOLOv8n false detection result image; (c) YOLOv8n-RF normal detection result image; (d) YOLOv8n-RF false detection result image.

Figure 17. Example of remote control finger interaction tested with YOLOv8n-RF.

Table 1. RemoteFinger Dataset Description.

Dataset	Volume	Pointer	Brightness	Total (Images)
Remote Finger 1.0	2100	2100	2100	6300
Remote Finger	2080	2080	2039	6199

Table 2. Filtered HaGRID Dataset Description Table.

Dataset	Call	Like	Stop	Total (Images)
HaGRID	1988	1988	1988	5994

Table 3. Experimental Environment.

Name	Version
CPU	Intel(R) Xeon(R) Silver 4214R
GPU	RTX 3080 Ti(12 GB) *1
OS	Ubuntu 18.04
PyTorch	1.13.0
Python	3.9.18
CUDA	11.7
OpenCV	4.8.1.78

Table 4. Main Parameters of the Program.

Name	Version
epoch	100
imgsz	640
batch	32
workers	8
device	0

Table 5. Comparison of experimental results before and after model improvement.

CRVB-DSCovnEMA	SPPF-DSConvEMA	BiFPN	mAP (%) RF	mAP (%) HR	Model Size (M)	GFLOPs	1-FDR Rock	DMT (ms) RF
-	-	-	98.30	98.73	5.95	8.2	38	13.0
√	-	-	98.70	98.92	4.63	7.1	45	14.0
√	√	-	99.30	99.43	4.64	7.4	55	16.0
√	√	√	99.53	99.57	3.46	6.5	60	15.6

Table 6. Table of Simplified Module Names.

Name	Model
YOLOv8n-A	CRVB-DSConvEMA
YOLOv8n-B	CRVB-DSConvEMA+SPPF-DSConvEMA
YOLOv8n-RF	CRVB-DSConvEMA+SPPF-DSConvEMA+BiFPN

Table 7. The situation where each module misclassifies “rock” into six different categories.

Model	Call	Like	Stop	Volume	Pointer	Brightness	FDR	1-FDR
YOLOv8n	0.10	0.04	0.46	0.03	-	-	0.63	0.38
YOLOv8n-A	0.09	0.04	0.38	0.03	0.01	-	0.55	0.45
YOLOv8n-B	0.08	0.03	0.31	0.03	-	-	0.45	0.55
YOLOv8n-RF	0.05	0.06	0.29	-	-	-	0.40	0.60

Table 8. Performance metric data of various object detection models.

Model	mAP (%) RF	Recall (%) RF	mAP (%) HR	Recall (%) HR	Model Size (M)	GFLOPs	1-FDR (%) Rock
SSD	21.13	-	88.33	-	186	-	20
MMDectionv3	47.43	58.33	65.00	64.00	59.9	-	40
YOLOv5	98.67	98.50	99.33	97.50	5.01	7.1	28
YOLOv6	97.43	96.83	99.20	98.33	8.28	11.8	20
YOLOv7	98.70	99.23	99.00	98.67	11.7	13.2	39
YOLOv8	98.30	98.20	98.73	98.50	5.95	8.2	38
YOLOv9	98.53	97.43	99.37	99.53	5.80	11.0	30
YOLOv8n-RF	99.53	98.98	99.57	98.88	3.46	6.5	60

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Wang, G.; Yao, Y. YOLOv8n-RF: A Dynamic Remote Control Finger Recognition Method for Suppressing False Detection. Sensors 2025, 25, 2768. https://doi.org/10.3390/s25092768

AMA Style

Wang Y, Wang G, Yao Y. YOLOv8n-RF: A Dynamic Remote Control Finger Recognition Method for Suppressing False Detection. Sensors. 2025; 25(9):2768. https://doi.org/10.3390/s25092768

Chicago/Turabian Style

Wang, Yawen, Gaofeng Wang, and Yining Yao. 2025. "YOLOv8n-RF: A Dynamic Remote Control Finger Recognition Method for Suppressing False Detection" Sensors 25, no. 9: 2768. https://doi.org/10.3390/s25092768

APA Style

Wang, Y., Wang, G., & Yao, Y. (2025). YOLOv8n-RF: A Dynamic Remote Control Finger Recognition Method for Suppressing False Detection. Sensors, 25(9), 2768. https://doi.org/10.3390/s25092768

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv8n-RF: A Dynamic Remote Control Finger Recognition Method for Suppressing False Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.1.1. RemoteFinger Dataset

2.1.2. HaGRID Dataset

2.1.3. Data Augmentation

2.2. Methods

2.2.1. Network Architecture

2.2.2. CRVB-DSConvEMA Module

2.2.3. SPPF-DSConvEMA Module

2.2.4. BiFPN Module

3. Results

3.1. Experimental Environment and Parameter Settings

3.2. Performance Evaluation Metrics

3.3. Experimental Analysis

3.3.1. Ablation Experiment

3.3.2. Comparative Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI