An Improved Lightweight YOLOv5s-Based Method for Detecting Electric Bicycles in Elevators

Zhang, Ziyuan; Yang, Xianyu; Wu, Chengyu

doi:10.3390/electronics13132660

Open AccessArticle

An Improved Lightweight YOLOv5s-Based Method for Detecting Electric Bicycles in Elevators

by

Ziyuan Zhang

¹

,

Xianyu Yang

²

and

Chengyu Wu

^2,*

¹

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

²

School of Information Science and Engineering, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2660; https://doi.org/10.3390/electronics13132660

Submission received: 17 June 2024 / Revised: 4 July 2024 / Accepted: 6 July 2024 / Published: 7 July 2024

(This article belongs to the Special Issue Advanced Machine Learning, Pattern Recognition, and Deep Learning Technologies: Methodologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

The increase in fire accidents caused by indoor charging of electric bicycles has raised concerns among people. Monitoring EBs in elevators is challenging, and the current object detection method is a variant of YOLOv5, which faces problems with calculating the load and detection rate. To address this issue, this paper presents an improved lightweight method based on YOLOv5s to detect EBs in elevators. This method introduces the MobileNetV2 module to achieve the lightweight performance of the model. By introducing the CBAM attention mechanism and the Bidirectional Feature Pyramid Network (BiFPN) into the YOLOv5s neck network, the detection precision is improved. In order to better verify that the model can be deployed at the edge of an elevator, this article deploys it using the Raspberry Pi 4B embedded development board and connects it to a buzzer for application verification. The experimental results demonstrate that the model parameters of EBs are reduced by 58.4%, the computational complexity is reduced by 50.6%, the detection precision reaches 95.9%, and real-time detection of electric vehicles in elevators is achieved.

Keywords:

electric bicycles; YOLOv5s; MobileNetV2; BiFPN; CBAM

1. Introduction

With the acceleration of urbanization, the problem of road congestion in cities is increasing. In this context, more and more citizens are choosing electric bicycles for commuting. However, due to the limited number of ground charging stations, it is difficult to satisfy people’s charging demands. Most residents choose to charge their electric bicycles indoors, which poses potential safety hazards, including the risk of fire accidents. Therefore, the timely installation of an electric bicycle detection system in elevators is one important measure to avoid these safety hazards. Object detection algorithms can be classified into three types.

The first type is traditional object detection algorithms, such as Cascade + HOG, DPM + Haar, or SVM [1], which can be mainly divided into two categories: feature-based and segmentation-based. Feature-based object detection algorithms mainly rely on recognizing certain attributes of the object, which can be manually designed features or abstract features extracted by the algorithm in order to achieve detection and recognition functions. Segmentation-based object detection algorithms mainly achieve detection and recognition through characteristics such as region, color, and edge. They are intuitive and fast in computation, but the design and selection of their features heavily depend on manual labor, resulting in limitations in accuracy, objectivity, robustness, and generalization. Meanwhile, traditional object detection algorithms mostly use sliding-window techniques, which have low computational time efficiency, complex processing, and low accuracy. Therefore, with the development of computing power and data, traditional detection algorithms evidently fail to fulfill the requirements of users.

The second type involves candidate regions/windows + deep learning classification, such as R-CNN [2], Faster R-CNN [3], etc. These algorithms first identify the possible locations of targets in the image, namely candidate regions, and utilize information such as texture, edges, and colors in the image to ensure a high recall rate when selecting fewer windows. Compared to traditional object detection methods, these methods do not rely on manually designed features or prior knowledge but learn feature representations through end-to-end deep learning networks, thereby more effectively capturing complex and abstract target features. These methods have higher detection accuracy while improving computational efficiency through efficient feature extraction and computational models. They also exhibit better robustness in different scenarios and objects, significantly improving the practicality and performance of object detection algorithms. Although these algorithms can provide high accuracy, they are not sensitive to time and are not suitable for fast detection.

The third type involves regression techniques grounded in deep learning principles, including YOLO [4], SSD [5], and RetinaNet [6]. Due to the fact that methods such as Faster R-CNN cannot meet real-time requirements in terms of speed, YOLO and other methods have gradually demonstrated their importance. These methods use the idea of regression to improve the ability to use the entire image as input to the network and directly regress the target border and class at multiple locations in the image, thereby greatly accelerating detection speed. SSD integrates the regression concept from YOLO with the anchor mechanism from Faster R-CNN to maintain the fast speed of YOLO and ensure the precision of prediction. However, this algorithm has high data requirements, high computing resource demands, sensitivity to tuning, and poor interpretability, making it difficult to handle highly redundant feature maps.

Compared with YOLOv4, YOLOv5 significantly improves small object detection, calculated performance, and real-time performance through a more efficient model structure and a more powerful adaptive pyramid pooling method. The deployment of object detection algorithm models is usually carried out on embedded development boards, such as for detecting melon-leaf anomalies [7], vehicle recognition [8], and so on. Due to the limited memory and resources of these embedded development boards, it is necessary to ensure lightweight processing of the selected models. The primary contributions of this paper include the following:

We use MobileNetV2 to reduce the parameters and computational complexity of the model and replace the original activation function ReLU6 with SiLU, which improves the training efficiency and convergence speed of the model, as well as its generalization ability and performance.
We introduce the CBAM attention mechanism at the output position of the backbone network to increase the performance of the model, which helps enhance the feature representation ability of the model, enhance the receptive field, improve detection precision, and reduce computational costs.
Considering the decrease in precision and recall after the introduction of MobileNetV2, the BiFPN model is introduced to ensure high detection precision.
We use a Raspberry Pi 4B for model application validation.
We test the behavior of the YOLOv5s-M2B model on the same dataset, comparing it against Faster RCNN, SSD, and YOLOv3 models, concluding that our model exhibits superior performance.

2. Related Works

2.1. Electric Bicycle Detection in Elevators

At present, research on electric vehicle detection in elevators mainly focuses on three aspects: lightweight model design, improvement of model detection performance, and model deployment.

2.1.1. Lightweight Model Design

The lightweight nature of the model is mainly achieved by introducing a lightweight network into the backbone network and reducing the number of layers in the network.

Liu, L. et al. [9] enhanced global and local feature selection while reducing computational load by combining shuffling operations in channel grouping convolution and time dilation convolution, resulting in a lightweight dilated shuffle group network. Benchmark experiments on the MIMIC-III and VitalDB datasets showed that their model outperformed other lightweight CNNs in balancing parameters and computational complexity. Due to the large computational resource parameters of traditional CNN models, LSGNet was introduced in [10], which is a lightweight CNN model. By adding the SGECA and ParcSG modules to the existing backbone network, the LSGNe model can achieve high accuracy, and its parameters are only 18% compared to MobileNetV3-Large [11].

2.1.2. Model Detection Performance

Model detection performance is mainly improved by introducing attention mechanisms and improving the loss function in the object detection algorithm network.

Zhao, Z. et al. [12] used the lightweight YOLOv5n [13] as the benchmark model, and the CBAM attention mechanism was introduced in the backbone network to enhance the model’s extraction of important feature regions. Additionally, a loss function named EIOU was used to resolve the electric bicycle occlusion problem, and CARAFE [14] was used to increase detection performance. Although the improved algorithm significantly improves detection precision, due to the limitations of the YOLOv5n benchmark model itself, its detection accuracy remains relatively low, with an mAP value of only 86.35%.

2.1.3. Algorithm Deployment

Algorithm deployment is mainly achieved through the deployment of object detection algorithms on embedded development boards such as the Raspberry Pi series, PYNQ series, and Jetson series.

Zhao, Z. et al. [12] deployed an object detection algorithm on the NVIDIA Jetson TX2 NA development board, improving detection accuracy. Although the algorithm can achieve a detection speed of 30FPS after deployment, the cost of the NVIDIA Jetson TX2 NA is high, and its universality on the x86 platform is poor.

3. Proposed Method

3.1. YOLOv5s Model

Compared to other versions of the YOLOv5 model, YOLOv5s is able to handle faster frame rates without losing too much precision and has a smaller model size and memory footprint, which means it consumes fewer resources during deployment and runtime, making it particularly suitable for applications in embedded devices and resource-constrained environments like elevators. Therefore, we chose YOLOv5s as the benchmark model. YOLOv5s has three network structures: a backbone, a neck, and prediction, as shown in Figure 1.

3.1.1. Backbone Network

YOLOv5 adopts the CSPDarkNet architecture as its backbone network for extracting image features. This architecture integrates the CSP module from DarkNet and CSPNet [15], significantly improving inference efficiency, reducing memory consumption, and enhancing the network’s learning ability.

YOLOv5’s focal structure, serving as the initial convolutional layer, performs a specialized convolutional operation to downsample the input feature map, thereby reducing computational and parameter complexity. Its structure is shown in Figure 2. It reorganizes the information on the width and height planes to the channel level by processing blocks on the image and adopts a method using interval pixel values, effectively avoiding the problem of information loss that may occur when downsampling twice. This process increases the number of input channels by four times. Afterward, the processed feature map is convolved using 32 convolutional kernels to generate a feature map with 32 channels.

Utilizing the CSP [16] structure to construct backbone networks involves partitioning the input feature map into two sections. One section undergoes processing by a subnetwork, while the other section is forwarded to the subsequent layer. The processed segment is eventually merged back to create a cohesive feature map serving as input for the next layer. This process unfolds through the following steps:

The input feature maps are divided into two parts.
In the subnetwork, convolutional layers compress the input feature map, perform a range of convolution computations, and finally expand with convolutional layers. By following these steps, relatively fewer high-level features can be distilled.
In the subsequent stages, the processed feature maps from the subnetwork are fused with directly processed feature maps, followed by a series of convolution operations.

3.1.2. Neck Network

In YOLOv5, the PAN [17] structure is used as the feature fusion network in the object detection algorithm to reprocess and utilize the feature information extracted by CSPDarkNet. Its purpose is to improve the model’s capability to detect targets of various sizes by integrating features from multiple levels. Two modules comprise the PAN structure.

The feature pyramid module utilizes various convolutional and pooling layers of diverse dimensions to produce multi-scale feature maps. These feature maps of different scales can offer target information of various granularities and help correct positional deviations in the feature maps.

The feature fusion module integrates feature maps from various scales to enrich the model’s feature representation and enhance its perceptual capabilities.

In the PAN structure, its feature fusion module utilizes a top-down feature aggregation approach, resulting in a more powerful feature representation.

3.1.3. Loss Functions

The gross loss function formula used in YOLOv5 is as follows:

L o s s = L_{c i o u} + L_{o b j} + L_{c l s}

(1)

where

L_{c i o u}

represents the loss function for bounding box regression, which is used to compute the disparity between the predicted box position information and actual box position information.

L_{o b j}

represents the confidence loss function, which calculates the disparity between predicted target objects and actual target objects.

L_{c l s}

represents the category loss function, which is used to compute the disparity between predicted target categories and actual target categories.

YOLOv5 uses the

C I O U

[18] loss function to calculate the bounding box regression loss, defined by the following formula:

L_{c i o u} = 1 - IOU + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α υ

(2)

where b stands for the predicted box,

b^{g t}

represents the true box,

I O U

[19] represents the intersection-over-union ratio of the predicted box and the true box,

ρ

stands for the Euclidean distance between the predicted box and the true box, c stands for the length of the smallest rectangular diagonal that frames the predicted box and the true box,

α

represents the weight coefficient, and

υ

represents the variable that measures the similarity of aspect ratios. The calculation formula for

α

and

υ

are as follows:

α = \frac{υ}{(1 - IOU) + υ}

(3)

υ = \frac{4}{π^{2} {(arctan (\frac{w^{g t}}{h^{g t}}) - arctan (\frac{w}{h}))}^{2}}

(4)

where

w^{g t}

and

h^{g t}

represent the width and height of the real box, while w and h refer to the width and height of the predicted box.

YOLOv5 employs the cross-entropy loss function [20] to compute both the confidence loss and category loss. The confidence loss calculation formula is as follows:

\begin{matrix} L_{o b j} = & - \sum_{i = 0}^{P \times P} \sum_{j = 0}^{S} I_{i j}^{o b j} [{\hat{C}}_{i}^{j} log (C_{i}^{j}) + (1 - {\hat{C}}_{i}^{j}) log (1 - C_{i}^{j})] \\ - λ_{n o o b j} \sum_{i = 0}^{P \times P} \sum_{j = 0}^{S} I_{i j}^{n o o b j} [{\hat{C}}_{i}^{j} log (C_{i}^{j}) + (1 - {\hat{C}}_{i}^{j}) log (1 - C_{i}^{j})] \end{matrix}

(5)

where

P \times P

stands for the number of grids, S stands for the number of prior boxes corresponding to each grid,

I_{i j}^{o b j}

and

I_{i j}^{n o o b j}

, respectively, represent whether the i-th prior box in the j-th grid contains the target to be detected,

{\hat{C}}_{i}^{j}

stands for the confidence of the true box,

C_{i}^{j}

stands for the confidence of the predicted box, and

λ_{n o o b j}

is the weight coefficient.

The formula used to compute the classification loss is as follows:

L_{c l s} = - \sum_{i = 0}^{P \times P} \sum_{j = 0}^{S} I_{i j}^{o b j} \sum_{c ϵ c l a s s e s} [{\hat{P}}_{i}^{j} (c) log (P_{i}^{j} (c)) + (1 - {\hat{P}}_{i}^{j} (c)) log (1 - P_{i}^{j} (c))]

(6)

where c represents the category to be detected by the network,

{\hat{P}}_{i}^{j} (c)

represents the true category label value corresponding to the prior box, and

P_{i}^{j} (c)

indicates the likelihood that the predicted box contains an instance of the corresponding category.

3.2. YOLOv5s-M2B Model

To better apply the model to elevators with smaller memory and computing resources, an improved YOLOv5s model based on MobileNetv2 and BiFPN (YOLOv5s-M2B) is proposed, as shown in Figure 3, where M represents the lightweight MobileNetv2 network.

The following is a detailed explanation of the above figure:

In the backbone, the MobileNetv2 network is introduced to replace the original YOLOv5s, and the SPP and focal modules are removed, thereby reducing the model’s parameter count and computational complexity while ensuring detection precision.
In the neck, the CBAM attention mechanism is added to the back of the four up- and downsampling stages to improve the detection precision of the model.
After adding the CBAM attention mechanism, the original contact in the neck is replaced with BiFPN to further improve the recognition performance and detection precision of the model. BiFPN_Add2 represents two feature map addition operations, and BiFPN_Add3 represents three feature map addition operations.

3.2.1. Algorithm Evaluation Indicators

We focus on the evaluation indicators of the model: precision (P), recall (R), mean average precision (mAP), parameters, and GFLOPs.

Precision is the proportion of true positive predictions out of all positive predictions made, reflecting the model’s capacity to accurately forecast the precision of positive samples. It measures how many predicted positive samples are real positive samples. The calculation formula is as follows:

\begin{matrix} P = \frac{T P}{T P + F P} \end{matrix}

(7)

where

T P

denotes correctly predicted positive observations and

F P

denotes falsely predicted positive observations.

Recall measures the proportion of correctly predicted positive samples out of all true positive samples, reflecting the model’s ability to correctly predict the number of positive samples. It indicates the ratio of positive samples predicted as positive to the total number of positive samples. The calculation formula is as follows:

\begin{matrix} R = \frac{T P}{T P + F N} \end{matrix}

(8)

where

F N

denotes falsely predicted negative observations.

Mean average precision is a comprehensive measure that averages the precision across multiple categories, and it is the main evaluation indicator for object detection algorithms. A superior mAP value signifies improved detection performance of the object detection model on the specified dataset. The calculation formula is as follows:

\begin{matrix} m A P = \frac{\sum_{i = 1}^{K} A P_{i}}{K} \end{matrix}

(9)

where

A P_{i} = \int_{0}^{1} P_{i} (R_{i}) d R

is the average precision of class i,

P_{i}

is the precision of class i,

R_{i}

is the recall ratio of class i, K is the number of classes, and

i \in [1, K]

.

mAP@.5 refers to the mAP across all categories when the intersection over union (IoU) is set to 0.5, and mAP@.5:.95 refers to the mAP across different IoU threshold values within the range

[0.5, 0.95]

with a step size of 0.05.

Parameters represent the number of adjustable parameters that need to be learned by the model. Typically, larger models have a higher number of parameters and require more computational resources.

GFLOPs is a measure of computational complexity, representing the billions of floating-point operations performed per second.

As the ultimate goal of this article is to deploy the algorithm on devices with limited computing resources, it is necessary to ensure that the improved model has a lower number of parameters and computational complexity.

3.2.2. MobileNetV2

The most notable feature of MobileNetV2 is the use of an inverted residual structure, which is exactly the opposite of the residual structure proposed by ResNet [21]. This structure can reduce the network’s complexity and improve its robustness. Therefore, we use MobileNetV2 for lightweight processing, as shown in Table 1, where t indicates the expansion factor of the channel, c indicates the number of output channels, n represents the number of repetitions of the unit, and s stands for the step size of the sliding operation.

In our model, we replace the activation function ReLU6 in MobileNetV2 with SiLU, which features unbounded upper limits, defined lower bounds, smoothness, and non-monotonic behavior. Figure 4 shows the activation function plots for ReLU6 and SiLU.

SiLU involves fewer operations in its computation process compared to ReLU6, resulting in relatively lower computational cost. This means that in resource-constrained environments such as edge devices, inference processes can be more efficient. Additionally, SiLU can provide better gradient propagation, which helps improve the stability and convergence of the training process. Finally, using the SiLU activation function can enhance the function and generalization capability of the model. Table 2 shows the testing precision, recall, mAP@.5, and mAP@.5:.95 for SiLU and ReLU6 in the model. The improved YOLOv5s based on MobileNetv2 is denoted as YOLOv5s-MobileNetv2.

3.2.3. CBAM

CBAM [22] is a typical representative of channel and spatial mixed attention mechanisms, aimed at addressing the constraints of conventional convolutional neural networks when handling information across various scales, shapes, and orientations. CBAM includes a channel attention module (CAM) and a spatial attention module (SAM). By embedding these two modules into different layers of the CNN network, the network focuses on the channel and spatial levels of the feature map. Figure 5 shows its network structure.

The channel attention module is designed to boost the expression of features within each channel. The following are the specific steps for implementation:

Global max pooling and average pooling: In the input feature map, global max and average pooling operations are executed on each channel to calculate the maximum and average feature values on each channel, resulting in two vectors representing the global maximum and average features of each channel.
Fully connected layer: This layer receives the processed feature vectors and passes them to a shared fully connected layer to learn the attention weights of each channel. The network automatically determines which channels are more relevant to the current task through learning.
Sigmoid activation: Employing the Sigmoid activation function ensures that the channel attention weights are constrained within the range of 0 to 1.
Attention weighting: By utilizing the obtained attention weights and multiplying them one by one with each channel of the original feature map, a channel feature map that has undergone attention weighting processing is obtained. This process highlights channels that are beneficial to the current task while suppressing channels that are not related to the task.

The spatial attention module is designed to emphasize the importance of various positions in the image. The specific implementation steps are as follows:

Maximum pooling and average pooling: Within the input feature map, max pooling and average pooling operations are performed across the channel dimension to extract features representing various contextual scales.
Connection and convolution: The processed features are merged along the channel dimension to form a feature map containing contextual information at different scales. Next, the feature map is processed through convolutional layers to generate spatial attention weights.
Sigmoid activation: Similar to the channel attention module.
Attention weighting: By utilizing the generated spatial attention weights and applying them to the original feature map, the features of each spatial position are weighted to highlight important regions in the image while reducing the impact of unimportant regions.

3.2.4. BiFPN

To increase the model’s detection performance, we introduce BiFPN [23] for multi-scale feature fusion. Figure 6c shows the network structure of BiFPN, which is an improvement on PAN. This neural network architecture is utilized for object detection and segmentation tasks within computer vision.

Compared to other feature fusion networks, BiFPN has undergone three optimizations:

Nodes with only one input are removed, making the network structure simpler.
By adding an edge between the primitive input node and the output node, more characteristics can be mixed without any additional cost.
The top-down and bottom-up paths are integrated into one module to achieve higher-level characteristic fusion.

4. Experiments

4.1. Experimental Settings

Training Environment: The PyTorch deep learning framework is utilized, with CUDA 11.0 and CUDNN 11.0 in GPU mode, Windows 10 with an Intel (R) Core (TM) i7-11800H processor, 16.0 GB of RAM, and an NVIDIA GeForce RTX 3060 GPU. The training parameters are detailed in Table 3.

In the following table, “epochs” refers to the total number of generations trained by the model; “batch size” refers to the number of batches sent to the model; “workers” refers to the number of processes; “mosaic” refers to the method of controlling data augmentation, and setting its value to 1 indicates enabling this data augmentation to improve the generalization performance of the model; “weight decay” refers to the weight decay coefficient, which can avoid unstable gradients or losses during the initial training stage; “learning rate” refers to the initial value of the learning rate, usually set to 0.01; and “image size” refers to the value set for the size of the training and testing sets.

As the publicly available dataset does not include images of electric bicycles inside elevators, a final dataset of 3900 images was collected through online searches, crawling, and field photography. The annotation of images followed the YOLO format, facilitated by the LabelImg software, including electric bicycles, bicycles, and people. The dataset is classified into training, validation, and testing sets in an 8:1:1 proportion.

Raspberry Pi: We use Raspberry Pi 4B as the development platform, Table 4 shows the detailed parameters.

4.2. Experimental Results

4.2.1. Training Results

Due to the limitations of traditional algorithms, the goal of this article is to design an object detection algorithm with high detection precision, a low number of base parameters, and low computational complexity. Therefore, when comparing models, we chose to compare the precision, recall, mAP@.5, mAP@.5:.95, parameters, and GFLOPs, intuitively reflecting the advantages of the improved algorithm and better reflecting the design objectives of this article. The training results after sequentially adding modules are shown in Table 5.

It can be observed that after adding MobileNetV2, although the precision and other algorithm evaluation indicators decreased, the number of parameters and computational complexity of the model also significantly decreased by 58.4% and 50.6%, respectively, which can meet the requirements for deployment in devices with limited storage and computing resources. After the introduction of CBAM and BiFPN, there was not much change in the parameters and computational cost, but the precision and other algorithm evaluation indicators significantly improved, which was essentially the same as the original YOLOv5s model.

Figure 7 provides a more intuitive representation of the above data. We can see that after 100 epochs of training, the mAP@.5 values for YOLOv5s, YOLOv5s, MobileNetV2, and YOLOv5s-M2B are all above 0.8, while the mAP@.5:.95 value is approximately 0.6. Specifically, after joining MobileNetV2 and training, the values of all four evaluation metrics decreased. Among them, the mAP@.5 and mAP@.5:.95 values showed the most significant decreases. Afterward, with the addition of BiFPN, the precision and other algorithm evaluation indicators all improved, and the final curve was close to that of YOLOv5s, with higher precision and recall values. This indicates that using MobileNetV2 and BiFPN can indeed improve evaluation metrics such as precision while reducing the number of parameters and computational complexity.

The experimental results indicate that the improved YOLOv5s-M2B model can meet the requirements for combining lightweight design and precise detection performance.

4.2.2. Testing Results

To better validate the YOLOv5s-M2B detection algorithm and achieve its deployment in elevators, we used a Raspberry Pi 4B and a buzzer to detect electric bicycles in elevators. The implementation steps are as follows:

Extract each frame from the video and pass it to the YOLOv5s-M2B model in the form of images. The model recognizes the received images and labels them with “person”, “cycle”, and ”electric cycle”;
Establish an interface between the Raspberry Pi 4B and PyCharm Professional 2023.3.5 software. Ensure that the Raspberry Pi 4B and PyCharm software are on the same local area network. Add an SSH interpreter to PyCharm, and input the hostname and username of the Raspberry Pi to be connected. Enter the correct password to connect the two.
Use the established interface to transfer the annotated image to the Raspberry Pi 4B.
Identify the annotated images received by the Raspberry Pi 4B. If the image contains an “electric cycle” annotation, use a buzzer to sound an alarm; otherwise, no action is taken. During the process of identifying images, the Raspberry Pi 4B also displays them on the display screen, making it easier for elevator administrators to monitor the situation inside the elevator.

To further validate the effectiveness of the model for real-time applications, we calculated the detection delay and frames per second (FPS) separately. The detection delay refers to the time it takes for the entire process from input data to the model and then to the generation of output results. Usually, a smaller detection delay indicates that the model can respond in a timely manner in real-time scenarios, thereby achieving higher frame rates (FPS). FPS measures the number of frames a model processes per second, which reflects the running speed of the model. The higher the FPS value, the faster the model runs. According to calculations, the average detection delay and FPS value for each image are 0.096 s and 88.601, respectively. This indicates that our model has a fast running speed and can respond promptly in real-time scenarios.

Figure 8 shows the testing results.

4.3. Comparison Experiment

To further verify the validity of YOLOv5s-M2B, we conducted comparative experiments with Faster-RCNN, SSD, YOLOv3, and YOLOv8 on the same dataset. Table 6 shows the comparison results.

According to the results of the comparison experiment, the YOLOv5s-M2B model has higher detection precision, with an mAP of 95.9%, compared with the other three object detection algorithms. Although its mAP@.5 and mAP@.5:.95 values are not the best, its number of parameters and computational complexity are minimal. Therefore, this model is more suitable for detecting electric bicycles in elevators. Based on the comparison of the data in the table above, we can also infer that the improved model can be applied to detect other types of vehicles, such as bicycles, motorcycles, baby strollers, etc. It can also be deployed in embedded devices in other enclosed spaces and has a wide range of applications.

5. Conclusions and Limitations

5.1. Article Conclusions

In this article, we present a lightweight method based on YOLOv5s to achieve the detection of electric bicycles in elevator scenes, enabling network models to be deployed on embedded edge devices with small storage space and limited computing resources. MobileNetV2 is used to achieve lightweight processing of the YOLOv5s model. It is combined with the CBAM attention mechanism and uses BiFPN for feature fusion, improving the detection performance of the model and achieving a high-precision, low-parameter, and computational complexity YOLOv5s-M2B model. The results show that compared with the original YOLOv5s, the improved YOLOv5s-M2B model significantly reduces the number of parameters and computational complexity by 58.4% and 50.6%, respectively, while achieving a high average detection precision of 95.9%. Finally, we tested the improved algorithm using a Raspberry Pi 4B and achieved good detection results, further demonstrating the feasibility of deploying the algorithm on embedded devices.

5.2. Design Limitations

This article conducts relevant research on the detection of electric vehicles in elevators and improves the YOLOv5s object detection algorithm, achieving good detection and recognition performance. However, there are still some design limitations:

The application scenarios of algorithms have limitations. This article only focuses on the detection of electric vehicles inside elevators and is not yet able to detect electric vehicles in other scenarios, such as roads and parking lots.
The dataset studied is limited. The dataset images used in this article have certain shortcomings, and more datasets related to electric vehicles in elevators need to be introduced to enable the algorithm to adapt to elevator environments with occlusions, blurring, and variable backgrounds, thereby enhancing the universality of the model.
The model is not novel enough. The model used in this article was proposed previously, and with the rapid development of the manufacturing industry today, there may be a mismatch between the older model and newly emerging embedded devices.

Author Contributions

Conceptualization, X.Y. and C.W.; Methodology, X.Y.; Software, X.Y.; Validation, Z.Z.; Formal analysis, Z.Z.; Investigation, X.Y.; Resources, C.W.; Data curation, Z.Z.; Writing—original draft, Z.Z.; Writing—review and editing, C.W.; Visualization, Z.Z. and X.Y.; Supervision, C.W.; Project administration, C.W.; Funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the first batch of “Pioneer” R&D Programs of Zhejiang Province in 2023 under grant 2023C01041.

Data Availability Statement

The data that support the findings of this study are openly available at http://43.142.54.61:8090/, using the login account Ebike and password Ebike (accessed on 17 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, J.; Xu, Z.; Fan, N.W.; Wang, Y.; Mo, W. The combination mode of forest and SVM for power network disaster response failure identification. Comput. Electr. Eng. 2024, 117, 109255. [Google Scholar] [CrossRef]
Girshick, B.R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Shihabuddin, A.R.; Beevi, S. Efficient Mitosis Detection: Leveraging Pre-trained Faster R-CNN and Cell-level Classification. Biomed. Phys. Eng. Express 2024, 10, 025031. [Google Scholar] [CrossRef] [PubMed]
Alruwaili, M.; Siddiqi, H.M.; Atta, N.M.; Arif, M. Deep Learning and Ubiquitous Systems for Disabled People Detection Using YOLO Models. Comput. Hum. Behav. 2024, 154, 108150. [Google Scholar] [CrossRef]
Hao, M.; Sun, Q.; Xuan, C.; Zhang, X.; Zhao, M.; Song, S. Lightweight Small-Tailed Han Sheep Facial Recognition Based on Improved SSD Algorithm. Agriculture 2024, 14, 468. [Google Scholar] [CrossRef]
Jing, C.; Rongjie, W.; Anhui, L.; Jiang, D.; Wang, Y. A Feature Enhanced RetinaNet-Based for Instance-Level Ship Recognition. Eng. Appl. Artif. Intell. 2023, 126, 107133. [Google Scholar]
Rahmat, H.; Wahjuni, S.; Rahmawan, H. Performance Analysis of Deep Learning-based Object Detectors on Raspberry Pi for Detecting Melon Leaf Abnormality. Int. J. Adv. Sci. Eng. Inf. Technol. 2022, 12, 386–391. [Google Scholar] [CrossRef]
Arrieta-Rodríguez, E.; Murillo, F.L.; Arnedo, M.; Caicedo, A.; Fuentes, M.A. Prototype for identification of vehicle plates and character recognition implemented in Raspberry Pi. IOP Conf. Ser. Mater. Sci. Eng. 2019, 519, 012028. [Google Scholar] [CrossRef]
Liu, L.; Hang, Y.; Chen, R.; He, X.; Jin, X.; Wu, D.; Li, Y. LDSG-Net: An efficient lightweight convolutional neural network for acute hypotensive episode prediction during ICU hospitalization. Physiol. Meas. 2024, 45, 065003. [Google Scholar] [CrossRef] [PubMed]
Yang, S.; Zhang, L.; Lin, J.; Cernava, T.; Cai, J.; Pan, R.; Zhang, X. LSGNet: A lightweight convolutional neural network model for tomato disease identification. Crop Prot. 2024, 182, 106715. [Google Scholar] [CrossRef]
Yi, Z.; Hancheng, H.; Zhixiang, L.; Yiwang, H.; Lu, M. Intelligent garbage classification system based on improve MobileNetV3-Large. Connect. Sci. 2022, 34, 1299–1321. [Google Scholar]
Zhao, Z.; Li, S.; Wu, C.; Wei, X. Research on the rapid recognition method of electric bicycles in elevators based on machine vision. Sustainability 2023, 15, 13550. [Google Scholar] [CrossRef]
Li, H.; Zhuang, X.; Bao, S.; Chen, J.; Yang, C. SCD-YOLO: A lightweight vehicle target detection method based on improved YOLOv5n. J. Electron. Imaging 2024, 33, 023041. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 20–26 October 2019; pp. 3007–3016. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Bolton, C.; Davies, J. A Singleton Failures Semantics for Communicating Sequential Processes. Form. Asp. Comput. 2006, 18, 181–210. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: New York, NY, USA, 2020; pp. 12993–13000. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An Advanced Object Detection Network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
Ruby, U.; Yendapalli, V. Binary Cross Entropy with Deep Learning Technique for Image Classification. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 5393–5397. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
Zhang, K.; Sun, M.; Han, T.X.; Yuan, X.; Guo, L.; Liu, L.; Tao, L. Residual Networks of Residual Networks: Multilevel Residual Networks. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 1303–1314. [Google Scholar] [CrossRef]

Figure 1. YOLOv5s network architecture.

Figure 2. Focal structure diagram of the model.

Figure 3. YOLOv5s-M2B network architecture.

Figure 4. Activation function plots for SiLU and ReLU6.

Figure 5. CBAM network architecture.

Figure 6. Structure diagram of FPN, PAN and BiFPN.

Figure 7. The training results.

Figure 8. The testing results.

Table 1. The network structure of MobileNetV2.

Input	Operator	t	c	n	s
$640^{2} \times 3$	cov2d	-	32	1	2
$320^{2} \times 32$	bottleneck	1	16	1	1
$320^{2} \times 16$	bottleneck	6	24	2	2
$160^{2} \times 24$	bottleneck	6	32	3	2
$80^{2} \times 32$	bottleneck	6	64	4	2
$40^{2} \times 64$	bottleneck	6	96	3	1
$40^{2} \times 96$	bottleneck	6	160	3	2
$20^{2} \times 160$	bottleneck	6	320	1	1

Table 2. Algorithm evaluation indicators for different models.

	SiLU	ReLU6
Indicator	SiLU	ReLU6
Precision (%)	91.2	88.8
Recall (%)	90.7	87.4
mAP@.5 (%)	94.5	92.7
mAP@.5:.95 (%)	62.6	61.5

Table 3. Experimental parameter configurations.

Parameter Name	Numerical Value
epochs	100
batch size	2
workers	8
mosaic	1.0
weight decay	0.0005
learning rate	0.01
image size	$640 \times 640$

Table 4. Raspberry Pi 4B detailed parameters.

Name	Parameter
SOC	CM2711
CPU	ARM Cortex-A72 1.5 GHz
memory	1 GB/2 GB/4 GB LPDDR4
number of USB ports	2×USB3.0
number of USB ports	2×USB2.0
video output	2 micro HDMI ports
video output	2-lane MIPI DSI display port
power input	5V USB-TypeC

Table 5. Training results for different models.

	YOLOv5s	YOLOv5-MobileNetV2	YOLOv5s-M2B
Indicator	YOLOv5s	YOLOv5-MobileNetV2	YOLOv5s-M2B
Precision (%)	95.2	88.8	94.3
Recall (%)	89.2	87.4	85.2
mAP@.5 (%)	97.2	92.7	95.9
mAP@.5:.95 (%)	68.2	61.5	66.4
Parameter (M)	7.02	2.92	3.03
GFLOPs (G)	15.8	7.8	8.7

Table 6. Results of comparison experiment.

Model	mAP@.5 (%)	mAP@.5:.95 (%)	Parameter (M)	GFLOPs (G)
Faster-RCNN	85.6	51.6	137.099	370.210
SSD	92.5	58.3	26.285	62.747
YOLOv3	92.7	57.6	61.95	66.17
YOLOv5s	97.2	68.2	7.02	15.8
YOLOv8	97.8	88.5	11.137	28.7
YOLOv5s-M2B	95.9	66.4	3.03	8.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Yang, X.; Wu, C. An Improved Lightweight YOLOv5s-Based Method for Detecting Electric Bicycles in Elevators. Electronics 2024, 13, 2660. https://doi.org/10.3390/electronics13132660

AMA Style

Zhang Z, Yang X, Wu C. An Improved Lightweight YOLOv5s-Based Method for Detecting Electric Bicycles in Elevators. Electronics. 2024; 13(13):2660. https://doi.org/10.3390/electronics13132660

Chicago/Turabian Style

Zhang, Ziyuan, Xianyu Yang, and Chengyu Wu. 2024. "An Improved Lightweight YOLOv5s-Based Method for Detecting Electric Bicycles in Elevators" Electronics 13, no. 13: 2660. https://doi.org/10.3390/electronics13132660

APA Style

Zhang, Z., Yang, X., & Wu, C. (2024). An Improved Lightweight YOLOv5s-Based Method for Detecting Electric Bicycles in Elevators. Electronics, 13(13), 2660. https://doi.org/10.3390/electronics13132660

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Lightweight YOLOv5s-Based Method for Detecting Electric Bicycles in Elevators

Abstract

1. Introduction

2. Related Works

2.1. Electric Bicycle Detection in Elevators

2.1.1. Lightweight Model Design

2.1.2. Model Detection Performance

2.1.3. Algorithm Deployment

3. Proposed Method

3.1. YOLOv5s Model

3.1.1. Backbone Network

3.1.2. Neck Network

3.1.3. Loss Functions

3.2. YOLOv5s-M2B Model

3.2.1. Algorithm Evaluation Indicators

3.2.2. MobileNetV2

3.2.3. CBAM

3.2.4. BiFPN

4. Experiments

4.1. Experimental Settings

4.2. Experimental Results

4.2.1. Training Results

4.2.2. Testing Results

4.3. Comparison Experiment

5. Conclusions and Limitations

5.1. Article Conclusions

5.2. Design Limitations

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI