Research on Machine Vision-Based Control System for Cold Storage Warehouse Robots

Wei, Zejiong; Tian, Feng; Qiu, Zhehang; Yang, Zhechen; Zhan, Runyang; Zhan, Jianming

doi:10.3390/act12080334

Open AccessArticle

Research on Machine Vision-Based Control System for Cold Storage Warehouse Robots

by

Zejiong Wei

^1,2

,

Feng Tian

³,

Zhehang Qiu

²,

Zhechen Yang

²,

Runyang Zhan

¹ and

Jianming Zhan

^1,*

¹

School of Mechatronics and Energy Engineering, NingboTech University, Ningbo 315100, China

²

School of Mechanical Engineering, Zhejiang University, Hangzhou 310023, China

³

Ningbo Ruyi Joint Stock Co., Ltd., Ningbo 315100, China

^*

Author to whom correspondence should be addressed.

Actuators 2023, 12(8), 334; https://doi.org/10.3390/act12080334

Submission received: 3 July 2023 / Revised: 12 August 2023 / Accepted: 14 August 2023 / Published: 20 August 2023

(This article belongs to the Special Issue Design and Control of Actuators for Active Human−Machine Interaction)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, the global cold chain logistics market has grown rapidly, but the level of automation remains low. Compared to traditional logistics, automation in cold storage logistics requires a balance between safety and efficiency, and the current detection algorithms are poor at meeting these requirements. Therefore, based on YOLOv5, this paper proposes a recognition and grasping system for cartons in cold storage warehouses. A human–machine interaction system is designed for the cold storage environment, enabling remote control and unmanned grasping. At the algorithm level, the CA attention mechanism is introduced to improve accuracy. The Ghost lightweight module replaces the CBS structure to enhance runtime speed. The Alpha-DIoU loss function is utilized to improve detection accuracy. With the comprehensive improvements, the modified algorithm in this study achieves a 0.711% increase in mAP and a 0.7% increase in FPS while maintaining accuracy. Experimental results demonstrate that the CA attention mechanism increases fidelity by 2.32%, the Ghost lightweight module reduces response time by 13.89%, and the Alpha-DIoU loss function enhances positioning accuracy by 7.14%. By incorporating all the improvements, the system exhibits a 2.16% reduction in response time, a 4.67% improvement in positioning accuracy, and a significant overall performance enhancement.

Keywords:

cold chain warehousing; human–machine interaction; machine vision; object detection; YOLOv5

1. Introduction

With the continuous advancement of science and technology and the improvement of human living standards, cold storage warehouses, as buildings that can maintain a constant low temperature, are widely used in various fields such as food, chemistry, pharmaceuticals, vaccines, plasma, and scientific experiments [1]. Due to reasons related to food, environmental protection, and biosafety, most cold storage warehouses have strict hygiene and protective requirements. However, the process of staff entering and exiting the warehouse involves multiple disinfections and wearing protective clothing, which is time-consuming and cumbersome. The frequent entry and exit of personnel also lead to drastic heat and moisture exchange between the interior of the cold storage warehouse and the external environment, resulting in temperature fluctuations, frost formation, and potential damage to stored items. Moreover, for the safety of workers and biological samples, it is inconvenient for personnel to enter the cold storage warehouse for cargo loading and unloading, handling, inspections, and equipment maintenance. Therefore, there is an urgent need for unmanned remote operation and human–machine interaction in cold storage warehouses [2].

In a cold storage environment, low temperatures can pose a threat to the camera, motors, and controllers, leading to reduced operational lifespans. Therefore, it is essential to ensure precision while maximizing the real-time performance of the system. Consequently, rapid identification and grasping of cartons are key research areas [3], as their intelligence and accuracy will impact the costs and overall competitiveness of cold storage enterprises. Thus, conducting research in this area holds significant importance.

Based on convolutional neural networks, object detection algorithms can be classified into two types based on the presence of region proposals: Two-Stage object detection algorithms and One-Stage object detection algorithms. Two-Stage object detection algorithms first generate a series of candidate boxes as examples and then classify these boxes using a convolutional neural network. Representative algorithms include RCNN [4], Fast RCNN [5], Faster RCNN [6], SPPNet [7], and others. One-Stage algorithms do not generate candidate boxes; instead, they directly input the image into the convolutional neural network, transforming the problem of object boundary localization into a regression problem. The YOLO series represents the One-Stage approach and stands out for its speed and real-time detection capabilities [8]. YOLOv5 inherits the advantages of its predecessors and has significantly improved accuracy with model updates.

Many researchers have made improvements to the YOLOv5 algorithm for different scenarios. Chen et al. transformed various task requirements into a unified object localization task and proposed a self-template method to fill the image boundaries, improving the model’s generalization ability and speed [9]. Chen integrated image feature classification and improved the algorithm’s output structure through model pruning and classification [10]. Both of these algorithms have high complexity and low efficiency in single-class object recognition scenarios. Zhang et al. used intersection over union (IoU) as a distance function to improve detection speed. They also enhanced the accuracy of network detection regression through transfer learning and the CoordConv feature extraction method [11]. Chen et al. performed Mosaic-9 augmentation on the dataset and replaced the ResNet feature extraction network with MobileNet V3 Small, improving the feature extraction speed for small target samples [12]. Karoll et al. used a bidirectional feature pyramid network as an aggregation path and introduced the SimAM attention module for feature extraction, improving detection accuracy [13]. However, these three improvements are more suitable for scenes with high background complexity and are not suitable for storage recognition. Li et al. extended and iterated the shallow cross-stage partial connection (CSP) module and introduced an improved attention module in the residual block [14]. Zhou et al. introduced residual connections and weighted feature fusion to improve detection efficiency. They also combined Transformer modules to enhance information filtering rate [15]. These two methods have significant advantages for specific datasets but lack universality. Mohammad et al. improved the algorithm’s ability to detect small objects by adding shallow high-resolution features and changing the size of the output feature maps [16]. Zhou et al. added detection branches in the middle and head blocks of YOLOv5 to improve local feature acquisition. They also enhanced detection accuracy by adding CBAM and DA attention modules [17]. Xiao et al. modified the network’s width and depth for detecting small objects in high-resolution images, improving detection speed [18]. These three methods mainly focus on improving the detection of small-sized objects and do not fully meet the requirements of storage.

Although many studies have made improvements to the YOLOv5 algorithm, it is challenging to directly apply them to the recognition of stacked cartons in a cold storage warehouse environment. This article aims to address the need for high accuracy and real-time performance in cold storage logistics. The research content is as follows:

(1): By integrating the CA attention mechanism, meaningful features on the channel and spatial axes are extracted to enhance the correlation representation of target information between different channels in the feature map.
(2): By introducing the lightweight Ghost module, the model parameters are compressed, maintaining detection accuracy and speed, and facilitating subsequent deployment on mobile embedded devices.
(3): By optimizing the loss function and replacing the original network’s GIoU with Alpha-DIoU, faster convergence can be achieved, and the predicted boxes can be closer to the ground truth, improving localization accuracy.

The writing process is as follows:

(1): Chapter 2 designs a human–machine interactive control system, reasonably arranges various modules, and explains the system modules and composition structure.
(2): Chapter 3 proposes an improved recognition algorithm and conducts testing and evaluation of the enhanced algorithm.
(3): Chapter 4 conducts practical tests on the control system and algorithm. An experimental platform is set up to analyze the system’s performance from fidelity, response time, and accuracy perspectives, demonstrating the overall improvement in the control system’s performance.
(4): Chapter 5 provides a reasonable summary of the work presented in this paper.

2. Human–Machine Interaction Control System

The system configuration is shown in Figure 1. Firstly, an image acquisition is performed using a camera and transmitted to the PC. On the PC platform, using the Windows operating system, the YOLOv5 object detection model is run to detect and recognize objects in the images. After converting the coordinates of the target objects, they are transmitted to the lower-level machine using a serial communication protocol. Based on the received coordinate information, the gripper is accurately moved to the specified position by controlling the motor and air pump through instructions. The target localization and gripping are achieved by using a suction cup.

This paper’s control system consists of six subsystems: communication, capture, inference, electrical control, pneumatic control, and system. The framework diagram is shown in Figure 2.

The communication subsystem is a key component of the system, responsible for data transmission and information exchange between the upper computer and other devices. Serial communication is used to exchange data and control instructions with each subsystem, enabling their collaborative work. The capture subsystem is responsible for invoking the camera device to capture images and obtain clear and accurate image data for subsequent analysis and processing. The inference subsystem refers to the use of deep learning models in the system for tasks such as object detection and recognition on the captured images or videos. The electrical control subsystem involves motor control and driving, responsible for controlling the operation of motors in the system. It enables automation control and precise motion execution. The pneumatic control subsystem involves gas control and driving, used to control and adjust the flow, pressure, and operation of gases. It performs the motion control of the suction cup and object gripping actions. The system subsystem includes the overall system switch and reset settings.

3. Control Algorithm Design

3.1. Introduction to YOLOv5

YOLOv5 divides an image into an S × S grid and generates several candidate boxes adaptively within each grid cell. The parameters of these boxes are then calculated to obtain information such as center point, width, height, and confidence. Finally, object prediction is performed to obtain the results. The network consists of four parts: the input, backbone, neck, and head.

The input module receives the image as input and employs the Mosaic data augmentation technique. It randomly crops, arranges, and concatenates four images to generate a new input image. Adaptive anchor box calculation and adaptive image scaling are used to adapt to anchor boxes and image sizes in different training datasets. Taking YOLOv5s as an example, each image is scaled to a size of 640 × 640. These improvements enrich the model, enhance its robustness, and improve training speed and detection capabilities. The backbone is a critical part of feature extraction, consisting of modules such as Focus, CBS, C3, and SPPF. The Focus module slices the image, expands the input channels by 4 times, and performs convolution operations to achieve feature extraction and downsampling. The CBS module is a convolutional layer that introduces non-linearity through BN batch normalization and SiLU activation function to avoid overfitting. The C3 module is used to further learn more features. The SPPF module, short for spatial pyramid pooling, is used to enlarge the receptive field and fuse information from feature maps of different scales to achieve feature fusion.

The neck module is mainly used for feature fusion and adopts the FPN-PAN network structure. It transmits localization features in a bottom-up manner, enabling the feature maps to contain both detailed and semantic information. The neck module is responsible for feature extraction and transfers these features to the output module, where the output layer generates object boxes and class confidences. To address the non-overlapping boundary issue, the GIOU (Generalized IOU) Loss is used as the regression box prediction loss function. The non-maximum suppression (NMS) method is applied to remove low-scored prediction boxes and select the best results. The output module generates feature maps at three scales, performing downsampling by a factor of 8, 16, and 32 on the original image. The output module includes classification loss, localization loss, and confidence loss, which are used to predict small, medium, and large objects.

3.2. Model Comparison

After thorough comparison and consideration, we have decided to use YOLOv5 instead of YOLOv8 as the object detection model for this project. While YOLOv8 shows some performance improvements in certain aspects, YOLOv5 remains better suited for our current needs. Firstly, YOLOv5 exhibits faster inference speed, reaching up to 62.5 FPS under equivalent hardware conditions, whereas YOLOv8 achieves 54.4 FPS. Secondly, the average precision of both models on the COCO test set is quite close, with 47.0 and 47.2 mAP for YOLOv5 and YOLOv8, respectively, indicating minimal differences.

However, YOLOv5 has a parameter count of only 41M, making it more lightweight compared to YOLOv8’s 52M, which is more resource-friendly for our storage and computational resources. Moreover, YOLOv5 supports a wider range of input resolutions, ranging from 640 to 1280, for both training and inference. In contrast, YOLOv8 requires a minimum resolution of 1280, which places higher demands on the input images. Additionally, YOLOv5 offers various data augmentation techniques that can enhance model robustness, providing valuable assistance to us.

Furthermore, YOLOv5 boasts a more active developer community and a wealth of application cases, making it easier to obtain support and reference existing experiences. In summary, considering factors such as accuracy, speed, and resource requirements, we believe YOLOv5 is the superior choice. While YOLOv8 also has its merits, YOLOv5 can deliver the detection performance we need for the current project, while offering greater stability, flexibility, and ease of use.

3.3. Improvements to YOLOv5

A series of improvements were made to the YOLOv5 base, and the improved network structure is shown in Figure 3.

3.3.1. Addition of Coordinate Attention (CA) Mechanism

The Coordinate Attention (CA) mechanism incorporates position information into channel attention. It not only captures inter-channel information but also captures direction and position-aware information, which helps the model to locate and recognize the target of interest more accurately [19]. The CA mechanism consists of two steps: coordinate information embedding and coordinate attention generation, encoding precise positional information for channel relationships and long-range dependencies. The structure of the CA mechanism is shown in Figure 4.

For the input feature tensor,

X = [x_{1}, x_{2}, \dots, x_{C}] {\in R}^{C \times H \times W}

, where C, H, and W represent the number of channels, height, and width of the input feature map, respectively. CA encodes each channel by pooling kernels of size (H, 1) and (1, W) along the horizontal and vertical coordinates, respectively. The output for the c-th channel at height h and width w is given by the following equation:

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(1)

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w)

(2)

The two transformations mentioned above aggregate features along two spatial directions to obtain a pair of direction-aware feature maps, namely X Avg Pool and Y Avg Pool in Figure 4. These feature maps are then concatenated and undergo a convolutional transformation F1 using a shared 1 × 1 convolutional kernel, as shown in the following equation:

f = δ (F_{1} ([z^{h}, z^{w}]))

(3)

In the equation, [·,·] represents the concatenation operation along the spatial dimension, δ is a non-linear activation function, and f refers to the intermediate feature map obtained from spatial information in the horizontal and vertical directions. The feature map f is split into two separate tensors,

f^{h}

and

f^{w}

, along the spatial dimension. These tensors are then transformed to have the same number of channels as the input tensor X using two 1 × 1 convolutions, denoted as

F_{h}

and

F_{w}

, as shown in the following equation:

\begin{matrix} g^{h} = σ (F_{h} (f^{h})) \end{matrix}

(4)

g^{w} = σ (F_{w} (f^{w}))

(5)

In the equation,

σ

represents the sigmoid function, which is applied to expand and scale the tensors

g^{h}

and

g^{w}

as attention weights. The final output feature tensor

Y = [y_{1}, y_{2}, \dots, y_{C}]

of the CA module can be expressed as follows:

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(6)

By introducing the attention mechanism into the detection network, this study focuses the model’s attention on the target objects, thereby improving accuracy. The CA attention mechanism is added to the C3 module of the YOLO backbone network, as shown in Figure 5. The input feature map dimensions are set to 128, 256, 512, and 1024, according to the output size of the original position module. This structure allows the model to effectively utilize the CA attention mechanism and enhance the performance of the YOLO network.

3.3.2. Ghost Module Replacement

Replacing the C3 module in the YOLOv5 backbone network, which has a large number of parameters and slower detection speed, is necessary to achieve lightweight modeling for real-time object detection in embedded platforms for human–machine interaction. To address this, the Ghost module is introduced in this study, enabling a significant reduction in network parameters, model size, and computational speed improvement [20].

The Ghost module aims to alleviate the redundancy issue in traditional convolutional neural networks, which increases computational complexity. By reducing the number of convolutions and using a small number of convolutions to linearly transform features, the Ghost module achieves an increased feature map with fewer parameters, as shown in Figure 6. The process involves splitting the original convolutional layer into two steps; the first step performs convolutional computation using GhostNet, and the second step integrates the features to form the new output.

Given the input feature map

X \in R^{C \times H \times W}

and output feature map

Y \in R^{h^{'} \times w^{'} \times n}

, with a convolutional kernel f of size k × k, the convolutional computation

n \times h^{'} \times w^{'} \times c \times k \times k

is performed. Here, C, H, and W represent the channel number, height, and width of the input feature map, while n, h, and w represent the channel number, height, and width of the output feature map. Linear operations are applied to each original feature map to generate s feature maps, as follows:

y_{i j} = Φ_{i, j} (y_{i}^{'}), \forall i = 1, \dots, m, j = 1, \dots, s,

(7)

In the equation,

y_{i}^{'}

represents the i-th original feature map in Y,

Φ_{i, j}

represents the linear operation, m represents the number of intrinsic feature maps, and

n = m \times s

represents the number of feature maps obtained. The theoretical acceleration ratio can be calculated as follows:

\begin{matrix} r_{s} = \frac{n \cdot h^{'} \cdot w^{'} \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot h^{'} \cdot w^{'} \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot h^{'} \cdot w^{'} \cdot d \cdot d} \\ = \frac{c \cdot k \cdot k}{\frac{1}{s} \cdot c \cdot k \cdot k + \frac{s - 1}{s} \cdot d \cdot d} \approx \frac{s \cdot c}{s + c - 1} \approx s, \end{matrix}

(8)

The parameter compression rate is given by

r_{c} = \frac{n \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot d \cdot d} \approx \frac{s \cdot c}{s + c - 1} \approx s,

(9)

In this paper, all three CBS (Convolution-BatchNorm-SiLU) structures in the YOLOv5 architecture are replaced with the CBSGhost module, achieving structural lightweighting, as shown in Figure 7.

3.3.3. Improved Loss Function

During the training process of convolutional neural networks, the network parameters are continuously updated by calculating the error using a loss function. The loss function of YOLOv5 consists of three parts. The BECLogits loss function is used to calculate the confidence

{L o s s}_{c o n f}

, the cross-entropy loss function is used to calculate the classification target

{L o s s}_{c l a s s}

, and the GIoU Loss loss function is used to calculate the regression box prediction

{L o s s}_{l o c}

. The total loss is defined as follows:

L o s s = {L o s s}_{c o n f} + {L o s s}_{c l a s s} + {L o s s}_{l o c}

(10)

But the GIoU Loss still has some issues in certain situations. For example, when two bounding boxes are very close or overlap significantly, the IoU tends to approach 1, but they may not be well-aligned spatially. In such cases, GIoU can be highly misleading and lead to a decrease in detection performance. Therefore, when replacing the GIoU Loss function with the DIoU Loss function, as shown in Figure 8, the problem of GIoU producing larger loss values when the distance between two boxes is large and results in a larger enclosing region is solved. This leads to faster convergence [21]. The calculation formula for GIoU Loss is as follows:

{L o s s}_{D I o U (A, B)} = I o U (A, B) - \frac{ρ^{2} (b, b^{g t})}{c^{2}}

(11)

Among them,

b

and

b^{g t}

represent the center points of the predicted box and the ground truth box,

ρ (\cdot)

represents the Euclidean distance, the calculation result is

d

,

c

represents the diagonal length of the minimum enclosing box, and

I o U

represents the intersection over union between the predicted box and the ground truth box. Alpha-IoU Loss is a powerIoU loss function proposed by He et al. which can be used for accurate bbox regression and object detection [22]. By extending GIoU Loss using Alpha-IoU, the predictive loss function Alpha-DIoU Loss is obtained, and the calculation formula is as follows:

{L o s s}_{A l p h a - D I o U (A, B)} = {1 - I o U}^{α} (A, B) + \frac{ρ^{2 α} (b, b^{g t})}{c^{2 α}}

(12)

By utilizing the hyperparameter

α

, this predictive loss function can achieve different levels of bounding box regression accuracy more flexibly, and it exhibits greater stability for large datasets and noisy data. In this experiment, we set

α = 3

.

3.4. Performance Evaluation

3.4.1. Evaluation Metrics

The results of this model divide the data into true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs) based on the annotated ground truth class and predicted class. TP represents the detection of a box that is actually a box, FP represents the detection of a box that is not actually a box, TN represents the detection of a non-box that is actually not a box, and FN represents the detection of a non-box that is actually a box. Precision (P) is the ratio of correctly identified boxes to the total number of predicted boxes, and recall (R) is the ratio of correctly identified boxes to the total number of ground truth positive samples.

p r e c i s i o n = \frac{T P}{T P + F P}

(13)

r e c a l l = \frac{T P}{T P + F N}

(14)

Average Precision (AP) and mean Average Precision (mAP) can be used to evaluate the recognition performance of the model.

A P = \int_{0}^{1} P (r) d r

(15)

m A P = \frac{\sum_{i = 1}^{N} {A P}_{i}}{N}

(16)

3.4.2. Evaluation Results

In this study, the Stack Carton Dataset (SCD) publicly released by Huazhong University of Science and Technology was used, consisting of a total of 8399 images and 151,679 instances. The dataset contains only one category: “Carton”. The training set and test set were divided in an 8:2 ratio. Conducting ablation experiments, the results are shown in Table 1 and Table 2, and Figure 9.

By comparing the above tables and figures, it can be observed that the three improvements have different aspects of enhancing the model’s performance. Additionally, the models incorporating these three improvements have achieved a balance between speed and accuracy. It is evident that the improved algorithms have increased the confidence scores for stack cartons and further strengthened the detection capability, aligning with the requirements of embedded devices.

4. Experiments

4.1. Experimental Design

To evaluate the effectiveness of the designed system, an experimental platform was set up. The host machine’s CPU is AMD Ryzen 7 5800H, GPU is NVIDIA GeForce RTX 3060, with 16.0 GB of RAM and 6.0 GB of video memory. The camera is ZED 2i with a polarizer and a lens focal length of 4 mm. The operating system is Windows 10, CUDA 11.6.134 is installed, and the development language is Python 3.9. The calculations are performed using the PyTorch framework. The experimental setup is shown in Figure 10. The experimental system includes a PC, control system, camera, motor, slider, suction cup, telescopic rod, and target cartons for picking. In addition, the camera is mounted above the parallel middle rail and positioned at the center to ensure the widest field of view.

4.2. Experimental Results Analysis

4.2.1. Fidelity Verification

As shown in Figure 11, false positives refer to incorrectly identifying other objects as detection targets, while false negatives refer to the failure to successfully identify the cardboard boxes. Detection was performed on 200 images, comprising a total of 1824 instances. The false positive and false negative quantities were calculated.

As shown in Figure 12 and Table 3, the second model has the lowest false positive and false negative quantities. It can be observed that the CA attention mechanism, by capturing directional and positional information, improves the accuracy of the detection model and increases the fidelity rate by 2.32%.

4.2.2. Response Time Verification

To verify the lightweight effect of the model, 1000 images were selected for response time testing. The five models were tested sequentially, and the runtime returned by the host computer was recorded.

The results are shown in Figure 13 and Table 4, where the third model had the shortest response time for image processing. It can be seen that Ghost lightweightization effectively improves the detection speed, reducing the response time by 13.89%.

4.2.3. Localization Accuracy Verification

As shown in Figure 14, the binocular camera coordinate system was used to obtain the coordinates of the center point of the cardboard box. The actual measured coordinates were compared to calculate the localization error. A total of 72 images with 648 instances were used for calculation, and the average value was obtained from 20 calculations.

As shown in Figure 15 and Table 5, the fourth model had the best localization accuracy. This validates the improvement in accuracy and robustness achieved by the Alpha-DIoU loss function, resulting in a 7.14% improvement in localization accuracy.

4.2.4. Comprehensive Analysis

To comprehensively evaluate the performance of the models, scores were assigned to each indicator based on their rankings. The best-performing model received a score of 10, the second best received a score of 9, and so on in a decreasing order.

As shown in Figure 16 and Table 6, the overall highest score was obtained by using all the improved models, with a score of 51. The algorithm reduced the response time by 2.16%, improved the localization accuracy by 4.67%, and only experienced a slight decrease of 0.3% in fidelity rate. Overall, it achieved the best comprehensive performance.

5. Conclusions

This paper presents a carton recognition and grasping control system based on YOLOv5.

By introducing the CA attention mechanism, incorporating Ghost lightweight modules, and modifying the loss function to Alpha-IoU, the system’s operational speed and prediction accuracy have been improved. Simulated experimental results on a PC platform demonstrate certain enhancements in carton stacking detection. The comprehensive algorithm improvements in this paper have led to a 0.711% increase in mean average precision (mAP) and a 0.7% increase in frames per second (FPS), all while maintaining precision.

An end-to-end human–machine interaction control system has been constructed, encompassing functions such as interface design, camera invocation, image transmission, and more, establishing a stable and reliable data interaction. Moreover, with the utilization of the enhanced model, algorithm response time has decreased by 2.16% and localization accuracy has improved by 4.67% in a simulated environment, facilitating future deployment on embedded systems.

This study combines principles from deep learning theory and, through preliminary experimental tests, presents an innovative solution for human–machine interaction applications of warehouse robots in the cold storage industry. It effectively enhances the overall efficiency of cold chain logistics, saves labor resources, and ensures personnel safety. Future work could focus on further optimizing system performance and enhancing algorithm efficiency and accuracy. The study holds significant potential for wide application and dissemination.

Author Contributions

Conceptualization, Z.W. and J.Z.; methodology, R.Z.; software, F.T.; validation, Z.Q., Z.Y. and R.Z.; formal analysis, Z.W. and J.Z.; investigation, J.Z.; resources, J.Z.; data curation, R.Z.; writing—original draft preparation, Z.W.; writing—review and editing, Z.Q. and Z.Y.; visualization, R.Z.; supervision, F.T. and J.Z.; project administration, F.T. and J.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by a Major Science and Technology Research Project in Ningbo City (Grant No. 2023Z041) and Ningbo Natural Science Foundation (Grant No. 2021J165), as well as sponsored by the project "Research on Low-Temperature Adaptation Technology for Large Cold Storage Warehouse Robots" (Ningbo City Public Welfare Research Program).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data sharing is not applicable.

Conflicts of Interest

The author declare no conflict of interest.

References

Gould, W.P. Cold storage. In Quarantine Treatments for Pests of Food Plants; CRC Press: Boca Raton, FL, USA, 2019; pp. 119–132. [Google Scholar]
Zhao, Y.; Zhang, X.; Xu, X. Application and research progress of cold storage technology in cold chain transportation and distribution. J. Therm. Anal. Calorim. 2020, 139, 1419–1434. [Google Scholar] [CrossRef]
Tang, Y.; Hu, Z.; Tang, T.; Gao, X. Effect of goods stacking mode on temperature field of cold storage. IOP Conf. Ser. Earth Environ. Science. 2021, 675, 012052. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Chen, S.; Yang, D.; Liu, J.; Tian, Q.; Zhou, F. Automatic weld type classification, tacked spot recognition and weld ROI determination for robotic welding based on modified YOLOv5. Robot. Comput. Integr. Manuf. 2023, 81, 102490. [Google Scholar] [CrossRef]
Chen, J.; Bao, E.; Pan, J. Classification and Positioning of Circuit Board Components Based on Improved YOLOv5. Procedia Comput. Sci. 2022, 208, 613–626. [Google Scholar] [CrossRef]
Zhang, H.; Tian, M.; Shao, G.; Cheng, J.; Liu, J. Target detection of forward-looking sonar image based on improved yolov5. IEEE Access 2022, 10, 18023–18034. [Google Scholar] [CrossRef]
Chen, Y.; Yang, J.; Wang, J.; Zhou, X.; Zou, J.; Li, Y. An Improved YOLOv5 Real-time Detection Method for Aircraft Target Detection. In Proceedings of the 2022 27th International Conference on Automation and Computing (ICAC), Bristol, UK, 1–3 September 2022; pp. 1–6. [Google Scholar] [CrossRef]
Liu, W.; Quijano, K.; Crawford, M.M. YOLOv5-Tassel: Detecting tassels in RGB UAV imagery with improved YOLOv5 based on transfer learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8085–8094. [Google Scholar] [CrossRef]
Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. Yolo-firi: Improved yolov5 for infrared image object detection. IEEE Access 2021, 9, 141861–141875. [Google Scholar] [CrossRef]
Zhou, J.; Li, W.; Fang, H.; Zhang, Y.; Pan, F. The Hull Structure and Defect Detection Based on Improved YOLOv5 for Mobile Platform. In Proceedings of the 2022 41st Chinese Control Conference (CCC), Hefei, China, 25–27 July 2022; pp. 6392–6397. [Google Scholar] [CrossRef]
Xiao, B.; Guo, J.; He, Z. Real-time object detection algorithm of autonomous vehicles based on improved yolov5s. In Proceedings of the 2021 5th CAA International Conference on Vehicular Control and Intelligence (CVCI), Tianjin, China, 29–31 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
Zhou, N.; Liu, Z.; Zhou, J. Yolov5-based defect detection for wafer surface micropipe. In Proceedings of the 2022 3rd International Conference on Information Science, Parallel and Distributed Systems (ISPDS), Guangzhou, China, 22–24 July 2022; pp. 165–169. [Google Scholar] [CrossRef]
Hamzenejadi, M.H.; Mohseni, H. Real-Time Vehicle Detection and Classification in UAV imagery Using Improved YOLOv5. In Proceedings of the 2022 12th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 17–18 November 2022; pp. 231–236. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar] [CrossRef]
He, J.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.S. α-IoU: A family of power intersection over union losses for bounding box regression. Adv. Neural Inf. Process. Syst. 2021, 34, 20230–20242. [Google Scholar]

Figure 1. System architecture diagram.

Figure 2. Controller function introduction.

Figure 3. YOLOv5 structure diagram.

Figure 4. Coordinate attention block.

Figure 5. Adding the CA module.

Figure 6. Ghost module.

Figure 7. Adding the Ghost module.

Figure 8. DIoU loss function illustration.

Figure 9. mAP comparison.

Figure 10. Experimental setup.

Figure 11. False positives and false negatives.

Figure 12. False positive and false negative quantities.

Figure 13. Response time comparison.

Figure 14. Host computer localization illustration.

Figure 15. Localization accuracy boxplot.

Figure 16. Model radar chart.

Table 1. Evaluation list.

ID	CA	Ghost	Alpha IOU
1	×	×	×
2	√	×	×
3	×	√	×
4	×	×	√
5	√	√	√

Table 2. Evaluation results.

ID	Precious/%	Recall/%	mAP/%	Params	Speed/FPS
1	98.938	97.282	90.804	7022326	31.6
2	98.955	97.2	90.812	7059694	31.3
3	98.583	96.725	89.705	5810006	36.7
4	98.256	97.298	91.437	12322312	30.9
5	98.941	96.744	91.515	10714896	32.3

Table 3. False positive and false negative experimental results.

	Base	CA	Ghost	Alpha-DIoU	All
False Negative	68	55	82	64	65
False Positive	12	10	18	10	12
False Negative rate	3.73%	3.01%	4.50%	3.51%	3.56%
False Positive rate	0.66%	0.55%	0.99%	0.55%	0.66%

Table 4. Response time experimental results.

Count	Time [s]
Count	Base	CA	Ghost	Alpha-DIoU	All
100	2.753	2.801	2.252	2.911	2.563
200	5.525	5.599	4.955	5.775	5.146
300	8.336	8.444	7.367	8.691	7.932
400	11.269	11.376	9.873	11.772	10.823
500	14.362	14.399	12.494	14.948	13.816
600	17.517	17.594	15.203	18.237	17.012
700	20.846	21.062	18.100	21.628	20.315
800	24.329	24.595	21.076	25.093	23.739
900	27.881	28.213	24.119	28.665	27.248
1000	31.645	31.948	27.248	32.362	30.960

Table 5. Localization accuracy experimental results.

Experiments	Position Error [cm]
Experiments	Base	CA	Ghost	Alpha-DIoU	All
1	1.13	1.12	1.18	0.96	0.98
2	1.06	1.15	1.19	1.02	1.06
3	1.08	1.02	1.25	1.06	1.02
4	1.08	1.05	1.22	1.08	1.06
5	1.13	1.03	1.23	0.92	1.15
6	1.08	1.12	1.25	1.08	1.05
7	1.13	0.98	1.29	1.02	0.96
8	1.09	1.10	1.13	0.96	0.98
9	1.15	1.08	1.18	1.08	1.08
10	1.03	1.12	1.24	0.92	1.06
11	1.13	1.02	1.16	1.02	1.10
12	1.03	1.05	1.25	1.05	1.02
13	1.12	1.08	1.19	1.10	0.98
14	1.09	1.11	1.22	0.98	1.08
15	1.12	1.12	1.24	1.05	0.96
16	1.05	1.00	1.30	0.96	1.08
17	1.08	1.12	1.28	1.08	1.12
18	1.12	1.15	1.14	0.92	1.08
19	1.02	1.12	1.20	1.06	1.02
20	1.12	1.05	1.15	0.96	0.98
average	1.092	1.0795	1.2145	1.014	1.041

Table 6. Localization accuracy experimental results.

	Base	CA	Ghost	Alpha-DIoU	All
Precious	9	10	7	6	8
Recall	9	8	6	10	7
mAP	7	8	6	9	10
Response Time	8	7	10	6	9
Position Acuracy	8	7	6	10	9
Fidelity Factor	7	10	6	9	8
Total Score	48	50	41	50	51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, Z.; Tian, F.; Qiu, Z.; Yang, Z.; Zhan, R.; Zhan, J. Research on Machine Vision-Based Control System for Cold Storage Warehouse Robots. Actuators 2023, 12, 334. https://doi.org/10.3390/act12080334

AMA Style

Wei Z, Tian F, Qiu Z, Yang Z, Zhan R, Zhan J. Research on Machine Vision-Based Control System for Cold Storage Warehouse Robots. Actuators. 2023; 12(8):334. https://doi.org/10.3390/act12080334

Chicago/Turabian Style

Wei, Zejiong, Feng Tian, Zhehang Qiu, Zhechen Yang, Runyang Zhan, and Jianming Zhan. 2023. "Research on Machine Vision-Based Control System for Cold Storage Warehouse Robots" Actuators 12, no. 8: 334. https://doi.org/10.3390/act12080334

APA Style

Wei, Z., Tian, F., Qiu, Z., Yang, Z., Zhan, R., & Zhan, J. (2023). Research on Machine Vision-Based Control System for Cold Storage Warehouse Robots. Actuators, 12(8), 334. https://doi.org/10.3390/act12080334

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Machine Vision-Based Control System for Cold Storage Warehouse Robots

Abstract

1. Introduction

2. Human–Machine Interaction Control System

3. Control Algorithm Design

3.1. Introduction to YOLOv5

3.2. Model Comparison

3.3. Improvements to YOLOv5

3.3.1. Addition of Coordinate Attention (CA) Mechanism

3.3.2. Ghost Module Replacement

3.3.3. Improved Loss Function

3.4. Performance Evaluation

3.4.1. Evaluation Metrics

3.4.2. Evaluation Results

4. Experiments

4.1. Experimental Design

4.2. Experimental Results Analysis

4.2.1. Fidelity Verification

4.2.2. Response Time Verification

4.2.3. Localization Accuracy Verification

4.2.4. Comprehensive Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI