1. Introduction
Containers serve as the primary mode of transportation for international trade [
1]. According to the United Nations Conference on Trade and Development (UNCTAD) report, 90% of international trade volume is carried out through maritime transport, with over 60% of maritime trade being conducted via container shipping [
2]. In 2020, the global container transport trade volume reached 140 million TEUs (twenty-foot equivalent units). The container code serves as a unique identifier for containers and is essential for enhancing control and management of cargo within containers during various stages of Container Multimodal Transportation (CMT).
Container terminals, particularly at gate entrances, are crucial nodes in CMT, responsible for container code recognition and recording [
3]. The speed and accuracy of container code recognition at gate entrances directly impact the overall efficiency of CMT. With the rapid growth of global container transport volume, port throughput has increased from 485 million TEUs in 2007 to 820 million TEUs in 2020 [
2]. However, the existing methods for container code recognition at gate entrances are unable to cope with the rapid growth of port throughput. The slow recognition speed results in long waiting times for containers at gate entrances, leading to traffic congestion and environmental pollution [
4]. Consequently, container terminal gate entrances have become a bottleneck that hinders the development of CMT [
5].
The application of Internet of Things (IoT) technology in smart ports is rapidly advancing. Smart ports leverage IoT technology to achieve real-time monitoring and management of cargo, equipment, and infrastructure, enhancing operational efficiency and reducing costs. Existing IoT solutions primarily rely on barcode scanners and RFID devices for data collection. Although these methods are mature, they have certain limitations, such as the need for additional hardware and high maintenance costs. This manuscript proposes a video-based data collection solution that utilizes existing surveillance cameras, eliminating the need for extra hardware and enabling more efficient data collection. Moreover, the algorithm presented in this paper takes into account different angles and lighting conditions, making it adaptable to various image capture devices. As a result, it offers a more flexible and cost-effective option for IoT systems in smart ports.
Due to the slow speed and high cost associated with manual container code recognition and recording [
6], Automatic Container Code Recognition (ACCR) systems are primarily used at port gate entrances. Among them, barcode-based ACCR systems have been phased out due to their poor reliability. Radio Frequency Identification (RFID)-based ACCR systems offer high detection accuracy but come with high deployment and maintenance costs [
7]. On the other hand, Optical Character Recognition (OCR)-based ACCR systems have low installation costs [
8] and are the mainstream approach in modern container terminals due to their simple system integration. However, their recognition accuracy is generally below 95%. This is mainly attributed to the high requirements of methods such as container code localization (CCL) [
9], character segmentation, and container code recognition (CCR) [
10] in terms of recognition conditions. In situations with uneven lighting, image blurring, or skewed container codes, errors in container code localization or character recognition are prone to occur.
Deep learning algorithms can automatically extract features directly from raw images, showing strong resistance to interference, and have achieved significant results in various areas such as traffic signal detection [
11], license plate recognition [
12], and medical image segmentation [
13]. Based on deep learning algorithms, the development of ACCR can be divided into two types: (1) using a single neural network to complete the entire container code recognition process [
14] and (2) stacking two neural networks to separately accomplish CCL and CCR [
15]. However, the former requires complex post-processing, does not improve detection speed, and may result in decreased recognition accuracy. Although the latter has improved recognition accuracy, the model parameters are too large, leading to longer computation time. Existing algorithms fail to achieve a balance between container code recognition speed and accuracy, and they also have high requirements for hardware facilities, making it difficult to deploy them at port entrances.
To address these issues, this paper presents a two-stage automatic container number recognition method based on deep learning algorithms. This method offers high recognition accuracy, fast speed, and easy deployment on devices with limited computational power. The first stage utilizes YOLOv4 for container number region localization, while the second stage employs Deeplabv3+ for container number character recognition on the localized region images. Additionally, improvements are made to the network structures of YOLOv4 and Deeplabv3+ to enhance the speed and accuracy of both models in the CCL and CCR tasks, resulting in the modified algorithms named C-YOLOv4 and C-Deeplabv3+. The method is applied to the data input segment of the port Internet of Things system. The main contributions of this paper are as follows:
- (1)
A dataset of container number images is constructed, primarily collected from container terminal gates. The dataset includes various complex scenarios such as tilted containers and rusty or contaminated container number characters;
- (2)
The paper proposes the CCL algorithm based on C-YOLOv4. By compressing the backbone feature-extraction network, redesigning the multi-scale recognition module, and improving the loss function of YOLOv4, the algorithm achieves higher accuracy and speed in container number region localization in complex scenarios;
- (3)
The paper introduces the CCR algorithm based on C-Deeplabv3+. A decoding structure combining sub-pixel convolutional upsampling and transpose convolutional upsampling is designed and applied to the DeepLabv3+ network. Additionally, a new dilated convolution branch is introduced to preserve more character details in the image without increasing the parameter count.
The remaining parts of this paper are organized as follows:
Section 2 provides a brief review of existing methods for CCL and CCR.
Section 3 describes the framework of the container number recognition method based on C-YOLOv4 and C-Deeplabv3+. In
Section 4, experiments on container number region localization and character recognition are conducted using a self-made dataset, followed by an analysis of the experimental results.
Section 5 discusses some conclusions.
4. Experiment
The experiments of CCL, CCR, and complete container number recognition, respectively, are proposed in this section to verify the advantages of the proposed ACCR method. The operating system used in this experiment is Windows10. The GPU is GeForce RTX3090 and is manufactured by the Nvidia company based in Santa Clara, CA, USA. And the processor is Intel Xeon Silver 4210R@2.40 GHz and is manufactured by Intel Corporation, which is based in the Santa Clara, CA, USA.
4.1. Dataset
The main collection scene of this experimental dataset is the container terminal gate. In addition, in order to improve the versatility of the model, the dataset also contains the container images collected by cameras and mobile phones at the container terminal yard. After data cleaning, images that cannot be recognized by human eyes, such as severely rusty image characters and seriously occluded container number character areas, were removed, leaving 1721 images in the dataset. The dataset is mainly composed of dry bulk containers and oil tanks, including container images when the container number area is partially obscured by snow, the container is tilted, and the container number character is rusted or polluted. Some of the images are shown in
Figure 5.
4.2. Evaluation Indicators
In this paper, precision–recall (P–R) curves, F1 scores, and accuracy are used to evaluate the results of CCL experiments. Precision refers to the proportion of truly positive samples in all data predicted to be positive, as shown in Equation (5). Recall is the ratio of the number of all correctly predicted positive images to the total number of total positive images, as shown in Equation (6). Accuracy refers to the proportion of all image data that is correctly predicted, as shown in Equation (7). The
F1 score is a weighted harmonic average of precision and recall, as shown in Equation (8):
where
TP means true positive,
FP means false positive,
FN means false negative, and
TN means true negative. In addition,
FPS (frames per second) is also used as an evaluation index for the real-time performance of the model.
Mean Pixel Accuracy (
MPA) and
FPS are evaluation indicators of the CCR experiment.
MPA refers to the result of averaging the accuracy of category pixels in pixel classification, and the larger the
MPA value, the higher the prediction accuracy, as shown in Equation (9):
CPA (Class Pixel Accuracy) refers to the proportion of pixels in the classification that are correctly classified for each class, as shown in Equation (10).
IoU (Intersection Over Union) represents the proportion of the intersection of a certain type of prediction region and the actual region to the union of the predicted region and the actual region of the class, as shown in Equation (11):
4.3. Experimental Localization of Container Numbers
The input image size is set to (416, 416, 3), with a ratio of 1:9 for the validation set and training set. Predicted boxes with an Intersection Over Union (IoU) greater than 0.4 with the ground truth boxes are retained, and image distortion adjustment is not performed. The batch size for image processing is set to 8 images per iteration, with a base learning rate of 0.001. Training is terminated when the learning rate remains below 1e-6 for 10 epochs.
First, we compare the model parameters and detection speed of C-YOLO v4 and YOLOv4, as shown in
Table 4. YOLOv4 has 39 MB parameters, while C-YOLO v4 only has 13.6 MB parameters, reducing it by approximately 78%.
In this paper, the detection accuracy of the proposed C-YOLOv4 model is also verified by P–R curve and
F1 curve, and the results are shown in
Figure 6. Compared with the single recall and precision, the
F1 value can be a more comprehensive measure of the model performance. From the P–R curve, it can be seen that the precision of the model has been kept around 1 during the recall from 0 to 0.9, which indicates that C-YOLOv4 is very accurate in predicting the positive samples. And the
F1 value, after removing the two threshold poles of 0 and 1, is basically above 0.95, which proves that the model’s detection effect is very stable.
In summary, the number of C-YOLOv4 parameters is reduced, and the detection speed is substantially increased, but the detection accuracy of the model is not negatively affected.
Figure 7 shows some cases of successfully localized and segmented box number regions using C-YOLOv4.
In this paper, YOLOv3, YOLOv4, YOLOv5s, and C-YOLOv4 are also selected for the comparison of accuracy (Accuracy) and speed (FPS) for the localization of box number region, and the results are shown in
Table 5.
As can be seen from
Table 5, YOLOv3 has the lowest accuracy of 94.50%, while YOLOv5s also has an accuracy of 98.43%, which is slightly lower than YOLOv4’s 98.81%, because it sacrifices some of the accuracy for improving the recognition speed; C-YOLO’s model has the highest accuracy of 99.76%, which is 1.06% higher than YOLOv4. In addition, YOLOv3 has the slowest recognition speed of 23.5 ms a frame, and YOLOv5s recognition speed is higher than YOLOv4, but still slower than C-YOLO’s 17.6 ms a frame. In summary, C-YOLO achieves an excellent balance in recognition accuracy and recognition speed.
The box number region localization accuracies of YOLOv3, YOLOv4, and YOLOv5s are lower than that of C-YOLOv4 because we have collected a large number of box number images in our dataset that are tilted, with missing characters or with contaminated box number regions, whereas YOLOv3, YOLOv4, and YOLOv5s are not able to adapt to these complexities, and they often suffer from omissions and have poor localization results, and our proposed C-YOLOv4, for the box number recognition task, optimizes loss function, multi-scale recognition, and other modules and can always locate the box number region accurately. Some comparison results are shown in
Figure 8,
Figure 9 and
Figure 10.
In terms of localization effect, C-YOLOv4 can accurately locate the area where the box number is located in the image when facing the problems of oil contamination, serious tilting of the box, and blocking of the box number area, while YOLOv3, YOLOv4, and YOLOv5s are sometimes missing in the edge portion, and YOLOv3 and YOLOv5s are more serious, with poorer localization accuracy.
Figure 8,
Figure 9 and
Figure 10, the red box is the result of the object detection frame of different models, and the yellow box is the difference of the results of different object detection frames.
4.4. Experiment of Box Number Character Recognition
Before the box number character recognition experiment, we recreated the dataset and manually localized and segmented the box number region in the self-constructed dataset; the effect is similar to the segmented box number region image in
Section 4.3, but the manual localization and segmentation is more accurate, which can avoid the failure of character recognition due to localization errors. We use Labelme as an annotation tool to annotate the box number characters, and the annotated image is shown in
Figure 11, where rectangular boxes of different colors represent different English letters or numbers.
In addition, because the English part of the container box number characters in the self-built dataset are mainly two kinds of TBBU and TBGU, in order to ensure that the model recognizes other English characters to improve the generalizability, we also cut down the images of other parts of the characters on the containers, which are used for model training, and some of the images are shown in
Figure 12.
The image downsampling multiplicity is set to 16, and the ratio of validation set to training set is 1:9. The image batch size is 8 images each time, the base learning rate is set to 0.0001, and the training is finished when the learning rate is less than 1× 10−6 in all 10 rounds.
Firstly, the number of model parameters and recognition speed of DeepLabv3+ and C-DeepLabv3+ are compared, and the results are shown in
Table 6. The number of parameters of DeepLabv3+ is 39.3 MB while the number of parameters of C-DeepLabv3+ is only 2.63 MB, which is reduced by about 95%.
We choose MPA to evaluate the accuracy of C-DeepLabv3+ in character recognition. Since the target of box number character recognition is 0–9 and A–Z with 36 categories, we obtained the recognition results of 36 characters in total, as shown in
Figure 13, and the mean value of its MPA is 99.22%, which shows that the overall recognition accuracy of C-DeepLabv3+ is very high. More specifically, although recognizing “0”, “1”, “6”, “8”, “B”, “I”, “J”, “O”, “Q “, “U”, “V”, and other similarly shaped characters has lower MPA values, the MPA index for each character is still above 97%.
In this paper, Segnet, PSPNet, DeepLabv3+, and C- DeepLabv3+ are also selected for the comparison of mean pixel classification accuracy (MPA) and recognition rate (FPS), and the results are shown in
Table 7.
C-DeepLabv3+ has the highest MPA and recognition speed, which are 2.51% and 24% higher than Deeplabv3+, respectively, while Segnet has the lowest MPA of 91.23%, which cannot satisfy the requirement of the accuracy of the box number character recognition, and the Segnet network has a complex structure while the model has a slow recognition speed of only 6.3 images per second. Taken together, the C-DeepLabv3+ network has the best character recognition effect.
C-DeepLabv3+ integrates MobileNetv3 as its core architecture, which confers a substantially increased recognition speed over Segnet, PSPNet, and DeepLabv3+. Additionally, the model augments the Atrous Spatial Pyramid Pooling (ASPP) module and the decoder to enhance the accuracy of box number character recognition, achieving the highest accuracy reported. The methodology outlined in
Table 1 was adhered to for comprehensive box number recognition. The localization outcomes from C-YOLOv4, processed by OpenCV, were subsequently fed into C-DeepLabv3+ for character recognition. The resultant recognition rates and speeds are detailed in
Table 8.
We combine
Table 1 and
Table 8 into the same table to compare the box number recognition accuracy and recognition speed for different documents, and the results are shown in
Table 9.
As can be seen from
Table 9, our proposed ACCR, in terms of recognition success rate and recognition speed, outperforms all previous methods, reaching 99.51% and 115 ms, respectively. The ACCR methods proposed in the previous literature are either too slow or have insufficient success rates. The recognition success rate of the method proposed by Wu et al. is the highest, reaching 99.3%, which is only slightly lower than that of our method, but its recognition speed is 15.6 times slower than our method, and it cannot realize real-time detection. Li et al.’s method is slightly faster than our method, but its recognition success rate is low, and its robustness is poor in complex environments, and the application at the dock gates is poor; Feng et al.’s method has a more balanced recognition speed and success rate and also has a certain degree of resilience to the complex environments, but it is still not as excellent as our method. The other methods in the table are not good enough in both recognition success rate and recognition speed, and they will not be repeated.
In summary, our method achieves the highest recognition success rate and speed, and it can accurately and quickly obtain recognition results in various complex scenes.
Figure 14 shows some of the box number character recognition results.
5. Conclusions
This study introduces a dual-stage container number recognition method leveraging deep learning techniques, specifically designed to swiftly and accurately identify container numbers at terminal gates. The proposed approach incorporates the enhanced algorithms C-YOLOv4 for detecting container number areas and C-Deeplabv3+ for character recognition. This method not only maintains high accuracy but also significantly accelerates recognition speed and reduces the complexity of the model, making it more practical for deployment at terminal facilities.
Significant enhancements were applied to YOLOv4, including replacing the CSPDarknet53 network with MobileNetv3 and substituting PANet with FPN to reduce parameters. The removal of 13 × 13 scale feature maps and the recalibration of prior box values have tailored the model to better fit the container number localization task. Furthermore, the adoption of an Enhanced Intersection Over Union (EIOU) loss function has optimized the speed of model training. The modified C-YOLOv4 demonstrated a 30% improvement in speed and a 1.06% increase in accuracy over the original model, outperforming other conventional models like YOLOv3 and YOLOv5s in complex scenarios. Modifications to Deeplabv3+ were also pivotal. Replacing Xception with MobileNetv3 as the backbone feature-extraction network reduced parameter size. The introduction of a new atrous convolution branch and a redesigned Atrous Spatial Pyramid Pooling (ASPP) module have improved the model’s precision in detecting small targets. Adjustments in the decoder, specifically the shift to transposed and sub-pixel convolution upsampling, have effectively preserved image details, boosting the character recognition capabilities of C-Deeplabv3+. Compared to its predecessor, the updated model has shown a 23.9% faster recognition speed and a 1.58% higher accuracy rate. However, the slower operation of C-Deeplabv3+ compared to C-YOLOv4 suggests the need for further optimization.
In future work, we will explore how to adapt our algorithms and models to various IoT device platforms and optimize their performance to ensure efficient operation under diverse hardware and resource constraints. Additionally, we will focus on diversifying the dataset to enhance the algorithm’s generalization ability. Through these efforts, we aim to extend the reach of our container number recognition system to a broader range of IoT devices and have a greater impact on actual port operations. Future efforts will focus on these areas to advance the system’s application in real-world port operations.
This research provides a solid foundation for robust, efficient container number recognition systems, with potential for further enhancement and application in real-world port operations.