Automating Code Recognition for Cargo Containers

Santos, José; Canedo, Daniel; Neves, António J. R.

doi:10.3390/electronics14224437

Open AccessArticle

Automating Code Recognition for Cargo Containers

by

José Santos

^*

,

Daniel Canedo

and

António J. R. Neves

Institute of Electronics and Informatics Engineering of Aveiro (IEETA), 3810-193 Aveiro, Portugal

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(22), 4437; https://doi.org/10.3390/electronics14224437

Submission received: 26 September 2025 / Revised: 30 October 2025 / Accepted: 11 November 2025 / Published: 14 November 2025

(This article belongs to the Special Issue Recent Advances and Applications of Machine Learning in Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

Maritime transport plays a pivotal role in global trade, where efficiency and accuracy in port operations are crucial. Among the various tasks carried out in ports, container code recognition is essential for tracking and handling cargo. Manual inspections of container codes are becoming increasingly impractical, as they induce delays and raise the risk of human error. To address these issues, this work proposes a hybrid Optical Character Recognition system that integrates YOLOv7 for text detection with the transformer-based TrOCR for recognition of the container codes, enabling accurate and efficient automated recognition. This design addresses the real-world challenges, such as varying light, distortions, and multi-orientation of container codes. To evaluate the system, we conducted a comprehensive evaluation on datasets that simulate the conditions found in port environments. The results demonstrate that the proposed hybrid model delivers significant improvements in detection and recognition accuracy and robustness compared to traditional OCR methods. In particular, the reliability in recognizing multi-oriented codes marks a notable advancement compared to existing solutions. Overall, this study presents an approach to automating container code recognition, contributing to the efficiency and modernization of port operations, with the potential to streamline port operations, reduce human error, and enhance the overall logistics workflow.

Keywords:

Optical Character Recognition (OCR); container code recognition; port operations; YOLOv7; TrOCR

1. Introduction

The use of maritime transport has experienced significant growth in recent decades, becoming the backbone of global trade, accounting for approximately an 80% share of global trade in 2024 [1]. This increase comes from its cost-effectiveness compared to air or land transport, as well as the accelerated globalization of supply chains. This growth has had an impact on port infrastructures, which are forced to adapt continuously to ensure the smooth operation of supply chains. As critical nodes in global logistics networks, ports face several challenges where a single one can lead to extended vessel idle times, causing increased operational costs in several industries.

Given that port operations involve complex processes that require precise coordination to maintain efficiency, there has been a demand for faster turnaround times and higher throughput. The implementation of industry 4.0 technologies in these scenarios has been a growing trend, creating a digital transformation, known as ’Smart Ports’ [2]. These smart ports are characterized by the use of newer technologies, such as the Internet of Things (IoT), Machine Learning (ML), and Big Data [3]. These technologies are used in all phases of port operations [4], minimizing unnecessary movements and reducing congestion, contributing to more streamlined port operations.

The applications of ML usage in port operations are diverse, reflecting the wide range of ML and Deep Learning (DL) algorithms [3]. Within these, common implementations include predictive maintenance of equipment, scheduling systems, automated stacking cranes, and numerous other solutions. Among these, Optical character Recognition (OCR) systems have garnered particular attention for their impact in this environment. By enabling automatic extraction of textual information, OCR proves indispensable in several systems, many of which are currently used as industry practices [3]. OCR has proven highly effective in yard operations, such as hazard placard detection and container code recognition. These applications improve the efficiency of container handling equipment, leading to reduced transport idle periods and shorter turnaround times. Although such technical advancements bring ports closer to an Industry 4.0 standard, there remains an ongoing need to develop more accurate and robust systems to fully meet the operational demands of modern maritime logistics.

To address these challenges of container code recognition, this work proposes a hybrid OCR system optimized for port operations. While existing OCR solutions such as Tesseract [5], originally designed for digitized documents, and EasyOCR, optimized for generalized real-world text, perform well in their respective domains [6], they often fail under port-specific conditions.

Although prior studies have explored combinations of object detection and OCR models, this work distinguishes itself by evaluating multi-oriented container codes and developing a prototype tailored for deployment in port environments. Rather than relying on explicit orientation correction or rotation preprocessing, our approach integrates YOLOv7 for code detection and TrOCR for recognition, forming a robust pipeline. Through experiments with Tesseract, EasyOCR, TrOCR, and the deep text recognition benchmark, this work identifies an effective combination that excels in these unique challenges of port environments. These results not only strengthen container code recognition for port operations but also provide broader insights into OCR applications where handling multiple text orientations is required.

This article is structured as follows. Section 2 presents the materials and methods used in this work, including the literature review, dataset description, and the proposed hybrid OCR system. Section 3 reports and discusses the experimental results, including comparative analysis with existing OCR models. Finally, Section 4 concludes the paper by discussing the implications of the findings and suggesting directions for future research.

2. Materials and Methods

2.1. Literature Review

As the need for parsing written information into machine-readable formats increased, the demand for OCR systems followed suit. As the use of ML and DL technologies became more prevalent [7], the need for an effective OCR system became apparent. The development of OCR systems has evolved into “end-to-end” tools, where the user is able to input an image and receive machine-readable text as output, without the need for any additional processing steps. Typically, modern OCR pipelines consist of two main stages: text detection, to localize text regions in an image, and text recognition, to convert detected text into machine-readable characters. Although some models were created for specific tasks, such as the recognition of digitized characters [5], most OCR systems typically lack the ability to handle text in different orientations, such as vertical text, which is common in port operations.

Most common detection models include Character Region Awareness for Text detection [8] (CRAFT), and You Only Look Once [9] (YOLO). While these models achieve strong performance, comparable efforts enhance computational efficiency by modifying existing models [10], or through algorithmic optimizations, such as for imaging reconstruction in a 3D image reconstruction [11], in order to achieve faster inference speed and accuracy gains.

Another common optimization approach is fine-tuning on task-specific data, allowing the model to adapt to the specific characteristics of the text in that setting, following similar principles applied in other domains, such as image analysis, where parameters of Meixner polynomials were optimized using the Firefly algorithm to enhance performance [12].

The recognition stage typically involves the use of recurrent neural networks (RNNs), Long short-term memory networks (LSTMs), and the incorporation of Convolutional Neural Networks (CNNs) with RNNs to create CRNNs. The approach of interconnecting techniques is common, such as the Deep Text Recognition(DTR) system proposed by Baek et al. [13], and has also been applied to medical imaging tasks, such as cell classification, where the combination of CNNs and Quaternion-based optimization algorithms has demonstrated improved classification accuracy and computational efficiency [14].

New advances in the recognition stage utilize transformer-based models, introduced by Vaswani et al. [15], have excelled in various natural language processing and computer vision tasks. These transformers are characterized by their encoder-decoder structure, relying on positional encoding and feed-forward layers without convolution or recurrence. With this foundation, the following research has introduced several enhancements to the original architecture, such as memory compression and query prototyping, in order to reduce attention complexity [16].

These developments have allowed this technology to be used in the field of OCR. The Transformer-Based OCR (TrOCR), developed by Li et al. [17], utilizes a Vision transformer (ViT) encoder and a text decoder, which allows it to learn visual patterns without relying on convolutional layers. Comparable optimization strategies have also been applied in image watermarking [18], where combining arithmetic optimization algorithms with discrete wavelet transform (DWT) and singular value decomposition (SVD) can achieve high accuracies with low computational costs.

The literature review highlights the evolution of OCR systems, from traditional methods to modern deep learning approaches. The integration of transformer-based models, such as TrOCR, represents a significant advancement in the field, offering improved performance and flexibility. Our proposal builds upon these advancements to develop a robust OCR system tailored for container code recognition in port operations.

2.2. Dataset

2.2.1. ISO 6346 Standard

In the context of creating a container code recognition system, the model must be able to detect and recognize the code of each container. These codes have to follow a standard structure, which is defined by the International Organization for Standardization (ISO) 6346 standard [19]. This standard was developed by the ISO, and the register is managed by the Bureau of International Containers (BIC) [20]. In order to be utilized in the global supply chain, all containers must adhere to this standard, ensuring that they are easily identifiable and traceable. This standard ensures uniformity in global logistics operations, allowing for a more efficient and reliable system. This standard simplifies the process for OCR systems by providing a clear structure for the codes, which can be used to train the models.

The ISO 6346 defines the structure of the code into two main parts, as seen in Figure 1, the container number and the size and type code. The container number is split into four parts:

Owner code: Consists of 3 letters.
Category identifier: Designates the type of container. The most common ones are
-
J: Detachable freight container equipment,
-
U: General purpose container,
-
Z: Trailers and chassis.
Serial number: Consists of 6 digits, being the unique identifier of the container within that operator’s fleet.
Check digit: A single digit, calculated mathematically in order to verify the integrity of the code.

The size and type code is a four-digit code that provides more information about the container’s physical characteristics:

1st character: Overall length,
2nd character: Overall height,
3rd–4th characters: Type and additional features (e.g., General-G0 to G3; Tank-T0 to T9).

2.2.2. Dataset Description

To ensure a robust performance of the models across diverse operational environments, this work leverages two distinct datasets. The first dataset is provided by the Port of Sines Authority (PSA), while the second is open source from GitHub (Version: commit 4ff6f29) [21]. Each dataset is tailored to specific environments in the container logistics process, providing images at entry and exit points.

The PSA dataset, comprised of 3,295 images, contains captured images from the rail terminal of the Port of Sines. These images were captured while the trains moved in and out of the terminal, where a fixed camera captured the containers during the working hours of the port. This constant monitoring allows the dataset to reflect the realistic challenges in port operations, such as rain, fog, and sun glare. This dataset realistically represents port operations, which is why it has served as the primary training and testing set for the model.

As the codes follow the ISO 6346 standard, which allows for both horizontal and vertical orientations, the dataset includes images of both formats. This is illustrated in Figure 2, where Figure 2a shows a horizontal code and Figure 2b shows a vertical code.

The open source dataset, available on GitHub, contains 2910 images depicting truck-based transport environments. These images are focused on two different angles. The first angle is a side profile view of the container, while the second angle collects the rear-view of the container, where the container doors are located. While the PSA dataset did not include truck environments, this dataset shows an operational environment that is also present in the Port of Sines infrastructure, making it an ideal supplement to create a model with comprehensive coverage of all port scenarios.

To further characterize the datasets, the visual condition of the images was analyzed using luminance statistics, and the Blind Referenceless Image Spatial Quality Evaluator (BRISQUE) [22] was used as an image quality metric. As shown in Table 1, the PSA dataset exhibits lower luminance and higher BRISQUE values, indicating that the images are generally darker and of lower quality compared to the GitHub dataset. This difference in image quality is likely due to the less controlled rail environment, where lighting conditions and weather can significantly impact the quality of the captured images. In contrast, as truck gate operations are conducted during daylight hours and within covered or sheltered areas, the images directly benefit from more consistent and favorable lighting conditions, resulting in higher-quality images.

By leveraging a dual dataset approach, the models are trained to perform reliably across a wide range of scenarios, from the controlled environment of the truck terminal to the more variable conditions of train operations, while ensuring accurate recognition of both vertical and horizontal container codes. The combination of these datasets allows the model to learn and generalize in the dynamic and often unpredictable nature of global transportation systems.

In order to support the training of the models, a unified annotation schema was adopted. Using the VGG Image Annotator (VIA) tool [23], each image in the dataset was manually annotated with bounding boxes around the container codes, using a consistent format, which includes the unique identifier for each container code, the size and type code, obtaining a bounding box, and containing text for each region of interest. These annotations are reused in the detection and recognition training, allowing the model to learn from the same data in both stages.

2.3. Hybrid Pipeline Architecture

The proposed hybrid OCR system is a structured hybrid pipeline architecture, designed to effectively handle the harsh conditions of port operations. This architecture is divided into two main stages: the detection stage and the recognition stage.

The proposed pipeline operates sequentially, as shown in Figure 3. An input image is processed by the detection stage, where YOLOv7 localizes the text regions of interest corresponding to the container codes. This yields bounding boxes that define the extent of each code. Next, these bounding boxes are used to crop the original image, isolating each code region.

Each cropped region is then fed into the recognition stage, where the TrOCR utilizes OCR to obtain machine-readable text. The recognized outputs include the container identification code and the size and type code. Following the recognition stage, the post-processing step performs a validation ensuring compliance with the ISO 6346 standard. This model verifies that the recognized container identification code is valid by calculating the check digit, confirming the structure is correct. In addition, it validates the accompanying size and type codes to ensure they conform to the standard specifications. Although this step is essential for real-world deployment and integration into port management systems, it was not included in the model performance evaluation, as it does not directly affect model accuracy. Nevertheless, this validator provides an important bridge between the raw pipeline outputs and their practical application in port operations, enhancing the overall reliability of the system.

2.3.1. Detection Stage

The detection stages use the YOLOv7 model, a well-established architecture in real-time object detection. Although not specifically designed for text detection, YOLOv7 offers a balance between speed and accuracy, making it suitable for real-time applications in port operations. These characteristics, combined with its proven reliability at the time of system development, made it an ideal choice for this task.

To adapt the model, the YOLOv7 model was fine-tuned using the combined dataset. Training was performed in two phases, each involving separate training processes. First, with a frozen backbone to leverage the feature-extraction capabilities of the pre-trained model, and second, with an unfrozen backbone to allow the entire network to adapt to the specific characteristics of container code detection.

2.3.2. Recognition Stage

The recognition stage employs the TrOCR model, a transformer-based model, which has shown strong performance in various OCR tasks, particularly in handling complex scenarios. This robust performance was a key factor in its selection, as the model achieves competitive accuracies on OCR benchmarks [17]. This model utilizes a Seq2Seq training strategy, which is implemented using the Hugging Face Transformers library [24].

2.3.3. Training Protocol

To ensure effective training and evaluation, the datasets were unified and randomly divided into training, validation, and testing sets using a 70/20/10 ratio, where 70% of the dataset was used for training, 20% for validation, and 10% for testing. This unified split allows both the detection and recognition models to learn from the same distribution of data. This randomized split allows each set to contain a representative mix of images from both datasets, resulting in a balanced distribution of operational environment and environmental conditions across all sets.

This approach also ensures that the test set consists entirely of unseen images for both models, allowing for a fair evaluation of their performance.

Hardware and Frameworks

GPU: NVIDIA GeForce RTX 1080 Ti (11GB VRAM)
TrOCR Framework: Hugging Face Transformers v4.46.3
-
Base Model: trocr-base-stage1 from Hugging Face
-
WandB Integration: Full training metrics logging
YOLOv7 Framework: Custom PyTorch v2.0.1 implementation
-
Input Resolution: 512 × 512 pixels
-
Augmentations: Mosaic (1.0), MixUp (0.15), FlipLR (0.5)
-
WandB Integration: Full training metrics logging

Hyperparameter Configuration

The hyperparameters were selected based on the official implementations of each model, as these provided a solid foundation for training. The epochs were determined empirically, based on training results, until no further improvements were observed in the validation metrics. Table 2 summarizes the key hyperparameters used for training both the TrOCR and YOLOv7 models.

TrOCR Specifications

The TrOCR model was trained using a Vision Encoder-Decoder architecture, with the following technical specifications:

Architecture:
-
Vision Encoder-Decoder with 384 M parameters
-
ViT encoder processes 384 × 384 images-16px patches
-
Transformer decoder: 1024 dim, 16 attention heads
-
Maximum sequence length: 64 tokens
-
Gradient clipping at 1.0 norm
-
Max Grad Norm: 1.0
Training Configuration:
-
Mixed FP16 precision (O1)
-
Gradient checkpointing enabled
-
Early stopping on validation CER

YOLOv7 Specifications

As for YOLOv7, the pipeline followed two training variants, one with the backbone frozen and the other with the entire network unfrozen. The characteristics of these configurations follow these technical specifications:

Architecture:
-
Darknet-based with 36.5 M parameters
-
Loss components:
∗
Bounding box: 0.05 (CIoU)
∗
Classification: 0.3
∗
Objectness: 0.7
-
Optimal Transport Assignment (OTA) enabled
Training Configuration:
-
FP32 precision
Training Variants:
-
Frozen: First 50 layers fixed
-
Unfrozen: Entire network trainable

The training pipeline used Weights and Biases (WandB) to monitor, in real-time, the performance of models during training. For the TrOCR recognition model, we tracked the Character Error Rate (CER) and training loss at each step, while for the YOLOv7 detection model, we followed the mean average precision (mAP) metrics at different Intersection over Union (IoU) thresholds. The system automatically saved checkpoints for TrOCR every 500 steps and after each epoch for YOLOv7, allowing us to track progress and recover if needed.

3. Results and Discussion

3.1. Evaluation Metrics

The evaluation of the hybrid OCR system is conducted through a series of metrics that assess both the detection and recognition stages. Due to the different nature of the tasks, distinct metrics are used for each stage. The evaluation metrics for both stages are presented, where the equations for the metrics are also shown.

In the detection stage, precision is used to measure the correctly identified text regions out of all detected regions, while recall measures the percentage of true text regions that were successfully detected. The mean Average Precision (mAP) at Intersection over Union (IoU) thresholds is also used, which provides a comprehensive measure of the model’s ability to localize text regions. In this work, the mAP@0.5 and mAP@0.5:0.95 are used, which measure the model’s ability to correctly localize text regions at different IoU thresholds.

For the recognition stage, we evaluate character and code-level accuracy of the system utilizing Character Error Rate (CER) and Word Error Rate (WER). CER measures the accuracy of individual characters, while WER evaluates the accuracy of the predicted codes as a whole. These metrics are both error-based metrics, where a value closer to 0 represents a perfect match between the predicted and ground truth text, indicating no recognition errors occurred.

3.2. YOLOv7 Training

For the training of this detection model, two distinct training processes were conducted utilizing the YOLOv7 architecture. The first phase involved training the model with an unfrozen backbone, allowing the model to adapt its feature extraction capabilities to the container code dataset.

In Figure 4a, the unfrozen model training exhibits unstable performance, despite efforts to optimize the hyperparameters. While the precision and recall initially showed signs of learning, after epoch 15, the values began to spike and continued to oscillate significantly, preventing convergence.

By contrast, the frozen backbone training, shown in Figure 4b, exhibited a more stable and consistent learning pattern. The precision and recall followed a steady upward trend. This suggests that freezing the backbone layers allowed the model to leverage the pretrained feature extraction capabilities, leading to improved performance.

As shown in Table 3, the frozen backbone model outperformed the unfrozen variant across all detection metrics. These results are evaluated at the bounding box level, where the overlap between the predicted and ground truth boxes determines the precision, recall, and mAP scores. The weaker performance of the unfrozen model likely reflects the difficulty of adapting a deeper network with a limited training set, which may have constrained effective learning in the backbone layers. By contrast, the frozen backbone model utilized the pre-trained feature representations, maintaining strong feature extraction capabilities while avoiding overfitting to the small dataset.

3.3. TrOCR Training

The TrOCR model was fine-tuned using pre-trained weights from the Hugging Face Transformers library, adapting it to recognize both vertical and horizontal container codes. The training leveraged the same dataset employed in the detection stage, ensuring consistency in data distribution and annotation quality.

As shown in Figure 5, the training loss (a), evaluation loss (b), and CER (c) decreased over training steps, indicating effective learning. Given that the CER calculates the number of correct characters predicted by the model, it was the metric used to evaluate the models’ performance. While having some spikes at the 3000 and 4000 steps, the CER metric reached a stable plateau at the end of the training, at 6000 steps.

When evaluating on the unseen test set, the TrOCR model demonstrated a strong recognition performance, achieving a CER of 0.0244 and a WER of 0.1066. These metrics indicate that the model accurately transcribed fewer than 3% of the characters incorrectly, and successfully recognized entire container codes with an 89.34% exact match rate, as shown in Table 4. These results underscore the model’s effectiveness in handling the challenges of OCR in container logistics, where high accuracy is essential in operational reliability.

3.4. Comparative Results

The final evaluation results of the OCR system, integrating both the detection and recognition stages, are presented in Table 5. As recall, precision, and mAP metrics are used to evaluate the detection stage, vertical and horizontal recognition accuracy correspond to the character-level accuracy for code in the vertical and horizontal orientations, respectively. Frames per second (FPS) is used to measure the inference speed, allowing comparison of the processing efficiency across different model combinations.

These metrics are essential, as real-world environments may represent either orientation, and evaluating the models’ performance in both scenarios is crucial for a comprehensive assessment of the OCR system’s capabilities.

The experimental results demonstrate that the combined use of YOLOv7 and TrOCR establishes a robust OCR system for container code recognition. This hybrid system achieves an outstanding text detection rate of 96.77% and a high precision of 99.40%, while also achieving a high vertical and horizontal recognition accuracy of 99.11% and 96.86%, respectively. In addition, the system runtime, assessed in frames per second (FPS), indicates that the hybrid model reached 3.0303 FPS on the test set, whereas the use of the deep text recognition model achieved the highest FPS of 16.6667. As the DTR model achieved a high FPS, it shows its efficiency as a lightweight model. However, its recognition accuracy was considerably lower than the TrOCR model, indicating a trade-off between speed and accuracy in this environment.

Although the FPS values may indicate the system is not yet suitable for real-time applications, they establish a valuable reference for future optimizations. Figure 6 summarizes this comparison, illustrating the relative efficiency of each evaluated configuration.

The ability to accurately recognize both vertical with high accuracy is particularly noteworthy, as conventional OCR systems are optimized for horizontal text due to their training on predominantly horizontal text datasets. By generalizing across orientations, the hybrid system avoids the need for additional models or processing steps, reducing the overall complexity and improving robustness. This represents a significant advancement for practical applications where text orientation can vary, such as in port operations and other logistics tasks.

For comparison, the standalone OCR systems, Tesseract and EasyOCR, were also evaluated on the same test set, utilizing their default configurations to reflect their standard usage in practical settings. The only adjustments made were to Tesseract, where the page segmentation mode was set to 5 or 6, for vertical and horizontal text, respectively, to optimize its performance for the specific orientations. The results indicate that both standalone OCR systems underperform significantly compared to the hybrid approach. The Tesseract had a poor performance, achieving only 0.41% precision in text detection and a low FPS of 1.3158. EasyOCR exhibited inconsistent recognition capabilities, with a 57.32 percentage point gap between vertical and horizontal recognition accuracy, indicating limitations in handling text orientation variability. EasyOCR also had the lowest FPS of 0.5155, showing that it is not a viable option for real-time applications.

As shown in Figure 7, the YOLOv7 and TrOCR hybrid model achieves superior performance across both vertical (a) and horizontal (b) text recognition tasks. This orientation-agnostic performance suggests that the TrOCR transformer-based architecture successfully generalizes to rotation variations, a critical capability that is often lacking in traditional OCR systems.

3.5. System Performance Analysis

To evaluate the effectiveness of the hybrid OCR system, this section analyses its performance under different text orientations and image quality conditions. The goal is to assess how these factors influence the detection and recognition accuracy, providing insights into the system’s robustness and areas for potential improvement.

To observe the performance differences between vertical and horizontal text recognition, an error matrix comparing recognition errors for both orientations is presented in Figure 8. The results show that vertical codes account for the majority of correctly recognized samples, indicating the robustness of the system. In contrast, horizontal samples exhibited a slightly higher proportion of misreads.

Among the error categories, digit and letter substitutions were the most common, while extra characters being recognized as errors remained rare. This discrepancy may be attributed to the training data distribution or inherent challenges in recognizing horizontally oriented text in port environments. Understanding these error patterns is important for guiding future improvements in model training and architecture to enhance overall recognition accuracy across all text orientations.

To investigate the relationship between image quality and model performance, Table 6 presents the detection and recognition metrics across different BRISQUE image quality ranges. The results indicate that the model maintains high detection performance across varying image quality levels. In particular, the mAP@50 remains consistently above 96% across all BRISQUE ranges, demonstrating the robustness of the YOLOv7 detection model in handling images of differing quality. Recognition accuracy maintained stability, with vertical accuracy consistently above 96% and horizontal accuracy showing a slight decrease in the lowest quality range.

These results show the hybrid pipeline effectively manages variations in image quality, which is crucial for real-world applications. The decline in horizontal accuracy at lower BRISQUE scores highlights an area for future work to enhance recognition robustness under challenging imaging conditions.

These findings demonstrate that the hybrid OCR pipeline is effective in handling variations in text orientation and image quality. However, they also highlight areas for further research and optimization under certain conditions.

4. Conclusions

This paper presents a novel hybrid OCR system that combines YOLOv7 for text detection and TrOCR for text recognition. The integration of these two models addresses significant challenges in port operations, particularly in detecting and recognizing codes under varying conditions and orientations.

Experimental results demonstrate that the proposed system performs well, achieving high accuracy and robustness compared to existing models. It successfully recognizes text in scenarios where traditional OCR systems struggle, while maintaining accuracy and inference speed. The implementation of this hybrid OCR system represents a significant step towards automating and optimizing port operations. By improving the reliability and speed of container code recognition, it contributes to the streamlining of workflows and enables better resource allocation, leading to improved operational effectiveness in port settings.

4.1. Limitations

Although the proposed system shows promising results, several limitations were identified during the study.

Firstly, while the datasets used for training and validation were limited in size and collected during a limited time frame, they may not represent the variability encountered in real-world port environments. This may affect the model’s generalizability. Secondly, while the YOLOv7 proved effective for text detection, new YOLO architectures with improved capabilities, such as small object detection, may offer better performance under certain port conditions. Therefore, ongoing evaluation and comparison of emerging models is necessary to ensure that the most effective solution is deployed for text detection in port environments.

Third, while TrOCR demonstrated strong performance, as it was fine-tuned for this specific task, it may not generalize well to other text recognition tasks without further adaptation. Additionally, the computational requirements for deploying this hybrid system for testing were manageable; scaling the implementation for real-time port operations may require further optimization to ensure efficiency and responsiveness.

Fourth, the statistical analysis presented in this work is limited to reporting standard OCR metrics. The use of more advanced statistical methods, such as confidence intervals or hypothesis testing, would provide stronger evidence of model robustness and a clearer understanding of performance variability across different conditions.

Finally, the post-processing step, while crucial for ensuring the reliability of the recognized codes in practical applications, was not implemented during the evaluation phase. This does not affect the intrinsic performance of the detection and recognition models, but it may be modified to improve the system performance or lead to more robust results in practical applications.

4.2. Future Work

With the experimental results reported in Section 3.4, a prototype version of the system is already being tested at the Port of Sines in collaboration with an industrial partner. Although operation results from this deployment are not yet available, future work will focus on reporting the performance data from this real-world application, while capturing images under challenging and diverse operational conditions. This will provide valuable insights into the system’s effectiveness in operational settings, as well as identify areas for further refinement and optimization.

A major focus of future research will be enhancing the system’s robustness and adaptability. To this end, expanding the dataset to encompass a broader range of environmental and operational conditions will strengthen the models’ generalization capabilities. Improving inference speed also remains a priority, as the current throughput is below real-time operational demands. Techniques such as model pruning, quantization, or exploring lightweight architectures could help achieve higher speeds while maintaining accuracy.

Another important direction is the comprehensive evaluation of the current system and its components. Ablation studies will be conducted to analyze the performance variation between the fine-tuned YOLOV7 models and provide insight into the superior performance of the unfrozen configuration. Additionally, statistical analysis for the evaluated models will be expanded to include more rigorous techniques for both the detection and recognition models, such as k-fold cross-validation or hypothesis testing. This will offer stronger evidence of model robustness and will expand the understanding of performance variability across different conditions.

Given the significance of the results presented, further statistical validation with paired statistical tests, such as the paired t-test, will provide stronger evidence of model robustness, and the calculation of confidence intervals will expand the understanding of performance variability. Constructing error matrices that capture common recognition errors, such as occlusions, surface wear, and misclassifications, will assist in identifying systematic weaknesses and areas for improvement.

To enable this level of analysis, a more detailed dataset with a systemic, condition-aware collection of images, annotated with specific imaging conditions, such as lighting variations, occlusions from cargo or equipment, motion blur, and surface wear, would be beneficial. With these detailed annotations, insights into the recognition behavior across different scenarios could be obtained, enabling adaptive training strategies and error correction mechanisms tailored to the operational challenges in port environments.

Similarly, the post-processing module can be refined into a more advanced validation system that not only verifies the recognized codes but also corrects recognition errors. This can be achieved by cross-referencing with a database of known container codes, employing template matching techniques, or incorporating contextual information from port logistics systems. Such refinements could enable automatic correction of check-digit mismatches and improve overall recognition reliability in real-world applications.

In addition, future research should explore the integration of this OCR system with other port management systems, such as inventory tracking and logistics planning. This integration could provide a more comprehensive solution for port operations, leveraging the strengths of the OCR system to enhance overall efficiency and effectiveness.

In conclusion, the proposed hybrid OCR system offers an effective and practical solution to the challenge of container code recognition in port operations. With continued development and real-world deployment, it has the potential to become a critical component in the modernization and automation of logistics infrastructure.

Author Contributions

Conceptualization J.S., D.C. and A.J.R.N.; methodology J.S. and D.C.; software J.S.; validation J.S., D.C. and A.J.R.N.; formal analysis J.S.; investigation J.S.; resources A.J.R.N.; data curation J.S.; writing—original draft preparation J.S.; writing—review and editing D.C. and A.J.R.N.; visualization J.S.; supervision D.C. and A.J.R.N.; project administration A.J.R.N.; funding acquisition A.J.R.N. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the PRR—Recovery and Resilience Plan and by the NextGenerationEU funds at Universidade de Aveiro, through the scope of the Agenda for Business Innovation “NEXUS: Pacto de Inovação—Transição Verde e Digital para Transportes, Logística e Mobilidade” (Project no. 53 with the application C645112083-00000059).

Data Availability Statement

The open source data is available at the links referenced in the text. All other datasets referenced in this study are available from the corresponding author. The data is not publicly available due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AdamW	Adaptive Moment Estimation with Weight Decay
BRISQUE	Blind Referenceless Image Spatial Quality Evaluator
BIC	Bureau International des Containers
CER	Character Error Rate
CIoU	Complete Intersection over Union
CNN	Convolutional Neural Network
CRNN	Convolutional Recurrent Neural Network
DL	Deep Learning
DTR	Deep Text Recognition
DWT	Discrete Wavelet Transform
EMA	Exponential Moving Average
FP16	16-bit Floating Point Precision
FPS	Frames Per Second
IoU	Intersection over Union
ISO	International Organization for Standardization
LSTM	Long Short-Term Memory
mAP	mean Average Precision
ML	Machine Learning
OCR	Optical Character Recognition
OTA	Optimal Transport Assignment
PSA	Port of Sines Authority
RNN	Recurrent Neural Network
Seq2Seq	Sequence-to-Sequence
SGD	Stochastic Gradient Descent
SVD	Singular Value Decomposition
TrOCR	Transformer-based Optical Character Recognition
ViT	Vision Transformer
WandB	Weights and Biases
WER	Word Error Rate
YOLOv7	You Only Look Once version 7

References

UNCTAD. Review of Maritime Transport 2024: Navigating Maritime Chokepoints. Available online: https://shop.un.org (accessed on 29 May 2025).
de la Peña Zarzuelo, I.; Freire Soeane, M.J.; López Bermúdez, B. Industry 4.0 in the port and maritime industry: A literature review. J. Ind. Inf. Integr. 2020, 20, 100173. [Google Scholar] [CrossRef]
Yang, Y.; Gai, T.; Cao, M.; Zhang, Z.; Zhang, H.; Wu, J. Application of Group Decision Making in Shipping Industry 4.0: Bibliometric Analysis, Trends, and Future Directions. Systems 2023, 11, 69. [Google Scholar] [CrossRef]
Filom, S.; Amiri, A.M.; Razavi, S. Applications of machine learning methods in port operations—A systematic literature review. Transp. Res. Part E Logist. Transp. Rev. 2022, 161, 102722. [Google Scholar] [CrossRef]
Smith, R. An Overview of the Tesseract OCR Engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 23–26 September 2007; Volume 2, pp. 629–633. [Google Scholar] [CrossRef]
Vedhaviyassh, D.R.; Sudhan, R.; Saranya, G.; Safa, M.; Arun, D. Comparative Analysis of EasyOCR and TesseractOCR for Automatic License Plate Recognition using Deep Learning Algorithm. In Proceedings of the 6th International Conference on Electronics, Communication and Aerospace Technology, ICECA 2022—Proceedings, Coimbatore, India, 1–3 December 2022; pp. 966–971. [Google Scholar] [CrossRef]
Raj, R.; Kos, A. A Comprehensive Study of Optical Character Recognition. In Proceedings of the 2022 29th International Conference on Mixed Design of Integrated Circuits and System (MIXDES), Wrocław, Poland, 23–24 June 2022; pp. 151–154. [Google Scholar] [CrossRef]
Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character Region Awareness for Text Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9357–9366. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Feng, X.; Wang, Z.; Liu, T. Port container number recognition system based on improved YOLO and CRNN Algorithm. In Proceedings of the International Conference on Artificial Intelligence and Electromechanical Automation, AIEA 2020, Tianjin, China, 26–28 June 2020; pp. 72–77. [Google Scholar] [CrossRef]
Tahiri, M.A.; Karmouni, H.; Azzayani, A.; Sayyouri, M.; Qjidaa, H. Fast 3D Image Reconstruction by Separable Moments based on Hahn and Krawtchouk Polynomials. In Proceedings of the 2020 Fourth International Conference on Intelligent Computing in Data Sciences (ICDS), Fez, Morocco, 21–23 October 2020; pp. 1–7. [Google Scholar] [CrossRef]
Bencherqui, A.; Tahiri, M.A.; Karmouni, H.; Daoui, A.; Alfidi, M.; Jamil, M.O.; Qjidaa, H.; Sayyouri, M. Optimization of Meixner Moments by the Firefly Algorithm for Image Analysis. In Digital Technologies and Applications; Motahhir, S., Bossoufi, B., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 439–448. [Google Scholar] [CrossRef]
Baek, J.; Kim, G.; Lee, J.; Park, S.; Han, D.; Yun, S.; Oh, S.J.; Lee, H. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4714–4722. [Google Scholar]
Tahiri, M.A.; El hlouli, F.Z.; Bencherqui, A.; Karmouni, H.; Amakdouf, H.; Sayyouri, M.; Qjidaa, H. White blood cell automatic classification using deep learning and optimized quaternion hybrid moments. Biomed. Signal Process. Control 2023, 86, 105128. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5999–6009. [Google Scholar]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A Survey of Transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
Li, M.; Lv, T.; Chen, J.; Cui, L.; Lu, Y.; Florencio, D.; Zhang, C.; Li, Z.; Wei, F. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13094–13102. [Google Scholar]
Bencherqui, A.; Tamimi, M.; Tahiri, M.A.; Hicham, M.A.K.; Alfidi, M.; Jamil, M.O.; Qjidaa, H.; Mhamed, S. Optimal Color Image Watermarking Based on DWT-SVD Using an Arithmetic Optimization Algorithm. In Proceedings of the 6th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023), Tirunelveli, India, 7–9 April 2023; pp. 381–391. [Google Scholar] [CrossRef]
ISO 6346:2022; Freight Containers—Coding, Identification and Marking. ISO: Geneva, Switzerland, 2022. Available online: https://www.iso.org/standard/83558.html (accessed on 10 February 2024).
Bureau International des Containers. BIC Code—The Standard for Container Identification. Available online: https://www.bic-code.org/ (accessed on 27 May 2025).
Lin, B. ContainerNumber-OCR: Container Number Recognition Based on YOLOv7 and CRNN. GitHub Repository. 2023. Available online: https://github.com/lbf4616/ContainerNumber-OCR (accessed on 20 February 2025).
Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-Reference Image Quality Assessment in the Spatial Domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef] [PubMed]
Dutta, A.; Zisserman, A. VGG Image Annotator (VIA). Version 2.0.12. Available online: http://www.robots.ox.ac.uk/~vgg/software/via/ (accessed on 30 May 2025).
Hugging Face. TrOCR Model Documentation. 2024. Available online: https://huggingface.co/docs/transformers/model_doc/trocr (accessed on 18 February 2025).

Figure 1. ISO 6346 container code structure. (a) Horizontal layout. (b) Vertical layout.

Figure 2. Example images from the PSA dataset; (a) horizontal code. (b) vertical code.

Figure 3. Hybrid OCR pipeline architecture.

Figure 4. Training performance of the YOLOv7 detection model; (a) Unfrozen configuration, trained for 50 epochs. (b) Frozen configuration, where the backbone layers were frozen, trained for 50 epochs.

Figure 5. Training and evaluation of the TrOCR recognition model over 6000 steps; (a) training loss, (b) evaluation loss, and (c) Character Error Rate (CER).

Figure 6. Runtime performance comparison of the evaluated OCR systems, measured in frames per second (FPS).

Figure 7. Examples of the hybrid model on the test set, showing detected bounding boxes and recognized characters in both vertical and recognition orientations; (a) Vertical code recognition. (b) Horizontal code recognition.

Figure 8. Error matrix comparing recognition errors between vertical and horizontal container codes.

Table 1. Image quality statistics for the PSA and GitHub datasets.

Dataset	$L_{mean}$	$L_{σ}$	BRISQUE
PSA-dataset	32.06	18.80	41.70
GitHub-dataset	38.94	23.32	28.39

Table 2. Comparative training hyperparameters.

Parameter	TrOCR	YOLOv7 (Frozen)	YOLOv7 (Unfrozen)
Batch Size	8	16	16
Learning Rate	5 $\times 10^{- 5}$	0.01	0.01
LR Schedule	Linear Warmup	Cosine Annealing	Cosine Annealing
Warmup Steps	500 steps	3 epochs	3 epochs
Epochs	25	50	50
Optimizer	AdamW	SGD	SGD
Momentum	-	0.937	0.937
Weight Decay	0	0.0005	0.0005

Table 3. Comparison of YOLOv7 detection performance with frozen backbone vs. original model.

Model	Precision (P)	Recall (R)	mAP@0.5	mAP@0.5:0.95
YOLOv7 (Frozen Backbone)	94.8	95.4	96.9	65.3
YOLOv7 (Original)	91.1	79.7	84.0	56.3

Table 4. TrOCR model performance on the test dataset.

Model	CER	WER	Character Accuracy	Exact Match Score
TrOCR	0.0244	0.1066	97.56%	89.34%

Table 5. Final evaluation results of the OCR system using YOLO for detection and TrOCR for recognition.

	Detection Recall	Detection Precision	mAP@50	mAP@95	Vertical Recognition Accccuracy	Horizontal Recognition Accccuracy	FPS
YOLOv7 + TrOCR	96.77%	99.40%	96.69%	70.84%	99.11%	96.86%	3.0303
YOLOv7 + Tesseract	87.66%	99.34%	87.57%	64.41%	2.58%	73.25%	2.8571
YOLOv7 + EasyOCR	89.48%	99.54%	89.48%	65.84%	15.31%	64.23%	0.5291
YOLOv7 + Deep text recognition [13]	90.12%	99.40%	96.69%	70.80%	10.79%	7.88%	16.6667
ine Tesseract	11.76%	0.41%	2.15%	0.77%	2.04%	33.48%	1.3158
EasyOCR	88.73%	16.47%	27.09%	10.02%	13.85%	71.17%	0.5155

Table 6. Comparison of detection and recognition metrics across different BRISQUE image quality ranges.

BRISQUE	Number of Images	mAP@50	Horizontal Acc.	Vertical Acc.
0–20	141	98.26%	88.89%	98.07%
21–40	387	98.46%	92.44%	98.22%
41–60	184	96.02%	97.85%	98.20%
61–80	25	97.83%	89.47%	96.23%
81–100	-	-	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Santos, J.; Canedo, D.; Neves, A.J.R. Automating Code Recognition for Cargo Containers. Electronics 2025, 14, 4437. https://doi.org/10.3390/electronics14224437

AMA Style

Santos J, Canedo D, Neves AJR. Automating Code Recognition for Cargo Containers. Electronics. 2025; 14(22):4437. https://doi.org/10.3390/electronics14224437

Chicago/Turabian Style

Santos, José, Daniel Canedo, and António J. R. Neves. 2025. "Automating Code Recognition for Cargo Containers" Electronics 14, no. 22: 4437. https://doi.org/10.3390/electronics14224437

APA Style

Santos, J., Canedo, D., & Neves, A. J. R. (2025). Automating Code Recognition for Cargo Containers. Electronics, 14(22), 4437. https://doi.org/10.3390/electronics14224437

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automating Code Recognition for Cargo Containers

Abstract

1. Introduction

2. Materials and Methods

2.1. Literature Review

2.2. Dataset

2.2.1. ISO 6346 Standard

2.2.2. Dataset Description

2.3. Hybrid Pipeline Architecture

2.3.1. Detection Stage

2.3.2. Recognition Stage

2.3.3. Training Protocol

Hardware and Frameworks

Hyperparameter Configuration

TrOCR Specifications

YOLOv7 Specifications

3. Results and Discussion

3.1. Evaluation Metrics

3.2. YOLOv7 Training

3.3. TrOCR Training

3.4. Comparative Results

3.5. System Performance Analysis

4. Conclusions

4.1. Limitations

4.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI