1. Introduction
Angle steel is the main component material of an electric power tower; each angle steel tower contains hundreds or even thousands of angle steel parts. From unloading to shipment, the angle parts will go through the processes of hole-making, backing, trimming, galvanizing, sorting, packing, and so on. Each part is given a dedicated steel stamp number, usually generated using a character mold punch. As the steel stamp character carries production information such as dispatch number, equipment number, worker code, processing time, it is crucial for product quality control and responsibility tracing. At present, the main use of traditional manual identification methods is not only inefficient but also susceptible to subjective bias, there is an urgent need for technological innovation to achieve automatic identification [
1,
2]. Although the existing OCR (Optical Character Recognition) equipment [
3] and advanced machine vision technology in many application areas has been quite mature, in the field of transmission tower angle steel parts steel stamp character recognition, there are still accuracy and consistency problems; it is difficult to meet the requirements of industrial applications. This limitation is mainly due to the steel characters in the manufacturing and the use of various complex factors that may occur in the process, such as steel parts embossed with characters of the material, concave and convex height, tilt angle, mutilation, adhesion, and small size and shape irregularities, etc., resulting in uneven distribution of the grayscale of the image and clarity degradation. The characters in the segmentation and recognition process of the error rate are higher, and the angle embossed character recognition has brought a huge challenge. Most of the traditional character recognition methods are based on the similarity of template matching, edge information in the image, shape features, and statistical classification techniques. To address the feature extraction problem of steel-stamped characters, Geng et al. [
4] used a method based on fractal dimension and Hidden Markov features to binarize the characters and then used multiple classifiers to recognize the embossed characters of license plates. Zhang et al. [
5] proposed an embossed character segmentation method, which firstly screens the embossed character region, then performs morphological optimization, and combines the extracted embossed character features with a BP neural network to achieve embossed character recognition. These methods are sensitive to image noise and small deformations of embossed characters, and it is difficult to deal with character sets with concave and convex features and recognize embossed characters in complex backgrounds.
In recent years, image recognition methods based on deep learning [
6] have gained widespread attention for their fast and accurate performance, among which, YOLO (You Only Look Once), proposed by Redmon et al. [
7,
8,
9,
10,
11,
12,
13,
14], defines target recognition as a regression problem, and directly predicts the bounding box and category probabilities from the full image through a single neural network for end-to-end optimization, which has become one of the most popular methods for image recognition. Jaderberg et al. [
15] proposed a method using synthetic data and artificial neural networks for natural scene text recognition, significantly improving recognition accuracy. Shi et al. [
16] introduced an end-to-end trainable neural network for image-based sequence recognition, particularly useful for scene text recognition. Graves et al. [
17] developed a multidimensional recurrent neural network for offline handwriting recognition, demonstrating its effectiveness in handling complex handwriting sequences. In many specific application areas, in order to further improve the accuracy of recognition, many scholars have carried out a lot of innovative research work based on the YOLO model framework. For example, Zhao Yan et al. [
18] proposed a detection model by combining the Darknet framework and the YOLOv4 algorithm and enhanced the model’s ability to detect multi-scale defects in circuit board characters by designing reasonable anchors using the k-means clustering algorithm. Y.S. Si et al. [
19] proposed an improved YOLOv4 model by employing MSRCP (Multi-scale retinex with chromaticity preservation) algorithm for image enhancement in low-light environments, as well as incorporating an RFB-s structure in the model backbone to improve the change of cow body pattern in terms of its robustness. In addition, by improving the Non-Maximum Suppression (NMS) algorithm, the accuracy of the model in recognizing the target is improved. Huaibo Song et al. [
20] improved the YOLOv5s detection network by introducing the Mixed Depth Separable Convolution (MixConv, MDC) module and combining it with the Squeeze-and-Excitation (SE) module in order to reduce the model parameters with essentially no loss of model accuracy. In this improvement, the conventional convolution, normalization, and activation function (CBH) module in the backbone part of the feature extraction network of the YOLO v5-MDC network is replaced with the MDC module, which effectively reduces the model parameters. Although the above-improved target detection methods based on the YOLO network model have improved the detection accuracy in specific applications, they generally suffer from high model complexity and slow training speed. At the same time, these features further increase the difficulty of detection when dealing with objects with small dimensions, different location information, and irregular morphology.
For the current power tower parts angle steel concave and convex characteristics and image noise and other problems, this paper designs a multi-scale residual attention coding algorithm module and selectable clustering minimum iteration center to improve the model of YOLOv5, and proposes a new deep learning network model based on the recognition of embossed characters of steel parts, which has a total of two innovations: (1) In the YOLOv5 architecture, the model is improved through the use of the Spatial Pyramid in the fast space pyramid Pooling (SPPF) and the downsampling phase of the backbone network by integrating the channel multiscale residual attention coding algorithm, which enhances the network’s ability in terms of extracting shallow features and assigning weights to the detection targets, thus optimizing the overall detection performance; and (2) Designing a method of selecting the cluster centers aiming at selecting, in each iteration, cluster centers that can minimize the differences within the clusters, ensuring that the cluster centers selected in each iteration best represent the characteristics of individual characters, thereby improving the accurate recognition of the shape and size of embossed characters. The main contents of this paper are structured as follows: The first part briefly describes the current state of research and the structure of the paper on the target detection of steel imprinted characters on angles; the second part briefly describes the challenges faced by the production of angles for transmission towers and the automated recognition of imprinted characters; the third part proposes the recognition model YOLOv5-R for mutilated and tiny imprinted characters; the fourth part compares the improved model with other models in terms of experiments and performance analyses on different types of angle embossed characters in detection tasks; and finally, the superiority of the improved model and its application potential are summarized.
3. Network Model
3.1. Angle Steel Embossed Character Recognition Based on YOLOv5
The YOLO improves processing efficiency by simplifying target detection to a single regression analysis. The neural network divides the image into multiple grids, with each grid cell independently predicting its internal target. Batch normalization and ReLU activation functions are configured after each convolutional layer to optimize nonlinear processing capability and overall stability. YOLOv5 is designed to optimize efficiency and speed, making it suitable for resource-constrained environments while maintaining good detection accuracy. Its core architecture includes multilayered convolutional layers, residual blocks, and an anchor box mechanism. Convolutional layers capture visual features at various levels using multi-scale convolutional kernels, while residual blocks solve the problem of gradient vanishing in deep networks through skip connections. The anchor box mechanism, by presetting bounding boxes of different sizes and ratios, accelerates the model’s convergence speed for targets of various sizes and shapes, improving detection accuracy, as shown in
Figure 3. The arrows in the figure represent the flow of data through the layers of the model, indicating the sequence of operations performed. The straight black arrows denote the standard forward pass through the Batch Normalization (BN), ReLU activation, and Convolution (Conv) layers, while the curved red arrows indicate the residual connections that add the input of a block to its output, helping to mitigate the vanishing gradient problem in deep networks.
However, YOLOv5 performs poorly in recognizing damaged and rusty embossed characters. These characters are usually small and prone to damage, leading to missed or incorrect detections. Additionally, corrosion and wear can make it difficult to distinguish characters from the background, reducing recognition accuracy. While the standard anchor box mechanism optimizes detection for targets of various sizes and shapes, its preset sizes and ratios may not be sufficient to cover all character variants, affecting the model’s flexibility and adaptability. Residual blocks may fail to adequately retain critical information when dealing with these detailed and damaged characters, resulting in decreased recognition accuracy. To address these shortcomings, we introduced a multi-scale feature fusion mechanism that better captures different levels of information in images, improving detection capabilities for complex scenes and small targets. Additionally, we optimized the YOLOv5 anchor box mechanism to better suit small target detection by introducing a dynamic anchor box adjustment strategy that adjusts anchor box sizes dynamically based on target size and shape, thereby improving detection accuracy. Furthermore, we incorporated an attention mechanism into the network to enhance the model’s focus on key features, thereby improving overall detection performance. With self-attention and channel attention mechanisms, the model can more effectively distinguish targets from the background, increasing recognition accuracy.
3.2. Multi-Scale Residual Channel Attention Coding Mechanism Design (MSRC)
Attention mechanism networks in image processing can finely control the channel and spatial dimensions of the model and use masks to precisely manage the attention, thus improving recognition accuracy. These include self-attention, domain attention, and channel and spatial attention modules (CAM and SAM). Although this enhances the image processing performance, the model still struggles to distinguish the nuances between background noise and the actual target when dealing with small or crippled targets, limiting the recognition rate. In addition, these problems may be further amplified when the depth and complexity of the model increase, affecting the memory and utilization of earlier features, thus limiting the improvement of recognition accuracy. To overcome these limitations, in this paper, we design a multi-scale residual channel attention coding algorithm mechanism that eliminates the insufficient filtering of non-critical information, thereby enhancing small target and residual recognition, the structure of which is shown in
Figure 4. This mechanism can significantly improve the recognition accuracy of the model by enhancing the function of feature extraction and information integration, especially when dealing with low-resolution and small-volume targets.
A multi-scale residual attention coding algorithm finely handles the feature map, which effectively enhances the representation of features and the continuous flow of information by combining the residual learning technique and the attention mechanism in deep learning. The algorithm first applies residual connections that allow gradients to flow directly through the network, preventing information from being lost during deep transfer. A 1 × 1 convolutional kernel is then used to reduce the dimensionality of the feature map, which not only simplifies the computational requirements of the model but, at the same time, preserves the most critical information. The global pooling step follows immediately after the spatial compression of the features, capturing the most globally important features and providing an accurate basis for the final attentional weighting. The normalized exponential function (Softmax function) is then used to assign the attentional weights, which enhances the model’s ability to determine the importance of features by amplifying meaningful feature differences and normalizing the weights, as shown in
Figure 5. Additionally, the Softmax function is applied at the final classification stage to convert the output logits into a probability distribution, enabling the model to assign a class label to each detected object with a confidence score. The calculation of the attentional weights not only relies on the amount of information in the features themselves but also takes into account the relative importance between the features with the following formula:
where
represents the attention weight of the
i feature map, exp denotes the natural exponential function, which is used to ensure that the weight is positive and amplifies the differences in the vector representation of the feature maps, and
denotes to ensure that the sum of the attention weights of all the feature maps is 1, which is achieved by summing the exponential weights of all the feature maps to achieve the normalization of the weights. Then the algorithm adjusts all feature maps to the same size weights using the corresponding attention weights, and fuses the weighted feature map phases to form the final fused feature map; the fusion formula is as follows:
where
g denotes the fused feature map, is the result of the
i feature map by 1 × 1 convolution processing, and
denotes traversing all feature maps and adding their weighted results to get the final fused feature map.
Residual attention coding calculates the degree of match between the feature maps and the predefined templates through cosine similarity, which further evaluates the model’s key to feature recognition accuracy. Cosine similarity measures the similarity between two vectors through dot product and mode length normalization, which effectively determines the consistency between the model output and the target template, as shown in the following equation:
where,
denotes the similarity between the
and the template
b,
denotes the vector representation of the
i feature map after a series of processing,
F(
b) denotes the vector representation of the feature map as the reference template after the same processing, ⋅ is used to calculate the dot product of the two vectors,
and
denotes the modes of
and
. Finally, to evaluate the attention weight of each feature map, the computed similarity metric is fed into the Softmax layer for normalization, the similarity is converted into probability. Finally, the sum of weighted feature maps is generated by increasing the attentional weights of the up-sampled C1 and the corresponding other predictor heads Ci to the same size and number of channels, which is calculated as follows:
where
A denotes the sum of weighted feature maps.
3.3. Design of Optional Clustering Minimum Iteration Center Module (OCMC)
In the YOLOv5 algorithm, the traditional K-means algorithm [
22,
23] is used to optimize the anchor frame size. The distance from the sample points to the cluster center is minimized by randomly selecting the initial cluster center and iteratively updating the position. Despite its simplicity and efficiency, the K-means algorithm still faces problems such as high computational complexity and dependence on a priori knowledge of the number of clusters. This is especially true when dealing with high-dimensional data or large-scale datasets since the distance from all data points to existing clustering centers needs to be calculated each time a new center is selected. Secondly, the a priori selection that still relies on the number of clusters is often difficult to determine in practical applications. In addition, Euclidean distance, as the core metric of clustering, may not be sufficient to express the actual similarity between data points in some application scenarios.
In order to deal with small and residual targets more effectively, an optimized clustering algorithm is proposed, and this algorithm replaces the traditional Euclidean distance by using the intersection and union ratio (IoU) as a new distance metric, because the traditional Euclidean distance cannot accurately reflect the similarity between anchor frames in target detection. The
IoU, on the other hand, takes into account the degree of overlap of the anchor frames in target detection, and it is a more reasonable metric for assessing the similarity between anchor frames. In this way, the problem of random selection of clustering centers can be effectively avoided, and the clustering process can more accurately reflect the shape and size differences of targets. The algorithm will randomly select a sample from the data set as the first clustering center. For each sample not selected as the center, its distance is calculated to the nearest clustering center, and the calculated distance is taken as the probability of selecting the next clustering center proportional to the inverse of the
IoU distance; the formula is as follows:
where
x denotes the sample,
C denotes the clustering center,
D denotes the distance metric, and
IoU denotes the ratio of the area of the intersection region of the two bounding boxes to the area of their concurrent region. This probabilistic model ensures that the selection of clustering centers considers both randomness and reflects the actual physical proximity between samples. During the iteration process, each cluster center is assigned to the nearest cluster center based on the distance of each sample
IOU. The center of each cluster is then updated such that the sum of the
IOU distances of all samples within the cluster is minimized. The algorithm terminates when the change in the cluster center is less than a threshold or the maximum number of iterations is reached, which is calculated as follows:
where
Cnew denotes the new clustering center.
3.4. Overall Framework of the Improved YOLOv5-R Network Model
To improve the detection accuracy of small-size and crippled targets, we introduce the design of an optional clustering minimum iteration center module. By implementing a multi-scale residual attention coding mechanism, extracting channel features through global maximum pooling, and adjusting the response strength of each channel using the sigmoid activation function, the model’s ability to handle complex backgrounds and multi-size targets is enhanced, improving its capture of key features. This module optimizes the clustering process by introducing the intersection and union ratio (IoU) as a new distance metric to replace the traditional Euclidean distance, reducing the dependence on the a priori knowledge of the number of clusters and reflecting more accurately the similarity of the anchor frames in the detection of the targets, so as to improve the efficiency and accuracy of clustering. The structure of the improved YOLOv5-R neural network is shown in
Figure 6, which is based on the YOLOv5 architecture and integrates the backbone network, the multi-scale residual attention mechanism, the feature pyramid, and the detection head. To optimize performance and retain more image details, the model is designed with a specified input image size of 3 × 640 × 640. The input image is first passed through high-resolution Convolutional Blocks (Channel Block Squeeze (CBS)), which use cascading to extract complex and abstract features of the image layer by layer. The image features then enter the multi-level CSP1_X (Common Spatial Pattern) module, while CSP1_X represents the orange, purple, and gray feature layers in the figure. Each module handles features at a certain granularity and reduces the computational cost and the model parameters through the strategy of segmentation and merging. After the feature map is further refined, it is fed into a multi-scale residual attention coding mechanism, which reduces the dimensionality while retaining the key information through a 1 × 1 convolution kernel, and Global Max Pooling (GMP) to extract the salient features in the map. The weights of the features processed by Global Max Pooling are then computed by a normalized exponential function and integrated into a comprehensive feature graph for enhancing the accuracy of target segmentation. The processed feature maps through the attention module are downsampled to compress the image resolution while retaining the core information. The processed feature maps are input into the feature pyramid network on the one hand, and additionally, the classification probability is computed through the Softmax layer, which is combined with the convolutional features of the CSP module in the feature pyramid through the multiplication operation, to ensure that the output classification results have a high confidence level.
Figure 6 shows the structure and information flow of the entire model. The upper half of the figure demonstrates how the multi-scale residual attention encoding mechanism is integrated into the feature pyramid to enhance feature extraction and target detection capabilities. The input image first passes through high-resolution convolution blocks (CBS), and then enters multi-level CSP1_X modules for feature extraction. The feature map is further processed through the multi-scale residual attention encoding mechanism and then enters the feature pyramid network for final target detection. The lower half of the figure details the specific operational processes of certain modules in the upper half, including the multi-scale residual attention module, CBS (Channel Block Squeeze) modules for extracting high and complex abstract features of the image, CSP (Cross Stage Partial Network) modules for processing features through partitioning and merging strategies to reduce computational cost and model parameters, and SPPF (Spatial Pyramid Pooling-Fast) modules for further refining and fusing feature maps. The orange dashed lines indicate feature maps processed through 3 × 3 convolutions, which capture more local information and details. These feature maps are then processed through 1 × 1 convolutions (yellow dashed lines) to reduce dimensionality, simplify computation, and retain essential features. Finally, the processed features are passed to the detection head, ultimately used for target detection, as indicated by the blue dashed lines.
The arrows indicate the flow of data through the network layers, while the different colors represent various op-erations and modules. Beige represents Convolutional Block Squeeze (CBS) modules for high-resolution feature extrac-tion. Orange, purple, and gray indicate different levels of the Common Spatial Pattern (CSP) network (CSP1_1, CSP1_2, CSP1_3) for feature processing. Light blue represents the Spatial Pyramid Pooling-Fast (SPPF) module for refining fea-ture maps. Light green shows processed feature maps at different scales (P3, P4, P5) for detection. Blue represents the YOLO loss layers for class, box, and objection losses. Yellow represents 1 × 1 convolution layers for dimensionality re-duction, and orange dashed lines indicate 3 × 3 convolution layers for capturing details. Blue dashed lines show the final steps to target detection. Black arrows show data flow, red arrows represent residual connections, and blue arrows lead to YOLO loss calculation. These details help to understand the structure and functionality of the model.
5. Conclusions
This study developed an optimized stamped character recognition algorithm based on the YOLOv5 architecture, incorporating an efficient multi-scale channel attention mechanism to reduce resource consumption while processing irrelevant information, significantly enhancing the weighting of key feature channels. A selectable clustering minimum iteration center module was also integrated to optimize the feature capture efficiency for small and irregular stamped characters. The test results show that, compared to existing methods, this model demonstrated superior comprehensive performance in extracting features from fine or incomplete stamped characters, achieving an 8.3% increase in recognition accuracy, an 8% increase in recall rate, and a 46 ms reduction in detection time compared to the baseline YOLOv5 model. The model simplified the network structure and enhanced recognition accuracy and processing speed. The next phase of research will explore how to strengthen the robustness of the model while reducing its parameters for more effective application in smart manufacturing.
Although the proposed method performs excellently in recognizing embossed characters on power transmission towers, its design and optimization are tailored for this specific application. For tasks such as handwritten character recognition, printed character recognition, or other industrial character recognition, the characters involved have different features and challenges. Therefore, the performance of this method in these scenarios may not be as outstanding as in power transmission tower character recognition. To improve generalizability, adjustments and optimizations are needed based on specific applications, such as training on datasets of handwritten characters or optimizing the model to handle higher character clarity. Nevertheless, the effectiveness of this method in other types of character recognition tasks still requires further validation and optimization.