Automatic Meter Reading from UAV Inspection Photos in the Substation by Combining YOLOv5s and DeeplabV3+

The combination of unmanned aerial vehicles (UAVs) and artificial intelligence is significant and is a key topic in recent substation inspection applications; and meter reading is one of the challenging tasks. This paper proposes a method based on the combination of YOLOv5s object detection and Deeplabv3+ image segmentation to obtain meter readings through the post-processing of segmented images. Firstly, YOLOv5s was introduced to detect the meter dial area and the meter was classified. Following this, the detected and classified images were passed to the image segmentation algorithm. The backbone network of the Deeplabv3+ algorithm was improved by using the MobileNetv2 network, and the model size was reduced on the premise that the effective extraction of tick marks and pointers was ensured. To account for the inaccurate reading of the meter, the divided pointer and scale area were corroded first, and then the concentric circle sampling method was used to flatten the circular dial area into a rectangular area. Several analog meter readings were calculated by flattening the area scale distance. The experimental results show that the mean average precision of 50 (mAP50) of the YOLOv5s model with this method in this data set reached 99.58%, that the single detection speed reached 22.2 ms, and that the mean intersection over union (mIoU) of the image segmentation model reached 78.92%, 76.15%, 79.12%, 81.17%, and 75.73%, respectively. The single segmentation speed reached 35.1 ms. At the same time, the effects of various commonly used detection and segmentation algorithms on the recognition of meter readings were compared. The results show that the method in this paper significantly improved the accuracy and practicability of substation meter reading detection in complex situations.


Introduction
Meter reading is an extremely important task that is widely used in real life [1]. Many meter readings [2][3][4] require manual periodic inspection and recording of the meter readings to reflect whether the device is operating safely. Due to the various shapes, scales, pointers, and characters of the meter, and the existence of different equipment locations, such as high-voltage equipment in substations, it is difficult for regular manual inspections to be carried out.
At present, traditional meters with scales and pointers are still commonly used in the substation environment. In the actual application environment, the traditional manual counting not only involves many human factors, but also is easily disturbed by the environment and has the danger of an electric shock. With the gradual promotion of unattended substations, inspection robots or UAVs equipped with automatic meter identification technology have been widely used. Therefore, the realization of intelligent reading of meters by the computer vision method has become a research hotspot.

1.
By combining UAV and deep learning vision technology, the problems of the low efficiency and the high cost of traditional manual inspection or robot inspection are solved; 2.
The object detection algorithm YOLOv5s is introduced to improve the accuracy of detection of meter dial area and classification; 3.
Deeplabv3+ is used for image segmentation and this method improves the detection accuracy of the pointer and the scale line; 4.
Based on the image segmentation results, the concentric circle sampling method is proposed to flatten the dial to realize the reading of the dial image.
This study is outlined as follows: Section 2 shows the YOLOv5 algorithm structure, the Deeplabv3+ algorithm structure, and the post-processing method of the meter readings; Section 3 introduces various comparative experiments and experimental results from the same time and the experimental results are analyzed; Section 4 concludes the study; and Section 5 puts forward an outlook for the future in view of the shortcomings of the research.

Meter Reading Recognition Based on Object Detection and Image Segmentation
According to the characteristics of the UAV aerial photography substation meter image and the shortcomings of the traditional meter reading method, this paper adopted the object detection algorithm and semantic segmentation technology based on deep learning, which can accurately obtain the meter image and the meter corresponding to the image. The scale and pointer area realized the reading of the substation equipment. Object detection technology was used to detect the meter area in the image, which generally refers to the minimum outer-enclosing rectangular area of the closely surrounded meter target; the image segmentation technology is used to further segment the meter pointer and scale area target pixels in the meter image.
The idea of this paper was firstly to use the YOLOv5s object detection [26,27] technology to detect the area where the metered target is located in the UAV aerial image and eliminate the interference of the non-target area in the image; secondly, to accurately segment the intercepted meter image into the scale and pointer positions in the meter image by Deeplabv3+ image segmentation technology; and finally, to obtain the meter reading by post-processing. The processing flow of this paper is shown in Figure 1.

Meter Reading Recognition Based on Object Detection and Image Segmentation
According to the characteristics of the UAV aerial photography substation meter image and the shortcomings of the traditional meter reading method, this paper adopted the object detection algorithm and semantic segmentation technology based on deep learning, which can accurately obtain the meter image and the meter corresponding to the image. The scale and pointer area realized the reading of the substation equipment. Object detection technology was used to detect the meter area in the image, which generally refers to the minimum outer-enclosing rectangular area of the closely surrounded meter target; the image segmentation technology is used to further segment the meter pointer and scale area target pixels in the meter image.
The idea of this paper was firstly to use the YOLOv5s object detection [26,27] technology to detect the area where the metered target is located in the UAV aerial image and eliminate the interference of the non-target area in the image; secondly, to accurately segment the intercepted meter image into the scale and pointer positions in the meter image by Deeplabv3+ image segmentation technology; and finally, to obtain the meter reading by post-processing. The processing flow of this paper is shown in Figure 1.

The YOLO Model
YOLOv5 [28] is a single-stage object detection algorithm. According to the depth and width of the network, YOLOv5 has four versions: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The depth of the network directly affects the detection accuracy and speed of the detector. The detection object in this paper was the aerial meter data and the detection object was small. On the premise of ensuring the detection accuracy, the detection model was installed on the edge device, so the YOLOv5s version would be used.

The YOLO Model
YOLOv5 [28] is a single-stage object detection algorithm. According to the depth and width of the network, YOLOv5 has four versions: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The depth of the network directly affects the detection accuracy and speed of the detector. The detection object in this paper was the aerial meter data and the detection object was small. On the premise of ensuring the detection accuracy, the detection model was installed on the edge device, so the YOLOv5s version would be used.
The YOLOv5s network consists of three parts and its network structure is shown in Figure 2. The backbone used mosaic data enhancement for splicing images through random scaling, random cropping, and random arrangement. The input image was imported into the Focus module for slicing operation and the sample slices with a scale of 640 × 640 × 3 were spliced into 320 × 320 × 12. At the same time, a Cross Stage Partial Networks (CSP) [29] structure and Spatial Pyramid Pooling (SPP) [30] were introduced to realize convolution and pooling down-sampling in order to extract the features. The next part was Neck, which was mainly the Feature Pyramid Networks (FPN) [31] and the Path Aggregation Network (PAN) [32]. The FPN layer transfered and fused the high-level strong semantic features through up-sampling from top to bottom. The PAN conveyed strong localization features from the bottom up and aggregated features from different backbone layers to different detection layers. The last part was the detection part, which took and output the feature maps of three scales, which are 17 × 17, 20 × 20, and 23 × 23, respectively. In the post-processing process, the model generated multiple anchor boxes based on the object features and used non-maximum suppression (NMS) [33]. If the confidence of the object being predicted as the object category was greater than the set threshold, it was retained, thus completing the object detection process.
The YOLOv5s network consists of three parts and its network structure is shown in Figure 2. The backbone used mosaic data enhancement for splicing images through random scaling, random cropping, and random arrangement. The input image was imported into the Focus module for slicing operation and the sample slices with a scale of 640 × 640 × 3 were spliced into 320 × 320 × 12. At the same time, a Cross Stage Partial Networks (CSP) [29] structure and Spatial Pyramid Pooling (SPP) [30] were introduced to realize convolution and pooling down-sampling in order to extract the features. The next part was Neck, which was mainly the Feature Pyramid Networks (FPN) [31] and the Path Aggregation Network (PAN) [32]. The FPN layer transfered and fused the high-level strong semantic features through up-sampling from top to bottom. The PAN conveyed strong localization features from the bottom up and aggregated features from different backbone layers to different detection layers. The last part was the detection part, which took and output the feature maps of three scales, which are 17 × 17, 20 × 20, and 23 × 23, respectively. In the post-processing process, the model generated multiple anchor boxes based on the object features and used non-maximum suppression (NMS) [33]. If the confidence of the object being predicted as the object category was greater than the set threshold, it was retained, thus completing the object detection process. YOLOv5s was used to detect the area where the meter was located within the image. In order to improve the accuracy of the final meter reading, this paper firstly detected the dial area of the image through the YOLOv5s network model. Using the labeling software, LabelImg, to make meter datasets, five different kinds of labels (bj, bjA, bjB, bjH, and bjL) in the meter images were defined. The type of meter represented by each label is shown in Figure 3. The algorithm itself automatically imported the training set into the YOLOv5s network model for training and generated the corresponding detection model weights. YOLOv5s was used to detect the area where the meter was located within the image. In order to improve the accuracy of the final meter reading, this paper firstly detected the dial area of the image through the YOLOv5s network model. Using the labeling software, LabelImg, to make meter datasets, five different kinds of labels (bj, bjA, bjB, bjH, and bjL) in the meter images were defined. The type of meter represented by each label is shown in Figure 3. The algorithm itself automatically imported the training set into the YOLOv5s network model for training and generated the corresponding detection model weights. The YOLOv5s network consists of three parts and its network structure is shown in Figure 2. The backbone used mosaic data enhancement for splicing images through random scaling, random cropping, and random arrangement. The input image was imported into the Focus module for slicing operation and the sample slices with a scale of 640 × 640 × 3 were spliced into 320 × 320 × 12. At the same time, a Cross Stage Partial Networks (CSP) [29] structure and Spatial Pyramid Pooling (SPP) [30] were introduced to realize convolution and pooling down-sampling in order to extract the features. The next part was Neck, which was mainly the Feature Pyramid Networks (FPN) [31] and the Path Aggregation Network (PAN) [32]. The FPN layer transfered and fused the high-level strong semantic features through up-sampling from top to bottom. The PAN conveyed strong localization features from the bottom up and aggregated features from different backbone layers to different detection layers. The last part was the detection part, which took and output the feature maps of three scales, which are 17 × 17, 20 × 20, and 23 × 23, respectively. In the post-processing process, the model generated multiple anchor boxes based on the object features and used non-maximum suppression (NMS) [33]. If the confidence of the object being predicted as the object category was greater than the set threshold, it was retained, thus completing the object detection process. YOLOv5s was used to detect the area where the meter was located within the image. In order to improve the accuracy of the final meter reading, this paper firstly detected the dial area of the image through the YOLOv5s network model. Using the labeling software, LabelImg, to make meter datasets, five different kinds of labels (bj, bjA, bjB, bjH, and bjL) in the meter images were defined. The type of meter represented by each label is shown in Figure 3. The algorithm itself automatically imported the training set into the YOLOv5s network model for training and generated the corresponding detection model weights.

Deeplabv3+ Split Tick Marks and Pointers
Deeplabv3+ is a well-polished state-of-the-art segmentation model, which has been widely used in many areas, such as the remote sensing area and medical image processing. This paper studied instrument segmentation detection. The proportion of the pointer and the scale was small, and the segmentation requirements were relatively fine. Deeplabv3+ has a better segmentation effect on fine objects, therefore, this paper chose Deeplabv3+.
Deeplabv3+ adopts a spatial pyramid pooling model and an encoding-decoding structure for semantic segmentation, it outputs feature map information in the upper backbone network, and it detects incoming features through a multi-rate and multi-effective field of view, which filters or pools operations to encode multi-scale features. The context information is enriched by encoding the semantic information and the decoder part gradually recovers the sharp target boundary information. In the meter detection images, because the pointer and the scale occupy a small proportion of the area where it is located, the segmentation of the image will be more difficult. The module structure of Deeplabv3+ is shown in Figure 4. Compared with the Xception series used in the Deeplabv3+ paper as the backbone feature extraction network, a different backbone network, MobileNetv2 [34], was used in this paper. It is more suitable for deployment on edge devices than Xception and has more ideal and more efficient parameters and speed. The extracted feature maps went through the upgraded ASPP module. The feature map was first reduced in dimension through a 1 × 1 convolution kernel, then through three depthwise separable convolutions, and finally

Deeplabv3+ Split Tick Marks and Pointers
Deeplabv3+ is a well-polished state-of-the-art segmentation model, which has been widely used in many areas, such as the remote sensing area and medical image processing. This paper studied instrument segmentation detection. The proportion of the pointer and the scale was small, and the segmentation requirements were relatively fine. Deeplabv3+ has a better segmentation effect on fine objects, therefore, this paper chose Deeplabv3+.
Deeplabv3+ adopts a spatial pyramid pooling model and an encoding-decoding structure for semantic segmentation, it outputs feature map information in the upper backbone network, and it detects incoming features through a multi-rate and multi-effective field of view, which filters or pools operations to encode multi-scale features. The context information is enriched by encoding the semantic information and the decoder part gradually recovers the sharp target boundary information. In the meter detection images, because the pointer and the scale occupy a small proportion of the area where it is located, the segmentation of the image will be more difficult. The module structure of Deeplabv3+ is shown in Figure 4.

Deeplabv3+ Split Tick Marks and Pointers
Deeplabv3+ is a well-polished state-of-the-art segmentation model, which has been widely used in many areas, such as the remote sensing area and medical image processing. This paper studied instrument segmentation detection. The proportion of the pointer and the scale was small, and the segmentation requirements were relatively fine. Deeplabv3+ has a better segmentation effect on fine objects, therefore, this paper chose Deeplabv3+.
Deeplabv3+ adopts a spatial pyramid pooling model and an encoding-decoding structure for semantic segmentation, it outputs feature map information in the upper backbone network, and it detects incoming features through a multi-rate and multi-effective field of view, which filters or pools operations to encode multi-scale features. The context information is enriched by encoding the semantic information and the decoder part gradually recovers the sharp target boundary information. In the meter detection images, because the pointer and the scale occupy a small proportion of the area where it is located, the segmentation of the image will be more difficult. The module structure of Deeplabv3+ is shown in Figure 4. Compared with the Xception series used in the Deeplabv3+ paper as the backbone feature extraction network, a different backbone network, MobileNetv2 [34], was used in this paper. It is more suitable for deployment on edge devices than Xception and has more ideal and more efficient parameters and speed. The extracted feature maps went through the upgraded ASPP module. The feature map was first reduced in dimension through a 1 Compared with the Xception series used in the Deeplabv3+ paper as the backbone feature extraction network, a different backbone network, MobileNetv2 [34], was used in this paper. It is more suitable for deployment on edge devices than Xception and has more ideal and more efficient parameters and speed. The extracted feature maps went through the upgraded ASPP module. The feature map was first reduced in dimension through a 1 × 1 convolution kernel, then through three depthwise separable convolutions, and finally it was output through Adaptive Pooling. The compressed feature layer was passed into the decoder part through the backbone, and after being resized, it was concat with encode_data, and finally the output result was obtained through two convolution kernels and the Upsample.
In this paper, Deeplabv3+ was used to segment the tick marks and the pointer positions of the dial area. The training set of the segmentation network adopted the image of the dial area after being detected by the YOLOv5s detection network. Therefore, the cropped RGB image of the dial area was transmitted to the Deeplabv3+ network and it output the segmentation map as the same size as the input images. Figure 5 shows an example of the tick training set; the upper image is the original image, and the lower image is the label image.
Sensors 2022, 22, x FOR PEER REVIEW 6 of 1 × 1 convolution kernel, then through three depthwise separable convolutions, and finally it was output through Adaptive Pooling. The compressed feature layer was passed int the decoder part through the backbone, and after being resized, it was concat with en code_data, and finally the output result was obtained through two convolution kernel and the Upsample. In this paper, Deeplabv3+ was used to segment the tick marks and the pointer posi tions of the dial area. The training set of the segmentation network adopted the image o the dial area after being detected by the YOLOv5s detection network. Therefore, th cropped RGB image of the dial area was transmitted to the Deeplabv3+ network and i output the segmentation map as the same size as the input images. Figure 5 shows an example of the tick training set; the upper image is the original image, and the lower imag is the label image.

Erosion
The Deeplabv3+ segmentation removed most of the background and useless infor mation, leaving only the scale and the pointer outline in the image; however, there wer still many noises. In order to further eliminate interference, this paper performed erosion morphological processing on the segmented meter pointer and scale outline to remov the discrete point blocks. Assuming that the contour map point set is A, the convolution kernel is B, and B moves in order in A, the erosion image can be obtained. The segmen image was obtained from the formula (1): To remove the interference point blocks in the segmented image, the convolution kernel B, used in this paper, was designed as a 4 × 4 structure. The erosion operation elim inated the boundary points of the object, shrunk the boundary inward, and removed th objects that were smaller than the structural elements. The pointer and scale occupied fewer pixels in the segmentation map and were easily affected by noise. The effects o noise such as burrs and small bumps were removed by erosion. At the same time, tw objects that are only connected by small blocks were disconnected to improve the readin accuracy of the meter.

The Flattening Method and Meter Readings
After semantic segmentation and erosion processing, the circular dial area of th pointer image was flattened into a rectangular area by the concentric circle sampling method. The length and geometric center of the two sides of the divided rectangular im age were used as the diameter and the center of the initial concentric circle, respectively and the initial rotation angle and the width and height of the flattened rectangular are

Erosion
The Deeplabv3+ segmentation removed most of the background and useless information, leaving only the scale and the pointer outline in the image; however, there were still many noises. In order to further eliminate interference, this paper performed erosion morphological processing on the segmented meter pointer and scale outline to remove the discrete point blocks. Assuming that the contour map point set is A, the convolution kernel is B, and B moves in order in A, the erosion image can be obtained. The segment image was obtained from the Formula (1): To remove the interference point blocks in the segmented image, the convolution kernel B, used in this paper, was designed as a 4 × 4 structure. The erosion operation eliminated the boundary points of the object, shrunk the boundary inward, and removed the objects that were smaller than the structural elements. The pointer and scale occupied fewer pixels in the segmentation map and were easily affected by noise. The effects of noise such as burrs and small bumps were removed by erosion. At the same time, two objects that are only connected by small blocks were disconnected to improve the reading accuracy of the meter.

The Flattening Method and Meter Readings
After semantic segmentation and erosion processing, the circular dial area of the pointer image was flattened into a rectangular area by the concentric circle sampling method. The length and geometric center of the two sides of the divided rectangular image were used as the diameter and the center of the initial concentric circle, respectively, and the initial rotation angle and the width and height of the flattened rectangular area were specified. The initial rotation angle was used to generate the initial sampling point of each concentric circle. The width corresponded to the number of times of sampling for each concentric circle, and the height corresponded to the number of sampled concentric circles. Starting from the initial sampling point of the concentric circles, the pixel values were uniformly sampled on the circumference of the concentric circles in a clockwise direction. Taking the center of the concentric circles as the center, and shortening the radius by one pixel unit, a new concentric circle was generated. The steps for sampling were repeated several times, and finally the flattened rectangular area corresponding to the circular dial area was obtained.
The flattened rectangular area corresponded to the scale of the dial, and the midpoint coordinate of the line-segment was taken as the scale feature point for the scale, thereby forming a scale-center coordinate set that could represent the position of the scale. At the same time, the flattened image was scanned line by line from top to bottom, and the average value of the pointer pixel position was used as the coordinates of the pointer tip to indicate the pointer position.
Meter readings were accurately calculated by flattening the image, and the calculation formula is shown in (2): where, R represents the meter reading, α represents the distance from the initial scale point of the flattened image to the pointer, β represents the distance from the initial scale of the flattened image to the end scale, and µ represents the total range of the meter.

Evaluation Indicators
Due to the complex environment of the substation, the aerial image of the UAV has a large receptive field, which will cause missed detection and false detection of the meter. Therefore, this paper used Precision and Recall to describe the meter detection model performance. The formulas of Precision and Recall are shown in (3) and (4): In the above formula, TP and FP represent the true and false positives, respectively, and FN represents the false negatives. In order to further evaluate the detection performance of the model, it was proposed to use the AP of a single category to represent the sum of the AP (average precision) values of each category, and to obtain the mAP (mean average precision) according to the AP value. The formulas for AP and mAP are shown in (5) and (6): In Formula (6), Q is the number of categories. The accuracy evaluation index of the image semantic segmentation model is expressed by mIoU (mean intersection over union), and the calculation formula is shown in (7): Among them, K represents the number of label categories in the data set; p ii represents the number of the category label, i, in the data set, where the actual prediction is i; p ij represents the category label, j, where the actual predicted category is the number of i; and p ji represents the category label, i, where the actual predicted category is the number of j.

Data Acquisition and Transmission
Datasets were acquired by using the company's self-produced drone nest, the developed platform scheduling software, and the DJI (Shenzhen DJI Sciences and Technologies Ltd., Shenzhen, China) Genie 4rtk (Real-Time Kinematic) UAV for data collection. The UAV itself integrates an HD video transmission system, a 360 • rotating gimbal, and a 4K camera. The camera carried by the UAVs captured all the photos on an SD card. Transmission was carried out by a 4G/5G signal or, when it came back to the drone nest connected by LAN. All collected data was sent to the control center.
This paper collected the meter images in the Harbin substation and the filming hours were from 9:00 a.m to 6:00 p.m. The shooting environment included different weather patterns, lighting, and time periods, and the flight collection was specified according to the planned route. The flying height of the UAV was the same as the shooting point of the collecting meter, and the distance between the gimbal and the meter to be collected was about 1 to 1.5 m. Both the training set and the validation set are independent of each other. We collected a total of 1632 images in five different categories. There were 979 images in the training set and 653 images in the test set and the number of all the meter types is shown in Table 1.

YOLOv5s Detection Results
After the training was completed, the YOLOv5s detection model needed to be tested before the semantic segmentation task, and the test results are shown in Figure 6. This shows that from different distances, different angles, and different weather conditions, the detection model could identify the position of the meter target in the image, and at the same time could correctly identify the meter type. This proves that the model used in this paper had a certain generalization ability and the training effect was ideal.

Deeplabv3+ Image Segmentation Results
After the Deeplabv3+ network was trained, the network model was tested. Figure 7 shows the test results of the tick marks and the pointers of the model after network segmentation. The upper layer is the original scene image and the lower layer is the corresponding segmented image. The test results show that the input image segmentation result was basically correct, and the position of the tick mark and the pointer could be accurately separated. The final image also needed to be corroded and flattened in order to correctly identify the dial reading.

Flattening Results
In the rectangular area, the scales were evenly arranged from left to right, the lower end of the pointer was close to the center of the dial, and the upper end was close to the scale. Figure 8 shows, the result of flattening the image. The first scale coordinate of the scale center coordinate corresponded to the start scale position, and the last scale coordinate corresponded to the end scale position. The first distance between the pointer tip coordinate and the start scale position was calculated and the second distance between the end scale position and the start scale position was calculated. The ratio of the first distance to the second distance was multiplied by the total range of the type of meter to obtain the readings of several pointer meters in the substation.

Deeplabv3+ Image Segmentation Results
After the Deeplabv3+ network was trained, the network model was tested. Figure 7 shows the test results of the tick marks and the pointers of the model after network segmentation. The upper layer is the original scene image and the lower layer is the corresponding segmented image.

Deeplabv3+ Image Segmentation Results
After the Deeplabv3+ network was trained, the network model was tested. Figure 7 shows the test results of the tick marks and the pointers of the model after network segmentation. The upper layer is the original scene image and the lower layer is the corresponding segmented image. The test results show that the input image segmentation result was basically correct, and the position of the tick mark and the pointer could be accurately separated. The final image also needed to be corroded and flattened in order to correctly identify the dial reading.

Flattening Results
In the rectangular area, the scales were evenly arranged from left to right, the lower end of the pointer was close to the center of the dial, and the upper end was close to the scale. Figure 8 shows, the result of flattening the image. The first scale coordinate of the scale center coordinate corresponded to the start scale position, and the last scale coordinate corresponded to the end scale position. The first distance between the pointer tip coordinate and the start scale position was calculated and the second distance between the end scale position and the start scale position was calculated. The ratio of the first distance to the second distance was multiplied by the total range of the type of meter to obtain the readings of several pointer meters in the substation. The test results show that the input image segmentation result was basically correct, and the position of the tick mark and the pointer could be accurately separated. The final image also needed to be corroded and flattened in order to correctly identify the dial reading.

Flattening Results
In the rectangular area, the scales were evenly arranged from left to right, the lower end of the pointer was close to the center of the dial, and the upper end was close to the scale. Figure 8 shows, the result of flattening the image. The first scale coordinate of the scale center coordinate corresponded to the start scale position, and the last scale coordinate corresponded to the end scale position. The first distance between the pointer tip coordinate and the start scale position was calculated and the second distance between the end scale position and the start scale position was calculated. The ratio of the first distance to the second distance was multiplied by the total range of the type of meter to obtain the readings of several pointer meters in the substation.

Comparative Requirements
In order to choose the algorithm version that was more suitable for the research in this paper, a comparative requirement between the parameters, FLOPS and running speed of each version of YOLOv5 was carried out. The experimental results are shown in Table 2. It can be seen from Table 2 that the YOLOv5s model had the least number of parameters and the fastest speed.
In order to verify the feasibility of using the detection and segmentation algorithm in this paper, this study compared and tested a variety of commonly used algorithms on the premise of the same dataset. The results of this comparison are shown in Tables 3 and 4. As can be seen in Table 3, this paper proposed the use of the YOLOv5s model with a faster detection speed for the detection of the meter dial of the aerial image. The Deeplabv3+ model with MobileNetV2 as the backbone network was used to segment the pointer and the scale of the meter image and the meter reading was realized through the meter postprocessing technology. In this paper, the YOLOv5s model was used to detect a picture on the NVIDIA Geforce A100 GPU with an inference speed of 22 ms, which is significantly better than the YOLOv5m, YOLOv5L, YOLOv5X, YOLOv4 [35], and the YOLOv3 [36] models. The size of the model was only 14.1 MB; and mAP50 reached 99.584% in this dataset. The better detection accuracy and faster detection speed is able to meet the daily meter image inspection requirements.
In summary, compared with other commonly used detection algorithms, YOLOv5s has the smallest model and the fastest inference speed under the premise of ensuring model accuracy. It is more suitable for real-time detection on edge devices deployed on UAVs.
As can be seen from Table 4, for the five types of meters, the mIoU of the method used in this paper reached 78.92%, 76.15%, 79.12%, 81.17%, and 75.73%, respectively. It is obviously better than the Deeplabv1 model and the Deeplabv2 model, but not as good as than the original Deeplabv3+ model. However, the Deeplabv3+ model with MobileNetV2 as the backbone achieved a single image segmentation speed of 35.1 ms on NVIDIA Geforce 2080 GPU, which was significantly faster than the other three image segmentation models. The improved Deeplabv3+ model was only 11.1 MB and the parameters of the static model were only 2.8 MB. Furthermore, the model ran twice as fast, which greatly improved the model segmentation speed. Table 4 shows that the Xception65 is the best backbone model, and this study did not need to achieve real-time detection. The method in this paper can be applied to substation meter reading whether using the Xception65 or the MobileNetv2 backbone network. Furthermore, the MobileNetv2 backbone network model detection speed was faster and we hope that the detection speed will be as fast as possible while still being of practical use.
Compared with the method of combining Faster R-CNN and U-Net in the literature [22], the method in this paper has the following advantages:

1.
Compared with the Faster R-CNN algorithm, the YOLOv5 algorithm was used in this paper, and the detection speed was significantly faster; 2.
The Deeplabv3+ image segmentation algorithm is mainly used in industrial applications, but the U-Net image segmentation method is mainly used for medical image segmentation, so it is better to use the Deeplabv3+ method for meter readings in industrial applications; 3.
The post-processing methods such as concentric circle sampling in this paper were more robust than the industrial applications in paper [22]. Figure 9 shows the meter reading results of this method, which are 1.2544, 0.4285, 0.3073, 0.0000, and 0.3977, respectively. It can be seen from the above that the method in this paper could accurately read the value of the meter image, providing an effective and accurate reading method for the UAVs aerial photography of the meter image, which, in this case, used a pointer to indicate the value. Sensors 2022, 22, x FOR PEER REVIEW 12 of 15 Figure 9. the results of the meter reading. Figure 10 shows the images of the five meters and Table 5 shows the manually measured values compared to the recognized values. The error shows the absolute values of the manually measured values minus the recognized values, which are 0.0212, 0.0002, 0.0184, 0.0330, and 0.0017, respectively. The effectiveness of this method in automatic identification and reading of substation meter images in UAV aerial photography is, therefore, proven.   Figure 10 shows the images of the five meters and Table 5 shows the manually measured values compared to the recognized values. The error shows the absolute values of the manually measured values minus the recognized values, which are 0.0212, 0.0002, 0.0184, 0.0330, and 0.0017, respectively. The effectiveness of this method in automatic identification and reading of substation meter images in UAV aerial photography is, therefore, proven.  Figure 10 shows the images of the five meters and Table 5 shows the manually measured values compared to the recognized values. The error shows the absolute values of the manually measured values minus the recognized values, which are 0.0212, 0.0002, 0.0184, 0.0330, and 0.0017, respectively. The effectiveness of this method in automatic identification and reading of substation meter images in UAV aerial photography is, therefore, proven.   The method proposed in this paper still has many aspects that need to be improved. We mainly realize that the reading detection of various pointer meters, and the provision of technical support for the inspection of substation meters, is needed. However, there are still many different types of meters that have not been studied, and when the area of the meter is not complete due to the external conditions, it will cause errors or lead to large reading errors.

Conclusions
This paper has designed a method that combined YOLOv5s and Deeplabv3+ and has implemented a substation meter detection and reading method through a series of post-processing methods. The test results have proven that the method proposed in this paper could accurately read various meter types at different angles and under different conditions. The main contributions of this paper are as follows 1.
The use of UAVs to fly through designated routes at different times and different weather conditions and the collection of 1632 images, including five different types of meters for object detection model training; 2.
The improvement of: the backbone network of the Deeplabv3+ semantic segmentation network; and the inference speed of the segmentation algorithm for a single image, which was twice the speed of the original model and a reduction in the size of the model weight; 3.
The use of the erosion and concentric circle sampling method to flatten images to realize meter panel reading. The result has been to achieve an accurate reading of the meter readings while quickly detecting the meter area. In this paper, the inspection of substation instruments was combined with deep learning visual algorithms and mobile flying equipment. It is hoped that the work in this paper can provide some help for intelligent substation inspection.

Future Work
The main future work would be to continue to improve detection accuracy and speed, especially for different kinds of meters and more complex background conditions. At present, the intelligent inspection technology of meter readings has become an important development direction of the intelligent inspection of substations. In future research, we will increase the detection and segmentation of other components in the substation environment, and at the same time combine other algorithms such as object tracking and key point detection to achieve state estimation and prediction, which will further improve the intelligence level of substation inspection from all aspects.