OSO-YOLOv5: Automatic Extraction Method of Store Signboards in Street View Images Based on Multi-Dimensional Analysis

Dai, Jiguang; Gu, Yue

doi:10.3390/ijgi11090462

Open AccessArticle

OSO-YOLOv5: Automatic Extraction Method of Store Signboards in Street View Images Based on Multi-Dimensional Analysis

by

Jiguang Dai

and

Yue Gu

^*

School of Surveying and Geosciences, Liaoning Technical University, Fuxin 123000, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2022, 11(9), 462; https://doi.org/10.3390/ijgi11090462

Submission received: 2 May 2022 / Revised: 23 July 2022 / Accepted: 25 August 2022 / Published: 28 August 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

To realize the construction of smart cities, the fine management of various street objects is very important. In dealing with the form of objects, it is considered a pursuit of normativeness and precision. Store signboards are a tangible manifestation of urban culture. However, due to factors such as high spatial heterogeneity, interference from other ground objects, and occlusion, it is difficult to obtain accurate information from store signboards. In this article, in response to this problem, we propose the OSO-YOLOv5 network. Based on the YOLOv5 network, we improve the C3 module in the backbone, and propose an improved spatial pyramid pooling model. Finally, the channel and spatial attention modules are added to the neck structure. Under the constraint of rectangular features, this method integrates location attention and topology reconstruction, realizes automatic extraction of information from store signboards, improves computational efficiency, and effectively suppresses the effect of occlusion. Experiments were carried out on two self-labeled datasets. The quantitative analysis shows that the proposed model can achieve a high level of accuracy in the detection of store signboards. Compared with other mainstream object detection methods, the average precision (AP) is improved by 5.0–37.7%. More importantly, the related procedures have certain application potential in the field of smart city construction.

Keywords:

smart city; store signboard; location attention; topology reconstruction; object detection; street view image

1. Introduction

Real-time and accurate information regarding urban infrastructure is an important prerequisite and key link to promote the modern development of urban governance, and is also of great significance to the practice of the concept of “full-cycle management” of cities and the construction of smart cities [1]. Early information on urban infrastructure was mainly collected and summarized in the form of field surveys. Although these data are relatively accurate, this method is extremely time- and labor-expensive when investigating large-scale information regarding urban infrastructure. Street view images are an emerging remote sensing data source that is related to people’s livelihoods; the development of modern cities, new technologies and methods are urgently needed to rapidly improve the level of updates of urban infrastructure; furthermore, street view images cover a wide range of areas [2,3]. As a result, street view images have become an effective way to obtain information on urban infrastructure in a timely manner. Therefore, scholars have carried out research on extracting information about urban infrastructure using this emerging remote sensing data source [4,5,6].

Information from store signboards is one of the most important sources of information regarding urban infrastructure and has important value for urban appearance management, urban economic development analysis, and urban three-dimensional reconstruction [7]. However, to date, there have been few studies on store signboard information extraction using street view images. In this study, we conducted research and analysis based on features that have similar characteristics to store signboards. As traditional object recognition approaches, the object-oriented methods use an image segmentation algorithm [8,9] to cluster pixels with similar characteristics to form a segmented area unit, and then use an image classification algorithm [10,11] to extract the objects. However, the traditional methods rely on the spectral and textural information of the imagery in the segmentation and classification. That is, based on the homogeneity of the pixels, the image objects are aggregated from the bottom to the top [12,13]. However, the spectral information of the imagery is easily affected by the imaging environment, weather conditions, seasons and shooting conditions [14,15], shadows [16,17], object color deformation [18], object overlap [19], and other problems. In addition to the above limitations, the main challenge regarding the use of store signboards as the identification target is that cities are developing rapidly, and stores can be of many different sizes. Although store signboards are mainly of a regular rectangular shape, the specification, size, and color vary greatly, which leads to high spatial heterogeneity and difficulty in extracting information from store signboards. At the technical method level, the method of segmentation and classification to obtain image objects essentially does not consider high-level features such as the morphological information and semantic context information of the imagery. Pixel aggregation based solely on the spectral features does not make full use of the other features of street view images. As a result, the object units obtained by traditional methods often do not match people’s morphological cognition of the actual target objects, which leads to the failure of object-level classification. Therefore, it is of great practical significance and application value to carry out research into the accurate extraction of information from store signboards in urban areas.

To date, the task of interpreting store signboards from street view imagery has mainly been conducted by the human eye. The reason for this is that humans can perceive the spatial features such as the shape and texture present in the imagery hierarchically, and can perform hierarchical extraction of the obtained visual information, so as to achieve accurate extraction of the target information. Therefore, how to make full use of the rich features of street view images, simulate the human visual perception mechanism, and extract the regional units that are consistent with the actual targets (such as the store signboards we are concerned with in this article) are key.

The rapid development of convolutional neural networks (CNNs) featuring data-driven decision making and self-iterative optimization has promoted a new round of research in image analysis and understanding [20], and has also led to the rapid development of target visual detection [21,22,23]. Representation learning through deep networks can perform multi-level abstraction in semantic image analysis, which means that it can far outperform traditional methods in remote sensing applications [24]. In the field of image processing, deep learning can be divided into semantic segmentation and object detection [25]. Multi-scale semantic feature extraction based on semantic segmentation and object detection has the characteristics of high efficiency and automation, which can improve the practicability of store signboard detection.

In the research on semantic segmentation, in order to solve the common problem of objects in street view imagery, i.e., the scale diversity, fully convolutional networks (FCNs) [26] are typically stacked with multiple convolutional layers and convergence layers [27,28]. For example, DeepLabv3 [29] introduces an atrous spatial pyramid pooling (ASPP) module and employs atrous convolution in cascade or in parallel to capture the multi-scale context at different atrous rates [30,31], and U-Net [32] concatenates feature maps at different levels [33,34]. However, with these pixel-based segmentation methods, the output often shows irregular edges and oversmoothed corners [35]. Previous work on billboard detection has treated the detection problem as a semantic segmentation problem [36,37], performing classification and localization at the image pixel level. However, from an application point of view, the annotation for image semantic segmentation is more time-consuming, which increases the difficulty of building large datasets. Another point is that the output of semantic segmentation is usually in the form of irregular graphics, while the geometric shape of store signboards is mainly rectangular. This is consistent with the output of object detection, so that object detection methods are more suitable for the task of store signboard detection.

In object detection research, most of the research objects have similar problems to store signboards. For example, to address the problems of occlusions and small targets, Xu et al. [38] used a repulsive loss function to solve the target occlusion problem, and a scale-aware bidirectional sub-network was used to detect the targets at two scales. Perceptual fusion was then performed at the end of the inference. For the location of a target covered by a transparent window, Morera et al. [39] used a self-labeled dataset, and applied various data enhancement methods, combined with loss functions such as localization loss. Aiming at the problems of the low pixel ratio of foreground objects and the significant differences of the object scales, Wang et al. [40] proposed FSoD-Net, which integrates a Laplace kernel with fewer parallel multi-scale convolutional layers and incorporates three different isolated regression branch layers, improving the classification score of the predicted bounding boxes. A deep learning model can be used to achieve the classification and extraction of the different types of objects in the imagery, but it cannot achieve the effect of artificial visual interpretation, i.e., each type of object needs to be considered according to its specific visual characteristics [41,42,43,44]. Store signboards are usually concentrated in the middle of street view images, and are often occluded by trees. These occlusions fragment the target information and destroy the topological relationship. In addition, guideboards and taxi signboards in the scene can also interfere with the store signboards. A method to design a reasonable and effective identification method according to the characteristics of store signboards in street view images needs to be considered when building a model.

In this study, we built upon the most recent version of YOLO proposed by Ultralytics [45], which is a network often referred to as YOLOv5, to take advantage of its high detection accuracy and fast reasoning. We propose the OSO-YOLOv5 network, which integrates location attention and topology reconstruction under the constraint of rectangular features. Through a lot of experiments, the feasibility and effectiveness of this method for store signboard extraction in various urban environments were verified.

The main contributions of this article can be summarized as follows:

(1) We add the coordinate information before the convolutional layer in the backbone structure which causes the method, to some extent, to be position-aware;

(2) We improve the C3 module in the backbone structure by enhancing the scale context and neighborhood context information, which gives the new C3 module rectangular feature constraints and a topology reconstruction function;

(3) We propose an improved spatial pyramid pooling model to reduce the computational load and improve the nonlinear learning ability of the model by cascading small-size kernel pools. In addition, we add channels and spatial attention modules to address the weakening of features due to the delivery process.

The rest of this article is organized as follows: Section 2 presents the details of the proposed method. The experimental results and analysis are presented in Section 3. Finally, our conclusions are drawn in Section 4.

2. Methods

In the complex and changeable urban environment, under different construction scales and planning forms, store signboards can present very different morphological characteristics. The visual morphological information is the main feature of store signboard information extraction in street view images. Therefore, according to the features of store signboards in street view images, we introduce an information extraction method for store signboards, which benefits from the multi-level feature abstraction ability and high speed of the YOLOv5 network.

2.1. Visual Feature Analysis

Due to the differences in street environments and store scales, store signboards present different visual features in street view images. Therefore, based on the geometric features, textural features, and location features of the store signboards in street view images, they can be divided into three types: (1) store signboards with no occlusion; (2) store signboards with a partial degree of occlusion; and (3) store signboards with a high degree of occlusion. The characteristics of the different types of store signboards are summarized in Table 1.

In addition, Figure 1 shows street view images with various types of store signboards, and explains their visual characteristics: (1) Store signboards with no occlusion usually have clear boundaries, a relatively uniform internal texture, and are usually spatially distributed and concentrated in the middle of the images. They also present a clear rectangular morphology. (2) The store signboards with partial occlusion have incomplete boundary information, a rough texture, a rectangular shape that is relatively blurred, and are mostly concentrated in the middle of the images. (3) The store signboards with a high degree of occlusion have vague or basically non-existent boundary features, and the texture is disordered. However, the rectangular shape can be determined according to prior knowledge, and most of the signboards are concentrated in the middle of the images.

To sum up, in the proposed method, we consider the common features of store signboards—the rectangular shape and the centrally distributed position—and add rectangular feature constraints and coordinate perception to the network. Since the main interference factor in the automatic detection of store signboards is occlusion, we rely on the internal texture of store signs to identify them, and integrate a topology reconstruction module. We also consider the common problem of objects in street view images—multi-scale features—and introduce a multi-scale perception module.

2.2. Proposed Model

The model designed in this study includes two main parts: (1) The backbone, as shown in the blue box in Figure 2. Firstly, the module is reorganized, according to the shape and spatial distribution of the store signboards, taking into account the constraints of rectangular features, location attention, and topology reconstruction. The CoordConv layer, C3_SC module, and C3 module are designed in order to improve the accuracy of the feature extraction. The improved spatial pyramid pooling (IFSPP) module is used to ensure scale-aware performance and improve the model generalization ability, without increasing the amount of model calculation. (2) The neck, as shown in the red box in Figure 2, is used to ensure the adequacy and accuracy of feature utilization in the process of spreading from high-level to low-level fusion of features with strong semantics, and adds an attention module. The network architecture is robust to complex scenarios and is applicable to the research objects of this paper.

2.2.1. Improvements to the Backbone

The YOLOv5 backbone network is mainly composed of Focus, Conv, C3, and spatial pyramid pooling (SPP) modules. Among these modules, the Conv module (Figure 3a), as a key module for transforming two spatial representations (Cartesian spatial coordinates (i, j) and pixel spatial coordinates), has a defect, because the specific position coordinates are not obtained in the input. The C3 module (Figure 3b) serves as a key module for the network to learn more features. Because the internal structure is not designed according to the store signboards’ features and image features, only 3 × 3 kernel convolution is used to extract features, resulting in limited feature acquisition. Global/remote dependencies serve many computer vision tasks [46], but the limited receptive field of convolution hinders the modeling of such remote interaction relationships. The SPP (Figure 3c) module can greatly increase the receptive field and separate the most significant contextual features. However, for the rectangular objects in this study, an excessively large pooling window would not only increase the amount of computation, but could also incorporate other irrelevant information. In view of the above problems, we carried out module design research.

(1): Location Attention

Store signboards are mostly concentrated in the middle of street view images. As a key module for transforming the coordinate space, the standard convolution structure has excellent properties, such as translation equivariance, local connection, and weight sharing. As the basic building block of different models, it has achieved great success in a variety of computer vision asks. Some recent achievements in machine vision rely on stacking a large number of convolution structures. However, in the task of object coordinate modeling, the general convolution cannot obtain specific position coordinate information, and cannot learn the smooth function between Descartes space and pixel space [47].

In order to improve the effect of the convolution and improve the location-aware performance, we introduce the CoordConv layer [47] to replace the convolutional layer in the backbone. That is, i-coordinate and j-coordinate channels are added to the input. The detailed structure of the CoordConv layer is shown in Figure 4. The CoordConv layer retains two of the original convolutional layer properties: fewer parameters and high computational efficiency. It also has the ability to view the pixels of an image through a target detection network, improving the efficiency of the output bounding box in the Cartesian coordinate system.

(2): Rectangular Feature Constraint

Most store signboards are in the form of multi-scale rectangles, as shown in Figure 1. To perceive multi-scale rectangular features, more scale context is needed. Scale methods include void convolution and global pooling. Anisotropic contextual information is widely present in scenarios, and for rectangular-shaped store signboards, the use of a square convolution kernel to capture the information would merge information from unrelated areas and lose flexibility.

To capture the scale context in the contextual information, we introduced improvements on the basis of the C3 module, and used a strip pooling (SP) module [48] to replace the 3 × 3 convolutional layer of the bottleneck submodule in the C3 module. As shown in Figure 5, a long kernel shape is first deployed along one spatial dimension. A narrow kernel shape is then maintained on the other spatial dimension. Such a structure can help to capture the long-range band structure features of store signboards and can prevent unrelated areas from interfering with the label prediction.

(3): Topological Reconstruction

As shown in Figure 1, store signboards are usually located in the middle of street view images and are easily occluded by trees, resulting in incomplete target information. Contextual information is critical to learn reliable and stable topologies, so we need to consider more contextual information from nearest neighbors. Self-attention is widely used in deep neural networks to capture the internal correlation of features and establish correct topological relations [49]. However, the traditional self-attention mechanism only undertakes information exchange in the spatial domain, and neglects the rich contextual information between close neighbors.

Focusing on the contextual information between neighbors, we introduced improvements on the basis of the C3 module, and used a contextual transformer (CoT) module [49] to replace the 3 × 3 convolutional layer of the bottleneck submodule in the C3 module. The CoT module (Figure 6) first encodes the input keys with 3 × 3 convolution to obtain the static contextual representation of the input, and then splices the coded keys with the input query. The dynamic multi-head attention matrix is learned by two consecutive 1 × 1 convolutions, and the attention matrix is multiplied by the input values to obtain the dynamic context representation of the input. The output of this module is the fusion result of static context expression and dynamic context expression. This module integrates contextual information mining and self-attention learning into a unified framework, explores the contextual information of close neighbors, establishes correct topological relations, and solves the problem of object fragmentation, to a certain extent.

(4): Multi-Scale Information Extraction

Considering the multi-scale problem of the research object of this study, a pyramid squeeze attention (PSA) module [50] is introduced, as shown in Figure 7a. This module replaces the 3 × 3 convolution of the bottleneck in the C3 module, and the new C3 module is named the C3_P module. By using the multi-scale pyramid convolution structure to integrate the information of the input feature map, the spatial information of different scales can be extracted from each channel feature map. By extracting the attention weights of the multi-scale feature maps, cross-dimensional interaction is realized. On the one hand, the spatial information of different scales is captured to enrich the feature space. On the other hand, group convolution combined with channel attention helps to extract more hidden feature information. The bottleneck submodule in the C3_P module is designed as shown in Figure 8.

The SPP module further integrates a multi-scale receptive field. In combination with the scale diversity and the rectangular features of the research objects, using a large pooling window would increase the computational load and could merge other irrelevant information. If we stack two pooled layers with 3 × 3 cores, then the effective receptive field is 5 × 5 pixels [51]. Similarly, a pooled stack with four cores of 3 × 3 is 9 × 9 pixels, and a pooled stack with six cores of 3 × 3 is 13 × 13 pixels. The computation amount for a 5 × 5 kernel is 25/9 = 2.78 times as much as that for a 3 × 3 kernel with the same number of filters. Therefore, in order to improve the scale-aware performance, a pooled stack of small-size cores was designed, and the IFSPP module was proposed, as shown in Figure 9. In addition to reducing the computational complexity, the small-size kernel pooling also has more nonlinearity (more layers of nonlinear functions), making the decision function more critical and serving as implicit regularization, further improving the model’s generalization capability.

(5): Parallel Architecture Design

Depth increases the expressiveness of a network, helping to learn more and more abstract features; however, a deeper network is not without its drawbacks [52]. A deeper network leads to more sequential processing and higher latency, which makes it harder to parallelize and less suitable for applications that require a fast response. In the proposed approach, a parallel sub-network was designed to reduce the depth and maintain high performance, considering the geometric features and positional structure of store signboards. We designed a parallel bottleneck architecture at the shallow feature graph, which is in the first C3 module, and named this module the C3_SC module. This location feature map has a higher resolution, so that more and clearer textural information can be obtained. As a result, more useful feature information can be extracted efficiently and pertinently, and a stable topological relation can be constructed. The bottleneck submodule in the C3_SC module is designed as shown in Figure 10.

2.2.2. Improvements to the Neck

A basic way to obtain reliable features is to design the backbone structure according to research target features. In addition, we still need to ensure the adequacy and accuracy of feature utilization. The neck structure realizes semantic features, from high-level to low-level fusion, and generates new feature maps. However, in the process, important information may be lost. In view of this, we conducted the following research: Store signboards belong to multi-scale targets, and accurate multi-scale fusion features are a prerequisite to obtain accurate detection results. Studies have shown that channel attention, spatial attention, or both can significantly improve the feature representation ability [53,54]. Based on the research into attention mechanisms by Woo et al. [55], we combined the spatial and channel attention of the feature map and recalibrated it to improve the current pyramid detection. Therefore, we integrate an attention module at each stage of the path aggregation structure to enhance the pyramidal features by weighting the fused feature map. The attention module first models the feature dependence of each layer of the pyramid feature map, further studies the feature importance vector, recalibrates the feature map, and emphasizes the useful features. The attention module consists of two main parts: (1) a channel attention block (CAB); and (2) a spatial attention block (SAB) (Figure 11).

(1): Channel Attention Block

The CAB focuses on enhancing the features along the channel of each pyramid level. It first explicitly models the dependency of the features along the channel and learns a channel-specific descriptor through the squeeze-and-excitation method. It then emphasizes the useful channels for more efficient global information expression of the feature maps in each pyramid level. The CAB can be expressed as follows:

M c (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F)))

(1)

The input is a characteristic F of H × W × C. We first carry out a global average pooling and a maximum pooling in space, and obtain two channel descriptions of 1 × 1 × C, where σ is a rectified linear unit (ReLU) activation function. These are then fed into a two-layer neural network, for which the number of neurons in the first layer is C/r, the activation function is ReLU, and the number of neurons in the second layer is C. This two-layer neural network is shared. The weight coefficient M_C is then obtained by adding the two features and passing a sigmoid activation function. Finally, the new feature can be obtained by multiplying the weight coefficient with the original feature F.

(2): Spatial Attention Block

Similar to the CAB, the spatial attention block (SAB) enhances the features along with the spatial location of each pyramid level, which emphasizes the effective pixels and suppresses the ineffective or low-effect pixels. The SAB process can be expressed as follows:

M_{S} (F) = σ (f^{(7 * 7)} ([A v g P o o l (F); M a x P o o l (F)])) = σ (f^{(7 * 7)} ([F_{a} v g^{S}; F_{m} a x^{S}]))

(2)

Given a characteristic F of H × W × C, we first average pool and maximum pool the channel dimension to obtain two channel descriptions of H × W × 1, and then splice them together. Then, through a convolutional layer of 7 × 7, σ is the sigmoid activation function, and the weight coefficient M_S is obtained. Finally, the new feature can be obtained by multiplying the weight coefficient and feature F.

To sum up, the focus of the channel attention module is to learn the channel information from the combination of heterogeneous features, and the focus of the spatial attention module is to learn the spatial information from the combination of features. This attention-based feature enhancement can automatically explore the importance of different levels of features, effectively integrate heterogeneous regions in the channel dimensions, and further recalibrate the importance of each pixel position in the spatial dimensions to get closer to the area of interest. Therefore, it prevents misjudgment caused by feature blur, and improves the ability of the detection module to distinguish store signboards and background.

3. Results

3.1. Experimental Setup

3.1.1. Dataset Production

To the best of the authors’ knowledge, there are no public datasets for store signboard detection. Even the Chinese Text in the Wild (CTW) dataset [56] has no instances of store signboard labeling. In order to verify the performance and robustness of the proposed algorithm, we annotated two datasets to evaluate the method. The selected street view images came from two data sources: CTW [56] and Baidu. The CTW street view images are mainly composed of Tsinghua Tencent 100k and Tencent street view image data. Baidu street view is mainly based on image data crawled from the internet. The street view images of the two data sources cover large areas in different regions with different image resolutions, and both datasets contain scenes with multiple lighting conditions, multiple degrees of occlusion, and target overlap.

In order to train a more robust model using small datasets obtained from complex environmental conditions, the method of label what you see (LWYS) [57] was used for the labeling, and only the targets of frontal view imaging were considered, so as to reduce instances of missed detection and false detection. We selected 647 CTW street view images. The store signboards were divided into three categories according to the standards listed in Section 2-1. A total of 2995 targets were marked using LWYS, for which the target proportion of no-occlusion store signboards, partial-occlusion store signboards, and heavy-occlusion store signboards was 1:2:2. There is no need to consider classification in the actual detection task, but multi-classification training can separate the store signboards and background to the greatest extent. Therefore, when we evaluate the accuracy, we regard the detected targets as one class according to the actual task requirements, i.e., store signboards.

The generation of a target detection model based on deep learning is realized on the basis of training with a large volume of image data. Therefore, the 647 CTW street view images needed to be enhanced, and one original image was enhanced to four. The image enhancement methods included image brightness enhancement and reduction, horizontal mirroring, zooming, translation, and so on. In addition, in the process of image acquisition, the shaking of the acquisition device or tree branches can cause image blurring, so we added a Gaussian filter to simulate this kind of blurring phenomenon. In order to increase the variability of the input images and make the model more robust to images obtained in different environments, 241 images among the CTW street view images were cropped and trained at the same time as the original images.

We used the labeled CTW image data to carry out the training. After the data enhancement, the sample library was made up of 3637 street view images, including 13,244 store signboard objects. The image sizes included 2048 × 2048 and 1200 × 1024, and the ratio of the training set, verification set, and test set was 8:1:1. In order to verify the generalization ability of the model, 300 Baidu street view images of different sizes were used to evaluate the accuracy of the model, including 1643 store signboards.

3.1.2. Network Training

The experiments were conducted on a Dell computer with a Core(TM) i7-9700 CPU @ 3.00 GHz, 28 GB RAM, and an NVIDIA GeForce RTX 2060 GPU. The PyTorch deep learning framework was run in Windows 10, and the program code was written in Python language. CUDA, CuDNN, OpenCV, and other libraries were used to train and test the target recognition models.

As the input size required for the YOLOv5 network is fixed, we adjusted the image size to a uniform size of 640 × 640. The batch size of the model training was set to 2, and the model weights were updated each time by the batch normalization (BN) layer. The momentum was set to 0.937 and the weight decay rate was set to 0.0005. The number of training sessions was set to 500. After the training, the weight file of the recognition model was saved, and the performance of the model was evaluated using the test set.

3.1.3. Evaluation Indices

In this article, the model’s performance on the different datasets is evaluated using precision, recall, intersection over union (IoU), F1-score, and average precision (AP). The formulas for these indices follow. The IoU is defined as the area of the intersection of the prediction and ground truth divided by the area of the union set. The AP averages the intersection of 10 IoU values, with a threshold of 0.50 to 0.95 and a step size of 0.05.

Precision = \frac{T P}{T P + F P}

(3)

Recall = \frac{T P}{T P + F N}

(4)

IoU = \frac{R O I P \cap R O I G}{R O I P \cup R O I G} = \frac{T P}{T P + F P + F N}

(5)

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall}

(6)

AP = \frac{A P 0.50 + A P 0.55 + \dots + A P 0.95}{10}

(7)

where TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative pixels, respectively. The F1-score is used to trade off the recall and precision to show the overall performance of the trained model, and the AP is used to show the overall performance of the model under different confidence thresholds.

Based on the above evaluation indices, we propose a new evaluation criterion to judge whether the current prediction frame meets the actual working detection criteria by constraining the minimum value of the overlapping area with the true value. In the actual store signboard detection task, if the accuracy is calculated according to the existing target detection evaluation standard, that is, the accuracy is calculated according to the overlapping pixels without constraint, it is not in line with the actual prediction. For example, if the prediction frame only overlaps with the true value by 50–60% [58], then we should not categorize this prediction frame as correct in the actual work. Therefore, we determine whether the current prediction frame is correct or not by constraining the overlapping area between the prediction frame and the true value according to the relevant departmental regulations.

S union = S p r e \cap S t r u e

(8)

\frac{S u n i o n}{S p r e} > t

(9)

\frac{S u n i o n}{S t r u e} > t

(10)

where S_union is the overlap area, S_pre is the prediction area, S_true is the ground-truth area, and t is the threshold of two ratios (one is the overlap area to prediction area, and the other is the overlap area to ground-truth area). We set the threshold value to 0.8, which was used to determine the current prediction frame as correct. Such an evaluation criterion guarantees the accuracy of detecting store signboards on the one hand, and fully considers the similarity of the prediction frame to the true value on the other hand, which is more reasonable in principle.

3.1.4. Comparison Method Selection

To demonstrate the effectiveness of the proposed method for the detection of store signboards, we evaluated the method on the self-labeled CTW dataset and the self-labeled Baidu street view dataset, and compared it quantitatively with the methods used in [39], EfficientDet_d0 [59], Faster R-CNN_resnet50 [60], YOLOX_s [61], YOLOv4 [62], and YOLOv5_l. The study by Yuan et al. [56] was aimed at the text on store signboards, which is not consistent with our research subject. The study by Liu et al. [63] was aimed at illegal billboards on driveways and in the middle of the sidewalk, so there is, again, an essential difference between the two tasks. Therefore, the methods proposed in these two papers could not be used as comparative experimental methods. In the study by Morera et al. [39], SSD [51] and YOLOv3 [64] were used as comparison methods, so these two algorithms were selected for the experimental comparison in this article. The comparison methods included both one-stage classifiers and two-stage classifiers, as well as several versions of the you only look once (YOLO) family. YOLOv5_l is the basis of the improved model in proposed in this paper.

3.2. Experimental Results and Analysis

3.2.1. Comparative Experimental Results and Analysis

The quantitative evaluation results are listed in Table 2 and Table 3. The best results are marked in bold and the second-best results are underlined.

Table 2 shows that the proposed method achieves the highest scores for all five metrics on the self-labeled CTW dataset, indicating that the proposed algorithm can maintain a balance between precision and recall. The corresponding precision and recall scores are 82.7% and 87.6%, respectively, indicating that 82.7% of the detected elements are store signboards in the test images, and 87.6% of all the store signboard elements in the images are correctly detected. The F1-score of the proposed method is 82.4%, which is 6.0–33.8% higher than the other methods, indicating the superiority of the proposed method. The IoU of the proposed method is 78.1%, which is 16.3% higher than the second-ranked YOLOv5_l. The quantitative results show that the methods of YOLOv3 and SSD described in [39] cannot complete the detection task of this article. Furthermore, our improved method improves the recall rate by 7.5% and the precision by 1.9%, compared with the original YOLOv5_l, indicating that more store signboards were successfully detected in the results of our method than in the original YOLOv5_l method.

To further test the performance of the proposed model under various data sources, we also conducted experiments on the Baidu street view dataset. The annotated Baidu dataset contained 300 street view images and, as with the CTW dataset, it also contained scenes of different complexities, with the difference being that the image sizes varied. From Table 3, it is clear that the proposed method ranks first in terms of precision, recall, F1-score, IoU, and AP. The AP score of the proposed method is 16.4–30.0% higher than that of the other methods, and the IoU is 10.7–34.6% higher than that of the other methods. The overall YOLOv5_l score is low, but the accuracy of the YOLOv3 and SSD methods used in [39] also decreases in the new dataset, indicating that we have improved the generalization ability of the model.

Figure 12 shows four representative visual results for the tested methods on the self-labeled CTW dataset, including scenes with different lighting conditions and different occlusion levels. It took about 0.22 s to infer 2048 × 2048 pixels for each test image using the proposed OSO-YOLOv5. The whole image has nine rows and four columns. The columns are the images of four representative samples from the test dataset and the detection results of the different methods. The rows contain the labeled images and the prediction maps generated by the eight test methods. The yellow box in each of these images indicates a local zoomed-in area.

As can be seen in Figure 12, the biggest difficulty for automatic detection in the CTW dataset is that the store signboards in the images are obscured by trees, resulting in blurred or even unobservable boundaries. Compared with the other tested methods, the proposed algorithm is less affected. In Figure 12, the first column of plots shows the large-scale store signboards without occlusion, and the second to fourth columns show the store signboards with increasing degrees of occlusion, i.e., they are more and more difficult for automatic extraction. For the unobstructed large-scale store signboards shown in the first column of the figure, our method, the two methods used in [39], and the other four methods can obtain complete extraction results, except for YOLOv5. However, these methods perform differently in the case of occlusion. As can be seen from the second column plot, only the proposed method, EfficientDet, and YOLOX can extract the small-scale, partially occluded store signboards, but the boundaries extracted by both EfficientDet and YOLOX are not very accurate. For the severely occluded store signboards shown in the third and fourth column plots, the extracted results of the proposed method are more consistent with the actual store signboard areas in the images, compared with the other tested algorithms, indicating that the OSO-YOLOv5 store signboard extraction network proposed in this paper can effectively improve the detection accuracy and eliminate the influence of occlusion on store signboard extraction. The visualization results show that the methods used in [39] are good for the detection of bus stop signs, but not so good for the store signboards studied in this article. Overall, we can conclude that the proposed method can perceive multi-scale information, deduce the occluded parts through the contextual information, and obtain a higher detection accuracy.

3.2.2. Results and Analysis for the Ablation Experiments

The comparative experiments demonstrated the effectiveness of the proposed method under complex conditions such as store signboards under shadow/tree occlusion conditions and multi-scale store signboards. Compared with the original YOLOv5_l algorithm, the proposed method improves the F1-score by 6.0% and the IoU by 16.3% on the CTW dataset. Since the designed framework is a multitasking framework, the optimization of individual modules does not necessarily guarantee improvement of the model’s performance. Therefore, we conducted extensive ablation experiments to investigate the impact of each module of the proposed method. We also show the improvements resulting from the combination of all the modules to demonstrate that these combined components are all complementary.

The dataset and evaluation indices selected for the experiments described in this section are the same as in Section 3.2.1. The quantitative evaluation results are shown in Figure 13.

As can be seen from Figure 13a,b, each line represents one evaluation metric, and the precision, recall, F1-score, and IoU generally show an increasing trend with the addition of the different modules. From the vertical observation in the figure, we can also clearly see that the method integrating all the modules obtains the highest scores in all four metrics. Accordingly, we can conclude that the integration of the different modules results in a certain improvement in the detection accuracy, but the combination of all the modules leads to the highest detection accuracy. The introduction of a single module leads to different degrees of improvement in the precision or recall values. These findings are sufficient to show that the different modules improve the extraction of multi-scale rectangular features in the middle of the images and solve the problem of target fragmentation, to some extent. These results also show that the components are complementary and improve the performance of the original YOLOv5_l from different perspectives.

4. Conclusions

With the many elements, complex structures, and dynamic changes in urban systems, fast and accurate extraction of store signboards from street view images has become a key research point for the intelligent management of urban components. Inaccurate store signboard information extraction is the key problem of mainstream target detection methods. In this study, we fully considered the characteristics of each type of store signboard, avoided the mutual interference between the different types of store signboards, made the sample production and extraction model training much less difficult, and then proposed a rectangular feature constraint, integrating location attention and topology reconstruction in the OSO-YOLOv5 network, to solve the problem of the intelligent detection of store signboards in real complex scenes.

The framework transforms the traditional multi-step workflow into an improved end-to-end deep learning architecture for the semi-automated supervision of urban components. The overall workflow of the proposed method is based on the original YOLOv5 backbone + neck structure. The improved backbone is used to improve the perception of multi-scale rectangular features and coordinate perception, and it also enhances the topological relationship construction. The improved neck is used to enhance the feature expression capability. In the context of smart city construction, we believe that this method will have great potential for application in mapping and updating scenarios. We also believe that this research breaks the technical bottleneck of urban information detection, and could be used to promote the high-quality development of cities. Specifically, this study could be extended to the detection of illegal billboards. We also aim to do more work on the matching of place names and addresses. The signboards detected in this research can be considered as POI data, and when combined with OCR technology, this information could be used to assign the corresponding attributes of the store name to each point of interest to realize the matching of place names and addresses.

We screened the street view images on the public CTW dataset, trained the model, and then tested the generalization ability of the model with street view images from the Baidu data source. The quantitative evaluation results obtained on both datasets showed that the proposed method has superior accuracy. The proposed method also showed higher localization accuracy in the case of a concentrated distribution and compact arrangement of targets. The method can also further improve the performance in complex cases where the targets have scale diversity or are occluded. There are indeed some limitations to this research, such as: (1) The current algorithm requires a relatively high-level device, so we need to develop a lightweight model that can be directly deployed on a mobile phone. (2) The algorithm needs to be further investigated in the case of a uniform background and a store signboard layout without boundaries, which is a typical “Chinese landscape”. (3) The current algorithm has a good effect in the detection of store signboards in frontal views, but the detection accuracy for store signboards in images taken at an angle is not satisfactory.

In our future work, we plan to improve the framework through the following aspects: (1) the introduction of lightweight models into the backbone or detection module; (2) further optimization of the localization learning, such as designing a new boundary-aware loss function; and (3) introducing more store information and establishing a knowledge graph, which will ultimately achieve an improved recognition accuracy by combining the object recognition techniques in computer vision with a knowledge graph.

Author Contributions

Funding acquisition, Jiguang Dai; investigation, Yue Gu; methodology, Jiguang Dai and Yue Gu; writing—original draft, Yue Gu; writing—review and editing, Jiguang Dai and Yue Gu. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant numbers 42071428 and 42071343 and the APC was funded by the National Natural Science Foundation of China, grant number 42071428.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in https://ctwdataset.github.io/, accessed on 18 May 2020.

Acknowledgments

We would like to thank the source code provider of the network used in the comparative experiment. We would like to thank all the anonymous referees whose comments greatly strengthened this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, G. Research on the Measurement of the Construction Level and Development Strategy of Yiyang Smart City Based on Principal Component Analysis. In Proceedings of the 2020 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), Vientiane, Laos, 11–12 January 2020; pp. 176–180. [Google Scholar]
Anguelov, D.; Dulong, C.; Filip, D.; Frueh, C.; Lafon, S.; Lyon, R.; Ogale, A.; Vincent, L.; Weaver, J. Google Street View: Capturing the World at Street Level. Computer 2010, 43, 32–38. [Google Scholar] [CrossRef]
Balali, V.; Golparvar-Fard, M. Segmentation and recognition of roadway assets from car-mounted camera video streams using a scalable non-parametric image parsing method. Autom. Constr. 2015, 49, 27–39. [Google Scholar] [CrossRef]
Campbell, A.; Both, A.; Sun, Q. Detecting and mapping traffic signs from Google Street View images using deep learning and GIS. Comput. Environ. Urban Syst. 2019, 77, 101350. [Google Scholar] [CrossRef]
Zünd, D.; Bettencourt, L. Street View Imaging for Automated Assessments of Urban Infrastructure and Services. In Urban Informatics; Shi, W., Goodchild, M., Batty, M., Kwan, M., Zhang, A., Eds.; Springer: Singapore, 2021; pp. 29–40. [Google Scholar]
Luo, H.; Yang, Y.; Tong, B.; Wu, F.; Fan, B. Traffic Sign Recognition Using a Multi-Task Convolutional Neural Network. IEEE Trans. Intell. Transp. Syst. 2018, 19, 1100–1111. [Google Scholar] [CrossRef]
Zhou, L.; Shi, Y.; Zheng, J. Business Circle Identification and Spatiotemporal Characteristics in the Main Urban Area of Yiwu City Based on POI and Night-Time Light Data. Remote Sens. 2021, 13, 5153. [Google Scholar] [CrossRef]
Soilán, M.; Riveiro, B.; Martínez-Sánchez, J.; Arias, P. Traffic sign detection in MLS acquired point clouds for geometric and image-based semantic inventory. ISPRS J. Photogramm. Remote Sens. 2016, 114, 92–101. [Google Scholar] [CrossRef]
Maboudi, M.; Amini, J.; Hahn, M.; Saati, M. Road Network Extraction from VHR Satellite Images Using Context Aware Object Feature Integration and Tensor Voting. Remote Sens. 2016, 8, 637. [Google Scholar] [CrossRef]
Patil, M.; Desai, C.; Umrikar, B. Image Classification Tool for Land Use/Land Cover Analysis: A Comparative Study of Maximum Likelihood and Minimum Distance Method. Int. J. Geol. Earth Environ. Sci. (JGEE) 2012, 2, 189–196. [Google Scholar]
Zhao, Q.; Jia, S.; Li, Y. Hyperspectral remote sensing image classification based on tighter random projection with minimal intra-class variance algorithm. Pattern Recognit. 2021, 111, 107635. [Google Scholar] [CrossRef]
Yang, J.; Gao, W.; Duan, X.; Hu, Y. Extraction of Building Information Based on Object-oriented Feature Automatic Selection. Remote Sens. Inf. 2020, 36, 130–135. [Google Scholar]
Chen, C.; Yu, J.; Ling, Q. Sparse attention block: Aggregating contextual information for object detection. Pattern Recognit. 2022, 124, 108418. [Google Scholar] [CrossRef]
Shahryari, S.; Hamilton, C. Neural Network-POMDP-Based Traffic Sign Classification under Weather Conditions. In Proceedings of the 29th Canadian Conference on Artificial Intelligence, Canadian AI 2016, Victoria, BC, Canada, May 31–June 3 2016; pp. 122–127. [Google Scholar]
Wali, S.B.; Abdullah, M.A.; Hannan, M.A.; Hussain, A.; Samad, S.A.; Ker, P.J.; Bin Mansor, M. Vision-Based Traffic Sign Detection and Recognition Systems: Current Trends and Challenges. Sensors 2019, 19, 2093. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Li, H.; Sun, F.; Liu, L.; Wang, L. A novel traffic sign detection method via color segmentation and robust shape matching. Neurocomputing 2015, 169, 77–88. [Google Scholar] [CrossRef]
Fleyeh, H. Shadow and Highlight Invariant Colour Segmentation Algorithm for Traffic Signs. In Proceedings of the 2006 IEEE Conference on Cybernetics and Intelligent Systems, Taipei, Taiwan, 7–9 June 2006; pp. 1–7. [Google Scholar]
Farhat, W.; Faiedh, H.; Souani, C.; Besbes, K. Real-time embedded system for traffic sign recognition based on ZedBoard. J. Real-Time Image Process. 2019, 16, 1813–1823. [Google Scholar] [CrossRef]
Liu, C.; Li, S.; Chang, F.; Wang, Y. Machine Vision Based Traffic Sign Detection Methods: Review, Analyses and Perspectives. IEEE Access 2019, 7, 86578–86596. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Broni-Bediako, C.; Murata, Y.; Mormille, L.H.B.; Atsumi, M. Searching for CNN Architectures for Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Dong, Z.; Wang, M.; Wang, Y.; Zhu, Y.; Zhang, Z. Object Detection in High Resolution Remote Sensing Imagery Based on Convolutional Neural Networks with Suitable Object Scale Features. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2104–2114. [Google Scholar] [CrossRef]
Haq, M.A.; Rahaman, G.; Baral, P.; Ghosh, A. Deep Learning Based Supervised Image Classification Using UAV Images for Forest Areas Classification. J. Indian Soc. Remote Sens. 2021, 49, 601–606. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Cheng, G.; Yang, C.; Yao, X.; Guo, L.; Han, J. When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2811–2821. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Han, X.; Lu, J.; Zhao, C.; Li, H. Fully Convolutional Neural Networks for Road Detection with Multiple Cues Integration. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 4608–4613. [Google Scholar]
Zhang, P.; Liu, W.; Lei, Y.; Wang, H.; Lu, H. Deep Multiphase Level Set for Scene Parsing. IEEE Trans. Image Process. 2020, 29, 4556–4567. [Google Scholar] [CrossRef] [Green Version]
Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587v3. [Google Scholar]
Cheng, B.; Collins, M.D.; Zhu, Y.; Liu, T.; Huang, T.S.; Adam, H.; Chen, L.C. Panoptic-DeepLab: A Simple, Strong and Fast Baseline for Bottom-Up Panoptic Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12472–12482. [Google Scholar]
Li, Z.; Chen, X.; Jiang, J.; Han, Z.; Li, Z.; Fang, T.; Huo, H.; Li, Q.; Liu, M. Cascaded Multiscale Structure with Self-Smoothing Atrous Convolution for Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Zhang, Z.; Liu, Q.; Wang, Y. Road Extraction by Deep Residual U-Net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
Zhao, W.; Persello, C.; Stein, A. Building outline delineation: From aerial images to polygons with an improved end-to-end learning framework. ISPRS J. Photogramm. Remote Sens. 2021, 175, 119–131. [Google Scholar] [CrossRef]
Hossari, M.; Dev, S.; Nicholson, M.; McCabe, K.; Nautiyal, A.; Conran, C.; Tang, J.; Xu, W.; Pitié, F. ADNet: A Deep Network for Detecting Adverts. arXiv 2018, arXiv:1811.04115v1. [Google Scholar]
Dev, S.; Hossari, M.; Nicholson, M.; McCabe, K.; Nautiyal, A.; Conran, C.; Tang, J. The CASE Dataset of Candidate Spaces for Advert Implantation. In Proceedings of the 2019 16th International Conference on Machine Vision Applications (MVA), Tokyo, Japan, 27–31 May 2019; pp. 1–4. [Google Scholar]
Xu, X.; Ma, Y.; Qian, X.; Zhang, Y. Scale-aware Efficient Det: Real-time pedestrian detection algorithm for automated driving. J. Image Graph. 2021, 26, 93–100. [Google Scholar] [CrossRef]
Morera, A.; Sanchez, A.; Moreno, A.B.; Sappa, A.; Vélez, J. SSD vs. YOLO for Detection of Outdoor Urban Advertising Panels under Multiple Variabilities. Sensors 2020, 20, 4587. [Google Scholar] [CrossRef]
Wang, G.; Zhuang, Y.; Chen, H.; Liu, X.; Zhang, T.; Li, L.; Dong, S.; Sang, Q. FSoD-Net: Full-Scale Object Detection from Optical Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Hu, H.; Wang, L.; Zhang, M.; Ding, Y.; Zhu, Q. Fast and Regularized Reconstruction of Building Façades from Street-View Images Using Binary Integer Programming. arXiv 2020, arXiv:2002.08549v3. [Google Scholar] [CrossRef]
Lee, J.; Hong, J.; Park, G.; Kim, H.S.; Lee, S.; Seo, T. Contaminated Facade Identification Using Convolutional Neural Network and Image Processing. IEEE Access 2020, 8, 180010–180021. [Google Scholar] [CrossRef]
Wu, W.; Yin, Y.; Wang, X.; Xu, D. Face Detection with Different Scales Based on Faster R-CNN. IEEE Trans. Cybern. 2019, 49, 4017–4028. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Wu, Z.; Luo, J.; Sun, Y.; Wu, T.; Zhou, N.; Hu, X.; Wang, L.; Zhou, Z. A divided and stratified extraction method of high-resolution remote sensing information for cropland in hilly and mountainous areas based on deep learning. Acta Geod. Cartogr. Sin. 2021, 50, 105–116. [Google Scholar] [CrossRef]
Ultralytics Yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 18 May 2020).
Urtasun, R.; Mottaghi, R.; Liu, X.; Cho, N.-g.; Lee, S.-w. The Role of Context for Object Detection and Semantic Segmentation in the Wild. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 891–898. [Google Scholar]
Liu, R.; Lehman, J.; Molino, P.; Petroski Such, F.; Frank, E.; Sergeev, A.; Yosinski, J. An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution. arXiv 2018, arXiv:1807.03247v2. [Google Scholar]
Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip Pooling: Rethinking Spatial Pooling for Scene Parsing. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4002–4011. [Google Scholar]
Bello, I.; Zoph, B.; Le, Q.; Vaswani, A.; Shlens, J. Attention Augmented Convolutional Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seattle, WA, USA, 27 October–2 November 2019; pp. 3285–3294. [Google Scholar]
Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. EPSANet: An Efficient Pyramid Squeeze Attention Block on Convolutional Neural Network. arXiv 2021, arXiv:2105.14447v2. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Goyal, A.; Bochkovskiy, A.; Deng, J.; Koltun, V. Non-Deep Networks. arXiv 2021, arXiv:2110.07641v1. [Google Scholar]
Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. A2-Nets: Double Attention Networks. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, QC, Canada, 3–8 December 2018; pp. 350–359. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yuan, T.; Zhu, Z.; Xu, K.; Li, C.; Hu, S. Chinese Text in the Wild. arXiv 2018, arXiv:1803.00085v1. [Google Scholar]
Lawal, M.O. Tomato detection based on modified YOLOv3 framework. Sci. Rep. 2021, 11, 1447. [Google Scholar] [CrossRef]
Kong, G.; Fan, H. Enhanced Facade Parsing for Street-Level Images Using Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10519–10531. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q. EfficientDet: Scalable and Efficient Object Detection. arXiv 2019, arXiv:1911.09070v7. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430v2. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934v1. [Google Scholar]
Liu, L.; Liu, Z.; Xiong, Y. Combining object detection and semantic segmentation to detect illegal billboard. Mod. Comput. 2021, 12, 127–132. [Google Scholar]
Joseph, R.; Ali, F. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767v1. [Google Scholar]

Figure 1. Schematic diagram of the visual characteristics of store signboards.

Figure 2. Overview of the proposed framework.

Figure 3. Schematic diagram of the main module in the YOLOv5 backbone. (a) Schematic diagram of a Conv module. (b) Schematic diagram of a C3 module. (c) Schematic diagram of an SPP module. (In the figure, k represents kernel size, the same below).

Figure 4. Schematic diagram of a Convolutional layer (a) and a CoordConv layer (b).

Figure 5. Schematic diagram of an SP module.

Figure 6. Schematic diagram of a CoT module.

Figure 7. Schematic diagram of the PSA module (a) and SPC module (b).

Figure 8. Schematic diagram of the bottleneck module in C3_P.

Figure 9. Schematic diagram of the IFSPP module.

Figure 10. Schematic diagram of the bottleneck submodule in the C3_SC module.

Figure 11. Schematic diagram of an attention block.

Figure 12. Examples detected by the different methods on the self-labeled CTW dataset.

Figure 13. The effect of the different modules on the accuracy obtained on the self-labeled CTW dataset (a) and the self-labeled Baidu dataset (b) (A represents the YOLOv5_l method, B represents the addition of the CoordConv layer, C represents the addition of the C3_SC and C3_P modules, D represents the addition of the IFSPP module, E represents the addition of the CBAM module, and F represents the full method described in this article).

Table 1. Characteristics of store signboards.

Area	Degree of Occlusion	Texture Feature	Morphology Feature	Location Feature
No occlusion	None	Uniform	Rectangles, Clear distinct boundaries	Concentrated in the middle of the image
Occlusion	Partial	Rough	Relatively clear rectangles, incomplete boundaries	Concentrated in the middle of the image
Occlusion	High	Rough and complicated	No obvious boundaries	Concentrated in the middle of the image

Table 2. Results for the self-labeled CTW dataset.

Method	Metrics (%)
Method	Precision	Recall	F1	IoU	AP
EfficientDet_d0	61.8	66.5	60.8	49.3	49.3
Faster R-CNN_resnet50	42.9	59.0	48.6	37.8	42.4
YOLOX_s	66.9	84.4	72.5	62.3	61.3
YOLOv4	74.8	76.3	73.2	65.9	56.2
YOLOv5_l	80.8	72.4	76.4	61.8	75.1
YOLOv3	75.4	66.2	70.5	61.3	53.7
SSD	67.0	57.6	61.9	44.8	51.5
Proposed	82.7	87.6	82.4	78.1	80.1

Table 3. Results for the self-labeled Baidu dataset.

Method	Metrics (%)
Method	Precision	Recall	F1	IoU	AP
EfficientDet_d0	57.9	54.9	56.4	39.3	47.8
Faster R-CNN_resnet50	45.4	60.7	51.9	35.1	43.3
YOLOX_s	59.0	59.1	59.0	41.5	49.0
YOLOv4	69.6	70.7	70.1	54.0	56.9
YOLOv5_l	74.8	73.7	74.2	59.0	54.5
YOLOv3	64.7	59.7	62.1	45.0	50.7
SSD	52.6	55.7	54.1	37.1	44.9
Proposed	78.5	86.0	82.1	69.7	73.3

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dai, J.; Gu, Y. OSO-YOLOv5: Automatic Extraction Method of Store Signboards in Street View Images Based on Multi-Dimensional Analysis. ISPRS Int. J. Geo-Inf. 2022, 11, 462. https://doi.org/10.3390/ijgi11090462

AMA Style

Dai J, Gu Y. OSO-YOLOv5: Automatic Extraction Method of Store Signboards in Street View Images Based on Multi-Dimensional Analysis. ISPRS International Journal of Geo-Information. 2022; 11(9):462. https://doi.org/10.3390/ijgi11090462

Chicago/Turabian Style

Dai, Jiguang, and Yue Gu. 2022. "OSO-YOLOv5: Automatic Extraction Method of Store Signboards in Street View Images Based on Multi-Dimensional Analysis" ISPRS International Journal of Geo-Information 11, no. 9: 462. https://doi.org/10.3390/ijgi11090462

APA Style

Dai, J., & Gu, Y. (2022). OSO-YOLOv5: Automatic Extraction Method of Store Signboards in Street View Images Based on Multi-Dimensional Analysis. ISPRS International Journal of Geo-Information, 11(9), 462. https://doi.org/10.3390/ijgi11090462

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

OSO-YOLOv5: Automatic Extraction Method of Store Signboards in Street View Images Based on Multi-Dimensional Analysis

Abstract

1. Introduction

2. Methods

2.1. Visual Feature Analysis

2.2. Proposed Model

2.2.1. Improvements to the Backbone

2.2.2. Improvements to the Neck

3. Results

3.1. Experimental Setup

3.1.1. Dataset Production

3.1.2. Network Training

3.1.3. Evaluation Indices

3.1.4. Comparison Method Selection

3.2. Experimental Results and Analysis

3.2.1. Comparative Experimental Results and Analysis

3.2.2. Results and Analysis for the Ablation Experiments

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI