Discriminative Semantic Feature Pyramid Network with Guided Anchoring for Logo Detection

—Recently, logo detection has received more and more attention for its wide applications in the multimedia ﬁeld, such as intellectual property protection, product brand management, and logo duration monitoring. Unlike general object detection, logo detection is a challenging task, especially for small logo objects and large aspect ratio logo objects in the real-world scenario. In this paper, we propose a novel approach, named Discriminative Semantic Feature Pyramid Network with Guided Anchoring (DSFP-GA), which can address these challenges via aggregating the semantic information and generating different aspect ratio anchor boxes. More speciﬁcally, our approach mainly consists of Discriminative Semantic Feature Pyramid (DSFP) and Guided Anchoring (GA). Considering that low-level feature maps that are used to detect small logo objects lack semantic information, we propose the DSFP, which can enrich more discriminative semantic features of low-level feature maps and can achieve better performance on small logo objects. Furthermore, preset anchor boxes are less efﬁcient for detecting large aspect ratio logo objects. We therefore integrate the GA into our method to generate large aspect ratio anchor boxes to mitigate this issue. Extensive experimental results on four benchmarks demonstrate the effectiveness of our proposed DSFP-GA. Moreover, we further conduct visual analysis and ablation studies to illustrate the advantage of our method in detecting small and large aspect logo objects. The code and models can be found at https://github.com/Zhangbaisong/DSFP-GA.


I. INTRODUCTION
R ESEARCH related to the logo field has been widely studied in the multimedia and beyond [1], [2], [3], [4], [5], [6], [7].Logo detection is an important task for its many applications, such as vehicle logo recognition for intelligent transportation [8] and protection of intellectual property [9] for commercial research.
Most of current logo detectors directly adopt object detection methods, and thus lack the refinement to the issues of logo detection according to the characteristics of logos.For example, many logo detection methods [10], [11] directly use feature maps extracted by CNNs.As a result, the semantic information of low-level feature maps for detecting small logo objects is insufficient.Because the stride of low-level feature maps is small, and the semantic information is not fully extracted.Moreover, existing models [12], [13], [14] use the preset anchor mechanism, and thus couldn't effectively deal with different aspect ratio logo objects, making it more difficult to detect large aspect ratio logo objects.
Compared with general object detection, logo detection has the same challenge of small objects.Furthermore, logo detection has the unique challenge of large aspect ratio objects.
• Small logo objects are difficult to detect, because lowlevel feature maps that are used to detect small logo objects lack semantic information.Generally, low-level feature maps have more detailed information and highlevel feature maps have more semantic information.Detailed information and semantic information are complementary.High-level feature maps have enough detailed information to detect large logo objects.But the stride of high-level feature maps is large, and the drawback of large stride is the missing of small logo objects.Therefore high-level feature maps are not suitable for detecting small logo objects.However, low-level feature maps contain less semantic information for detecting small logo objects.Without enough semantic information, low-level feature maps are difficult to distinguish between foreground and background, resulting in insufficient training of small logo objects.Lin et al. proposed the Feature Pyramid Network (FPN) [15] to generate pyramidal feature maps for object detection.As shown in Fig. 1 (a), it builds a feature pyramid by sequentially (2) category "coffee beanery", max/min equals 7.9; (3) category "luciano soprani", (the left box) max/min equals 7.7, (the right box) max/min equals 6.5; (4) category "simple human", (the left box) max/min equals 5.8, (the right box) max/min equals 4.7.Blue boxes: ground-truth boxes.(b) Histogram of the number of boxes vs the ratio of maximum dimension to minimum dimension of the object on the LogoDet-3K dataset.The value of max/min accounts for 65.1% in the range of (1-2.9), the value of max/min accounts for 23.1% in the range of (3-4.9), the value of max/min value greater than 5 accounts for 11.8%.
combining two adjacent layers via top-down and lateral connections.Although FPN is simple and effective to integrate semantic information, the top-down pathway doesn't fully integrate rich semantic information into lowlevel feature maps.Therefore, it is necessary to integrate the rich semantic information of multiple feature maps for small logo objects, and the key depends on the extraction and learning of discriminative features.
• Training with preset anchor boxes would result in fewer positive samples and inaccurate features, which will introduce a large number of regions of negative samples and can be disadvantageous for large aspect ratio logo objects.Existing methods do not provide any solution for large aspect ratio logo objects.Bao et al. applied Faster R-CNN to logo detection [16].The Region Proposal Network (RPN) in Faster R-CNN mainly includes three three kinds of anchor boxes with different proportions to obtain proposals.After that, similar detectors [17], [18], [19] that use the preset anchor mechanism are also applied to logo detection.These methods alleviate the impact of different aspect ratio objects on detection, but it is less efficient for detecting large aspect ratio logo objects.As shown in Fig. 2 (a), logo objects such as "napapijri" and "coffee beanery" are extremely wide.Logo objects such as "luciano soprani" and "simple human" are extremely tall.It is very hard to detect these logo objects.Moreover, large aspect ratio objects account for lots of proportions on logo datasets.Fig. 2 (b) shows that about 35% of the objects have max/min values higher than 3, and more than 11.8% have max/min values greater than 5.This challenge leads to the inefficiency of logo detection through using the preset anchor boxes.It is indispensable to address the issue of large aspect ratio objects in logo detection.
For the first challenge, since low-level feature maps contain less semantic information, and the key is to enhance semantic features of these feature maps.Therefore we obtain discriminative semantic features by fusing high-level and middlelevel feature maps to low-level feature maps.For the second challenge, preset anchor boxes are less effective for solving large aspect ratio logo objects.Hence, we adopt a new anchor mechanism that can generate anchor boxes with different widths and heights.Based on the above considerations, we propose a novel logo detection method Discriminative Semantic Feature Pyramid Network with Guided Anchoring (DSFP-GA) to address the issues about small logo objects and large aspect ratio logo objects.The DSFP-GA mainly consists of Discriminative Semantic Feature Pyramid (DSFP) and Guided Anchoring (GA) [20].Specifically, the DSFP can fuse and enhance semantic information of low-level feature maps, which is capable of detecting small logo objects.As shown in Fig. 1 (b), we rely on an architecture that combines feature maps via lateral connections and high-to-low aggregating pathways, which can enrich more discriminative semantic features of low-level feature maps and can achieve better performance on small logo objects.Furthermore, we adopt the GA at the region proposal part for detecting large aspect ratio logo objects, which can generate large aspect ratio anchor boxes via anchor location branch and anchor shape branch.In addition, we further adopt the Complete IoU (CIoU) loss [21] into our method.The CIoU loss takes into account the central point distance, the overlap rate, the aspect ratio, and the penalty between anchor boxes and ground-truth boxes, which can more accurately regress the localization of logo objects and then improve the performance of logo detection.
The main contributions of this work can be summarized as follows: • We propose a novel logo detection method DSFP-GA, which can enrich discriminative semantic features and generate large aspect ratio anchor boxes to simultaneously address the issues of small logo objects and large aspect ratio logo objects.• We design the DSFP to obtain discriminative semantic features for logo detection.It can be inserted into any detection model for discriminative semantic features representation, and then can achieve better performance on small objects detection.• Extensive evaluations demonstrate the effectiveness of the proposed DSFP-GA model over a wide range of state-ofthe-art detection models on four logo datasets LogoDet-3K [22], LogoDet-3K-1000 [22], QMUL-OpenLogo [23], and FlickrLogos-32 [24].The remainder of this paper is organized as follows.The related work about object detection and logo detection algorithms is described in Section II.We describe the detailed framework design in Section III.Experimental results and analysis are reported in Section IV.Finally, we conclude the paper and propose our future research of logo detection in Section V.

A. Object Detection
Object detection has been an important task of computer vision research, and the development of deep learning has vastly improved the performance of object detection [25], [26], [27], [28], [29], [30].A modern detector is usually composed of two parts, a backbone that is pre-trained on ImageNet [31], and a detection head that is used for predicting localization and classification of objects.For those detectors, their backbones include VGG, ResNet, SpineNet, ResNeXt, and DenseNet, etc.As to the detection head, it is generally divided into two kinds, i.e., one-stage detectors and two-stage detectors.
One-stage detectors include YOLO series [32], [18], [19], SSD [17], and RetinaNet [33], etc.They are simpler and faster than two-stage detectors but have slightly behind the accuracy.The classical detectors generally use preset anchor boxes for object detection.However, manually setting the scale and proportion of the anchor boxes lead to inefficiency in detection tasks of different scenes.Recently, anchor-free methods [34], [35], [36], [37], and methods of transformers [38] for object detection have been proposed.Two-stage detectors include R-CNN series [39], [40], [41], [42], and ThunderNet [43], etc. Faster R-CNN employed the RPN to generate Regions of Interest (RoIs) by modifying preset anchor boxes and improved the efficiency of detectors.Then many methods were introduced to enhance Faster R-CNN from different aspects.Cascade R-CNN [44] extended Faster R-CNN to a multi-stage detector through the cascade architecture.Mask R-CNN [45] replaced the RoIPool layer with the RoIAlign layer using bilinear interpolation.Soft NMS [46] was proposed to improve NMS.We apply the object detection method to logo detection.Different from these methods, we fully consider the characteristics of logo objects, and our framework is optimized from Faster R-CNN.On the issue of small logo objects, we introduce the DSFP to enhance semantic information of low-level feature maps and improve the performance of small logo objects.On the issue of large aspect ratio objects, we adopt the GA, which can generate large aspect ratio anchor boxes to accurately match these logo objects and effectively improve the efficiency of detection.

B. Logo Detection
Logo detection is a special case of object detection, and it can be applied to many fields and has great commercial value.Hence, logo detection has attracted extensive attention from researchers.Early logo detection methods are established on hand-crafted visual features (e.g., SIFT and HOG) and conventional classification models (e.g., SVM).Inspired by the recent advances in object detection using deep learning methods [47], [48], remarkable progress has been made for logo detection.Some existing detectors often insert some network layers between the backbone and detection head, and these layers are usually used to collect feature maps from different levels.Normally, it is composed of several bottomup paths and several top-down paths.Detectors equipped with this mechanism include FPN, Path Aggregation Network (PANet) [49], and Balanced Feature Pyramid (BFP) [50].FPN used lateral connections and a top-down pathway to enhance the semantic information of shallow layers.After that, PANet brought in a bottom-up pathway to further increase the detailed information in deep layers.BFP further integrated balanced semantic features to strengthen original features.Different from these feature pyramid networks, our approach relies on integrated rich semantic features to low-level feature maps, which can enrich the discriminative semantic features of these feature maps to detect small logo objects and then bring improvement for logo detection.
In addition, whether logo detectors are improved from onestage or two-stage methods, almost all of them use preset anchor boxes to obtain ROIs.However, there are many large aspect ratio objects on logo datasets, and it may not be the most compatible choice for logo detectors.Therefore, instead of using preset anchor boxes, our approach generates anchor boxes according to anchor location branch and anchor shape branch through learning features of logo objects, which can improve the performance of large aspect ratio logo objects.Compared with existing logo detectors, our proposed DSFP-GA is more effective for small logo objects and large aspect ratio logo objects.

III. APPROACH
To address the issues of detecting small logo objects and large aspect ratio logo objects, we propose the Discriminative Semantic Feature Pyramid Network with Guided Anchoring (DSFP-GA) for logo detection.The overall network architecture of DSFP-GA is shown in Fig. 3, which is mainly divided into four parts, feature extractor, feature pyramid, guided anchoring, and classification and regression.Specifically, the feature maps of input logo images are extracted by ResNet-50.Then the DSFP is mainly used to enrich the semantic information of low-level feature maps and improve the detection performance of small logo objects.The feature maps extracted from the feature extractor are input into the Next, we will focus on the two main modules in the DSFP-GA, namely DSFP and GA.

A. Discriminative Semantic Feature Pyramid
In order to address the issue of detecting small logo objects, we propose the DSFP to obtain discriminative semantic features via integrating high-level and middle-level features with rich semantic information to low-level feature maps.As shown in the bottom of Fig. 3, the whole process mainly includes three steps: lateral connection, multiple upsampling, and feature fusion.
1) Lateral Connection.Multi-level feature maps generated by the feature extractor are fed into the DSFP.In Fig. 3, {C2, C3, C4, C5} are multi-level features from level 2 to 5, and these feature maps are recorded as {T emp P 2, T emp P 3, T emp P 4, P 5} through lateral connections.Feature maps transform as follows: where Ci is feature map at level i. Lateral connections contain a 3 × 3 convolutional layer on each merged feature map to reduce the aliasing effect of upsampling and integration.
2) Multiple Upsampling.To integrate multi-level features and preserve semantic information, we need to upsample feature maps {P 5, T emp P 4, T emp P 3} to the corresponding size, and the specific operations are as follows.
Upsample P 5 for three times corresponds to the size of feature maps {T emp P 4, T emp P 3, T emp P 2} respectively, and three obtained feature maps denote as {C5 4, C5 3, C5 2}.
Here we use the classical nearest interpolation function for upsampling.Upsample T emp P 4 twice to the size of feature maps {T emp P 3, T emp P 2} respectively, and two obtained feature maps record as {C4 3, C4 2}.Upsample T emp P 3 once to the size of the feature map T emp P 2 and one obtained feature maps record C3 2. Through this step, we can get the rich semantic information of multi-level feature maps in different resolutions.
3) Feature Fusion.In this step, we integrate the same size feature maps.The specific operations are as follows.
Feature maps C5 4 and T emp P 4 are integrated to get P 4.
Feature maps C5 3, C4 3, and T emp P 3 are integrated to get P 3. Feature maps C5 2, C4 2, C3 2, and T emp P 2 are integrated to get P 2. Afterward, we append a 3 × 3 convolutional layer on {P 2, P 3, P 4} to reduce the aliasing effect.Feature maps {P 2, P 3, P 4, P 5} of final outputs are used for logo detection following the same feature pyramid pipeline.
The proposed DSFP via crossing layer fusion from top to bottom, which can ensure that the semantic information of high-level and middle-level feature maps can be directly fused with low-level feature maps.The DSFP achieves the fusion of different levels features through above three steps, which can obtain discriminative semantic features for detecting small logo objects, and then further improves the performance for logo detection.

B. Guided Anchoring
Our approach adopts the GA to address the issue of detecting large aspect ratio logo objects.The GA can adaptively generate the width and height of anchor boxes via learning features of logo objects, and it is helpful to obtain more accurate anchor boxes and then improve detection performance.The GA mainly consists of two branches: anchor location and anchor shape.
1) Anchor Location.This branch is used to predict which region could be the center regions of logo objects.This branch yields a probability map p(x, y|F I ) of the same size as the input feature map F I , where x and y are the center coordinates of anchor boxes.Each entry p(x, y|F I ) corresponds to the location with coordinate (xs + s/2), (ys + s/2) on the image I, where s is the stride of the feature map.Through a 1 × 1 convolutional layer, we get the mapping of objectness scores, and we use the sigmoid function to transform it into probability value.
2) Anchor Shape.The goal of this branch is to predict the width (w p ) and height (h p ) of anchor boxes.Because of great varying range, w p and h p are transformed as follows: This branch is used to predict shapes of anchor boxes, and it also contains a 1 × 1 convolutional layer and can produce the mapping of two channels, including d wp and d hp values, through the formula of conversion to the corresponding w p and h p values.The essential difference between the design of guided anchoring and preset anchor boxes is that each position is related to only one anchor box of dynamically predicted shapes instead of a series of preset anchor boxes.Through the two branches of anchor location and anchor shape, our framework can obtain large aspect ratio anchor boxes, and then improve the performance of logo detection.

C. Loss Function
In the training of the DSFP-GA framework, the overall optimization loss function is defined as: where L cls and L reg are losses of classification and localization, respectively.The classification loss is defined as follows: where L ga cls and L head cls are classification losses of the GA and the detection head, L loc is used for anchor location branch.
Since the center of the anchor usually accounts for a small portion of the whole feature map, we use the F ocal Loss to mitigate the imbalance of positive and negative samples.
For L ga cls and L head cls , we adopt the Cross Entropy Loss to calculate the classification loss.
The regression loss is defined as: where L ga reg , L shape , and L head reg are regression losses of the anchor shape branch, the GA, and the detection head, respectively.For L shape and L ga reg , we adopt the Smooth L 1 and the Bounded IoU Loss respectively.For L head reg , we further incorporate the CIoU Loss to obtain more accurate regression results on logo detection.The CIoU loss considers four geometric factors in the process of regression, including the overlap rate, the central point distance, the aspect ratio, and the penalt, and thus can accurately regress the localization of logo objects and then improve the performance of logo detection.

IV. EXPERIMENT
In this section, we conduct an extensive evaluation of the proposed method.Experiments are performed over multiple logo datasets and baseline network architectures.  A. Experimental Setting 1) Datasets.We conduct our experiments on four logo datasets with different scales.Most of the experiments are performed on the large-scale LogoDet-3K [22] datasets, which contains 113,710 images for training, 28,432 for validation and 16,510 for testing.To assess the robustness of the DSFP-GA method, experiments are also performed on the LogoDet-3K-1000 [22] dataset, the middle-scale QMUL-OpenLogo [23] dataset, and the small-scale FlickrLogos-32 [24] dataset.The LogoDet-3K-1000 dataset is sampled from the LogoDet-3K dataset, and it consists of 53,049 images for training and 9,559 images for testing.The QMUL-OpenLogo dataset contains 27,083 images from 352 logo categories by aggregating and refining several existing logo datasets.The FlickrLogos-32 dataset consists of 2,240 images from 32 logo categories.The detailed statistics of four datasets are shown in TABLE I.
2) Implementation Details.The proposed framework is implemented based on the ResNet-50 backbone, which is pre-trained on the ImageNet [31].For fair comparison, all baseline detectors are re-implemented based on the publicly available mmdetection toolbox [51] via the same codebase.All models are trained on the train set and validated on the validation set.For evaluation, we adopt the widely used mean Average Precision (mAP) [52] and the IoU threshold is 0.5.We train these detectors with an initial learning rate of 0.002 and the input images are resized to 1000×600.All other hyper-

B. Ablation Study
In this part, we provide empirical analysis for each component in DSFP-GA, DSFP, GA, and CIoU loss.Ablation studies are conducted on Faster R-CNN with ResNet-50-FPN.We report the overall ablation studies on the LogoDet-3K dataset in TABLE II.Furthermore, the results of ablation studies on the LogoDet-3K-1000, the QMUL-OpenLogo, and the FlickrLogos-32 datasets as shown in TABLE IV, TABLE VI, and TABLE VII.These can show the effectiveness of our method from different aspects on different logo datasets.
1) DSFP.We evaluate the effect of the DSFP by comparing it with FPN.We propose the DSFP that is mainly used to improve the ability to detect small logo objects, and also enhance the semantic information of feature maps.As shown in TABLE III, small logo objects and medium logo objects account for 1.8% and 29.8% on the LogoDet-3K dataset.The DSFP brings 0.7% mAP improvement than the Faster R-CNN on the LogoDet-3K dataset in TABLE II, validating the effectiveness of the DSFP.
Furthermore, the DSFP can perform better than the FPN for logo detection task.This is because that it can extract more discriminative semantic features.To verify this, we visualize the heatmap in Fig. 4, which explains the DSFP is more  Besides visualizing heatmaps, we also visualize the detection results of two images with small logo objects in Fig. 5. Compared with DSFP-GA (Faster R-CNN + DSFP), Faster R-CNN misses a small logo object in the first image.It further proves the advantages of DSFP-GA in small logo objects detection.The second image has small and extremely tall objects in Fig. 7. Faster R-CNN lacks a good solution to deal with this kind of logo objects, and the detection result is less satisfactory.In contrast, our method has the advantage of detecting the small and extremely tall logo objects.Moreover, we validate the benefit of the DSFP on the other three datasets.Similar to the LogoDet-3K dataset, small logo objects and medium logo objects account for 1.7% and 33% on the LogoDet-3K-1000 datasets in TABLE V.As shown in TABLE IV, the DSFP increases 0.5% mAP than Faster R-CNN, which shows that our DSFP enriches discriminative semantic information of feature maps.For the QMUL-OpenLogo dataset, more than 23.1% are small logo objects and over 44% are medium logo objects as shown in TABLE VIII.It can be seen that the main challenge is small logo objects on the QMUL-OpenLogo dataset.The DSFP has a more obvious improvement than the baseline in TABLE VI, which shows the effectiveness of the DSFP in small logo detection.For the FlickrLogos-32 dataset, we can observe that less than 5.4% are small logo objects and about 29.3% are medium logo objects in TABLE IX.DSFP still brings 0.7% mAP improvement than Faster R-CNN baseline in TABLE VII, which can indicate that the DSFP has enhanced discriminative semantic information of feature maps.
2) GA.We evaluate the advantage of the GA on the LogoDet-3K dataset.The GA no longer limits the aspect ratio and the size of anchor boxes, which can well address the issue of large aspect ratio logo objects.For the LogoDet-3K dataset,   Range (1.0-2.9)94.8% Range (3.0-4.9) 4.3% Range (5+) 0.9% more than 35% of logo objects have an aspect ratio greater than 3, and more than 11.8% of logo objects have an aspect ratio greater than 5 in TABLE III.There are many large aspect ratio logo objects on the LogoDet-3K dataset.As shown in TABLE II, the GA improves the mAP from 84.5% to 86.6% on the LogoDet-3K dataset, which can indicate the advantage of our method via using the GA.We visualize the results in Fig. 6 and Fig. 7 to show that our method is effective in dealing with large aspect ratio logo objects, and the detection results of DSFP-GA (Faster R-CNN + DSFP + GA) are better than that of Faster R-CNN.As shown in Fig. 6, for the first two images, DSFP-GA has more accurate detection results than Faster R-CNN.In the third image, Faster R-CNN mistakenly identifies the logo category, and the detection accuracy of the correct box is 26% lower than DSFP-GA.As shown in Fig. 7, for the first image, Faster R-CNN doesn't detect the logo with a tilt angle on the right side of the image.On the contrary, DSFP-GA well detects the logo object on the right side with high accuracy, which also shows that DSFP-GA is robust to detect hard logo objects.As for the left side logo object, the accuracy of Faster R-CNN is 16% lower than DSFP-GA.In the second image, we can see that ground-truth boxes are small and extremely tall logo objects.DSFP-GA well detects these two logo objects with Fig. 7.
Comparison of large aspect ratio (extremely tall) logo detection results between Faster R-CNN and DSFP-GA (Faster R-CNN + DSFP + GA).Blue boxes: ground-truth boxes.Orange boxes: correct detection boxes.good accuracy.It compellingly indicates that DSFP-GA not only can detect large aspect ratio logo objects well but also have better performance on small logo objects.
The ablation studies on the LogoDet-3K-1000 dataset can further validate the benefit of the GA.In TABLE V, logo objects of Range (3+) account for 36.1% and logo objects of Range (5+) account for 11.9% on the LogoDet-3K-1000 dataset.The GA brings 0.7% mAP improvement on the LogoDet-3K-1000 dataset in TABLE IV, which validates the effectiveness of GA when addressing the issue of large aspect ratio logo objects.
The ablation studies on the QMUL-OpenLogo dataset and the FlickrLogos-32 dataset can show the performance of the GA.As shown in TABLE VIII, we find that more than 81.5% of the logo objects have an aspect ratio between 1 and 2.9, and about 4.3% have an aspect ratio greater than 5 on the QMUL-OpenLogo dataset.The GA improves the mAP from 53.5% to 53.7% in TABLE VI.Furthermore, as shown in TABLE IX, there are approximately 95% of the logo objects have an aspect ratio between 1 and 2.9, and only about 0.9% have an aspect ratio greater than 5 on the FlickrLogos-32 dataset.Similarly, the GA increases 0.1% mAP as shown in TABLE VII.Hence, we can draw a conclusion that the GA has better performance for large aspect ratio logo objects.
3) CIoU Loss.We also evaluate the benefit of the CIoU loss on these four logo datasets.The CIoU loss can obtain more accurate regression results via solving the problem of inconsistency to improve detection performance.In TABLE II, the CIoU loss improves the mAP from 86.6% to 87.7% on the LogoDet-3K dataset.The CIoU loss also improves the mAP from 89.4% to 90.1% on the LogoDet-3K-1000 dataset in

DSFP-GA
ResNet-50-DSFP 87.7 the mAP from 53.7% to 53.8% on the QMUL-OpenLogo dataset.The CIoU loss improves the mAP from 86.7% to 87.1% on the FlickrLogos-32 dataset in TABLE VII.These validate the effectiveness of our framework when adopting the CIoU loss.
In order to verify the better performance of DSFP-GA, we select two images that contain both small size and large aspect ratio logo objects and visualize the detection results.As shown in Fig. 8, Faster R-CNN doesn't detect the logo object that is small and wide in the first image.In the same image, DSFP-GA has good results in localization and classification.In the second image, Faster R-CNN mistakenly detects two logo objects and the accuracy of another logo object detected by Faster R-CNN is much lower than DSFP-GA.It proves the great performance of DSFP-GA in both small size and large aspect ratio logo objects.

C. Comparison of State-of-the-Art Frameworks
To further validate the versatility of the proposed DSFP-GA, experiments are performed with multiple popular baselines.We choose several one-stage frameworks that have good performance on general detection datasets in recent years.We also select a series of classic two-stage frameworks which are improved from Faster R-CNN and act as the state-of-the-art.
1) Experiment on the LogoDet-3K.Our method DSFP-GA achieves the best performance on the LogoDet-3K datasets.We compare DSFP-GA with the state-of-the-art detection approaches on the large-scale LogoDet-3K dataset in TA-BLE X.Compared with the existing two-stage baselines Faster R-CNN, Libra R-CNN, and Dynamic R-CNN, etc., the DSFP-GA significantly outperforms these state-of-the-art frameworks.Our framework is modified on Faster R-CNN and achieves the best 87.7% mAP, surpassing the Faster R-CNN baseline of 3.9% mAP, which indicates the effectiveness of our strategy.In addition, our framework also improves 4.6% mAP compared with PANet that is equipped with the excellent feature pyramid structure PAFPN.It can explain that our DSFP has a better effect on fusing the features of logo objects than PAFPN.We also compare DSFP-GA with state-of-theart one-stage approaches.Our framework brings 7.8% mAP improvement than ATSS [53] and 6.5% mAP improvement than GFL [54].Because there are many large aspect ratio logo objects, and our model is more efficient for this hard issue.
Moreover, some visual detection results of DSFP-GA are given in Fig. 9, which can better verify that our model has great performance in detecting various logo objects.We can see that our model has good detection results on large logo objects (category "cherry 7up", category "waffle house", etc.), medium logo objects (category "cheez whiz", category "swiss miss", etc.), and small logo objects (category "freia", category "nioxin", etc.).It is worth mentioning that there are multiple multi-scale objects in a test image, our model also can detect all logo objects well.These validate our framework can well detect logo objects of different sizes and has the capacity to handle multiple logo objects within one image.
2) Experiment on the LogoDet-3K-1000.DSFP-GA also has the best performance on the LogoDet-3K-1000 dataset.As shown in TABLE XI, DSFP-GA also achieves 90.1% mAP, which increases 1.9% mAP than Faster R-CNN.In addition, our framework also improves 1.0% mAP compared with PANet.Compared with the one-stage framework, our framework improves 2.3% mAP than ATSS and 2.4% mAP than GFL.The LogoDet-3K-1000 dataset contains more large aspect ratio logo objects, which proves the effectiveness of our model in dealing with this kind of logo objects.The experiments on the LogoDet-3K-1000 dataset further validate the advantages of the proposed DSFP-GA framework over a wide range of state-of-the-art methods.
3) Experiment on the QMUL-OpenLogo.From TABLE XII, we can see that our framework achieves the best performance (53.8% mAP) on the middle-scale logo dataset.We also list the experimental results of baselines on the middle-scale QMUL-OpenLogo dataset.Compared with Faster R-CNN, DSFP-GA obtains 1.2% mAP improvement.Our framework also improves 0.9% mAP compared with PANet.This further shows that DSFP-GA can better detect the QMUL-OpenLogo dataset with more small logo objects than Faster R-CNN and PANet.Compared with the best performing one-stage method GFL, our method improves 4.6% mAP than it (49.2% and 53.8%).These results indicate that our model is efficient in dealing with the problem of small logo objects.
4) Experiment on the FlickrLogos-32.Our framework also has good performance on the small-scale FlickrLogos-32 dataset.The experimental results of baseline and our framework on the small-scale FlickrLogos-32 dataset are summa-

D. Qualitative Analysis
To further evaluate the performance of DSFP-GA in detecting small logo objects and large aspect ratio logo objects, we select two categories that small logo objects account for a large proportion and the other two categories that large aspect ratio logo objects occupy a substantial part on the LogoDet-3K dataset in Fig. 10.The Average Precision (AP, evaluation indicators for a single category) value of four categories is shown in Fig. 11.We analyze the characteristics and AP value of these four categories below.
The small logo objects in the "whisper" category have As for the categories of "ben franklin store" and "sram", we can see that large aspect ratio logo objects account for a large proportion (49.50% and 53.97%) in Fig. 10 (c, d).As shown in Fig. 11, DSFP-GA increases 12.2% and 26.4% AP than Faster R-CNN respectively.Faster R-CNN couldn't well deal with large aspect ratio logo objects through the preset anchor boxes.However, DSFP-GA performs better in these two categories, indicating that DSFP-GA has a great advantage in detecting large aspect ratio logo objects.

V. CONCLUSION
In this work, we propose a novel logo detection framework, the Discriminative Semantic Feature Pyramid Network with Guided Anchoring (DSFP-GA), for detecting small logo objects and large aspect ratio logo objects which is a rarely explored problem in the logo detection literature.This architecture is shown to overcome the logo detection challenges via introducing the DSFP and the GA.We have designed the DSFP to enhance discriminative semantic features, which can improve the performance of small logo detection.The GA can generate the adaptive width and height of anchor boxes according to logo objects, which can effectively deal with large aspect ratio logo objects.We further adopt the CIoU loss for regression to improve the performance of the framework.Extensive evaluations were conducted on four standard logo benchmarks to validate the advantages of the proposed DSFP-GA model over a wide range of state-of-the-art methods.In the future, we will design a better feature pyramid and region proposal network to further improve the performance of logo detection.

Fig. 1 .
Fig. 1.(a) FPN introduces a top-down pathway and lateral connections to fuse multi-level features from level 2 to 5 (P2 -P5).(b) Our DSFP adds high-to-low aggregating pathways and lateral connections, and it mainly can enrich semantic information of low-level feature maps.

Fig. 2 .
Fig. 2. (a) Four illustrative large aspect ratio logo images.(1) category "napapijri", max/min equals 6.1; (2) category "coffee beanery", max/min equals 7.9;(3) category "luciano soprani", (the left box) max/min equals 7.7, (the right box) max/min equals 6.5; (4) category "simple human", (the left box) max/min equals 5.8, (the right box) max/min equals 4.7.Blue boxes: ground-truth boxes.(b) Histogram of the number of boxes vs the ratio of maximum dimension to minimum dimension of the object on the LogoDet-3K dataset.The value of max/min accounts for 65.1% in the range of (1-2.9), the value of max/min accounts for 23.1% in the range of (3-4.9), the value of max/min value greater than 5 accounts for 11.8%.

Fig. 3 .
Fig. 3. Overview of the proposed Discriminative Semantic Feature Pyramid Network with Guided Anchoring (DSFP-GA).Feature Extractor: we use the ResNet-50 as the backbone to extract feature information.Discriminative Semantic Feature Pyramid: we propose the DSFP to obtain discriminative semantic features.It mainly contains lateral connection, multiple upsampling, and feature fusion.Guided Anchoring: we adopt the GA to generate anchor boxes that can be large aspect ratios, and then determine whether it belongs to foreground or background and preliminary bounding box regression.Classification and Regression: output the corresponding category and final localization.
) where R CIoU is penalty term for predicted box B p and ground-truth box B g .Where b p and b g denote the central points of B p and B g , ϕ(•) is the euclidean distance, and c is the diagonal length of the smallest enclosing box covering the two boxes.α is a positive trade-off parameter.w, h is the width and height of the predicted box, respectively.

Fig. 4 .
Fig. 4. Visualization comparison of the features extracted by FPN and DSFP.P2 and P3: the second level and the third level of the feature pyramid.

Fig. 8 .
Fig. 8.Comparison of both small size and large aspect ratio logo detection results between Faster R-CNN and DSFP-GA.Blue boxes: groundtruth boxes.Orange boxes: correct detection boxes.Yellow boxes: mistaken detection boxes.

Fig. 9 .
Fig. 9.Some examples of detection results of DSFP-GA.The orange box corresponds to the location of the logo objects.On the top of the box is the category name and its accuracy.

Fig. 11 .
Fig. 11.The AP value of four categories in Faster R-CNN and DSFP-GA on LogoDet-3K dataset.

TABLE I STATISTICS
OF FOUR LOGO DATASETS

TABLE III THE
PROPORTION OF OBJECT SIZE AND ASPECT RATIO ON THE LOGODET-3K DATASET parameters follow the settings in mmdetection toolbox if not specifically noted.

TABLE IX THE
PROPORTION OF OBJECT SIZE AND ASPECT RATIO ON THE FLICKRLOGOS-32 DATASET

TABLE IV .
As shown in TABLE IX, the CIoU loss increases

TABLE X DETECTION
RESULTS ON THE LOGODET-3K DATASET

TABLE XI DETECTION
RESULTS ON THE LOGODET-3K-1000 DATASET

TABLE XII DETECTION
RESULTS ON THE QMUL-OPENLOGO DATASET

TABLE XIII DETECTION
RESULTS ON THE FLICKRLOGOS-32 DATASET