Long-Range Dependence Involutional Network for Logo Detection

Logo detection is one of the crucial branches in computer vision due to various real-world applications, such as automatic logo detection and recognition, intelligent transportation, and trademark infringement detection. Compared with traditional handcrafted-feature-based methods, deep learning-based convolutional neural networks (CNNs) can learn both low-level and high-level image features. Recent decades have witnessed the great feature representation capabilities of deep CNNs and their variants, which have been very good at discovering intricate structures in high-dimensional data and are thereby applicable to many domains including logo detection. However, logo detection remains challenging, as existing detection methods cannot solve well the problems of a multiscale and large aspect ratios. In this paper, we tackle these challenges by developing a novel long-range dependence involutional network (LDI-Net). Specifically, we designed a strategy that combines a new operator and a self-attention mechanism via rethinking the intrinsic principle of convolution called long-range dependence involution (LD involution) to alleviate the detection difficulties caused by large aspect ratios. We also introduce a multilevel representation neural architecture search (MRNAS) to detect multiscale logo objects by constructing a novel multipath topology. In addition, we implemented an adaptive RoI pooling module (ARM) to improve detection efficiency by addressing the problem of logo deformation. Comprehensive experiments on four benchmark logo datasets demonstrate the effectiveness and efficiency of the proposed approach.


Introduction
Feature extraction is the most fundamental problem in various image-related tasks. Recent years have witnessed the powerful feature representation capability of convolutional neural networks (CNNs) that makes them very good at extracting rich image features. This is because they have two inherent advantages, namely, sparse connectivity and weight sharing. The former changes the situation of full connectivity in traditional neural networks and enables local perception, while the latter allows for the number of parameters to be significantly reduced, thus allowing for CNNs to train light models with fewer parameters. These advantages enable CNNs to outperform the traditional manual feature method. Therefore, deep-learning-based CNNs are increasingly dominant in object detection tasks. As a special form of object detection, logo detection aims at finding all the logos in an image or a video and return their locations. Extracting effective features is a crucial step in logo detection, in which deep CNNs can help in discovering intricate structures in logo datasets.
Logo detection plays an important role in various real-world applications, such as intelligent transportation [1,2], trademark infringement detection [3], and automatic logo detection and recognition [4]. However, logo detection is a complex task compared to general object detection because logo images usually have two distinctive characteristics: a large aspect ratio and multiple scales. On the one hand, there are many long words, artistic words, and feature images in a logo image resulting in a relatively large aspect ratio. On the other hand, the exquisite design of logo images causes a variety of multiscale logo objects in an image.
Compared with general object detection, the challenges of logo detection mainly come from two aspects: • The large aspect ratio of logos usually spans a large area in an image, as shown in Figure 1a. As far as we know, there has been little research on the problem of large aspect ratios in logo detection. Existing two-stage approaches based on fixed-size anchors fail to complete the detection of flexible aspect ratio logos [5][6][7][8]. Optimized strategies [9][10][11] also have a limited effect on the detection of logos with a large aspect ratio. The method in [12] could generate anchors of any shape, but it was unable to extract long-range dependence. Similarly, the traditional convolutional approach cannot fully utilize long-range interaction and the locations of various spatial features, which severely restricts its ability to address the large aspect ratio of logos. • Multiscale logo objects in an image. As seen in Figure 1b, 'adidas' appears both in the foreground and background, but the scale varies greatly. Scale diversity can be resolved utilizing feature pyramid networks (FPNs) [13], but the semantic information of small objects may be lost after multiple instances of downsampling. The bottom-up information channel is increased by PANet [14], but the information is concentrated more in the adjacent layers. Although SEPC [15] can extract multilevel features, it has the disadvantage of the topology being too simple to extract more information.  In this paper, we present a novel logo detection method called long-range dependence involutional Network (LDI-Net). We rethink the intrinsic principle of convolution, and propose long-range dependence involution (LD involution) and apply it to a region proposal network (RPN). Two major convolutional flaws are remedied by LD involution, since it has significant advantages in acquiring long-range interactions in spatial and channel dimensions. Meanwhile, it can preferentially extract significant visual information in space by creating particular involutional kernels for certain spatial locations. The construction of LD involution enables visual information and elements in the spatial domain to be reasonably allocated and sorted on the logo image to the greatest extent. The channelsharing involutional kernel allows for us to use a larger K to satisfy the establishment and correlation of long-range information, and significantly reduces the redundancy of the model. For logo detection, a logo image with a very large aspect ratio is characterized by high requirements for long-distance information contact. LDI-Net improves the detection performance of logos with a large aspect ratio by employing a new operator and a selfattention mechanism. For the second issue, we suggest a multilevel representation neural architecture search (MRNAS) to detect multiscale logo objects. MRNAS introduces six heterogeneous information paths to construct a diverse multipath topology that combines semantic information and location representation, optimizing cross-level interaction between features. Additionally, we implemented an adaptable RoI pooling module (ARM) to improve detection efficiency and achieve adaptive feature learning for differently shaped objects. By adding additional offset and a modulation mechanism, the logo deformation problem caused by angles, occlusion, rotation, distortion, reflection, etc. (as shown in Figure 1c) is solved.
The main contributions of this paper can be summarized as follows: • We developed a network with LD Involution for logo detection by establishing longrange information dependence, and ranking the significance of visual information via a new operator and a self-attention mechanism to solve the problem of a large aspect ratio. • We constructed a diverse multipath topology on the basis of neural architecture search theory in which each path utilizes a specific feature fusion. • We conducted extensive experiments and evaluated our approach on four benchmark logo datasets: FlickrLogos-32, QMUL-OpenLogo, LogoDet-3K-1000 and LogoDet-3K. The experimental results demonstrate the effectiveness of the proposed model.

Object Detection
In recent years, CNNs have been widely used in deep learning and have achieved many good research results [16][17][18]. Object detection is one of the most fundamental and challenging problems in computer vision, and has received much attention in recent years. In the era of deep learning, object detection methods are divided into two genres: two-stage and one-stage. The two-stage system first creates regional proposals on the basis of image content, followed by categorization and localization. Classical two-stage algorithms include fast R-CNN [19], faster R-CNN [5], and cascade R-CNN [20]. One-stage algorithms are characterized by one-step completion without regional proposals, directly generating the category and location coordinates, such as the YOLO series [8,21,22]. Among them, faster R-CNN is a milestone work based on RPN.
It is an important issue for object detection to recognize multiscale objects. Many works were improved on the basis of FPN [13], including PANet [14], BiFPN [23], and SEPC [15], because of its strong performance in multilevel feature extraction. PANet enhanced the representation ability by integrating bottom-up and top-down paths. BiFPN introduced learnable weights to determine the importance of different input features, and repeatedly employed multiscale feature fusion. SEPC performed deformable convolution on the high-level features of a feature pyramid, which adapted to the actual scale change and maintained scale balance between layers. Although these methods implemented the information interaction between multiple layers, the relatively simple topology of the search structure lacked the feature information of small objects.

Logo Detection
Logo detection has been extensively studied in e-commerce and multimedia fields [24][25][26][27]. Early logo detection was generally completed on the basis of manual features and traditional classification models, such as Viola-Jones (VJ) [28], the histogram of oriented gradients (HOG) [29], and the deformable parts model (DPM) [30]. Yan et al. [31] used the Bayesian classifier framework to detect and remove video logos. Wang et al. [32] implemented a simple automotive logo recognition method using template matching and edge orientation histograms.
In the last few years, deep-learning-based logo detection algorithms have become mainstream. Bao et al. [33] directly applied faster R-CNN to logo detection and achieved good performance. Xu et al. [27] proposed a solution to robust defence competition in e-commerce logo detection. Velazquez et al. [34] improved the detection performance of small objects by incorporating FPN into the DETR structure. Wang et al. [25] built the largest fully annotated logo detection dataset, i.e., LogoDet-3K, and proposed Logo-Yolo to resolve the imbalanced samples of logo objects. A cross-view learning method [35] provided ideas for logo detection. Hou et al. [26] constructed a large dataset, FoodLogoDet-1500, to address data limitations in food logo detection, and proposed MFDNet to address multiscale and similar logo problems.
Different from previous work, we rethought the intrinsic principle of convolution and applied the proposed LD involution to RPN. Meanwhile, we constructed a diverse multipath topology on the basis of neural architecture search theory in which each path utilized a specific feature fusion. In addition, we introduced ARM to achieve adaptive feature learning for different objects.

Our Approach
In this section, we present logo detection method LDI-Net, and the overall framework is shown in Figure 2. Specifically, the model first feeds the feature map into MRNAS to learn multilevel features after extracting essential features from the input image. Next, it feeds the feature map into RPN, established by LD involution to obtain higher-quality regional proposals. Then, it feeds the feature map into ARM to enhance the modeling capability. Lastly, the model performs classification and localization. All components are described in detail in the following sections.

Multilevel Representation Neural Architecture Search
As shown in Figure 2, the main body of MRNAS is a fully connected directed acyclic graph composed of N + 2 nodes, while N is a predefined constant value. In LDI-Net, to balance efficiency and accuracy, we set N to 5. The nodes of the directed acyclic graph represent the feature map driven by the feature pyramid, P is the input node, O is the output node, and t i (i = 1, 2, ..., N) is the intermediate node. Different information paths are used as connections between the two nodes. We introduced six kinds of heterogeneous information paths: top-down, bottom-up, fusing-splitting, scale-equalizing, skip-connect, and none [36]. They could realize the aggregated combination of multilevel information on different paths. These information paths PA(i, j) transform t i into t j , and each node i ∈ {1, 2, ..., N} aggregates the input of the previous node:

Long-Range Dependence Involution
The purpose of RPN is to generate regional proposals when detecting objects. LD involution is a more efficient way to correlate information compared with convolution, which improves the quality of generated candidate regions better than RPN. Similar to involution [37], the feature transformation process of LD involution is shown in Figure 3. For a coordinate point in the input feature map, its feature vector is first transformed by two steps of generation (as given in Figure 4) and each reshaped to expand into the involutional kernel corresponding to the coordinate point. Then, it multiadds with the K × K neighborhood near the coordinate point to obtain the final output feature map. We focus on long-range dependence, which is crucial for the optimization of large aspect ratios. Inspired by [38], we adopted a flexible generation method to generate the involutional kernel instead of the convolutional kernel. As shown in Figure 4, the involutional kernel was constructed in two parts. In the first part, we used global selfattention to extract distant information. In the content-position section, we utilized relative position encodings R h and R w to represent height and width, respectively. We used q, k, and r to represent query, key, and position encoding, respectively. Attention logits are denoted as qk T + qr T . and represent element-wise and matrix multiplication, respectively. After self-attention, global average pooling was employed to refine the context modeling and enrich the extraction of long-range information. The second part is to capture channel dependence by learning the correlation between channels and filtering attention to the channel. After the feature extraction of different positions in the first part, the channel feature dependence was successively obtained with 1 × 1 convolution, BN, ReLU, and 1 × 1 convolution. Combined with the general form described above, the module is defined as follows: where X i,j and M i,j represent input and output, respectively. S represents the global self-attention, while G represents global average pooling. W s and W f represent the linear transformation matrix (1 × 1 convolution was adopted here), while δ represents BN and ReLU.  Figure 4. Construction of the involutional kernel.

Adaptive RoI Pooling Module
RoI pooling is used to pool arbitrary-size input feature maps into the same size feature maps. RoI pooling divides RoI into i bins. Each bin can be formulated as follows: where x is the input feature map, and y is the output feature map. R it is the sampling position of the t-th grid cell in the i-th bin, and m i is the number of grid cells in the bin. We summed the sampling values on the grid cell and took the average value to calculate the output of the bin. As shown in Figure 2, in the adaptive RoI pooling, we added an additional offset and a modulation mechanism [39]: where R i is the offset that is used to increase the spatial sampling position and improve the feature extraction ability of the network. h i is the modulation scalar that is used to assign the weight to each offset corrected region.

Loss Function
In LDI-Net, the final loss function consists of L rpn , L cls and L loc , as listed in Equation (5): where L rpn is the RPN loss, L cls is the classification loss, and L loc is the boundary box regression loss. We implemented L cls by the cross-entropy loss function. In order to better adapt the changes in distribution, we used Dynamic SmoothL1 Loss (DSL) in L loc to compensate for high-quality samples and pay more attention to high-quality samples: To evaluate the effectiveness of the proposed LDI-Net, we completed comprehensive experimental validation on four datasets: large-scale dataset LogoDet-3K [25], mediumscale dataset LogoDet-3K-1000 [25], and two small-scale datasets, QMUL-OpenLogo [40] and FlickrLogos-32 [41]. LogoDet-3K contains 158,652 pictures, including 142,142 for trainval and 16,510 for the test. LogoDet-3K-1000 is a subset of LogoDet-3K, sampled from LogoDet-3K. To further evaluate the generalization and robustness of the LDI-Net model, we also carried out extensive experiments on two widely used logo detection datasets, i.e., QMUL-OpenLogo and FlickrLogos-32. The detailed description of these datasets is shown in Table 1. The classes, images and objects represent the number of categories, images and logos in the dataset, respectively. The trainval and test represent a division of the dataset whose sum is the number of images.

. Implementation Details
We implemented our method on the basis of the publicly available MMDetection toolbox [42], and used dynamic R-CNN [43] based on ResNet-50 as the baseline. We chose ResNet-50 as the backbone network because of its two advantages: (1) ResNet-50 itself had little influence on the model, which rendered the improvement effect of the proposed model more obvious. (2) It is beneficial for researchers to conduct comparisons in the experiments since it is a classical network that has been widely used. For evaluation, we used the widely used mean average precision (mAP) [44] with an IoU threshold of 0.5. Meanwhile, we added processing time, model size, parameters, and FLOPs in order to further detail the experimental results. Processing time refers to the time from the beginning of the training process to convergence. The model size, parameters, and FLOPs can provide a reference for measuring the model complexity. In our experiments, the basic detection network was trained using stochastic gradient descent (SGD), and the initial learning rate was set to 0.002. In the data preprocessing stage, all input images were resized into 1000 × 600. The weight decay was 0.0001, and the momentum was 0.9. We followed the settings in MMDetection for the other hyperparameters.

Comparisons with State of the Art
We compared the proposed LDI-Net with several other one-stage and two-stage popular baselines, as reported in Table 2. Table 2 shows the best detection performance of all methods with a uniform learning rate. The proposed LDI-Net method was superior to other baselines, as it achieved the best performance with 88.7% mAP. It achieved 4.9% and 1.6% improvements over faster R-CNN and dynamic R-CNN, respectively. The proposed LDI-Net strategy also achieved the best performance compared to other approaches that utilize feature fusion. For example, PANet, Libra R-CNN, and our method all utilize feature fusion to extract multilevel features. In comparison, LDI-Net achieved 5.6% and 5.3% improvement over PANet and Libra R-CNN, respectively. To verify the detection of large aspect ratio logos, we also compared guided anchoring and achieved 2.4% accuracy improvement, which also shows our method's advantages in long-range interactions. In comparison with the other baselines, the proposed LDI-Net improved mAP by 10.4%, 8.8%, 7.5%, 6.6%, 4.3%, 5.2%, 6.0% 2.0% and 14.4% compared with FSAF, ATSS, GFL, Soft-NMS, generalized IoU, distance IoU, complete IoU, SABL and sparse R-CNN, respectively.

Qualitative Analysis
In Figure 5, we present some illustrative examples for LDI-Net. Our model could achieve good detection performance on regular large logos and logos with large aspect ratios. For example, the detection of Warburtons and CINNZEO showed good performance on multiscale logos, while the detection of Bubbly and BOLD ROCK showed great results on deformed logos in the second line. Our model also achieved over 98% detection accuracy on Skittles and eatZis with a large proportion, and logos with disparate aspect ratios (e.g., Intusium23 and Brigham's) in the third line.  We also set up different iterations to compare the proposed strategy and dynamic R-CNN in terms of convergence and accuracy. Figure 6 provides the performance trend when the iterations increased, showing that our method gradually stabilized, starting from 250,000 and converging at 350,000. During the training process, it was clear that our method maintained a higher mAP than that of dynamic R-CNN.

Results on LogoDet-3K-1000
LogoDet-3K-1000 is a subdataset of LogoDet-3K that has a suitable number of images and categories. Experiments on this dataset helped in further evaluating our model. We used different strategies on LogoDet-3K-1000 and list the results in Table 3. The proposed strategy outperformed other baseline approaches and achieved 90.4% mAP. In detail, it achieved 1.3%, 2.0%, 1.3%, 1.9%, and 3.6% improvement compared with PANet, Libra R-CNN, guided anchoring, dynamic R-CNN, and sparse R-CNN, respectively.

Results on QMUL-OpenLogo
We provide experimental results on QMUL-OpenLogo to verify the effectiveness of LDI-Net. As shown in Table 4, our model obtained 56.3% mAP, which outperformed all the other baselines. It achieved 2.5% and 14.7% improvements compared with classical algorithms faster R-CNN and SSD, respectively. Both dynamic R-CNN and Libra R-CNN achieved 54.6% mAP, our method still achieved a 1.7% improvement. Our method was also superior to feature fusion-based methods, e.g., PANet. These comparisons further verify the superiority of our method in information exchange and feature fusion.

Ablation Study
In this section, we conduct comprehensive analysis of the effects of each LDI-Net component on four logo datasets. We compare the test and localization accuracy of each LDI-Net component with dynamic R-CNN, namely, LD involution, MRNAS, and ARM. We used dynamic R-CNN equipped with ResNet-50 and FPN as the baseline.

LD Involution
LD involution is a targeted solution to the large aspect ratio problem. As shown in Table 6, LD involution achieved 87.3% mAP, outperforming other baselines on LogoDet-3K. Table 7 shows that our method outperformed the baseline by 1.2% improvement on LogoDet-3K-1000. Our method also achieved 1.5% and 0.7% improvement on QMUL-OpenLogo and FlickrLogos-32, as shown in Tables 8 and 9, respectively. In the comparison with involution, our method also showed superiority. In particular, on the QMUL-OpenLogo, our method achieved 1% improvement.  Figure 7 shows the visualization comparison results of dynamic R-CNN and LD involution on LogoDet-3K. A logo image with a large aspect ratio is taken as the visual displaying result, which shows that the accuracy of our method was higher than the baseline. For example, for the logo with a very wide aspect ratio, our model achieved 3% and 27% improvement over the baseline, as shown in Figure 7a,b, respectively. In addition, LD involution could identify small logo 'SEADOO', while the baseline could not, as shown in Figure 7b. This indicates that LD involution extracting long-range information is also effective for small objects with a large aspect ratio.

MRNAS
We applied the MRNAS module to solve multiscale problems and achieved good results. The MRNAS module performed well on two large logo detection datasets, i.e., LogoDet-3K and LogoDet-3K-1000. As shown in Table 6, on LogoDet-3K, the mAP of our model with MRNAS reached 87.5%, a 0.4% improvement over dynamic R-CNN. On LogoDet-3K-1000, the model with MRNAS reached 89.5% mAP, achieving 1% improvement over the baseline, as seen in Table 7. In addition, MRNAS achieved good performance on the other two datasets, in which mAP was significantly improved, as can be seen in Tables 8 and 9.
We provide some illustrative examples of logos with different scales from LogoDet-3K, as shown in Figure 8. In the first pair, the baseline could not detect the rightmost logo. In contrast, our method had a detection accuracy of 92%. Meanwhile, LDI-Net improved the detection mAP from 44% to 98% compared with dynamic R-CNN, which also shows the superiority of the proposed method in the second pair.

ARM
We conducted ablation experiments for ARM on four datasets, and the experimental results demonstrate that ARM works better than the baselines. As shown in Table 6, the performance of a single ARM module was comparable to that of two other modules, up to 88.2% on LogoDet-3K. In the other three datasets (Tables 7-9), ARM also achieved an improvement in mAP. The results indicate that ARM can be an effective solution to logo deformation.
We selected a variety of deformed logos for different reasons to fully illustrate the functionality of ARM in a visualization experiment. Figure 9a shows that our model could still achieve 82% detection accuracy on the logo that was deformed due to the camera angle. As shown in Figure 9b, our model could detect incomplete logo 'TIMEX', which confirms the effectiveness of our model.
After testing the components individually, we conducted experiments combining LD involution and MRNAS to further validate the model effects, as shown in Tables 6-9. On all four datasets, the combination of LD involution and MRNAS performed better than adding only one module.

Conclusions and Future Work
In this paper, we proposed a logo detection model, long-range dependence involutional network (LDI-Net), to detect logos with large aspect ratios by adding a new operator and a self-attention mechanism. Meanwhile, MRNAS was proposed to construct a novel multipath topology to realize multiscale logo detection. ARM was also introduced to enhance the ability of the proposed model to handle logo deformation.
So far, LDI-Net has worked well, but there are some limitations. Although multiscale logos can be completed well, there is still room for further improvement in the localization and classification of some small logos. Our method could also solve the problem of logo deformation caused by occlusion and rotation very well, but the deformations caused by reflection and distortion need to be studied more specifically. In future work, we will continue to conduct indepth research to solve the above problems. Further, we will address other challenges of logo detection, such as small, similar, and low-resolution logos.

Conflicts of Interest:
The authors declare no conflict of interest.