FastDARTSDet: Fast Differentiable Architecture Joint Search on Backbone and FPN for Object Detection

: Neural architecture search (NAS) is a popular branch of automatic machine learning (AutoML), which aims to search for efﬁcient network structures. Many prior works have explored a wide range of search algorithms for classiﬁcation tasks, and have achieved better performance than manually designed network architectures. However, few works have explored NAS for object detection tasks due to the difﬁculty to train convolution neural networks from scratch. In this paper, we propose a framework, named as FastDARTSDet, to directly search on a larger-scale object detection dataset (MS-COCO). Speciﬁcally, we propose to apply differentiable architecture search method (DARTS) to jointly search backbone and feature pyramid network (FPN) architectures for object detection task. Extensive experimental results on MS-COCO show the efﬁcient and efﬁcacy of our method. Speciﬁcally, our method achieves 40.0% mean average precision (mAP) on the test set, outperforming many recent NAS methods.


Introduction
Machine learning has achieved great success in various tasks, including computer vision [1][2][3], nature language processing [4,5], and digital image and signal processing [6][7][8]. Neural architecture search aims to automatically find the best neural network architecture in some search space instead of manually design based on a large amount of trails and expert knowledge. Several approaches for NAS have been explored, including reinforcement learning, evolutionary algorithm, and differentiable NAS (DARTS). The works [9][10][11][12] adopt reinforcement learning by considering the generation of an architecture as the agent's action. Another reinforcement learning based method [13], named ENAS, proposes weight sharing strategy among the same operations in different architectures, which can significantly reduce the time cost during the search process. Another work [14] adopts maximum flow in graph theory to address NAS task. Other works [15][16][17][18][19][20] adopt genetic algorithms by first encoding the neural architecture and then proposing a group of architectures as a population. Population of architectures are selected according to their performance and then new individuals are generated by crossover and mutation strategies. Apart from the above methods, one-shot neural architecture search [21] has been one of the most popular searching paradigms of neural architecture search (NAS) for its high efficiency. Unlike reinforcement-learning-based methods [9][10][11][12] and evolutionary algorithms [15][16][17][18][19][20] that generate one candidate network at a time, one-shot-based methods construct a supernet containing all the connections in the search space and jointly train all operation weights. DARTS [22] further introduces architecture parameters and address NAS as a bi-level optimization problem which is solved by stochastic gradient descent (SGD). After the gradient-based search phase, DARTS discretizes the supernet and infers the final architecture simply according to the value of architecture parameters, which is referred to as the discretization procedure. Some works [23,24] notice the huge memory cost of one-shot NAS and are dedicated to more efficient searching methods. Moreover, the works [25,26] attempt to improve the optimization algorithm for more stable searching. Differentiable architecture search [22] has taken the dominance with a myriad of follow-up works [26][27][28][29][30] to reduce the search cost.
However, for object detection task, due to the complexity of network and difficult to train models, the above methods that focus on classification task can not directly search on object detection dataset. Unlike classification models, object detection models contain two modules: backbone and feature pyramid network (FPN). To reduce the search cost, recent works only search for the architectures of backbone [31,32] or FPN [33][34][35] separately. This problem has aroused widespread concern in the industry, because in practice, manually adjusting each module based on the standard detection model is inefficient and suboptimal. It is difficult to use and evaluate the trade-off between reasoning time and accuracy and the presentation ability of each module in different datasets. In particular, authors in SM-NAS [36] find that "the combination of cascade RCNN and resnet18 (not the standard detection model) is faster and more accurate than the combination of FPN and resnet50 in coco [37] and BDD [38] (automatic driving data set)". Though many prior works have been dedicated to exploring neural architecture search for object detection tasks, they suffers vast GPU memory cost and searching time, e.g., 44 GPU-days for DetNAS [31]. The above NAS methods for object detection adopt RetinaNet and Faster-RCNN framework. However, few NAS methods adopt YOLO framework [39][40][41][42], which are more efficient detector. EAutoDet [43] proposes an efficient architecture search method for YOLO framework and is able to discover effective architectures in a few GPU-days. Inspired by the above methods, we refer to YOLOv5 framework and propose to apply DARTS [22] to jointly search for the architectures of backbone and FPN. Our method aims to discover optimal architectures in a few GPU-days.
The contributions of our approach can be summarized as: (1) applying DARTS to object detection task and supporting search for the architectures of backbone and FPN; and (2) strong performance on COCO dataset outperforming many recent manuallydesigned networks.

Related Work
This section introduces the prior works related to our method. We first introduce some classic object detection methods in Section 2.1. Then, we introduce the efficient neural architecture search methods, most of which aims at classification task. Finally, we introduce the recent NAS methods for object detection task.

Object Detection
Existing framework of object detector usually consist of several parts: a CNN backbone to extract features, a feature fusion module to fuse extracted features at different scales, a region proposal network (RPN) to generate candidate target region (two-stage detectors [2]), and a detection head to predict and classify bounding boxes. Since each module plays an important role in object detectors, recent advances focus on the designing of each module. For example, except for directly using the existing backbone for classification, such as the VGG [44] and ResNet [1], some researchers put forward a new backbone specially for the detectors (DetNet [45]). FPN [46] is one of the typical networks exploring the design of the feature fusion neck, which designs a top-down architecture to fuse features at different scales. Though great progress has been made through these designing, many of these detectors just focus on one module and ignore the relationship between the backbone and head which may cause the sub-optimal result. R-CNN [47] considers region proposals and achieves high accuracy, it can not detect the object in real-time speed even with Fast R-CNN [48] and Faster R-CNN [2] due to the region generation process. Apart from the above classic object detection methods, there are also some manually-designed detectors for rotation detection either, including regression-based methods [49,50] and classificationbased models [51][52][53], especially for high-precision detection of small objects [54][55][56].

One-Shot Architecture Search
One-shot NAS [21] regards neural network architectures as Directed Acyclic Graphs (DAG), and constructs a supernet containing all types of operations and connections in the search space. Each candidate neural network architecture can be seen as a sub-graph of the supernet. Based on one-shot NAS, [22] involve architecture parameters to represent the importance of candidate operations and connections, which are then optimized alternately with the network weights based on SGD. XNAS [57] address NAS as an online selection task, and adopt the prediction with experts advice (PEA) theory to select operations and connections from the search space. Other methods [30,58] propose to reduce the GPU memory requirement by gradually removing connections in the supernet. Ref. [30] proposes to prune the connections with low confidence and increase the depth of the supernet (the number of cells). Ref. [23] proposes PC-DARTS to reduce the GPU memory requirement by sampling 1/K channels for each operation in the one-shot model, where K is a hyperparameter controlling the rate of activated channels of each convolution at each iteration. PC-DARTS also introduces to accumulate the architecture importance of different iteration to stablize the optimization. Ref. [58] utilizes the Bayesian learning and compression to compute the entropy of the connections, according to which the architecture could be pruned to reduce the GPU memory cost and accelerate the searching phase.

NAS for Object Detection
DetNAS [31] refers to DARTS and propose to search neural architectures for object detection by SGD. However, it requires vast GPU memory and search time to discover an architecture. Recently, researchers [27][28][29] proposed the adoption of a single-path searching strategy to reduce the memory cost of searching for image classification. However, such a single-path searching strategy increases the difficulty of training supernets. Since the object detection neural architectures are more complex and hard to train, the single-path strategy is not quite suitable for searching object detection architectures. Moreover, object detection networks have feature pyramid networks (FPN) to fuse features at different scales, which differs from those for classification. Since FPN plays an important role in object detection networks, many works are dedicated to searching architectures for optimal FPN. NAS-FPN [33] aims to search FPN architecture for RetinaNet [59], a popular one-stage detection framework. Specifically, FPN architectures are generated by an RNN controller, which is trained by reinforcement learning (RL). However, NAS-FPN requires vast GPU memory and search time. EAutoDet [43] proposed an efficient architecture search method for YOLO framework and achieves great performance on COCO dataset. Inspired by the above related works, we propose to search on YOLO framework and propose to directly apply DARTS [22] to joint search for backbone and FPN architectures.

Search Space
This work designs search spaces for backbone network and feature pyramid network separately. Specifically, for backbone, we consider efficiency and effectiveness and refer to DARTS [22]

Search Space for Backbone
The macro architecture of backbone is shown in Figure 1, and we propose to search operation types and connections for each cell. The backbone consists of 3N normal cells and three reduced cells. Unlike DARTS where normal/reduced cells share the same architecture, we independently search each of the cells, i.e., all cells can have different architectures. The importance of candidate operations and connections are represented by the architecture parameters α, which is trained by SGD algorithm during the search stage. After searching, we derive the final architecture according to the magnitude of α. Specifically, for each intermediate node, we preserve two connections sourced from different predecessors, and select the best operation on the two connections.

Stem
Normal

Search Space for FPN
We extract three features of different spatial sizes from the backbone and pass them through FPN. As shown in Figure 2, we adopt nodes to denote feature maps and edges to denote operations. Each normal cell is a supernet that is independently searched during the search process. Similar to the search space of backbone, there are 7 candidate operations. Similar to the backbone, we introduce architecture parametersγ to denote the importance of candidate operations for each normal cell, whose normalized weights are denoted as γ = so f tmax(γ). To introduce the feature of nodes in a normal cell, we take the j-th node as an example without loss of generality. The feature of node v j is where O is the candidate operation set, z i is the feature of node v i (a predecessor of node v j ), and e o ij denotes the operation o on edge e ij that connects node v i and v j . After searching, only two connections are preserved for each node, and only one operation for each connection is selected according to the magnitude of γ.

The Proposed FastDARTSDet
Based on DARTS [22], we build a supernet containing all candidate operations and connections in the search space. Nodes of supernet represents feature maps and edges denotes operations. We utilize e ij to denote the edge from node v i to v j , and utilize {e o ij , o ∈ O} to denote candidate operations on edge e ij , where O is an operation candidate set for each edge. Similar to prior works [21,22], we design O = {zero, identity, max pooling, average pooling, 3 × 3 convolution, 5 × 5 convolution, 3 × 3 dilated convolution, and 5 × 5 dilated convolution}. To search architectures via gradient-based method, we follow DARTS [22] and define the architecture parametersα o ij to represent the importance of different candidate operations on each edge. The output of operations e o ij are averaged with weight α o ij to obtain the output of edge e ij , represented as o j (x i ): where α denotes the normalized architecture parameters, x i denotes the output of v i , and, thus, the output of v j can be computed as follows: Similar to DARTS [22], we address NAS as a bi-level optimization problem and solve it by SGD algorithm. The optimization problem is formalized as follows: where W denotes the supernet weights, α denotes normalized architecture parameters for backbone following Equation (1), γ denotes normalized architecture parameters for FPN module. Unlike DARTS that search on a rather simple classification task, we propose to search on a much more complex computer vision task, object detection. The major differences lie in two aspects: (1) the detection datasets are larger than classification datasets, and (2) the detection models are more complex than the classification datasets. Specifically, detection models usually contain multiple parts, including backbone, feature pyramid network, and detector head. In this work, by defining the joint search space of backbone and FPN in the Section 3.1, we adopt the differentiable-based neural architecture search algorithm to search CNN architectures for object detection task.

Protocols
The models are evaluated on MS-COCO 2017 dataset [37].

MS-COCO Dataset
Ms-COCO dataset is a large-scale image dataset developed and maintained by Microsoft. The tasks of frequency aggregation include recognition, segmentation, and detection. In the classic case, the target location is determined through the bounding box. At the beginning, it is mainly used for face detection and pedestrian detection. The dataset, such as Caltech pedestrian dataset, contains 350,000 bounding box tags. Pascal VOC data includes 20 targets, more than 11,000 images and more than 27,000 target bounding boxes. Recently, there are detection datasets obtained under ImageNet data, 200 categories, 400,000 images, and 350,000 bounding boxes. Because some targets have a strong relationship rather than existing independently, it is meaningful to detect a certain target in a specific scene. Therefore, accurate location information is more important than bounding box.

Experimental Settings
All our models are trained from scratch, that is, no ImageNet pretrained weights are adopted to initialize the our model. In the search process, A supernet with one normal cell (N = 1) at each block is build, as shown in Figure 1. The architecture parameters are defined to represent the importance of candidate operations and connections. The training set of COCO is divided into two parts for training architecture parameters and network weights, respectively. The final architecture is derived after alternately optimizing architecture parameters and network weights for 30 epochs by SGD optimizer. In the evaluation process, the discovered architectures are trained from scratch for 300 epochs by SGD optimizer. The models are also deepened by increasing the number of normal cells N to 2. The hyper-parameters are set as those provided by YOLOv5 for a fair comparison. Our codes are based on PyTorch and all our experiments are conducted on V100 GPU.

Performance of the Supernet
Here, The performance curve of supernet during the search process is illustrated in Figure 3, including the curve of loss functions, precision, recall, and mean average precision (mAP). Blue lines in Figure 3 is the averaged value among all 80 classes in COCO dataset. Three left columns report the tendency of bounding box loss, objectiveness loss, and classification loss. Two right figures on the top report the precision and recall of training set. Two right figures on the bottom report the mAP of validation set. Figure 3 shows that the supernet gradually converges during the search process, demonstrating the great convergence ability of our search method.
The precision-recall, and F1-score curve of each classes are illustrated in Figure   4.

Performance of the Searched Model
The performance curve of the searched model during the evaluation process is illustrated in Figure 5. The model is trained from scratch for 300 epochs with the same hyper-parameters as YOLOv5. The precision-recall and F1-score curve of each classes are illustrated in Figure 6. Figure 7 illustrates the confusion matrix among 80 classes of the searched model. 4.
To visualize the model performance, The ground truth bounding boxes and the predicted bounding boxes of several images is displayed in Figure 8a,b. Overall, our model performs good and is able to detect most of the objects in the images. However, for small objects and occluded objects, the performance of our model is barely satisfactory. In addition, our model may also be confused about similar categories. For example, in the fourth image at the bottom, a boy takes a piece of paper, though our model detect the location of the paper but it mis-classify it as a laptop since laptop is pretty similar to a paper.

Ablation Study
To demonstrate the advantage of collaborative search, The results that only search Backbone and FPN are reported in Table 1 and illustrated in Figure 9. Specifically, the default architecture of backbone is set as the architecture of DARTS [22] and the default architecture of FPN is set as the architecture of PANet [60]. The depth of network is controlled by the number of stacked normal cells N. The baseline (default architectures of both backbone and FPN) only achieves 33.9% mAP when N = 1 and 38.1% mAP when N = 2. If the backbone is searched independently, the performance of the discovered model improves by 0.2% (for N = 1) and 0.6% (for N = 2); if the FPN is searched independently, the performance of the discovered model improves by 1% (for N = 1) and 0.9% (for N = 2); If the backbone and FPN are searched jointly, the performance of the discovered model improves by 2.5% (for N = 1) and 1.8% (for N = 2). In general, we can obtain the following conclusions: first of all, if compared with baseline only, it can be found that whether it is a separate search or a collaborative search, whether it is a Backbone search only or a FPN search only, the final structure is significantly improved compared with baseline. The results also reflect the effectiveness of our algorithm. Secondly, by comparing the collaborative search and separate search, we can find that even if everyone is better than the baseline, the effect of collaborative search is more obvious. From the empirical analysis, the results are consistent with the conjecture. Previous studies tend to search only one part and fix the other part, which will lead to ignoring the connections in different structures. End-to-end networks should be regarded as a whole rather than a segmented part. Finally, we notice that the increase in the result gradually decreases with the increase in the model size: for N = 1, the performance of joint search surpasses that of independent search by nearly 2% mAP, while for N = 2, the improvement is less than 1%. This is because the complexity of the model search increases with the increase in the model size. If the same experimental parameters are maintained, the effect will inevitably decrease.

Comparison with Prior Methods
To show the effectiveness of our method, we compare the performance of searched architectures with other state-of-the-art works on the COCO test-dev dataset in Table 2. Our discovered models (FastDARTSDet) achieve the competitive and even better performance compared with the peer NAS methods. Specifically, FastDARTSDet with N = 1 achieves 36.4 AP with only 5.8 M parameters, outperforming EfficientDet-D0 by 2.6% AP. FastDARTSDet with N = 2 achieves 40.0% AP with 6.9 M parameters, surpassing EfficientDet-D1 by 0.4% AP with similar parameters. Moreover, compared to prior NAS methods for object detection, our method only requires 4.2 GPU-days, significantly faster than prior NAS methods. Additionally, most of the prior methods independently search either backbone (DetNAS, EfficientDet) or FPN modele (NAS-FPN, NAS-FCOS, Auto-FPN), and only a few methods propose to jointly search both backbone and FPN (SM-NAS, Hit-Detector). On the one hand, compared to independent search method, our search space is much larger; on the other hand, compared to other joint search method, our method is much more efficient and faster.  Table 1. Performance of joint and independent search for backbone and FPN on MS-COCO validation set. The default architecture of backbone is the architecture by DARTS [22]. The default architecture of FPN is the architecture of PANet [60]. Specifically, the 'default' backbone utilize the discovered architecture searched on classification task, which cost 1.0 GPU-day. Consequently, for the setting of 'default' backbone and 'default' FPN, the search cost is 1.0 GPU-day. For the setting of 'default' backbone and 'searched' FPN, the search cost is 5.2 GPU-day (1.0 GPU-day to search default backbone and 4.2 GPU-days to search FPN). The best setting and the corresponding performance among each block is in bold.

Conclusions and Outlook
We have presented an efficient and effective search approach to discover optimal architectures for object detection, which is a relatively less explored area. Unlike previous works searching either backbone or FPN structure alone, our method can simultaneously search backbone and FPN architecture. We adopt a differentiable-based algorithm in such a complex search space and propose a kernel reusing technique to speed up the search process stably. Our method can discover outstanding architectures in 4.2 GPU-days, whose efficacy has been demonstrated by extensive experiments. In particular, the discovered architecture achieves 40.0 AP on COCO test-dev with 6.9M parameters, and our lightweighted model achieves 36.4 AP on COCO test-dev with 5.8M parameters, competitive and even better than the state-of-the-art object detection NAS methods.
For future work, we would formulate the architecture search problem as a special case of combinatorial optimization on graphs [65], which can be readily connected to the recent advance in machine learning for combinatorial optimization [66]. In particular, it would be interesting to see if some useful architectures can be quickly searched from the architecture pool as more and more architectures have been found by experts and search algorithms. In this sense, architecture matching, or essentially graph matching especially across multiple graphs [67] to estimate the similarity, has been an important direction to explore as thus one need not search a brand new architecture from scratch. We have noted a recent line of research on using machine learning to achieve efficient and effective graph matching by using graph neural networks [68,69]. Readers are referred to [70] for more comprehensive review and we believe retrieval with fine search rather than search from scratch would be an important direction for future NAS research.