SSD7-FFAM: A Real-Time Object Detection Network Friendly to Embedded Devices from Scratch

: The high requirements for computing and memory are the biggest challenges in deploying existing object detection networks to embedded devices. Living lightweight object detectors directly use lightweight neural network architectures such as MobileNet or ShufﬂeNet pre-trained on large-scale classiﬁcation datasets, which results in poor network structure ﬂexibility and is not suitable for some speciﬁc scenarios. In this paper, we propose a lightweight object detection network Single-Shot MultiBox Detector (SSD)7-Feature Fusion and Attention Mechanism (FFAM), which saves storage space and reduces the amount of calculation by reducing the number of convolutional layers. We offer a novel Feature Fusion and Attention Mechanism (FFAM) method to improve detection accuracy. Firstly, the FFAM method fuses high-level semantic information-rich feature maps with low-level feature maps to improve small objects’ detection accuracy. The lightweight attention mechanism cascaded by channels and spatial attention modules is employed to enhance the target’s contextual information and guide the network to focus on its easy-to-recognize features. The SSD7-FFAM achieves 83.7% mean Average Precision (mAP), 1.66 MB parameters, and 0.033 s average running time on the NWPU VHR-10 dataset. The results indicate that the proposed SSD7-FFAM is more suitable for deployment to embedded devices for real-time object detection.


Introduction
As one of the fundamental visual recognition problems of computer vision, object detection is the basis of many other computer vision tasks, such as instance segmentation [1,2] and object tracking [3]. Object detection must identify not only the object's categories but also locate each item. Object detection has been extensively studied in the literature. In recent years, benefited by the rapid development of Deep Convolutional Neural Networks (DCNN), object detectors based on deep learning have achieved significant breakthroughs. Object detection has been widely used in the real world, such as robot vision, video surveillance, and autonomous driving. In order to improve the detection accuracy, most research focuses on the design of increasingly complex object detectors such as R-CNN [1], Single-Shot MultiBox Detector (SSD) [4], You Only Look Once (YOLO) [5], and their variants [2,[6][7][8][9]. Although they have achieved high detection accuracies, such object detection networks are usually challenging to handle for embedded devices due to computational and memory limitations. Therefore, the design and development of more efficient deep neural networks for real-time embedded object detection are highly expected.
Current state-of-the-art object detectors with deep learning can be mainly divided into two major categories: two-stage detectors [1,2,6,7] and one-stage detectors [4,5]. Two-stage detectors firstly generate a set of preselected boxes for the object to be detected and then perform object category prediction. Therefore, Two-stage detectors often reported stateof-the-art results on many public benchmark datasets. However, they are relatively slow.

1.
We propose a seven-layer convolutional lightweight real-time object detection network, SSD7-FFAM, that can be started from scratch to solve the problems that arise when the structures of the existing lightweight object detectors with pre-trained network models as the backbone are fixed, difficult to optimize, and not suitable for specific scenarios.

2.
A novel feature fusion and attention mechanism method is proposed to solve the problem of reduced detection accuracy caused by the decrease in the number of convolutional layers. It first combines high-level semantic information-rich feature maps with low-level feature maps to improve the detection accuracy of small targets. At the same time, it cascades the channel attention module and spatial attention module to enhance the contextual information of the target and guide the convolutional neural network to focus more on the easily identifiable features of the object.

3.
Compared with existing state-of-the-art lightweight object detectors, the proposed SSD7-FFAM has fewer parameters and can be applied to various specific embedded real-time detection scenarios.
The remainder of this paper is organized as follows. The related work is described in Section 2. In Section 3, the details of the proposed SSD7-FFAM are stated. The experimental results and discussions are then reported in Section 4. Finally, Section 5 concludes this paper.

Single-Stage Detectors
Different from the two-stage detection algorithm, the single-stage detector directly trains the convolutional neural network model to map image pixels to the coordinates of the bounding box. In 2016, Redmon et al. proposed a real-time detector called You Only Look Once (YOLO) [5]. YOLO divides the image space into a fixed number of grids and then predicted each grid. YOLO achieves relatively good detection results at the time. However, YOLO only uses the last feature map to predict, which is not suitable for multi-scale object detection. Later, to solve the limitations of YOLO, Liu et al. proposed another single-stage object detector-Single-Shot MultiBox Detector (SSD) [4]. SSD and YOLO both use a convolution neural network for prediction. Still, SSD is different from YOLO in (1) using multi-scale feature map detection; the advantage of this is that a relatively large feature map is used to detect relatively small objects, and small feature maps are responsible for detecting large objects; (2) unlike YOLO, which uses full connectivity, SSD directly uses convolution for detection; (3) SSD draws on the concept of anchor in Faster R-CNN [6], and each cell sets prior boxes with different scales as the basis of the predicted bounding box, which reduces the difficulty of training. Some other single-stage detectors also be proposed recently, e.g., Single-shot Refinement [25], RetinaNet [26], and CornerNet [27].
SSD uses the adjusted VGG as the backbone network and then adds additional convolution layers to obtain more feature maps for detection. A basic SSD model is shown in Figure 1. SSD predicts objects on multiple feature maps, and each feature map predicts objects of different scales. In the prediction module, it is divided into two parts: positioning and classification. The redundant prediction frame is then filtered through an algorithm called non-maximum suppression to form the final prediction result. Using multiple feature maps for object detection, the detection accuracy of SSD is comparable to Faster R-CNN, and the speed is faster than it. However, because a pre-trained VGG or even deeper network is required as the backbone network, SSD is not ideal for embedded devices with limited memory. Meanwhile, it is not suitable for detecting small objects as each feature map is predicted separately.

Deep Feature Fusion
Deep neural networks have a strong expressive ability by extracting deep feature maps. SSD uses the pyramid structure of the convolutional network to predict the target and achieves high-precision detection. A network called FPN [7], which combines highlevel semantic information with low-level high-resolution feature information, is proposed to achieve higher accuracy. FPN is a fully convolutional network and its backbone uses ResNet. FPN uses nearest-neighbor interpolation to sample from top to bottom and then connects it horizontally with the feature map after 1 × 1 convolution in the way of element-wise summation. The experiments of applying FPN to RPN and Fast R-CNN [28] respectively show that after fusing different feature layers, the accuracy of object detection is improved, especially the detection of small objects.
To detect small objects quickly and accurately, a multi-level feature fusion method that introduces contextual information into the SSD is proposed in the article Feature-Fused SSD: Fast Detection for Small Objects [29]. Gao et al. designed two different feature fusion modules: the connection module and the element summation module. Compared with SSD, its detection accuracy and speed have been improved to a certain extent. However, Feature-Fused SSD is also not suitable for embedded devices because of the increased computational complexity over SSD.

Visual Attention Mechanism
In cognitive science, the visual attention mechanism is essential for the human visual system to have such amazing data processing capabilities. In computer vision, the main research problem is how to establish a suitable calculation model to explain this attention mechanism. Introducing the attention mechanism in computer vision information processing can not only allocate limited computing resources to important targets but also produce results that meet human visual cognition requirements. Therefore, the visual attention mechanism has become a research hotspot in the field of computer vision.
In recent years, most of the research work on the combination of deep learning and visual attention mechanism has focused on using masks to form attention mechanisms. The mask uses new weights to identify critical features in input pictures. Through training, the deep neural network learns the areas that need attention in each new picture and forms attention. This kind of attention pays close attention to the spatial domain [30] or the channel domain [31], which can be directly generated through the network after learning. Besides, this kind of attention is differentiable, and models can learn the weight of attention through neural network calculation of gradient and forward propagation and backward feedback [32].
Inspired by these excellent feature fusion and attention mechanism methods, we propose the SSD7-FFAM. The SSD7-FFAM is formed by adding the proposed FFAM based on SSD7 [33]. The details of SSD7-FFAM is introduced in the next section.

Proposed Method
In this section, each part of the proposed SSD7-FFAM in Figure 2 is illustrated in detail. First, we introduce the specific structure of the entire network in Section 3.1. Then, the feature fusion module is described in Section 3.2. Next, we illustrate the attention module combined with feature fusion in Section 3.3. Immediately afterward, Section 3.4 presents the prediction layers. Finally, in Section 3.5, the loss function during training is explained. Two feature maps from different convolutional layers firstly pass the feature fusion module to form a new feature map and then pass the attention module to generate a feature map for prediction and input to the prediction layers. Figure 2 depicts the specific structure of the proposed SSD7-FFAM. In SSD [4], the feature maps extracted by VGG and additional convolutional layers are respectively used for object location and classification. However, the initial shallow feature maps lack important semantic information, and this problem causes the detection accuracy to be inferior to the two-stage detectors. Therefore, SSD is not conducive to the detection of small objects. Unlike SSD7 [33], the proposed SSD7-FFAM adopts two novel modules: the feature fusion module and the attention module, based on SSD7, to compensate for the reduction of detection accuracy caused by the decrease in convolutional layers. The feature fusion module combines two feature maps of different scales into a new feature map after transformation. This module enhances the semantic information of shallow feature maps. The attention module is a lightweight module that combines channel attention and spatial attention. It dramatically improves network performance while bringing a small amount of calculation and parameters.

Feature Fusion Module
SSD7 [33] uses four feature maps of different sizes to predict objects independently, and there is no connection between each feature map. The deep low-resolution feature map has undergone many convolution operations and can extract rich semantic information, which helps distinguish the object and the background. However, due to excessive downsampling, a lot of detailed information is lost. The shallow, high-resolution feature map contains detailed information of the object, which is conducive to accurate object positioning. However, because it performs fewer convolution operations, the shallow feature map cannot extract enough high-level features, and the semantic information is insufficient. Therefore, combining shallow feature maps and deep feature maps to generate a feature map with high resolution and full semantic information can be beneficial to the detection of small targets.
The schematic diagram of the semantic interpolation processing proposed in FPN [7] is shown in Figure 3. The dark color indicates that it contains richer semantic information. The low-level feature maps are passed to high-level through convolution, and the highlevel feature maps are given to the low-level through upsampling, and then the new feature maps generated by the element-sum module stitching between the two are used to predict the object. Unlike FPN, in SSD7-FFAM, we use 2 × 2 deconvolutions instead of nearest-neighbor interpolation for upsampling. The deconvolution layer introduces non-linearity and therefore helps to represent the ability of network feature expression. In order to make the size of the feature map consistent, some deconvolutions are followed by zero-padding, which is to add pixels with a value of zero to the row or column. Similarly, the two adjusted feature maps are spliced by element-wise summation so that the in-depth semantic information is propagated to the shallow feature maps. An operation called Rectified Linear Unit (ReLU) follows the element-wise summation module. At the same time, the smooth convolution with the size of the 3 × 3 convolution kernel is used to reduce aliasing caused by the enlargement of the feature map size. Figure 4 shows an example of the structure of the feature fusion module used in the proposed SSD7-FFAM. The new Conv4 feature map is passed to the attention module along with the other two feature maps.

Channel Attention Module
In order to effectively understand the scope of the object, many previous works [34,35] adopted the average pooling method. In this paper, SSD7-FFAM uses both maximum pooling and average pooling on the feature map after feature fusion so that the detector channel focuses on the object and its contextual information. The schematic diagram of the channel attention module is shown in Figure 5a. Firstly, we perform average pooling and maximum pooling on the feature map F ∈ R H×W×C to obtain two 1 × 1 × C channel descriptors. Secondly, they are sent to a two-layer parameter-sharing neural network. The number of neurons in the two layers is C/r and C, where r is the reduction ratio, and the activation function is ReLU. Then, the weight coefficient M c is obtained by adding the two features through a Sigmoid activation function. In summary, the calculation formula of channel attention is as follows:

Spatial Attention Module
The authors of [36] point out that the merge operation in the channel dimension can highlight the feature map's information area. After the channel attention module, we introduce a spatial attention module to focus on where features are meaningful. Similar to channel attention, given the feature map F after the channel attention module, the spatial attention module first performs the average pooling and maximum pooling of the channel dimensions to obtain two H × W × 1 channel descriptions and stitch these two descriptions together according to the channels. Then, the weight coefficient M S is obtained after a 7 × 7 convolutional layer, and the activation function is Sigmoid. The spatial attention module is shown in Figure 5b and is computed as: (2) Figure 6 shows the structure of the attention module used in SSD7-FFAM, where F is the new feature obtained by multiplying M c and the input feature, and F is the result of multiplying M S by F . The final feature map F f is the result of adding F and F to the ReLU activation function.

Prediction Layers
SSD [4] draws on the anchor concept in Faster R-CNN [6] and sets prior boxes with different scales in each unit of the feature map. The predicted bounding boxes are based on these prior boxes, which can reduce the training difficulty to a certain extent. In SSD7-FFAM, we send the four feature maps after feature fusion and attention module to the prediction layer for object classification and location. Their scales are (37,37), (18,18), (9,9), and (4,4); the number of prior boxes set for each feature map is different. For the scale of the prior box, it follows the linear increasing rule: as the size of the feature map decreases, the scale of the prior box increases linearly: where s k represents the ratio of the prior box size to the feature map, s min and s max represent the minimum and maximum values of the ratio, and 0.2 and 0.9 are used in the paper, m refers to the number of feature maps. Each prior box of each unit outputs a set of independent prediction values, which are mainly divided into two parts: the confidence of each category and the location of the bounding box. In the prediction process, the category with the highest confidence is the category to which the bounding box belongs. In particular, when the first confidence value is the highest, it means that the bounding box does not contain the object because the background is also regarded as a particular category. The location of the bounding box contains four values (cx, cy, w, h), which respectively represent the center coordinates and width and height of the bounding box.
For an m × n feature map, SSD7-FFAM sets k prior boxes in each unit, then each unit needs (c + 4) × k prediction values, so a total of (c + 4) × k × m × n predicted values, where c is the number of categories. Similarly, SSD7-FFAM uses convolution for prediction and needs (c + 4) × k convolution kernels to complete the detection process of this feature map.

Loss Function
In the training process, we firstly determine which prior box matches the ground truth, and the bounding box corresponding to the matching prior box is responsible for predicting it. If a prior box does not match any ground truth, then the prior box can only match the background, which is called a negative sample. Intersection over Union (IoU) is used in the matching process. It is the ratio of the area where the two boxes intersect to the combined area. The principle of matching has the following two points: First, we match each ground truth with the prior box with the highest IoU between it to ensure that each ground truth must match a certain prior box; second, for the remaining unmatched prior boxes, the prior box is also matched the ground truth if the IoU is higher than the threshold (0.5 in our experiments). The IoU of ground truth and the prior box can be formulated as follows: There are a lot more negative samples than positive samples because ground truths are less than prior boxes. SSD7-FFAM uses hard negative mining to sample negative samples and arrange them in descending order of confidence loss and selects top-k with larger loss as negative samples to ensure that the ratio of positive and negative samples is close to 1:3.
After the training samples are determined, the loss function needs to be calculated. SSD7-FFAM uses the weighted sum of location loss (loc) and confidence loss (conf) as the loss function: L(x, c, l, g) = 1 N L con f (x, c) + αL loc (x, l, g) (8) where N is the number of positive samples and the weight parameter α is set to 1 through cross validation; x p ij ∈ {1, 0} is an indicator parameter. When x p ij = 1, it means that the i-th prior box matches the j-th ground truth, the ground truth category is p; c is the category confidence prediction value, l is the location prediction value, and g is the location parameter of the ground truth.
For the confidence loss, the proposed SSD7-FFAM uses SoftMax loss: For the location loss, SSD7-FFAM employs Smooth L1 loss, which is calculated as follows:

Experimental Results and Discussions
We conducted experiments on two widely used datasets: NWPU VHR-10 [37] and Pascal VOC [38]. First, we compare the proposed SSD7-FFAM with other state-of-the-art methods on the NWPU VHR-10 dataset containing many small targets. Second, we conducted ablation experiments on the Pascal VOC dataset to understand the role of each module in the proposed FFAM. Finally, we added the proposed FFAM to the original SSD300 to form SSD300-FFAM and carry out comparative experiments with other state-of-the-art detectors on the Pascal VOC dataset to verify the effectiveness of the FFAM. We adopted the standard mean Average Precision (mAP) to measure the object detection performance.

Datasets Description
The NWPU VHR-10 dataset is a 10-class geospatial object detection dataset. The dataset contains 3775 object instances, including 757 airplanes, 302 ships, 655 storage tanks, 390 baseball diamonds, 524 tennis courts, 159 basketball courts, 163 ground track fields, 224 harbors, 124 bridges, and 477 vehicles. These object instances are manually marked with horizontal borders. It contains a total of 800 images, and we only used 650 of them with annotated images in this experiment.
The Pascal VOC [38] dataset is a standardized dataset with good image quality and complete labels. It is widely used to evaluate object detection and instance segmentation algorithms. It contains 20 common objects, namely an airplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, sofa, train, and TV Monitor. Each image in the dataset has a one-to-one correspondence with its annotation file (class label and bounding box). At the same time, there may be multiple different object categories in an image. The training set and test set were divided according to a ratio of approximately 1:1. Pascal VOC 2007 contains 9936 labeled images with 24,640 objects, of which about 5011 were used for training and validating, and the remaining 4952 images were used for testing. Pascal VOC 2012 contains 11,540 labeled images with 27,450 objects. In this experiment, all the labeled images in Pascal VOC 2012 were used for training and validating [39].

Evaluation Metric
Like most other object detection methods, mean Average Precision (mAP) was adopted as the evaluation metric. In object detection, mAP is the average of the APs in multiple categories. The AP depends on the precision and recall score and is calculated as follows: 16) where N means the number of all the classes. AP is computed as the area under the precision-recall curve. FP indicates the false positives; TP indicates the true positives, i.e., IoU > 0.5; FN indicates the false negatives.

Experimental Details
The experimental platform is a PC with a 3.4 GHz CPU, 8.0 GB RAM, and Windows 10 operating system. The proposed method was implemented by Python 3.6 based on the Pytorch framework, accelerated by two NVIDIA GeForce GTX 2080Ti GPUs with 11 GB GPU memory, CUDA10.0, and cuDNN7.0.
The proposed SSD7-FFAM network was trained using Stochastic Gradient Descent (SGD) from scratch with a Gamma value of 0.1. The momentum parameter and weight decay were chosen as 0.9 and 0.0005. In the training process, the batch size we used was 32, which can be adjusted according to the capacity of the GPU used. Unlike SSDs that use backbone networks, all convolution layers of SSD7-FFAM were initialized with the "Xavier" method [40]. The max iteration was set to 120 k. The initial learning rate was set to 0.001 and decreases by 10% at 80 and 100 k, respectively. For the training procedure, making the training model and the test model have similar input sizes has been proven to be the best way to train a multi-scale object detection network [41]. In these experiments, all networks' input image size was set to 300 × 300 × 3. SSD7-FFAM uses the NMS with a Jaccard overlap of 0.45 per class and a confidence threshold of 0.01. We maintained the top 200 detections per image. To further limit the number of predictions to parse, we set top_k to 5 in the testing phase. Other experimental settings were the same as the original SSD.

Results on NWPU VHR-10
In this experiment, we compared the proposed SSD7-FFAM with other state-of-the-art detectors on the NWPU VHR-10 dataset. The results are shown in Table 1. The proposed SSD7-FFAM achieved an mAP of 83.7%, which is a 13% improvement over the original SSD7 [33]. SSD7-FFAM has higher accuracy than the state-of-the-art two-stage methods RICNN [42], R-P-Faster R-CNN [43], and Faster R-CNN [6]. Table 1. The comparison result of the proposed SSD7-FFAM and state-of-the-art methods on the NWPU VHR-10 dataset.

Ablation Study on Pascal VOC 2007
In order to test the effects of the feature fusion module and the attention mechanism module, we ran models with different modules added to the SSD7 on the Pascal VOC 2007 test set and discussed each method's results. All experimental settings were consistent, except for the network structure and certain components.
First, we tested the effect of the attention mechanism module. Table 2 shows that adding the attention mechanism to different networks can get different results. Comparing the original SSD7 (row 1) and adding the attention module on SSD7 (row 2), using the attention mechanism on SSD7 can get 37.6% mAP while SSD7 can only achieve 36.6%.
At the same time, we compared the number of parameters and find that the attention mechanism only increases by 0.01 MB on average. In the following experiments, we used the attention mechanism to obtain better detection results. Table 2. Effects of the feature fusion module and the attention mechanism module on the SSD7. FF indicates feature fusion module; AM indicates attention mechanism module, Deconv indicates deconvolution module. Then, we conducted comparative experiments to test the effects of the feature fusion module on our experimental platform. The results are also listed in Table 2. The SSD7-FFAM used the feature fusion module introduced in Section 3, while the FPSSD7 [45] used bilinear interpolation for upsampling during feature fusion. The mAP has an improvement of 7.6% using the proposed feature fusion method (row 4) compared to the method using FPSSD7 [45] is only 5.1% (row 3). The results demonstrate that using the feature fusion module with deconvolution contributes more to the detection accuracy than the bilinear interpolation. Compared with the FPSSD7 [45], the combination of FPSSD7 [45] and attention mechanism results in a higher mAP with 44.2% (row 5). The results show that the SSD7-FFAM (row 6) we proposed has an 8.1% improvement compared to the original SSD7. Table 3 shows that the proposed SSD7-FFAM achieves the second-lowest average running time among the six different models, only 0.145 s slower than FPSSD7 [45], while the accuracy of SSD-FFAM is 3% higher. The average running time of SSD7-FFAM in comparison to other detectors is shown in Figure 8.  Therefore, the quantitative results in Tables 2 and 3 show that the FFAM has a good effect on improving detection accuracy and can more accurately detect various objects in the dataset.

Extended Research on SSD300
To further verify the effectiveness of the proposed FFAM method, in this experiment, we also added it to the original SSD [4], which is called SSD300-FFAM, and compared it with other state-of-the-art detectors on the Pascal VOC 2007 test set. Except for the different sizes of detection feature maps, other experimental settings are consistent with SSD300 [4].
We first compared the proposed SSD300-FFAM with other state-of-the-art detectors on the Pascal VOC 2007 test set. All detectors used VGG-16 as the backbone network, and their input image size was 300 × 300. Since our methods are mainly modified based on SSD300 [4], we chose SSD300 [4] as the baseline method. The detection results are summarized in Table 4. The results show that SSD300-FFAM achieves 78.7% mAP, which has an improvement of 1.2% over the baseline method SSD300 and is 5.5% higher than the Faster R-CNN. SSD300-FFAM is slightly better than SSD-TSEFFM300 [46] (78.7% vs. 78.6%). Compared to other detectors, we also achieved much better performance. SSD300-FFAM shows a great improvement in testing tasks with specific backgrounds and some small objects, e.g., airplane (82.6%), chair (61.7%), motorbike (86.3%), and train (87.8%). This shows that SSD300-FFAM improves the detection effect of SSD on small objects to a certain extent. Therefore, the proposed feature fusion and attention mechanism method plays a useful role in object detection. We also compared the proposed SSD300-FFAM with state-of-the-art two-stage detectors and one-stage detectors using different backbone networks. For the results in Table 5, we discuss as follows: • The proposed SSD300-FFAM has 300 more boxes than the baseline method SSD300 because the sizes of detection feature maps used are 38, 20, 10, 6, 3, 1, respectively. However, its detection accuracy is higher. • Two-stage methods like Faster R-CNN and R-FCN have a candidate frame extraction part, so the number of boxes is much less than that of one-stage methods YOLOv3 [9], SSD, or SSD300-FFAM.

•
Compared with the one-stage method YOLOv3 [9], the deeper the backbone network used, the higher the object detector's detection accuracy.

Discussions
In the research of lightweight object detection network, previous works have achieved some remarkable success. With the rapid development of the Internet of Things (IoT), some efforts [49][50][51] have been made to reduce the need for storage space and computational complexity on edge devices by using the IoT and cloud-based services. These existing efforts send algorithms and databases to AWS services hosted in the cloud for saving storage space on the edge devices. However, these works also have the limitations of long data transmission time, slow data transmission speed, and high demand on the equipment's network condition. Compared with these works, the proposed SSD7-FFAM can overcome these limitations by directly deploying it to edge devices to realize real-time object detection. Similarly, SSD7-FFAM also has the limitation of insufficient processing power.
Aiming at realizing real-time object detection on embedded devices, we reduced the number of convolutional layers of the original SSD. A novel method of feature fusion and attention mechanism is proposed to improve the detection accuracy. The experimental results comparatively demonstrate the compactness of SSD7-FFAM.
In addition to the results given, it should be noted that FFAM is a universal scheme that provides a general solution for designing and implementing a fast and high detection accuracy object detector. Except for remote sensing image detection, SSD7-FFAM also has the potential to be applied in other fields, including but not limited to autonomous driving, military object detection, and medical image detection.
Although compared to other SSD-based lightweight object detection networks, our SSD7-FFAM can start from scratch. One of the limitations of SSD7-FFAM is its low accuracy on large-scale object detection datasets, which guides the direction of our future study.
SSD7-FFAM only pays attention to the feature extraction part, and the prediction part is not optimized. In future work, we will focus on the overall optimization of SSD7-FFAM.

Conclusions
In this paper, we propose a lightweight real-time object detection network SSD7-FFAM for embedded devices that can be used in specific scenarios from scratch. The proposed novel feature fusion and attention mechanism method can effectively improve the accuracy of object detection. Firstly, through deconvolution and feature fusion, lowlevel feature maps and high-level feature maps are combined to obtain feature maps with more substantial semantic information for prediction, which is conducive to detecting small objects. Secondly, the channel attention module and the spatial attention module are used to enhance the contextual information of the detection target so that the detector focuses more on the object instead of the background. The experimental results on the NWPU VHR-10 dataset show that the proposed SSD7-FFAM can reach 83.7% mAP and predict an image at an average of 0.033 s. The proposed SSD7-FFAM has a parameter value of 1.66 M, which is easier to deploy to embedded devices than other lightweight object detection networks because it does not require a pre-trained model.
In our future work, in addition to addressing the above limitations, we will transplant the SSD7-FFAM to embedded devices such as drones, and we will test more methods and further optimize our model to enhance scalability.
Author Contributions: Conceptualization, methodology, software, validation, formal analysis, investigation, data curation, visualization, and writing-original draft preparation, Q.L.; resources, writing-review and editing, supervision, project administration, and funding acquisition, Y.L. and W.H. All authors have read and agreed to the published version of the manuscript.