GGT-YOLO: A Novel Object Detection Algorithm for Drone-Based Maritime Cruising

: Drones play an important role in the development of remote sensing and intelligent surveillance. Due to limited onboard computational resources, drone-based object detection still faces challenges in actual applications. By studying the balance between detection accuracy and computational cost, we propose a novel object detection algorithm for drone cruising in large-scale maritime scenarios. Transformer is introduced to enhance the feature extraction part and is beneﬁcial to small or occluded object detection. Meanwhile, the computational cost of the algorithm is reduced by replacing the convolution operations with simpler linear transformations. To illustrate the performance of the algorithm, a specialized dataset composed of thousands of images collected by drones in maritime scenarios is given, and quantitative and comparative experiments are conducted. By comparison with other derivatives, the detection precision of the algorithm is increased by 1.4%, the recall is increased by 2.6% and the average precision is increased by 1.9%, while the parameters and ﬂoating-point operations are reduced by 11.6% and 7.3%, respectively. These improvements are thought to contribute to the application of drones in maritime and other remote sensing ﬁelds.


Introduction
The global market of drones is expected to exceed $48 billion by 2026, which has been reported by Drone Industry Insights [1]. Given their advantages of high mobility, rapid response and great view, drones are playing an important role in various human social activities, e.g., monitoring [2,3], photogrammetry [4,5], search-and-rescue [6], etc. Advanced Artificial Intelligence and Internet of Things techniques have been equipped with drones to carry out these tasks autonomously. However, there exist challenges to be addressed in real-world applications.
Object detection helps drones to find the position and class of objects in their view and is the primary requirement for drones applied in maritime cruising and searching missions. For the last twenty years, various algorithms and application scenarios have been studied for object detection. For traditional approaches, handcrafted features are extracted from the patches of images and one or multiple classifiers are selected to traverse the total image, e.g., histogram of oriented gradient (HOG) detector, deformable parts model (DPM), etc. [7]. As popular solutions in the last ten years, deep-learning-based approaches utilize deep network to learn high-level feature representations of various objects, e.g., region convolutional neural network (R-CNN), you only look once series (YOLOs), etc. [8]. Even though there have been remarkable achievements using the above approaches, some common challenges remain to be addressed, such as object rotation and scale changes, small and occluded object detection, real-time of onboard system, etc. For traditional scenarios, pedestrian and vehicles as main detected objects present relatively

Related Work
In recent years, object detection based on drone vision has been studied extensively for various application fields. Related works are selected to introduce in this section.
With the advantages of great perspective and high resolution, drone vision is very suitable for remote sensing. Travelling vehicles [14], road information [15] and pavement distress [16] could be extracted from drone imagery by deep learning algorithms, e.g., Faster R-CNN, YOLOs, etc. An improved Faster R-CNN consisting of a top-down-top feature pyramid fusion structure is proposed for visual detection tasks of catenary support devices defect [17]. For small object detection in drone images, more abundant feature information could be extracted by a multi-branch parallel feature pyramid network [18]. Furthermore, a supervised spatial attention mechanism was considered to reduce the background noise. Small object detection accuracy could be improved by feature pyramid network, which is capable of fusing more representative features including shallow and deep feature maps [19]. The receptive field for small object detection was enriched by concatenating two ResNet models in the DarkNet of YOLOv3 [20] and increasing convolution operations in an early layer [21]. To minimize the occurrence of missed targets due to occlusion, Tan et al. [22] introduced soft non-maximum suppression into the framework of YOLOv4 [23]. YOLOv5 presents four versions for different application scenarios: YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x. Small object detection by the vision of drones has been studied by improving YOLOv5 [24]. The refinements, including adding a microscale detection layer, setting prior anchor boxes and adapting the confidence loss function of the detection layer, were implemented in the YOLOv5 framework for small-sized wheat spike detection [25]. Thus, it can be seen that multiscale representation, contextual information, super resolution, and region proposal are the main solutions to improve the performance of small object detection [26]. YOLOv6 [27] and YOLOv7 [28] have been proposed successively in 2022. The backbone of YOLOv6 utilizes EfficientRep instead of CSPDarkNet. It is worth mentioning that YOLOv6 continues to use anchor-free. A new border regression loss SIOU is introduced; in other words, YOLOv6 is the best combination of YOLOv5 and YOLOx. YOLOv7 presents a planned re-parameterized model to replace some original modules. Due to time limitations, the related work on their applications is rare.
Considering the limited onboard computation resource, a few lightweight networks have been proposed for drone vision. To reduce the computational cost and network size, pointwise convolution and regular convolution were combined as the main building block of the network proposed by Liu et al. [29]. The inverted residual block [30] was utilized to construct a lightweight network for object recognition. For vehicle detection in aerial images, Javadi et al. [31] optimized YOLOv3 by replacing Darknet-53 with MobileNet-v2 which integrates deep separable convolution, linear bottleneck and inverted residual. An improved network named MobileNet-v3 was realized by adding lightweight attention model and h-swipe into MobileNet-v2. MobileNet-v3 was used for reducing the computation cost of YOLOv4 while ensuring feature extraction from the aerial images [32]. Thus, how to obtain a good trade-off between computational cost and detection accuracy has become the focus of drone-vision research [33].
Maritime object detection, as one typical scenario, has been studied for many years. Prasad et al. [34,35] summarized the visual perception algorithms for maritime scenarios in recent years and proposed the corresponding assessment criteria of maritime computer vision. The maritime datasets were provided for training and evaluating the deep-learningbased visual algorithms in [36,37]. The multi-spectral vision was studied for human body detection in the maritime search-and-rescue tasks using drones [38]. Reverse depthwise separable convolution was applied in the backbone model of YOLOv4 [39], which reduced the network parameters by 40% and was suitable for vision-based surface target detection of unmanned ships. Ship re-identification [40] is significant when ships frequently move in and out of the drone's view. Even though various algorithms for ship detection in SAR or horizontal perspective images were presented in [41][42][43], drone vision-based maritime object detection still presents some challenges. Background variation, scale variation, illumination conditions, visible proportion, etc. are thought to be especially serious while detecting and tracking maritime objects using drone vision.
Inspired by the outstanding works mentioned above, both detection accuracy and computation cost are required by mobile vision. As one typical one-stage object detection framework, YOLO series have been studied and applied in drone-based vision systems. On the one hand, feature enhancement is the main way to improve detection accuracy. On the other hand, network lightweight is thought to reduce the computational burden of onboard systems. As a result, YOLOv5 is studied as the framework of our drone vision for maritime object detection in this work. To achieve our aim, advanced models such as Transformer and GhostNet are utilized to improve the accuracy and efficiency of the original YOLOv5. Comparative experiments are conducted to obtain the optimal solution regarding re-configuring the YOLOv5 with Transformer and GhostNet. The improved object detection framework is expected to have better performance in drone maritime cruise scenarios.

Materials and Methods
Maritime object detection based on drone vision is studied in this work. While drones are cruising through typical maritime scenarios, various appearances of ships of different scales would be presented in their view. Therefore, detection accuracy and computation efficiency of the algorithm have to be considered for detecting these maritime objects. To achieve these, a novel drone-based maritime object detection algorithm is presented in Figure 1. The algorithm can be mainly divided into three parts: The backbone is responsible for extracting features from an input image and is composed of three network layers based on CNN and Transformer. The feature maps with scales of 80 × 80, 40 × 40 and 20 × 20 can be calculated through the backbone. The neck is responsible for fusing the feature maps. For the head, three detectors at different scales are utilized to calculate the positions and sizes of objects. In addition, the dataset specialized for drone-based maritime object detection is described in this section. based on CNN and Transformer. The feature maps with scales of 80 × 80, 40 × 40 and 20 × 20 can be calculated through the backbone. The neck is responsible for fusing the feature maps. For the head, three detectors at different scales are utilized to calculate the positions and sizes of objects. In addition, the dataset specialized for drone-based maritime object detection is described in this section.

MariDrone Dataset
A specialized dataset is constructed for drone-based maritime object detection. The dataset is composed of thousands of maritime scenario images collected by our drone DJI M300, therefore named MariDrone. The drone with a size 810 × 670 × 430 mm has a payload capacity of 2.7 kg and the effective range of the remote control is 8 km. The positioning precision is 1~2 cm. The onboard vision system is deployed with a wide-angle camera and an embedded GPU device. The camera has a high resolution of 1200 million and the angle view is 82.9 deg. The embedded device based on the NVIDIA Pascal™ GPU architecture is equipped with 8 GB of memory and has a memory bandwidth of 59.7 GB/s. It is responsible for real time object detection using our algorithm. Images are collected by the onboard vision system when the drone is cruising over the Yangtze River. Regarding illumination, both sunny and cloudy conditions are involved in the dataset. The 3840 × 2140 image resolution is enough high to retain small objects or local details. In order to ensure the generalization of the MariDrone dataset, maritime videos are recoded using drones in different weather and illumination conditions. Through sampling these videos, a total of 4743 real images were obtained. Compared with other similar datasets, the MariDrone dataset was constructed completely by the flying drone. As a result, different scales, varying illuminations and various views are well presented in our dataset.
Furthermore, data augmentation is thought to extend the MariDrone dataset. As shown in Figure 2, random combinations of transformation operations involving translation, scaling, rotation, dithering, etc. are utilized in the data augmentation process. Translation, scaling and rotation can increase the forms of the labeled objects in the images. Meanwhile, maritime scenarios have been enriched by color dithering. Through such data augmentation, a total of 8340 images were composed for the MariDrone dataset. Each image was annotated according to COCO format. The dataset was divided into training-set,

MariDrone Dataset
A specialized dataset is constructed for drone-based maritime object detection. The dataset is composed of thousands of maritime scenario images collected by our drone DJI M300, therefore named MariDrone. The drone with a size 810 × 670 × 430 mm has a payload capacity of 2.7 kg and the effective range of the remote control is 8 km. The positioning precision is 1~2 cm. The onboard vision system is deployed with a wide-angle camera and an embedded GPU device. The camera has a high resolution of 1200 million and the angle view is 82.9 deg. The embedded device based on the NVIDIA Pascal™ GPU architecture is equipped with 8 GB of memory and has a memory bandwidth of 59.7 GB/s. It is responsible for real time object detection using our algorithm. Images are collected by the onboard vision system when the drone is cruising over the Yangtze River. Regarding illumination, both sunny and cloudy conditions are involved in the dataset. The 3840 × 2140 image resolution is enough high to retain small objects or local details. In order to ensure the generalization of the MariDrone dataset, maritime videos are recoded using drones in different weather and illumination conditions. Through sampling these videos, a total of 4743 real images were obtained. Compared with other similar datasets, the MariDrone dataset was constructed completely by the flying drone. As a result, different scales, varying illuminations and various views are well presented in our dataset.
Furthermore, data augmentation is thought to extend the MariDrone dataset. As shown in Figure 2, random combinations of transformation operations involving translation, scaling, rotation, dithering, etc. are utilized in the data augmentation process. Translation, scaling and rotation can increase the forms of the labeled objects in the images. Meanwhile, maritime scenarios have been enriched by color dithering. Through such data augmentation, a total of 8340 images were composed for the MariDrone dataset. Each image was annotated according to COCO format. The dataset was divided into training-set, validation-set and test-set at a ratio of 7:2:1.

GGT-YOLO Algorithm
The drone-based maritime object detection algorithm is described in this section, as shown in Figure 3. Using YOLOv5 as the framework, one Transformer is fused in the backbone to enhance the ability of feature extraction; it is of benefit to detect small or occluded objects from complex maritime scenarios in the view of drone. Two GhostNets are utilized to reduce the computational consumption of the network. Therefore, the algorithm is named GGT-YOLO. Compared with YOLOv5 and other derivatives, GGT-YOLO can achieve an optimal balance between detection accuracy and computational cost.

GGT-YOLO Algorithm
The drone-based maritime object detection algorithm is described in this section, as shown in Figure 3. Using YOLOv5 as the framework, one Transformer is fused in the backbone to enhance the ability of feature extraction; it is of benefit to detect small or occluded objects from complex maritime scenarios in the view of drone. Two GhostNets are utilized to reduce the computational consumption of the network. Therefore, the algorithm is named GGT-YOLO. Compared with YOLOv5 and other derivatives, GGT-YOLO can achieve an optimal balance between detection accuracy and computational cost.

GGT-YOLO Algorithm
The drone-based maritime object detection algorithm is described in this section, as shown in Figure 3. Using YOLOv5 as the framework, one Transformer is fused in the backbone to enhance the ability of feature extraction; it is of benefit to detect small or occluded objects from complex maritime scenarios in the view of drone. Two GhostNets are utilized to reduce the computational consumption of the network. Therefore, the algorithm is named GGT-YOLO. Compared with YOLOv5 and other derivatives, GGT-YOLO can achieve an optimal balance between detection accuracy and computational cost.

Object Detection Framework
Compared with the previous versions, YOLOv5 has advantages in data enhancement, feature extraction and loss calculation. Depending on faster detection speed and fewer computational requirements, YOLOv5s is a light version of YOLOv5 and is convenient for deploying onboard drones and other mobile terminals. Therefore, YOLOv5s is employed as the object detection framework in our work. Please note that in the following work any mention of 'YOLOv5' refers to YOLOv5s. Main sections of the framework are described: In the input section, input images are pre-processed to be standardized through mosaic data enhancement, anchor box calculation and image scaling orderly. Then, a backbone is established for extracting various features from standard images. In this section, a focus model is applied to calculate the reduced parameters through a series of slicing operations. CBL is defined as a specialized network involving convolution and batch normalization with activation function and used for transmitting features to alleviate the gradient vanishing. A cross-stage parallel network named C3 can expand the gradient path so as to enhance feature extraction. More fine-grained feature maps are acquired via concatenating CBL with C3. One such combination of CBL and C3 is applied repeatedly in the backbone network to calculate the feature maps with different scales. Spatial pyramid pooling (SPP) is used to reduce the feature loss due to image scaling and distorting. Subsequently, a neck section is mostly responsible for fusing the feature maps with different scales. Using a feature pyramid framework, a bottom-up path aggregation network is designed. C3 in the neck section is different from that in the backbone section. It plays a role of down-sampling operations during the fusion. Meanwhile, Concat refers to concatenating the feature maps after sampling. In the end, three detection heads composed of convolution operations are used to output the detection results with different scales. In the component of each head, one 3 × 3 convolution is responsible for feature integration, while one 1 × 1 convolution is used to adjust the number of channels. In the framework, detecting the objects of large, medium and small sizes can be carried out by calculating the feature maps with the scales of 80 × 80, 40 × 40 and 20 × 20, respectively.
Although the YOLOv5 displays good performance, there are still some challenges to be solved, especially when deployed on board light and flexible drones. To improve the detection performance on scale variation and computational cost, a novel algorithm GGT-YOLO is proposed by modifying the primary YOLOv5.

Feature Extraction Optimization
Due to the scale variations and frequent occlusions of ships displayed in the view of drone, it is a challenge for YOLOv5 to detect maritime objects. As a typical attention mechanism model, Transformer can pay more attention to key features instead of background or blank areas and thus is introduced to enhance the feature extraction of the algorithm. Inspired by Vision Transformer, Transformer is applied in the backbone of the GGT-YOLO, as shown in Figure 3.
Transformer is composed of a multi-head attention and a multilayer perceptron (MLP). Both residual connection (Add) and normalization (Norm) are applied between these networks. Multi-head attention can calculate the relationship among pixels in different positions to enhance the key features, especially for objects from multi-subspaces. In fact, each head of self-attention can be viewed as a subspace of information. As shown in Figure 3, feature maps from the backbone network will be reshaped to form a vector I by flattening operation. And the query vector Q, the key vector K and the value vector V can be calculated from I by different linear transformations. Specifically, head i denotes the result of the i-th self-attention obtained by scaled dot-product attention, which is given as: where IW Q i is the linear transformation from I to Q for head i , IW k i is the linear transformation from I to K, and IW V i is the linear transformation from I to V. Multi-head attention is calculated by concatenating head i , which is given as follows: where Concat refers to tensor concatenation operation, and W o is a linear transformation matrix. MLP is essentially one fully connected layer involving nonlinear transformations and responsible for adjusting the spatial dimension of feature maps. Meanwhile, normalization can ensure that the network converges faster and is anti-overfitting. Global and rich contextual information could be captured by Transformer. Placed behind the SPP, Transformer contributes to detect small or occluded objects from complex maritime scenarios.

Network Lightweight Optimization
Computational cost is a strict requirement for drone onboard systems. Based on the premise, how to reduce the algorithm consumption while ensuring its performance becomes a challenge. As one alternative solution, GhostNet is employed in the feature fusion section of the proposed GGT-YOLO.
Let us assume that most feature maps contain redundant information which is similar and ghost-like between one other. The redundant information, called ghost feature maps, guarantees a comprehensive understanding of the input feature map. Using GhostNet, intrinsic and ghost feature maps can be calculated in the following steps. First, m intrinsic feature maps are calculated from input feature maps by convolutions, which is given as follows: where Y ∈ R h ×w ×m defines intrinsic feature maps with m channels; h and w are the height and width of Y; X ∈ R h×w×c is input feature maps with c channels h and w are the height and width of the input feature map; f ∈ R c×k×k×m is the convolution filters; and k × k is the kernel size of f. Then, ghost feature maps can be generated by applying a series of cheap linear transformations on each intrinsic feature map in Y, as follows: where y i is the i-th intrinsic feature map in Y, Φ i,j is the j-th (except the last one) linear transformation, and y i,j is the j-th ghost feature map. S is defined as the number of the generated ghost feature maps. That is to say, each intrinsic feature map y i can generate one or more ghost feature maps. Finally, both intrinsic and ghost feature maps are combined to form out feature maps. The linear transformations operated on each channel enable a far lesser computational cost to the network than ordinary convolutions. As a result, by using GhostNet the parameters and calculation consumption can be reduced to be about 1/S of those of the primary convolution network. S can be considered as the theoretical speed-up ratio of GhostNet. As shown in Figure 3, two stacked GhostNets and the corresponding shortcuts make up the Ghost bottleneck. One GhostNet acts as an expansion layer to increase the channels of feature maps, while the other one reduces the channels to match the shortcut path. The shortcuts integrate the key information from different layers into the feature maps. Thereby, richer feature information with less computational cost can be obtained by the Ghost bottleneck.
In this work, the Ghost bottleneck is used to replace the CBL in the C3 module, as shown in Figure 3. GhostNet converts intrinsic feature maps to generate ghost feature maps by linear transformations. Compared with the primary network of YOLOv5, floatingpoint operations and network parameters are greatly reduced. Through comparative experiments, GhostNet is applied in the last two C3 of our GGT-YOLO algorithm, which is named C3Ghost.

Experimental and Discussion
For training and evaluating the proposed GGT-YOLO algorithm, related experiments are performed on a workstation equipped with IntelRCoreTM i7-9800XCPU@3.80GHz ×16, 32 GB RAM and NVIDIA GeForce RTX 2060Ti GPU with 12 GB of memory. Batch size is set as 16, iterations are 300 and the size of the input image is 640 × 640. Other parameters are default. To enhance the diversity of the MariDrone dataset, flip horizontal and mosaic data augmentations are adopted in the phase of training.

Performance Analysis
In this section, the value of the GGT-YOLO algorithm is demonstrated by comparative experiments. The proposed algorithm is compared with YOLOv3 [20], YOLOv4 [23], YOLOv5 [11] and YOLOv7 [28] under same conditions. Not only our MariDrone dataset but also the public dataset RSOD [44] are employed for evaluating the proposed algorithm and other YOLO versions. RSOD is a remote sensing object detection dataset that includes four categories of objects, e.g., aircraft, oil tanks, playground, etc. A total of 976 images and 6950 objects are labeled in the dataset. Figure 4 shows the APs of these algorithms during training using the RSOD and MariDrone datasets, respectively. It can be seen in Figure 4a that the mean AP (mAP) of GGT-YOLO is 1.0% higher than that of YOLOv5. Although similar to the P, R and mAP of YOLOv7, the parameters and FLOPs of the proposed algorithm are reduced by 83.2% and 85.6%, respectively. In Figure 4b, the APs of all algorithms increase rapidly at the beginning, but the rates gradually slow down when the iteration is about 100. At around 170 epochs, our algorithm GGT-YOLO shows its advantage. During 210 to 250 epochs, GGT-YOLO is almost overlapped with YOLOv5. During the period of 250-300 epochs, all algorithms begin to converge, but GGT-YOLO still maintains high accuracy. As a whole, compared with the YOLO series algorithms, GGT-YOLO has great advantages in convergence speed and accuracy. In addition, it can be noted that the best suitable iteration number is around 300.
To demonstrate the performance of GGT-YOLO and other YOLO series algorithms, P, R, AP, FLOPs and parameters are calculated in Table 1. Compared with YOLOv5, the P of GGT-YOLO is increased by 1.4%, the R is increased by 2.6% and the AP is increased by 1.9%, while its parameters and FLOPs are reduced by 11.6% and 7.3%, respectively. Given these certain advantages, GGT-YOLO is thought more befitting for onboard systems of drones. In addition, the evaluation based on the RSOD dataset is shown in Table 2.  To demonstrate the performance of GGT-YOLO and other YOLO series algorithms, P, R, AP, FLOPs and parameters are calculated in Table 1. Compared with YOLOv5, the P of GGT-YOLO is increased by 1.4%, the R is increased by 2.6% and the AP is increased by 1.9%, while its parameters and FLOPs are reduced by 11.6% and 7.3%, respectively. Given these certain advantages, GGT-YOLO is thought more befitting for onboard systems of drones. In addition, the evaluation based on the RSOD dataset is shown in Table  2.

Comparative Analysis
During maritime cruising executed by drones, ships in the remote and moving view present scale variations and frequent occlusions. Aside for computational cost, detection accuracy is also required by the onboard vision detection algorithm. As described in Section 3, Transformer is introduced to enhance the feature extraction of YOLOv5, while GhostNet is introduced to reduce the computational cost. How to fuse the two models with the primary network is analyzed in this section.
The proposed GGT-YOLO and other derivatives are defined in Table 3. Bn represents the n-th C3 model behind SPP, where GhostNet or Transformer is introduced. GGT-YOLO is defined by one Transformer being used to replace the first C3 model and two GhostNets being used to replace the fourth and fifth C3 models in the YOLOv5 framework. T-YOLO

Comparative Analysis
During maritime cruising executed by drones, ships in the remote and moving view present scale variations and frequent occlusions. Aside for computational cost, detection accuracy is also required by the onboard vision detection algorithm. As described in Section 3, Transformer is introduced to enhance the feature extraction of YOLOv5, while GhostNet is introduced to reduce the computational cost. How to fuse the two models with the primary network is analyzed in this section.
The proposed GGT-YOLO and other derivatives are defined in Table 3. Bn represents the n-th C3 model behind SPP, where GhostNet or Transformer is introduced. GGT-YOLO is defined by one Transformer being used to replace the first C3 model and two GhostNets being used to replace the fourth and fifth C3 models in the YOLOv5 framework. T-YOLO is defined by one Transformer being used to replace the first C3 model in the framework of YOLOv5. G-YOLO is defined by one GhostNet being used to replace the fifth C3 model in the framework of YOLOv5. GT-YOLO is defined by one Transformer being used to replace the first C3 model and one GhostNet being used to replace the fifth C3 model in the framework of YOLOv5. Other derivatives, e.g., TT-YOLO, GG-YOLO and GGGT-YOLO, are also defined in the same way. By comparative experiments between these derivatives, the optimal solution for drone-based maritime object detection can be obtained. The APs of these fresh networks designed by tentative combination are calculated in Figure 5. It can be seen that even though all the networks have converged, GGT-YOLO proposed by our work has a faster rise speed in the beginning stage and keeps a higher score in the final stage. In addition, the corresponding evaluation metrics are listed in Table 4, and our GGT-YOLO is highlighted in bold.
YOLO, are also defined in the same way. By comparative experiments between t rivatives, the optimal solution for drone-based maritime object detection can be o The APs of these fresh networks designed by tentative combination are calcu Figure 5. It can be seen that even though all the networks have converged, GGT proposed by our work has a faster rise speed in the beginning stage and keeps a score in the final stage. In addition, the corresponding evaluation metrics are liste ble 4, and our GGT-YOLO is highlighted in bold.    Owing to one C3Ghost applied in the neck section, the parameters and FLOPs of G-YOLO are reduced by 0.6 × 10 6 and 0.5 × 10 9 , respectively. Meanwhile, the AP remains at the same level as YOLOv5. To further investigate whether C3Ghost has an effect on reducing computational cost, GG-YOLO (that applies two C3Ghost models) is studied.
As shown in Table 4, even though the computational cost is less than for G-YOLO, the AP of GG-YOLO starts to decrease. It shows that GhostNet in C3Ghost would affect the detection accuracy when reducing computation complexity. To guarantee a reliable detection accuracy, another T-YOLO introduces one Transformer in the backbone section of YOLOv5. The results in the Table 3 show that the P, R and AP of T-YOLO are improved by 0.7%, 2.4% and 0.6%, respectively. Unfortunately, when two Transformer models are introduced into the network, the AP is improved by only 0.1%, but there is a decrease of 1.7% in R. It shows that Transformer could improve the average detection precision, but the recall would not.
In order to better balance computational cost and detection accuracy, a novel GGT-YOLO algorithm is found to be the optimal solution according to comparisons of the evaluation metrics. One Transformer and two C3Ghost models are introduced in the GGT-YOLO. For proof, another two networks, GT-YOLO and GGGT-YOLO, are also designed (in Table 3). GT-YOLO replaces one C3 with one C3Ghost in the neck and introduces one Transformer in the backbone. Even though the detection accuracy of GGT-YOLO is the same as that of GT-YOLO, the parameters and FLOPs are fewer. This means that GGT-YOLO has a lower computational cost. Furthermore, GGGT-YOLO applies more C3Ghost models and is compared with GGT-YOLO; the detection accuracy degenerates rapidly, though a lesser computational complexity is available. As showed in Figure 5, GGGT-YOLO does not seem to perform well in the convergence stage.

Results and Discussion
Thousands of images were recorded when drones were implementing the mission of maritime cruise. Various situations are involved in the dataset, e.g., single object, multiobject, sunny, cloudy, etc. Different sizes and orientation of ships are also presented with labels in these images. Through training, GGT-YOLO is tested and evaluated by using the testing set and validation set. Part of the results are shown in Figure 6. It can be seen that all ships, including small or occluded ships, are detected from large-scale crowded backgrounds.
By the exploratory experiments above, an optimal algorithm GGT-YOLO is proposed for drone-based vision to detect ships from maritime scenarios. Considering the limited computational ability of the onboard system, GhostNet is introduced to reduce the proposed algorithm's computational cost. Instead of general convolution calculation, linear transformations are employed to generate feature maps in GhostNet, and fewer FLOPs are required. It is beneficial for the proposed algorithm to be deployed on airborne systems. However, as more GhostNet models are introduced, the detection accuracy involving P, R and AP begins to decrease. The reason is that linear transformations of GhostNet can not fully approximate the convolution operation. On the other hand, Transformer is proved to have the ability to enhance the detection accuracy of the algorithm. The multi-head attention is able to calculate the contexts of pixels in different positions from multi-subspaces, which is beneficial for GGT-YOLO in extracting significant features from large-scale scenarios.
In conclusion, lesser computational cost as well as adequate detection accuracy has been achieved by our GGT-YOLO. The corresponding P, R and AP are 82%, 71.8% and 72.1%, respectively. In addition, the parameters and FLOPs are 6,234,710 and 15 × 10 9 . Through the comparative experiments, it can be noted that proper introduction of Transformer and GhostNet is beneficial to improve the performance of the detection algorithm. The proposed GGT-YOLO is available for detecting maritime objects by drones.
Thousands of images were recorded when drones were implementing the mission of maritime cruise. Various situations are involved in the dataset, e.g., single object, multiobject, sunny, cloudy, etc. Different sizes and orientation of ships are also presented with labels in these images. Through training, GGT-YOLO is tested and evaluated by using the testing set and validation set. Part of the results are shown in Figure 6. It can be seen that all ships, including small or occluded ships, are detected from large-scale crowded backgrounds.  By the exploratory experiments above, an optimal algorithm GGT-YOLO is proposed for drone-based vision to detect ships from maritime scenarios. Considering the limited computational ability of the onboard system, GhostNet is introduced to reduce the proposed algorithm's computational cost. Instead of general convolution calculation, linear transformations are employed to generate feature maps in GhostNet, and fewer FLOPs are required. It is beneficial for the proposed algorithm to be deployed on airborne systems. However, as more GhostNet models are introduced, the detection accuracy involving P, R and AP begins to decrease. The reason is that linear transformations of GhostNet can not fully approximate the convolution operation. On the other hand, Transformer is proved to have the ability to enhance the detection accuracy of the algorithm. The multihead attention is able to calculate the contexts of pixels in different positions from multisubspaces, which is beneficial for GGT-YOLO in extracting significant features from largescale scenarios.
In conclusion, lesser computational cost as well as adequate detection accuracy has been achieved by our GGT-YOLO. The corresponding P, R and AP are 82%, 71.8% and 72.1%, respectively. In addition, the parameters and FLOPs are 6,234,710 and 15.1 × 10 9 . Through the comparative experiments, it can be noted that proper introduction of Transformer and GhostNet is beneficial to improve the performance of the detection algorithm. The proposed GGT-YOLO is available for detecting maritime objects by drones.

Conclusions
Both detection accuracy and computational consumption require consideration simultaneously when drones are being employed to detect small or occluded objects from large-scale scenarios. In this work, we proposed a novel drone-based maritime object detection algorithm, in which the feature extraction is enhanced while the computation of the feature fusion is optimized. A specialized dataset is introduced, and numerous comparative experiments have been conducted to illustrate the proposed algorithm. The results show that the P, R and AP are improved by 1.4%, 2.6% and 1.9%, respectively, com-

Conclusions
Both detection accuracy and computational consumption require consideration simultaneously when drones are being employed to detect small or occluded objects from large-scale scenarios. In this work, we proposed a novel drone-based maritime object detection algorithm, in which the feature extraction is enhanced while the computation of the feature fusion is optimized. A specialized dataset is introduced, and numerous comparative experiments have been conducted to illustrate the proposed algorithm. The results show that the P, R and AP are improved by 1.4%, 2.6% and 1.9%, respectively, compared with the primary YOLOv5. Furthermore, the parameters and floating-point operations are reduced by 11.6% and 7.3%, respectively. It can be proved that the algorithm provides a single optimal solution for drone-based object detection in maritime and other remote sensing fields. In the next work, the lightweight of the feature fusion will be studied.