Centered Multi-Task Generative Adversarial Network for Small Object Detection

Despite the breakthroughs in accuracy and efficiency of object detection using deep neural networks, the performance of small object detection is far from satisfactory. Gaze estimation has developed significantly due to the development of visual sensors. Combining object detection with gaze estimation can significantly improve the performance of small object detection. This paper presents a centered multi-task generative adversarial network (CMTGAN), which combines small object detection and gaze estimation. To achieve this, we propose a generative adversarial network (GAN) capable of image super-resolution and two-stage small object detection. We exploit a generator in CMTGAN for image super-resolution and a discriminator for object detection. We introduce an artificial texture loss into the generator to retain the original feature of small objects. We also use a centered mask in the generator to make the network focus on the central part of images where small objects are more likely to appear in our method. We propose a discriminator with detection loss for two-stage small object detection, which can be adapted to other GANs for object detection. Compared with existing interpolation methods, the super-resolution images generated by CMTGAN are more explicit and contain more information. Experiments show that our method exhibits a better detection performance than mainstream methods.


Introduction
With visual sensors and computer vision development, gaze estimation technology can obtain gaze points with high accuracy [1]. However, the application of gaze estimation is still limited to visual attention analysis [2], assistive technologies for users with motor disabilities [3], behavior research [4], etc. Meanwhile, object detection algorithms such as YOLOv4 [5] and Faster RCNN [6] have low confidence and apparent location deviation in the prediction of small objects. The method of combining object detection and gaze estimation can significantly improve small object detection performance.
Object detection algorithms have achieved impressive accuracy and efficiency in detecting large objects. However, the performance with small-sized objects is far from satisfactory. There is still a big gap between the performances with small and large objects in recall and accuracy. To achieve a better detection performance when using small objects, SSD [7] uses feature maps from shallow layers for small objects. FPN [8] exploits a feature pyramid to combine feature maps at different scales. Bai et al. [9] introduced a generative adversarial network to implement image super-resolution for small object detection. SOD-MTGAN [10] takes ROIs as input and predicts the categories and locations of objects.
The shallow feature maps are full of textural information but less discriminative, which leads to many false positive results in SSD. The up-sampling of FPN and [9] might generate artifacts which can cover the feature of small objects. SOD-MTGAN takes ROIs from baseline detectors as input, which means that SOD-MTGAN is only executed as the second stage of two-stage object detection. The performance of SOD-MTGAN is heavily dependent on its baseline detector. SOD-MTGAN exploits deconvolution layers for upsampling, which generates fewer artifacts [10]. However, SOD-MTGAN did not propose a method to suppress artifacts.
In this paper, we proposed a centered multi-task generative adversarial network (CMTGAN) to improve detection performance on small objects, which exploits points of interest presented by gaze estimation methods or detectors (e.g., YOLOv4, etc.) for small object detection. We exploit a gaze estimation method or a detector as a baseline selector to propose points of interest. CMTGAN crops the selected regions centered by points of interest and performs two-stage object detection. Following the previous works on GANs, CMTGAN consists of two subnetworks: a generator and a discriminator. The generator performs super-resolution on selected regions. The discriminator distinguishes real images (high-resolution images) from fake images (super-resolution images) and performs complete two-stage object detection. Contributions: The contributions can be summarized as follows: (1) We proposed an end-to-end convolutional network based on classical GAN for small object detection, which can perform effective single image super-resolution and complete two-stage object detection.
Our method can be pre-trained on high-resolution images for super-resolution without extra object information, which helps the generator learn to extract features from low-resolution images efficiently. The generator performing super-resolution and the discriminator performing object detection can be trained together, which helps them learn to perform better detections simultaneously.
(2) We introduced artificial texture loss into the generator to suppress the artifacts generated by up-sampling, which improves the detection performance on small objects. Artificial texture loss helps the generator reach a balance between textures from original images and textures generated by super-resolution. (3) We exploit a centered mask in the network, making the generator pay more attention to the central part of images. (4) The experiments on VOC datasets reveal that CMTGAN can restore sharper images with more information from low-resolution images than traditional interpolation methods.
Our method has a better performance than mainstream one-stage and two-stage object detection methods. It is also more efficient than object detection methods combined with CNN-based super-resolution methods.
CMTGAN can perform state-of-art detection on small/medium objects.

Small Object Detection
Traditional object detection methods are based on handcrafted features and the deformable part model [11]. Due to the limitation of handcrafted features, traditional methods are far less robust than methods based on deep neural networks. Especially for small object detection, the performance of traditional methods is far from CNN-based methods.
In recent years, object detection methods based on deep neural networks have exhibited superior performances. Currently, CNN-based object detection methods can be categorized as one of two frameworks: the two-stage framework (e.g., Faster RCNN [6], FPN citefpn, etc.) and the one-stage framework (e.g., YOLO [5,12,13], SSD [7], etc.). Faster RCNN [6], a milestone of the two-stage framework, performs object detection with two stages. Faster RCNN proposes ROIs in the first stage, then predicts categories and regresses bounding boxes in the second stage. The one-stage frameworks such as YOLO convert object detection to regression problems, significantly improving detection speed. However, Faster RCNN and YOLOv4 still show unsatisfactory performance on small object detection.
To detect small objects better, SSD uses feature maps from the shallow layer. Although shallow feature maps contain more texture information, they lack semantic information, leading to false positive results in SSD. Compared to SSD-like detectors, our discriminator uses deep, strong semantic features to represent small objects, thus reducing the false positive rate.
FPN exploits the feature pyramid to combine low-resolution, semantically strong features with high-resolution, semantically weak features. With the feature pyramid, FPN exhibits a superior performance over Faster RCNN for small object detection. However, FPN up-samples low-resolution features to fit with high-resolution features, a process that introduces artifacts into the features and consequently degrades the detection performance. SOD-MTGAN [10] uses deconvolution layers for up-sampling, which introduces fewer artifacts into features. However, SOD-MTGAN has not proposed a specific method to suppress artifacts.
Xiang et al. [14] proposed a one-stage space-time video super-resolution framework. It exploits a ConvLSTM method to super-resolve videos, but it is not suitable for single image super-resolution. Su et al. [15] proposed a progressive mixture model for single image super-resolution, which achieved impressive performance on super-resolution.
Compared to FPN and the generator of SOD-MTGAN, our method proposes a method to suppress artificial textures. We exploit deconvolution layers for up-sampling like SOD-GAN and propose artificial texture loss to suppress artifacts, which helps our network balance original textures and super-resolution textures.
Different from [15], we combine single image super-resolution and object detection in a CNN-based framework, which means they can be trained together.

Generative Adversarial Networks
In the primary work, the generative adversarial network generates realistic-looking images from random noise input [16]. GAN exhibits an impressive performance in image super-resolution [17,18], image editing [19,20], image generation, style transfer [21,22], representation learning [23,24], object detection [9,10,25], and so on. GAN includes a generator and a discriminator: the generator generates images, and the discriminator determines the authenticity of images. During training, the generator tries to generate more realistic-looking images, and the discriminator struggles to discover the difference between real images and fake images. After that, the well-trained generator can be used to generate realistic-looking images.
Ledig et al. [17] proposed a generative adversarial network for image super-resolution. The generator takes low-resolution images as input to generate super-resolution images. Real high-resolution images and fake images (e.g., super-resolution images) are delivered to the discriminator. The discriminator the difference between real images and fake images. Bai et al. [10] introduces SOD-MTGAN for image super-resolution and small object detection. The generator of SOD-MTGAN takes ROIs proposed by a baseline detector (e.g., Faster RCNN) as input and performs super-resolution on ROIs. The discriminator of SOD-MTGAN has three tasks: judging the authenticity of the image, predicting categories, and fine-tuning bounding boxes. The discriminator plays a role as the second-stage subnetwork in the two-stage framework. Therefore, the baseline detector has a significant influence on the detection performance of SOD-MTGAN.
Compared to SOD-MTGAN, the discriminator of our method performs complete twostage small object detection. The generator of CMTGAN takes selected regions as input and performs super-resolution. Then the discriminator proposes ROIs on super-resolution images in the first stage, predicts object categories and regresses object locations in the second stage. The baseline selector only proposes points of interest, which means that CMTGAN has less reliance on the selector.

Proposed Method
CMTGAN includes a generator and a discriminator. As shown in Figure 1, the baseline selector creates points of interest on the input containing small objects. We cropped the selected regions centered by points of interest as high-resolution images (HR images) and down-sampled the HR images to obtain low-resolution images (LR images). The generator takes the LR images to generate super-resolution images (SR images). The HR/SR images are delivered to the discriminator. The discriminator categorizes the input as real or fake and detects small objects.   As shown in Figure 2 and Table 1, we adopted a deep CNN architecture which has shown impressive performance in tiny face detection [9] and super-resolution [17].   There is one skip connection layer, one RPN layer, one sigmoid layer, two deconvolution layers, three convolution layers, and five residual blocks in the generator. Differently from [9], we introduced a skip connection layer into the generator, which brings texture information from shallow layers to up-sampling layers. Differently from the up-sampling layers in [9,17], we exploited deconvolution layers for up-sampling, which achieves a higher efficiency and generates fewer artifacts [10]. Every deconvolution layer performs up-sampling with a factor of 4, which means that the size of SR images is four times that of LR images. We exploit a sigmoid layer to limit the output, which can avoid gradient exploding problems in training.

Discriminator
As shown in Figure 2 and Table 2, we employed ResNet-50 as our backbone network in the discriminator. ResNet-50 is not the only choice, which can be replaced with ResNet-101, AlexNet, or VGGNet for different objects. We introduced an ROI layer into the backbone network to propose ROIs. We used an average pooling layer following the backbone network for down-sampling. We used three parallel fully connected layers behind the average pooling layer, which distinguish the real HR images from the generated SR images, predicting object categories, and regressing bounding boxes.
The discriminator takes HR images and SR images as input. The backbone network extracts features from input and proposes ROIs. Figure 3a shows the tuple u = u x1 , u y1 , u x2 , u y2 of ROI. Behind the average pooling layer, the first fully connected layer (FC Adv ) uses softmax to predict the probability (P HR ) of the input image being a real HR image. The second fully connected layer (FC Cls ) also uses softmax, which outputs the probability P Cls = (p 0 , ..., p K ) of the ROI, each being part of the K + 1 object categories. The third fully connected layer (FC Loc ) outputs the bounding box offset tuple t = (t x , ty, t w , t h ). As shown in Figure 3b, the offset tuple t = (t x , ty, t w , t h ) corresponds to the bounding box.
Compared to the discriminator in [17,23], our discriminator not only distinguishes real images from fake images but also detects objects in the images. The discriminator in [9] predicts the probability of the input being a face. The discriminator in [10] predicts the probability of the input being each of the categories and fine-tunes the bounding boxes. Compared to [9] and [10], our discriminator performs complete two-stage object detection, proposing ROIs, predicting object categories, and regressing bounding boxes. The difference between our method and [10] means that we only need a point of interest to detect a small object, while [10] needs an ROI proposed by its baseline detector.

Loss Function
We incorporated the loss functions from some state-of-art GAN approaches and propose centered content loss that satisfies the needs of small object detection. Centered content loss consists of pixel-wise loss, perception loss, and artificial texture loss. Centered content loss cooperates with adversarial loss, guiding the generator to generate realisticlooking images easier for small object detection. Furthermore, we propose two-stage detection loss, including ROI loss, classification loss, and regression loss. On the one hand, two-stage detection loss enables the discriminator to perform two-stage object detection. On the other hand, two-stage detection loss drives the generator to recover fine details from LR images for easier detection, as shown in Figure 2. In the following, we describe the centered content loss and the adversarial loss. Furthermore, we define the objective functions of the generator and the discriminator.

Centered Content Loss
As shown in Figure 1, the selected regions contain small objects in the central part. We introduced a centered mask which makes the content loss more sensitive to the central part of SR images. The centered mask is shown in Equation (1), and Figure 4 shows the suppression effect of our centered mask.
Here, W and H denote the size of SR images. Pixel-wise loss: Instead of the generator in [16] taking random noise as input, our generator creates SR images from LR images. A natural and straightforward way is to enforce the generator's output to be the ground-truth images by minimizing the pixel-wise loss, which has been proved effective in some state-of-the-art approaches [26,27]. The pixel-wise loss is computed as Equation (2).
Here, M x,y denotes the centered mask. I HR and G ω I LR denote real HR images and generated SR images. G represents the generator, and ω denotes its parameters. W and H denote the size of HR/SR images and the centered mask.

HR SR
Pixel-wise Loss without Centered Mask

Centered Mask
Pixel-wise Loss with Centered Mask Perception loss: Solutions of MSE optimization problems often lack high-frequency content, which results in images covered with overly smooth textures. Therefore, we adopted the perception loss based on the pre-trained ResNet [28]. The pixel-wise loss is computed as Equation (3).
Here, R denotes the pre-trained ResNet. w and h indicate the size of the feature map created by R.
Artificial texture loss: The perception loss increases high-frequency content in SR images, making them sharper. However, perception loss without suppression tends to introduce artificial textures into images, which do not exist in HR images. These artificial textures significantly reduce the perception loss, but they also obscure the original textures of images, which is fatal for small object detection. Artificial texture loss is proposed to suppress the artificial textures encouraged by perception loss. The artificial texture loss is computed as Equation (4).
where M x and M y are the variants of M x,y to the direction of x and y. W and H denote the size of the super-resolution image. G ω I LR x, * is the sum of the pixel values of the x-th row in the generated image. G ω I LR * ,y denotes the sum of the pixel values of the y-th column in the generated image.

Adversarial Loss
We adopted an adversarial loss to generate more realistic-looking SR images, which has been proved to be efficient in [23]. The adversarial loss is defined as Equation (6): where D represents the discriminator and θ denotes its parameters. D θ I HR denotes the probability of the input I HR being a real HR image. The adversarial loss encourages the discriminator to have a stronger discriminative ability to distinguish real HR images from generated SR images. At the same time, the adversarial loss drives the generator to produce images with fine details.

Detection Loss
As shown in Figure 2, our discriminator is a two-stage object detection method. First, the discriminator proposes ROIs from the input. Second, the discriminator predicts object categories and regresses bounding boxes on ROIs. To achieve this, we propose detection loss, including ROI loss, classification loss, and regression loss.
ROI Loss: To complete the task of proposing ROIs and ensuring the generated images are in more detail, we introduced the ROI loss to the overall objective. The ROI loss is defined as Equation (7): in which where r = r x1 , r y1 , r x2 , r y2 denotes a tuple of the true ROI regression target, and u = u x1 , u y1 , u x2 , u y2 denotes the proposed ROI tuple u shown in Figure 3a.
In our method, ROI loss plays two roles. First, it guides the discriminator to propose ROIs from the input, regardless of whether they are real HR images or generated SR images. Second, it promotes the generator to recover images with more detail, making it easier to propose ROIs.
Classification Loss. In order to complete the object categorization, we adopted crossentropy loss as our classification loss. The classification loss is defined as Equation (9): in which where D cls I * i denotes the probability of the i-th input belonging to the k-th category. Our classification loss also plays two roles in the discriminator and the generator, respectively. First, it encourages the discriminator to predict accurate object categories. Second, it drives the generator to produce images that are easier to classify.
Regression Loss: We also introduced regression loss into the objective function to complete the two-stage object detection and promote the generated images that make it easier to localize small objects.
where v = v x , v y , v w , v h denotes a tuple of the true bounding box regression target, and t = t x , t y , t w , t h denotes the tuple of the predicted bounding box, as shown in Figure 3b. Similar to the ROI loss, our regression loss also has two purposes. First, it guides the discriminator to fine-tune the bounding box in the ROI proposed in the first stage. Second, it encourages the generator to produce sharper images with more high-frequency content.

Objective Function
Based on the previous analysis, we propose the objective function of CMTGAN. CMTGAN can be trained by optimizing the objective function. We adopted two objective functions for the generator and the discriminator, respectively. The loss functions L G of the generator and L D of the discriminator are shown in Equations (12) and (13).
where λ pix , λ perc , λ tex , λ adv and λ det denote the trade-off weights during training generator G. τ ROI , τ cls , τ loc , and τ adv denote the trade-off weights during training discriminator D. l pixel−wise , l perception , l tex , l adv , l ROI , l cls and l loc denote the pixel-wise loss in Equation (2), the perception loss in Equation (3), the artificial texture loss in Equation (4), the adversarial loss in Equation (6), the ROI loss in Equation (7), the classification loss in Equation (9) and the regression loss in Equation (11). The loss function of generator G consists of centered content loss, adversarial loss, and detection loss. Different to the previous GAN methods, we introduced the centered mask and artificial texture loss into the centered content loss. The centered mask promotes the generator focus on improving details of the central part, which satisfies the needs of small object detection. Artificial texture loss helps the generator reach a balance between keeping original features and generating super-resolution textures. The loss function of discriminator D includes adversarial loss and detection loss. Different from [10], we introduced ROI loss into our detection loss, which helps the discriminator perform the first stage of small object detection: propose ROIs. We also adopt classification loss and regression loss for the second stage: predict object categories and regress bounding boxes.
While training the generator, we froze the discriminator, calculated the loss of the generator with L G , and updated the generator by backpropagation. Similar to the generator, we also optimized the discriminator while keeping the generator frozen.

Datasets and Evaluation Metrics
We implemented our model with PyTorch and all the following experiments were performed on a single NVIDIA GeForece RTX 3090 GPU. Table 3 shows our system requirements. Considering the GPU's performance, we experimentally validated our proposed method on the VOC dataset.
The VOC dataset contains 20 object categories including vehicles, households, animals, and others. This dataset has been widely used as a benchmark for object detection tasks [29].
Due to the resolution of the dataset, we exploited original images for the pre-training of the generator. After that, we created selected regions from original images for the pre-training of the discriminator and the training of CMTGAN, respectively. Due to the errors in the baseline selector, the point of interest cannot properly coincide with the center of the target. As shown in Figure 5, we also added a random offset x o f f set , y o f f set from the center of the target while creating selected regions.
x o f f set , y o f f set is shown in Equation (14).
x o f f set = random 10, w object · h object y o f f set = random 10, w object · h object (14) where w object and h object denote the size of the detection target. The function random(x 1 , x 2 ) returns a random integer from x 1 to x 2 . After that, we took the point of interest as the center and crop the selected region with a fixed size size selected .
We exploited average gradient (AG), standard deviation (STD), and mutual information (MI) to validate the performance of our generator, in which AG shows the definition of images, STD shows the quantity of information, and MI denotes the similarity between HR/SR images. Furthermore, we performed small object detection with CMTGAN and some mainstream methods with one-stage frameworks or two-stage frameworks. We divided the objects into small (area < 96 2 ), medium (96 2 > area > 32 2 ), and large objects (area > 96 2 ). We focused on the detection of small/medium objects and report the final detection performance with AP.

Implementation Details
In the generator, we set the trade-off weights λ pix = 1, λ perc = 0.006, λ tex = 2 × 10 −8 , λ adv = λ det = 0.001. In the discriminator, we set the trade-off weights τ adv = τ ROI = τ cls = τ loc = 1. First, we performed the pre-training of the generator and the discriminator. Second, we trained the CMTGAN for image super-resolution and small object detection.
Pre-training of the generator and the FC adv branch of the discriminator. We created HR images in the size of 400 2 from the VOC dataset and exploited down-sampling to produce LR images at the size of 100 2 . Then, we performed the pre-training on HR and SR images. The generator produces SR images at the size of 400 2 from LR images, and the FC adv branch outputs the probability of the input being a real HR image. Our generator was trained from scratch. The weights in each layer were initialized with a zero-mean Gaussian distribution with standard deviation 0.02, while the biases were initialized with 0. The backbone network of discriminator loaded the pre-trained weights of ResNet-50. The weights in the fully connected layer of FC adv branch were initialized with a zero-mean Gaussian distribution with a standard deviation of 0.1, while the biases were initialized with 0. During the pre-training, the weights and biases in the backbone network of the discriminator were fixed, which makes the discriminator more stable. We adopted the Adam optimizers for the generator and the discriminator, respectively. The learning rates for the optimizers were initially set to 0.0001 and were then reduced to 95% after every epoch. We alternately updated the generator and the discriminator networks: we updated the generator every five iterations and updated the discriminator every iteration except on the generator's turn. The pre-training was terminated after 50 epochs, and the states of the network were recorded.
Pre-training of the discriminator: We pre-trained the FC cls branch and FC loc branch of the discriminator on the selected regions with size selected = 150. Similar to the former pre-training, we also fixed the backbone network of the discriminator. The backbone network of discriminator loads the pre-trained weights of ResNet-50. The weights in RPN layers, fully connected layers of FC cls branch and FC loc branch are initialized with a zero-mean Gaussian distribution with a standard deviation of 0.1, while the biases are initialized with 0. We adopted the Adam optimizer for the discriminator. The learning rate for the optimizer was initially set to 0.0001 and then reduced to 95% after every epoch. The pre-training was terminated after 50 epochs, and the states of the network were recorded.
Training for CMTGAN: Finally, we trained CMTGAN on the selected regions. The generator performed super-resolution on the selected regions in the size of 150 2 . The discriminator performed object detection on the SR images in the size of 600 2 , predicting object categories and regressing bounding boxes. The generator and discriminator load weights from the pre-trained weights. We adopted the Adam optimizers for the generator and the discriminator, respectively. The learning rates for the optimizers were initially set to 1 × 10 −5 and then reduced to 95% after every epoch. We alternately updated the generator and the discriminator networks: we updated the generator every five iterations and updated the discriminator every iteration except on the generator's turn. The training contains 100 epochs. In the first 50 epochs, layers in the backbone network of the discriminator were fixed. In the following 50 epochs, no layer was fixed.

Performance of Super-Resolution
The generator performed super-resolution on LR images, and the performance is shown in Figure 6. We performed up-sampling with bicubic interpolation on LR images in the size of 100 2 (Figure 6 row A) and restore images in 400 2 ( Figure 6, row B).
We super-resolved LR images with SPSR [30] and ESRGAN ( Figure 6, row C and row D).
At the same time, we exploit CMTGAN without artificial texture loss to generate SR images with a factor of 4 ( Figure 6, row E). Furthermore, we exploit CMTGAN with artificial texture loss to generate SR images with a factor of 4 ( Figure 6, row F). It is evident that SR images in row E are significantly sharper than restored images in row B. However, SR images in row E contain some abnormal textures, which may cover the original texture information of small objects. Especially in the first image of row E, we can see that the wings are abnormally distorted by artificial textures. SR images in row F contain significantly fewer artificial textures than SR images in row E. The wings in the first image of row F are more realistic than row E.
Although SPSR exhibited an impressive performance on images of buildings, images generated by SPSR in row C contain too many artificial textures for small object detection compared to images generated by our method in row E. ESRGAN generated more realisticlooking images in row D compared to SPSR. Images generated by ESRGAN in row D look sharper than images in row E, which shows extremely clear boundaries. However, due to the optical factors, real HR images captured by cameras do not contain such extremely clear boundaries, which means interference in object detection methods. More details are shown in the following experiments.
In summary, the generator of CMTGAN can generate sharper SR images than traditional interpolation methods. There is no significant gap between the generator of CMTGAN-and CNN-based methods (e.g., ESRGAN, etc.) in single image super-resolution. Artificial texture loss shows significant suppression of artifacts, which helps the generator keep a balance between original features and super-resolution textures.
Furthermore, we quantitatively analyzed the super-resolution performance of CMT-GAN with AG, STD, MI, and inference time. A higher AG means sharper images, and a higher STD means more information in images. MI shows a similarity between HR images and SR/RE images. We collected 54 HR images from the VOC dataset randomly and down-sampled them to the size of 150 2 , as shown in Figure 7. We up-sampled LR images with bilinear interpolation and bicubic interpolation to restore images in the size of 600 2 . The generator of CMTGAN produces SR images with a factor of 4. As shown in Table 4, we calculated AG, STD, and MI of SR/RE images to validate the performance of CMTGAN. Taking into consideration the needs of object detection on inference time, we also recorded the inference time in Table 5.
According to Table 4, it is clear that SR images generated by CNN-based methods have higher AG and STD than RE images generated by traditional interpolation methods, and images generated by ESRGAN have the best AG and STD. However, a higher AG and STD do not mean absolutely better images. The images generated by SPSR have a better AG and STD than CMTGAN, while they contain too many artificial textures, as shown in Figure 6. These artificial textures increase AG and STD, but also make small objects hard to detected. Therefore, we exploited MI to measure the similarity between HR images and SR/RE images. As shown in Table 4, SR images generated by CMTGAN have the best MI, which means that SR images generated by CMTGAN are the most similar to the original HR images.
According to Table 5, CMTGAN has the shortest inference time among CNN-based super-resolution methods. The generator of CMTGAN takes an average of 10.1 ms to perform super-resolution, which satisfies the needs of object detection. Although it takes more time than traditional interpolation methods, the inference time of CMTGAN is significantly shorter than SPSR and ESRGAN.
In summary, SR images generated by CMTGAN are sharper than images produced by traditional interpolation methods and contain more information. The generator of CMTGAN exhibits a similar super-resolution performance to some state-of-the-art CNNbased methods. SR images generated by CMTGAN are the most similar to the original HR images as compared to images generated by traditional interpolation methods and CNN-based methods. The generator of CMTGAN can perform real-time super-resolution on a single NVIDIA RTX3090, which satisfies the needs of small object detection.   We exploited CMTGAN to detect small objects, as shown in Figure 8. The generator performed super-resolution on the input, which made the images easier for detection. The discriminator proposed ROIs in the first stage, predicted object categories and regressed bounding boxes in the second stage.

Select Regions
Super-Resolution (x4) Stage 1 Stage 2 Input images We performed small/medium object detection on selected regions with CMTGAN, YOLOv4, and Faster RCNN combined with different up-sampling methods. We upsampled the selected regions from 150 2 to 608 2 with bilinear interpolation and bicubic interpolation, from 150 2 to 600 2 with SPSR and ESRGAN for YOLOv4, which is similar to the super-resolution performed by the generator in CMTGAN. We up-sampled the selected regions from 150 2 to 600 2 with bilinear interpolation, bicubic interpolation, SPSR, and ESRGAN for Faster RCNN, similar to the super-resolution in CMTGAN. Then, we exploited these methods for object detection.
As shown in Table 6, CMTGAN has a better performance on small/medium object detection than YOLOv4 (i.e., 20.52% in AP) and Faster RCNN (i.e., 5.27% in AP). Although YOLOv4 combined with ESRGAN achieved a higher AP, its inference time also increased as shown in Table 7. According to Table 7, YOLOv4 combined with bilinear interpolation has the shortest inference time. The inference time of CMTGAN is longer than YOLOv4 combined with bilinear interpolation but significantly shorter than YOLOv4 and Faster RCNN combined with CNN-based super-resolution methods. CNN-based super-resolution methods (e.g., ESRGAN, SPSR, etc.) may benefit small object detection, but they also take a long time to super-resolve LR images, which makes the detection is not in real timeCMTGAN exhibited a better object detection performance than Faster RCNN combined with traditional interpolation with a similar inference time.

Conclusions
In this paper, we proposed CMTGAN, a new small object detection method based on generative adversarial networks. We introduced artificial texture loss and a centered mask into the generator, with which the generator could create super-resolution images easier for small object detection. The artificial texture loss helped the generator to balance the original features and super-resolution textures. The discriminator of our method performed complete two-stage object detection and distinguished real images from fake images, which can be adapted to other GANs for detection tasks. The experimental results showed that, compared with the existing methods, the generator of CMTGAN could generate sharper super-resolution images with more information. CMTGAN had an obvious advantage in small/medium object detection.
In future work, we will focus on eliminating the baseline selector. Although CMT-GAN has a similar inference time than Faster RCNN, there is still a significant difference between YOLOv4 and CMTGAN in inference time. We will investigate how to optimize the architecture of CMTGAN to perform more efficient object detection. Furthermore, we will further investigate the generation of artifacts to achieve a better performance.