Selecting Post-Processing Schemes for Accurate Detection of Small Objects in Low-Resolution Wide-Area Aerial Imagery

In low-resolution wide-area aerial imagery, object detection algorithms are categorized as feature extraction and machine learning approaches, where the former often requires a post-processing scheme to reduce false detections and the latter demands multi-stage learning followed by post-processing. In this paper, we present an approach on how to select post-processing schemes for aerial object detection. We evaluated combinations of each of ten vehicle detection algorithms with any of seven post-processing schemes, where the best three schemes for each algorithm were determined using average F-score metric. The performance improvement is quantified using basic information retrieval metrics as well as the classification of events, activities and relationships (CLEAR) metrics. We also implemented a two-stage learning algorithm using a hundred-layer densely connected convolutional neural network for small object detection and evaluated its degree of improvement when combined with the various post-processing schemes. The highest average F-scores after post-processing are 0.902, 0.704 and 0.891 for the Tucson, Phoenix and online VEDAI datasets, respectively. The combined results prove that our enhanced three-stage post-processing scheme achieves a mean average precision (mAP) of 63.9% for feature extraction methods and 82.8% for the machine learning approach.


Introduction
In object detection, small objects are regarded as those taking up less than 1% of an image when measuring the area of objects with a bounding box [1]. Detecting small objects in low-resolution wide-area aerial imagery is a challenging task, especially in remote sensing applications [2][3][4][5][6][7][8][9][10][11][12][13][14][15]. For example, in widely recognized datasets such as VEDAI, DOTA, MS COCO, Potsdam, and Munich [1,[16][17][18] there is an interest in locating small objects of interest that have a size of less than 32 × 32 pixels. In low-resolution aerial datasets, small objects such as cars or trucks usually cover 20 to several hundred pixels in area, which renders the detection task in a sample aerial video rather difficult. Detecting small objects in an aerial video is of importance in many practical applications including urban traffic management, visual surveillance, local parking monitoring, military target strike, and emergency rescue [3,4,12,14,16,18]. Widely recognized as a special case of object detection, detecting small objects in aerial imagery is challenging due to issues such as high density of the objects within a small area, shadow from clouds, partial occlusion due to (i) We designed three sets of experiments to measure the degree of performance improvement on each of the seven post-processing schemes for any of the eleven algorithms on vehicle detection. The first set of experiments determined the best three post-processing schemes using two aerial image datasets (Tucson and Phoenix). The second set of experiments decided the best post-processing scheme for each of the ten algorithms. The third set of experiments conducted verification using two groups of selected post-processing methods for the 11th algorithm on a third public aerial dataset (online VEDAI).
(ii) We adapted a two-stage learning scheme by applying DenseNet to perform the detection task and combined our post-processing scheme to improve object detection accuracy. Quantitative results proved the viability and efficiency for small object detection using a low-resolution wide-area aerial dataset, i.e., the online VEDAI dataset, which is not limited to small cars and trucks.
(iii) We measured the degree of improvement on object detection performance in addition to a time-savings evaluation. We adjusted the parameters for learning and validation to achieve the highest accuracy without any post-processing, then applied the selected post-processing and measured the speed-up factor using Google Colab Pro.
The remainder of this paper is organized as follows. In Section 2, we present some related work on traditional and current state-of-the art schemes for small object detection in the remote sensing domain. In Section 3, we present a concise summary of the existing schemes as well as our proposed schemes for post-processing, and some key details on DenseNet, in addition to its architecture adapted for two-stage learning. Section 4 includes a discussion of our experiments along with quantitative and qualitative results. Section 5 discusses the results of our research study. Finally, Section 6 presents concluding remarks and future research directions.
To compare the degree of performance improvement of post-processing schemes for the VMO-based algorithm [28], we applied the three-stage scheme [13] in contrast to each of five post-processing schemes [4,[22][23][24][25], and analyzed their performance. In our first set of experiments, each of the ten algorithms [25,[27][28][29][30][31][32][33][34][35] was combined with each of the post-processing schemes [4,13,14,20,22,23,25] to determine the best three post-processing schemes. The second set of experiments determined the best post-processing scheme for traditional approaches for small object detection in aerial image datasets. A third set of experiments assessed the hundred-layer tiramisu code using DenseNet [39] with learning parameters optimized for the online VEDAI dataset, in combination with various post-processing schemes.
Given the fact that the established fully convolutional (FC) DenseNet trained their models with neither extra data nor post-processing, we adapted this scheme with initial learning rates, performed similar regularization [39], and enhanced the global accuracy using our selected post-processing. We also studied several other recent schemes for small

Post-Processing Schemes
We previously derived a two-stage post-processing method sieving and closing (S&C) [4]) to reduce false detections. In the first stage, we perform area thresholding which sieves out any detected objects whose area fall outside a designated range. We used a pixel-area range of (5, 160) for the Tucson dataset and (5, 180) for the Phoenix dataset (because the spatial resolutions are slightly different). In the second stage, due to the persistence of some detection errors even after applying area thresholds, a morphological closing operation was performed to connect adjacent tiny objects which tend to be false detections. This step also offers boundary smoothing and fills small holes inside each detection. The size of a morphological closing filter [50] is flexible with respect to the criteria of achieving highest overall average F-score. All the binary objects within an area range encompassing most small cars and trucks are preserved, but fixed thresholding does not generalize to datasets with different resolutions [13]. Inspired by sieving and closing [4], we derived the three-stage scheme for post-processing [13], where Stage 1 operates area thresholding to drop some incorrect detections, Stage 2 performs the morphological closing operation and Stage 3 applies conditional object sieving with respect to a compactness measure for vehicle shape. This scheme overcame the shortcomings of using expected vehicle size and fit datasets with various spatial resolutions [13]. In Stage 3, the compactness C of a region is defined as [26] where L represents the perimeter of the region, and A is the area of the region. For our data, we define the lower compactness threshold as half of the smallest compactness, and the upper threshold as twice the largest compactness. Due to the concern of distortions on some small vehicles at high probability to be truly detected objects, we retained detections with compactness in the range [C small /2, 2 × C large ] [4,13]. The enhanced three-stage post-processing scheme [20] was derived to achieve more independence with respect to object size [33]. In Stage 1, a 3 × 3 median filter is applied. In Stage 2, an opening operation is applied to sieve out trivial false detections, followed by a closing operation. Finally, Stage 3 applies linear Gaussian filtering followed by nonmaximum suppression (NMS) to discard multiple false detections nearby a single object.
We also developed a two-stage spatial processing scheme [14], where the first stage used multi-neighborhood hysteresis thresholding. The second stage of spatial processing is identical to Stage 2 of the Enh3Stage post-processing scheme [20], while the filter size of the opening and closing operation was carefully adjusted in accordance with the spatial resolution of the datasets.

Experiments and Results
We evaluated quantitative detection performance in four scenarios: (i) six detection algorithms combined with the 3Stage scheme and each of these six with filtering by shape index (SI); (ii) seven post-processing schemes each associated with any of the ten object detection algorithms in the Tucson and Phoenix datasets to determine the best three postprocessing schemes for each algorithm; (iii) ten algorithms each combined with three postprocessing schemes out of seven as determined from (ii), along with a visual comparison of the vehicle detection results; (iv) a two-stage machine learning approach with adjusted parameters to achieve the highest overall accuracy on small object detection in the online VEDAI dataset, using its updated flowchart with an optimally tuned proportion of folders on training to testing and updating its structure of FC-DenseNet, combined with each of the three post-processing schemes determined from (iii). Figure 1 depicts a pipeline of our research study on selecting post-processing schemes for the detection of small objects.
The enhanced three-stage post-processing scheme [20] was derived to achieve more independence with respect to object size [33]. In Stage 1, a 3 × 3 median filter is applied. In Stage 2, an opening operation is applied to sieve out trivial false detections, followed by a closing operation. Finally, Stage 3 applies linear Gaussian filtering followed by nonmaximum suppression (NMS) to discard multiple false detections nearby a single object.
We also developed a two-stage spatial processing scheme [14], where the first stage used multi-neighborhood hysteresis thresholding. The second stage of spatial processing is identical to Stage 2 of the Enh3Stage post-processing scheme [20], while the filter size of the opening and closing operation was carefully adjusted in accordance with the spatial resolution of the datasets.

Experiments and Results
We evaluated quantitative detection performance in four scenarios: (i) six detection algorithms combined with the 3Stage scheme and each of these six with filtering by shape index (SI); (ii) seven post-processing schemes each associated with any of the ten object detection algorithms in the Tucson and Phoenix datasets to determine the best three postprocessing schemes for each algorithm; (iii) ten algorithms each combined with three postprocessing schemes out of seven as determined from (ii), along with a visual comparison of the vehicle detection results; (iv) a two-stage machine learning approach with adjusted parameters to achieve the highest overall accuracy on small object detection in the online VEDAI dataset, using its updated flowchart with an optimally tuned proportion of folders on training to testing and updating its structure of FC-DenseNet, combined with each of the three post-processing schemes determined from (iii). Figure 1 depicts a pipeline of our research study on selecting post-processing schemes for the detection of small objects. Pipeline of the proposed research study of small object detection. Labeled outputs of Type 1 are detection results of using ten feature extraction algorithms followed by each of the seven postprocessing schemes. Labeled outputs of Type 2 correspond to an updated multi-stage learning approach (FC-DenseNet) with selected post-processing schemes.

Experimental Setup
We conducted our experiments using MATLAB R2019b on a Windows PC (Intel Core i7-8500U, 1.80 GHz CPU, 16 GB RAM). The average computation time per frame for each detection algorithm was found by calculating the total processing time of each algorithm before and after combining the selected post-processing scheme divided by the total number of frames in the experiment. Meanwhile, we applied the vehicle tiramisu code (Python version) [39] for image segmentation in the Google Colab Pro environment; after saving the outputs in a .mat file, we implemented our post-processing code in MATLAB for each of the three datasets. Pipeline of the proposed research study of small object detection. Labeled outputs of Type 1 are detection results of using ten feature extraction algorithms followed by each of the seven post-processing schemes. Labeled outputs of Type 2 correspond to an updated multi-stage learning approach (FC-DenseNet) with selected post-processing schemes.

Experimental Setup
We conducted our experiments using MATLAB R2019b on a Windows PC (Intel Core i7-8500U, 1.80 GHz CPU, 16 GB RAM). The average computation time per frame for each detection algorithm was found by calculating the total processing time of each algorithm before and after combining the selected post-processing scheme divided by the total number of frames in the experiment. Meanwhile, we applied the vehicle tiramisu code (Python version) [39] for image segmentation in the Google Colab Pro environment; after saving the outputs in a .mat file, we implemented our post-processing code in MATLAB for each of the three datasets.
Two versions were implemented for each of the ten detection algorithms [14]: (1) using our prior binarization methods [4,20] for simple thresholding, and (2) replacing the binarization step with the proposed spatial processing. We refer to these two approaches as "before" and "after" using a post-processing scheme in our experimental analysis, respectively. We compare the labeled outputs of the "before" and "after" to quantify the degree of improvement of object detection performance resulting from the proposed post-processing.

Datasets
Two aerial videos (spatial resolution: 720 × 480 pixels per frame) obtained from a low-resolution camcorder, served as our datasets for performance analysis of the ten vehicle detection algorithms [25,[27][28][29][30][31][32][33][34][35] each associated with any of the five post-processing schemes [4,[22][23][24][25], the three-stage scheme by the measure of compactness [13], and the enhanced three-stage scheme [20]. In our manual segmentation, the total number of ground truth (GT) vehicles is 8072 (4012 in the Tucson dataset and 4060 in the Phoenix dataset). All these vehicles have approximately rectangular shape, and the vehicle area is distributed from 40 to 150 pixels in the Tucson dataset and ranged from 20 to 175 pixels in the Phoenix dataset. Automatic detections from any combination of algorithms without or with a post-processing scheme are compared with their GT vehicles in each frame.
We used online VEDAI dataset [52] as a third dataset (spatial resolution: 512 × 512 pixels per frame), which comprises 9 classes of objects; except for buses, boats, and planes, this study concerns small size vehicles such as cars, pickups, tractors, camping cars, trucks and vans [46]. The online VEDAI dataset contains 1246 images with a total of 3600 instances across these 6 classes. The portion of this dataset used for training was varied from 80% to 95% in increments of 5%, then 98% as highest; the rest were used for testing.

Classifications and Evaluation Metrics
We evaluated each algorithm [25,[27][28][29][30][31][32][33][34][35] for small object detection by automatically classifying the correct detection and each type of detection errors. The binary detection outputs were processed with 8-connected component labeling, and the overlaps between detections and the ground truth, were evaluated similar to the region matching proposed by Nascimento and Marques [53], in which the detections are characterized as follows: To quantify the performance of each algorithm with and without post-processing, we use basic information retrieval (IR) metrics [3,4,40] When β = 1, the F β measure equals F 1 -score, a harmonic mean of precision and recall. Substituting precision and recall in the expression of TP, FP and FN into Equation (4), the simplified expression of F 1 -score can be written as An algorithm with lower PWC score indicates better detection results.
To assess comprehensive performance of each algorithm before and after combining with a post-processing scheme, we adopted the CLEAR metrics [41,42] for quantitative evaluations. If we denote the number of Misses (FNs) as m i and the number of FPs as fp i , then the multiple object detection accuracy (MODA) in the i-th frame (i = 1, . . . , 100 in each dataset) is computed as [41] where c m and c f represent the weights applied to the FNs and FPs, respectively; N (i) G is the number of ground-truth objects in the i-th frame.
We equally weight c m = c f = 1 [41] and sum over all frames in each dataset [42], yielding the multiple object count (MOC) Since no negative sample exists in the ground truth of the Tucson and Phoenix datasets, the detection outputs for these datasets do not contain any true negatives (TN). Meanwhile, Splits (S) or Merges (M) were a second type of detection error; they were not counted in either MODA or MOC. There has not been any final agreement on how to weight Splits (S) or Merges (M) in the previous publications [40,42,53,54].
In online VEDAI dataset, we used the accuracy metric to evaluate the object detection performance, which is the ratio of the number of correct predictions to the total number of predictions [55,56]. For binary classification, the accuracy metric equals the percentage of correctly classified instances, which is expressed as [40] Accuracy = TP + TN TP + FN + FP + TN (9) where the accuracy metric numerically matches 1−PWC. Average precision is defined as [57][58][59] where P = Precision and R = Recall. The mean average precision (mAP) is a global performance measure defined as the sample mean of AP [59]  We started our experimental study by evaluating six algorithms [27][28][29][30][31][32] combined with the 3Stage scheme [13] or filtering by SI [24]. We carefully adjusted the parameters of each algorithm to achieve the highest average F-score. F-score of each method for two datasets. Table 1 shows that our 3Stage scheme has better average F-score than filtering by SI for all six algorithms; hence, we decided to replace filtering by SI with our 3Stage scheme in subsequent experiments involving combinations of post-processing schemes with each of the ten detection algorithms.

Seven Post-Processing Schemes Combined with Ten Algorithms
We randomly divided the 100 frames in each of the Tucson and Phoenix datasets into ten groups where each group has ten frames. Post-processing (3Stage scheme [13], sieving and closing (S&C) [4]) combined with LC [27], VMO [28] and FDE [34] were selected to calculate average F-scores via 100 frames in each of the two datasets. These groups of average F-scores were obtained where each group contains ten frames. Table 2 shows the mean and standard deviation for each group, along with the 95% confidence intervals (CI); each group of data has tight CI along with considerably small standard deviation. Table 2. Statistical test on average F-score (10 × 10 frames): two post-processing schemes associated with three automatic detection algorithms.

Tucson Dataset
Phoenix Dataset We evaluated the ten feature extraction-based vehicle detection algorithms combined with each of the seven post-processing methods [4,13,14,20,22,23,25] (marked as M1 to M7) for all the 200 frames in the Tucson and Phoenix datasets. The average F-scores are tabulated in Table 3, where the best three overall F-scores (boldface numbers in each column) were selected for further comparison.
From Table 3, we count the number of boldface entries in each row to determine that the Enh3Stage scheme (M6) obtained the highest nine votes, S&C (M5) and SpatialProc (M7) both obtained six votes, 3Stage scheme (M3) had four votes, and each of the other three (M1, M2, M4) was voted no more than twice. Table 3. Average F-score for seven post-processing methods combined with each of the ten algorithms for Tucson (T) and Phoenix (P) datasets.

Ten Algorithms Combined with the Best Three Post-Processing Schemes
We step further to present the performance analysis before and after post-processing, where the types of detections were analyzed for all the 8072 vehicles in the Tucson and Phoenix datasets. Table 4 shows the quantitative results on each of the ten algorithms before and after combining with each of their best three post-processing schemes (as highlighted with bold numbers in Table 3). Note that M0 refers to no post-processing applied to an algorithm in Table 4 and the subsequent tables. As observed from Table 4, among nine feature extraction-based algorithms combined with Enh3Stage scheme (M6), VMO and MSSS display the highest and second highest TP counts; among six algorithms with SpatialProc (M7), LC and MSSS show the highest and second highest TP counts; among five algorithms with S&C (M5), LC and VMO represent the best two in TP counts. Among the tabulated ten detection algorithms combining any type of post-processing, MSSS with FiltDil (M1) presents the largest TP counts, while MF with S&O (M4) shows the lowest TP counts. Regarding FNs (Misses), MSSS with FiltDil (M1) and VMO with Enh3Stage scheme (M6) show the smallest and second smallest FN counts. Regarding FP, the lowest FP count was found in FDE with S&C (M5), the best reduction of FP was shown by MF with S&O (M4), displaying 93.2% decrease of FP, the least reduction was found on LC with S&C (M5), displaying 50.7% decrease on FP. Meanwhile, all the post-processing schemes in Table 4 reduced Splits (S), where the best reduction was achieved by PAE with 3Stage scheme (M3), and the least reduction was obtained by FT with FiltDil (M1). All seven post-processing schemes for any of the ten algorithms trade improving average F-score for the mild cost of converting a small part of TPs to Merges. For reducing FPs and Splits, each post-processing scheme has quite similar performance when detecting these small vehicles using our aerial datasets.
We used basic IR metrics [40] to evaluate the performance of ten detection algorithms before and after applying post-processing. The average precision, recall and F-score of each algorithm before using any post-processing are depicted in Figure 2. Regarding precision and F-score, the highest values were achieved by LC in Tucson dataset, while the lowest corresponded to CV (KFCM-CV) in the Tucson dataset and FICA in the Phoenix dataset. Regarding recall, all the rates of eight algorithms (except FDE and PAE) are higher than 0.9 in the Tucson dataset, while only two algorithms (LC and MSSS) achieved recall rates higher than 0.9 in the Phoenix dataset. We used basic IR metrics [40] to evaluate the performance of ten detection algorithms before and after applying post-processing. The average precision, recall and F-score of each algorithm before using any post-processing are depicted in Figure 2. Regarding precision and F-score, the highest values were achieved by LC in Tucson dataset, while the lowest corresponded to CV (KFCM-CV) in the Tucson dataset and FICA in the Phoenix dataset. Regarding recall, all the rates of eight algorithms (except FDE and PAE) are higher than 0.9 in the Tucson dataset, while only two algorithms (LC and MSSS) achieved recall rates higher than 0.9 in the Phoenix dataset. The detection results of ten algorithms associated with the voted best three postprocessing schemes were arranged into four groups: those algorithms with Enh3Stage scheme (M6), those with S&C (M5) and SpatialProc (M7), and those with all other four schemes (M1 through M4). The precision, recall and F-score on each of the ten algorithms after combining with the best three post-processing schemes are displayed in Figure 3, which consists of four sub-diagrams. The results for the two datasets were merged to simplify the visual comparison in Figure 3. Regarding precision and F-score, when post- The detection results of ten algorithms associated with the voted best three postprocessing schemes were arranged into four groups: those algorithms with Enh3Stage scheme (M6), those with S&C (M5) and SpatialProc (M7), and those with all other four schemes (M1 through M4). The precision, recall and F-score on each of the ten algorithms after combining with the best three post-processing schemes are displayed in Figure 3, which consists of four sub-diagrams. The results for the two datasets were merged to simplify the visual comparison in Figure 3. Regarding precision and F-score, when postprocessed by the Enh3Stage scheme (M6), FDE and CV ranked the highest and lowest among the eight algorithms, respectively; the same conclusion was drawn when postprocessed by S&C (M5), while the highest and lowest results were displayed with LC and CV when post-processed by SpatialProc (M7). For other post-processing schemes, the highest scores were achieved by FDE with 3Stage scheme (M3), the lowest scores were obtained by FICA with HeurFilt (M2). Regarding recall, seven algorithms (except MF and FICA) preserved their recall rates above 0.7 when post-processed by the Enh3Stage scheme (M6) or any of the other four schemes (M1 through M4); five algorithms (except CV) retained their recall rates above 0.7 when post-processed by either S&C (M5) or SpatialProc (M7). Three of the best precision and average F-scores were achieved by FDE with S&C (M5), FDE with 3Stage scheme (M3), and FDE with Enh3Stage scheme (M6), while the best three recalls were achieved by MSSS with FiltDil (M1), VMO with S&C (M5), and VMO with the Enh3Stage scheme (M6). highest scores were achieved by FDE with 3Stage scheme (M3), the lowest scores were obtained by FICA with HeurFilt (M2). Regarding recall, seven algorithms (except MF and FICA) preserved their recall rates above 0.7 when post-processed by the Enh3Stage scheme (M6) or any of the other four schemes (M1 through M4); five algorithms (except CV) retained their recall rates above 0.7 when post-processed by either S&C (M5) or SpatialProc (M7). Three of the best precision and average F-scores were achieved by FDE with S&C (M5), FDE with 3Stage scheme (M3), and FDE with Enh3Stage scheme (M6), while the best three recalls were achieved by MSSS with FiltDil (M1), VMO with S&C (M5), and VMO with the Enh3Stage scheme (M6). Figure 3. Precision, Recall and F-score for ten feature extraction-based algorithms each combined with their best three post-processing schemes.
As seen in Figure 3, the improvements on precision and average F-scores of all ten vehicle detection algorithms supported the validity of our post-processing selection, while the mild decrease of recall for each algorithm resulted from the trade-off of losing a small portion of TPs. Due to the extremely low-resolution of wide-area aerial frames, none of the algorithms combined with post-processing achieved an average F-score higher than 0.9. Table 5 displays the percentage of wrong classification (PWC) for our quantitative results on ten feature extraction-based algorithms with their three best post-processing schemes. The best outcome was achieved by FDE with the Enh3Stage scheme (M6), which resulted in a PWC of 18.2% in Tucson dataset. The smallest improvement was obtained by LC with S&C (M5), where PWC reduced from 62.4% to 58.9% in the Phoenix dataset. In the Tucson dataset, the PWC scores of nine algorithms (excluding KFCM-CV) were reduced to below 50%; in the Phoenix dataset, only VMO with any of its three best postprocessing schemes (M3, M5 and M6) and FDE with S&C (M5), decreased to below 50%. Regarding the degree of improvement, the best three post-processing schemes for each of nine algorithms (except VMO) performed better in the Tucson dataset than in the Phoenix dataset. In contrast, only VMO when post-processed by the 3Stage scheme (M3) or S&C (M5), exhibited better scores in the Phoenix dataset than those in the Tucson dataset. As seen in Figure 3, the improvements on precision and average F-scores of all ten vehicle detection algorithms supported the validity of our post-processing selection, while the mild decrease of recall for each algorithm resulted from the trade-off of losing a small portion of TPs. Due to the extremely low-resolution of wide-area aerial frames, none of the algorithms combined with post-processing achieved an average F-score higher than 0.9. Table 5 displays the percentage of wrong classification (PWC) for our quantitative results on ten feature extraction-based algorithms with their three best post-processing schemes. The best outcome was achieved by FDE with the Enh3Stage scheme (M6), which resulted in a PWC of 18.2% in Tucson dataset. The smallest improvement was obtained by LC with S&C (M5), where PWC reduced from 62.4% to 58.9% in the Phoenix dataset. In the Tucson dataset, the PWC scores of nine algorithms (excluding KFCM-CV) were reduced to below 50%; in the Phoenix dataset, only VMO with any of its three best postprocessing schemes (M3, M5 and M6) and FDE with S&C (M5), decreased to below 50%. Regarding the degree of improvement, the best three post-processing schemes for each of nine algorithms (except VMO) performed better in the Tucson dataset than in the Phoenix dataset. In contrast, only VMO when post-processed by the 3Stage scheme (M3) or S&C (M5), exhibited better scores in the Phoenix dataset than those in the Tucson dataset. Table 5. PWC for ten algorithms before and after combining with their best three post-processing schemes.
Alg.    Table 6 presents quantitative results of each algorithm without applying any postprocessing or each combined with any of the voted post-processing schemes measured by MODA and MOC from CLEAR metric [41,42], where 95% CIs are applied for each metric.

Scheme PWC %/Tucson Dataset PWC %/Phoenix
From Table 6, we conclude that in the Tucson dataset, FDE with Enh3Stage scheme (M6) has highest MODA, KFCM-CV with SpatialProc (M7) shows largest improvement, KFCM-CV with S&C (M5) has lowest MODA, and LC with S&C (M5) has smallest improvement. In the Phoenix dataset, FDE with S&C (M5) shows highest MODA, FICA with Enh3Stage scheme has largest improvement, KFCM-CV with S&C (M5) shows lowest MODA, while LC with 3Stage scheme (M3) has smallest improvement. Checking the corresponding MOC indices of each method before and after combining with any of the three best postprocessing schemes, every numerical value closely coincides with the related sample mean for each MODA.   Table 3 for Tucson dataset, respectively. Similar results are shown in row (e) and rows (f-h) for Phoenix dataset. GT vehicles are displayed in the last column.

FC-DenseNet: Two-Stage Machine Learning for Small Object Detection
In this subsection, we present a third set of experiments to assess our two-stage machine learning approach (i.e., FC-DenseNet model and its variations) for small object detection. The limitations of our previous study and plan for solution are summarized as follows: (i) While our investigation included comprehensive tests on voting the best three post-processing schemes for small object detection using two low-resolution wide-area aerial datasets, the number of frames was relatively small, the results may not have universal applicability. Hence, the online VEDAI dataset with both training and testing folders were used for our additional experiments. (ii) In addition to saliency detection, the four "pillar" techniques on small object detection had been specified as multi-scale representation, contextual information, super resolution based techniques and regional proposals [1]; however, the ten algorithms [25,[27][28][29][30][31][32][33][34][35] adapted for small object detection missed the involvement of regional proposals, i.e., deep CNN-based semantic image segmentation, where it was reported that convincible object detection accuracy could also be achieved on similar urban scene datasets such as CamVid and Gatech [39], and hence, this keynote approach should also be included for verification. (iii) Regarding the voted best three schemes (Enh3Stage, SpatialProc, S&C), subsequent competition is necessary upon handling a different online available aerial dataset; meanwhile, the other four schemes (filtered dilation, heuristic filtering, 3Stage, sieving and opening) with fewer votes in prior tests should not be completely excluded in another set of test scenarios when applying post-processing. As depicted in Figure 1, we performed tests to measure the degree of improvement from each post-processing scheme associated with the two-stage machine learning approach followed up with the updated FC-DenseNet. We tabulated the numerical results via a few sets of tests and expect to further evaluate the method by matching the best candidate to improve small object detection accuracy, applying a different ratio of training, validation and testing for online VEDAI dataset.  Table 3 for Tucson dataset, respectively. Similar results are shown in row (e) and rows (f-h) for Phoenix dataset. GT vehicles are displayed in the last column.

FC-DenseNet: Two-Stage Machine Learning for Small Object Detection
In this subsection, we present a third set of experiments to assess our two-stage machine learning approach (i.e., FC-DenseNet model and its variations) for small object detection. The limitations of our previous study and plan for solution are summarized as follows: (i) While our investigation included comprehensive tests on voting the best three post-processing schemes for small object detection using two low-resolution widearea aerial datasets, the number of frames was relatively small, the results may not have universal applicability. Hence, the online VEDAI dataset with both training and testing folders were used for our additional experiments. (ii) In addition to saliency detection, the four "pillar" techniques on small object detection had been specified as multi-scale representation, contextual information, super resolution based techniques and regional proposals [1]; however, the ten algorithms [25,[27][28][29][30][31][32][33][34][35] adapted for small object detection missed the involvement of regional proposals, i.e., deep CNN-based semantic image segmentation, where it was reported that convincible object detection accuracy could also be achieved on similar urban scene datasets such as CamVid and Gatech [39], and hence, this keynote approach should also be included for verification. (iii) Regarding the voted best three schemes (Enh3Stage, SpatialProc, S&C), subsequent competition is necessary upon handling a different online available aerial dataset; meanwhile, the other four schemes (filtered dilation, heuristic filtering, 3Stage, sieving and opening) with fewer votes in prior tests should not be completely excluded in another set of test scenarios when applying post-processing. As depicted in Figure 1, we performed tests to measure the degree of improvement from each post-processing scheme associated with the two-stage machine learning approach followed up with the updated FC-DenseNet. We tabulated the numerical results via a few sets of tests and expect to further evaluate the method by matching the best candidate to improve small object detection accuracy, applying a different ratio of training, validation and testing for online VEDAI dataset.
Some additional tests are presented as follows: (i) The vehicle tiramisu code [39] on semantic segmentation was implemented for the online VEDAI dataset (size: 512 × 512), where the portion of training data was progressively tuned from 80% to 95% in 5% steps then 98% as the highest (and the rest for testing), with a fixed portion of 1% (valid_pct = 0.01) for validation. (ii) The two-stage learning within ten epochs was applied to acquire the detection accuracy and draw the curve of learning rate versus loss, where some parameters such as weight decay (wd), learning rate (lr), and optimal threshold, were set up with default values and pct_start was initialized as 0.3/0.7. (iii) FC-DenseNet (with 103 layers and its other alternatives) was adopted upon using a relatively smaller learning rate (selected from the curve on the right plot of Figure 5) to proceed with the updated two-stage learning, where the final outputs including updated object detection accuracy with predicted detection labels. (iv) Final tests were designed to pick up four post-processing schemes (either from the voted three schemes and a relatively best one from the other four) to further evaluate the degree of improvement on the metrics of accuracy and mAP, then determine the best post-processing scheme (among the seven) for online VEDAI dataset [1,52,55]. Some additional tests are presented as follows: (i) The vehicle tiramisu code [39] on semantic segmentation was implemented for the online VEDAI dataset (size: 512 × 512), where the portion of training data was progressively tuned from 80% to 95% in 5% steps then 98% as the highest (and the rest for testing), with a fixed portion of 1% (valid_pct = 0.01) for validation. (ii) The two-stage learning within ten epochs was applied to acquire the detection accuracy and draw the curve of learning rate versus loss, where some parameters such as weight decay (wd), learning rate (lr), and optimal threshold, were set up with default values and pct_start was initialized as 0.3/0.7. (iii) FC-DenseNet (with 103 layers and its other alternatives) was adopted upon using a relatively smaller learning rate (selected from the curve on the right plot of Figure 5) to proceed with the updated two-stage learning, where the final outputs including updated object detection accuracy with predicted detection labels. (iv) Final tests were designed to pick up four postprocessing schemes (either from the voted three schemes and a relatively best one from the other four) to further evaluate the degree of improvement on the metrics of accuracy and mAP, then determine the best post-processing scheme (among the seven) for online VEDAI dataset [1,52,55]. Figure 5. Sample results of two-stage learning approach and the generated curve to find an optimal learning rate (after parameter tuning) on semantic image segmentation for online VEDAI.
Regarding a CNN-related architecture of fully convolutional (FC) DenseNet with its dense block of four layers and its building blocks in [39], the differences in the connectivity pattern were established between each of the upsampling and downsampling paths, and the generation of feature maps was performed on each of the four layers and the block output from concatenation of each layer output, while the information of crucial parameters was excerpted from every layer of the model. Crucial kernels such as Transition Down and Transition Up were combined in the final output.
A subgroup of eight sample frames were loaded in the implementation of the twostage learning approach for small object detection, where ten categories of objects may coexist: terrain, car, truck, tractor, camping car, van, pickup, bus, boat, and plane. We presented the initial results of Google Colab program outputs after tuning each parameter in Figure 5, where we found that the highest accuracy achieved was 84.0%, and a better alternative learning rate (less than 0.001) close to the numerical point of 0.0001 appeared by the end of the second-stage learning.
The FC-DenseNet model [39] with 103 layers was applied to perform our updated two-stage learning, where the related parameters were selected as follows: wd = 0.01, lr = 0.0001, valid_pct = 0.01, and pct_start = 0.3/0.7. Figure 6 shows the updated results and Figure 5. Sample results of two-stage learning approach and the generated curve to find an optimal learning rate (after parameter tuning) on semantic image segmentation for online VEDAI.
Regarding a CNN-related architecture of fully convolutional (FC) DenseNet with its dense block of four layers and its building blocks in [39], the differences in the connectivity pattern were established between each of the upsampling and downsampling paths, and the generation of feature maps was performed on each of the four layers and the block output from concatenation of each layer output, while the information of crucial parameters was excerpted from every layer of the model. Crucial kernels such as Transition Down and Transition Up were combined in the final output.
A subgroup of eight sample frames were loaded in the implementation of the twostage learning approach for small object detection, where ten categories of objects may co-exist: terrain, car, truck, tractor, camping car, van, pickup, bus, boat, and plane. We presented the initial results of Google Colab program outputs after tuning each parameter in Figure 5, where we found that the highest accuracy achieved was 84.0%, and a better alternative learning rate (less than 0.001) close to the numerical point of 0.0001 appeared by the end of the second-stage learning.
The FC-DenseNet model [39] with 103 layers was applied to perform our updated two-stage learning, where the related parameters were selected as follows: wd = 0.01, lr = 0.0001, valid_pct = 0.01, and pct_start = 0.3/0.7. Figure 6 shows the updated results and samples of predicted GT objects, where the highest accuracy was reached with 86.0% in the Epoch 6 of second-stage learning, and the predicted labels of small objects match the ground truth. samples of predicted GT objects, where the highest accuracy was reached with 86.0% in the Epoch 6 of second-stage learning, and the predicted labels of small objects match the ground truth. Given the results obtained from the online VEDAI dataset as mentioned above, it is computed that the average detection accuracy is 78.2% on initial second stage learning and 85.5% on the updated second stage learning, proving FC-DenseNet 103 a better model on small object detection.

Final Tests to Evaluate the Two-Stage Learning Approach with Post-Processing for Online VEDAI
Considering the low-resolution and wide area of online VEDAI dataset, we designed our final tests to select the best of the seven post-processing schemes. The experimental study was conducted in four scenarios. (i) Vote a scheme with highest score on accuracy among the other four post-processing schemes (with fewer votes from the first set of experiments). (ii) Take the vote from (i) as the fourth scheme along with the best three schemes, apply each post-processing scheme after updated two-stage learning, and evaluate the differences in the detection outputs, where accuracy and mAP are used as two metrics for quantitative comparison. (iii) Conduct sensitivity analysis to evaluate a balance between object detection accuracy, the complexity of algorithms, and time costs when applying each of the four post-processing schemes and varying the number of convolutional layers in FC-DenseNet. (iv) Combine the scores obtained from (i) to (iii) to determine the best post-processing scheme for online VEDAI. Note that the size and resolution of sample frames were adjusted upon applying any post-processing scheme.
Applying each of the four schemes (FiltDil (M1), HeurFilt (M2), 3Stage (M3), S&O (M4)) to post-process the detection output after updated two-stage learning, the scores of detection accuracy are shown in Table 7, where three different cases on ratio of training to testing on the online VEDAI dataset were considered. While applying M4 resulted in reducing accuracy, each of the other three schemes mildly increased their scores, and the relatively best results appeared with our 3Stage scheme (M3) in condition of choosing different portions of training to test. Given the results obtained from the online VEDAI dataset as mentioned above, it is computed that the average detection accuracy is 78.2% on initial second stage learning and 85.5% on the updated second stage learning, proving FC-DenseNet 103 a better model on small object detection.

Final Tests to Evaluate the Two-Stage Learning Approach with Post-Processing for Online VEDAI
Considering the low-resolution and wide area of online VEDAI dataset, we designed our final tests to select the best of the seven post-processing schemes. The experimental study was conducted in four scenarios. (i) Vote a scheme with highest score on accuracy among the other four post-processing schemes (with fewer votes from the first set of experiments). (ii) Take the vote from (i) as the fourth scheme along with the best three schemes, apply each post-processing scheme after updated two-stage learning, and evaluate the differences in the detection outputs, where accuracy and mAP are used as two metrics for quantitative comparison. (iii) Conduct sensitivity analysis to evaluate a balance between object detection accuracy, the complexity of algorithms, and time costs when applying each of the four post-processing schemes and varying the number of convolutional layers in FC-DenseNet. (iv) Combine the scores obtained from (i) to (iii) to determine the best post-processing scheme for online VEDAI. Note that the size and resolution of sample frames were adjusted upon applying any post-processing scheme.
Applying each of the four schemes (FiltDil (M1), HeurFilt (M2), 3Stage (M3), S&O (M4)) to post-process the detection output after updated two-stage learning, the scores of detection accuracy are shown in Table 7, where three different cases on ratio of training to testing on the online VEDAI dataset were considered. While applying M4 resulted in reducing accuracy, each of the other three schemes mildly increased their scores, and the relatively best results appeared with our 3Stage scheme (M3) in condition of choosing different portions of training to test. Table 7. Performance evaluation of accuracy metric on each of the four unvoted post-processing schemes associated with initial detection after updated two-stage learning (FC-DenseNet 103). We adopted the metric of mean average precision (mAP) on ten sample frames from online VEDAI to measure the post-processing results by FiltDil (M1) and 3Stage scheme (M3), where the results from five different portions of training to test are displayed in Table 8. We determined that the 3Stage scheme may perform better than filtered dilation since it has the least contrast (79.54% to 75.33%) of mAP in the last column. Hence, we may choose our 3Stage scheme (M3) as the fourth scheme in addition to the former voted three schemes (M5, M6 and M7) to compare the outcomes of post-processing for online VEDAI dataset. Given the two metrics of accuracy and mAP to measure the improved performance of four post-processing methods, i.e., the 3Stage scheme (M3), S&C (M5), the Enh3Stage scheme (M6) and SpatialProc (M7) applied to the initial detection output after updated two-stage learning, we present the quantitative scores on each of the post-processing schemes in Table 9, where five different portions of training to test were applied in contrast with those same cases without any post-processing. When our training set is no greater than 90%, better scores were achieved by S&C (M5) than 3Stage (M3), while the opposite behavior occurs in conditions of a very large proportion of training (95% and 98%); the best improvement was achieved by Enh3Stage (M6), the least improvement was associated with SpatialProc (M7) among the four post-processing schemes using Accuracy and mAP as metrics. Hence, we conclude that among the seven post-processing schemes for the entire online VEDAI dataset, our Enh3Stage scheme (M6) achieved the best results: typically, when applying a training to test ratio of 98% to 2% for online VEDAI dataset, the highest mAP and accuracy were 82.80% and 89.1%, respectively. Computational efficiency of each algorithm combined with its best post-processing scheme was evaluated for the involved dataset(s). Experiments were conducted in the software platform of MATLAB R2019b on a Dell laptop with Intel Core i7-8500U 1.80 GHz CPU and 16 GB RAM. The average CPU execution times in seconds per frame with size 720 × 480 for eleven algorithms are reported in Table 10, indicating that among the eleven detection algorithms using their related aerial video dataset(s) for the experimental study, after combining the best post-processing scheme, FDE runs the fastest, FC-DenseNet is at the median position, and TE is the slowest in this experiment.

Discussion
We have presented a method for selecting post-processing schemes for the eleven automatic vehicle detection algorithms in low-resolution wide-area aerial imagery. In addition to the four existing post-processing schemes [22][23][24][25] and the S&C scheme which comprises pixel-area sieving and morphological closing [4], three more post-processing schemes we recently derived were included [13,14,20]. The 3Stage scheme [13] displays a better average F-score than filtering by shape index [24] for LC [27], VMO [28], FT [29], MSSS [30], KFCM-CV [31], and TE [32]. Our tests applied to two aerial datasets, i.e., Tucson and Phoenix, which take the average F-score as a metric for comparison on all frames of all combinations of ten algorithms with each of the seven post-processing schemes. Voting from highest average F-scores by row comparison, the best three post-processing schemes were respectively associated with each detection algorithm. The highest number of votes was established for Enh3Stage scheme (M6) [20], while the second highest votes was tied for S&C (M5) [4] and SpatialProc (M7). We conclude that after post-processing, in the Tucson dataset, FDE and LC rank the best two in precision, F-score and PWC; FT and MSSS rank the best two in recall. For the Phoenix dataset, FDE and VMO rank the best two in precision, F-score and PWC; VMO and MSSS rank the best two in recall. The metrics of MODA and MOC show coincidence with ranks of each automatic algorithm on PWC score improvements. The count of votes suggests possible outcome of improving the accuracy of small object detection in low-resolution aerial image datasets.
The hundred-layer vehicle tiramisu code [39] on semantic image segmentation was adapted as an eleventh algorithm on small object detection to design the third set of experiments in our research study. The two-stage machine learning approach [39] and its updated two-stage learning applying FC-DenseNet103 model with tuned parameters, were used to achieve the highest overall initial detection accuracy, where the best average accuracy was 85.5%. Two sets of experiments were designed to search for the best postprocessing scheme between the two groups: applying mean average precision (mAP) and accuracy as two metrics for performance evaluation, we checked the other four schemes to determine the best one, and let this scheme join the group with the best three schemes voted from our first set of experiments. At the final stage, we conclude that the Enh3Stage scheme (M6), has the highest overall mAP and accuracy among the four post-processing schemes (M3, M5, M6 and M7) when different proportions of training and test images were chosen from online VEDAI dataset. Since the best outcome from Enh3Stage was just slightly lower than 85.0% on mAP and very close to 0.9 on accuracy, there are still some opportunities for future improvement.
There are several limitations of our research study. (i) Many feature-extraction-based algorithms on aerial vehicle detection are geometric measure-based or grayscale intensitybased methods, which lack temporal analysis. (ii) For fair comparison, all seven post-processing schemes provide heuristic improvement for each of the grayscale aerial frames, where false positives may be further eliminated using temporal filtering in aerial video datasets. (iii) Since there was no training data for the two aerial video datasets, our study did not assess machine learning schemes on those datasets. (iv) Apply some newer metrics, i.e., the structure similarity index measure (SSIM) [60] for performance evaluation. (v) To demonstrate the robustness of the algorithms, many large datasets are needed for testing. All these topics may represent potential research directions for subsequent investigation.
Most recently, quite a few deep-learning approaches have been widely applied to object detection and segmentation in traffic video analysis as well as wide-area remote surveillance [57,[61][62][63][64][65][66][67][68][69][70][71][72][73][74][75]. Detecting aerial vehicles using a deep learning scheme typically includes pretraining, sample frame labeling, feature extraction in a deep convolutional neural network (DCNN), and then implementation of an algorithm for object detection and segmentation to obtain the labeled outputs. While some multi-task learning-based methods have demonstrated time efficiency and object detection accuracy [43][44][45], further updates are still needed to deal with the increased complexity, time cost, and requirement of multi-core GPU support.

Conclusions
Our work addresses the accurate detection of small objects for vehicle detection using low-resolution wide-area datasets. We designed three sets of experiments for selecting post-processing schemes to improve the performance of object detection algorithms, In the first set of experiments, we voted the best three post-processing schemes combined with ten detection algorithms. In the second set of experiments, we determined the best post-processing scheme for each of the ten algorithms, based on the type of detections and two sets of performance metrics. In the third set of experiments, we applied these post-processing schemes to a two-stage machine learning approach and its variation model adding kernel of FC-DenseNet, and then measured the degree of improvement for online VEDAI. We quantified the object detection performance using basic IR metrics and CLEAR metrics. Based on average F-score, accuracy and mAP, we determined that the Enh3Stage scheme may represent the best scheme in our post-processing selection for improving vehicle detection accuracy in wide-area aerial imagery.