DisCountNet: Discriminating and Counting Network for Real-Time Counting and Localization of Sparse Objects in High-Resolution UAV Imagery

: Recent deep-learning counting techniques revolve around two distinct features of data—sparse data, which favors detection networks, or dense data where density map networks are used. Both techniques fail to address a third scenario, where dense objects are sparsely located. Raw aerial images represent sparse distributions of data in most situations. To address this issue, we propose a novel and exceedingly portable end-to-end model, DisCountNet, and an example dataset to test it on. DisCountNet is a two-stage network that uses theories from both detection and heat-map networks to provide a simple yet powerful design. The ﬁrst stage, DiscNet, operates on the theory of coarse detection, but does so by converting a rich and high-resolution image into a sparse representation where only important information is encoded. Following this, CountNet operates on the dense regions of the sparse matrix to generate a density map, which provides ﬁne locations and count predictions on densities of objects. Comparing the proposed network to current state-of-the-art networks, we ﬁnd that we can maintain competitive performance while using a fraction of the computational complexity, resulting in a real-time solution


Introduction
Counting objects is a fine-grain scene-understanding problem which can arise in many real-world applications including counting people in crowded scenes and surveillance scenarios [1][2][3][4][5], counting vehicles [6], counting cells for cancer detection [7], and counting in agriculture settings for yield estimation and land use [8,9]. Counting questions also appear as some of the most difficult and challenging questions in Visual Question Answering (VQA). Despite very promising results in "yes/no" and "what/where/who/when" questions, counting questions (how many) are the most difficult questions for the system, which have the lowest performance [10,11].
In natural resource management, livestock populations are managed on pastures and rangeland consisting of hundreds or thousands of acres, which may not be easily accessible by ground-based vehicles. The emergence of micro Unmanned Aerial Vehicles (UAVs), featuring high flexibility, low cost, and high maneuverability has brought the opportunity to build effective management systems. They can easily access and survey large areas of land for data collection and translate this data into a user-friendly information source for managers.
Current methods to count animals and identify their locations through visual observation are very expensive and time consuming. UAV technology has provided inexpensive tools that can be used to gather data for such purposes, but this also creates an urgent need for development of new automatic and real-time object detection and counting techniques. Existing computer vision algorithms for object detection and counting are mainly designed and evaluated on non-orthogonal photographs taken horizontally with optical cameras. For UAVs, images are taken vertically at higher altitudes (usually a hundred meters or less above ground level). In such images, the objects of interest can be very small, lacking important information; For example, an aerial image of an animal has only the top view which presents a blob shape, containing no outstanding or distinguishing features. Additionally, this area of interest presents itself similarly to other objects in background, such as tree and bushes; while corresponding terrestrial image of the same animal has many distinguishing features such as head, body, or legs which makes it easier for recognition. Moreover, ground-based images offer a balance between background and foreground, which is not present in UAV images taken from a high altitude. A difference between a frontal view (ground-based) of an animal and top view (aerial-based) is depicted in Figure 1. Objects in aerial images are small, flat, and sparse; moreover, objects and backgrounds are highly imbalanced. In human-centric photographs, different parts of objects (head, tail, body, legs) are clearly observable while aerial imagery present very coarse features. In addition, aerial imagery presents argumentative features, such as shadows.
In addition, the UAV images we use in this project are associated with a large scene-understanding problem, which is still a challenging issue even for ground-based images. Specific challenges to count and localize animals in large pastures include: (1) animals may be occluded by bushes and trees; (2) variant lighting conditions; (3) small areas of animals in the imagery make it difficult to detect them based on shape features; (4) herding animals tend to group together (form a herd).
Recent advances in deep neural networks (DNNs) along with massive datasets have facilitated the progress in artificial intelligence tasks such as image classification [12], object recognition [13,14], counting [8,9], contour and edge detection [15] and semantic segmentation [16]. Most successful network architectures have improved the performance of various vision tasks at the expense of significantly increased computational complexity. In many real-world applications, real-time analysis of data is necessary. One of the goals of our research is to develop a real-time algorithm that can count and localize animals while on board UAVs. For this purpose, we need an algorithm that balances portability and speed with accuracy instead of sacrificing the former for the latter. To address this, we propose a novel technique influenced by both detection and density map networks along with specialized training techniques in which coarse and fine detection occurs. One network operates on a sparse distribution, while the other operates on a dense distribution. By separating and specializing these tasks, we compete with state-of-the-art networks on this particular challenge while maintaining impressive speed and portability.
In this research, we have designed a novel end-to-end network that takes a high-resolution and large image as input and produce the count and localization of animals as output. The first stage, DiscNet, is designed to discriminate between foreground and background data, converting a full feature rich image into a sparse representation where only foreground patches and their locations are encoded. The second network, CountNet, seeks to solve a density function. Operating on the sparse matrix from DiscNet, CountNet can limit its expensive calculations to important areas. An illustration of our network is presented in Figure 2. The novel contributions of our work include:

•
We developed a novel end-to-end architecture for counting and localizing small and sparse objects in high-resolution aerial images. This architecture can allow for a real-time implementation on board the UAV, while maintaining comparable accuracy to state-of-the-art techniques.

•
Our DisCount network discards a large amount of background information, limiting expensive calculations to important foreground areas.

•
The hard example training part of our algorithm addresses the issues of shadow and occluded animals.

•
We collected a novel UAV dataset, prepared the ground truth for it, and conducted a comprehensive evaluation.

Counting Methods
Counting can be divided into several categories based on the annotation methods used for generating the ground-truth data.

Counting Via Detection
One can consider that perfect detection will lead to the perfect counting. In case that objects are distinct and can be easily detected, this assumption is true. In this method, objects need to be annotated by a bounding box. Several methods [5,[17][18][19] have applied detection for counting objects. For instance, Ref. [5] have manually annotated bounding box and trained a Faster R-CNN [20] for counting people in a crowd. However, in many cases these methods suffer from heavy occlusions among objects. Moreover, the annotation cost can be very expensive and impractical in very dense objects. In our case, although animals are sparsely located in the image, they are herding together which results in dense patches. Using this theory, we use coarse detection methods to model the first function of density.

Counting Via Density Map
In this case, annotation involves marking a point location for each object in the image. This annotation is based on density heat map and is preferred in scenarios that there are many objects occluding each other. Density heat-map annotation has been used in several cases including counting cells, vehicles, and crowd [6,[21][22][23][24][25][26]. Since our counting involves counting occluded objects in the selected patches, we use density heat-map annotation technique.

Counting on Image Level
This counting is based on image level label regression [8,9,27] which is the least expensive annotation technique. However, these methods can only count. Since we are interested in both counting and localization of objects, we did not use image level annotation which is basically the global count in the image.

Counting Applications
Counting methods have been mainly applied for counting crowd [5,[28][29][30][31], vehicles [6,32], and cell [7]. In agriculture, there have been limited research for counting apples and oranges [33], tomatoes [8,9], maize tassels [34], and animals [27]. However, authors are not aware of any fully automatic techniques for counting animals or fruit from aerial imagery. The existing techniques for counting and detection of animals on UAVs [35][36][37] need manual preparation of training data in a way that each image contains a single animal [35,36] or extra sensor such as thermal camera [37]. Due to payload limitation, it is not always possible to add extra sensor; thermal cameras are usually more expensive than optical one which is a prohibitive cost for local farmers. Moreover, counting in [35,36] is performed via a post-processing step by connected component analysis. Our approach is different from previous work as we have developed a fully automatic technique where the region of interest are selected automatically in the first part of the network ( DiscNet) without any manual cropping of imagery and counting is performed automatically in an end-to-end learning procedure on optical imagery.

Unmanned Aerial Systems
In recent years UAS have been extensively used in various areas such as scene understanding and image classification [12], flood detection [16], vehicle tracking [38], forest inventory [39], soil moisture [40], and wildlife and animal management [36,41]. There has been very limited work on use of UAS for monitoring livestock particularly for animal detection, feeding behavior, and health monitoring. For the review of these techniques , see [42].
In addition, several methods based on DNNs [32,[43][44][45] have been developed for object detection and tracking in satellite and aerial imagery, particularly vehicles. For counting and detecting of man-made objects ( such as vehicles in parking lots) in aerial imagery, one deal with the imagery that contain an equal distribution of objects of interest and background and there is not any overlap between objects. In typical crowd or vehicle counting from aerial imagery, more than 70% of image contain the object while in our case less than 1 percent of imagery contain the object of interest (cattle). Based on our knowledge there are not any fully automatic techniques for counting sparse objects from UAV imagery. Objects from UAV images are usually flat, proportionally small, and missing normal distinguishing features. Moreover, the ratio of foreground (object of interest) to background data in UAV imagery is prohibitively small. This means that we need to handle sparse information to separate foreground information from background data. Additionally, most domesticated animals used in agriculture are herding. This means that even though they represent a minute amount of sparsely distributed information, they will tend to group, leading to density situations that cannot be accurately handled by detection networks.

Data Collection
UAS flights for cattle and wildlife detection were conducted at the Welder Wildlife Foundation in Sinton, TX on December 2015. This coincides with the typical dates for wildlife counts due to leaf drop of deciduous trees. A fixed-wing UAV fitted with a single-channel non-differential GPS and digital RGB camera for photogrammetry was flown by the Measurement Analytics Lab (MANTIS) at Texas A&M University-Corpus Christi under a blanket Certificate of Authorization (COA) approved by the United States Federal Aviation Administration (FAA). Over 600 acres were covered using a fixed-wing small UAV called the SenseFly eBee ( Figure 3). It is an ultra-lightweight ( 0.7 kg), fully autonomous platform which has a flight endurance of approximately 50 min on a fully charged battery and light wind, and can withstand wind speeds up to 44 km/h. With this setup, it can cover ten square kilometers per flight mission. For this survey, the platform was equipped with a Canon IXUS 127 HS 16.1 MP RGB camera with automatic exposure adjustment for optimal image exposure. Four flights were conducted at 80 m above ground level with 75% sidelap and 65% endlap to seamlessly cover the entire study area, which consisted of over one thousand individual photographs. The resultant ground sample distance (GSD) was on average 3.8 cm. These images were post-processed using structure-from-motion photogrammetric techniques to generate an orthorectified image mosaic (orthomosaic) (Figure 4). In this work, Pix4Dmapper Pro (Pix4D SA, 1015 Lausanne, Switzerland) was used to process the imagery. The SfM image processing workflow is summarized as follows [46]: (1) Image sequences are input the software and a keypoint detection algorithm, such as a variant of the scale invariant feature transform (SIFT), is used to automatically extract features and find keypoint correspondences between overlapping images using a keypoint descriptor. SIFT is a well-known algorithm that allows for feature detection regardless of scale, camera rotations, camera perspectives, and changes in illumination [47] (2) Key points as well as approximate values of the image geo-position provided by the UAS autopilot (onboard GPS) are input into a least squares bundle block adjustment to simultaneously solving for camera interior and exterior orientation. Based on this reconstruction, the matching points are verified, and their 3D coordinates calculated to generate a sparse point cloud. (3) To improve reconstruction, ground control points (GCPs) laid out in the survey area are introduced to constrain the solution and optimize reconstruction. GCPs also improve georeferencing accuracy of the generated data products. (4) Densification of the point cloud is then performed using a MultiView Stereo (MVS) algorithm to increase the spatial resolution. The resultant densified set of 3D points is used to generate a triangulated irregular network (TIN) and obtain a digital surface model (DSM). (5) The DSM is then used by the software to project every image pixel and to calculate a geometrically corrected image mosaic (orthomosaic) with uniform scale. Due to the low accuracy of the onboard GPS used to geotag the imagery, ground control targets were laid out in the study area, and RTK differential GPS was used to precisely locate their position within 2 to 4 cm horizontal and vertical accuracy. These control targets were used during the post-processing of the imagery to accurately georeference the orthomosaic image product.

Dataset Feature Description
The prominent features of this data set are roads, cows, and fences which are standard for ranch land in the southern United States. According to the USDA [48], each head of cattle requires roughly 2 acres (43,560 square feet or 4047 square meters) of ranch land to maintain year-around foraging. On average, a cow when viewed orthogonal occupies roughly 16 square feet, or 1.5 square meters. This means when viewed as an area, a properly populated ranch land will have approximately 0.0037% area that pertain to cattle. As can be seen in Figure 5, any given image in our dataset contains a large amount of background information. In this figure, background information is represented as translucent areas, while important areas, containing objects of interest, are transparent. In addition, cattle are herding animals, meaning they travel in groups. This is especially prevalent in calves, which stay within touching distance of their mothers. Due to this large disparity of area-to-cow and the propensity of the cattle to group up, you end up with unique distributions and sub-distributions of data. The description of these distributions would be densely packed locations of data scattered sparsely in a much larger area. Figure 5 shows an example of sparsity in an image. Out of 192 regions, only 11, signified by transparent areas, represent useful information in the given counting task. Furthermore, out of the useful regions, only 12% of the pixel area represent non-zero values in the probability heat map.

Dataset Preparation
Individual images taken from the UAV have a native resolution of 3456 by 4608, which is scaled down to 1536 by 2048. Ground-truth annotations for this dataset are center of object point locations, all of which were labeled by hand. The density map is generated by processing the center of point objects with a Gaussian smoothing kernel with a size of 51 and a sigma of 9. This process can be visualized in Figure 6, with an example region, its ground-truth point annotation, and the resultant Gaussian smoothing output. The size of the Gaussian smoothing kernel roughly correlates to the average distance between the tip of a cow's head and the base of its tail. Due to the scale of the data, some unimportant areas may be labeled as important as they are located proximate to cows. To generate region labels, a sum operation is performed over each region. Any region with a value greater than zero is classified as foreground, with all others classified as background.

Our Approach
In this work, we will employ deep-learning models to approximate two density maps, θ d and θ c based on assumptions generated from observation of the data. These assumptions are:

1.
The data can be accurately described as a set of two distributions.

2.
The majority of our data can be classified as background information.

3.
Background information can be safely discarded without losing contextual details.

4.
Foreground information can be densely packed.
Therefore, we design a two-stage approach to solve the problem. The first stage, DiscNet, is designed to discriminate between background and foreground data, converting a full feature rich image into a sparse representation where only foreground data and its location is encoded; this will approximate θ d function. The second network, CountNet, approximates θ c by operating on the sparse matrix from DiscNet; CountNet can limit its expensive calculations to regions. The result of this design, DisCountNet, is a two-stage, end-to-end supervised learning process that maintains remarkable accuracy while yielding a real-time solution to the provided problem.

Implementation and Training
The design for DisCountNet is detailed in Figure 2, and shows the full end-to-end implementation. DiscNet (the first stage) is an encoder characterized by convolutions of large kernels and leaky RELU activation functions followed by aggressive pooling. The first four convolutions use kernels with sizes of seven, six, five, and four. The first three pooling operations are all max pooling with a kernel and stride of four, and the final pooling operation is another max pooling layer with stride and kernel size of two. The last next to last layer in the network is a final one-by-one convolution to reduce the feature map depth to two, followed by a SoftMax activation, yielding a 12 by 16 matrix of values that represent the likelihood that a cow is found in a given region. The aggressive striding allows us to use larger kernel sizes to capture contextual information that could be lost while limiting expensive operations. DiscNet then uses this matrix to convert the original input image into a sparse representation, operating on the assumptions listed above. CountNet uses the sparse representation to generate per-pixel probability values. This flow of information can be visualized in Figure 7, which shows different data representations at different stages of the proposed network. Our training procedure is depicted in Algorithm 1. Given the dataset {X i } N i=0 , DiscNet gets trained using the full images and region label ground truths via a weighted cross entropy loss to determine if there is a cow in a given region. Each data X i consist of R (i) regions, where each region is labeled by y (i) r with r = 1 . . . where r = 1 . . . R (i) . The result of the network prediction is denoted as y (i) r . We use a weighted cross entropy minimization equation, which is given by Equation (1); for convenience, we drop the superscript (i) in the formula. d = − ∑ r (y r pr −0.5 log( y r ) + (1 − y r )pr 0.5 log(1 − y r )). (1) where pr ∈ [0, 1] and represents the percentage of regions with desired information. This weighted loss function serves to counterweight the loss values for our unbalanced data set. For example, if an image is 90% background regions, the loss for foreground regions will be ten times higher. This will cause the network to weigh the loss for positive examples more highly than negative examples, resulting in an increased number of false positives and fewer false negatives. In our given implementation, a false negative will hurt the performance much more than a false positive. As an example, a false negative in the discriminator would mean that a region with a cow is not passed to CountNet, meaning no cows can be detected. However, a false positive means that a region without a cow is passed on, for which CountNet can still compensate. In the second stage, as CountNet seeks to model a different function based on Assumption 4, it uses a different implementation. CountNet features a U-Net structure [49] with modified operations. Operating on a sparse representation of the original image generated by DiscNet, CountNet creates a sparse density map. The encoding pathway features four convolution-pooling operations with skip connections to the decoding pathway, which uses transposed convolutions. All the pooling operations are max pooling with a kernel and stride of two. Each of the convolutions uses a three-by-three kernel, and all transposed convolutions use a stride of two. Finally, the network uses a one-by-one convolution with one feature depth that represents the likelihood that a cow is in one given pixel. The U-Net-like-architecture is proven for providing accurate inferences while maintaining contextual information for per-pixel tasks. CountNet is trained by the sparse data generated by DiscNet. CountNet's loss value is generated using regions and corresponding ground-truth density map regions by minimizing the Mean Square Error. The 2 loss function, is given by Equation (2) where z i is a given ground-truth density map and z i is a prediction,

Hard Example Training
During end-to-end training, CountNet maintains a list of loss values per region. At the end of each epoch, it sorts this list, then truncates the lowest half. CountNet then randomly perturbs these regions using random flipping and rotating, training again with a larger batch size. The loss from this training is again stored, and the process is repeated m − 1 times. As the population decreases by half in every iteration, m should be chosen to ensure that the population of regions does not drop below a given batch size. On observation, regions used multiple times contain argumentative features, such as black cows that look similar to shadows or obscured cows behind foliage.

Evaluation Metrics
For evaluation, we used five metrics in addition to comparing parameters between DisCountNet, RetinaNet [14] and CSRNet [50]. The targeted goal is to have as-accurate-as-possible counting and density map generation while providing a real-time solution on portable hardware. The metrics can be broken into three different sections; image level label comparison, region level label comparison, and generated density map quality comparison.

Image Level Label Metrics
To compare raw counting results, we use mean squared error (mSE) and mean absolute error (mAE). The resultant values provide us with an idea of the average error expected for any image in our testing set. In both equations, n is the total number of images, y t is a given ground-truth label count, and y t is our count prediction for image t.

Image Region Level Metric
The grid average mean absolute error (GAME) [6] metric provides more accurate information for counting quality. Mean absolute error as a metric does not care where errors occur as long as they average out, where GAME simultaneously considers the object count, and the location estimated for the objects.. The formula for GAME is as follows: where n is the total number of images, L is the amount of gridlines in each dimension, and y r t is the actual count for image t on region r. It should be noted that GAME(0) is equal to the mAE, as the region considered is the whole image.

Density Map Quality Comparison
To evaluate the quality of the produced density heat maps, we use peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [51]. These metrics provide insight to the quality of the generated density map compared to the ground truth. Standard implementations of these metrics are non-distance evaluations, meaning that they cannot be used to evaluate raw counting results, but rather to provide an insight into a network's ability to create accurate per-pixel values. The formula for PSNR can be found below, with MAX i being the maximum possible pixel value of a given image and mSE being the mean squared error found in Equation (3).
The formula for structural similarity is as follows: where µ represents the mean, σ represents the standard deviation, σ xy is covariance, x is a ground truth, and y is a prediction. Finally, c 1 and c 2 are variables to stabilize division.

Experimental Setup
We have compared the performance of our technique with two state-of-the-art techniques namely RetinaNet and CSRNet [14,50]. RetinaNet is a detection network [14] that generates bounding boxes and CSRNet generates a density map [50].
All networks were trained on a Spectrum TXR410-0032R Deep Learning Dev Box, leveraging an Intel Core 17-5930K, 64 GB of RAM, and four Nvidia GeForce Titan Xs.
DisCountNet was trained with the Adam [52] optimizer with a learning rate of 1e-3. The batch size for DiscNet was 1, and the batch size for CountNet is the number of regions detected by DiscNet. During hard example training, the batch size of DiscNet was set to be 24, and m was chosen to be three. The number of repetitions of hard example training, m, was set to three to maintain a large sample population. A larger dataset could theoretically use a larger number of repetitions, as the batch size of regions could remain higher.
To generate positive anchors, RetinaNet was trained using images that contained cows extracted from a three-by-three grid of the original images using bounding box ground truths.
Validation was run with a Dell Inspiron 15-7577 using a solid-state drive, i5-7300 processor, 16 GB of RAM, and an Nvidia 1060 Max-Q. To operate on more limited hardware, images were split into non-overlapping regions before being processed by CSRNet and RetinaNet. For RetinaNet [14], a three-by-three grid was used to generate the regions, while CSRNet [50] was validated using a two-by-two grid. Using this hardware as an analog for consumer attainable and portable hardware, DisCountNet averaged 34 frames per second. To compare, RetinaNet [14] averaged 4 frames per second, and CSRNet [50] averaged 12 frames per second. This shows that only our technique can count and localize objects in real time.

Qualitative and Quantitative Results
A sample full image, its ground truth, and the predicted density map by our algorithm are shown in Figure 8. In addition to the density map, the network predicts 6 objects in this image, which corresponds to the actual count. As it can be seen in Figure 8, our method is able to detect small and sparse objects in large UAV images. Further results for selected regions by our discriminator network are shown in Figure 9. This figure shows that our method can distinguish between two adjacent cattle and those animals that our occluded by foliage. As can be seen in Table 1, DisCountNet maintains competitive metrics despite using just over 1% of the parameters of current state-of-the-art networks. In addition, DisCountNet would have the benefit of limiting computation whereas other networks would not. For example, in an empty image, DisCountNet would only use the operations of DiscNet, as CountNet would not run. RetinaNet [14] and CSRNet [50] however, would need to use their full operations on the empty image, resulting in computations with no benefit.
As it can be seen in Tables 2 and 3, DisCountNet outperformed state-of-the-art in all GAME metrics as well as SSIM. This shows that our method is more accurate in simultaneous counting and localization of objects compared to others.  (Table 3) metric is 1e-4 from being a perfect score for DisCountNet. This is due to the fact that when using a sparse representation, we allow for perfect zero output. This results in absolutely no error for the majority (around 80%) of all pixels. CSRNet [50] does not have this type of design, so every pixel output value can be extremely close to zero, but statistically will not be zero. This results in a small error value in every pixel which is even more pronounced when comparing SSIM over other metrics.

Conclusions
In this paper, we propose an innovative method to work with sparse datasets by designing a fully convolutional counting and localization method. Our method outperformed state-of-the-art techniques in quantitative metrics while providing real-time results. Through innovative design, we limit operations to only important areas while discarding non-important areas. While our method greatly improves the counting and localization performance, it has the limitation of detecting and counting highly occluded objects. As it can be seen on the bottom row of Figure 3, our network has difficulty detecting cows inside shrubbery with high occlusion. This technique is easily portable to other application domains, as it provides general implementations rather than specific hand-crafted techniques. Our technique can possibly be expanded to an iterative series of DiscNets, or a cascade of weak convolutional regional classifiers. By the iterative process of dense to sparse information representations, successive networks would work on less and less information.

Conflicts of Interest:
The authors declare no conflict of interest.