Transformer for Tree Counting in Aerial Images

: The number of trees and their spatial distribution are key information for forest management. In recent years, deep learning-based approaches have been proposed and shown promising results in lowering the expensive labor cost of a forest inventory. In this paper, we propose a new efﬁcient deep learning model called density transformer or DENT for automatic tree counting from aerial images. The architecture of DENT contains a multi-receptive ﬁeld convolutional neural network to extract visual feature representation from local patches and their wide context, a transformer encoder to transfer contextual information across correlated positions, a density map generator to generate spatial distribution map of trees, and a fast tree counter to estimate the number of trees in each input image. We compare DENT with a variety of state-of-art methods, including one-stage and two-stage, anchor-based and anchor-free deep neural detectors, and different types of fully convolutional regressors for density estimation. The methods are evaluated on a new large dataset we built and an existing cross-site dataset. DENT achieves top accuracy on both datasets, signiﬁcantly outperforming most of the other methods. We have released our new dataset, called Yosemite Tree Dataset, containing a 10 km 2 rectangular study area with around 100k trees annotated, as a benchmark for public access.


Introduction
The density and distribution of forest trees are important information for ecologists to understand the ecosystem in certain regions. For example, the environmental effect of deforestation or forest fires may be estimated based on the number of lost trees and their location. In recent decades, forest trees are often counted with the help of aerial imagery. Since manually counting the trees from images is time-consuming, automatic tree counting methods have been developed to lower the time cost. With the breakthrough of deep learning in the recent decade, deep neural networks (DNNs) made unprecedented progress in computer vision tasks such as image classification [1][2][3][4][5] and object detection [6][7][8][9][10][11]. DNNs also become widely popular for object counting. One approach of object counting using DNNs is detection-based, i.e., to localize each individual object of interest first and then get the total number. So far this is the mainstream of the published tree counting methods [12][13][14][15][16][17][18]. Another approach is to regress the density of objects in the image using DNNs and then calculate object counts. This approach has been successful for crowd (people) counting [19][20][21][22][23]. However, the effectiveness of density-based methods for tree counting are not sufficiently explored as they are reported in much fewer published works with limited comparative evaluation [24,25].
In this work, we propose a new method for tree counting called density transformer or DENT, which consists of a multi-receptive field (Multi-RF) convolutional neural network (CNN), a transformer, and two heads: Density Map Generator (DMG) and tree counter. The Multi-RF CNN extracts visual features from images with multiple receptive fields of different sizes simultaneously, perceiving the patterns of both the local patch and the concentric context. The transformer models the pair-wise relations between the visual features and filters the contextual visual information sharing across different positions using an attention mechanism. The two heads, the DMG, and the tree counter, parallelly decode the hidden states of the transformer to generate the tree density map at different granularity levels. If a relatively coarse treemap already meets the demand, the DMG can be detached after training to save the inference time. The whole model of DENT is end-to-end trainable.
Currently, very few benchmark datasets are publicly available for tree counting tasks. Previous works reported their performances tested on either private data or a small subset (<10k trees) of public datasets made for other tasks [16,26]. The lack of a common benchmark makes fair comparison across different methods difficult. Hence, we created a new labeled dataset called Yosemite Tree Dataset, which contains aerial images for a ∼10 km 2 rectangular area with ∼100k trees whose coordinates are annotated. It is suitable for evaluating not only the performance of tree counting methods but also the counting error versus the area of interest. We have released this dataset to the public.
To demonstrate the effectiveness of DENT, we compare DENT with many existing state-of-art methods of different types, including fully convolutional networks regressors and detectors. The methods are evaluated on the Yosemite Tree Dataset and the cross-site NeonTreeEvaluation [16] Dataset. On both of them, DENT achieves competitive results with the best existing methods and significantly outperforms most of the other methods.
The main contributions of this work include two parts. The first part is the novel endto-end approach for tree counting, using an efficient multi-receptive field CNN architecture for visual feature representation, a transformer for modeling the pair-wise interaction between the visual features, and two heads for outputs at different granularity and time costs. The second part is the new Yosemite Tree Dataset as a common benchmark for tree counting.

Transformers
Transformers [27] are attention-based deep learning models. They are initially proposed in the area of natural language processing (NLP). The input of a transformer is an embedding sequence. Pair-wise interaction between any two elements of the sequence is formulated by the transformers. The output corresponding to an element is aggregated from all the elements of the sequence with different weights depending on their relationship. In this paper, we adopt a transformer to enhance the CNN features, by selectively transferring contextual information among different elements.
Transformer-based methods have also been proposed for computer vision tasks such as object detection [11], image classification [28] in recent years. These methods are also applied on remote sensing images such as in [29,30]. However, to the best of our knowledge, this work is the first work applying a transformer as a density regressor to count objects in aerial images.

Density Estimation
Learning density maps using deep CNNs is a trend of crowd counting. On this trend, the counting task is formulated as a regression program. The CNNs are trained to predict the density distribution over the input image. However, the location of each individual object is not explicitly predicted. When the object is crowded, the representation of the density map is relatively robust. In the existing works, different network architectures are tried. MCNN [19] uses a multi-column network with different filter sizes for objects at different scales. The features from all the columns are fused to predict the crowd density map. SwitchCNN [20] has an additional classifier to predict and switch to the best column for the given image. CSRNet [21] generates a high-resolution density map. It is composed of a front-end CNN for feature extraction and a back-end CNN for map generation. It uses dilated convolution instead of pooling or transposed convolution to reduce the computational complexity. CANNet [23] encodes contextual information at different scales by subtracting the local average from the feature maps.
For tree counting tasks, an AlexNet [1] regressor is applied in [24]. In the work of [25], AlexNet [1], VGGNet [2], and a UNet [31] are evaluated and compared; the UNet achieves the best performance. In this paper, we follow the paradigm of the density estimation problem and formulate tree counting as a regression problem.

Object Detection
The purpose of object detection is to localize each object of interest in the image. Traditional detectors explicitly use a sliding window of predefined size to scan each position of the image [32][33][34][35][36]. These early works usually extract hand-crafted features such as HOG [34] and SIFT [37]. These features are finally fed to a classifier such as a support vector machine (SVM) or a neural network. Modern detectors make use of the powerful features from deep convolutional neural networks (CNNs) pre-trained on large-scale classification datasets [1]. These detectors adopt different strategies to generate bounding boxes for objects using CNNs. Faster-RCNN [6], RetinaNet [8], and YOLO [9] predefine a set of anchors and formulate the detection into two sub-problems: classification of the subimage in each anchor and regression of the offset between the ground truth box and the anchor. CenterNet [10] treats the center of an object as a keypoint and regresses the width and height. RetinaNet, YOLO, and CenterNet infer the results in one shot. In contrast, Faster-RCNN recomputes the features for classification after the generation of region proposals.
So far, most of the published works of tree counting methods are based on detection. These methods can be categorized into tree groups: (1) Explicitly using sliding windows. The very early works in [38][39][40][41] synthesize the expected appearance of trees and generate a template based on the prior knowledge. The likelihood of the existence of a tree in a sliding window is estimated by the correlation between the tree template and the image patch in the window. However, the templates oversimplify the diverse appearance of trees in real world. Later works use hand-crafted features plus classifiers. For example, a feature descriptor using circular autocorrelation is designed to detect the shape of palm tree in [42]. The goal in [43] is also to detect palm tree, but the descriptor used is HOG [34]. While [13,44] using CNNs to recognize palm trees in the sliding window to learn features automatically. TS-CNNs [45] has two sliding windows of different sizes, each has an AlexNet classifier. One is to recognize the pattern of trees, the other one is to suppress the false positives according to the spatial distribution of the surrounding objects.
(2) Fully convolutional classifiers are equivalent to sliding window CNN classifiers but with better computational efficiency. U-Net [31] and DenseNet [46] are used to predict the confidence maps of tree in [47,48]. The peaks on the confidence maps are considered as the final prediction.
Counting trees in aerial images using detectors is straightforward but with some disadvantages, especially when the trees are dense and crowded. Firstly the representation of overlapping trees may be ambiguous for detectors at test time. A typical detector usually outputs an excessive number of initial boxes and applies Non-Maximum Suppression (NMS) to select the best ones. The basic idea of NMS is to pair-wisely check the Intersection over Union (IoU) of every two proposal boxes, and remove the one with a lower detection score when their IoU is higher than a preset threshold (typically 0.45 or 0.50). For tree counting, it is often the case that two correct boxes have high IoU. An example case is shown in Figure 1b. In this case, the NMS procedure will likely remove either the blue box or the yellow box and cause an underestimation of the tree count. Secondly, the threshold for the detection score directly affects the predicted tree count. Deliberately tuning the threshold requires extra effort. Thirdly, bounding boxes are relatively expensive to label. The labelers need to determine the width and the height of the boxes. It is often difficult when the trees are overlapping.

A New Density Transformer, DENT
The architecture of the DENT model is illustrated in Figure 2. It contains four main components: a Multi-Receptive Field convolutional network (Multi-RF CNN) to compute a feature map over an input image, a transformer encoder to model the interaction of features extracted from different positions, a Density Map Generator (DMG) to predict the density of the trees and a counter to regress the number of trees in the image.  Starting from a RGB aerial image I ∈ R 3×H 0 ×W 0 , the Multi-RF CNN generates a lowresolution feature map f CNN ∈ R C×H×W , where C is the number of output channels, and in this paper H = H 0 32 and W = W 0 32 . The feature map is projected using a trainable linear transform to generate f visual ∈ R d model ×H×W , where d model is the dimension of the hidden states of the transformer encoder. For convenient, it can also be reshaped and represented in a sequence form:

Multi-RF CNN
Since each f i is corresponding to a certain position p i in the image, we use it to estimate the tree density at p i . We also use a special embedding f cnt ∈ R d model to query the number of trees in the image. The transformer encoder selectively transfers the information across f 0 ∼ f L and f cnt . The final hidden state of the transformer are decoded by the DMG and the tree counter. Then the DMG generates a density map D ∈ R H×W . Meanwhile, the tree counter outputs the number of treesẑ ∈ R. The details of the components are discussed in the following sections.

Multi-Receptive Field Network
Inspired by the macula of the human retina, we extract feature presentation of each position of the image using multiple receptive fields, based on the intuitive assumptions: A wide receptive field of CNN covers a large area of the image containing rich contextual information. On the other hand, a narrow one focuses on the details in a small region of interest without being distracted by the surrounding objects.
Early works in MCNN [19] and SwitchCNN [20] control the receptive fields by designing multi-column networks with different convolutional kernel sizes. We argue that such a strategy has limitations: Firstly, using these methods it is not easy to implement a small receptive field on much deeper networks because generally the receptive field is enlarged quickly with the depth of the network increased. Modern deep networks usually have large receptive fields. For example, a VGG16 [2] has a receptive field of 212 × 212 while a ResNet50 [3] has a receptive field of 483 × 483 [49]. Secondly, the widely used pretrained off-the-shelf models cannot be reused. Searching for the optimal architecture and pretraining takes extra effort. To avoid these limitations, We use an off-the-shelf network as a backbone and add jump connections to its early layers to implement small receptive fields.
We proposed Multi-Receptive Field convolutional network (Multi-RF CNN) as depicted in Figure 3. Specifically, the network contains a vanilla ResNet18 and two extra paths added on the convolutional Block 2. We refer to the original path of ResNet18 from Block 2 (i.e., Block 3∼5) as Path A. Path B consists of two 1 × 1 convolutional layers. Path C is simply an average pooling layer. The strides of the three paths are all the same as 32 on the original input image. The receptive fields of the three paths are naturally different, as 466 × 466, 43 × 43 and 47 × 47, respectively. Offsets are also applied on the input of Path B and C to ensure that the output feature maps from the three paths are center-aligned. These feature maps are concatenated along the channel axis as the final output. Although the architecture of Multi-RF CNN is surprisingly simple, we observe that it outperforms the vanilla ResNet18 in our experiments.

Transformer Encoder
We exploit the self-attention mechanism of transformer [27] to model two types of interactions: those between the visual features extracted at different positions, and those between the visual features and the counting query. In this section, we introduce the transformer encoder and discuss the two types of interactions.
Architecture. We use only the encoder part of a standard transformer. The encoder contains a group of stacked encoder layers. By default, number of encoder layers is 2 in this paper. Each encoder layer (Figure 4a) has identical structure containing a multi-head attention sublayer and a feed forward sublayer. Each sublayer has a residual connection and the output is processed by layer normalization [50]. The attention mechanism takes effect in the multi-head attention sublayer (Figure 4b), where the core function is scaled dot-product attention. Given a query matrix Q ∈ R L q ×d k , a key matrix K ∈ R L k ×d k and a value matrix V ∈ R L k ×d v , the scaled dot-product attention is defined as follows: The multi-head attention can be defined as: where h is the total number of heads, and where the projection matrices ×d model are learnable at training stage. We will omit the other details about transformer, since the encoder we used is almost the same with the original. We refer the readers to [27] for the details.
Interaction between visual features. Contextual information is essential for density estimation. It can be extracted by convolutional networks in their receptive fields as discussed in Section 3.1. The interaction in a convolutional network occurs only between the convolutional kernels and the previous-layer feature maps. As a supplement, we exploit the self-attention mechanism to realize the pair-wise interaction between features at different positions. The attention score for feature vector f i on another feature f i can be roughly defined as The contextual information collected by f i can be defined as Equations (4) and (5) are equivalent with an individual head in the multi-head attention mechanism when Q = K = V = f visual .
However, Equations (4) and (5) is permutation-invariant and any positional information is ignored. Hence we add a 2D version of positional encodings [11,27,51] to the visual features before feeding the transformer encoder: PE(x, y) 4i+3 = cos(y/10000 4i/d model ) (6) where (x, y) is the 2D position on the feature maps and i is the dimension.
Interaction between visual features and counting query. Inspired by the [CLS] token used in BERT [52], we also introduce a token [CNT] appended to the end of the input sequence of transformer encoder (Figure 2c). The corresponding token type embedding is f cnt . Hence the input of the transformer encoder is [ f 0 , f 1 , f 2 , ... f L , f cnt ]. The hidden state of the transformer corresponding to the [CNT] token represents the aggregate embedding of the sequence and serves as a global context for tree counting. In contrast, each visual feature vector is corresponding to a patch of the image and used to estimate the local tree density. For convenience sake, these visual feature vectors are also refered as [DEN] tokens in this paper. To differentiate these two types of tokens, we also apply an token type embedding f den for the DEN tokens (Figure 2b). The application of f den can be seen as a in-place self-add operation: f i += f den . Specifically, f cnt , f den ∈ R d model . Both f cnt and f den are learnable parameters at training time. The usage of the two token type embeddings are inspired by [52], where segment embeddings are used for different sentences, and [53], where token type embeddings are used for visual features versus textual features.

Density Map Generator (DMG)
The Density Map Generator is a fully connected feed-forward network followed by a reshape operation. The feed-forward network takes the final hidden state of the transformer corresponding to each [DEN] token to predict the tree density. The output sequence for all [DEN] tokens is reshaped into a 2D map, which is the predicted tree density map.
Tree density map. A tree density map (Figure 1d) represents the spatial distribution of trees in the image. The ground truth tree density map can be generated from the keypoint annotations of the trees (Figure 1c). Given an image I, denote p i = (x i , y i ) is the location of the ith tree and z is the tree count. The original annotation map is generated as where δ is the delta function. Following the works for crowd counting [19][20][21]23], the ground truth tree density map D gt is generated from the annotation map convolved by a Gaussian kernel: (In practice, the model learns an H × W tree density map, which is a sum-pooled version of the H 0 × W 0 density map).
where G σ (x) is a 2D Gaussian kernel with standard deviation σ: Denote the predicted density map is D(p; I, θ), where θ stands for the parameters of DENT. The loss of DMG is Mean Squared Error (MSE): where B is the batch size; H, W are the height and width of the density map.
Density-based counting. At the test stage, the estimated counts of the treesẑ R in a region of interest R is given by the integral of the tree density map: z R = ∑ p∈R D(p; I, θ) (11) because when R 2 σ 2 , we have

Tree Counter
The tree counter of DENT is a feed-forward network that decodes the transformer output corresponding to the [CLS] token. The target of the network is the tree count normalized by the number of the [DEN] tokens, i.e., the average of the density map: This network is also trained using MSE loss. We found the normalization helps the imbalance of losses for the tree counter and DMG. Denoting c(I, θ) as the output of the tree counter, the loss of the tree counter is The predicted tree number isẑ = c(I i , θ)L The tree counter is a relatively lightweight head of DENT compared with DMG. Since the tree counter gives a predicted tree count for each H × W area in the study area, the predictions over the whole study area can also be seen as a coarse density map. If a more refined density map is not demanded, the DMG can be pruned after training. And then the computational complexity of the Dot-Product Attention in the top encoder layer is reduced from O(L 2 · d model ) to O(L · d model ), because the interaction between [DEN] tokens in that layer is no longer needed. Examples of the density maps generated by a DMG and a tree counter are shown for comparison in Figure 5.

Yosemite Tree Dataset
We choose a rectangular study area, centered at Latitude 37.854, Longitude −119.548, in the Yosemite National Park and build a benchmark dataset for tree counting based on RGB aerial images. (Figure 6) The images are collected via Google Maps at 11.8 cm ground sampling distance (GSD) and stitched together. The study area is 2262.5 m × 4525.1 m in the real world and 19,200 × 38,400 pixels in the image. Inside the study area, the position of each individual tree is manually labeled. The total number of labeled trees is 98,949. To illustrate the variance of the land covers, the directions of light, and the sizes and the shapes of the trees, some 960 × 960 example images cropped from the study area are shown in Figure 6b. The dataset is publicly available for download at https://github.com/nightonion/yosemite-tree-dataset, (accessed on 31 December 2021). We split the study area into four regions A, B, C, and D of the same size (Figure 6a). Region B and D are used as a training set and Region A and C as a test set. To evaluate the accuracy of different tree counting methods, we further divide the study area into small non-overlapping square blocks. The counting errors in different blocks are supposed to be calculated separately. And the statistics of the errors are used as the metrics. Different block sizes can be used to analyze the accuracy versus the size of the region of interest, for example, 960 × 960 and 4800 × 4800.
To better demonstrate the ground truth distribution of the tree counts versus the block size, histograms are shown in Figure 7.

NeonTreeEvaluation Dataset
We also evaluate the models using NeonTreeEvaluation Dataset [16], which is collected from 22 sites across the United States by multiple types of sensors. The forest types vary in different sites. (Examples are shown in Figure 8). In this work, we only use the fully labeled RGB data, as follows: (1) A test set of 194 images containing 6634 annotated trees.
The size of each image is 400 × 400 pixels and corresponds to a 40 m × 40 m region in the real world. (2) A training set including 15 much larger images, containing 17,790 annotated trees. We crop them into 3395 400 × 400 training images as consistent with the test images.

Evaluation Metric
By following the works for crowd density estimation, we evaluate different methods for tree counting using Mean absolute error (MAE) and Root Mean Squared Error (RMSE), which are defined as follows: where N is the total number of blocks of in the test set, z i denotes the true number of trees in the ith block, andẑ i is the predicted number of trees in the ith block inferred by algorithms. For the NeonTreeEvaluation Dataset, a block is simply a test image. For the Yosemite Tree Dataset, we set the block size to 960 × 960 and 4800 × 4800 and report the results.

Comparison to State-of-Art Methods
We compare DENT with the state-of-the-art methods of different fashions, including density-based methods and detection-based methods. The tested density-based methods include fully convolutional networks originally designed for segmentation and crowd counting. The tested detection-based methods include one-stage and two-stage, anchorbased and anchor-free detectors. For the methods of Faster-RCNN, RetinaNet, YOLOv3, CenterNet, CSRNet, SANet, and CANNet we use their official implementations. For the other methods, we use their third-party open-source implementations.
The results are shown in Tables 1 and 2. The two heads of DENT, i.e., the DMG and the tree counter, achieve a closed performance. On the Yosemite Dataset, they are nearly on par with CANNet and outperform the other state-of-the-art methods in terms of MAE and RMSE for every test region and block size setting. On the cross-site NeonTreeEvaluation Dataset, they significantly outperform all the other methods.

Technical Details
We implement DENT using PyTorch [55]. On the Yosemite Tree Dataset, we crop 320 × 320 subimages from the study areas for training and testing. While on the NeonTreeEvaluation Dataset, as the test set are officially provided as 400 × 400 images, we crop 400 × 400 subimages for training from the large training images. As the downsampling rate of the whole DENT is 32, we pad the input images with zero values to 416 × 416 in both the training phase and test phase. The batch sizes we used to train DENT on Yosemite Tree Dataset and NeonTreeEvaluation Dataset are 48 and 32 respectively. Except for those mentioned above, we use the same setting to train DENT on the two datasets.
Pretraining and initialization. The ResNet in the Multi-RF network is pre-trained on the ImageNet dataset [56,57]. All the other components of DENT are learned from scratch. All the parameters of the transformer are initialized with Xavier [58]. The token type embeddings are initialized using a normal distribution.
Loss. The total loss during training is the weighted sum of the losses of the DMG and tree counter: where λ is a weighting factor to balance the losses of the two heads. In our experiments, we use λ = 1 by default.
Optimizer. We use Adam [59] to minimize the loss for a total of 300 epochs without weight decay. The initial learning rate is 10 −5 for the first 100 epochs. And then we apply a learning rate decay by a factor of 0.5 for every 50 epochs. We also apply gradient clipping to stabilize the training. The max norm of the gradients is set to 0.1.
Regularization and Data Augmentation. For reducing overfitting, dropout and random-flip are applied. Specifically, a dropout of 0.1 is added before each Add&Norm layer in the transformer encoder. The training images along with the target tree density map are randomly flipped horizontally and/or vertically.

Ablation Study
To evaluate the effects of the Multi-RF CNN and the transformer layers, ablation experiments are done on the test set (union of Region A and Region C) of Yosemite Tree Dataset for 960 × 960 blocks. The results are provided in Table 3. We start from a ResNet18 without a transformer. The output is projected to a single-channel linearly using a 1 × 1 convolutional layer. Interestingly this baseline already achieves lower errors compared with some existing methods (Table 1). After Adding two extra paths to the ResNet18 to get a Multi-RF network, the counting errors are lowered (The third row in Table 1). Adding two transformer layers as encoder make performance gain on both ResNet18 and Multi-RF network. We also try different numbers of transformer layers. Two layers work best in our experiments. More layers worsen the results and take a longer training time to converge.

Inference Time
To demonstrate the computational efficiency of DENT we test it on the whole 19,200 × 38,400 study area and report the inference time. The tests are done with a single NVIDIA Tesla V100 SXM2 GPU with CUDA 11.3. Every neural layer runs in native PyTorch with batch size = 1 in the default FP32 precision. We run 10 times for each case and report the average. The inference time of our basic implementation is 47.8 s.
Faster version. Due to the shift-invariance of convolution, when a study area is scanned by the Multi-RF CNN, the size of the scan window (input size) does not affect the final feature map. (This is true only when every layer in the backbone has padding size = 0. And beware that if padding size = 0 is used at test time, it should be used at training time as well to avoid accuracy drop.) We adopt a two-stage inference mode to improve the GPU utilization and lower the time cost: At the first stage, the backbone takes in a larger input image (The resolution is still 11.8 cm GSD. But each input image covers a larger area in the real world.) and generates a larger feature map. At the second stage, the transformer along with the DMG and the tree counter scans the feature map using its original input size. We test this strategy with a 4800 × 4800 input size for the Multi-RF CNN, the inference time is shortened to 16.0 s. When the DMG is pruned as discussed in Section 3.4, the inference time can be further shortened to 11.5 s. Even further improvement is possible with tricks like batch processing and low precision inference but beyond the scope of this paper. A comparison of these different implementations of DENT and the state-of-art object detectors in terms of inference time is shown in Table 4.

Conclusions and Future Work
We presented a deep neural regressor, DENT, based on CNN and transformer for tree counting in aerial images. We built a large benchmark dataset, Yosemite Tree Dataset, to evaluate different tree counting algorithms. We also used an existing cross-site dataset to test the robustness of the methods. Our approach achieved competitive results and outperformed the state-of-art methods. The ablation study further supported the effectiveness of the design.
With the advancement of drones, aerial imagery is becoming more and more affordable. However, due to the limited visual field, the captured photos need to be stitched to create the whole picture of a large study field. This procedure can be laborious. For this reason, an accurate video-based tree counting algorithm would be more automatic and appealing. The emerging applications of video-based density estimation methods for crowd counting inspired us. We will explore video-based tree counting algorithms in future work.