The design for DisCountNet is detailed in

Figure 2, and shows the full end-to-end implementation. DiscNet (the first stage) is an encoder characterized by convolutions of large kernels and leaky RELU activation functions followed by aggressive pooling. The first four convolutions use kernels with sizes of seven, six, five, and four. The first three pooling operations are all max pooling with a kernel and stride of four, and the final pooling operation is another max pooling layer with stride and kernel size of two. The last next to last layer in the network is a final one-by-one convolution to reduce the feature map depth to two, followed by a SoftMax activation, yielding a 12 by 16 matrix of values that represent the likelihood that a cow is found in a given region. The aggressive striding allows us to use larger kernel sizes to capture contextual information that could be lost while limiting expensive operations. DiscNet then uses this matrix to convert the original input image into a sparse representation, operating on the assumptions listed above. CountNet uses the sparse representation to generate per-pixel probability values. This flow of information can be visualized in

Figure 7, which shows different data representations at different stages of the proposed network. Our training procedure is depicted in Algorithm 1. Given the dataset

${\left\{{X}_{i}\right\}}_{i=0}^{N}$, DiscNet gets trained using the full images and region label ground truths via a weighted cross entropy loss to determine if there is a cow in a given region. Each data

${X}_{i}$ consist of

${R}^{\left(i\right)}$ regions, where each region is labeled by

${y}_{r}^{\left(i\right)}$ with

$r=1\dots $ where

$r=1\dots {R}^{\left(i\right)}$. The result of the network prediction is denoted as

${\tilde{y}}_{r}^{\left(i\right)}$. We use a weighted cross entropy minimization equation, which is given by Equation (

1); for convenience, we drop the superscript

$\left(i\right)$ in the formula.

where

$pr\in [0,1]$ and represents the percentage of regions with desired information. This weighted loss function serves to counterweight the loss values for our unbalanced data set. For example, if an image is 90% background regions, the loss for foreground regions will be ten times higher. This will cause the network to weigh the loss for positive examples more highly than negative examples, resulting in an increased number of false positives and fewer false negatives. In our given implementation, a false negative will hurt the performance much more than a false positive. As an example, a false negative in the discriminator would mean that a region with a cow is not passed to CountNet, meaning no cows can be detected. However, a false positive means that a region without a cow is passed on, for which CountNet can still compensate. In the second stage, as CountNet seeks to model a different function based on Assumption 4, it uses a different implementation. CountNet features a U-Net structure [

49] with modified operations. Operating on a sparse representation of the original image generated by DiscNet, CountNet creates a sparse density map. The encoding pathway features four convolution-pooling operations with skip connections to the decoding pathway, which uses transposed convolutions. All the pooling operations are max pooling with a kernel and stride of two. Each of the convolutions uses a three-by-three kernel, and all transposed convolutions use a stride of two. Finally, the network uses a one-by-one convolution with one feature depth that represents the likelihood that a cow is in one given pixel. The U-Net-like-architecture is proven for providing accurate inferences while maintaining contextual information for per-pixel tasks.

CountNet is trained by the sparse data generated by DiscNet. CountNet’s loss value is generated using regions and corresponding ground-truth density map regions by minimizing the Mean Square Error. The

${\ell}_{2}$ loss function, is given by Equation (

2) where

${z}_{i}$ is a given ground-truth density map and

$\tilde{{z}_{i}}$ is a prediction,