Domain-Aware Neural Architecture Search for Classifying Animals in Camera Trap Images

Simple Summary Camera traps acquire visual data in a non-disturbing and round-the-clock manner, so they are popular for ecological researchers observing wildlife. Each camera trap may record thousands of images of diverse species and bring about millions of images that need to be classified. Many methods have been proposed to classify camera trap images, but almost all methods rely on very deep convolutional neural networks that require intensive computational resources. Such resources may be unavailable and become formidable in cases where the surveillance area is large or becomes greatly expanded. We turn our attention to camera traps organized as groups, where each group produces images that are processed by the edge device with lightweight networks tailored for images produced by the group. To achieve this goal, we propose a method to automatically design networks deployable for edge devices with respect to given images. With the proposed method, researchers without any experience in designing neural networks can develop networks applicable for edge devices. Thus, camera trap images can be processed in a distributed manner through edge devices, lowering the costs of transferring and processing data accumulated at camera traps. Abstract Camera traps provide a feasible way for ecological researchers to observe wildlife, and they often produce millions of images of diverse species requiring classification. This classification can be automated via edge devices installed with convolutional neural networks, but networks may need to be customized per device because edge devices are highly heterogeneous and resource-limited. This can be addressed by a neural architecture search capable of automatically designing networks. However, search methods are usually developed based on benchmark datasets differing widely from camera trap images in many aspects including data distributions and aspect ratios. Therefore, we designed a novel search method conducted directly on camera trap images with lowered resolutions and maintained aspect ratios; the search is guided by a loss function whose hyper parameter is theoretically derived for finding lightweight networks. The search was applied to two datasets and led to lightweight networks tested on an edge device named NVIDIA Jetson X2. The resulting accuracies were competitive in comparison. Conclusively, researchers without knowledge of designing networks can obtain networks optimized for edge devices and thus establish or expand surveillance areas in a cost-effective way.


Introduction
Visual data are a rich source of information about wildlife and can provide strong support for wildlife conservation and ecological research. One cost-effective way to obtain visual data of wildlife is via camera traps that work in a non-disturbing [1] and round-theclock manner [2], thus making them ideal for observing wild animals otherwise difficult to monitor [3], e.g., nocturnal mammals [4] and large animals [5]. Because camera traps are noninvasive [6], a single deployment may record a diverse range of species [7]. Consequently, the recorded images have to be processed before being adopted in ecological research [8]. The images may go through several processing stages determined by the research process, a fundamental stage of which is species identification that is usually implemented as automatically and centrally classifying camera trap images at a data center installed with very deep convolutional neural networks (CNNs) [9][10][11][12][13]. In practice, there may be millions of images produced by camera traps [7,12,14,15], so image transfer and processing at a data center is often computationally intensive and costly. Furthermore, the scale of the surveillance area may also be restricted by the processing capability of the data center.
Edge computing [16] was ideally developed for such cases, i.e., intensive computation centralized at a data center can be split and localized by edge devices near camera traps [17,18]. Thus, fundamental processing steps such as removing images without animals [6,7,12,[19][20][21] and classifying images with animals [9][10][11][12][13] can be automatically conducted on edge devices. However, edge devices are not only heterogeneous [22] but also resource constrained [23]. These limitations of edge devices narrow down the range of available neural networks [23,24]. Hence, lightweight networks [25] designed for edge devices are critical in edge computing for camera trap images. Even so, "deep neural network design is very difficult, and it requires the experience and knowledge of experts, a lot of trial and error, and even inspiration" [26]. Luckily, network design can be automated through neural architecture search (NAS) [27]. However, NAS is often developed regardless of domain knowledge [28] regarding camera trap images of wildlife [29]. Specifically, NAS is often designed based on benchmark datasets such as CIFAR-10 [30] and ImageNet [31], which differ from camera trap images in many aspects, especially data distribution and aspect ratios as described below.
The data distribution of benchmark datasets may differ from camera trap images, e.g., in classes, image foregrounds and backgrounds. Camera trap images purely contain animals, but only partial classes in benchmark datasets are relevant to animals. For instance, six out of ten classes in CIFAR-10 are related to animals and 233 out of 1000 classes in ImageNet are relevant to vertebrate [32]. Consequently, NAS based on benchmark datasets may waste resources on designing networks optimized for data irrelevant to animals. In addition to classes, animal images in benchmark datasets also differ from camera trap images in foregrounds and backgrounds. For benchmark datasets [30,31], images are usually artificially preprocessed to guarantee that foreground animals are large and centered and their backgrounds are relatively small and may differ from animal habitats in the wild. In contrast, animals in habitats are photographed by camera traps under various conditions, so the animals may appear at random locations in images and are often closely related with image backgrounds.
The image aspect ratios of benchmark datasets differ from camera trap images, e.g., the aspect ratios of CIFAR-10 and ImageNet are both 1:1 (the image width and height are the same), though this ratio may not hold for camera trap images. For instance, the resolutions of camera trap images range from 2048 × 1536 (aspect ratio: 4:3) to 2616 × 1472 (16:9) in North American Camera Trap Images, i.e., NACTI [13], and the resolutions range from 1920 × 1080 (16:9) to 2048 × 1536 (4:3) in Missouri Camera Trap Images, i.e., MCTI [33]. Therefore, networks found by NAS based on benchmark datasets may require that camera trap images be resized to satisfy the aspect ratio 1:1. However, resizing images may alter their aspect ratios and introduce interpolated pixels, often resulting in either misshaped animals or memory waste.
In short, images from benchmark datasets adopted by NAS often differ from camera trap images, and this difference potentially implies domain shift [34]. Additionally, it may be hard to modify existing networks in line with the applications [29]. These issues inspired us to develop NAS based on the domain knowledge of camera trap images for edge devices. We used the proposed method to conduct searches directly on camera trap images rather than images of benchmark datasets. The aspect ratios of camera trap images are maintained during the search, which is guided by a loss function particularly derived for finding the lightweight networks. The hyper parameter of loss function was theoretically analyzed and carefully chosen, and lightweight networks found by the search were tested on the NVIDIA Jetson X2 edge device. The experimental results confirmed the validity of the proposed method. The main contributions of this paper are as follows.
1. A method named Domain-Aware Neural Architecture Search (DANAS) was developed regarding the domain knowledge of camera trap images. The proposed method directly searches networks on camera trap images, thus avoiding negative effects such as the domain shift incurred by benchmark datasets in conventional search methods. 2. Aspect ratios of camera trap images are maintained during the search. As part of domain knowledge, the changes of aspect ratio may not be automatically tackled by neural networks. Therefore, the changes are manually eliminated by first finding the most frequent aspect ratio and then padding images whose aspect ratios differ from the most frequent one. 3. A loss function was derived to guide DANAS to find lightweight networks applicable for edge devices. A theoretical analysis of the proposed loss function was conducted, and the analysis revealed the value of hyper parameter in the loss function to boost its guiding effect on the search.

Datasets
Two datasets were employed in this study: MCTI and NACTI, containing 24 thousand and 3.7 million camera trap images, respectively, with varying resolutions. Since label errors are found in NACTI and its millions of images require too much computational resources, NACTI was selectively adopted in this study in the form of a subset named NACTI-a containing 29 thousand images with varying resolutions. The species data in NACTI-a and MCTI are illustrated in Table 1.

Method
DANAS was developed within the framework of reinforcement learning [35,36], i.e., the search is implemented on sampling candidate networks from a search space through a sampler [29], as shown in Figure 1. In DANAS, the sampler is long short-term memory (LSTM) [37]. The reason to use LSTM as the sampler is that this sampler does not rely on parameter sharing [38], which may not be helpful for finding high-performance networks (as reported by [39]). Around the sampler, there are five conceptual search steps (from 1 to 5 in Figure 1). By repeating these steps, the quality of the sampled network is gradually improved via updates of the learnable parameters θ of the sampler. Starting from the first step, all five steps are introduced sequentially next. In Step 1 shown in Figure 1, LSTM samples candidate networks from the search space defined by a meta architecture, i.e., a prototype from which all candidate networks are derived. The meta architecture is similar to the ones defined in [35][36][37][38], i.e., a pipeline segmented to groups of layers called cells. There are two types of cells, normal and reduction cells, and the cells of the same type share the same inner structure. Besides the inner structures, the normal and reduction cells differ in the way they process data, i.e., the width and height dimensions of data remain the same before and after normal cells while the width and height of the input are halved through reduction cells. There are N normal cells in the pipeline, and each normal cell is adjacent to two reduction cells. At the end of the pipeline, a global average [40] is appended. In this study, the reduction cell was simplified to a single pooling layer, i.e., an average pooling or a max pooling with a kernel size of 5 × 5 or 3 × 3, and the normal cells were sampled based on the meta cell shown in Figure 2. As shown in Figure 2, a normal cell is a group of blocks whose inputs come from blocks in the same cell or previous B cells. For blocks not serving as inputs of any other blocks, their outputs are concatenated to produce the cell output. Each normal cell has B (a constant) blocks, and each block has M (determined by the sampler) operations. The operation is sampled from the same set of operations as that in [38], e.g., a stack of 3 × 3 depth-wise-separable convolution [41], batch normalization [42], and ReLU [43]. Accordingly, the sampler first determines the operation number M of a block by sampling an integer from some predefined integers, and then it samples inputs and operations for the block, and the sampling repeats for B blocks to form a normal cell. Once the normal cell has been sampled, the sampler samples a pooling layer to form a reduction cell. Both the sampled normal cell and the reduction cell are employed to build the candidate network.
In Step 2 shown in Figure 1, the candidate network is built based on the sampled cells and the meta architecture, i.e., assembling the cells according to the cell pipeline. The building process is identical with the one introduced in [44], i.e., we applied the adaptive meta-architecture [44] to build candidate networks. Once the candidate network is properly built, its performance is evaluated based on the camera trap images with maintained aspect ratios.
In Step 3 shown in Figure 1, the candidate network is trained and validated on camera trap images with the most frequent aspect ratio, i.e., the occurrences of unique aspect ratios of camera trap images are counted and the aspect ratio with the maximal count is chosen as the most frequent one. Images with aspect ratios different the most frequent one are padded by zero pixels. In practice, camera trap images are processed to have the same aspect ratio before the search starts, and the processed images are employed to train the candidate network. The trained network is then validated to yield validation accuracy to compute the loss.
In Step 4 shown in Figure 1, an accuracy reward [44] is generated based on accuracies obtained by training and validating a candidate network, and both the produced accuracy and the network parameter number [25] are employed to generate the loss J . The purpose of this step is to train LSTM to sample "good" networks via gradient-based optimization algorithms such as stochastic gradient descent (SGD). The meaning of "good" is twofold, i.e., the parameter number s of the network should be close to the desired parameter number (s * = 1.5 million in our case) and the accuracy reward R of the network should be close to the ideal accuracy (R * = 100, i.e., 100% accuracy). Since the reward is twofold, we need a bivariate reward function f (R, s) so that the gradient ∇ θ J of the total loss J synchronizes with the reward. According to the case of the unary loss function in studies of reinforcement learning [45], we defined J (θ) as where θ represents learnable parameters associated with the sampler, R is the accuracy reward involving the training and the validation accuracies of the candidate network, and s is the parameter number of the network in millions. The bivariate function f (R, s) provides the reward based on (R, s), and a Σ summarizes probabilities of sampling the candidate network through the sampler, i.e., where the notations are similar to [44], i.e., n C is the number of the cells in the candidate network, B i denotes the block number of the ith cell, n j is the operation number of the jth block, and P(x|y ) is the probability of sampling x under condition y. The details of a Σ can be found in Appendix A. Since and f (R, s) yields a scalar, the direction of ∇ θ|R,s J is solely determined by ∇ θ a Σ . However, a Σ remains unknown due to the unknown probability distributions of n j and a k , which means the direction of ∇ θ a Σ is out of our control, i.e., we cannot change the direction of ∇ θ|R,s J to point to promising positions of high rewards. However, we can change its magnitude ∇ θ|R,s J via f (R, s) so that ∇ θ|R,s J synchronizes with the reward. For example, suppose the sampler sampled a network of (R, s) close to (R * , s * ); we expect the sampler to sample networks alike, which requires that θ should not be largely updated by SGD involving ∇ θ|R,s J . However, ∇ θ|R,s J is partially determined by |∇ θ a Σ |, so ∇ θ|R,s J may not remain small when (R, s) is close to (R * , s * ). In this case, f (R, s) should scale |∇ θ a Σ | to ensure that the resulting ∇ θ|R,s J is relatively small. This requires the reward surface defined by f (R, s) to be similar to a whirlpool with vortex (R * , s * ).
We chose Witch of Agnesi [46] to build f (R, s) on account of its bell-like curve and the simple mathematical form that only introduces one hyper parameter. Therefore, f (R, s) is defined as where a ∈ R is the hyper parameter introduced by Witch of Agnesi. In practice, R * usually equals 100 (100% accuracy) [44], s * is determined by the application, and only a remains unknown. The value of a may be discovered by assuming both R and s are restrained within some range, and this assumption may be reasonable under certain search conditions. Specifically, let x = s − s * and y = R; then f (R, s) can be written as Assuming x 1 ≤ x ≤ x 2 and y 1 ≤ y ≤ y 2 , the volume V of f (x, y) within the assumed ranges is given by Suppose u 2 = −u 1 = u * < π/2 and 0 ≤ y ≤ R * , then, the formula above can be simplified by substituting tan u and sin(2u) by their Taylor series of order three, i.e., which is equivalent to which is a special case of monic cubic polynomials, i.e., the depressed cubic: u * 3 + c 1 u * = c 2 . According to Cardano's formula, the solution of the depressed cubic is where c 1 and c 2 are The solution u * of the depressed cubic requires where ε may take a small value such as 10 −6 . Figure 3 illustrates the surface of f (R, s) parameterized by a = 3/4 − 10 −6 , R * = 100 and s * = 1.5 within the ranges 0 ≤ R ≤ 2R * /3 and 0 ≤ s ≤ 5. As expected, f does have a whirlpoollike surface with the vortex (R * , s * ), and the sampler may be guided by ∇ θ|R,s J involving f to find lightweight networks. In step 5 shown in Figure 1, a selecting and training strategy is employed to find the optimal network. The idea behind this strategy is concentrating computational resources on promising networks found during the search, as the method first samples a relatively large number of candidate networks with small training epochs, e.g., 2 epochs in our case, and then finds the promising ones based on the sampled networks with large training epochs. In practice, we ran a single search to sample 1500 networks, and then networks with parameter numbers ranging from 1 to 1.5 million (the ideal parameter number in our case) were sorted decreasingly by their validation accuracies. If there are more than 150 networks, then only top 150 networks are retained for retraining through 5 epochs, and then the trained networks are sorted based on accuracies. If there are networks with accuracies >90%, then 15 networks are retained and retrained through 10 epochs; otherwise, half of the networks are retained and retrained. We stopped this procedure at 15 epochs and selected the top-1 network. If the difference between the accuracies between the top-1 and the top-2 networks was not large, e.g., less than 1%, then we would increase the epoch number and continue the training.

Results
The performance evaluation of DANAS was individually conducted on the NACTI-a and MCTI datasets. As shown in Table 1, the most frequent resolution of both NACTI-a and MCTI is 2048 × 1536 (aspect ratio 4:3). Accordingly, the images of the two datasets were resized to have the resolutions 85 × 64 (4:3) [44] for the search and 224 × 168 (4:3) for the test; for each dataset, the search was conducted on 85 × 64 images, and then the optimal network discovered by the search was trained and tested on 224 × 168 images. Each dataset was split to three subsets, i.e., the training set, the validation set and the test set, and the search was conducted on the first two subsets. The split was implemented by randomly sampling images from the dataset at a ratio of 0.64:0.16:0.2 of the sample numbers of three subsets, namely, 20% images were randomly sampled from the dataset to build the test set, then 20% images were randomly sampled from the remaining images to build the validation set, and the rest of the images served as the training set. The candidate networks found by the search were trained on the training set and then tested on the validation set, so the test set remained unknown to the search.
In searches on NACTI-a and MCTI, the pipeline shown in Figure 2 had three pairs of one reduction cell and five normal cells (N = 3) at most. The normal cell had five blocks (B = 5), and each block may have had five operations (M = 5) at most. The input channels of the normal cell and reduction cell were, respectively, fixed to 20 and 40. The output channel of the reduction cell was fixed to 40, while the output channel of the normal cell was automatically determined by its operations. The candidate network was trained by using AMSGrad [47] with a batch size of 32, two epochs, and a learning rate of 0.005. The training was conducted on 85 × 64 training images via a PyTorch module named Distributed Data Parallel (DDP) that loaded the network and the batches to available GPUs, individually trained networks on GPUs, collected the resulting gradients from all GPUs and synchronized networks based on the collected gradients. The trained network was then tested on 85 × 64 images of the validation set on each GPU, and the resulting accuracies were retrieved via PyTorch module named Manager. The retrieved accuracies were then averaged to yield the training and the validation accuracies that were used to generate the accuracy reward. Finally, the loss was computed based on the accuracy reward through the loss function whose hyper parameters were set as a = 3/4 − 10 −6 , R * = 100 and s * = 1.5. All searches were done on a workstation installed with 4 GPUs of NVIDIA TITAN Xp, Ubuntu 20.04, PyTorch 1.7.0 and MySQL 8.0.13.
In tests, several networks famous for their lightweight designs or performance were chosen for comparison with DANAS, i.e., MobileNet-v2 [48], EfficientNet [49], DenseNet [50], Resnet-18 [51], ResNext [52] and Wide ResNet [53]. Each network was trained by using SGD [54] of Nesterov momentum [55] with a batch size of 10, 20 epochs, and a learning rate ranging from 0.005 to 0.0001. The learning rate was changed by cosine schedule [54]. The training was conducted on 224 × 168 images from both the training and the validation sets via DDP, and the weights of the network at the last epoch were saved on the hard disk. During tests, the weights were read from the disk and employed to populate the network, and the network was tested on 224 × 168 images of the test set. All networks in comparison were trained and tested on the workstation, and the optimal networks found by DANAS were additionally tested on an NVIDIA Jetson X2 edge device installed with Ubuntu 18.06 and PyTorch 1.1.0.
Since the camera trap images differ widely between MCTI and NACTI-a, DANAS found different networks, which led to distinct accuracies and misclassifications for two datasets. The detailed results are discussed in the following sections.

Search and Test on NACTI-a
The search on NACTI-a consumed roughly 74 hours and found a network with 1.36 million parameters. The search performance was compared with a random search via steps like those shown in Figure 1. Specifically, the sampler in step 1 of Figure 1 was replaced by random sampling, and both memory constriction [44] in step 2 and sampler updating in step 4 of Figure 1 were removed. However, the memory constriction could not truly be removed due to the limited physical GPU memory, and the constriction was thus alleviated by resampling the networks until the pipeline shrinkage [44] did not happen. The training and test configurations of networks explored by the random search were the same as those in DANAS. The search procedures of the random search and DANAS are visualized in Figures 4 and 5, respectively.  As shown in Figure 4 regarding the random search, there were 57 networks with parameter numbers exceeding 2.5 million and 79 networks with validation accuracies exceeding 60%.
As shown in Figure 5 regarding DANAS, there were 32 networks with parameter numbers exceeding 2.5 million and 140 networks with validation accuracies exceeding 60%, and one of them was chosen as the optimal network according to step 5 in Figure 2. The optimal network is highlighted by a yellow star in Figure 5 and its normal cell is depicted in Figure 6; its reduction cell was simply a max pooling with a 3-by-3 kernel. The detailed network structure based on the normal cell shown in Figure 6 is illustrated in Figure 7, which shows how the data flowed through the normal and the reduction cells. The connections between cells are denoted by arrows. In Figure 7, cells labeled "normal cell i − 4", "normal cell i − 3" . . . "normal cell i + 1" correspond to cells labeled "Cell i − 4", "Cell i − 3" . . . "Cell i + 1" in Figure 6. If we rotate Figure 6 clockwise by 90 • , then cell labels and arrow colors in Figure 6 will match labels and arrow colors in Figure 7. For instance, yellow arrows between "cell i − 1" and "normal cell" in Figure 6 correspond to the yellow arrow between "3 × 3 max pool" and "normal cell i" in Figure 7, purple arrows between "cell i − 2" and "normal cell" in Figure 6 correspond to the purple arrow between "normal cell i − 2" and "normal cell i" in Figure 7, and so forth. For each normal cell shown in Figure 7, its inputs are signified by "a direct arrow running from the previous cell" and "three curved arrows running from another three previous cells", and each arrow in Figure 7 corresponds to a group of arrows with the same color in Figure 6.
The input channels of the normal and reduction cells were fixed to 20 and 40, respectively. The output channel of the reduction cell was fixed to 40, and the output channel of the normal cell was automatically determined by its operations. The fixed channel numbers served as element-wise additions within blocks, i.e., only tensors of the same dimensions could be added element-wise. Therefore, channels of any block input were assumed to be 20. If the input channel differed from this constant, then the input was fed to an additional stack of 1-by-1 convolution, batch normalization and ReLU for changing the channel number to 20. Accordingly, the channel numbers of all block outputs were the same, i.e., 20, due to the fact that no operation within a block affected input data dimensions. Besides the channel numbers of inputs, if an input to a block differed in widths or heights, then all inputs were resized to have the minimal width and height found among the block inputs. Therefore, all block inputs shared the same dimensions, and element-wise additions worked in any block. As shown in Figure 6, the cell output was obtained by concatenating block outputs, which required that outputs to concatenate had the same width and height. If the outputs differed in width or height, then they were resized to the maximal width and height found among outputs to concatenate. The number of cell output channels could thus be easily derived by counting the number of outputs to concatenate, e.g., for the normal cell in Figure 6, its output channel number was 60 = 3 × 20, i.e., three block outputs were concatenated to yield the cell output. For reduction cell, since the pooling layer only halved the width and height dimensions of inputs, the output channel was the same as the input channel, i.e., 40. If inputs of a reduction cell had different channel numbers other than 40, then the inputs were fed to the channel-changing stack the same as the normal cell. All convolutions in normal cells had strides set to 1 and paddings set to 1 or 2, respectively, for 3 × 3 or 5 × 5 convolutions. All poolings had strides set to 2 and paddings set to 1 or 2, respectively, for 3 × 3 or 5 × 5 poolings.
As shown in Figure 7, the output of the last normal cell was fed to a global average, i.e., a w × h average pooling where w and h refer to the input width and height, respectively. Here, a w × h × c tensor was pooled to c scalars via the global average where c denotes the class number. If the input to the global average had channels other than c channels, then the input was fed to an additional 1-by-1 convolution of stride set to 1 and padding set to 0 before the input was fed to the global average. The results of the network shown in Figure 7 are illustrated in Table 2, and the best accuracy within each row is highlighted by bold texts. As shown in Table 2, although the parameter number of the optimal network discovered by DANAS was small, the average test accuracy associated with DANAS was the third best of the compared networks. However, there were eight species accuracies in DANAS that were the best (bold digits in Table 2) compared to other networks, and there were eight best species accuracies in Resnet-18, which demonstrated the best average test accuracy.
There were 155 images misclassified by DANAS. Among all misclassifications, 78 were color images and the rest were night-vision images, i.e., about half misclassified images were night-vision images. The image samples of typical misclassifications are illustrated in Figure 8, i.e., the partial animal body in the left sample, the small region occupied by the animal in the middle left sample, and visually similar animals in the right and the middle right samples. Among all misclassifications, about 64% (99 samples) were misclassified due to the visual similarity of animals, and these misclassifications mainly originated from the deer and canine species. Samples of deer and canine misclassifications are shown in Figure 9. The misclassifications were mainly made among red deer (29 samples) and red fox (23 samples). For red deer samples, 14 samples were grayscale images without colors (the left sample in Figure 9), and the remaining color samples always contained red deer whose heads were obscured due to camera view limitations, body orientations (the middle-left sample in Figure 9), etc. For red fox samples, 11 samples were grayscale images (the middle-right sample in Figure 9), and the remaining color samples always contained foxes occupying small image regions (the right sample in Figure 9). Samples of misclassifications other than deer are shown in Figure 10. The misclassifications were made among bobcat, cougar, coyote, moose, etc., due to reasons similar to those of the deer and red fox misclassifications. Additionally, misclassification samples only containing animal heads are shown in Figure 10.

Search and Test on MCTI
The search on MCTI consumed roughly 62 hours and found a network with 1.43 million parameters. The search performance was compared with a random search whose configuration was the same as the one introduced in the previous section. The search procedures of DANAS and the random search are visualized in Figures 11 and 12, respectively. As shown in Figure 11 regarding the random search, there were 66 networks with parameter numbers exceeding 2.5 million (62 points on the right of vertical line at 2.5 in the figure; four points are not shown due to limited space) and 16 networks with validation accuracies exceeding 50%.
As shown in Figure 12 regarding DANAS, there were 15 networks with parameter numbers exceeding 2.5 million (13 points on the right of vertical line at 2.5 in the figure; two are not shown due to limited space) and 93 networks with validation accuracies exceeding 50%; one of them was chosen as the optimal network according to step 5 in Figure 2. The optimal network is highlighted by a yellow star in Figure 12, its normal cell is depicted in Figure 13; its reduction cell was simply a max pooling with a 3-by-3 kernel.  The network structure based on the cell in Figure 13 was the same as the one shown in Figure 7 because normal cells found on both NACTI-a and MCTI involved all previous cells, and the pipeline in Figure 7 illustrates how data flowed at the cell level (in contrast to the data flow at the block level that is shown in Figures 6 and 13). The test results are shown in Table 3, and the best accuracy within each row is highlighted by bold texts. As shown in Table 3, the parameter number of the optimal network discovered by DANAS was small, and the average test accuracy associated with DANAS was the best throughout the networks in comparison. There were 167 images misclassified by DANAS. Among all misclassifications, 45 were color images, and the rest were grayscale images. Samples of typical misclassifications are shown in Figure 14, i.e., vagueness due to dirty camera lens in the left sample, similar backgrounds and species in the middle samples, and partial animal body in the right sample.

Tests on Jetson X2
The optimal networks discovered by DANAS with the MCTI and NACTI-a datasets were tested on the NVIDIA Jetson X2 edge device shown in Figure 15. Because the versions of PyTorch installed in the workstation and the Jetson X2 are different, the format of network weights saved in the workstation was incompatible with Jetson X2. This issue was tackled by loading and resaving weights in Pickle-based format through PyTorch's built-in function torch.save() with the parameter "_use_new_zipfile_serialization" set to False. The resaved network weights and 224 × 168 test images were copied to Jetson X2 through secure copy protocol as in [44]. The test results are shown in Table 4.  As shown in Table 4, the average accuracies on Jetson X2 were 92.91% and 94.31% for NACTI-a and MCTI, respectively. The average accuracies on Jetson X2 were slightly lower than the corresponding accuracies obtained on the workstation, i.e., 92.86% and 94.02% for NACTI-a and MCTI, respectively.

Comparisons between DANAS and other Search Methods
Since comparisons of search methods based on custom-defined search space and various hardware may introduce bias, we compared our method with other methods via Nasbench-201 [39]. Nasbench-201 provides a database and application programming interfaces (APIs) for comparing search methods with the same search space and hardware.
In Nasbench-201, all candidate networks in a specific search space were trained, validated and tested on the CIFAR-10, CIFAR-100 and ImageNet-16-120 datasets [39]. The training, validating and testing accuracies were saved in databases and could be programmatically retrieved via an API, a network code encoding the network architecture. There were five operations scattering in three cells, i.e., the first cell containing one operation, the second containing two, and the last containing three. For each operation, there were five options available for sampling, i.e., "nor_conv_3 × 3", "none", "nor_conv_1 × 1", "avg_pool_3 × 3" and "skip_connect" [39]. Nasbench-201 does not distinguish networks of different operation inputs (i.e., for all networks of operations arranged in the same encoding order, only one network is trained, validated and tested on the aforementioned datasets.), so there are 5 × 5 2 × 5 3 = 15, 625 [39] networks in Nasbench-201, and a sampler tested on Nasbench-201 is restricted to sample operations only. Accordingly, we simplified our sampler and applied Bayesian optimization [21] to automatically find values of the sampler hyper parameters, i.e., the embedding dimension, the hidden unit number, the layer number, and the learning rate were set to 19, 33, 1, and 0.005, respectively. The rest of the configuration was the same as that of DANAS.
According to [39], there are two types of search methods tested on Nasbench-201, i.e., methods dependent on or independent of parameter sharing. Parameter sharing often means weights of a newly-sampled network are initialized by using weights from the previously-sampled networks trained on the dataset, so the weights of previously trained networks are not abandoned during the search. In [39], parameter-sharing-dependent methods were repeated three times and other methods were repeated 500 times. For each run of the method independent of parameter sharing, the method continued to run until the simulated training time [39] of its sampled networks reached a predefined limit called time budget, i.e., 12,000 s. [39]. The simulated training time of the sampled network was obtained by adding its training and validation time saved in Nasbench-201.
Since our method (DANAS) is independent of parameter sharing, DANAS was tested according to the configuration of search methods independent of parameter sharing, i.e., the search based on DANAS was repeated 500 times and each search automatically stopped once the time budget was reached. Different from methods in [39], our method requires an additional hyper parameter, i.e., the ideal parameter number s * . This parameter is set to the parameter number of the candidate network with the optimal validation accuracy. Accordingly, DANAS was tested against three datasets available in Nasbench-201, i.e., CIFAR-10 (s * = 0.87 in millions), CIFAR-100 (s * = 0.86 in millions), and ImageNet-16-120 (s * = 1.29 in millions). Because network weights required by parameter-sharing-dependent methods were not available at the time of paper submission, we only tested parametersharing-independent methods with Nasbench-201 on our own hardware. Specifically, all search steps except for training, validating and testing sampled networks were conducted on our hardware, and network accuracies and parameter numbers were directly retrieved from Nasbench-201. The configurations of all methods except DANAS were the same as [39]. The results are illustrated in Table 5. As shown in Table 5, five search methods were compared on three benchmark datasets, i.e., CIFAR-10, CIFAR-100 and ImageNet-16-120 [39]. Among methods in comparison, i.e., REA [56], RS [57], REINFORCE [45] and BOHB [58], our method (DANAS) achieved the second best test accuracy on CIFAR-10 and the third best test accuracy on both CIFAR-100 and ImageNet-16-120.

Discussion
DANAS was evaluated on two datasets, NACTI-a and MCTI. For both datasets, the random searches significantly differed from DANAS in changes of validation accuracy and parameter number over time.
In the case of NACTI-a, the number of networks with parameter numbers exceeding 2.5 million in the random search was almost twice that of DANAS, and the number of networks with validation accuracies exceeding 50% in the random search was roughly half that of DANAS. More importantly, the distribution of points from DANAS in Figure 5 illustrates a growing trend towards networks with few parameter numbers and high validation accuracies, i.e., the search tended to find Pareto solutions good for both accuracy and parameter number, while no such trend can be seen in Figure 4 regarding the random search.
In the case of MCTI, the ratio between the numbers of networks with 2.5 million parameter numbers or above for random search and DANAS was higher than the case of NACTI-a, i.e., about 4:1, and the ratio between the numbers of networks with validation accuracies exceeding 50% for the random search and DANAS was lower than the case of NACTI-a, i.e., about 1:8. The distribution of points from DANAS in Figure 12 illustrates the same trend as the case of NACTI-a, and the random search showed no such trend, as depicted in Figure 11.
The performance of the networks found by DANAS was evaluated by comparing the test accuracies with seven CNNs with parameter numbers ranging from 0.7 to 66.8 million on two datasets, NACTI-a and MCTI. Although the parameter numbers of networks found on both datasets were lower than 1.5 M, the test accuracy was the third best for NACTI-a and the best for MCTI. These results reveal the benefit of designing CNNs with structures highly customized for studied data and used device. Generally, the experimental results confirmed the validity of DANAS.
The search efficiency of DANAS was compared with search methods reported in [39] based on Nasbench-201, and the search methods with parameter sharing were retested on our hardware. For all benchmark datasets of Nasbench-201, our method outperformed all parameter-sharing-dependent methods reported in [39] and most of parameter-sharingindependent methods including the random search. Generally, DANAS outperformed NAS methods with parameter sharing and was competitive compared with NAS methods without parameter sharing.

Conclusions
In this study, DANAS is proposed to automatically design lightweight CNNs for ecological research powered by camera traps and edge computing. DANAS was developed based on domain knowledge of camera trap images, i.e., the search is conducted on camera trap images whose resolutions are lowered while the original aspect ratios are maintained. Therefore, the data distribution of the original dataset is preserved during the search, so the data distribution difference incurred by using benchmark datasets in traditional NAS is reduced in DANAS. Furthermore, the search in DANAS is guided by a loss function designed based on Witch of Agnesi whose hyper parameter was theoretically derived. In experiments, DANAS was shown to successfully find lightweight networks for two datasets of wildlife camera trap images. The found networks were then trained on a workstation and tested on both the workstation and an edge device. In comparison with CNNs of classical lightweight designs and good performance, the networks found by DANAS had low parameter numbers and competitive test accuracies. Generally, researchers without knowledge of designing CNNs can obtain lightweight CNNs optimized for edge devices through DANAS and thus expand surveillance areas in a cost-effective way. Data Availability Statement: Dataset NACTI is available at http://lila.science/datasets/nacti. Dataset MCTI is available at https://lila.science/datasets/missouricameratraps.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Since LSTM plays the role of a sampler, options to sample have to be converted to length-fixed vectors to feed the LSTM. This conversion is accomplished via a technique called word embedding, i.e., options are uniquely represented by plain words such as "conv 3 × 3" and each word is mapped to a row of a |V| × d matrix, where V denotes the set of unique words, |V| is the number of unique words, and d is a hyper parameter. The word-embedding matrix is implemented as a standalone layer in our sampler, which means the matrix elements were learned during the search. To sample a network from our search space, LSTM starts with a random word from V, i.e., a random row of wordembedding matrix is fed to the input layer of LSTM. The data flow through LSTM and yield two tensors: "hidden states" and "cell states". Hidden states are associated with the immediately previous sampling, and cell states are associated with long-term sampling found useful during training. Hidden states are then fed to some linear layer whose output corresponds to the sampling options. For example, suppose there are five different inputs and four different types; then there are two separate linear layers (matrices) of size h × 5 and h × 4, where h denotes the dimension of hidden states. If an operation input is desired, then the hidden states are only fed to the h × 5 linear layer, which yields a five-element vector, and the resulting vector is fed to SoftMax function to produce five probabilities. The probabilities are then used to build a categorical distribution, and the input is sampled based on the built distribution.
Mathematically, suppose the last sampling results in hidden states h t−1 ; then, for tth sampling, LSTM attempts to sample blocks of the normal cell. For the jth block, LSTM firstly samples the number of operations in the block and then samples inputs and types of operations, namely, h t−1 is fed to LSTM to produce h t , and h t is fed to a linear layer. The linear layer produces a vector that is converted to probabilities via softmax(), and an integer n j is sampled based on the probability P n j |θ via categorical distribution. Thus, current block has n j operations. Starting from the first operation, h t is fed to LSTM to produce h t+1 which is fed to a linear layer, and the vector produced by the linear layer is converted to probabilities. The operation input a 1 is then sampled based on P a 1 n j , θ , and h t+1 is fed to LSTM to yield h t+2 , which is used to produce the probabilities to sample the operation type a 2 based on P a 2 a 1 , n j , θ . For the kth sampling, h t+k−1 is fed to LSTM to yield h t+k which is fed to a linear layer to sample the option a k based on P a k a 1 :(k−1) , n j , θ . This sampling procedure continues until all blocks have been sampled. Then, LSTM starts to sample blocks of the reduction cell.
Specifically, suppose there are n C cells to sample; for the ith cell, LSTM needs to sample B i blocks. For the jth block, suppose LSTM samples the operation number n j based on P n j |θ ; then, LSTM needs to sample inputs and types for every n j operations. For the kth operation, suppose LSTM samples the input and type denoted by a k based on P a k a 1 :(k−1) , θ ; then, the samplings related with the kth operation may be quantitatively represented by n j ∑ k log P a k a 1 :(k−1) , θ , (A1) and the samplings related with the jth block may be quantitatively represented by log P n j |θ + n j ∑ k log P a k a 1 :(k−1) , θ , and the samplings related with the ith cell may be quantitatively represented by Thus, the value of a Σ (θ) is associated with θ of LSTM, and θ can be optimized via some optimization algorithm such as SGD. Even so, a Σ (θ) alone cannot make SGD work because a Σ (θ) contains no information about rewards, i.e., the gradient ∇ θ a Σ used in SGD is out of control since we are not sure whether ∇ θ a Σ leads to a promising position of high reward or not. This can be alleviated by introducing rewards, as described in the paper.