Passive Infrared sensor activated cameras, otherwise known as camera traps, have proved to be a tool of major interest and benefit to wildlife management practitioners and ecological researchers [1
]. Camera traps are used for a diverse array of purposes including presence–absence studies [3
], population estimates [6
], animal behaviour studies [10
], and species interactions studies [12
]. A comprehensive discussion of the applications of camera trap methodologies and applications are described in sources including [15
]. The capacity of camera traps to collect large amounts of visual data provides an unprecedented opportunity for remote wildlife observation; however, these same datasets incur a large cost and burden as image processing can be time consuming [2
]. The user is often required to inspect, identify and label tens-of-thousands of images per deployment, dependent on the number of camera traps deployed. Large scale spatio-temporal studies may involve 10–100 s of cameras deployed consecutively over months to years, and the image review requirements are formidable and resource intensive. Numerous software packages have been developed over the last 20 years to help with analysing camera trap image data [19
], but these methods often require some form of manual image processing. Automation in image processing has been recognised internationally as a requirement for progress in wildlife monitoring [1
] and this has become increasingly urgent as camera trap deployment has grown over time.
The identification of information within camera trap imagery can be tackled using (a) paid staff, (b) internet crowd-sourcing, (c) citizen science, or (d) limiting the study size. All approaches involving human annotators can encounter errors due to fatigue. Using staff requires access to sufficient budget and capable personnel and constitutes an expensive use of valuable resources in terms of both time and money. The quality of species identification is likely to be high, but the time of qualified staff is otherwise lost for other tasks, such as field work and data interpretation. Internet crowd-sourcing involves out-sourcing and payment to commercial providers. This approach can be economical with fast task completion; however, there is potential for a large variation between annotators, influenced by experience and skill. Volunteer citizen scientists can also provide image annotation services typically via the access to web sites such as Zooniverse [20
]. Costs are lower than employing staff, but reliable species identification might require specialized training and errors have important implications for any subsequent machine algorithms developed [21
]. Limited control of data access, sharing and storage raises concerns around sensitive ecological datasets (e.g., endangered species) along with privacy legislation [22
]. Nonetheless, the use of volunteers or citizen scientists has proved effective in the field of camera trapping—notably via TEAM Network
] and the Snapshot Serengeti
]—but for some, taxa human identification has been shown to be problematic [25
]. Meek and Zimmerman [26
] discuss the challenges of using citizen science for camera trap research, particularly how managing such teams along with the data can incur enormous costs to the researchers. Limiting the design of studies by reducing the number of camera traps deployed, reviewing data for the presence of select species only, or evaluating only a proportion of the available data and archiving the remainder are unpalatable options. Such approaches constrain the data analysis methodologies available and limit the value of research findings [27
To overcome the limitations of approaches outlined above, including human error and operator fatigue, we have utilised computer science to develop automated labelling. As well as being able to confirm results, key strengths of this approach, compared to existing options, include it being consistent, comparatively fast, standardised, and relatively free from biases associated with operator fatigue. Advances in computer vision have been pronounced in recent years, with successful demonstrations of image recognition in fields as diverse as autonomous cars, citrus tree detection from drone imagery, and the identification of skin cancer [29
]. Recent work has also demonstrated the feasibility of Deep Learning approaches for species identification in camera trap images [32
] and more widely across agricultural and ecological monitoring [33
]. In the context of camera traps it is worth noting that such algorithms have been used in prototype software for this purpose since at least 2015, in projects such as Wild Dog Alert
]—building on earlier semi-automated species recognition algorithms [39
]. The practical benefit of this research for end-users has been limited, because they cannot access software to automatically process camera trap images. We therefore developed ClassifyMe
as a software tool to reduce the time and costs of image processing. The ClassifyMe
software is designed to be used on constrained hardware resources—such as field laptops—although it can also be used on office workstations. This is a challenging requirement for a software application because it is required to operate across diverse computer hardware and software configurations while providing the end-user with a high-level of control and independence of their data. To elaborate on how we tackle these issues we outline the general structure and operation of ClassifyMe
and provide an evaluation of its performance using an Australian species case study along with Supplementary Materials
evaluating performance across Africa, New Zealand and North America.
2. Materials and Methods
The software is developed so it can be installed on individual computers under an End User Licence Agreement. The intent is that the user will upload an SD card of camera trap images, select the relevant model and then run ClassifyMe
on this dataset to automatically identify and sort the images (Figure 1
The proposed workflow allows camera trap images to be processed on the user’s machine. This provides high level of control on the use and access to the data, alleviating concerns around the sharing, privacy and security of using web services. Furthermore, ClassifyMe avoids the need for the user to upload their data to cloud infrastructure, which can be prohibitive in terms of accessibility, time and cost. ClassifyMe adopts a ‘tethered’ service approach, whereby the user needs only intermittent internet access (every 3 months), to verify security credentials to ensure continued access to the software. The ‘tethered’ service approach was adopted as a security mechanism to obstruct misuse and the unauthorised proliferation of the software for circumstances such as poaching. A practitioner can therefore validate security credentials and download the appropriate regional identification model (e.g., New England model) prior to travel into the field. When in the field, ClassifyMe can be used to evaluate deployment success (e.g., after several weeks of camera trap data collection) and can be used in countries with limited or no internet connectivity. Validation services are available for approved users (e.g., ecology researchers or managers) who require extensions of the tethered renewal period.
2.2. Software Design Attributes
The software design and stability of ClassifyMe was complicated by our choice to operate solely on the user’s computer. As such, the software is capable of operating on a plethora of different operating systems and hardware designs. To limit stability issues in ClassifyMe, however, we have decided to only currently release and support the Windows 10TM operating system, which is widely used by field ecologists. Different hardware options are supported including CPU-only and GPU; the models used by ClassifyMe are best supported by NDVIDIA GPU hardware and, as a result, users with this hardware will experience substantially faster processing times (up to 20 times faster per dataset).
The ‘tethered’ approach and corresponding application for software registration might be viewed as an inconvenience by some users. However, these components are essential security aspects of the software. The ClassifyMe
software is a decentralised system; individual users access a web site, download the software and the model and then process their own data. The ClassifyMe
web service does not see the user’s end data and—without the registration and ‘tethering’ process—the software could be copied and redistributed in an unrestricted manner. When designing ClassifyMe
, the authors were in favour of free, unrestricted software, which could be widely redistributed. During the course of development, it occurred to the team that the software was also at risk of misuse. In particular ClassifyMe
could be used to rapidly scan camera trap images whilst in field to detect the presence of particular species such as African elephants which are threatened by poaching [40
]. To address this concern, a host of security features were incorporated into ClassifyMe
. These features include software licencing, user validation and certification, and extensive undisclosed software security features. Disclosed security features include tethering and randomly generated licence keys, and facilities to ensure that ClassifyMe
is used only on the registered hardware and unauthorised copying is prevented. In the event of a breach attempt, a remote shutdown of the software is initiated.
All recognition models are restricted, and approval is issued to users on a case-by-case basis. This security approach is implemented in a privacy-preserving context. The majority of security measures involve hidden internal logic along with security provisions of the communications with the corresponding ClassifyMe
web service at https://classifymeapp.com/
(to ensure the security of communications with the end user and their data). Information provided by the user and the corresponding hardware ‘fingerprinting’ identification is performed only with user consent and all information is stored on secured encrypted databases.
A potential disadvantage of the local processing approach adopted by ClassifyMe is that user’s software resources are utilised, which potentially limits the scale and rate of data processing. An institutional cloud service—for instance—can auto-scale (once the data is uploaded) to accommodate data sets from hundreds of camera trap SD card simultaneously. In contrast, the ClassifyMe user will only be able to only process one camera trap dataset at a time. The ClassifyMe user will also have to implement their own data record management system—there is no database system integrated within ClassifyMe, which has the benefit of reducing software management complexity for end users but the disadvantage of not providing a management solution for large volumes of camera trap records. ClassifyMe is designed simply to review camera trap data for species identification, to auto-sort images and to export the classifications (indexed to image) to a csv file.
2.3. Graphical User Interface
is initiated, the main components consist of: (a) an image banner which displays thumbnails of the camera trap image dataset, (b) a model selection box (in this example set as ‘New England NSW’), and (c) the dialogue box providing user feedback (e.g., ‘Model New England NSW loaded’)—along with a series of buttons (‘Load’, ‘Classify’, ‘Cancel’, ‘Clear’, ‘Models’) to provide the main mechanisms of user control (Figure 2
The image banner provides a useful way for the user to visually scan the contents of the image data set to confirm that the correct data set is loaded. The ‘Models’ selection box allows users to select the most appropriate detection model for their data set. ClassifyMe offers facilities for multiple models to be developed and offered through the web service. A user might—for instance—operate camera trap surveys across multiple regions (e.g., New England NSW and SW USA). Selection of a specific model allows the user to adapt the model to the specific fauna of a region. Access to specific models is dependent on user approval by the ClassifyMe service providers. Facilities exist for developing as many classification models as required but dependent on the provision of model training datasets.
The dialogue box of ClassifyMe provides the primary mechanism of user interaction with the software. It provides textual responses and prompts which guide the user through use of the software and the classification process. Finally, the GUI buttons provide the main mechanism of user control. The ‘Load’ button is used to load an image dataset from the user’s files into the system; the ‘Classify’ button to start the classification of the loaded image data using the selected model; the ‘Cancel’ button to halt the current classification task, and the ‘Clear’ button to remove all current text messages from the dialogue box.
When an image dataset is loaded and the classification process started (Figure 3
), each image is scanned sequentially for the presence of an animal (or other category of interest) using the selected model. ClassifyMe
automatically sorts the images into sub-directories corresponding to the most likely classification and can also automatically detect and sort images where no animal or target category is found. The results are displayed on-screen via the dialogue box which reports the classification for each image as it is processed. The full set of classification results, which includes the confidence scores for the most likely categories, is stored as a separate csv file. ClassifyMe
creates a separate sub-directory for each new session. The full Unified Modelling Language (UML) structure of ClassifyMe
(omitting security features) is described in Supplementary Material S1
2.4. Recognition Models
The primary machine learning framework behind ClassifyMe
is DarkNet and YOLOv2 [41
]. The YOLOv2 framework is an object detector
deep network, based on a Darknet-19 convolutional neural network structure. YOLOv2 provides access to not only a classifier (e.g., species recognition) but also a localiser (where in image) and a counter (how many animals) which facilitates multi-species detections. ClassifyMe
at present is focused on species classification but future models could incorporate these additional capabilities due to the choice of YOLOv2. YOLOv2 is designed for high-throughput processing (40–90 frames per second) whilst achieving relatively high-accuracy (YOLOv2 544
544 mean Average Precision [email protected]
frames per second on Pascal VOC 2007 dataset using a NVIDIA GeForce GTX Titan X GPU, [36
]. A range of other competitive object detectors such as SSD [42
], Faster R-CNN [43
] and R-FCN [44
] could also have been selected for this task. Framework choice was governed by a range of factors including: Accuracy of detection and classification; processing speed on general purpose hardware; model development and training requirements; ease of integration into other software packages, and licencing. Dedicated object classifiers such as ResNet [45
] also provide high-accuracy performance on camera trap data [46
], however such models lack the future design flexibility of an object detector.
is designed for the end-user to install relevant models from a library accessed via the configuration panel. The model is then made available for use in the model drop-down selector box e.g., the user might install the Australian and New Zealand models via the configuration panel and when analysing a specific data set select the New Zealand model. These models are developed by the ClassifyMe
development team. Models are developed in consultation with potential end-users and when the image data provided meets the ClassifyMe
data requirements standard (Refer Supplementary Material S2
). Importantly, ClassifyMe
recognition models perform best when developed for the specific environment and species cohort to be encountered—and the specific camera trap imaging configuration to be used—in each study. When used outside the scope of the model, detection performance and accuracy might degrade. ClassifyMe
is designed primarily to support end-users who have put effort into ensuring high-quality annotated datasets and who value the use of automated recognition software within their long-term study sites.
2.5. Model Evaluation
has currently been developed and evaluated for five recognition models. These are Australia (New England New South Wales), New Zealand, Serengeti (Tanzania), North America (Wisconsin) and South Western USA models. The Australia (New England NSW) dataset was developed from data collected at the University of New England’s Newholme Field Laboratory, Armidale NSW. The New Zealand model was developed as part of a predator monitoring program in the context of the Kiwi Rescue
]. The Serengeti model was produced from a subset of the Snapshot Serengeti dataset [24
]. The North America (Wisconsin) model was developed using the Snapshot Wisconsin dataset [48
], whilst the South West USA model was developed using data provided by Caltech camera traps data collection [49
]. Source datasets were sub-set according to minimum data requirements for each category (comparable to the data standard advised in Supplementary Material S2
) and in light of current project developer resources.
Object detection models were developed for each dataset using YOLOv2. Hold-out test data sets were used to evaluate the performance of each model on data not used for model development. These hold-out test data sets were formulated via the random sampling of images from the project repository of images. Sample size varied based on data availability, but the preferred approach was balanced designs (equal images per class) with an 80% training-10% validation-10% testing split, with the training set used for network weight estimation, the validation set for optimizing algorithm hyper-parameters and the testing set used for obtaining model performance metrics. No further constraints were imposed, such as ensuring test data was sourced from different sites or units. This approach is reasonable for large, long-term monitoring projects involving tens to hundreds of thousands of images captured from a discrete number of cameras in fixed locations. Excessive levels of visual correlation in small, randomly sampled data subsets are generally minimal in such situations. In this case, the algorithms developed are intended to process further imagery captured from these specific cameras and locations, with model assessment approaches needing to adequately reflect this scenario. The model performance assessment does not correspond to generalised location-invariant learning; which requires a different approach, with model assessment occurring on image samples from different cameras, locations or projects. This is not the presently intended use of ClassifyMe
, whose models are optimised to support specific large projects and not a general use case for any camera trap study. Generalised location-invariant models require further evaluation before they can be incorporated in future editions of ClassifyMe
. Model training was performed on a Dell XPS 8930 Intel Core i7-8700 CPU @ 3.20 GHz NVIDIA GeForce GTX 1060 6 GB GPU 16 GB RAM 1.8 TB HDD drive, running a Windows 10 Professional x64 operating system using YOLOv2, via the “AlexeyAB” Windows port [50
]. Training consisted of 9187 epochs, 16,000 iterations and 23 h for the natural illumination model, and 9820 epochs, 17,000 iterations and 25 h for the infrared illumination model.
Overall recognition accuracies were 98.6% natural illumination, 98.7% infrared illumination for Australia (New England, NSW), 97.9% natural and infrared illumination for New Zealand, 99.0% natural and flash illumination for Serengeti, 95.9% natural illumination, 98.0% infrared illumination for North America (Wisconsin), and 96.8% natural illumination, 98.5% infrared illumination for the South West USA models. A range of model evaluation metrics were recorded including accuracy, true positive rate, positive predictive value, Matthew’s Correlation Coefficient and AUNU (Area Under the Receiver Operating Characteristic Curve of each class against the rest, using the uniform distribution) [51
]. In this section, we will focus on the Australia (New England, NSW) model, further results of the other models are provided in Supplementary Material S3
The Australian (New England, NSW) consisted of nine recognition classes and a total of 8900 daylight illumination images and 8900 infrared illumination images. Specific details of the Australian (New England, NSW) data set are provided in Table 1
. Observe that the models developed only distinguish between visually distinct classes, the current versions of ClassifyMe
models do not perform fine-grained recognition between visually similar classes, such as different species of Macropods. The component-based software design of ClassifyMe
allows the incorporation of such fine-grained recognition models if they are developed in the future. Another important consideration is that model evaluation has been performed for ‘in-bag’ samples, that is, the data was sourced from particular projects with large annotated data sets and the model developed is intended for use only within this project and network of cameras to automate image review. The application of the models to ‘out-of-bag’ samples from other sites or projects is not intended and can produce unstable recognition accuracy.
As previously stated, model performance was assessed using a randomly held-out test data set; the detection summary (Table 2
), the confusion matrix of the specific category performance (Table 3
), and the model performance metrics were evaluated (Table 4
) using PyCM [52
]. Figure 4
displays examples of detection outputs, including the rectangle detection box that is overlaid on the location of the animal in the image and the detected category.
The results of our testing indicate that ClassifyMe provides a high level of performance which is accessible across a wide range of end-user hardware with minimal configuration requirements.
4.1. Key Features and Benefits
ClassifyMe is the first application of its kind, it provides a software tool which allows field ecologists and wildlife managers access to the latest advances in artificial intelligence. Practitioners can utilise ClassifyMe to automatically identify, filter and sort camera trap image collections according to categories of interest. Such a tool fills a major gap in the operational requirements of all camera trap users irrespective of their deployments.
There are additional major benefits to localised processing on the end-user’s device. Most importantly, the local processing offered by ClassifyMe provides a high degree of privacy protection of end-user data. By design, ClassifyMe does not transfer classification information of user image data back to third parties, rather, all the processing of the object recognition module is performed locally, with minimal user information transferred back, via encryption, to the web service. The information transferred to the web service concerns the initial registration and installation process and the on-going verification services aimed at disrupting un-authorised distribution (which is targeted specifically at poachers and similar mis-uses of ClassifyMe software). These privacy and data control features are known to be appealing to many in our wider network of ecological practitioners, because transmitting and sharing images with third parties compromises (1) human privacy when images contain people, (2) the location of sensitive field equipment, and (3) the location of rare and endangered species that might be targeted by illegal traffickers. Researchers and wildlife management groups also often want control over the end-use of their data and sometimes have concerns about the unforeseen consequences of unrestricted data sharing.
4.2. Software Comparisons
At present, there are few alternatives to ClassifyMe
for the wildlife manager wanting to implement artificial intelligence technologies for the automated revision of their camera trap images. The most relevant alternative is the MLWIC: Machine Learning for Wildlife Image Classification in R package [53
]. The MLWIC package provides the option to run pre-trained models, and also for the user to develop their own recognition models suited to their own data sets. Whilst of benefit to a subset of research ecologists skilled with R, the approach proposed by Tabak et al. [53
] is not accessible to a wider audience as it requires a considerable investment of time and effort in mastering the intricacies of the R Development Language and Environment, along with the additional challenges of hardware and software configuration associated with this software. Integration of the MLWIC package within R is sensible if the user wants to incorporate automated image classification within their own workflows. However, such automated image recognition services are already offered in other leading machine learning frameworks, particularly TensorFlow [54
] and PyTorch [55
]. Such frameworks offer extensive capabilities with much more memory efficient processing for a similar investment in software programming know-how (Python) and hardware configuration. In fact, our wider research team routinely uses TensorFlow and PyTorch—along with other frameworks such as DarkNet19 [41
]—for camera-trap focused research. Integration with R is straight-forward, via exposure to a web-service API or via direct export of framework results as csv files. Within R, there are Python binding libraries which also allow access to Python code from within R and the TensorFlow interface package [56
] also provides a comparatively easy way of accessing the full TensorFlow framework from within R. In summary, there a range of alternative options to the MLWIC package which are accessible with programming knowledge. AnimalFinder [19
] is a MATLAB 2016a script available to assist with the detection of animals in time-lapse sequence camera trap images. This process is—however—semi-automated, and does not provide species identification, it also requires access to a MATLAB software licence and corresponding software scripting skills. AnimalScanner [57
] is a similar software application providing both a MATLAB GUI and a command line executable to scan sequences of camera trap images and identify three categories (empty frames, humans or animals), based on foreground object segmentation algorithms coupled with deep learning.
The Wildlife Insights (https://wildlifeinsights.org
] promises to provide cloud-based analysis services, including automated species recognition. The eMammal project provides both a cloud service and the Leopold desktop application [59
]. The Leopold eMammal desktop application uses computer vision technology to search for cryptic animals within a sequence and places a bounding box around the suspected animal [60
]. The objectives and functions of eMammal are—however—quite broad, and support citizen science identifications, expert review, data curation and training within the context of monitoring programs and projects. This approach is very different from the approach adopted by ClassifyMe
, which is a dedicated, on-demand application focused on automated species recognition on a user’s local machine with no requirement to upload datasets to third-party sources. The iNaturalist project (https://www.inaturalist.org
] is of a similar nature to eMammal but focused on digital or smartphone camera-acquired imagery from contributors across the world, and uses deep learning convolutional neural network models to perform image recognition within its cloud platform to assist with review by citizen scientists. Whilst very useful with a wide user base, iNaturalist does not specifically address the domain challenges of camera trap imagery. Motion Meerkat is a software application which also utilises computer vision in the form of mixture of Gaussian models to detect motion in videos which reduces the number of hours required for researcher review [62
]. DeepMeerkat provides similar functionality using convolutional neural networks to monitor for the presence of specific objects (e.g., hummingbirds) in videos [63
]. There is a further, wide range of software available including Renamer [64
] and VIXEN [65
] to support camera trap data management. Young, Rode-Margono and Amin [66
] have provided a detailed review of currently available camera trap software options.
4.3. Model Development
An important design decision of ClassifyMe
was to not allow end-users to train their own models. This is in contrast to software such as the MLWIC package. The decision was motivated by both legal aspects and quality control as opposed to commercial reasons. Of particular concern is use of the software to determine field locations of prized species that poachers could then target. These concerns are valid, with recent calls having been made for scientists to restrict publishing location data of highly sought-after species in peer-reviewed journals [67
]. Such capabilities could be of use to technologically inclined poachers, and providing such software—along with the ability to modify that software—presented a number of potential legal issues. Similar concerns exist concerning human privacy legislation [22
]. The strict registrations, legal and technological controls implemented within ClassifyMe
are designed to minimise risk of misuse.
Allowing end-users to train their own models also presents quality control issues. The deep networks utilised within ClassifyMe
(and similar software) are difficult to train to optimal performance and reliability. Specialised hardware and its configuration are also required for deep learning frameworks, which can be challenging even for computer scientists. Data access and the associated labelling of datasets is another major consideration; many users might not have sufficient sighting records nor the resources to label their datasets. The risk of developing and deploying a model which provides misleading results in practice is high—with quite serious potential consequences for wildlife observation programs. Schneider, Taylor and Kremer [69
] compared the performance of the YOLOv2 and Faster R-CNN object detectors on camera trap imagery. The YOLOv2 detector performed quite poorly with an average accuracy of 43.3% ± 14.5% (compared to Faster R-CNN which had an accuracy of 76.7% ± 8.31%) on the Gold-Standard Snapshot Serengeti dataset. The authors suggested that the low performance was due to limited data. Our results clearly indicate that YOLOv2 can perform well with strict data quality control protocols. Furthermore, the ClassifyMe
YOLOv2 model is most effective at longer-term study sites, where the model has been calibrated using annotated data specific to the study site. ClassifyMe
is also designed to integrate well with a range of other object detection frameworks including Faster-RCNN which is utilised within the software development team for research purposes. Future editions of ClassifyMe
might also explore the use of other detection frameworks or customised algorithms based on our on-going research focused on ‘out-of-bag’ models, suited for general use as well as the fine-grained recognition of similar species.
ClassifyMe resolves the issue of model development for practitioners by out-sourcing model development to domain experts who specialise in the development of such technology in collaborative academic and government joint research programs. Users can request model development, either for private use via a commercial contract, or for public use—which is free—and on the provision of image data sets to a protocol standard, the model will be developed and assessed for deployment as a ClassifyMe model library. ClassifyMe is designed to enable the selection of a suitably complex model to ensure good classification performance, but to also enable storage, computation and processing within a reasonable time frame (benchmark range 1–1.5 s per image, Intel i7 16 GB RAM) on end user computers. Cloud-based solutions, such as those used in the Kiwi Rescue and Wild Dog Alert programs, have the capacity to store data in a central location using a larger neural network structure on high-performance computer infrastructure. Such infrastructure is costly to run and is not ideal for all end-users.