Toward Audio Beehive Monitoring: Deep Learning vs. Standard Machine Learning in Classifying Beehive Audio Samples

: Electronic beehive monitoring extracts critical information on colony behavior and phenology without invasive beehive inspections and transportation costs. As an integral component of electronic beehive monitoring, audio beehive monitoring has the potential to automate the identiﬁcation of various stressors for honeybee colonies from beehive audio samples. In this investigation, we designed several convolutional neural networks and compared their performance with four standard machine learning methods (logistic regression, k-nearest neighbors, support vector machines, and random forests) in classifying audio samples from microphones deployed above landing pads of Langstroth beehives. On a dataset of 10,260 audio samples where the training and testing samples were separated from the validation samples by beehive and location, a shallower raw audio convolutional neural network with a custom layer outperformed three deeper raw audio convolutional neural networks without custom layers and performed on par with the four machine learning methods trained to classify feature vectors extracted from raw audio samples. On a more challenging dataset of 12,914 audio samples where the training and testing samples were separated from the validation samples by beehive, location, time, and bee race, all raw audio convolutional neural networks performed better than the four machine learning methods and a convolutional neural network trained to classify spectrogram images of audio samples. A trained raw audio convolutional neural network was successfully tested in situ on a low voltage Raspberry Pi computer, which indicates that convolutional neural networks can be added to a repertoire of in situ audio classiﬁcation algorithms for electronic beehive monitoring. The main trade-off between deep learning and standard machine learning is between feature engineering and training time: while the convolutional neural networks required no feature engineering and generalized better on the second, more challenging dataset, they took considerably more time to train than the machine learning methods. To ensure the replicability of our ﬁndings and to provide performance benchmarks for interested research and citizen science communities, we have made public our source code and our curated datasets. Dataset: The Python


Introduction
Many beekeepers listen to their hives to ascertain the state of their honey bee colonies because bee buzzing carries information on colony behavior and phenology. Honey bees generate specific sounds when exposed to stressors such as failing queens, predatory mites, and airborne toxicants [1]. While experienced beekeepers can tell audio changes in sounds produced by stressed colonies, they may not always be able to determine the exact causes of the changes without hive inspections. Unfortunately, hive inspections disrupt the life cycle of bee colonies and put additional stress on the bees. The transportation costs are also a factor because many beekeepers drive long distances to their far flung apiaries due to increasing urban and suburban sprawls. Since beekeepers cannot monitor their hives continuously due to obvious problems with logistics and fatigue, a consensus is emerging among researchers and practitioners that electronic beehive monitoring (EBM) can help extract critical information on colony behavior and phenology without invasive beehive inspections and transportation costs [2]. When viewed as a branch of ecoacoustics [3], an emerging interdisciplinary science that investigates natural and anthropogenic sounds and their relations with the environment, audio beehive monitoring contributes to our understanding of how beehive sounds reflect the states of beehives and the environment around them.
In this article, we contribute to the body of research on audio beehive monitoring by comparing several deep learning (DL) and standard machine learning (ML) methods in classifying audio samples from microphones deployed approximately 10 cm above Langstroth beehives' landing pads. In our experiments, the convolutional neural networks (ConvNets) performed on par with or better than the four standard ML methods on two curated datasets. On the first dataset of 10,260 audio samples (BUZZ1) where the training and testing samples were separated from the validation samples by beehive and location, a shallower raw audio ConvNet with a custom layer outperformed three deeper raw audio ConvNets without custom layers and performed on par with the four ML methods trained to classify feature vectors extracted from raw audio samples. On the second, more challenging dataset of 12,914 audio samples (BUZZ2) where the training and testing samples were separated from the validation samples by beehive, location, time, and bee race, all raw audio convolutional neural networks performed better than the four ML methods and a ConvNet trained to classify spectrogram images of raw audio samples obtained through Fourier analysis [4,5].
Our investigation gives an affirmative answer to the question of whether ConvNets can be used in real electronic beehive monitors that use low voltage devices such as Raspberry Pi [6] or Arduino [7]. Toward this end, we demonstrate that trained ConvNets can operate in situ on a credit card size Raspberry Pi computer. The presented methods do not depend on the exact positioning of microphones, and can be applied to any audio samples obtained from microphones placed inside or outside beehives provided that microphone propolization is controlled for. Our objective is to develop open source software tools for researchers, practitioners, and citizen scientists who want to capture and classify their audio beehive samples. To ensure the replicability of our findings and to establish performance benchmarks for interested research and citizen science communities, we have made public our two curated datasets, BUZZ1 (10,260 samples) and BUZZ2 (12,914 samples), of manually labeled audio samples (bee buzzing, cricket chirping, and ambient noise) used in this investigation [8].

Background
As an integral component of EBM, audio beehive monitoring has attracted considerable research and development effort. Bromenschenk et al. [1] designed a system for profiling acoustic signatures of honeybee colonies, analyzing these signatures, and identifying various stressors for the colonies from this analysis. A foam insulated microphone probe assembly is inserted through a hole drilled in the back of the upper super of a Langstroth hive that consists of two supers. The hole is drilled so that the probe is inserted between honeycomb frames at least four inches down from the top of the upper super, because brood frames are typically located in the two lower supers of a standard Langstroth hive. The probe's microphone is connected via a microphone amplifier to a computer with an audio card and the audio analyzing software that performs the Fast Fourier Transform (FFT) of captured wav files and then saves the processed data to text files for subsequent analysis. The frequency spectrum is processed into a running average and an overall average. The two averages are exported to text-based files and are analyzed with statistical software to associate sound spectra with acoustic variations. In an experiment, bee colonies treated with naptha, ammonia, or toluene all produced statistically different scores with each stressor producing a unique acoustic signature. The researchers state that these associations can be detected with artificial neural networks (ANNs) but offer no experimental evidence of this claim.
Ferrari et al. [9] designed a system for monitoring swarm sounds in beehives. The system consisted of a microphone, a temperature sensor, and a humidity sensor placed in a beehive and connected to a computer in a nearby barn via underground cables. The sounds were recorded at a sample rate of 2 kHz and analyzed with Matlab. The researchers monitored three beehives for 270 h and observed that swarming was indicated by an increase of the buzzing frequency at about 110 Hz with a peak at 300 Hz when the swarm left the hive. Another finding was that a swarming period correlated with a rise in temperature from 33 • C to 35 • C with a temperature drop to 32 • C at the actual time of swarming.
Ramsey et al. [10] presented a method to analyze and identify honey bee swarming events through a computational analysis of acoustic vibrations captured with accelerometers placed on the outside walls of hives. The researchers placed two accelerometers on the outer walls of two Langstroth hives with Apis mellifera honey bees, one accelerometer per hive. A cavity was drilled in the center of the back wall of each hive. Both hives were located in close proximity to each other and approximately 10 m away from a house with a PC running a Linux distribution. The accelerometers were placed in the cavities and connected to a dual channel conditioner placed between the hives and encased in a waterproof acrylic box. The conditioner was coupled to the indoor PC with two coaxial cables through which the captured vibration samples were saved on the PC's hard disk. An ad hoc roof was placed above both hives to control for vibrations caused by rain drops. The vibration samples, each one hour long, were logged from April to June 2009. Averaged frequency spectra were computed for a frequency resolution of 20 Hz and an averaging time of 510 s. The averaged spectra were combined into one day long spectrograms which were analyzed with the Principal Component Analysis (PCA) for feature extraction. The witnessed swarming events were juxtaposed with the corresponding spectrograms and were found to exhibit a unique set of features. A subsequent minimization computer analysis suggested that swarming may be detected several days in advance. Peaks in vibrational activities were discovered at 250 Hz, 500 Hz, 750 Hz, and 2000 Hz.
Mezquida and Martinez [11] developed a distributed audio monitoring system for apiaries. The system consists of nodes and wireless sensor units with one node per apiary and one sensor unit per hive. A node is a solar-powered embedded computer. A sensor unit consists of an omnidirectional microphone and a temperature sensor. Sensor units are placed at the bottom of each hive protected by a grid to prevent propolization. The system was tested in an apiary of 15 Langstroth hives where up to 10 hives were monitored continuously by taking 8 s audio samples every hour with a sampling rate of 6250 Hz. The timestamped frequency spectra obtained with Fourier transform and temperature data were logged in a SQL database from May 2008 to April 2009. The researchers reported that sound volume and sound intensity at medium and low frequencies showed distinguishable daily patterns, especially in the spring. Sound volume did not have identifiable patterns in the winter.
Rangel and Seeley [12] also investigated audio signals of honeybee swarms. Five custom designed observation hives were sealed with glass covers. The captured video and audio data were monitored daily by human observers. By manually analyzing the captured audio samples and correlating them with the corresponding video samples, the researchers found that approximately one hour before swarm exodus, the production of piping signals gradually increased and ultimately peaked at the start of the swarm departure.
Kulyukin et al. [13] designed an algorithm for digitizing bee buzzing signals with harmonic intervals into A440 piano note sequences to obtain symbolic representations of buzzing signals over specific time periods. When viewed as a time series, such sequences correlate with other timestamped data such as estimates of bee traffic levels or temperatures [14,15]. The note range detected by the proposed algorithm on 3421.52 MB of 30-s wav files contained the first four octaves, with the lowest note being A0 and the highest note being F#4. It was observed that the peaks in the frequency counts, as the selected 24-h time period progressed, started in the first octave (D1), shifted higher to C3 and C#3 in the third octave, and returned back to the first octave at the end of the selected time period. Several notes in the fourth octave, e.g., F#4 and C#4, were also detected, but their frequency counts were substantially lower than those of the peaks in the first three octaves.
Representation learning is a branch of AI that investigates automatic acquisition of representations from raw signals for detection and classification [16]. Standard machine learning (ML) techniques are believed to be limited in their ability to acquire representations from raw data because they require considerable feature engineering to convert raw signals into feature vectors used in classification. Unlike conventional ML techniques, deep learning (DL) methods are believed to be better at acquiring multi-layer representations from raw data because real data often encode non-linear manifolds that low-order functions have a hard time modeling. In a DL architecture, each layer, starting from raw input, can be construed as a function transforming its input to the one acceptable for the next layer. Many features of these layers are learned automatically by a general purpose procedure known as backpropagation [17]. DL methods have been successfully applied to image classification [18][19][20], speech recognition and audio processing [21,22], music classification and analysis [23][24][25], environmental sound classification [26], and bioinformatics [27,28].
Convolutional neural networks (ConvNets) are a type of DL feedforward neural networks that have been shown in practice to better train and generalize than artificial neural networks (ANNs) on large quantities of digital data [29]. In ConvNets, filters of various sizes are convolved with a raw input signal to obtain a stack of filtered signals. This transformation is called a convolution layer. The size of a convolution layer is equal to the number of convolution filters. The convolved signals are typically normalized. A standard choice for signal normalization is the rectified linear unit (ReLU) that converts all negative values to 0's [30][31][32]. A layer of normalization units is called a normalization layer. Some ConvNet architectures (e.g., [26]) consider normalization to be part of convolution layers. The size of the output signal from a layer can be downsampled with a maxpooling layer, where a window is shifted in steps, called strides, across the input signal retaining the maximum value from each window. The fourth type of layer in a ConvNet is a fully connected (FC) layer, where the stack of processed signals is treated as a 1D vector so that each value in the output from a previous layer is connected via a synapse (a weighted connection that transmits a value from one unit to another) to each node in the next layer. In addition to the four standard layers, ConvNets can have custom layers that implement arbitrary functions to transform signals between consecutive layers.
Since standard and custom layers can be stacked arbitrarily deeply and in various permutations, ConvNets can model highly non-linear manifolds. The architectural features of a ConvNet that cannot be dynamically changed through backpropagation are called hyperparameters and include the number and size of filters in convolution layers, the window size and stride in maxpooling layers, and the number of neurons in FC layers. These permutations and hyperparameters distinguish one ConvNet architecture from another.
In audio processing, ConvNets have been trained to classify audio samples either by classifying raw audio or vectors of various features extracted from raw audio. For example, Piczak [26] evaluated the potential of ConvNets to classify short audio samples of environmental sounds by developing a ConvNet that consisted of two convolution layers with maxpooling and two FC layers. The ConvNet was trained on the segmented spectrograms of audio data. When evaluated on three public datasets of environmental and urban recordings, the ConvNet outperformed a baseline obtained from random forests with mel frequency cepstral coefficients (MFCCs) and zero crossing rates and performed on par with some state-of-the-art audio classification approaches. Aytar et al. [33] developed a deep ConvNet to learn directly from raw audio waveforms. The ConvNet was trained by transferring knowledge from computer vision to classify events from unlabeled videos. The representation learned by the ConvNet obtained state-of-the-art accuracy on three standard acoustic datasets. In [34], van den Oord et al. showed that generative networks can be trained to predict the next sample in an audio sequence. The proposed network consisted of 60 layers and sampled raw audio at a rate of 16 kHz to 48 kHz. Humphrey and Bello [35] designed a ConvNet to classify 5-s tiles of pitch spectra to produce an optimized chord recognition system. The ConvNet yielded state-of-the-art performance across a variety of benchmarks.

Hardware
All audio data for this investigation were captured by BeePi, a multi-sensor EBM system we designed and built in 2014 [13]. A fundamental objective of the BeePi design is reproducibility: other researchers and citizen scientists should be able to replicate our results at minimum costs and time commitments. Each BeePi monitor consists of exclusively off-the-shelf components: a Raspberry Pi computer, a miniature camera, a microphone splitter, four microphones connected to a splitter, a solar panel, a temperature sensor, a battery and a hardware clock. We currently use Raspberry Pi 3 model B v1.2 computers, Pi T-Cobblers, breadboards, waterproof DS18B20 digital temperature sensors, and Pi cameras. Each BeePi unit is equipped with four Neewer 3.5 mm mini lapel microphones that are placed either above the landing pad or embedded in beehive walls. These microphones have a frequency range of 15-20 KHz, an omnidirectional polarity, and a signal-to-noise ratio greater than 63 dB. There are two versions of BeePi: solar and regular. In the solar version, a solar panel is placed either on top of or next to a beehive. For solar harvesting, we use Renogy 50 watts 12 Volts monocrystalline solar panels, Renogy 10 Amp PWM solar charge controllers, and Renogy 10 ft 10 AWG solar adaptor kits. The regular version operates either on the grid or on a rechargeable battery. For power storage, we use two types of rechargeable batteries: the UPG 12 V 12 Ah F2 lead acid AGM deep cycle battery and the Anker Astro E7 26,800 mAh battery. All hardware components, except for solar panels, are placed in a Langstroth super. Interested readers are referred to [36] for BeePi hardware design descriptions, pictures, and assembly videos.
We have been iteratively improving the BeePi hardware and software since 2014. BeePi monitors have so far had five field deployments. The first deployment was in Logan, UT, USA (September 2014) when a BeePi monitor was placed into an empty hive and ran exclusively on solar power for two weeks. The second deployment was in Garland, UT, USA (December 2014-January 2015) when a BeePi monitor was placed in a hive with overwintering honey bees and successfully operated for nine out of the fourteen days of deployment exclusively on solar power and captured about 200 MB of data. The third deployment was in North Logan, UT, USA (April 2016-November 2016) where four BeePi monitors were placed into four beehives at two small apiaries and collected 20 GB of data. The fourth deployment was in both Logan and North Logan, UT, USA (April 2017-September 2017) when four BeePi units again placed into four beehives at two small apiaries to collect 220 GB of audio, video, and temperature data [15]. The fifth deployment started in Logan, UT, USA in May 2018 with four BeePi monitors placed in four new beehives freshly initiated with Carniolan honeybee colonies. As of July 2018, we collected 66.04 GB of data.

Audio Data
The audio data for this investigation were taken from the datasets captured by six BeePi monitors. Two BeePi monitors were deployed in Logan, UT, USA (41. [13], we reported on some advantages and disadvantages of embedding microphones into hive walls and insulating them against propolization with small metallic mesh nets. In this study, the microphones were placed approximately 10 cm above the landing pad with two microphones on each side (see Figure 1). Each monitor saved a 30-s audio wav file every 15 min recorded with the four microphones above the beehives' landing pads and connected to a Raspberry Pi computer. We experimentally found these two parameters (30 s every 15 min) to be a reasonable compromise between the continuity of monitoring and the storage capacity of a BeePi monitor. In a BeePi monitor, all captured data are saved locally either on the Raspberry Pi's sdcard or on a USB storage device connected to the Raspberry Pi. No cloud computing facilities are used for data storage or analysis. Each 30-s audio sample was segmented into 2-s wav samples with a 1-s overlap, resulting in 28 2-s wav samples per one 30-s audio file. Data collection software is written in Python 2.7 [37].
We obtained the ground truth classification by manually labeling 9110 2-s audio samples captured with four BeePi monitors in two apiaries (one in Logan, UT, USA and one in North Logan, UT, USA ) in May and June 2017. The two apiaries were approximately seventeen kilometers apart. The first apiary (apiary 1) was in a more urban environment in Logan, UT, USA . In this apiary, the first monitored beehive (beehive 1.1) was placed next to a garage with a power generator. The second monitored beehive (beehive 1.2) was placed approximately twenty meters west of beehive 1.1 behind a row of aspens and firs where the generator's sound was less pronounced. However, beehive 1.2 was located closer to a large parking lot. Hence, some audio samples from beehive 1.2 contained sounds of car engines and horns. Some samples from beehives 1.1 and 1.2 contained sounds of fire engine or ambulance sirens. In May 2018, two more Langstroth beehives (beehive 1.3 and beehive 1.4) were added to apiary 1 to obtain the validation data for BUZZ2 (see below for details). The second apiary (apiary 2) was in a more rural area in North Logan, UT, USA, seventeen kilometers north of apiary 1. In this apiary, we collected the data from two monitored beehives located approximately fifteen meters apart. Each beehive was located next to an unmonitored beehive. The first monitored beehive in apiary 2 (beehive 2.1) was placed close to a big lawn. The lawn was regularly mowed by the property owner and watered with an automated sprinkler system. The second monitored beehive in apiary 2 (beehive 2.2) was located fifteen meters west of beehive 2.1, deeper in the backyard where ambient noise (sprinklers, mower's engine, children playing on the lawn, human conversation) was barely audible. The validation dataset of BUZZ2 used for model selection was obtained from two Langstroth beehives (1.3 and 1.4) in apiary 1 obtained in May 2018.
We listened to each sample and placed it into one of the three non-overlapping categories: bee buzzing (B), cricket chirping (C), and ambient noise (N). The B category consisted of the samples where at least two of us could hear bee buzzing. The C category consisted of the audio files captured at night where at least two of us could hear the chirping of crickets and no bee buzzing. The N category included all samples where none of us could clearly hear either bee buzzing or cricket chirping. These were the samples with sounds of static microphone noise, thunder, wind, rain, vehicles, human conversation, sprinklers, and relative silence, i.e., absence of any sounds discernible to a human ear. We chose these categories for the following reasons. The B category is critical in profiling acoustic signatures of honeybee colonies to identify various colony stressors. The C category is fundamental in using audio classification as a logical clock because cricket chirping is cyclical, and, in Northern Utah, is present only from 11:00 p.m. until 6:00 a.m. The N category helps EBM systems to identify uniquely individual beehives because ambient noise varies greatly from location to location and from beehive to beehive at the same location.
The first labeled dataset, which we designated to be our train/test dataset, included 9110 samples from beehives 1.1 and 2.1: 3000 B samples, 3000 C samples, and 3110 N samples. We later labeled another dataset, which we designated to be our validation dataset, of 1150 audio samples (300 B samples, 350 N samples, and 500 C samples) captured from beehives 1.2 and 2.2 in May and June 2017. Thus, the audio samples in the validation dataset were separated from the audio samples in the training and testing dataset by beehive and location. In summary, our first curated dataset, which we called BUZZ1 [8], consisted of 10,260 audio samples: a train/test dataset of 9110 samples that we used in our train/test experiments and a validation dataset of 1150 audio samples that we used in our validation experiments for model selection.
In BUZZ1, the degree of data overlap between the training and testing datasets in terms of the exact beehives being used is controlled for by the fact that audio samples from different beehives differ from each other even when the hives are located in the same apiary. To better control for data overlap, we created another curated dataset, BUZZ2 [8], of 12,914 audio samples where we completely isolated the training data from the testing data in terms of beehive and location by taking 7582 labeled samples (76.4%) for training from beehive 1.1 in apiary 1 and 2332 labeled samples (23.52%) samples for testing from beehive 2.1 in apiary 2. The training set of 7582 samples consisted of 2402 B samples, 3000 C samples, and 2180 N samples. The testing set of 2332 samples consisted of 898 B samples, 500 C samples, and 934 N samples. The validation dataset of BUZZ2 used for model selection was obtained from two Langstroth beehives (1.3 and 1.4) in apiary 1 obtained in May 2018. These beehives did not exist in apiary 1 in 2017. Both were freshly initiated with Carniolan honeybee colonies and started being monitored in May 2018. Thus, in BUZZ2, the train/test beehives were separated by beehive and location while the validation beehives were separated from the train/test beehives by beehive, location, time (2017 vs. 2018), and bee race (Italian vs. Carniolan). The validation dataset included 1000 B samples, 1000 C samples, and 1000 N samples.

Deep Learning
To obtain a baseline for our ConvNets that classify raw audio, we first developed a ConvNet architecture, called SpectConvNet, to classify spectrogram images of raw audio samples. In audio processing, a spectrogram represents an audio spectrum where time is represented along the x-axis and frequency along the y-axis and the strength of a frequency at each time frame is represented by color or brightness. Such spectrogram representations can be treated as RGB images. For example, Figures 2-4 show spectrogram images of bee buzzing, cricket chirping, and ambient noise, respectively, computed from three audio samples from BUZZ1. The spectrograms were computed by using the specgram function from the Python matplotlib.pyplot package. This function splits an audio sample into NFFT = 512 length segments and the spectrum of each segment is computed. The number of overlap points between each segment was set to 384. The window size was 2048 and the hop size was 2048/4 = 512. The scaling frequency was the same as the input audio sample's frequency.
The windowed version of each segment was obtained with the default Hanning window function. Each spectrogram was converted to an 100 × 100 image with a dpi of 300. Thus, the time and frequency domains were treated equally, and were mapped to the range of [0, 99] each with each pixel value approximating a frequency value at a particular point in time.
The components of SpectConvNet are shown in Figure 5. Figure 6 shows a layer by layer control flow in SpectConvNet. Table 1 gives a detailed description of each layer of SpectConvNet. We trained SpectConvNet to classify 100 × 100 RGB spectrogram images such as the ones shown in Figures 2-4. We chose the size of 100 × 100 experimentally as a compromise between accuracy and performance. The use of square filters in SpectConvNet's layers was motivated by standard best practices in ConvNets trained to classify images (e.g., [18,20]). SpectConvNet has three convolution layers with ReLUs. In layer 1, the input is convolved using 100 3 × 3 filters (shown in red in Figure 6) with a stride of 1. The resultant 100 × 100 × 100 feature map is shown in layer 2 in red. The features are maxpooled with a kernel size of 2 resulting in a 50 × 50 × 100 feature map. The resultant feature map is processed with 200 3 × 3 filters (shown in blue in layer 2) with a stride of 1. The output is a 50 × 50 × 200 feature map (shown in blue in layer 3). In layer 3, another convolution is performed using 200 3 × 3 filters (shown in green) with a stride of 1. The result is a 50 × 50 × 200 feature map (shown in green in layer 4). The features are maxpooled with a kernel size of 2 resulting in a 25 × 25 × 200 feature map in layer 4. In layer 5, the output is passed through a 50-unit FC layer. In layer 6, a dropout [38] is applied with a keep probability of 0.5 to avoid overfitting and reduce complex co-adaptations of neurons [18]. Finally, the signal is passed through a 3-way softmax function that classifies it as bee buzzing (B), cricket chirping (C), or ambient noise (N).
After designing SpectConvNet, we followed some design suggestions from [26,35] to design a second ConvNet architecture, henceforth referred to as RawConvNet, to classify raw audio waveforms (see Figure 7). The layer by layer control flow on RawConvNet is shown in Figure 8. Table 2 gives a detailed description of each layer and its parameters.

Layers Specification
Layer 1 Conv-2D filters = 100, filterSize = 3, strides = 1, activation = relu, bias = True, biasInit = zeros, weightsInit = uniform scaling, regularizer = none, weightDecay = 0.001 Layer 2 Maxpool-2D kernelSize = 2, strides = none RawConvNet consists of two convolution layers, both of which have 256 filters with ReLUs. The weights are randomly initialized using the methodology described by Xavier et al. in [39] to keep the scale of the gradients roughly the same in both layers. Each convolution layer has L2 regularization and a weight decay of 0.0001. The first convolution layer has a stride of 4, whereas the second layer has a stride of 1. A larger stride is chosen in layer 1 to lower the computation costs.
The input signal to RawConvNet is a raw audio wav file downsampled to 12 kHz and normalized to have a mean of 0 and a variance of 1. The raw audio file is passed as a 20,000 × 1 tensor to layer 1 (see Figure 8). In layer 1, the 1D tensor is convolved with 256 n × 1 filters (shown in red in layer 1) with a stride of 4 horizontally, where n ∈ {3, 10, 30, 80, 100}, and activated with ReLUs. We introduced the n parameter so that we could experiment with various receptive field sizes to discover their impact on classification, which we discuss in detail in Section 4 below. The resultant 5000 × 256 feature map is shown in layer 2 in red. Batch normalization [40] is applied to the resultant signal to reduce the internal covariate shift and the signal is maxpooled with a kernel size of 4 resulting in a 1250 × 256 feature map. The signal is convolved using 3 × 1 256 filters (shown in blue in layer 2) with a stride of 1 horizontally. The output is a 1250 × 256 feature map (shown in blue in layer 3). Batch normalization is applied again to the resultant signal and the signal is maxpooled with a kernel size of 4, resulting in a 313 × 256 feature map. In layer 4, we add a custom layer that calculates the mean of the resultant tensor of feature maps along the first axis. To put it differently, the custom layer is designed to calculate the global average over each feature map, thus reducing each feature map tensor to a single real number. The custom layer results in a 256 × 1 tensor (shown in magenta in layer 4). Finally, in layer 5, the output is passed through a 3-way softmax function that classifies it as B (bee buzzing), C (cricket chirping), or N (noise).

Standard Machine Learning
We used the following standard ML methods as comparison benchmarks for ConvNets: (M1) logistic regression [41]; (M2) k-nearest neighbors (KNN) [42]; (M3) support vector machine with a linear kernel one vs. rest (SVM OVR) classification [43]; and (M4) random forests [44]. These methods are supervised classification models that are trained on labeled data to classify new observations into one of the predefined classes.
The M1 model (logistic regression) is used for predicting dependent categorical variables that belong to one of a limited number of categories and assigns a class with the highest probability for each input sample. The M2 model (KNN) is a supervised learning model that makes no a priori assumptions about the probability distribution of feature vector values. The training samples are converted into feature vectors in the feature vector space with their class labels. Given an input sample, M2 computes the distances between the feature vector of an input sample and each feature vector in the feature vector space, and the input sample is classified by a majority vote of the classes of its k nearest neighbors, where k is a parameter. The M3 model (SVM OVR) builds a linear binary classifier for each class where n-dimensional feature vectors are separated by a hyperplane boundary into positive and negative samples, where negative samples are samples from other classes. Given an input sample, each classifier produces a positive probability of the input sample's feature vector belonging to its class. The input sample is classified with the class of the classifier that produces the maximum positive probability. The M4 model (random forests) is an ensemble of decision trees where each decision tree is used to classify an input sample, and a majority vote is taken to predict the best class of the input sample.
We trained all four models on the same feature vectors automatically extracted from the raw audio files in BUZZ1. We used the following features: (F1) mel frequency cepstral coefficients (MFCC) [45]; (F2) chroma short term Fourier transform (STFT) [5]; (F3) melspectrogram [46]; (F4) STFT spectral contrast [47]; and (F5) tonnetz [48], a lattice representing the planar representations of pitch relations. Our methodology in using these features was based on the bag-of-features approach used in musical information retrieval [49]. Our objective was to create redundant feature vectors with as many potentially useful audio features as possible. For example, our inclusion of chroma and tonnetz features was based on our hypothesis that bee buzzing can be construed as musical melodies and mapped to note sequences [13]. Our inclusion of MFCC features, frequently used in speech processing, was motivated by the fact that some ambient sounds, especially around beehives in more urban environments, include human speech. Consequently, MFFCs, at least in theory, should be useful in classifying ambient sounds of human speech as noise [50]. Each raw audio sample was turned into a feature vector of 193 elements: 40 MFCC's, 12 chroma coefficients, 128 melspectrogram coefficients, seven spectral contrast coefficients, and six tonnetz coefficients. The structure of each feature vector is shown in Equation (1): (1) To estimate the impact of feature scaling, we experimented with the following feature scaling techniques on the feature vectors: (S1) no scaling; (S2) standard scaling when the mean and the standard deviation of the feature values in each feature vector are set to 0 and 1, respectively; (S3) min/max scaling when the feature values are set to [0, 1) to minimize the effect of outliers; (S4) L1 norm that minimizes the sum of the absolute differences between the target and predicted values [51]; (S5) L2 norm, also known as least squares, which minimizes the sum of the squares of the differences between the target and predicted values [51].

DL on BUZZ1
In BUZZ1, the train/test dataset of 9,110 labeled samples was obtained from beehives 1.1 and 2.1. The validation dataset of 1,150 was obtained from beehives 1.2 and 2.2. In the experiments described in Section 3, the train/test was repeatedly split into a training set (70%) and a testing test (30%) with the train_test_split procedure from the the Python sklearn.model_selection library [52]. The ConvNets were implemented in Python 2.7 with tflearn [53] and were trained on the train/test dataset of BUZZ1 with the Adam optimizer [54] on an Intel Core i7-4770@3.40 GHz × 8 processor with 15.5 GiB of RAM and 64-bit Ubuntu 14.04 LTS. The Adam optimizer is a variant of stochastic gradient descent that can handle sparse gradients on noisy problems. Both ConvNets used categorical crossentropy as their cost function. Categorical crossentropy measures the probability error in classification tasks with mutually exclusive classes. The best learning rate (η) for SpectConvNet was found to be 0.001 and the best learning rate for RawCovNet was found to be 0.0001.
Each ConvNet was trained for 100 epochs with a batch size of 128 (experimentally found to be optimal) and a 70-30 train/test split of the training data set of BUZZ1 train/test dataset of 9110 audio samples with the train_test_split procedure from the the Python sklearn.model_selection library [52]. Table 3 gives the performance results comparing SpectConvNet with RawConvNet with different sizes of the receptive field. The ConvNet hyperparameters are given in Appendix A. RawConvNet's testing accuracy increased with the size of the receptive field. The accuracy was lowest (98.87%) when the receptive field size was 3 and highest (99.93%) when it was 80 with a slight drop at 100. With a receptive field size of 80, RawConvNet classified raw audio samples with a testing accuracy of 99.93% and a testing loss of approximately 0.004. RawConvNet slightly outperformed SpectConvNet in terms of a higher validation accuracy (99.93% vs. 99.13%), a lower testing loss (0.00432 vs. 0.00702), and a much lower runtime per epoch (462 s vs. 690 s). As the size of the receptive field increased in RawConvNet, more information was likely learned in the first convolution layer, which had a positive impact on its performance. The runtime per epoch was slowest when the receptive field's size was smallest. Figures S1-S5 of the Supplementary Materials give the graphs of the loss curves of RawConvNet for each receptive field size n ∈ {3, 10, 30, 80, 100} on the BUZZ1 training and testing dataset. The loss curves gradually reduce with training. The shapes of the loss curves are almost identical, which indicates that there is no overfitting.    To estimate the contribution of the custom layer in RawConvNet, we designed three more deeper ConvNets (ConvNet 1, ConvNet 2, ConvNet 3) without the custom layer and trained them to classify raw audio samples. In all three ConvNets, the receptive field size of layer 1 was set to 80, as in RawConvNet, and the learning rate η was set to 0.0001. In each ConvNet, the custom layer is replaced with various combinations of FC and convolution layers. In ConvNet 1 (see Figure 11), layers 1, 2, 3, and 5 are identical to RawConvNet (see Figure 8) and layer 4 is replaced with an FC layer with 256 neurons. In ConvNet 2 (see in Figure 12), layers 1 and 2 are identical to layers 1 and 2 of RawConvNet but in layer 3, the output of maxpooling is convolved with 256 filters with a filter size of 3 and a stride of 1. Batch normalization is then performed on the output and the resultant feature map is passed to an FC layer with 256 units. A dropout of 0.5 is applied to the output of layer 5 before passing it to the FC softmax layer. Layer 6 is identical to the last layer in RawConvNet and ConvNet 1 and consists of an FC layer with 3 neurons that correspond to the B (bee buzzing), C (cricket chirping), and N (noise) classes with the softmax function. In ConvNet 3 (see Figure 13), layers 1, 2, and 3 are identical to layers 1, 2, and 3 of ConvNet 2. In layer 4, the output of layer 3 is additionally convolved with 256 filters with a filter size of 3 and a stride of 1. In layer 5, batch normalization is performed on the output of layer 4 before passing it to layer 6 that consists of an FC layer with 256 neurons with a dropout of 0.5. Layer 7 is the same as in ConvNets 1 and 2.    Table 4 gives the results of estimating the contribution of the custom layer by comparing RawConvNet (see Figure 8) with ConvNet 1 (see Figure 11), ConvNet 2 (see Figure 12), and ConvNet 3 (see Figure 13). All four networks were trained for 100 epochs with a 70-30 train/test split and a batch size of 128. The receptive field size was fixed to 80 in the first convolution layer for all ConvNets during training. The accuracies of ConvNets 1, 2, and 3 were slightly lower than the accuracy of RawConvNet. However, ConvNets 1, 2, and 3 showed higher losses. We also analyzed the contribution of the custom layer in RawConvNet (see Figures 7 and 8, Table 2) by generating the plots of the gradient weight distributions in the final FC layers of RawConvNet, ConvNet 1, ConvNet 2, and ConvNet 3 with TensorBoard [55]. The plots indicate that, unlike RawConvNet, ConvNets 1, 2, and 3 did not learn much in their final FC layers. The gradient weight distribution plots are given in Section S2 of the Supplementary Materials. The gap between the training and testing accuracies of ConvNet 2 (68.07% vs. 98.21%) and ConvNet 3 (66.81% vs. 98.02%) are noteworthy. This gap can be caused by underfitting, in which case more epochs are needed to increase the training accuracy of ConvNets 2 and 3. It may also be possible that both ConvNets 2 and 3 achieved a local optimum and did not leave it until the end of training. Finally, the gap can be explained by the fact that the training data were harder to classify than the testing data, which, in principle, can occur because the train_test_split procedure from the Python sklearn.model_selection library [52] assigns samples into the training and testing datasets randomly.

ML on BUZZ1
To compare the performance of ConvNets with standard ML models, we trained the four standard ML models on the same Intel Core i7-4770@3.40 GHz × 8 computer with 15.5 GiB of RAM and 64-bit Ubuntu 14.04 LTS with a 60-40 train/test split of the BUZZ1 train/test dataset and the k-fold cross validation. We did not use the k-fold cross validation with the DL methods on BUZZ1 (see Section 4.1) for practical reasons: it would take too long to train and test ConvNets on each separate fold on our hardware. In all four ML models, the optimal parameters were discovered through the grid search method [56] where, given a set of parameters, exhaustive search is used on all possible combinations of parameters to obtain the best set of parameters.
We used three metrics to evaluate the performance of the ML models: classification accuracy, confusion matrix, and receiving operating characteristic (ROC) curve. The classification accuracy is the percentage of predications that are correct. A confusion matrix summarizes the performance of a model on a dataset in terms of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). An ROC curve is a graph that summarizes a classifier's performance over all possible thresholds. ROC curves are generated by plotting true positive rates on the y-axis against false positive rates on the x-axis. Table 5 shows the accuracy results of the ML method M1 (logistic regression) with two testing procedures (i.e., 60-40 train/test split and k-fold cross validation: 5466 training samples and 3644 testing samples) and five feature scaling techniques (i.e., S1, S2, S3, S4, and S5). In both test types, the scaling techniques S1, S2, and S3 resulted in an accuracy above 99%, whereas S4 and S5 caused the accuracy to drop to the low 90% range.  Table 6 shows the confusion matrix that corresponds to Test 1 (row 1) in Table 5, i.e., train/test split with no feature scaling (S1). Since the confusion matrices for Tests 2 and 3 (rows 2 and 3 in Table 6, respectively) were almost identical to the confusion matrix in Table 6, we did not include them in the article to avoid clutter. Table 7 shows the confusion matrix that corresponds to Test 4 (row 4 in Table 5) with train/test split and L1 feature scaling (S4). Since the confusion matrix for Test 5 was almost identical to the confusion matrix for Test 4, we did not include it in the article. Regardless of the test type (train/test vs. k-fold), the S1, S2 and S3 feature scaling methods performed better than the S4 and S5 methods. Since the performance of S1 (no scaling) resulted in basically the same accuracy as S2 and S3, feature scaling did not seem to contribute much to improving the classification accuracy of logistic regression.  Table 8 shows the accuracy results of the M2 classification method (KNN) with the same two testing procedures and the same five feature scaling methods. The feature scaling methods had almost no impact on the KNN classification accuracy in that the accuracy in all tests was above 99%. Table 9 gives the accuracy results for the M3 classification method (SVM OVR). The SVM OVR classification achieved the best performance when coupled with the S1, S2, and S3 feature scaling methods, and was negatively affected by the S4 and S5 feature scaling methods in the same way as logistic regression. The ROC curves for SVM OVR are presented in Figure S10 in the Supplementary Materials. Table 10 gives the accuracy results for the M4 classification method (random forests) classification with the same two testing procedures. No feature scaling was done because decision tree algorithms are not affected by feature scaling. Decision tree nodes partition data into two sets by comparing each feature to a specific threshold, which is not affected by different scales.

DL vs. ML on BUZZ1 Validation Dataset
We evaluated all raw audio ConvNets and ML models on the BUZZ1 validation dataset of 1150 audio samples. The receptive field size for all ConvNet was n = 80. Among the ConvNets, the top performer was RawConvNet (see Table 11). Among the ML models, the top performer was M1 (logistic regression) (see Table 12). Appendix B contains the confusion matrices for the other models on the BUZZ1 validation set. The total accuracy of RawConvNet (95.21%) was greater than the total accuracy of logistic regression (94.60%) and outperformed logistic regression in the bee and noise categories. Logistic regression outperformed RawConvNet in the cricket category.  Table 13 summarizes the classification performance of the raw audio ConvNets and the four ML methods on the BUZZ1 validation dataset. The results indicate that on this dataset where the validation samples were separated from the training and testing samples by beehive and location, RawConvNet generalized better than the three deeper ConvNets and performed on par with logistic regression and random forests. While KNN and SVM OVR performed considerably lower than RawConvNet, logistic regression, and random forests, both models (KNN and SVM OVR) outperformed the three deeper ConvNets.

DL on BUZZ2
The raw audio ConvNets (RawConvNet, ConvNets 1, 2, and 3) were trained on the training dataset of BUZZ2 (7582 audio samples) with the same parameters as on BUZZ1 (see Section 4.1) and tested on the testing dataset of BUZZ2 (2332 audio samples). Recall that, in BUZZ2, the training dataset was completely separated from the testing dataset by beehive and location. Table 14 gives the performance results comparing RawConvNet with ConvNets 1, 2, and 3 on the BUZZ2 train/test dataset. The size of the receptive field was set to 80 for all ConvNets. The training accuracies of RawConvNet (99.98%) and ConvNet 1 (99.99%) were higher than those of ConvNet 2 (69.11%) and ConvNet 3 (69.62%). The testing accuracies of all ConvNets were on par. However, the testing loss of RawConvNet (0.14259) was considerably lower than the testing losses of ConvNet 1 (0.55976), ConvNet 2 (0.57461), and ConvNet 3 (0.58836). Thus, RawConvNet, a shallower ConvNet with a custom layer, is still preferable overall to the three deeper ConvNets without custom layers. It is noteworthy that there is a performance gap between the training and testing accuracies of ConvNets 2 and 3 similar to the gap between the same ConvNets on the BUZZ1 train/test dataset (see Table 4).

ML on BUZZ2
The four standard ML methods (logistic regression, k-nearest neighbors (KNN), support vector machine (SVM), and random forests) were trained on the training dataset of BUZZ2 (7582 audio samples) and tested on the testing dataset of BUZZ2 (2332 audio samples). The training and testing took place on the hardware and with the same parameters as on BUZZ1 (see Section 4.2). Since, as was discussed in Section 4.2, feature scaling did not improve the performance of the ML methods on BUZZ1, no feature scaling was used with the four ML methods on BUZZ2. Tables 15-18 give the confusion matrices for logistic regression, KNN, SVM OVR, and random forests, respectively, on the BUZZ2 testing dataset.
As these tables show, the total accuracy of KNN (98.54%) was the highest followed by random forests (96.61%). The total accuracies of logistic regression (89.15%) and SVM OVR (81.30%) were considerably lower, which indicates that these models did not generalize well on the testing data completely separated from the training data by beehive and location. When these results are compared with the results of the same ML methods on the BUZZ1 testing dataset in Section 4.2, one can observe that the accuracy of logistic regression fell by almost 10% (99% on BUZZ1 vs. 89.15% on BUZZ2), the accuracy of KNN stayed on par (99% on BUZZ1 vs. 98.54% on BUZZ2), the accuracy of SVM OVR decreased by almost 20% (99.91% on BUZZ1 vs. 81.30% on BUZZ2), and the accuracy of random forests dropped by over 3% (99.97% on BUZZ1 vs. 96.61% on BUZZ2).

DL vs. ML on BUZZ2 Validation Dataset
We evaluated the raw audio ConvNets and the ML models on the BUZZ2 validation dataset of 3000 audio samples. Recall that, in BUZZ2, the training beehives were completely separated by beehive and location from the testing beehive while the validation beehives were separated from the training and testing beehives by beehive, location, time (2017 vs. 2018), and bee race (Italian vs. Carniolan). Tables in Appendix B give the confusion matrices for all models on the BUZZ2 validation dataset. Table 19 gives the performance summary of all models on this dataset. The receptive field size for all raw audio ConvNets was n = 80. The results in Table 19 indicate that in a validation dataset completely separated from a training dataset by beehive, location, time, and bee race, the raw audio ConvNet models generalized much better than the ML models. Of the ConvNet models, RawConvent (see Table 20) was the best classifier. Of the four ML models, logistic regression (see Table 21) performed better than the other three ML models. RawConvNet outperformed linear regression on the bee and noise categories and performed on par on the cricket category. In terms of total validation accuracies, all raw audio ConvNets generalized better than their ML counterparts and outperformed them.
For the sake of completeness, we trained SpectConvNet (see Figure 5 and Table 1) on the BUZZ2 training dataset and evaluated it on the BUZZ2 validation dataset to compare its performance with the raw audio ConvNets and the four ML models. The confusion matrix for this experiment is given in Table 22. SpectConvNet generalized better than the four ML models but worse than the raw audio ConvNets.

Running ConvNets on Low Voltage Devices
One of the research objectives of our investigation was to discover whether ConvNets could run in situ on low-voltage devices such as Raspberry Pi or Arduino. To achieve this objective, we persisted trained RawConvNet on the sdcard of a Raspberry Pi 3 model B v1.2 computer. The Raspberry Pi was powered with a fully charged Anker Astro E7 26800 mAh portable battery. Two hundred 30-s raw audio samples from BUZZ1 train/test dataset were saved in a local folder on the Raspberry Pi.
Two experiments were performed on the Raspberry Pi to test the feasibility of in situ audio classification with RawConvNet. In the first experiment, a cronjob executed a script in Python 2.7.9 every 15 min. The script would load persisted RawConvNet into memory, load a randomly selected audio sample from the local folder with 200 samples, split the audio sample into overlapping 2-s segments, and then classify each 2-s audio segment with RawConvNet. In the first experiment, the fully charged battery supported audio classification for 40 h during which 162 30-s samples were processed. In other words, it took the system, on average, 13.66 s to process one 30-s audio sample on the Raspberry Pi 3 model B v1.2.
In the second experiment, we modified the Python script to process a batch of four 30-s audio files once every 60 min. The objective of the second experiment was to estimate whether a batch approach to in situ audio classification would result in better power efficiency because the persisted RawConvNet would be loaded into memory only once per every four 30-s audio samples. In the second experiment, the fully charged Anker battery supported audio classification for 43 h during which 172 30-s audio samples were processed. Thus, it took the system 37.68 s, on average, to classify a batch of four 30-s audio samples, which resulted in 9.42 s per one 30-s audio sample.

Discussion
Our experiments indicate that raw audio ConvNets (i.e., ConvNets classifying raw audio samples) can perform better than ConvNets that classify audio spectrogram images on some audio datasets. Specifically, on the BUZZ2 validation dataset, SpectConvNet outperformed the four ML models but performed considerably worse than all four raw audio ConvNets. While it may be possible to experiment with different image classification ConvNet architectures to improve the classification accuracy of SpectConvNet, RawConvNet's classification accuracies of 95.21% on the BUZZ1 validation dataset and 96.53% on the BUZZ2 validation dataset are quite adequate for practical purposes because, unlike SpectConvNet, it classifies unmodified raw audio signals without having to convert them into images. Consequently, it trains and classifies faster, and is less energy intensive, which makes it a better candidate for in situ audio classification on low voltage devices such as Raspberry Pi or Arduino.
When we compared the performance of RawConvNet, a shallower ConvNet with a custom layer, with the three deeper ConvNets without custom layers on both the BUZZ1 and BUZZ2 validation datasets, we observed (see Table 23) that adding more layers may not necessarily result in improved classification performance. The summary results in Table 23 and the confusion matrices in Appendix B indicate that the deeper networks performed worse than RawConvNet on both validation datasets. RawConvNets also achieved lower validation losses. These results indicate that on BUZZ1, where the validation samples were separated from the training and testing samples by beehive and location, RawConvNet generalized better than the three deeper ConvNets and performed on par with logistic regression and random forests. While KNN and SVM OVR performed worse than RawConvNet, logistic regression, and random forests, both models (KNN and SVM OVR) outperformed the three deeper ConvNets. The picture is different on BUZZ2, where the validation samples were completely separated from the training samples by beehive, location, time, and bee race. In terms of total validation accuracies, all raw audio ConvNets generalized better than their ML counterparts and outperformed them. It it interesting to observe that, for the ML models, feature scaling did not contribute to performance improvement. While the total accuracies give us a reasonable description of model performance, it is instructive to take a look at how the models performed on both validation datasets by audio sample category. This information is summarized in Tables 24 and 25. On BUZZ1, both the DL and ML models generalized well in the bee category. In the cricket category, logistic regression, KNN, RawConvNet, and ConvNet 1 achieved a validation accuracy above 90%. In the noise category, the only classifiers with a validation accuracy above 90% were RawConvNet, logistic regression, and random forests. On BUZZ2, only RawConvNet achieved a validation accuracy above 90% in the bee category. Thus, in the bee category, this was the only ConvNet that generalized well to a validation dataset where the audio samples in the training dataset were separated from the validation dataset by beehive, location, time, and bee race. All classifiers with the exception of SVM OVR and random forests generalized well in the cricket category. In the noise category, RawConvNet was again the only classifier with a validation accuracy above 90%. The main trade-off between the ConvNets and the ML methods was between feature engineering and training time. Our scientific diary indicates that it took us approximately 80 h of research and experimentation to obtain the final set of features for the four ML methods that gave us an optimal classification performance. On the other hand, once the feature engineering was completed, it took less than 5 min for each of the four ML methods to train both on BUZZ1 and BUZZ2. Specifically, on a personal computer with an Intel Core i7-4770@3. 40  We did not evaluate our methods on ESC-50 [26], ESC-10 [26], and UrbanSound8K [57], three datasets used in some audio classification investigations. ESC-50 is a collection of 2000 short environmental recordings divided into five major audio event classes: animals, natural and water sounds, human non-speech, interior and domestic sounds, and exterior and urban noises. ESC-10 is a subset of ESC-50 of 400 recordings divided into 10 audio classes: dog bark, rain, sea waves, baby cry, clock tick, sneeze, helicopter, chainsaw, rooster, fire cracker. UrbanSound8K is a collection of 8732 short recordings of 10 audio classes of urban noise: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music.
A principal reason why we did not consider these datasets in our investigation is that our objective is to provide a set of practical audio classification tools for electronic beehive monitoring. Consequently, it is important for us to train and test our models on audio samples captured by deployed electronic beehive monitors. Such audio samples are absent from ESC-50, ESC-10, and UrbanSound8K. Another, more fundamental, reason why we chose to curate and use our own domain-specific data is Wolpert and Macready's no-free-lunch (NFL) theorems [58] that imply that, if a stochastic or deterministic algorithm does well on one class of optimization problems, there is no guarantee that it will do as well on other optimization problems. Since the DL and ML methods we used in this investigation are optimization algorithms in that they optimize various cost functions, there is no guarantee that their adequate performance on ESC-50, ESC-10, and UrbanSound8K will generalize to other domain-specific datasets, and vice versa.

Conclusions
We designed several convolutional neural networks and compared their performance with logistic regression, k-nearest neighbors, support vector machines, and random forests in classifying audio samples from microphones deployed above the landing pads of Langstroth beehives. On a dataset of 10,260 audio samples where the training and testing samples were separated from the validation samples by beehive and location, a shallower raw audio convolutional neural network with a custom layer outperformed three deeper raw audio convolutional neural networks without custom layers and performed on par with the four machine learning methods trained to classify feature vectors extracted from raw audio samples. On a more challenging dataset of 12,914 audio samples where the training and testing samples were separated from the validation samples by beehive, location, time, and bee race, all raw audio convolutional neural networks performed better than the four machine learning methods and a convolutional neural network trained to classify spectrogram images of audio samples. Our investigation gives an affirmative answer to the question of whether ConvNets can be used in real electronic beehive monitors that use low voltage devices such as Raspberry Pi or Arduino. In particular, we demonstrated that a trained raw audio convolutional neural network can successfully operate in situ on a low voltage Raspberry Pi computer. Thus, our experiments suggest that convolutional neural networks can be added to a repertoire of in situ audio classification algorithms for audio beehive monitoring. The main trade-off between deep learning and standard machine learning was between feature engineering and training time: while the convolutional neural networks required no feature engineering and generalized better on the second, more challenging, dataset, they took considerably more time to train than the machine learning methods. To the electronic beehive monitoring, audio processing, bioacoustics, ecoacoustics, and agricultural technology communities, we contribute two manually labeled datasets (BUZZ1 and BUZZ2) of audio samples of bee buzzing, cricket chirping, and ambient noise and our data collection and audio classification source code. We hope that our datasets will provide performance benchmarks to all interested researchers, practitioners, and citizen scientists and, along with our source code, ensure the reproducibility of the findings reported in this article.
Supplementary Materials: The following materials are available online at http://www.mdpi.com/2076-3417/8/ 9/1573/s1. Figure S1. Training and validation losses of RawnConvNet with receptive field size n = 3; Figure S2. Training and validation losses of RawConvNet with receptive field n = 10; Figure S3. Training and validation losses of RawConvNet with receptive field n = 30; Figure S4. Training and validation losses of RawConvNet with receptive field n = 80; Figure S5. Training and validation losses of RawConvNet with receptive field n = 100. Figure S6. Histograms of the gradients for layers 4 and 5 in ConvNet 1. The histograms on the left show the gradient distribution in the FC softmax layer in layer 4; the histograms on the right show the gradient distribution for the FC softmax layer in layer 5; Figure S7. Histograms of the gradients for layers 5 and 6 in ConvNet 2. The histograms on the left show the distribution of gradients for the FC softmax layer in layer 5; the histograms on the right show the gradient distribution for the FC softmax layer in layer 6; Figure S8. Histograms of the gradients for layers 6 and 7 in ConvNet 3. The histograms on the left show the distribution of the gradients for the FC softmax layer in layer 6; the histograms on the right show the gradient distribution for the FC softmax layer in layer 7; Figure S9. Histograms of the gradients in the FC softmax layer in layer 5 in RawConvNet; Figure S10. ROC of SVM OVR with Standard Scaling for Bee, Noise and Cricket classes; Three sample audio files (bee buzzing, cricket chirping, ambient noise) from BUZZ1. Funding: This research has been funded, in part, through the contributions to our Kickstarter fund raiser [36]. All bee packages, bee hives, and beekeeping equipment used in this study were personally funded by V.K.

Acknowledgments:
We are grateful to Richard Waggstaff, Craig Huntzinger, and Richard Mueller for letting us use their property in Northern Utah for longitudinal electronic beehive monitoring tests.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Hyperparameters
In ConvNets, the properties corresponding to the structure of different layers and neurons along with their arrangement and receptive field values are called hyperparameters. Table A1 shows the number of hyperparameters used by different models discussed in this article.

Appendix B. Confusion Matrices for BUZZ1 and BUZZ2 Validation Datasets
This appendix gives confusion matrices for the DL and ML models on the BUZZ1 and BUZZ2 validation datasets.