3.1. Hardware and Data Acquisition
The video data for this investigation were captured by BeePi monitors, multi-sensor EBM systems we designed and built in 2014 [
8], and have been iteratively modifying since then [
7,
22]. Each BeePi monitor (see 
Figure 1) consists of a raspberry pi 3 model B v1.2 computer, a pi T-Cobbler, a breadboard, a waterproof DS18B20 temperature sensor, a pi v2 8-megapixel camera board, a v2.1 ChronoDot clock, and a Neewer 3.5 mm mini lapel microphone placed above the landing pad. All hardware components fit in a single Langstroth super. BeePi units are powered either from the grid or rechargeable batteries.
BeePi monitors thus far have had six field deployments. The first deployment was in Logan, UT (September 2014) when a single BeePi monitor was placed into an empty hive and ran on solar power for two weeks. The second deployment was in Garland, UT (December 2014–January 2015), when a BeePi monitor was placed in a hive with overwintering honeybees and successfully operated for nine out of the fourteen days of deployment on solar power to capture ≈200 MB of data. The third deployment was in North Logan, UT (April–November 2016) where four BeePi monitors were placed into four beehives at two small apiaries and captured ≈20 GB of data. The fourth deployment was in Logan and North Logan, UT (April–September 2017), when four BeePi units were placed into four beehives at two small apiaries to collect ≈220 GB of audio, video, and temperature data. The fifth deployment started in April 2018, when four BeePi monitors were placed into four beehives at an apiary in Logan, UT. In September 2018, we decided to keep the monitors deployed through the winter to stress test the equipment in the harsh weather conditions of northern Utah. By May 2019, we had collected over 400 GB of video, audio, and temperature data. The sixth field deployment started in May 2019 with four freshly installed bee packages and is still ongoing as of January 2021 with ≈250 GB of data collected so far. In early June 2020, we deployed a BeePi monitor on a swarm that made home in one of our empty hives and have been collecting data on it since then.
We should note that, unlike many apiarists, we do not intervene in the life cycle of the monitored hives in order to preserve the objectivity of our data and observations. For example, we do not apply any chemical treatments to or re-queen failing or struggling colonies.
  3.2. Terminology, Notation, and Definitions
We use the terms frame and image interchangeably to refer to two-dimensional (2D) pixel matrices where pixels can be either real non-negative numbers or, as is the case with multi-channel images (e.g., PNG or BMP), tuples of real non-negative numbers.
We use pairs of matching left and right parentheses to denote sequences of symbols. We use the set-theoretic membership symbol ∈ to denote when a symbol is either in a sequence or in a set of symbols. We use the universal quantifier ∀ to denote the fact that some mathematical statement holds for all mathematical objects in a specific set or sequence and use the existential quantifier ∃ to denote the fact that some mathematical statement holds for at least one object in a specific set or sequence.
We use the symbols ∧ and ∨ to refer to the logical and and the logical or, respectively. Thus,  states a common truism that for every natural number i there is another natural number j greater than i.
Let 
 and 
 be sequences of symbols, where 
 are positive integers. We define the intersection of two symbolic sequences 
 in Equation (
1), where 
 are positive integers, 
 whenever 
, for 
 and 
. If two sequences have no symbols in common, then 
.
        
We define a video to be a sequence of consecutive equi-dimensional 2D frames , where t, j, and k are positive integers such that . Thus, if a video V contains 745 frames, then  = , . By definition, videos consist of unique frames so that if  and , then . It should be noted that  and  may be the same pixelwise. Any smaller sequence of consecutive frames of a larger video is also a video. For example, if  is a video, then so are  and .
When we discuss multiple videos that contain the same frame symbol  or when we want to emphasize specific videos under discussion, we use superscripts in frame symbols to reference respective videos. Thus, if videos  and  include , then  designates  in  and  designates  in .
Let 
 be a video, 
, and 
 be a positive integer. A frame’s context in 
V, denoted as 
, is defined in Equation (
2).
        
        where
        
In other words, the context  of  is a video that consists of a sequence of consecutive  or fewer frames (possibly empty) that precede  and a sequence of  or fewer frames (possibly empty) that follow it. We refer to  as a context size and to  as the -context or, simply, context of  and refer to  as the contextualized frame of . If there is no need to reference V, we omit the superscript and refer to  as the contextualized frame of .
For example, let 
, then the 3-context of 
 is
        
Analogously, the 2-context of 
 is
        
If 
, then 
 is the pixel value at row 
r and column 
c in 
. If 
 is a frame whose pixel values at each position 
 are real numbers, we use the notation 
 to refer to the maximum such value in the frame. If 
, then the mean frame of 
V, denoted as 
 or 
 when 
V can be omitted, is the frame where the pixel value at 
 is the mean of the pixel values at 
 of all frames 
, as defined in Equation (
3).
        
If pixels are n-dimensional tuples (e.g., each  is an RGB or PNG image), then each pixel in  is an n-dimenstional tuple of the means of the corresponding tuple values in all frames .
Let 
 be a video, and let 
 be a context size and 
 be the context of 
. The 
l-th dynamic background frame of 
V, denoted as 
, 
, is defined in Equation (
4) as the mean frame of 
 (i.e., the mean frame of the 
-context of 
).
        
In general, the dynamic background operation specified in Equation (
4) is designed to filter out noise, blurriness, and static portions of the images in a given video. As the third video set in the 
supplementary material shows, BeePIV can process videos taken against the background of grass, trees, and bushes. For an example, consider a 12-frame video 
V in 
Figure 2, and let 
. 
Figure 3 shows 12 dynamic background frames for the video in 
Figure 2. In particular, 
 is the mean frame of 
 of 
; 
 is the mean frame of 
 of 
; 
 is the mean frame of 
 of 
; and 
 is the mean frame of 
 of 
. Proceeding to the right in this manner, we reach 
, the last contextualized frame of 
V, which is the mean frame of 
 of 
.
A neighborhood function maps a pixel position 
 in 
 to a set of positions around it. In particular, we define two neighborhood functions 
 (see Equation (
5)) and 
 (see Equation (
6)) for the standard 4- and 8-neighborhoods, respectively, used in many image processing operations. Given a position 
 in 
, the statement 
 states that the 8-neighborhood of 
 includes a position 
 such that 
 and 
.
        
We use the terms PIV and digital PIV (DPIV) interchangeably as we do the terms bee and honeybee. Every occurrence of the term bee in the text of the article refers to the Apis Mellifera honeybee and to no other bee species. We also interchange the terms hive and beehive to refer to a Langstroth beehive hosting an Apis Mellifera colony.
  3.3. BeePIV
  3.3.1. Dynamic Background Subtraction
Let 
 be a video. In BeePIV, the background frames are subtracted pixelwise from the corresponding contextualized frames to obtain difference frames. The 
l-th difference frame of 
V, denoted as 
, is defined in Equation (
7).
          
The pixels in 
 that are closest to the corresponding pixels in 
 represent positions that have remained unchanged over a specific period of physical time over which the frames in 
 were captured by the camera. Consequently, the positions of the pixels in 
 where the difference is relatively high, signal potential bee motions. 
Figure 4 shows several difference frames computed from the corresponding contextualized and background frames of the 12-frame video in 
Figure 2.
Let 
 be a video. We now introduce the video difference operator 
 in Equation (
8) that applies the operation in Equation (
7) to every contextualized frame 
, 
.
          
For example, if 
V is the 12-frame video in 
Figure 2 and 
, then 
 and 
 contains each of the four difference frames (i.e., 
, 
, 
, 
) in 
Figure 4.
  3.3.2. Difference Smoothing
A difference frame  may contain not only bee motions detected in the corresponding contextualized frame  but also bee motions from the other frames in the context  or motions caused by flying bees’ shadows or occasional blurriness. In BeePIV, smoothing is applied to  to replace each pixel with a local average of the neighboring pixels. Insomuch as the neighboring pixels measure the same hidden variable, averaging reduces the impact of bee shadows and blurriness without necessarily biasing the measurement, which results in more accurate frame intensity values. An important objective of difference smoothing is to concentrate intensity energy in those areas of  that represent actual bee motions in .
The smoothing operator, 
, is defined in Equation (
9), where 
 is a real positive number and 
 is a weighting function assigning relative importance to each neighborhood position. In the current implementation of BeePIV, 
. We will use the notation 
 to denote the smoothed difference frame obtained from 
 so that 
. We will use the notation 
 as a shorthand for the application of 
 to every position 
 in 
 to obtain 
. 
Figure 5 shows several smoothed difference frames obtained from the corresponding difference frames where the weights are assigned using the weight function 
 in Equation (
10).
          
Let 
 and let 
. The video smoothing operator 
 applies the smoothing operator 
H to every frame in 
 and returns a sequence of smoothed difference frames 
, as defined in Equation (
11), where 
 is a sequence of difference frames.
          
  3.3.3. Color Variation
Since a single flying bee may generate multiple motion points in close proximity as its body parts (e.g., head, thorax, and wings) move through the air, pixels in close proximity (i.e., within a certain distance) whose values are above a certain threshold in smoothed difference frames can be combined into clusters. Such clusters can be reduced to single motion points on a uniform (white or black) background to improve the accuracy of PIV. Our conjecture, based on empirical observations, is that a video’s color variation levels and its bee traffic levels are related and that video-specific thresholds and distances for reducing smoothed difference frames to motion points can be obtained from video color variation.
Let 
 be a video and let 
 (see Equation (
4)) be the background frame of 
 with 
 (i.e., the number of frames in 
V). To put it differently, as Equation (
4) implies, 
 is the mean frame of the entire video and contains information about regions with little or no variation across all frames in 
V.
A color variation frame, denoted as 
 (see Equation (
12)), is computed for each 
 as the squared smoothed pixelwise difference between 
 and 
 across all image channels.
          
The color variation values from all individual color variation frames 
 are combined into one maximum color variation frame, denoted as 
 (see Equation (
13)), for the entire video 
V. Each position 
 in 
 holds the maximum value for 
 across all color variation frames 
, where 
. In other words, 
 contains quantized information for each pixel position on whether there has been any change in that position across all frames in the video.
          
Figure 6 gives the background and maximum color variation frames for a low bee traffic video 
 and a high bee traffic video 
. The maximum color variation frames are grayscale images whose pixel values range from 0 (black) to 255 (white). Thus, the whiter the value of a pixel in a 
 is, the greater the color variation at the pixel’s position. As can be seen in 
Figure 6, 
 has fewer whiter pixels compared to 
 and the higher color variation clusters in 
 tend to be larger and more evenly distributed across the frame than the higher color variation clusters in 
.
 The color variation of a video 
V, denoted as 
, is defined in Equation (
14) as the standard deviation of the mean of 
. Higher values of 
 indicate multiple motions; lower values indicate either relatively few motions or complete lack thereof. We intend to investigate this implication in our future work.
          
We postulate in Equation (
15) the existence of a function 
 from reals to 2-tuples of reals that maps color variation values for videos (i.e., 
) to video-specific threshold and distance values that can be used to reduce smoothed difference frames to uniform background frames with motion points. In 
Section 4.2, we define one such 
 function in Equation (
33) and evaluate it in 
Section 4.3, where we present our experiments on computing the PIV interrogation window size and overlap from color variation.
          
Suppose there is a representative sample of bee traffic videos 
 obtained from a deployed BeePi monitor. Let 
 and 
 be experimentally observed lower and upper bounds, respectively, for the values of 
. In other words, for any 
, 
. Let 
 and 
 be the experimentally selected lower and upper bounds, respectively, for 
, that hold for all videos in the sample. Then 
 in Equation (
15) can be constrained to lie between 
 and 
, as shown in Equation (
16).
          
The frame thresholding operator, 
, is defined in Equation (
17), where, for any position 
 in 
, 
 if 
 and 
, otherwise. The video thresholding operator, 
, applies 
 to every frame in 
 and returns a sequence of smoothed tresholded difference frames 
, as defined in Equation (
18), where 
.
          
In 
Section 4.2, we give the actual values of 
 and 
 we found by experimenting with the videos on our testbed dataset. 
Figure 7 shows the background frame of the video in 
Figure 2 and the value of 
 computed from 
 for the video. 
Figure 8 shows the impact of smoothing difference frames with the smoothing operation 
H (see Equation (
9)) and then thresholding them with 
 computed from 
.
The values of  are used to threshold smoothed difference frames  from a video V and the values of , as we explain below, are used to determine which pixels are in close proximity to local maxima and should be eroded. Higher values of  indicate the presence of higher bee traffic and, consequently, must be accompanied by smaller values  between maxima points, because in higher traffic videos, multiple bees fly in close proximity to each other. On the other hand, lower values of  indicate lower traffic and must be accompanied by higher values of , because in lower traffic videos bees typically fly farther apart.
  3.3.4. Difference Maxima
Equation (
19) defines a maxima operator 
 that returns 1 if a given position in a smooth thresholded difference frame 
 is a local maxima by using the neighborhood function 
 in Equation (
5). By analogy, we can define 
, another maxima operator to do the same operation by using the neighborhood function 
 in Equation (
6).
          
We will use the notation 
, where 
n is a positive integer (e.g., 
 or 
), as a shorthand for the application of 
 to every position of 
 in 
 to obtain the frame 
. The symbol 
b in the subscript of 
 indicates that this frame is binary, where, per Equation (
19), 1’s indicate positions of local maxima. 
Figure 9 shows that application of 
 to a 
 smoothed difference frame 
 to obtain the corresponding 
.
The video maxima operator 
 applies the maxima operator 
 to every frame in a sequence of smoothed thresholded difference frames 
, and returns a sequence of binary difference frames, as defined in Equation (
20), where 
 is a sequence of smoothed thresholded difference frames.
          
  3.3.5. Difference Maxima Erosion
Let 
P be the sequence of 
 positions of all maxima points in the frame 
. For example, in 
Figure 9, 
. We define an erosion operator 
 that, given a distance 
, constructs the set 
P, sorts the positions 
 in 
P by their 
i coordinates, and, for every position 
 such that 
, sets the pixel values of all the positions that are within 
 pixels of 
 in 
 to 0 (i.e., erodes them). We let 
 refer to the smoothed and eroded binary difference frame obtained from 
 after erosion and define the erosion operator 
E in Equation (
21).
          
The application of the erosion operator is best described algorithmically. Consider the 
 binary difference frame in 
Figure 10a. Recall that this frame is the frame in 
Figure 9b obtained by applying the maxima operator 
 to the frame in 
Figure 9a. After 
 is applied, the sequence of the local maxima positions is 
.
Let us set the distance parameter  of the erosion operator to 4 and compute . As the erosion operator scans  left to right, the positions of the eroded maxima are saved in a dynamic lookup array (let us call it I) so that the previously eroded positions are never processed more than once. The array I holds the index positions of the sorted pixel values at the positions in P. Initially, in our example, , because the pixel values at the positions in P, sorted from lowest to highest, are  so that the value 125.50 is at position 4 in P, the value 134.00 at position 1, the value 136.00 at position 3, and the value 143 at position 2. In other words, the pixel value at  is the lowest and the pixel value at  the highest. For each value in I which has not yet been processed, the erosion operator computes the euclidean distance between its coordinates and the coordinates of each point to the right of it in I that has not yet been processed. For example, in the beginning, when the index position is at 1, the operator computes the distances between the coordinates of position 1 and the coordinates of positions 2, 3, and 4.
If the current point in 
I is at 
, a point to the left of it in 
I is at 
, then 
 is the distance between them. If 
, the point at 
 is eroded and is marked as such. The erosion operator continues to loop through 
I, skipping the indices of the points that have been eroded. In this example, the positions 2 and 3 in 
P are eroded. Thus, the frame 
 shown in 
Figure 10b, has 1’s at positions 
 and 
 and 0’s everywhere else.
Since the erosion operator is greedy, it does not necessarily ensure the largest pixel values are always selected, because their corresponding positions may be within the distance  of a given point whose value may be lower. In our example, the largest pixel value 143 at position  is is eroded, because it is within the distance threshold from , which is considered first.
A more computationally involved approach to guarantee the preservation of relative local maxima is to sort the values in  in descending order and continue to erode the positions within  pixels of the position of each sorted maxima until there is nothing else to erode. In practice, we found that this method does not contribute to the accuracy of the algorithm due to the proximity of local maxima to each other. To put it differently, it is the positions of the local maxima that matter, not their actual pixel values in smoothed difference frames in that the motion points generated by the multiple body parts of a flying bee have a strong tendency to cluster in close proximity.
The video erosion operator 
 applies the erosion operator 
 to every frame in a sequence of binary difference frames 
 and returns a sequence of corresponding eroded frames 
, as defined in Equation (
22), where 
 is a sequence of difference frames.
          
The positions of 1’s in each 
 are treated as centers of small circles whose radius is 
 of the width of 
, where 
 in our current implementation. The 
 parameter, in effect, controls the size of the motion points for PIV. After the black circles are drawn on a white background the frame 
 becomes the frame 
. We refer to 
 frames as motion frames and define the drawing operator 
 in Equation (
23), where 
 is obtained by drawing at each position with 1 in 
 a black circle with a radius of 
 of the width of 
.
          
Figure 11 shows twelve motion frames obtained from the twelve frames in 
Figure 2. 
Figure 12 shows the detected motion points in each motion frame 
 plotted on the corresponding original frame 
 from which 
 was obtained. The motion frames 
 record bee motions reduced to single points and constitute the input to the PIV algorithm described in the next section.
 The video drawing operator 
 applies the drawing operator 
 to every frame in a sequence of eroded frames 
 and returns a sequence of the corresponding white background frames with black circles 
, as defined in Equation (
24), where 
 is a sequence of eroded frames. The sequence of motion frames is given to the PIV algorithm described in the next section.
          
  3.3.6. PIV and Directional Bee Traffic
Let 
 be a sequence of motion frames and let 
 and 
 be two consecutive motion frames in this sequence that correspond to original video frames 
 and 
. Let 
 be a 
 window, referred to as interrogation area or interrogation window in the PIV literature, selected from 
 and centered at position 
. Another 
 window, 
, is selected in 
 so that 
. The position of 
 in 
 is the function of the position of 
 in 
 in that it changes relative to 
 to find the maximum correlation peak. For each possible position of 
 in 
, a corresponding position 
 is computed in 
.
          
The 2D matrix correlation is computed between 
 and 
 with the formula in Equation (
25), where 
 are integers in the interval 
. In Equation (
25), 
 and 
 are the pixel intensities at locations 
 in 
 and 
 in 
. For each possible position 
 of 
 inside 
, the correlation value 
 is computed. If the size of 
 is 
 and the size of 
 is 
, then the size of the matrix 
C is 
.
The matrix 
C records the correlation coefficient for each possible alignment of 
 with 
. A faster way to calculate correlation coefficients between two image frames is to use the Fast Fourier Transform (FFT) and its inverse, as shown in Equation (
26). The reason why the computation of Equation (
26) is faster than the computation of Equation (
25) is that 
 and 
 must be of the same size.
          
If  is the maximum value in C and  is the center of , the pair of  and  defines a displacement vector  from  in  to  in . This vector represents how particles may have moved from  to . The displacement vectors form a vector field used to estimate possible flow patterns.
Figure 13 shows two frames 
 and 
, two corresponding motion frames 
 and 
 obtained from them by the application of the operator in Equation (
24), and the vector field 
 computed from 
 and 
 by Equation (
26). The two displacement vectors correspond to the motions of two bees with the left bee moving slightly left and the right bee moving down and right.
 In Equation (
27), we define the PIV operator that applies to two consecutive motion frames 
 and 
 to generate a field of displacement vectors 
. The video PIV operator 
 applies the PIV operator 
G to every pair of consecutive motion frames 
 and 
 in a sequence of motion frames and returns a sequence of vector fields 
, as defined in Equation (
28), where 
 is a sequence of eroded frames. It should be noted that the number of motion frames in 
Z exceeds the number of the vector fields returned by the video PIV operator by exactly 1.
          
After the vector fields are computed by the 
G operator for each pair of consecutive motion frames 
 and 
, the directions of the displacement vectors are used to estimate directional bee traffic. Each vector is classified as 
lateral, 
incoming, or 
outgoing according to the value ranges in 
Figure 14. A vector 
 is classified as outgoing if its direction is in the range 
, as incoming if its direction is in the range 
, and as lateral otherwise.
Let 
 and 
 be two consecutive motion frames from a video 
V. Let 
, 
, and 
 be the counts of incoming, outgoing, and lateral vectors. If 
 is a sequence of 
k motion frames obtained from 
V, then 
, 
, and 
 can be used to define three video-based functions 
, 
, and 
 that return the counts of incoming, outgoing, and lateral displacement vectors for 
Z, as shown in Equation (
29).
          
For example, let  such that , ,  and , , . Then, , , and .
We define the video motion count operator in Equation (
30) as the operator that returns a 3-tuple of directional motion counts obtained from a sequence of motion frames 
Z with 
, 
, and 
.
          
  3.3.7. Putting It All Together
We can now define the BeePIV algorithm in a single equation. Let 
 be a video. Let the video’s color variation (i.e., 
) be computed with Equation (
14) and let the values of 
, and 
 be computed from 
 with Equation (
15). We also need to select a context size 
 and the parameter 
 for the circle drawing operator 
 in Equation (
24) to generate motion frames 
.
          
The BeePIV algorithm is defined in Equation (
31) as the operator 
 that applies to a video 
V. The operator is a composition of operators where each subsequent operator is applied to the output of the previous one. The operator starts by applying the video difference operator 
 in Equation (
8) to subtract from each contextualized frame 
 its background frame 
. The output of 
 is given to the video smoothing operator 
 in Equation (
11). The smoothed frames produced by 
 are given to the video maxima operator 
 in Equation (
20) to detect positions of local maxima. The frames produced by 
 are given to the video erosion operator 
 in Equation (
22). The eroded frames produced by 
 are processed by the video drawing operator 
 in Equation (
24) that turns these frames into white background motion frames and gives them to the video motion count operator 
 in Equation (
30) to return the counts of incoming, outgoing, and lateral displacement vectors. We refer to Equation (
31) as the BeePIV equation. 
Figure 15 gives a flowchart of the BeePIV algorithm.