Ethoflow: Computer Vision and Artificial Intelligence-Based Software for Automatic Behavior Analysis

Manual monitoring of animal behavior is time-consuming and prone to bias. An alternative to such limitations is using computational resources in behavioral assessments, such as tracking systems, to facilitate accurate and long-term evaluations. There is a demand for robust software that addresses analysis in heterogeneous environments (such as in field conditions) and evaluates multiple individuals in groups while maintaining their identities. The Ethoflow software was developed using computer vision and artificial intelligence (AI) tools to monitor various behavioral parameters automatically. An object detection algorithm based on instance segmentation was implemented, allowing behavior monitoring in the field under heterogeneous environments. Moreover, a convolutional neural network was implemented to assess complex behaviors expanding behavior analyses’ possibilities. The heuristics used to generate training data for the AI models automatically are described, and the models trained with these datasets exhibited high accuracy in detecting individuals in heterogeneous environments and assessing complex behavior. Ethoflow was employed for kinematic assessments and to detect trophallaxis in social bees. The software was developed in desktop applications and had a graphical user interface. In the Ethoflow algorithm, the processing with AI is separate from the other modules, facilitating measurements on an ordinary computer and complex behavior assessing on machines with graphics processing units. Ethoflow is a useful support tool for applications in biology and related fields.


Introduction
Behavioral studies are critical to understanding the fundamental aspects of animal ecology [1,2]. The assessment of animal behavior by visual inspection is limited and subjective and does not allow observations over long periods [3]. The use of computational tools in behavioral assessments allows accurate and long-term evaluations of animals [2,4]. For instance, automatic tracking systems obtain the animal's position in each frame of a digital video and record the Cartesian or polar coordinates of the movement [5].
From animals' coordinates over time is possible to calculate important kinematic measurements (e.g., the animal walked distance and meandering). Furthermore, evaluating complex behaviors (measurements based on characteristics extracted from specific animal behaviors) can provide relevant insights into animal biology. For example, the evaluation of complex behaviors among social insects, such as changes in trophallaxis (the complex social behavior of food exchange among nestmates), is important for understanding their response to stress agents such as pesticides [6,7].
• has a graphical user interface (GUI) and has already been successfully applied in other studies [11,12]; • performs animal tracking in homogeneous or heterogeneous environments; • can maintain the identity among individuals in animal group tracking; • evaluates various kinematic variables (e.g., mean speed, turning angle, and group interaction); • supports complex behavior assessment (e.g., mating, grooming, and trophallaxis).
A brief overview of recent tools involving tracking methods and AI techniques for animal behavioral assessments is presented in Section 2. The methods and results of the Ethoflow algorithm, applications in different setups and bioassays, and performance in processing speed and accuracy are described in Sections 3 and 4. Finally, the discussion and conclusions are presented in Sections 5 and 6, respectively.

Related Work
The tracking software Tracktor uses unsupervised machine learning to track animal groups maintaining individuals' identities [13]. This software exhibited advantages in processing speed and robustness compared to the software IdTracker [14] and the ToxTrac [15]. Some other tracking software exhibit outstanding performance using deep learning algorithms [16], including the idtracker.ai [17] and the TRex [18]. These two software also apply CNNs to track many animals simultaneously with high accuracy in maintaining individuals' identities.
In addition to tracking software, there are also tools for measuring the geometrical configuration of body parts denoted as pose estimation [16]. Deep learning approaches have also led to notable improvements in pose estimation software (e.g., DeepPoseKit, DeepLabCut, and LEAP) [19][20][21]. For instance, the DeepLabCut uses transfer learning with a pre-trained network in large datasets (e.g., ImageNet). This approach can improve performance and reduce the number of required training examples [19]. However, it may come with the cost of slow inference due to excessive parameterization in large networks. The LEAP framework uses a relatively simple 15 layers CNN to limit model complexity and maximize inference speed [20]. However, the LEAP achieved limited accuracy compared to the DeepPoseKit and DeepLabCut [21]. To improve the speed-accuracy tradeoff in DeepLabCut and LEAP, the DeepPoseKit toolkit was developed using Stacked DenseNet, a deep learning architecture that provides fast and accurate detection even at low spatial resolutions [21].
The unfolding of behavioral assessments in tools without graphical user interface (GUI) (e.g., Tracktor) [13] requires familiarity with programming, which can limit the general public use. In this context, the Ethoflow software looks user-friendly due to the GUI. The available tracking software measure large collectively animal groups with high accuracy, especially those using deep learning. However, these tracking software operate by background subtraction or thresholding [13][14][15]17,18]. These approaches require video recordings of homogeneous environments and are not applicable in the field. In Ethoflow, we implemented thresholding by Otsu's method [22] (Section 3.1.3) to handle assessments in a homogeneous environment. In addition, we also implemented instance segmentation by Mask R-CNN [9] for evaluation in heterogeneous environments (Section 3.1.4).
With pose estimation toolkits, variables can be measured to predict complex animal behavior after some posterior machine learning analysis [23]. Although our goal with the Ethoflow software is tracking analysis, Ethoflow also directly measures complex behaviors. After hyperparameter optimization, we defined a parsimonious CNN architecture to assess complex binary behavior (Section 3.1.8). Deep learning software is computationally costly and requires graphics processing unit (GPU) hardware. Accordingly, they are not feasible to use on an ordinary computer. An interesting feature in our proposal is that the deep learning algorithms (used for analysis in a heterogeneous environment and measurement of complex behaviors) are separate from the other modules in Ethoflow. Wherefore, Ethoflow covers kinematic measurements on an ordinary computer and assesses more complex behavior with a GPU.

Software Features and Algorithm
The Ethoflow software was developed in modality desktop application with Python language, including the image library OpenCV [24] and the framework TensorFlow [25] with Keras for AI models. Other libraries, such as SciPy [26], Numpy [27], Pandas [28], and SciKit Learn [29], were also used. We recommended Python version 3.6.8 and Microsoft Windows 10 when running the Ethoflow. The main input and output files, formats, descriptions, and quick examples of using these Ethoflow files are described in Supplementary materials (Table S1). The following subsections will provide further details on the Ethoflow algorithm steps (Figure 1).

Input Video
Multi-threaded processing was implemented in the algorithm. In this procedure, the video is read in a thread independent of the processing thread, and the frames are stored in a stack ( Figure 1; step 1.1; Appendix A). This avoids the delay between frame reading and other processing steps of the algorithm, whereby frames are always available to obtain better rates in frames per second (fps).

Preprocessing
In preprocessing (Figure 1; step 1.2), the video is processed to eliminate the regions that are not of interest to the user and transformed into a virtual primary color system (color space XYZ). In this color space, the chromaticity (XZ) and luminance (Y) are coded sepa-  Figure 1. Flowchart of the Ethoflow algorithm. The numbers on the right side of the rectangles indicate the steps in the algorithm process. These steps are described in the subsequent sections from Sections 3.1.1-3.1.9. Diamond symbols indicate the option of using the deep learning algorithms (for analysis in a heterogeneous environment or measurement of complex behaviors) according to the need. Thus, the Ethoflow performs kinematic measurements on an ordinary computer and assesses more complex behavior with a graphics processing unit hardware.

Input Video
Multi-threaded processing was implemented in the algorithm. In this procedure, the video is read in a thread independent of the processing thread, and the frames are stored in a stack ( Figure 1; step 1.1; Appendix A). This avoids the delay between frame reading and other processing steps of the algorithm, whereby frames are always available to obtain better rates in frames per second (fps).

Preprocessing
In preprocessing (Figure 1; step 1.2), the video is processed to eliminate the regions that are not of interest to the user and transformed into a virtual primary color system (color space XYZ). In this color space, the chromaticity (XZ) and luminance (Y) are coded separately, resulting in a more uniform response to the luminosity variation. Then, grayscale transformation and normalization are applied to increase homogeneity between the frames. Smoothing is also applied through a transformation based on the median of neigh- Figure 1. Flowchart of the Ethoflow algorithm. The numbers on the right side of the rectangles indicate the steps in the algorithm process. These steps are described in the subsequent sections from Sections 3.1.1-3.1.9. Diamond symbols indicate the option of using the deep learning algorithms (for analysis in a heterogeneous environment or measurement of complex behaviors) according to the need. Thus, the Ethoflow performs kinematic measurements on an ordinary computer and assesses more complex behavior with a graphics processing unit hardware.

Object Detection
After preprocessing, manual and automatic image thresholds are applied to detect individuals ( Figure 1; step 1.3). In manual thresholding, the classification of pixel (x, y) is performed according to a global threshold defined by the user (g): One of the automatic thresholding options is based on Otsu's method [22], wherein the optimal threshold minimizes the within-class variance. This algorithm attempts to find Sensors 2021, 21, 3237 5 of 23 a threshold value (k) that minimizes the within-class variances c 0 and c 1 (background and objects, respectively). If the set of gray levels of an image L = {1, 2, · · · , l} and the total number of pixels N = {n 1 , n 2 , · · · , n l }, then the probability of occurrence of a gray level (p i ) is given by As the method is based on the normalized histogram, Thus, the probability of occurrence (ω i ), means (µ i ), and variances (σ i ) of each class, are given by The within-class (σ w ) and between-class (σ b ) variances are The total variance is σ 2 t = σ 2 w + σ 2 b , and calculating the between-class variance improves the computational time because the variance between classes is based on first-order statistics (class means) [22].

Instance Segmentation
Instance segmentation (IS) [9] is another type of automatic segmentation available in Ethoflow for animal behavior assessments in heterogeneous environments ( Figure 1; step 1.4). ResNet-101 [30] was the convolutional base used in this model, following a Mask R-CNN implementation [31]. In this model, the video frames pass through a convolutional base for feature extraction, leading to feature map generation. The region proposal network (RPN) is then applied, which provides several candidate boxes (ROI proposals). As several ROIs are generated, the model classifies these boxes into foreground proposals (animals) and backgrounds. ROI pooling is applied to standardize the foreground proposals' size, slicing each foreground into a fixed number of parts, and max pooling is applied to standardize the size. Finally, the boxes labeled as real animals are instantiated using a pixel-wise sigmoid function ( Figure 2).

Post-Processing
In post-processing ( Figure 1; step 1.5), morphological operations are applied to eliminate residues. First, dilation is used to fill parts that belong to the same individual but are detected separately. Second, the gradient is calculated and subtracted from the expanded frame to eliminate undesirable edges. Finally, erosion is applied to eliminate any noise erroneously detected as individuals.

Position and Identity
In step 1.6 of the algorithm, the animal contours (the pixels contained in the animal body) are identified. The contours are identified without establishing hierarchies while retaining only the extreme points of the contour line segments. The contour measurements,

Post-Processing
In post-processing (Figure1; step 1.5), morphological operations are applied to eliminate residues. First, dilation is used to fill parts that belong to the same individual but are detected separately. Second, the gradient is calculated and subtracted from the expanded frame to eliminate undesirable edges. Finally, erosion is applied to eliminate any noise erroneously detected as individuals.

Position and Identity
In step 1.6 of the algorithm, the animal contours (the pixels contained in the animal body) are identified. The contours are identified without establishing hierarchies while retaining only the extreme points of the contour line segments. The contour measurements, such as the area, length, and the ratio between the area and length, are calculated to restrict the contours that are identified based on the user's inputs.
When the number of contours identified is smaller than the number of individuals specified by the user, the nonhierarchical clustering k-means algorithm is applied to separate merged individuals. In this unsupervised machine learning algorithm, the number of groups (k) in which the set of pixels will be grouped is equal to the number of individuals specified by the user. The initial k centroids are randomly defined among the set of data points. Then, the next set of centroids are chosen according to the probability of spreading between the centers [32]. The contour points are compared with each centroid and are allocated to the group where the Euclidean distance is minimal. Considering the inputs for the algorithm = { 1 , . . . , } of n data points, this algorithm runs interactively to find a set = { 1 , . . . , } that minimizes the function ( ) as follows: where ( , )² is the distance from x to the closest center in C. To choose centroids in the k-means algorithm, the first set of centers 0 are randomly selected from the dataset. Then, this step is repeated for 2 ⩽ ⩽ : is chosen to be equal to a data point according to the probability [32]: A combinatorial optimization algorithm [33] is applied to maintain the identity of individuals, which provides the optimal identity assignment among the centroids of animal contours. This is based on the Euclidean distance between the set of centroids of the When the number of contours identified is smaller than the number of individuals specified by the user, the nonhierarchical clustering k-means algorithm is applied to separate merged individuals. In this unsupervised machine learning algorithm, the number of groups (k) in which the set of pixels will be grouped is equal to the number of individuals specified by the user. The initial k centroids are randomly defined among the set of data points. Then, the next set of centroids are chosen according to the probability of spreading between the centers [32]. The contour points are compared with each centroid and are allocated to the group where the Euclidean distance is minimal. Considering the inputs for the algorithm X = {x 1 , . . . , x n } of n data points, this algorithm runs interactively to find a set C = {c 1 , . . . , c k } that minimizes the function ϕ x (C) as follows: where d(x, C) 2 is the distance from x to the closest center in C. To choose centroids in the k-means algorithm, the first set of centers C 0 are randomly selected from the dataset. Then, this step is repeated for 2 i k: c i is chosen to be equal to a data point x n according to the probability [32]: A combinatorial optimization algorithm [33] is applied to maintain the identity of individuals, which provides the optimal identity assignment among the centroids of animal contours. This is based on the Euclidean distance between the set of centroids of the objects in the frame i+1 = {a 1 , a 2 , . . . , a n } and the set of centroids in the frame i = {b 1 , b 2 , . . . , b n }. Considering that each a n is assigned to only one b n , the goal is to minimize the total cost of assignments about the distance matrix (D) between each a n and b n : Sensors 2021, 21, 3237 7 of 23 The mathematical model [33] for the assignments is given as Minimum : where d ij is the cost (Euclidean distance) from centroid a n to centroid b n . There are n! ways to assign a n to b n and achieve the optimal assignment, interactively, with the following steps: 1.
The minimum of each row is subtracted from the entire row.

2.
The minimum of each column is subtracted from the entire column. 3.
All zeros in the matrix are crossed with the minimum possible lines.
If crossing lines = n, then the optimal assignment is found.

Else:
To determine the smallest entry not crossed by any line, Subtract this entry from each uncrossed row and add it to each crossed column. Proceed to step 3.

Kinematic Variables
Among the identified and assigned animal contours, each individual's centroid (Cartesian position) is determined. Based on this Cartesian position x, y of individuals over time (video frames; f ), various kinematic variables are computed in algorithm step 1.7. The distance that an animal walks during the video is tracked distance (td) (Equation (12)). Dividing td by the total time of the video, the mean velocity can be calculated. Ethoflow also calculates the maximum velocity achieved by the animal.
The average angle that the individual rotated in each frame (turning angle; ta) is computed by the absolute sum of the angles ( • ) of the movement divided by the video frames (f ) (Equation (13)), while the meandering (the average angle that the individual rotated during the video; M) is divided by tracked distance (td) (Equation (14)); the angle of the movement is the arctangent of the locomotion in planes y (∆y i ) and x (∆x i ).
The movement of individuals is categorized based on the user-defined values. When defining the analysis protocol, the user defines the thresholds for low (tl) and high movement (th). Thus, considering the movement of individuals in each frame as mf : mf ≤ tl is counted as resting (the time associated with no activity of the individual); tl < mf ≤ th is counted as mean movement (the time in intermediated activity); mf > th is counted as fast movement (the time in high activity). The sum of these counts is divided by the frames per second (fps) used to sample the video to obtain these values in time.
The user also sets a threshold for interaction (ti). The interaction is considered when the individuals approach a distance ≤ ti. The sum of all interactions of an individual is defined as centrality. The network density (nd) is a measurement associated with group interaction (Equation (15)). A network is a set of items in which the vertices are defined as nodes (n), and the connections among them are defined as edges (m) [34]. Here, the nodes are the individuals, and the edges represent the number of interactions among them.
If the user defines a region of interest (ri), Ethoflow computes how long the individuals stayed inside this region, considering the position (coordinates x, y) of each individual in the video frames (f ): Considering the direction unit (u) of the individuals (i), the proportion of the group polarized (p) at each frame is calculated as The angular momentum (rotate; r) for each frame is a cross product (or vector product; ×) between the distance (d) of an individual to the center of mass of the group and the direction of movement (u): These parameters provide information on the global structure of the group [35], such as how much individuals are aligned in a group (polarization; gp), how much the group displays low directional alignment between neighboring individuals (swarming; gs), and how much the group moves around its center of mass (milling; gm). The sum of these counts is divided by the fps to obtain these values in time:

Complex Behavior Model
Ethoflow also measures complex behaviors using a CNN model (step 1.8). Different hyperparameter configurations were tested to define the CNN model ( Figure 3) (Appendix B). In this step, the bounding box computed from animal contours passes through the convolutional base (convolutional and max-pooling layers) for feature extraction. The activation function is applied to the output of each layer to introduce nonlinearity. Then, behavior classification is performed in the dense layers. When the complex behavior that the user is evaluating occurs, the network output will be equal to behavior 1; otherwise, behavior 0. The behavior occurrence sum is divided by the video frames to generate the percentage of occurrence of the behavior. Thus, we are interested in determining the occurrence of binary behaviors that are detectable through spatial information.

Output
In step 1.8, the behavioral parameters are automatically saved in a comma-separated values (csv) file in the path defined by the user. This file also contains the raw data, the coordinates (x, y) of movement in each frame. Thus, the user is free to calculate other kinematic parameters, in addition to those automatically computed by the software. At the end of the video processing, Ethoflow exhibits the detection rate (dr), which is the proportion that the individual was detected during the entire video minus false detection. False detection is considered when an individual has between frames velocity greater than the percentile at 95% of group velocity across all frames. Given the instantaneous speed vector IS = (is 1 , . . . , is n ) and f frames in the video, dr is defined as: where P. 95 is the percentile at 95% of the IS vector.
Sensors 2021, 21, x FOR PEER REVIEW 9 of 24 Figure 3. Convolutional neural network architecture defined after hyperparameter optimization (Appendix B) to recognize animal complex behavior on Ethoflow. This model was configured with stochastic gradient descent (learning rate at 0.0001 and momentum at 0.9) as an optimizer and binary cross-entropy as a loss function. Batch normalization was applied before max-pooling layers. Dropout was also applied after dense inner layers. In the inner layers (convolutional or dense), the function activation was Elu. The dimensions (width × height × depth) of feature map are given in each layer; the output dimensions of a layer are the same as the input dimensions of the next layer.
In the flatten process, the data are transformed into a vector to enter the dense layers. In the last dense layer, a sigmoid function is applied, which gives the binary output.

Output
In step 1.8, the behavioral parameters are automatically saved in a comma-separated values (csv) file in the path defined by the user. This file also contains the raw data, the coordinates (x, y) of movement in each frame. Thus, the user is free to calculate other kinematic parameters, in addition to those automatically computed by the software. At the end of the video processing, Ethoflow exhibits the detection rate (dr), which is the proportion that the individual was detected during the entire video minus false detection. False detection is considered when an individual has between frames velocity greater than the percentile at 95% of group velocity across all frames. Given the instantaneous speed vector = ( 1 , … , ) and frames in the video, dr is defined as: where P.95 is the percentile at 95% of the IS vector.

Application in Heterogeneous Environments
The Ethoflow was run on a machine with Intel i7-9750H CPU 2.60 GHz × 12, 8 GB RAM and GPU NVIDIA ® GeForce ® GTX 1660 (6 GB) Ti Max-Q. To apply Ethoflow in a heterogeneous environment experiment, we trained the IS model to detect the bee Melipona quadrifasciata through the 1325 images in various heterogeneous backgrounds ( Figure 4). In addition to these image data, the inputs with bounding box positions, classes, and masks (pixel-wise positions of the animals) are required to train the IS model [9]. The manual generation of these inputs is a laborious task. Then, we developed a heuristic to automatically generate these inputs based on several random backgrounds and a video . Convolutional neural network architecture defined after hyperparameter optimization (Appendix B) to recognize animal complex behavior on Ethoflow. This model was configured with stochastic gradient descent (learning rate at 0.0001 and momentum at 0.9) as an optimizer and binary cross-entropy as a loss function. Batch normalization was applied before max-pooling layers. Dropout was also applied after dense inner layers. In the inner layers (convolutional or dense), the function activation was Elu. The dimensions (width × height × depth) of feature map are given in each layer; the output dimensions of a layer are the same as the input dimensions of the next layer. In the flatten process, the data are transformed into a vector to enter the dense layers. In the last dense layer, a sigmoid function is applied, which gives the binary output.

Application in Heterogeneous Environments
The Ethoflow was run on a machine with Intel i7-9750H CPU 2.60 GHz × 12, 8 GB RAM and GPU NVIDIA ® GeForce ® GTX 1660 (6 GB) Ti Max-Q. To apply Ethoflow in a heterogeneous environment experiment, we trained the IS model to detect the bee Melipona quadrifasciata through the 1325 images in various heterogeneous backgrounds ( Figure 4). In addition to these image data, the inputs with bounding box positions, classes, and masks (pixel-wise positions of the animals) are required to train the IS model [9]. The manual generation of these inputs is a laborious task. Then, we developed a heuristic to automatically generate these inputs based on several random backgrounds and a video in homogeneous conditions to detect objects using manual segmentation or Otsu's method. Frames are randomly sampled in the video and pass through the algorithm's preprocessing and object detection stages ( Figure 5A). Then, the animals are "copied," and the contours are "pasted" into random backgrounds ( Figure 5B). Concomitantly, the bounding box, class, and mask of each animals are saved in a dictionary with the following structure: Of all the data generated with the heuristic, 976 (74%) were used for training, 249 (19%) for validation, and 100 (7%) to evaluate the classification using the average precision (AP) [36]. To obtain AP, we calculated the intersection over union (IoU) of the predicted bounding boxes (i.e., the x, y coordinates in the upper-left corner and width and height of the rectangular box around the object of interest) and target bounding boxes. Based on the IoU, the precision (Equation (23)) and recall (Equation (24)) can be calculated using the true positives (TP), false positives (FP), and false negatives (FN) for the detected objects (DO) in a determined threshold (x) (Equation (25)).
There is a tradeoff between the precision and recall, wherein the higher the recall, the more the model tends to find all the target objects, i.e., a low FN value. However, an increase in the recall tends to decrease the precision, as it increases FP. Considering equally spaced recall levels n = (0, 0.1, · · · , 1.0), interpolation is performed using the highest precision value for a given recall. Then, the AP is obtained from the interpolated values of the precision P interp (r) : Sensors 2021, 21, x FOR PEER REVIEW 10 of 24 preprocessing and object detection stages ( Figure 5A). Then, the animals are "copied," and the contours are "pasted" into random backgrounds ( Figure 5B). Concomitantly, the bounding box, class, and mask of each animals are saved in a dictionary with the following structure:   Of all the data generated with the heuristic, 976 (74%) were used for training, 249 (19%) for validation, and 100 (7%) to evaluate the classification using the average precision (AP) [36]. To obtain AP, we calculated the intersection over union (IoU) of the predicted bounding boxes (i.e., the x, y coordinates in the upper-left corner and width and height of the rectangular box around the object of interest) and target bounding boxes. Based on the IoU, the precision (Equation (23)) and recall (Equation (24)) can be calculated using the true positives (TP), false positives (FP), and false negatives (FN) for the detected objects (DO) in a determined threshold (x) (Equation (25)).
There is a tradeoff between the precision and recall, wherein the higher the recall, the more the model tends to find all the target objects, i.e., a low FN value. However, an increase in the recall tends to decrease the precision, as it increases FP. Considering equally spaced recall levels = (0, 0.1, ⋯ , 1.0), interpolation is performed using the highest precision value for a given recall. Then, the AP is obtained from the interpolated values of the precision ( ( )):

Application in Complex Behavior
Ethoflow was also applied to learn the detection of trophallaxis, the complex social behavior of food exchange among nestmates, in M. quadrifasciata. Thus, 1270 labeled images were generated (724 for non-trophallaxis and 546 for trophallaxis) ( Figure 6). In this dataset, 70% of the data was used for training, while 20% was used for validation. Another sample dataset (10%) was used to assess the classifier's performance based on the global accuracy from the confusion matrix, Kappa index, and Z-test (5%).

Application in Complex Behavior
Ethoflow was also applied to learn the detection of trophallaxis, the complex social behavior of food exchange among nestmates, in M. quadrifasciata. Thus, 1270 labeled images were generated (724 for non-trophallaxis and 546 for trophallaxis) ( Figure 6). In this dataset, 70% of the data was used for training, while 20% was used for validation. Another sample dataset (10%) was used to assess the classifier's performance based on the global accuracy from the confusion matrix, Kappa index, and Z-test (5%). The labeled images used to train the CNN model for recognizing trophallaxis were also generated through a heuristic automatically. When bees perform trophallaxis, they position themselves in front of each other and exchange food. Based on this predictable positioning, the heuristic was based on the individuals' area and body length. Initially, the program estimates the median (M) and standard deviation (sd) of the body area (a) and length (l) in frames where there is no crossing (no meeting between individuals). Sub- The labeled images used to train the CNN model for recognizing trophallaxis were also generated through a heuristic automatically. When bees perform trophallaxis, they position themselves in front of each other and exchange food. Based on this predictable positioning, the heuristic was based on the individuals' area and body length. Initially, the program estimates the median (M) and standard deviation (sd) of the body area (a) and length (l) in frames where there is no crossing (no meeting between individuals). Subsequently, the software obtains the images (b) from the video and labels them as trophallaxis if :

Application in Behavioral Bioassays
A behavioral assay was performed with the two stingless bee species. Bees of both species were collected from four colonies each of M. quadrifasciata and Partamona helleri in Viçosa, State of Minas Gerais, Brazil (20 • 45 S and 42 • 52 W). The collected bees were kept for 1 h in the laboratory under conditions similar to those found in their colonies (28 • C and 80% relative humidity in total darkness) [37]. Subsequently, bee behavior was recorded in the arenas (Petri dish, 9 cm diameter, 2 cm height) for 15 min with a digital video camera (HDR-XR520V, Sony Corporation) at 30 fps and high definition (1920 × 1080 pixels). Behavioral bioassays were performed in a room with artificial fluorescent light at 25 ± 3 • C and 70 ± 5% relative humidity. Bioassays were performed with 37 replicates, with each replicate corresponding to a group of five bees of each of the two species. The kinematic variables measured with Ethoflow included centrality, polarization, milling, resting, meandering, and tracked distance. In the centrality response, the interaction was considered when the individuals approached a distance ≤1.41 cm. An instantaneous tracked distance ≤0.046 cm frame −1 was counted as resting. Centrality was the response variable in the model with interaction between polarization and bee species, or model with interaction between milling and bee species. Meandering was the response variable in the model with interaction between resting and bee species. Besides, the tracked distance between the bee species was compared. These models were fitted with generalized linear models (GLM) with a gamma distribution, displaying adequate distribution for continuous data in which the variance increases with the square of the mean [38]. When an explanatory variable had no significant effect, the model was simplified, and the results were plotted as a function of the significant variable.
A toxicological bioassay was also performed with M. quadrifasciata to demonstrate trophallaxis recognition under pesticide stress conditions. The acclimated bees were orally exposed to the commercial formulation (cf) (water-dispersible granules at 700 g a.i. Kg −1 , Bayer CropScience, São Paulo, SP, Brazil) of the neonicotinoid imidacloprid in a sublethal concentration (0.2 mg cf L −1 ). This concentration is 300× smaller than that recommended for controlling the whitefly Bemisia tabaci (60 mg cf L −1 ) [39]. The pesticide imidacloprid is commonly associated with bee decline and causes motor impairments in bees [40]. After 3 h of exposure, the bees were filmed as previously described, and trophallaxis behavior was quantified using Ethoflow. Trophallaxis response (n = 60) to the pesticide was assessed using a GLM with a Poisson distribution, a suitable distribution for count data [38].

Performance
Using videos with variations in resolution, the number of individuals, animals, and backgrounds (Supplementary Materials; Figure S1), we evaluated some parameters associated with Ethoflow's performance and also compared it with other tracking software that has a satisfactory processing rate, based on the processing speed obtained by Sridhar et al. (2019) [13]. A multiple regression model was applied to assess whether the fps rate responds to the interaction between the resolution and the number of individuals.
The effect of centrality and the number of individuals in fps was assessed using a GLM with a gamma distribution. Analysis of covariance (ANCOVA) was performed to assess whether the detection rate varied with the interaction between the number of individuals and background (homogeneous and heterogeneous).

Heterogeneous Environment and Complex Behavior
Ethoflow was efficient in detecting the tested bees with high precision and low false positives in heterogeneous environments (average precision ± standard error = 0.916 ± 0.02; Figure 7A). In addition, in complex behavior assessment, the CNN model exhibited high accuracy in the validation process (global accuracy = 92.13%, Kappa index = 0.84, Z = 24.74, Figure 7B).

Performance
In homogeneous backgrounds, Ethoflow achieved a median rate of 32.5 fps. This rate is a satisfactory processing speed compared to other tracking software that does not use AI in their algorithms (e.g., idTracker = 5.5 fps; ToxTrac = 28.6 fps; Tracktor = 25.7 fps) ( Figure 9).
Statistical interaction was observed between the variables video resolution and group size in fps rate (F 1, 130 = 12.81, p = 0.0005, Figure 10A). The heterogeneous environment quantification was not influenced by the video resolution or number of individuals (F 1, 28 = 0.81, p = 0.37, Figure 10B), and the fps rate in a heterogeneous environment (0.386) was lower than in homogeneous backgrounds. The fps decreased with an increase in the centrality of individuals (F 1, 38 = 81.24, p < 0.0001, Figure 10C). There was no significant effect on the number of individuals (F 1, 37 = 0.009, p = 0.93), and no interaction was observed between the centrality and individuals (F 1, 36 = 1.62, p = 0.21). Besides, the software exhibited high detection rates with significant interaction between the number of individuals and type of background (F 1, 94 = 137.85, p < 0.0001, Figure 10D

Performance
In homogeneous backgrounds, Ethoflow achieved a median rate of 32.5 fps. This rate is a satisfactory processing speed compared to other tracking software that does not use AI in their algorithms (e.g., idTracker = 5.5 fps; ToxTrac = 28.6 fps; Tracktor = 25.7 fps) ( Figure 9). Statistical interaction was observed between the variables video resolution and group size in fps rate (F 1, 130 = 12.81, p = 0.0005, Figure 10A). The heterogeneous environment quantification was not influenced by the video resolution or number of individuals (F 1, 28 = 0.81, p = 0.37, Figure 10B), and the fps rate in a heterogeneous environment (0.386) was lower than in homogeneous backgrounds. The fps decreased with an increase in the centrality of individuals (F 1, 38 = 81.24, p < 0.0001, Figure 10C). There was no significant effect on the number of individuals (F 1, 37 = 0.009, p = 0.93), and no interaction was observed between the centrality and individuals (F 1, 36 = 1.62, p = 0.21). Besides, the software exhibited high detection rates with significant interaction between the number of individuals and type of background (F 1, 94 = 137.85, p < 0.0001, Figure 10D), where an increase in the number of individuals had a greater influence on the heterogeneous environments.

Discussion
We developed Ethoflow software using computer vision, machine learning and deep learning techniques. This program had consistent speed rates and accuracy on processing. In addition to the possibility to study complex behaviors, Ethoflow allows multivariate assessment of kinematic behaviors. Multivariate assessment of behavioral traits can bring important insights into animals' ecological aspects, for instance, in studies of toxicological

Discussion
We developed Ethoflow software using computer vision, machine learning and deep learning techniques. This program had consistent speed rates and accuracy on processing. In addition to the possibility to study complex behaviors, Ethoflow allows multivariate assessment of kinematic behaviors. Multivariate assessment of behavioral traits can bring important insights into animals' ecological aspects, for instance, in studies of toxicological assessments and animal behavior [41][42][43]. Some modern software programs that use deep learning to evaluate behaviors demand powerful machines with GPU [17,18,44], which makes the analysis of laboratory routines in ordinary computers difficult. In the Ethoflow algorithm, the AI processing is separate from the other modules, enabling kinematic measurements on an ordinary computer and assessing more complex behavior using a GPU. Wherefore, to perform kinematic measurements in homogeneous environments, an ordinary computer is sufficient (e.g., a central process unit of the 3.60 GHz) to run Ethoflow. In complex behavior assessments and heterogeneous environments, a GPU computer is interesting for optimizing speed-up computational processes.
Unraveling complex behaviors can be limited by software without adequate tools or software that are complex to set up or does not have a GUI, requiring familiarity with their tools [13,17,45], limiting their usage in the general public. Thus, there is a demand in research for powerful software with simplified-interface that, at the same time, increase the ability to study more complex behaviors. In this context, the Ethoflow software looks friendly and does not require line commands to be used due to the GUI. Additionally, Ethoflow does not require a great familiarity with computational tools and has multidisciplinary applications.
During the processing in homogeneous backgrounds, the effects of resolution and number of individuals in the fps demonstrated that the frame reading step (higher resolution, higher reading time) and calculating identities (more individuals, more combinations) could decrease the processing speed. Nonetheless, the implementation of multi-threaded reading [46] in Ethoflow solves these problems. This type of reading avoids the delay between calculating the identity and reading the frames, whereby there will always be frames available in the queue for immediate calculation of the identity. This procedure possibility satisfactory fps rates compared to other software available for the same purpose [13][14][15]. The identity calculation algorithm step occurs when at least two individuals interact. Thus, fps showed a negative correlation with centrality because the greater the interaction, the greater the identity calculations. The number of individuals (mainly in groups > 3) had no binding effect on the fps, probably, because the amount of interaction between individuals depends on the density of the group (i.e., number of individuals per space) and not only on the size (i.e., the number of individuals) [35].
In heterogeneous environments, there is no influence of the video resolution or number of individuals on the processing, and the fps rate is lower than in homogeneous backgrounds. This shows that the main bottleneck in processing occurs in the detection of animals by Ethoflow through the instance segmentation model. With instance segmentation, real-time processing (~30 fps) has not been achieved; processing around 5 fps was reported using a robust GPU [9]. Even though it is not possible to achieve real-time processing with instance segmentation, this functionality in the Ethoflow imposes great advantages given the various possibilities of analysis in heterogeneous environments. Furthermore, video acquisition by Ethoflow is independent of processing, which enables real-time video records.
The reliable detection rates obtained with Ethoflow demonstrated that this software is sufficiently robust for applications in different assays. Moreover, using the heuristic to generate training data automatically made it possible to obtain a high average precision model. Such in heterogeneous environments, there was a more pronounced decrease in the detection rate of objects; therefore, increasing the amount of data for training can improve the detection [47]. With the use of our heuristic, increasing the amount of data does not take much time from the user, but it could increase the time of computational training and inference. Another alternative would be to increase the quality of the data with images annotated manually. One tool that can be used to label images manually is VGG Image Annotator (VIA) [48].

Conclusions
This study provides information about the development of e-applications of computer vision and the artificial intelligence-based software Ethoflow. This software is suitable for multivariate kinematic evaluations, behavioral assessments in heterogeneous environments, tracking individuals in groups maintaining their identities, and can be trained to learn complex behaviors. Ethoflow was applied to biological assessments and was efficient to detect significant differences between different bee species and pesticide stress. Some possibilities of data analysis and representation were demonstrated with Ethoflow's output. The deep learning models were implemented to expand the possibilities of animal behavior analyses to other fields, including the behavioral monitoring of domestic animals in precision livestock farming. According to demand, Ethoflow will be constantly updated for future improvements and new functions, such as tracking three dimensions. Therefore, Ethoflow is a helpful support tool for technical and scientific applications in biology and related fields.

Patents
This software is registered with the Brazilian National Institute of Intellectual Property (INPI, Ministério da Economia, Brazil, reg. no. BR 51 2020 000737-6).

Informed Consent Statement: Not applicable.
Data Availability Statement: Ethoflow-related files, including source code and video tutorials, are available at https://sites.google.com/view/ethoflow (accessed on 6 May 2021). The datasets generated for this study are available at https://github.com/bernardesrodrigoc/Ethoflow (accessed on 6 May 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The video is read by a thread that is independent of the processing thread, and the frames are stored in a stack (queue). This queue is a linear data structure that stores items in a "FIFO" (First In, First Out) manner ( Figure A1). Frames are exchanged between the reading and processing threads. This increases the processing speed, as frames are always present in the queue and ready for processing, and no time is spent waiting for the next frame to be read. The video is read by a thread that is independent of the processing thread, and the frames are stored in a stack (queue). This queue is a linear data structure that stores items in a "FIFO" (First In, First Out) manner ( Figure A1). Frames are exchanged between the reading and processing threads. This increases the processing speed, as frames are always present in the queue and ready for processing, and no time is spent waiting for the next frame to be read. Figure A1. Representation of the processing steps in the multi-threading feature of Ethoflow.

Appendix B
Different hyperparameters were tested to find a suitable convolutional neural network (CNN) model. This model was trained to recognize trophallaxis in the stingless bees, Melipona quadrifasciata, and can be used to recognize many binary behaviors. Using the validation accuracy of such a variable response, the interaction between the dropout and the activation function of the network output was evaluated ( Figure A2A). Dropout is a Figure A1. Representation of the processing steps in the multi-threading feature of Ethoflow.

Appendix B
Different hyperparameters were tested to find a suitable convolutional neural network (CNN) model. This model was trained to recognize trophallaxis in the stingless bees, Melipona quadrifasciata, and can be used to recognize many binary behaviors. Using the validation accuracy of such a variable response, the interaction between the dropout and the activation function of the network output was evaluated ( Figure A2A). Dropout is a regularization technique that randomly zeros out the input units of a layer, breaking fixed patterns to avoid overfitting [49]. Here, a better response was obtained, deactivating neurons with a probability of 0.2 (dropout = 0.2) ( Figure A2A). The higher rates, despite reducing overfitting, decreased the accuracy. The activation function of the network output that presented the best response was the sigmoid function (Equation (A1)) ( Figure A2A). This function binarizes the network output (0 or 1). As it involves a binary behavior classification, the sigmoid function is expected to generate better output.
f (x) = 1 1 + e −x , where x is the output from the previously hidden layer . (A1) The activation function of the inner layers and the optimization method were also evaluated. The best result was obtained with the exponential linear unit (Elu) function, along with mini-batch stochastic gradient descent (mini-batch SGD) as the optimizer ( Figure A2B). The Elu function (Equation (A2)) is an identity function for positive values, and it tends smoothly to −α for negative values. This function saturates for very small (extremely negative) values, resulting in the activation average being close to zero. Thus, ELUs tend to normalize the layer's output, accelerate learning, and increase accuracy [50].
In the mini-batch SGD, the term stochastic refers to a random sampling of batches in the data. Based on the loss value, the optimizer plays the role of updating the network's trainable parameters (weights). This is executed by calculating the loss gradient concerning the parameters (current weights) of the network. Mathematically, this process is performed by deriving the cost function and finding the gradient of the current weights. Then, the weights are updated in the gradient's opposite direction, reducing the loss slightly with each batch. Since the classification is binary (the output from the network is a probability), binary cross-entropy (Equation (A3)) was used as the cost function. Cross-entropy is a measure of the distance between the expected result y and the predictions p(y).
H p (q) = −1 n n ∑ i=1 y i * log(p(y i )) + (1 − y i ) * log(1 − p(y i )), where n is the number of network outputs. As an increase in the learning rate tended to decrease the accuracy ( Figure A2C), the lowest rate tested (0.0005) was maintained. The learning rate determines the magnitude of gradient descent. At high learning rates, network updates can result in great randomness.
The network interacts with the data in mini-batches, i.e., it does not process an entire dataset simultaneously; rather, the data is divided into small batches. Although this hyperparameter is important in CNN models [51], it does not play an important role in our model ( Figure A2D). Therefore, one of the smallest values (batch size = 5) was selected to accelerate the network's training time. In many statistical models, the normalization of variables is important (e.g., to avoid the predominance of some variables due to different scales). To this end, batch normalization layers were used in the CNN model. This layer can adaptively normalize the data as the mean and variance change during training [52].
Using the hyperparameters defined above, the network's size (number of layers) was also evaluated, and better accuracy was obtained with smaller architectures (Table A1). While more layers (a higher-dimensional representation space) allow the network to learn more complex representations, this increases the network's computational cost; accordingly, model L7 was employed.  In many statistical models, the normalization of variables is important (e.g., to avoid the predominance of some variables due to different scales). To this end, batch normalization layers were used in the CNN model. This layer can adaptively normalize the data as the mean and variance change during training [52].

Number of Layers Validation Accuracy (Mean ±
Using the hyperparameters defined above, the network's size (number of layers) was also evaluated, and better accuracy was obtained with smaller architectures (Table A1). While more layers (a higher-dimensional representation space) allow the network to learn more complex representations, this increases the network's computational cost; accordingly, model L7 was employed. Data augmentation is a powerful technique for mitigating overfitting. Using the defined architecture (model L7 in Table A1), different data augmentation configurations were tested (Table A2). Excessive data augmentation reduces the accuracy, while sets with little augmentation increase overfitting. Thus, set 3 was deemed the best option to address the problem of overfitting. Table A2. Sets tested for data augmentation. In all the tests, the horizontal flip and fill mode = the "nearest" was used. Model L7 in Table A1 was used for these tests.