Parallelization of the Honeybee Search Algorithm for Object Tracking

Object tracking refers to the relocation of specific objects in consecutive frames of a video sequence. Presently, this visual task is still considered an open research issue, and the computer science community attempted solutions from the standpoint of methodologies, algorithms, criteria, benchmarks, and so on. This article introduces a GPU-parallelized swarm algorithm, called the Honeybee Search Algorithm (HSA), which is a hybrid algorithm combining swarm intelligence and evolutionary algorithm principles, and was previously designed for three-dimensional reconstruction. This heuristic inspired by the search for food of honeybees, and here adapted to the problem of object tracking using GPU parallel computing, is extended from the original proposal of HSA towards video processing. In this work, the normalized cross-correlation (NCC) criteria is used as the fitness function. Experiments using 314 video sequences of the ALOV benchmark provides evidence about the quality regarding tracking accuracy and processing time. Also, according to these experiments, the proposed methodology is robust to high levels of Gaussian noise added to the image frames, and this confirms that the accuracy of the original NCC is preserved with the advantage of acceleration, offering the possibility of accelerating latest trackers using this methodology.


Introduction
Object tracking deals with the problem of following a specific object from a video source in a frame by frame basis, and tries to solve several challenges such as the occlusion of objects [1,2]; and the abrupt changes in light, color or size [3]. Different approaches, known as trackers, have been proposed in order to solve the object tracking problem, but no definitive answer has been found [3][4][5][6][7]. Usually, researchers compare trackers against each other regarding speed and accuracy with focus on trackers which provide the most accurate results but have considerable low speeds.
In general, trackers are slower because they perform more operations to provide better accuracy. This apparent negative correlation between accuracy and speed makes it challenging to select a suitable tracker for any given application [8]. In summary, the studied problem in this work is how to accelerate a specific tracker without having negative effects on their accuracy.
The Graphics Processing Unit (GPU) is a useful tool to accelerate computational tasks in general, including object tracking [9,10]. This device thrives at the moment of accelerating trackers such as the Normalized Cross-Correlation (NCC) [11][12][13]. Although NCC is not a recent tracker it is one "enthusiasm" of the dance performance. The dancer makes pauses for antennal contact with peers, to transfer some harvested samples of nectar. The angle recorded by bees during their trajectory to the resource, relative to the sun's azimuth (the horizontal component of direction towards the sun), is mirrored following the angle on the comb using the waggle portion of the dance.
The described process can be abstracted and represented using pseudocode, which can be useful to develop an algorithm. Algorithm 1 shows a very general view of the search process employed by honeybees. There are three main phases: exploration, recruitment, and harvest. The process is inherently parallel and the algorithm could be improved by embracing parallel processing technologies.

The Honeybee Search Algorithm
The following section provides more details of HSA used as the basis for this work. The algorithm was initially proposed by Olague and Puente [14][15][16] defined as a meta-heuristic that combines Evolutionary Algorithms and Swarm Intelligence with individuals that emulate the behavior of foraging honeybees. As displayed in Algorithm 1, there are three main phases: exploration, recruitment, and harvest. The exploration phase consists of sending search agents called explorers to random points of a specific multidimensional space; these points or possible solutions are called food sources. We define the richness or attractiveness of the food sources through a function called fitness function. The next phase, recruitment, is where the explorers return to the hive and try to convince other honeybees to follow them to exploit the food source they have found. In the last phase, harvest, a massive search covers the space where the surroundings of the best food sources receive considerable greater attention. An evolutionary algorithm similar to Evolution Strategies (µ + λ) is at the core of in the harvest and exploration phases; mutation and crossover are used as the leading evolutionary operators [29].
The exploration stage (Algorithm 2) is actually a modified Evolution Strategy (ES) of the type µ + λ. These evolutionary algorithms serve as a simulation of the natural process in which the honeybees perform asynchronous exploration of the space in search of the food source. The exploration stage starts creating a random population µ e of multidimensional points called explorers, which are then transformed into a new population λ e using the mutation, crossover, and random steps. The best individuals are selected using a fitness value as criteria, obtained using a fitness function. The importance of mutation commonly distinguishes this kind of Evolutionary Algorithm as the main source of change between generations (instead of crossover). In this case, we follow the approach proposed by Boumaza and Louchet [30], with some changes in operators used to evolve the population.
λ e sons are generated by three different methods: α e sons are generated by Polynomial Mutation [29], β e sons are generated by crossover using Simulated Binary Crossover [29] and γ e sons are generated randomly. It should be noted that α e + β e + γ e = λ e . Another important difference between the common ES and the one used in this algorithm is that a sharing operator [31] is used to penalize individuals that concentrate in a small space by reducing the individual's fitness value. The algorithm repeats this stage until a given number of iterations or generations. The recruiting phase is easily understood: we divide the total number λ of forager agents into groups. The proportional size p i of those groups is relative to the fitness value ( f it i ) of each of the individuals that were selected from the exploration phase, as seen in Equation (1). Not all explorers may have assigned foragers to harvest their location; these might not have found a food source that is good enough.
The harvest phase (Algorithm 3) is also a (µ + λ)-ES where the offspring is generated in the exact same way as in the exploration phase (λ h = α h + β h + γ h ). The differences are that, instead of beginning with random points, the starting points are the results from the exploration phase; and the number of generations and individuals (µ h , λ h ) may change. This phase means the algorithm performs an ES search for every good food source found during exploration. We compute the number of foragers assigned to each explorer trough the factor r i , shown in Equation (2), where µ h is the total size of the foragers' population.

Crossover
The Simulated Binary Crossover operator (SBX) [29] is used to generate a part of the offspring populations. The SBX operator emulates the working principle of the single point crossover operator on binary strings. From two parent solutions P 1 and P 2 , it creates two children C 1 and C 2 using Equations (3)- (5).
The spread factor β is dependent on a random variable u ∈ [0, 1] and on a user-defined non-negative value η c that characterizes the distribution of the children in relation to their parents.

Mutation
Mutation is applied to each of the real variables of the individual using a polynomial distribution perturbation [32]. The mutation operation modifies a parent P into a child C using the boundary values P (LOW) and P (UP) of each of the decision variables using Equations (6) and (7).

Normalized Cross-Correlation
As discussed in Section 1, the Normalized Cross-Correlation (NCC) is one of the less complicated approaches to object tracking, but is top tier compared to other trackers which used the ALOV video dataset for performance evaluation [3]. We use NCC as fitness function for HSA and describe the adaptations in further sections.
NCC is used to measure the correlation between two image templates, i.e., NCC quantifies how similar these image templates are [33][34][35]. The grade of similarity is a scalar number, the greater it is, the more correlated these images are. Usually, the aim of using NCC is to find the point (u, v) where certain template t is located within a certain image frame I (illustrated in Figure 1). In order to find the optimal (u, v), every suitable point (u, v) in I must be checked using NCC. The range of possible values for (u, v) is limited by the dimensions w × h of I as well as the dimensions m × n of t. The range of u extends from 0 to w − m while the range of v goes from 0 to h − n. The formula used to compute NCC for any given (u, v) is based on the squared Euclidean distance. Equation (8) shows how the squared Euclidean distance d 2 I,t (u, v) is obtained. The term I(x, y) refers to the value used to describe the gray level of pixel located at point (x, y) within frame I. Similarly, t(x − u, y − v) is the value of gray assigned to the pixel (x − u, y − v) of template t. Note (x, y) is any point contained by the rectangular window that has point (u, v) as the top left corner and (u + m, v + n) as bottom right corner, meaning u ≤ x ≤ u + m and v ≤ y ≤ v + n. Such inequalities ensure the top left corner of template t will always be (0, 0) while the bottom right corner will always be (m, n).
The expression used to obtain d 2 I,t (u, v) may be expanded, as in Equation (9). This manipulation reveals that the result depends on three terms. The term ∑ x,y t 2 (x − u, y − v), which remains constant, is ignored since it does not change for two different points (u, v). On the other hand, the term ∑ x,y I 2 (x, y) does change for different values (u, v), but this change reflects the fact that the selected point is different rather than measuring the similarity between pixels I(x, y) and t(x − u, y − v). Only the term ∑ x,y I(x, y)t(x − u, y − v) is useful to get the cross-correlation c(u, v) (Equation (10)) and thus it is established as a starting point for NCC.
NCC presents an improvement over c(u, v), since it provides resistance to changes across the frames of a video sequence. For example, a difference in the illumination of the targeted object that occurs between frames of the video. Another common problem that occurs when using c(u, v) is that bright spots in the image or template can cause confusion by provoking greater correlation values than actual features of the targeted object. In order to give NCC a greater resistance to the mentioned errors, the formula for c(u, v) is altered by normalizing the image and feature vectors to the unit length (see Equation (11)). The mean of all the pixels (x, y) in the region were u ≤ x ≤ u + m and v ≤ y ≤ v + n is denoted asĪ u,v and found using Equation (12). The termt is the mean of the pixels contained in t and requires Equation (13) to be evaluated.
The value γ(u, v), called NCC, ranges between −1 and 1 [35]. As the value approaches 1, the certainty of finding t in point (u, v) with great robustness is greater. However, NCC is not capable of detecting certain transformations such as rotation and scale changes in the targeted object.

NCC as a Fitness Function
Each point (u, v) in the image has a value γ(u, v). The fitness function in this case is γ(u, v), while the individual is a 2D point (u, v). In this case there are not different fitness functions for foraging and exploration phases. The formal definition of the optimization problem is as follows: In this work, we made small changes to the original proposal of HSA to adapt to the specific problem of object tracking. The exploration and harvest phases of the algorithm are left almost intact, as described by [14][15][16]. A significant difference is in the sharing operator. Since object tracking requires an optimal 2D point rather than a sparse cloud of 2D points, thus we can omit the sharing operator to ease the convergence to the optimal value. Also, the initial random population of the exploration phase is not entirely random; it is around a small window that surrounds the last known position (u, v) of the object.
The recruitment phase was modified to limit the section of the feasible region to be searched. Originally, small search spaces were defined around each food source found during the exploration phase ( Figure 2a). This was changed so that a unique subsection of the feasible region is defined. This single sub-region ( Figure 2b) is defined to contain all explorers within it. Ω Ω

Workflow of Parallel Experiments
The following section provides information about several different tools that were used in conjunction to perform the experiments and evaluate the raw data that was obtained. Figure 3 shows the overall workflow. There are two main actors: the CPU and the GPU. It begins when the CPU performs operations that are required before the parallel processing is performed. Then, the GPU starts working to perform the search operations. Finally, results are sent back from the GPU to the CPU and then presented to the user.  Observe that the stage "Random number generation" (described in Section 2.5.2) is only used in program that uses HSA (BEE) because the heuristic procedure requires them. Another important note is that the stage of "Parallel Processing" changes drastically between programs. The parallel processing of program BEE uses Parallel HSA while the second program performs an exhaustive search (NO BEE), using NCC.

Hardware
The GPU used in most of the experiments is an AMD Radeon R9 270, and it was programmed using OpenCL. The clock speed of the GPU is of 925 MHz, and the communication with the CPU is made through a PCI Express x16 slot that transmits about 16 GB/s. The architecture of the GPU is designed for parallel operations that require no synchronization [36]. The 1280 ALUs of the GPU can work independently, but not in a coordinated manner. If the program requires synchronization, the available resources are limited. The maximum number of synchronized work-items, also called the maximum work-group, is of 256 parallel work-items [37].
The CPU used to provide instructions to the GPU, image preprocessing, random number generation, and displaying results is a DELL XPS 8700 with an Intel Core i7-4790. The base frequency of the processor is 3.60 GHz. We use OpenCV for image manipulation. The CPU also has 15.6 GiB of RAM available.
The experiments of Section 3.7, which tested state of the art trackers, requiered a second GPU with the following characteristics: NVIDIA GeForce GTX 1050 with 640 cores and 1354MHz clock speed.

Random Number Generation
The processors of the GPU have a limited function set. This situation presents certain difficulties, such as the lack of native support for random number generation, required for several vital activities of HSA. The solution we implemented was to pre-generate random number arrays of different distributions with the CPU and sent them along with the frame image data to the GPU. Four arrays are generated and sent from the CPU to the GPU: 1. uniformly distributed integers (using OpenCV) 2. uniformly distributed real numbers (using OpenCV) 3. beta distribution for SBX [38,39] 4. and delta distribution for Polynomial Mutation [39].

GPU Kernel Parameters
The GPU kernel is the program that is executed by every work-item and can receive configuration parameters if required. In this case, we sent different configuration parameters for each program, Table 1 provides a summary. Program BEE requires several values related to HSA. Many of the assigned parameters were optimal values found by Olague and Puente [14][15][16]. Those parameters are: η m , η c , α e /λ e , β e /λ e , γ e /λ e , The number of generations for the exploration phase was selected using an experiment (described in Section 3.4), the harvest phase has the half of the number of generations used for exploration.

Honeybee Search Algorithm Parallelization
Each honeybee is logically composed of q GPU work-items. The maximum number of individuals that can work simultaneously is p/q, where p is the maximum work-group. To take advantage of these p/q individuals, the size of populations µ e , λ e , µ h and λ h are all set to the same size of p/q.
When evaluating the fitness function, all work-items that belong to a certain individual have work to do. However, there are activities in HSA that require a lower level of parallelization and only use one processor per individual. The list of activities that fall in this category are: • generation of the initial random µ e population for the exploration phase • generation of λ populations using mutation, crossover and random exploration • merge µ and λ populations • selection of the µ best from µ + λ populations Before the selection of the µ best from the µ + λ population, said population has to be sorted by fitness value. The sorting algorithm used in the described implementation is a parallelized merge-sort, that uses one work-item per honeybee [40].
The only phase not parallelized at all is the recruitment phase. One processor keeps working while the rest remain idle. This processor assigns how many foragers will be recruited by each explorer bee and defines the new search space.

Tests Procedure and Evaluation Criteria
The tests and evaluation of the different developed tracker proposals were carried out considering several evaluation criteria. First, the tests were performed using the ALOV video dataset (described in Section 2.6.1). Secondly, a quantitative evaluation is carried out using the F-score criterion for each proposal and analyzed using a survival curve (detailed in Section 2.6.2). Subsequently, the average time per frame criterion is obtained for each proposal and analyzed using a survival curve (detailed in Section 2.6.3). Besides, the time complexity is derived for each proposal and contrasted with the obtained results. Finally, performance in terms of accuracy and processing time for each proposal is analyzed through a proposed criterion, called Score-Time Efficiency (detailed in Section 2.6.4).

ALOV Video Dataset
The tests were performed using the ALOV video dataset (Figures 4 and 5) proposed by [3]. This dataset provides 314 videos and ground truth data for evaluation and has been used as a benchmark to test 19 different trackers.  The ALOV dataset has 14 categories and each one presents a different challenge for common trackers. The categories are: 1. Light: 33 videos that have sudden and intense changes in the main light source or how it illuminates the tracked object. 2. Surface Cover: 15 videos where the tracked object changes its surface cover, but this cover adopts the form of the object it covers. 3. Specularity: 18 videos with shiny objects that reflect light and produce specularities. 4. Transparency: 20 videos where the tracked object is transparent and easily confused with the background. 5. Shape: 24 videos in which the tracked object drastically changes its shape. 6 13. Zooming Camera: 29 videos where the zoom of the camera changes the displayed size of the object. 14. Long Duration: 10 videos that have greater duration, between one and two minutes.
To perform a quantitative evaluation against the other 19 trackers, the dataset authors suggest using the F-score [3], which is a measurement tool based on the intersection and union area between the provided answer and the real answer. We have successfully applied this measure to other evolutionary algorithms dealing with video tracking [6,7].
The authors of the dataset suggest the use of a survival curve; a type of graph that shows the results sorted in descending order and is more commonly used in the field of medicine to compare how different treatments affect patients [41,42]. The curves that are closer to the labeled horizontal line can be considered the most accurate.

F-Score
The F-score F is evaluated using Equations (14) and (15). The first Equation (14) is known as the PASCAL criterion and it helps to identify false positives and true positives. The criterion depends on how the rectangular window T i is given as a result for a certain frame (also called truth) intersects with the ground truth GT i and how this area compares to the union of both as illustrated in Figure 6. The ground truth data was also provided as part of the ALOV++ dataset to allow faster testing and measurement of new algorithms.
|T The number of false positives n f p increases when the PASCAL criterion fails; the number of true positives n tp increases on the opposite case, and the number of false negatives n f n increases when the object is present on the frame according to the ground truth, but the tested algorithm determines the object was not detected. precision = n tp n tp + n f p recall = n tp n tp + n f n F-score = 2 · precision · recall precision + recall Note F only has non-negative values lesser or equal to 1, and the PASCAL criterion measures not only that the result is in the correct place but also that it reflects the current, correct size of the tracked object.

Average Time per Frame
The average time per frame is obtained using the summation of the seconds used to process each frame (seconds per frame, spf) of each video sequence divided by the number of frames in that sequence.
Average Time per Frame = ∑ seconds per frame i Number of frames The average time per frame should be obtained for each video of the dataset and each one of the different tracker proposals. We also used the survival curve plots to visualize the average time per frame data. Moreover, the time complexity is formally detailed and later confirmed with experimental results.

Score-Time Efficiency
To allow the comparison in terms of F-score and seconds per frame, this work proposes the Score-Time Efficiency. This rate describes how much benefit is obtained in terms of F-score by each second spent to process the frame. The tracker proposal with the greater Score-Time Efficiency will be considered better. Equation (17) shows how we obtained the Score-Time Efficiency.
Score-Time Efficiency = F-score Average Time per Frame (17)

Results
The Big O notation or asymptotic notation is used in the literature to describe the running time of algorithms [43]; this work uses it with to predict how a certain proposed tracker would compare against another one and how parallelization affects its performance in an abstract and formal manner (see Section 3.1). The theoretical analysis is later confirmed by experimental evidence. Section 3.2 observes the difference between using CPU or GPU to process exhaustive NCC. To continue, our NCC implementation is validated and compared to other trackers using the results reported by Smeulders et al. [3] (Section 3.3). Later, Section 3.4 describes the experiment that was performed to find the optimal number of generations for HSA. Ultimately, the experiment in Section 3.5 determines which of the proposed implementations is the best alternative in terms of accuracy and speed. Additionally, Section 3.6 uses artificial noise to test the robustness of the best alternative. Finally, Section 3.7 provides the comparison of the performance of our best alternative against other recent trackers.

Polynomial Time Complexity
The time complexity of NCC is obtained by observing Equations (11), (12) and (13). The time it takes to gett using Equation (13)  Sequential exhaustive NCC requires to compute every single possible γ(u, v) for every (u, v). This is the size of the feasible region (w , note that mnwh is the greatest term because m should be smaller or equal to w and n should be lesser or equal to h (as illustrated in Figure 1). This is the time complexity of sequential exhaustive NCC (Table 2 row 1).
In exhaustive parallel NCC, the GPU performs the computation of several γ(u, v) simultaneously. Let c be the maximum number of simultaneous work-items of the GPU, then the time complexity of parallel exhaustive NCC is O( mnwh c ) ( Table 2 row 2).  (1) and (2). Please note that this analysis is specific to the case where the fitness function is γ(u, v), if the fitness function experiences any change, then so does the time complexity. The first line of the Exploration pseudocode takes O(µ e ), as it is a constrained random generation of µ e individuals. The second line of the exploration pseudocode has a time complexity of O(µ e mn) since γ(u, v) is computed for every one of the µ e individuals. Next, the time for line 4 is of O(λ e ) and O(λ e mn) for line 5; line 6 is omitted (no sharing). Exploration line 7 performs sorting, taking O((µ e + λ e )log(µ e + λ e )). In total, lines 4-7 have a time complexity of O(λ e mn), since line 5 should be significantly slower to compute. This O(λ e mn) time complexity operation is repeated several times (line 3), this number is called the maximum number of generations or g, producing a time complexity of O(gλ e mn) which is polynomially greater than the cost of lines 1-2. Thus, the exploration phase takes O(gλ e mn). The recruitment phase requires the summation of the fitness of all the µ e individuals (Equation (1)) and then Equation (2) has to be computed for each of those individuals, producing a time complexity of O(µ e ). The harvest phase is very similar to the exploration phase; the key difference is the size of the evaluated populations (µ h and λ h ) resulting in time complexity of O(gλ h mn). In polynomial terms, the greatest time complexity between exploration, recruitment, and harvest phases is that of the harvest phase given that it should be true that λ h > λ e ; taking those considerations the time complexity for sequential HSA NCC ( Table 2 row The parallelization of HSA is described in Section 2.5.4. The evaluation of the fitness function γ(u, v) for each individual is performed by q work-items simultaneously. Please note that all individuals are processed in parallel because the number of individuals is p/q, and this is dependent on the capacity of the GPU. This step derives in a O( mn q ) time complexity to get the fitness of all individuals in the population once. Still, it is not possible to process several generations at the same time; they are dependent on the results of the previous generation. Thus, the overall time complexity of Parallel HSA NCC (Table 2 row , then the parallelization is expected to successfully accelerate computations in both cases. However, the comparison between using exhaustive search or using HSA is non-intuitive. Still, we could reduce it to the simple idea that HSA will be faster if the size of the image frame (w × h) is significantly greater than the number of individuals in the λ h population multiplied by the number of generations g. In summary, HSA is expected to be faster than exhaustive search as long as wh > gλ h , which should be the case in most if not all of the tested videos. In the following sections, the different tracker proposals are tested in terms of speed and accuracy, to contrast the time complexity derivations of each proposal, and to confirm the advantages of parallelizing HSA for object tracking.

Sequential Exhaustive NCC vs. Parallel Exhaustive NCC
Our theoretical analysis has supported the assumption that using a GPU produces an acceleration against using the CPU to compute NCC with an exhaustive search. The experiments show that the assumption is correct, the GPU implementation of exhaustive NCC takes fewer seconds per frame on average.
We considered impractical to fully process all videos of the ALOV dataset using the CPU and exhaustive search, but in order to estimate the average time per frame, we tested the first two frames of each video. We made this decision because the CPU required 1 day and 18 h to process the first 5 videos of the ALOV dataset. To check that the accuracy of the results is not affected, a few random videos from the ALOV dataset were selected and used for testing, According to the results (Table 3), the GPU computes NCC for a full frame around 170 times faster than the CPU. A small difference was observed in the average F-score, equivalent to 12.60% of the overall standard deviation (not relevant). As expected, the parallelization of exhaustive NCC successfully accelerated the execution. This result supports our previous time complexity analysis: O(mnwh) > O( mnwh c ).

Reported NCC and Other Trackers versus This Work's NCC
This experiment was performed to show that the proposed NCC implementation behaves as reported by other authors. We used all the 314 videos of the ALOV dataset for testing in this experiment.
The tests labeled OURS use the described GPU implementation of NCC and the tests labeled CTRL are the ones reported for NCC [3,44].
As illustrated in Figure 7, the difference in average F-score between OURS and CTRL is small. The distance between the averages is 9.06% of the standard deviation of all the tests, not significant. These results validate that the proposed GPU implementation of NCC performs as good as expected in terms of F-score. Comparison of F-score between using this work's NCC (OURS) and another NCC implementation (CTRL), the raw data was obtained from [44]. 'Both' shows the average and standard deviation of OURS and CTRL combined. Figure 8 shows survival curves for OURS, CTRL and 18 trackers reported by Puddu [44]. Survival curves are obtained by sorting the scores for every video and show a general overview of the video tracker performance; the closer the curve is to the perfect score (1.0), the better. The other trackers that are displayed are: Struck (STR) [45], Foreground-Background Tracker (FBT) [46], Tracking Learning and Detection (TLD) [47], Tracking by Sampling Trackers (TST) [48], Lucas-Kanade Tracker (LKT) [49], Kalman Appearance Tracker (KAT) [50], Fragments-based Robust Tracking (FRT) [51], Mean Shift Tracking (MST) [52], Locally Orderless Tracking (LOT) [53], Incremental Visual Tracking (IVT) [54], Tracking on the Affine Group (TAG) [55], Tracking by Monte Carlo (TMC) [56], Adaptive Coupled-layer Tracking (ACT) [57], 1 -minimization Tracker (L1T) [58], 1 Tracker with Occlusion detection (L1O) [59], Hough-Based Tracking (HBT) [60], Super Pixel tracking (SPT) [61], and Multiple Instance learning Tracking (MIT) [62].  FBT  FRT  HBT  IVT  KAT  L1O  L1T  LKT  LOT  MIT  MST  SPT  STR  TAG  TLD  TMC TST CTRL OURS

Number of videos [videos]
F-score [arb. unit] Figure 8. Comparison of F-score survival curves between using this work's NCC (OURS) and the NCC implementation reproduced from [3,44] (CTRL). Other video trackers are shown for reference.
According to this experiment, our implementation of NCC is validated and can be considered a tracker of acceptable performance in terms of accuracy.

Adjusting the Honeybee Search Algorithm
This experiment was designed to determine if the time per frame is reduced when the number of generations (explained in Section 2.2) is lower, and what is the optimal number of generations. This experiment considers all the 314 videos of the ALOV dataset. All the tests use NCC, GPU, and HSA. The variation is made in the number of generations. The other configuration parameters of HSA reuse optimal values proposed by Olague and Puente [14][15][16] after numerous experiments. Figure 9a shows that the average F-score is stable between generations 2 to 10. The difference in score between 2 and 3 generations is 3.50% of the overall standard deviation, and 1.95% between 3 and 4 generations. The leap between 4 and 10 generations is only of 4.02%. On the other hand, there is a notable difference between 1 and 2 generations, measured as 66.79% of the overall standard deviation.  Figure 9b shows that as the number of generations increases, the average time per frame increases. From 1 to 2 generations, the average time per frame increases by 8.65% of the overall standard deviation. From 2 to 3 generations, the increment is 4.50%. When changing from 3 to 4 generations, time per frame increases by 8.48%. The pattern is caused by the increment in the number of generations for the harvest phase, which is the half of the generations used for the exploration phase. Finally, an increment from 4 to 10 generations causes the average time per frame to increase by 38.82% of the overall standard deviation, showing that the trend continues. Figure 10a shows a three-dimensional plot, the axes show the average F-score, the average time per frame and the number of generations; the color of the bars indicates a difference in the average F-score; this helps to visualize the trend observed when the number of generations of HSA varies. After two generations, the average F-score stabilizes. In contrast, the average time per frame keeps increasing. We inferred from that information that two generations provide the right balance between time and score. Nevertheless, the Score-Time Efficiency shows without a doubt that two generations is the best option (Figure 10b). As expected, the average time increases with each new generation. This aspect supports the time complexity analysis of parallel HSA with NCC as fitness function: O( gmn q ), where the number of generations is of significant influence. This result shows the practical utility of the Score-Time Efficiency, using it to determine that the optimal number of generations. Also, the observation that the F-score has asymptotic behaviour starting on two generations ensures that selecting that number does not mean a negative effect on accuracy. Figure 11 shows an example of the results that were obtained using two generations, the location of the object (a ball) is obtained with a certain degree of accuracy even when the size and illumitation of the object changes. Figure 11. Visual example of the results that were obtained using NCC and HSA with two generations. The frames that are shown correspond to the 23th video of the Light category of the ALOV dataset. Every 14th frame is shown.

GPU Exhaustive vs. GPU with Honeybee Search Algorithm
This experiment also uses all the 314 videos of the ALOV dataset and all the tests use NCC and the GPU. The tests labeled BEE use Parallel HSA configured to run for two generations, and the ones labeled NO BEE use exhaustive search. Keep in mind that the full capacity of the GPU is not used in BEE because of the synchronization needs.
Notice in Figure 12 that the average difference of F-score between BEE and NO BEE is relatively small. This difference is 1.54% of the overall standard deviation, an irrelevant difference. This validates that BEE performs as good as the exhaustive NCC tracker. The difference of average time per frame, displayed in Figure 13a, is of 42.22% of the overall standard deviation. However, there is a remarkable difference in the standard deviation of both groups. In other words, BEE provides a lower and more stable time per frame (about 7.58 times steadier). This result validates that using Parallel HSA provides an improvement in time per frame against a plain GPU-accelerated NCC. Figure 13b shows how using the GPU accelerates the implementation that uses HSA and how it compares against exhaustive search. This outcome shows how the polynomial time complexity of GPU BEE is lower than GPU NO BEE, which is lower than CPU NO BEE, supporting the theoretical analysis O(mnwh) > O( mnwh c ) > O( gmn q ). Finally, Figure 14 shows that using parallel HSA (GPU BEE) provides a greater Score-Time Efficiency meaning it is the best alternative. This simple observation is an intuitive condensation of the previous analysis of both the F-score and average time per frame observations.

Tests with Gaussian Noise
This experiment uses 14 videos of the ALOV dataset, one for each category. All the tests use NCC, HSA with two generations, and GPU. We apply Gaussian noise to test the resistance of the proposal to external perturbations. Gaussian noise requires the configuration parameters of median and standard deviation to generate random numbers with that distrubution [63], and then these are added to the values of each color channel. Figure 15 shows an example frame without noise (a), this is labeled as noise level 0. Then, the same frame with Gaussian noise using median and standard deviation with a noise level of 50 shown in (b). Finally, the same example frame with Gaussian noise using median and standard deviation with a noise level of 100 shown in (c). The Figure 16a shows that there is a difference in F-score between videos that display different levels of noise. In other words, the underlying object tracker (NCC) shows a degree susceptibility to noisy image frames. The difference between noise level 0 and level 50 is 34.88% of the overall standard deviation, and the difference between levels 50 and 100 is of 10.38% of the total standard deviation. On the other hand, Figure 16b shows that noise has no noticeable effect on the average time per frame.   A conclusion drew from the results obtained with the combination of NCC, HSA with two generations, and GPU is that Gaussian noise affects the accuracy. Nevertheless, the difference in F-score observed between no noise (level 0), and some noise (50 and 100) is in the range of 0.11 to 0.14, which is small considering that the full range of the F-score is from 0 to 1. On the other hand, the noise has no effect on speed.

Our Best Tracker Proposal versus Two Recent Trackers
Two different recent trackers were selected for contrast: Struck [45] and SiamMask [10]. Struck remains relevant because it is a purely online tracking proposal that requires no kind of training and is still able to participate in the latest issues of the VOT challenge [64]. On the other hand, SiamMask is considered among the best tracker proposals that participated in the latest VOT challenge (top 20) in terms of accuracy, also several techniques that take inspiration from SiamMask are positioned in the toptier. SiamMask relies on pre trainined models as it is based on convolutional neural networks. This section compares the performance of this two state of the art trackers against our best proposal (GPU BEE) in terms of speed and accuracy using 14 videos of the ALOV dataset, one per category.
The analysis of Struck and SiamMask revealed that both trackers preprocess the image frames before actually performing tracking, in the case of SiamMask this is usefull to reduce latency caused by the CPU to GPU communication. Taking that into account, a simple reduction of the frame size is performed in our proposal, by scaling the picture accordingly, whenever the width or the height of the frame is greater than 300 pixels. The effect on the accuracy caused by the reduction of frames can be considered negligible compared to the full resolution frame. On the other hand, a relevant reduction of the average time per frame is obtained. Figure 17a shows that the accuracy of GPU BEE is inferior to that of SiamMask and Struck. This is not surprising since GPU BEE uses NCC as a tracker, and different sources have found it to be surpassed in accuracy by Struck. However it is interesting to note that Struck showed a better accuracy than SiamMask in this particular setting, which is not the general case. The conclusion that is drawn is that Struck shows greater robustness in certain cases: low resolution videos and videos with background objects that resemble the target. In addition, it is clear that the parallelization of the HSA does not improve the accuracy of its fitness function (Figure 13), but a significant acceleration is observed compared to the conventional NCC tracker. As shown in (Figure 17b), the average time per frame of our best proposal (GPU BEE) is almost four orders of magnitude smaller than that obtained for CPU NO BEE. Moreover, the survival curve of average time per frame of GPU BEE is in the same window (from 0.2 to 0.9 spf) of Struck and SiamMask, a remarkable improvement against CPU NO BEE.

Discussion
The main concern of this work is to find out whether our proposed combination of HSA and GPU is capable of accelerating a given object tracker. It is shown in Figures 12 and 13 that the optimal proposal, the one that maximizes accuracy and speed, is parallel HSA. An area of opportunity that arises from this work is whether or not parallel HSA using NCC as fitness function can compete against other trackers in terms of speed. During the tests combining the GPU, HSA, and NCC, the minimum observed latency was of 0.0351 s per frame; the maximum was 2.6715 s per frame; and the average was of 0.3729 s per frame. The results show that the tracker's speed is affected by the dimensions of the frame (width and height); as well as the size of the tracked object.
We reviewed the literature and found that the ALOV dataset was not used to test the speed of trackers due to the authors' emphasis on accuracy. On the other hand, the dataset presents a demanding task because the frame sizes that range from 192 × 144 to 1920 × 1080 pixels and post many different challenges. Other authors have reported speed of their tracker proposals. For example, Wu et al. [4] which used video sequences with frame sizes than range from 128 × 96 to 640 × 480 pixels, which are significantly smaller than ALOV frame sizes. Using the data of the most accurate trackers reported in that benchmark [4], we found that the minimal latency is of 0.0027 s per frame, the maximum latency is of 1.9607 s per frame and the average is 0.3228 s per frame. In contrast our average time (0.3729 s per frame) seems to fall on the average, despite the fact that the ALOV dataset has larger frame sizes and presents more complex challenges.
This work presents the results of tests using only one object tracker (NCC), but the parallelized version of the Honeybee Search Algorithm as presented could and ought to be combined with other object trackers. This combination can provide more evidence of the benefits of combining these tools. Testing with other object trackers is not a trivial undertaking since this requires the study and understanding of the tracker and an analysis to determine if it can be accurately translated into a fitness function. We believe the main objective of this work has been fulfilled by providing the proof that the combination of HSA and the GPU is useful to accelerate the NCC object tracker without adverse effects on its accuracy. For future work, we have decided to address this further by testing this configuration with other trackers (Section 5.1). Another relevant point that supports the previous observation is that the parallel HSA was capable of accelerating NCC to the point of reaching Struck and SiamMask speed levels even when NCC is not a state of the art tracker as is confirmed in Section 3.7.
According to our results, the GPU may not be the best tool for the parallelization of HSA or other Swarm Intelligence algorithms. The GPU was designed to perform vector operations with small synchronization requirements, while many Swarm Intelligence algorithms focus on the idea of synchronization of individuals. Usually, a GPU has a limit to the number of possible synchronized threads, often a fraction of the full capacity of the GPU. Nevertheless, our parallel implementation of HSA achieved a higher speed than naïve parallelization of the NCC tracker. Even though, the used hardware is not the best alternative in the market, we found that our parallelization approach enhances the performance of the NCC tracker and this should be translatable to another GPU with higher capacities.

Conclusions
The presented results provide evidence to conclude that the proposed combination of a GPU and HSA can be useful to accelerate NCC, and possibly other trackers. Furthermore, using HSA, together with the GPU, provides a greater acceleration than using the GPU alone. Even when HSA is not a deterministic tool, the accuracy is not negatively affected when using it.
The experiments helped find the optimal value for the number of generations (two generations), an important configuration parameter for HSA that has an effect in speed and accuracy. Both the time complexity analysis and the experimental evidence showed the extent of the influence that this parameter has in the tracker's execution speed. Other experiments showed that the proposal is resistant to high levels of noise.
The proposed Score-Time Efficiency criterion was found useful to make simple comparisons of different tracker proposals in terms of accuracy and speed, allowing making decisions that require the optimization of both variables. According to the analysis performed using this criterion, the best option was parallel HSA with 2 generations using NCC as fitness function.
Comparison experiments of our best proposal against other recent trackers such as SiamMask and Struck show evidence that parallel HSA has no relevant effects on the accuracy of its tracker function (NCC), but dramatically reduces latency, almost four orders of magnitude in terms of average time per frame. This evidence exposes an opportunity area in which to focus future efforts: to employ the proposed methodology but using state of the art trackers as a starting point.

•
Perform tests with different datasets. Since the ALOV dataset is mainly focused on accuracy.
An interesting dataset could be the object tracking benchmark of Kristan et al. [64]. Also, use state-of-the-art trackers and other well-known traditional trackers for testing. For instance, Struck [45] and SiamMask [10] are currently considered some of the most accurate and fast trackers [3,64].

•
Perform tests with different parallel computing or hardware acceleration technologies. The synchronization required by HSA could be provided by another architecture such as the FPGA [65], which would allow designing a parallel computing architecture with custom capabilities.

•
Comparisons with other Swarm Intelligence heuristics. This work used HSA, which could be compared against other similar heuristics, such as Particle Swarm Optimization or Artificial Bee Colony.