Automated Design of Salient Object Detection Algorithms with Brain Programming

Despite recent improvements in computer vision, artificial visual systems' design is still daunting since an explanation of visual computing algorithms remains elusive. Salient object detection is one problem that is still open due to the difficulty of understanding the brain's inner workings. Progress on this research area follows the traditional path of hand-made designs using neuroscience knowledge. In recent years two different approaches based on genetic programming appear to enhance their technique. One follows the idea of combining previous hand-made methods through genetic programming and fuzzy logic. The other approach consists of improving the inner computational structures of basic hand-made models through artificial evolution. This research work proposes expanding the artificial dorsal stream using a recent proposal to solve salient object detection problems. This approach uses the benefits of the two main aspects of this research area: fixation prediction and detection of salient objects. We decided to apply the fusion of visual saliency and image segmentation algorithms as a template. The proposed methodology discovers several critical structures in the template through artificial evolution. We present results on a benchmark designed by experts with outstanding results in comparison with the state-of-the-art.


Introduction
Saliency is a property found in the animal kingdom whose purpose is to select the most prominent region on the field of view. Elucidating the mechanism of human attention including the learning of bottom-up and top-down processes is of paramount importance for scientists working at the intersection of neuroscience, computer science, and psychology. Giving a robot/machine this ability will allow it to choose/differentiate the most relevant information. Learning the algorithm for detecting and segmenting salient objects from natural scenes has attracted great interest in computer vision and recently by people working with genetic programming [1,2]. While many models and applications have emerged, a deep understanding of the inner workings remains lacking. This work develops over a recent methodology that attempts to design brain-inspired models of the visual system, including dorsal and ventral streams [3,4]. The dorsal stream is known as the "where" or "how" stream. This pathway is where the guidance of actions and recognizing objects' location in space is involved and where visual attention occurs. The ventral stream is known as the "what" stream. This pathway is mainly associated with object recognition and shape representation tasks. This work deals with the optimization/improvement of an existing algorithm (modeling the dorsal stream) and allows evolution to improve this initial template method. The idea is to leverage the human designer with the whole dorsal stream design's responsibilities by focusing on the high-level concepts while leaving the computer (genetic programming-GP) the laborious chore of providing optimal variations to the template. Therefore, the human designer is engaged in the more creative process of defining a family of algorithms [5]. Figure 1 shows the template's implementation (individual representation) that emulates an artificial dorsal stream (ADS). As we can observe, the whole algorithm represents a complex process based on two models. A neurophysiological model called the two pathway cortical model-the two streams hypothesisand a psychological model called feature integration theory [6]. This last theory states that human beings perform visual attention in two stages. The first is called the preattentive stage, where visual information is processed in parallel over different feature dimensions that compose the scene: color, orientation, shape, intensity. The second stage, called focal attention, integrates the extracted features from the previous stage to highlight the scene's region (salient object). Hence, the image is decomposed into several dimensions to obtain a set of conspicuity maps, which are then integrated-through a function known as evolved feature integration (EFI)-into a single map called the saliency map.
Brain programming (BP) is based on the most popular theory of feature integration for the dorsal stream and the hierarchical representation of multiple layers as in the ventral stream [7]. Note that the template's design can be adapted according to the visual task. In this work, we focus on designing an artificial dorsal stream. Moreover, BP replaces the data-driven models with a functiondriven paradigm. In the function-driven process, a set of visual operators (V Os) is fused by synthesis to describe the image's properties to tackle object location and recognition tasks. This paper is organized as follows. First, we outline the related work briefly to highlight the research direction. Next, we detail the construction of the ADS template using an adaptation of Graph-based Visual Saliency (GBVS) combined with the Multiscale Combinatorial Grouping (MCG) to an evolutionary machine learning algorithm. Then, we present the results of the evolutionary algorithm to illustrate the benefits of the new proposal. Finally, we finish the article with our conclusions and future work on the automate design of brain models.

Linear Combination
Subsample Figure 1: Brain programming implementation of the dorsal stream using the combination of visual saliency and image segmentation algorithms. We propose to discover a set of visual operators (V Os) and the evolutionary feature integration (EFI) within the template through artificial evolution. The whole design makes a design balance between the human designer and the computer.

Related Work
For a learning algorithm design technique to be well received, it needs to solve several analysis levels. A major critique of deep learning is the opacity.
Scientists depend on complex computational systems that are often ineliminably opaque, to the detriment of our ability to give scientific explanations and detect artifacts. Here we follow a strategy for increasing transparency based on three levels of explanation about what vision is and how it works and why we still lack a general model, solution, or explanation for artificial vision [8]. The idea follows a goal-oriented framework where learning is studied as an optimization process [9]. The first is theoretical transparency or knowledge of the visual information processing whose design is the computation goal. The second is algorithmic transparency or knowledge of visual processing coding. Finally, the third level is execution transparency or knowledge of implementing the program considering specific hardware and input data.
Visual attention has a long history, and we recommend the following recent articles to the interested reader to learn more about the subject [10,11,12].
However, to really put it into practice, it is better to look for information about benchmarks [13,14,15]. In the present work, we select Li et al. since it provides an extensive evaluation of fixation prediction and salient object segmentation algorithms as well as statistics of major datasets. They provide a framework focusing on the performance of GBVS against several state-of-the-art proposals.
The study also explains how to adapt fixation prediction algorithms to salient object detection by incorporating a segmentation stage. Fixation prediction algorithms target at predicting where people look in images. Salient object detection focus on a wide range of object-level computer vision applications. Since fixation prediction originated from cognitive and psychological communities, the goal is to understand the biological mechanism. Salient object detection does not necessarily need to understand biological phenomena.
Regarding the second level of explanation (algorithmic transparency) or the knowledge of the visual processing coding. We can observe two different ap-proaches to incorporate learning into such study. The first is exemplified by a deep learning technique (DHSNET-Deep Hierarchical Saliency Network) since it is used as a building block in [2]. This method is a fully convolutional network (FCN)-based method and it was designed to address the limitations of multi-layer perceptron (MLP)-based methods [16]. FCN architectures leads to end-to-end spatial saliency representation learning and fast saliency prediction, within a single feed-forward process. FCN-based methods are now dominant in the field of computer vision.
The second methodology is represented by evolutionary computation applying genetic programming. We identify two representative works. In [2] the contribution is oriented towards the automatic design of combination models by using genetic programming. The proposed approach automatically selects the algorithms to be combined and the combination operators uses as input a set of candidate saliency detection methods and a set of combination operators. This idea follows a long history in computer vision about combination models. To achieve good results, authors rely on complex algorithms like DHSNET, using it as building blocks to the detriment of transparency since the method does not enhance the complex algorithms in the function set but only the output.
Since fixation prediction algorithms are complex heuristics another alternative is to work directly with some key parts of the algorithm to attempt to improve/discover the whole design. In [1] genetic programming serves to generate visual attention models-fixation prediction algorithms-to tackle salient object segmentation. However, the authors took a step back, returning to the first stage-theoretical transparency-and revisited Koch et al. looking for a suitable model susceptible to optimization [17]. They develop an optimization-based approach to learn the complete model using a basic algorithm that serves the purpose of a template. This algorithm uses as a foundation the code reported in [18]. In this way, Dozal et al. attempt to fulfill the second stage-algorithmic transparency-since they contemplate the difficulty of articulating the whole design exposed by Treisman and Gelade. To sum up, it is not easy to delegate all practical aspects to the computer according to the genetic programming paradigm. This way of looking for visual attention programs has already impacted practical applications like visual tracking [19,20]. This method searches for new alternatives in the processes described in the feature integration theory (FIT). That includes processes for the acquisition of visual features, the computation of conspicuity maps, and the integration of features. Nevertheless, a drawback is that the visual attention models are evolved to detect a particular/single object in the image.
In this work, we would like to identify all foreground regions and separate them from the background. Note that the foreground can contain any object on a particular database. This was the problem approached by Contreras et al.
and is known as salient object detection. The idea is to replace Itti's algorithm with the proposal published in [21] and the further adaptation and benchmark described in [15].
Koch and coworkers adapt Treisman and Gelade's theory into basic computational reasoning. Itti's algorithm accomplishes two stages 1) visual feature acquisition and 2) feature integration. It consists of visual features extraction and computation of visual and conspicuity maps, feature combination, and the saliency map. GBVS is not different from Itti's implementation. However, it makes a better description of the technique through Markov processes. The idea is to adapt the GBVS algorithm to the symbolic framework of brain programming. Figure 1 depicts the proposed algorithm where multiple functions are discovered through artificial evolution. GBVS is a graph-based bottom-up visual salience model. It consists of three steps: first, extraction of features from the image, second, creation of activation maps using the characteristic vectors, and third, normalization of activation maps and combining these maps into a master map. We adapt the algorithm described in [1] with the new proposal using four dimensions: color, orientation, shape, and intensity. In Koch's original work, there are three dimensions, each approached with a heuristic method, and the same for the integration step. We apply the set of functions and terminals provided in [22] with a few variants to discover optimal heuristic models for each of these stages. This algorithm uses the Markov chains to generate the activation maps. This approach is considered "organic" because, biologically, individual "nodes" (neurons) exist in a connected, retinotopically organized network (the visual cortex) and communicate with each other (synaptic activation) in a way that results in emergent behavior, including quick decisions about which areas of a scene require additional processing.

Methodology
BP aims to emulate the behavior of the brain through an evolutionary paradigm using neuroscience knowledge for different vision problems. The first jobs to introduce this technique [4,1] focused on automating the design of visual attention (VA) models and studied the way it surpasses previous human-made systems developed by VA experts. To perceive salient visual features, the natural dorsal stream in the brain has developed VA as a skill through selectivity and goal-driven behavior. The artificial dorsal stream (ADS) emulates this practice by automating acquisition and integration steps. Handy applications for this model are tracking objects from a video captured with a moving camera, as shown in [19,20].
BP is a long process consisting of several stages summarized in two central ideas correlated with each other. First, the primary goal of BP is to discover functions that are capable of optimizing complex models by adjusting the operations within them. Second, a hierarchical structure inspired by the human visual cortex uses function composition to extract features from images. It is possible to adapt this model depending on the task at hand; e.g., the focus of attention can be applied to saliency problems [1], or the complete artificial visual cortex (AVC) can be used for categorization/classification problems [4]. This study uses the ADS, explained to a full extent in the following subsections, to obtain as a final result the design of optimal salient object detection programs which satisfy the visual attention task.

Individual Representation
We represent individuals by using a set of functions for each V O defined in Section 3.3. Entities are encoded into a multi-tree architecture and optimized through evolutionary operations of crossover and mutation.
The architecture uses four syntactic trees, one for each evolutionary visual regarding orientation, color, and shape. We then merge the CM s produced by the center-surround process-including feature and activation maps-using an EFI tree, generating a saliency map (SM) as a result. Section 3.3.1 provides details about the usage of these EV Os; additionally, Figure 1 provides a graphical representation of the complete BP workflow.
After initializing the first generation of individuals, the fitness of each solution is tested and used for creating a new population.

Artificial Dorsal Stream
The ADS models some components of the human visual cortex, where each layer represents a function achieved by synthesis through a set of mathematical operations; this constitutes a virtual bundle. We select visual features from the image to build an abstract representation of the object of interest. Therefore, the system looks for salient points (at different dimensions) in the image to construct a saliency map used in the detection process. The ADS comprises two main stages: the first acquires and transforms features in parallel that highlight the object, while in the second stage, all integrated features serve the goal of object detection.

Acquisition and Transformation of Features
In this stage, different parts of the artificial brain automatically separate basic features into dimensions. The entrance to the ADS is a color image I defined as the graph of a function as follows.
Definition 1. Image as the graph of a function. Let f be a function color spaces. We define the optimization process through the formulation of an appropriate search space and evaluation functions.

Feature Dimensions
In this step, we obtain relevant characteristics from the image by decomposing it and analyzing key features. Three EV Os transform the input picture to emphasize specific characteristics of the object. Note that the fourth V M Int is not evolved and is calculated with the average of the RGB color bands. These EV Os are operators generated in Section 3.2. Individuals-programs represent possible configurations for feature extraction that describe input images and are optimized using the evolutionary process. We perform these transformations to recreate the process of extracting information following the FIT. When applying each operator, a V M generated for each dimension represents a partial procedure within the overall process. Each V M is a topographic map that represents, in some way, an elementary characteristic of the image.

Creating the Activation Maps
After selecting the visual operators generated by the evolutionary process the feature maps complete the feature extraction. Next, activation maps are

A Markovian Approach
Now, consider a fully connected graph denoted as G A . For each node M with its indexes (i, j) ∈ [n] 2 connected to the other nodes. The edge point of a node in the two-dimensional plane (i, j) to the node (p, q) will be the weight and is defined as follows: where and σ is a free parameter of the algorithm. Thus, the weight of the edge from node (i, j) to node (p, q) is proportional to their dissimilarity and their closeness in the domain of M. It is possible then to define a Markov chain on G A by normalizing the weights of the outbound edges of each node to 1, and drawing an equivalence between nodes-states, and edges weights-transition probabilities.

Normalizing an Activation Map
This step is crucial to any saliency algorithm and remains a rich area of study. GBVS proposes another Markovian algorithm and the goal of this step is a mass-concentration in the activation maps. Authors construct a graph G N with n 2 nodes labeled with indices from [n] 2 . For each node (i, j) and (p, q) connected, they introduce an edge from Once again, each node's output edges are normalized to unity, and treating the resulting graph as a Markov chain makes it possible to calculate the equilibrium distribution over the nodes. The mass will flow preferentially to those nodes with high activation. The artificial evolutionary process works with the modified version of GBVS. To improve the results, we can add the MCG during the evolution or after since the image segmentation computational cost with this algorithm is very high.

Genetic Operations
We follow the approach detailed in [1] where the template represents an • Chromosome-level mutation: The algorithm randomly selects a mutation point within a parent's chromosome and replaces the chosen operator completely with a randomly generated operator.
• Gene-level mutation: Within a visual operator, randomly chosen, the algorithm selects a node, and the mutation operation randomly alters the sub-tree that results below this point.
Once we generate the new population, the evolutionary process continues, and we proceed to evaluate the new offspring.

Evaluation Measures
Evolutionary algorithms usually apply a previously defined fitness function to evaluate the individuals' performance. BP designs algorithms using the generated EV Os to extract features from input images through the ADS hierarchical structure depicted in Figure 1. Experts agree on the way to evaluate the various proposals for solutions to the problem of salient object detection. In this work, we follow the protocol detailed with source code in [14] and apply two main evaluation measures: Precision-Recall and F-measure. The first is given through Equation (5): To compute a saliency map S, we convert it to a binary mask BM , and compute P recision and Recall by comparing BM with ground-truth G. In this definition binarization is a key step in the evaluation. The benchmark offers a method based on thresholds to generate a precision-recall curve. The second measure (Equation (6)) is made with this information to obtain a figure of merit: This expression comprehensively evaluate the quality of a saliency map. The F-measure is the weighted harmonic mean of precision and recall. In the benchmark β 2 is set to 0.3 to increase the importance of the precision value.
We calculate both evaluations with two variants. In our first approach, we obtain the maximum F-measure considering different thresholds for each image during the binarization process. Then, we calculate the average of all photos in the training or testing set; see [1]. The second variant is the one used in the benchmark, which consists of first calculating the average that results from varying the thresholds and then reporting the maximum resulting from evaluating all the images. We will use both approaches during the experiments.

Experiments and Results
Designing machine learning systems requires the definition of three different components: algorithm, data, and measure. In this section, we evaluate the proposed evolutionary algorithm with a standard test. Thus, the goal is to benchmark our algorithm against external criteria. In this way, we need to run a series of tests based on data and measures provided by well-known experts.
Finally, we contrast our results with several algorithms in the state-of-the-art.
In this research, we follow the protocol detailed in [15]. This benchmark is of great help because it gives us the possibility of accessing the source code of various algorithms to make a more exhaustive comparison. This benchmark also analyzes blunt flaws in the design of salience benchmarks known as database design bias produced by emphasizing stereotypical salience concepts. This bench-mark makes an extensive evaluation of fixation prediction and salient object segmentation algorithms. We focus on the salient object detection part, consisting of three databases FT, IMGSAL, and PASCAL-S. We present complete results next. Also, we include a test of the best program with the databases proposed in [2].

Image Databases
FT is a database with 800 images for training and 200 images for testing.
Authors of the benchmark reserve this last set of 200 images for comparison.
We use the training dataset to perform a k-fold technique (k = 5) to find the best individual. The training dataset is randomly partitioned into five subsets of 160 images. A single subset of the k subsets is retained as the validation data for testing the model, and the remaining k − 1 subsets are used as training data. Table 9 reports the best program results for 30 executions in the k-fold technique. Each execution was run considering the parameters from Table 2.
We follow the same procedure in the PASCAL-S database, and the final results are given in Figure [ obtain accurate ground truth, while IMGSAL provides a ground truth that was purposefully segmented following a raw segmentation to illustrate a real case where humans indicate an object's location inaccurately.      in the second run-up to 70.74% in the first run. The high dispersion is due to the complexity of the problem generated by the poor manual segmentation.

Experiments with Dozal's Fitness Function on GBVSBP
The experiment shows an average oscillating between 62.60% and 68.52% for training with σ = 3.72 and σ = 1.52 respectively. On the other hand, during testing, the algorithm highest score in average is 67.04% with σ = 2.48. Table   6 reports the best solution.   show excellent stability as seen in Table 7 with a low standard deviation, especially in training with the first run-scoring 1.58. During training, fitness reaches highest in the fifth fold of run 6, scoring 66.39%, while on average, the fifth run scores first place with 63.45%. Regarding the testing stage, the algorithm scores the best individual at the fifth fold with a 67.66%, and on average, the best run was the fifth with 63.63%. Table 8 presents the best set of trees.  The experiment with our second model GBVSBP+MCG and the FT database shows outstanding results compared to the previous model, see Table 9. Another remarkable difference is the stability during training regarding the standard deviation, whose performance descend a little while testing the best models, where four values score above σ = 3. Meanwhile, in the testing stage, the best individual achieves 95.06% in the second run. On average, the algorithm discovered the best individuals considering all folds in the first run with 93.47%, while the  second run reports the best results with an average of 92.13%. Table 10 shows the best solution.

Experiments with Benchmark's Score
As the second round of experiments, we adapt the algorithm to use as a fitness function the proposed benchmark's score as explained earlier, see Section 3.8. From now on, all report experiments considered this way of evaluation.  With this new F-measure, the results obtained show greater stability globally according to Table 11 despite runs 1 and 4 reporting σ = 4.47 and σ = 4, 23.
All other values are below 2%. The fitness of the individuals, despite that a decrease is observed, remains with competitive results as the best individual in the training stage achieves 72.59%, and the best in the test stage reaches 73.81%. In this experiment, the highest average fitness was reflected in the last run considering all folds and for both stages with values of 70.29% and 69.06%. Table 12 gives the best set of visual operators.         stability, we must bear in mind that the results of this experiment will be much more consistent when the best individual is tested in the benchmark since we are using the same F-measure. As a result, the best results were 89.02% in training corresponding to the second run of the first fold and 89.02% at testing discovered in the first run of the fifth fold. Table 18 gives the best set of trees.

Analysis of the Best Evolutionary Run
Typical experimental results that illustrate the inner workings of genetic programming are those related to fitness, diversity, number of nodes, and depth of the tree. Figure 2 provides charts giving best fitness, average fitness, and median fitness. The purpose is to detail the performance and complexity of solutions through the whole evolutionary run. As we can observe, artificial evolution scores a high fitness within the first generations. On average, BP converges around the seventh generation. The chart depicting diversity shows the convergence of solutions in all of the four trees characterizing the program.
Compared to the fitness plot, these data demonstrate that despite the differences in diversity that occurred during the experiment, the model's performance remained constant. One of the biggest problems using genetic programming is incrementing a program's size without a rise in the program's performance, mainly when the final result cannot generalize to new data. This problem is called bloat and is usually associated with tree representation. As observed in the last two graphs, the complexity is kept low with the number of nodes below seven and depth below five regarding all trees. These numbers were consistently below the proposed setup for all experiments. The hierarchical structure allows an improvement in performance and the management of the algorithm's complexity.   The values presented above correspond to fitness in the evolutionary cycle of BP. The benchmark offers two modalities: one which uses only 60% of the database, and another containing all images; we keep the first option. Table   19 shows final results achieved on the testing set over 10 random splits with our best program considering the FT dataset. Here, we appreciate final results considering the following algorithms for salient object detection: FT-Frequencytuned, GC-Global Contrast, SF-Saliency Filters, PCAS-Principal Component Analysis, and DHSNET. Also, we include the original proposal of the artificial dorsal stream named focus of attention (FOA) reported in [1]. Note that we overpass all other algorithms in the benchmark. Figure 3 presents image results of all algorithms in the three databases for visual comparison.   Note that even if the computer model is symbolic, the interpretation remains numeric, and therefore exists computer errors. Anyway, the computation is data-independent since the proposal follows a function-driven paradigm.

Comparison with Other Approaches
Finally, we test our best solution (GBVSBP + MCG) on four databases studied in [2], and the results are in Table 20. The best solution designed by Contreras-Cruz et al. was trained with MSRA-A and then it was tested with MSRA-BTest, ECSSD, SED2, and iCoseg. We observe that we score highest on three datasets MSRA-BTest, ECSSD, and iCoseg while achieving competitive results on SED2. We provide such comparison since [2] does not test their algorithms with the benchmark protocol. Therefore it is hard to make a clear comparison between both approaches, and the results we provide here serve the purpose of illustrating the methodologies' performance. Figure 5 illustrates the image processing through the whole GBVSBP+MCG program.

Conclusions
In this work, we propose a method to improve the ADS model presented by [1]. This method consists of applying an algorithm called GBVS that surpasses Itti's previous model. GBVS uses a graph-based approach using Markov chains while involving the same stages as the Itti model. Moreover, we follow the idea of combining fixation prediction with a segmentation algorithm to obtain a new method called GBVSBP + MCG to tackle the problem of salient object segmentation. As we show in the experiments, the novel design scores highest in the FT, IMGSAL, and PASCAL-S datasets of a benchmark provided by [15].
These tests show the strength and generalization power of the discovered model compared to others developed manually and current CNNs such as DHSNET, surpassed by more than 12 percentage points in FT. Also, we give results on four datasets described in [2]   the IMGSAL database, the ground truth is poorly segmented, and GBVSBP significantly improves the score compared to the original proposal. Therefore, we can say that there is a considerable benefit in combining analytical methods with heuristic approaches. We believe that this mixture of strategies can help find solutions to challenging problems in visual computing and beyond. One advantage is that the overall process and final designs are explainable, which is considered a hot topic in today's artificial intelligence. This research attempts to advance studies conducted by experts (neuroscientists, psychologists, and computer scientists) by adapting the symbolic paradigm for machine learning to find better ways of describing the brain's inner workings.