Ultrafast Image Categorization in Biology and Neural Models

Humans are able to categorize images very efficiently, in particular to detect the presence of an animal very quickly. Recently, deep learning algorithms based on convolutional neural networks (CNNs) have achieved higher than human accuracy for a wide range of visual categorization tasks. However, the tasks on which these artificial networks are typically trained and evaluated tend to be highly specialized and do not generalize well, e.g., accuracy drops after image rotation. In this respect, biological visual systems are more flexible and efficient than artificial systems for more general tasks, such as recognizing an animal. To further the comparison between biological and artificial neural networks, we re-trained the standard VGG 16 CNN on two independent tasks that are ecologically relevant to humans: detecting the presence of an animal or an artifact. We show that re-training the network achieves a human-like level of performance, comparable to that reported in psychophysical tasks. In addition, we show that the categorization is better when the outputs of the models are combined. Indeed, animals (e.g., lions) tend to be less present in photographs that contain artifacts (e.g., buildings). Furthermore, these re-trained models were able to reproduce some unexpected behavioral observations from human psychophysics, such as robustness to rotation (e.g., an upside-down or tilted image) or to a grayscale transformation. Finally, we quantified the number of CNN layers required to achieve such performance and showed that good accuracy for ultrafast image categorization can be achieved with only a few layers, challenging the belief that image recognition requires deep sequential analysis of visual objects. We hope to extend this framework to biomimetic deep neural architectures designed for ecological tasks, but also to guide future model-based psychophysical experiments that would deepen our understanding of biological vision.


Biological Vision and Ultrafast Image Categorization
What distinguishes a visual scene that includes an animal from one that does not? This question of "animacy detection" is crucial for the survival of any species, especially in regard to the interactions between prey and predators. This constraint has therefore profoundly shaped the way biological visual systems process retinal input. Of particular importance is the fact that this response must be efficient and fast, while keeping energy requirements to a minimum. In addition, these systems must fit the ecological niche of the system under consideration, with the range of patterns to be recognized being different for, say, a lion, a bird, and a human. It is important to note that this biologically-inspired approach shares some similarities and differences with detection algorithms defined in computer vision. Our goal in this paper is both to propose bio-inspired ultrafast image categorization models and to better understand how biological visual systems can efficiently implement such a task [1].
Therefore, let us first define the task of rapidly detecting an animal in a scene (see Figure 1). This task is routinely used in the study of biological vision in the laboratory (for In ultrafast image categorization, the task is to report whether a briefly flashed image contains a class of object, such as an animal [3]. The presentation time can be on the order of 20 ms, and the response is, for example, the pressing or not pressing of a button. Representative images for distractors and targets are shown here for two classes: 'animal' and 'artifact'. Note that these tasks are a priori independent and that an animal target can be either a target or a distractor for the other task. Here, based on Rousselet et al. [7], we did not consider images of humans to be part of the animal class, since they seem to represent a class of their own. Human categorization of an animal can be performed with high accuracy (generally over 80% correct), very quickly [8], and is robust to geometric transformations [7]. Color seems to have little effect, but some low-level statistics [9], as well as other factors (such as the animal's position and size in the scene) may influence accuracy but not speed [10]. Accuracy is maximal when the animal is in the center of the visual field [11], but performance is still above chance level (at about 60%) at extreme eccentricities of about 70 • . Such a task is performed seamlessly in parallel, so that multiple images can be categorized at once [12]. Surprisingly, once the task is learned, novel images are processed as quickly as familiar ones [13]. Given the difficulty of modeling this task, a scientific question is to understand what features in the image configuration are sufficient to produce such an effective behavioral response [14].

Feed-Forward Models of Ultrafast Image Categorization
Designing the best algorithm to solve ultrafast image categorization, as implemented in biological systems, is one possible way to answer this question. In this case, there are major constraints in the dynamics of vision, especially related to the limits of axonal transduction speed, which can lead to major difficulties in modeling the system [15]. In the case of the ultrafast go/no-go categorization task, two consequences follow from these physiological constraints: first, the response must be made quickly and therefore must be open-loop, i.e., before the action can take effect; second, since the whole process involves several processing steps before recurrent loops can refine neural activity, the flow of information is predominantly feed-forward [16]. This has been confirmed by EEG recordings of humans performing the task, showing that top-down signals (such as context or expectation) can influence categorization, but that the process is mainly a bottom-up, feed-forward process [17]. We can also expect that there should be a trade-off between accuracy and speed for image categorization algorithms [18].
Given the problem of designing the best algorithm to solve ultrafast image categorization in biologically inspired systems, it was previously shown that such a feed-forward architecture may be sufficient to perform the task [19]. This architecture consists of a sequence of layers that interleave a linear and a nonlinear process. This is similar to the simple and complex sublayers observed in the primary visual cortex. The linear part of the processing is performed by a convolutional operator, hence the name convolutional neural network (CNN) for this class of architecture. The nonlinear operation is often a simple rectifying unit, similar to the integration process that transforms the analog input to a neuron into a (positively defined) firing rate. In these architectures, the layer's resolution generally becomes progressively coarser along the levels of the hierarchy, until a few classification layers provide the final output [20]. The efficiency of this model yielded results comparable to humans performing the task on the same images [19]. Other popular methods use oriented luminance gradient histograms [21], but with a similar architecture, in which a sequence of processing steps in image space is followed by a classification step. Remarkably, these CNN architectures mirror that of the primate visual system, wherein the retinal image is transmitted from the thalamus to the primary visual cortex and then follows a path along the temporal lobe [16,22].

Related Work
Since their adoption as modeling tools, feed-forward architectures have been instrumental in the breakthrough of deep learning architectures, in particular in providing human-like performance for the PASCAL [23] and IMAGENET [24] challenges, that is, classifying millions of images into over 1000 different categories (labels). An important aspect of these architectures, originally inspired by neuroscience, is that they can be trained in a supervised manner, i.e., by associating each image with a given label in the training phase. This was illustrated for the MNIST challenge of classifying handwritten digits by associating each image of a digit with its recognized value [20]. This training process optimizes a given loss function applied to each pair, which allows the weights of the network to be progressively adjusted using gradient descent. In particular, a CNN such as VGG 16 is a well-optimized architecture for performing this challenge of computer domain categorization IMAGENET [25]. Therefore, we decided to use VGG 16 with IMAGENET as a starting point to better understand the process underlying ultrafast categorization of natural images, while bridging our knowledge between neuroscience and computer science.
The task defined for the IMAGENET [24] challenge could be considered computerspecific, since it requires choosing among 1000 labels, which implies knowing and remembering these 1000 labels to make the choice. Unlike artificial neural networks, which can easily compare these 1000 possibilities simultaneously, one can instead use a subset of behaviorally relevant labels to make the task more relevant to humans. Since we defined a novel task, it is then possible to "re-train" these CNNs to categorize images by defining a novel set of supervised pairs (e.g., an image containing an umbrella associated with the synset "artifact"). For the original IMAGENET [24] challenge, each input-output data pair consists of an input data point (an image from the IMAGENET database) and its corresponding output label (e.g., an image containing an umbrella associated with the label "umbrella"). The idea is to take the knowledge gained from one task and transfer it to a different but related task by using the right training pairs to re-train the CNN; this method is called transfer learning [26]. The advantage of using this method is that one can more easily explore the space of all possible architectures by adjusting the synaptic weights of the convolutional kernels, but also by testing the meta-parameters of the CNNs, such as the number of layers, the number of channels in each layer, or the coarsening of the visual information along the hierarchy [27]. Note that, at the extreme, even the best CNN network may not be able to learn to categorize an image-independent feature, such as whether the calendar day on which the photo was taken is odd or even. For instance, we will show below that, following that logic, if we define a task consisting of random labels among the 1000 categories of IMAGENET, then none of our tested architectures can learn this task efficiently. Finally, while a drawback of these networks is their lack of interpretability, we will exploit the fact that their raw efficiency gives a lower bound for the possibility of solving a given task.
Indeed, compared to random labels, the situation is different when defining more ecological tasks, such as categorizing animals or artifacts. This method can also be used by changing the definition of the supervision pairs to study changes in task context, rapid categorization, and object interference in the image [28]. A fitting question might be, "Is there an animal in this image?", since it reduces this human-machine bias by reducing the choices while maintaining a sufficiently complex and documented question. Searching for these kinds of categories seems to be a primordial function of the brain [16,29]. For example, using a set of specific stimuli, it has been shown that categories can be found in the brain areas of rhesus monkeys and that these categories can then be learned by artificial neural networks [30]. Our goal here is to obtain a model that is more faithful to the physiological data. In summary, and somewhat counterintuitively, compared to biological systems, it may be more difficult for a neural model to make a choice between only two alternatives, such as detecting an animal in an image, than to choose from 1000 labels [31,32]. This work will allow us to better understand how this is achieved in both biology and computational neuroscience models.

Main Contributions
What distinguishes an image with an animal from an image without an animal? To answer this scientific question, our work proposes three major contributions. First, to define the psychophysical task, we built a script to build large, arbitrary datasets of images based on IMAGENET [24]. It was defined by selecting labels according to a large semantic graph of English words: WORDNET [33] (see Figure 2). According to our scientific question, we first defined our task as the categorization of an animal in an image. As a control, we also defined an independent task consisting of detecting the presence of any artifact in the image (see Figure 1C,D). Second, we re-trained the existing VGG 16 model on these tasks and compared its performance with experimental data. This allowed us to test the robustness of our networks to different geometric transformations and to compare their accuracy with that observed in the physiological data. In addition, we compared the accuracy for both tasks, individually and jointly. Third, we tested different levels of complexity of such models by performing a gradual removal of layers from the original network. This experiment quantified whether low-level features could be sufficient to categorize animals [34] (although it is known that the global image statistics [35] or the spatial frequency envelope is not sufficient to categorize images [36]) and whether this could be accompanied by a decrease in invariance to geometric deformations. Finally, we discuss how this work can be useful in the design of future physiological experiments and in the design of novel computer vision architectures.

Building the Dataset Maker Library Using the WORDNET Hierarchy
To re-train a deep convolutional network (like VGG 16) for a specific task, one of the most important components is the dataset. We needed a tool that would allow us to generate datasets suitable for answering our question. Therefore, we created a library that, from a keyword, generates a dataset with image folders containing (target) or not containing (distractor) this keyword [37]. For this, we will use the corresponding set of labels from the IMAGENET database [24], which is based on a large lexical database of the English language: WORDNET [33]. The nouns, verbs, adjectives, and adverbs in this database are grouped into a graphical set of cognitive synonyms, synset, each of which expresses a different concept. These synsets are linked to each other using a few conceptual relations (see Figure 2). For example, if we set the dataset maker with the keyword 'animal', we used the hyperonym link to determine that a German Shepherd is a type of dog and that a dog is a type of 'animal', thus defining a hyperonym path. In this example, the synset 'animal' from WORDNET is in the hyperonym path of the label 'German Shepherd' in IMAGENET. Based on this relationship, the dataset creator selected a specific subset of labels in the IMAGENET database to build our datasets. Once the list of labels corresponding to our task was selected, the dataset maker randomly selected from the URLs provided for the IMAGENET challenge [24] to download the images that make up the dataset.
With this tool, we generated datasets according to two given tasks. In particular, we generated the dataset necessary to train the network to answer our question, "Is there an animal in this scene?", by selecting the 'animal' synset. To answer the question, "Is there an artifact in this scene?", we followed the same protocol. As a control, we also created a 'random' dataset, which was generated by randomly selecting 500 labels from the IMAGENET database. The latter was generated to infer the role of the possible links between arbitrary labels by measuring the resulting efficiency of categorization by a deep convolutional network. In summary, we used Dataset Maker to generate three datasets: One based on the 'animal' synset, one based on the 'artifact' synset, and a 'random' one. Each newly generated dataset contains a 'test', 'validation', and 'train' set (with 1200, 800, and 2000 images, respectively). Each set contained a 'target' and a 'distractor' category (both with the same number of images). All networks were trained on the 'training' set and tested during training on the corresponding 'validation' set. We then computed accuracies using the 'test' set. As a control, we also tested the networks on the dataset from Serre et al. [19], which contains a total of 600 targets (images with an animal) and 600 distractors (images without an animal).

Transfer Learning
We used the transfer learning method to re-train networks [26]. This method takes the knowledge gained from one task and applies it to a different but related task. We used an existing network that had been pre-trained on a specific task: VGG 16 [25]. This architecture is loaded thanks to the PYTORCH library [38] and trained on the database used to solve the IMAGENET [24] task. We had previously found that this model provided the best trade-off between accuracy and complexity [39]. It also achieves a good model of biological function as measured by the Brain score [40]. Compared to other architectures such as ResNet, VGG 16 stands out as an ideal candidate. Two notable advantages of the transfer learning method are the robustness and the convergence speed for learning the network. This results in lower total execution time and energy consumption. In particular, this method allowed us to save computational time during the learning process and thus experiment with several possible strategies (see Figure 3). We first validated this hypothesis by training a network with random weights: VGG SLS (Supervised Learning from Scratch) as a control.  During transfer learning, we kept all layers of the VGG 16 network, since it is already capable of performing feature extraction on natural scenes, and re-trained only the last fully connected layer. In particular, we replaced this last layer trained on IMAGENET (i.e., with a vector of dimension K = 1000 that captures the predicted probability of detection for each of the labels in the IMAGENET database) with a layer whose output dimension is simply K = 1 and which represents the predicted probability of detection of a new object of interest, i.e., a target rather than a distractor. We then re-trained this fully connected layer, while freezing the weights of the other layers, to match the pre-trained features (the output of the convolutional layers) with the synset corresponding to the new task implemented by the dataset maker. Following this process, we re-trained a network that we call VGG TLC (Transfer Learning on Classification layers). As a control, we also tested the effectiveness of freezing the remaining layers by completely re-training all layers of the pre-trained network, VGG TLA (Transfer Learning on All layers). Note that these two networks were trained without any form of data augmentation.

Convolution + Relu blocks
Since the network is asked to make a binary decision during training ("Is this synset present in this scene?"), we implemented the loss using the binary cross-entropy loss with logits from the PYTORCH library. We used the stochastic gradient descent (SGD) optimizer from the PYTORCH library and validated parameters such as batch size, learning rate, and momentum by performing a sweep of these parameters for each network. During the sweep, we varied one of these parameters over a given range while leaving the others at their default values for 25 epochs. We chose the parameters' values that gave the best average accuracy on the validation set: batch size = 8, learning rate = 0.00005, momentum = 0.99. Then, to increase the generality of our results, we implemented various preprocessing steps on the inputs to introduce more variation into the training dataset: data augmentation. From the VGG TLC protocol, we tested the effectiveness of this data augmentation using two strategies: first, by re-training a pre-trained network with a set of custom transformations from the PYTORCH library: random horizontal flipping (with p = 0.5), random vertical flipping (with p = 0.5), a random rotation (p = 1), and random grayscale (with p = 0.5), such that we trained the VGG TLDA (Transfer Learning with Data Augmentation) model. Then, the input images were distorted using the auto-augment function from the PYTORCH library. This function implements a total of 16 randomly parameterized affine transformations on the inputs to perform data augmentation [41], thus defining the VGG TLAA (Transfer Learning with Auto Augment function) model. Finally, we studied a VGG RANDOM model trained on the 'random' dataset (that is, consisting of two categories defined by randomly chosen labels among the 1000 labels of IMAGENET). Note that, although we implemented all transfer learning strategies on this dataset, as the results were similar for all strategies, we chose to display the networks obtained using the same training protocol as VGG TLAA.

Pruning
Another network manipulation that we tested is the modification of the CNN architecture. In particular, we tested the effect of pruning the convolutional layers of the pre-trained network VGG 16 to determine the complexity of the features required to categorize a given synset of interest. In fact, the VGG 16 network can be described as a hierarchically organized pipeline: first, a set of convolutional layers, then a set of fully connected layers [25]. The set of convolutional layers is organized into 5 blocks of 13 convolutional layers. Within a block, there is a sequence of convolutional layers followed by a nonlinearity and optionally a normalization (in our case, we did not use or test the batch normalization option). Within each block, the image size and the number of channels are constant. In general, the resolution decreases from block to block using max-pooling operations, while the number of channels increases from 64 at the input to 512 at the fully connected blocks.
Since the final process of the set of convolutional layers is an adaptive pooling function that produces a characteristic image of constant size equal to 7 × 7, the size and architecture of the fully connected layers were kept constant. Therefore, we defined new networks whose names correspond to the number of layers to be pruned. The network named VGG-1 had only its last convolutional layer block pruned, and then we applied the same learning process as for the network VGG TLAA. We then did the same for the 12 different depth factors. We have chosen the names of the meshes according to the number of layers removed. Thus, the network with one layer removed is called "vgg minus one", i.e., "VGG-1" (from the deepest VGG-1 to the shallowest VGG-12).

Accuracy
Tha accuracy metric will be used to describe the performance of the model. In effect, the network is expected to output a binary decision ('Is there an animal in the scene?') and is designed to provide the predicted probability of the presence of a synset of interest in the scene. We considered the output a 'target' if the network output was greater than 0.5 (i.e., 50%), otherwise it was considered a 'distractor'. A positive true was defined as the case in which the network categorized a 'target' if it was a 'target', otherwise it defined a positive false. Similarly, a true negative was defined when the network categorized a 'distractor' when it was indeed a 'distractor', otherwise it defined a false negative. Based on these observations, we could determine each time that the networks performed a good categorization and calculate its accuracy as the ratio of the sum of true positives and negatives over the total number of samples tested. To provide a comparison with the state of the art, we tested the VGG 16 and computed its prediction by summing the predictions of the labels belonging to the hyperonymous path of the synsets of interest after the softmax layer, hence VGG LUT (Look Up Table). Accuracy was then computed using the same methodology as for the re-trained networks. We evaluated the accuracy of our different networks on the test set using Equation (1): True positive + True negative True positive + True negative + False positive + False negative (1)

Performances on Natural Scenes Containing Animals without Transfer Learning
Obviously, testing the initial pre-trained net should be one of the first experiments before re-training the neural networks. If we were to test it on the dataset on which it was trained to categorize an animal, it would indeed perform very well, with a mean accuracy of 0.99 for categorizing an animal (and 0.98 for an artifact; see Figure 4). The goodness of these results is quite stunning compared to human behavioral results and highlights one difference between human and machine intelligence.   [37]. For each dataset, the network is tested with original images (left) or after applying a random rotation (right). The dotted line represents the chance level for all graphs.
However, as soon as we added a perturbation such as a random rotation (images similar to [42]) into the same dataset, the performance dropped to 0.85 for the presence of an animal and 0.83 for the detection of an artifact, on par with human performance. Note that if the labels chosen in the task definition have no semantic link, as is the case for the test on the 'random' dataset, the network cannot perform a correct categorization, with or without rotation, and it yields an accuracy close to chance level.

Performances on Natural Scenes Containing Animals with Transfer Learning
We then tested different variations of transfer learning on the task, "Is there an animal in this visual scene?", and show the mean accuracies for different datasets, as summarized in Table 1. First, we have seen that the VGG LUT network seems to be robust, as validated on the dataset used by Serre et al. [19], on which the network achieves a mean accuracy of 0.95. Note that it achieves better performances compared to about 0.84 obtained by the model designed in Serre et al. [19] and about 0.80 in psychophysics, and this is without any retraining process. Now, let us focus on the network after the transfer learning process, as the VGG TLC, VGG TLA, VGG TLDA, and VGG TLAA reached similar levels of performance on the test set (with 0.97, 0.96, 0.97, and 0.95, respectively) and also maintained robust categorization on the Serre et al. [19] dataset (with 0.94, 0.92, 0.91, and 0.88, respectively). Compared to the VGG SLS, which could only reach 0.64 on the same task, these results show that transfer learning allows us to obtain highly accurate networks for the categorization of a synset of interest. Note that this low performance is only due to the computational limits that we imposed in our study. We then focused on the robustness of the categorization of the different data augmentation strategies (VGG TLC, VGG TLA, VGG TLDA, and VGG TLAA) compared to the state of the art VGG LUT and the expected performance in neurobiological models.

Robustness of the Categorization with Different Geometric Transformations
Since we were looking for the best robustness for this task, we tested VGG TLC, VGG TLA, VGG TLDA, VGG TLAA, and VGG LUT on the newly constructed dataset using our dataset maker library with the synset 'animal'. We applied either a grayscale filter or a vertical or a horizontal reflection to the input (see Table 2). We also tested the robustness to rotation by rotating the image around the center by an angle ranging from −180 • to +180 • (see Figure 5). All these networks maintained good average accuracy on the returned dataset and on the grayscale dataset (see Table 2). These results were consistent with psychophysical results showing that ultrafast categorization is robust to a grayscale transformation [10]. Only VGG TLDA and VGG TLAA seemed to show robust accuracy at all angles, with peaks in accuracy at the cardinal orientations (−180 • , −90 • , 0 • , 90 • , and 180 • ), which could be explained by the pre-training weights of the networks, as they correspond to the peaks found in the categorization of VGG 16. We conclude here that data augmentation provides a more robust categorization of the synset of interest by the network, as the VGG TLDA and VGG TLAA achieve better performance in this task. In addition, the protocol used to re-train the network VGG TLAA, with the auto-augment function of the library PYTORCH [41], is also better than our custom data augmentation. The performance of VGG TLAA is very close to that of VGG TLDA, with a tendency for VGG TLAA to be more robust to rotation. Therefore, the VGG TLAA network is the best fit for psychophysical observations due to its stability and robustness of categorization to different image transformations [7,42]. In the following, we therefore focused on exploring the features that this model relies on to perform its categorization. Table 2. Mean accuracies for ultrafast image categorization of an animal in a scene using various geometric transformation on the input: vertical flip, horizontal flip, grayscale filter. These transformations were implemented using our dataset maker library with the synset 'animal' for four re-trained networks: VGG TLC, VGG TLA, VGG TLDA, and VGG TLAA. It was compared with the state-of-the-art network VGG LUT. All the transformations used here were performed using the PYTORCH library [38].

What Features Are Necessary to Achieve the Task?
We designed an experiment in which we gradually removed layers from a pre-trained network VGG 16 for 12 different "depth" factors. For each level, we tested the re-trained pruned networks to categorize an animal in a scene for our dataset IMAGENET. VGG LUT and VGG TLAA achieved the best accuracy for this task (see Figure 6). The accuracies of the networks remained similar to the performance found by Serre et al. [19] with a slight drop between VGG-9 and VGG-12. This is not a surprise, as their model relied on low-level features [34]. Note that the computational time required to perform the categorization decreased with the depth of the network (in seconds on a Quadro RTX 5000 GPU, we obtained VGG TLAA = 0.005 ± 0.0001 and VGG-8 = 0.003 ± 0.0001) (see Table 3).
L U T T L A A v g g -1 v g g -2 v g g -3 v g g -4 v g g -5 v g g -6 v g g -7 v g g -8 v g g -9 v g g -1 0 v g g -1 1 v g g -  We also tested all pruned networks on our IMAGENET dataset by rotating the image around the center from −180 • to +180 • ; however, the categorization may lose robustness with fewer layers (see Figure 7). Indeed, as the number of layers and the mean accuracy after rotation decreased, the standard deviation of the mean accuracy increased (VGG TLAA = 0.91 ± 0.02, VGG-8 = 0.73 ± 0.04). Although the networks seem to be able to categorize an animal with fewer layers, they seem to trade this advantage for a lower robustness to transformations such as rotations.
To get a better idea of the size of the feature maps needed to categorize an animal in a scene, we tested the networks on a new "shuffled" dataset, where the image had been divided into square patches of different sizes and then blended to generate a new image [43]. Since CNN networks are by definition robust to translation, patch translation should have minimal impact on categorization unless it breaks some necessary patterns in the images. With few layers, the networks should rely on low-level features to perform their categorization, and indeed we obtained an idea of the size of feature maps required for different depths. In fact, between patch sizes 256 × 256 and 64 × 64, the categorization of the networks was robust to this transformation (see Figure 8). However, as soon as we reached the patch size of 32 × 32 pixels, the accuracy of all networks dropped sharply. Furthermore, there seemed to be a transition between deeper and medium networks, as the latter gave better average accuracies for this task. As a consequence, the size of the feature maps needed to perform such a task varies with the depth of the network. For example, VGG TLAA appears to rely on feature map sizes between 32 × 32 and 64 × 64 pixels, as its accuracy drops when we exceed this threshold (see Figure 8); however, further study is needed to quantify this feature map size. In a future application, we could extract feature maps from these low-level layers to better understand the features needed to perform this task. This would allow us to design a stimulus set for a psychological task such as in Thorpe et al. [3]. Such a test could be relevant to whether these features are sufficient to categorize an animal in a flashed scene. L U T T L A A v g g -1 v g g -2 v g g -3 v g g -4 v g g -5 v g g -6 v g g -7 v g g -8 v g g -9 v g g -1 0 v g g -1 1 v g g -1 2  test dataset, where we applied a shuffled transformation to the input image. We show the results as we decreased the size of the shuffled patches on the images. The networks were retrained to categorize animals and tested on datasets based on IMAGENET images created using the 'animal' synset. The index after "vgg-" indicates the number of convolutional layers pruned in the networks. The dotted line represents the chance level for all plots.

Dependence of Accuracy Scores between the Two Tasks
We examined dependence of learning performance of VGG TLAA between two tasks by introducing a variation of the synset of interest in the construction of the dataset. We used our dataset maker tool with the keyword 'artifact', thus generating a new network trained to categorize the presence of the 'artifact' synset in a natural scene: VGG ARTI-FACT. We displayed the average accuracy of the networks trained to detect the 'animal' synset (here VGG ANIMAL stand for our VGG TLAA) on the dataset constructed with the 'animal' synset (respectively trained to detect the artifact synset tested on the dataset constructed with the artifact synset). Next, we tested the networks trained to detect the 'animal' synset on the dataset constructed with the 'artifact' synset and vice versa. Here, by exposing the predictions for the 'animal' and 'artifact' synsets, we highlight a bias in the composition of the dataset. Although the outputs are independent, the 'animal' images confidently match the 'non-artifact' images (and vice versa), thus facilitating global detection (see Figure 9A,B). To infer the influence of this bias on the performance of the network, we generated through the dataset maker a dataset based on the 'animal' synset where, in addition to not being animals, the distractors would also not be 'artifacts'. This defines the 'strictly animal' set (respectively, one defines the 'strictly artifact' set based on the 'artifact' synset where, in addition to not being an artifact, the distractors would also not be an 'animal'). Once this distinction was made, although there is a loss in performance for both networks, they remained fit for their respective tasks by maintaining an accuracy above 0.8. On the other hand, they did not seem to be able to predict the absence of their respective sentences once the ensemble was modified (see Figure 9B,D). These results reinforce the argument that, despite task independence, the composition of the dataset can generate bias in network categorization.
As a control, we tested the VGG RANDOM network on the corresponding dataset (see Section 2.1 for details). As it obtains an average accuracy close to the one obtained with the VGG SLS network, its poor performance can be explained by the fact that the pre-trained weights of the VGG 16 network do not match the new task. Incidentally, this bias is also present in the dataset used by Serre et al. [19]. However, when we compared the performance of the humans on this dataset with the performance achieved by the network on a frame-by-frame basis, we found a high correspondence (about 0.84) in their correct predictions. Indeed, for some images, the networks failed at categorizing but the human succeeded, and vice versa. For some images, both the network and the human succeeded or failed in categorizing an animal, and there were cases where the network was wrong but the humans responded correctly on average (see Figure 10). We have displayed images where one human or both a human and our model failed to categorize an animal in the scene, as this may reflect the specific features that humans or our models rely on to perform their categorization. This close relationship between human and network responses could allow us to select images and design physiological and psychophysical tests to infer the features necessary for such detection.

Discussion
In this paper, we have shown that we can re-train networks using transfer learning to apply them to an ecological image categorization task and obtain insights on visuo-cognitive processes. Such outcomes could in particular be beneficial when studying impaired systems such as in Autism Spectrum Disorder [44]. These artificial networks achieve accuracies similar to those found in psychophysical responses in humans. In the image processing flow at work in convolution networks, the position of the feature maps has no influence on the activation of receptive fields. Since translation is a shift in the position of the feature maps, these networks are supposed to be robust to translation. However, a transformation by a rotation constitutes then a global perturbation of the features composing the maps. Thus, since the features are different, rotation can lead to the solicitation of different receptive fields. If these new receptive fields are not previously learned, the network will be unable to generalize. This could explain the differences in performance between learning protocols involving or not involving rotations. Furthermore, the robustness of the categorization is comparable to that found in psychophysical data. In particular, we have shown quantitatively that the categorization of the re-trained networks may be robust to transformations such as rotations, reflections, or grayscale filtering, such as is observed in humans [3,7].
We have studied networks that learn to detect if an image contains an animal or an artifact. Two independent networks each re-trained on each of the two categorization tasks used to highlight a link or rather a bias in this categorization. This kind of bias is also found in humans and seems to impact the categorization as well [45] and could be linked to top-down influences [46]. The question of detecting an animal in an image is indeed tightly linked to that of detecting an artifact, allowing for the possibility of the less likely appearances of an animal object (like a teddy bear) or of a non-animal non-object (like a mountain). The study of this kind of bias could possibly allow for building ecologicallyrelevant datasets to maximize the learning process of the networks in order to discover more about the features needed for categorization [47].
While the level of 80% correct categorization between humans and machines in this type of task is similar, both could be driven to make different "mistakes", and these particular examples could then be used as subjects for studies in the design of psychophysical tests. In addition, these systematic errors could be a window into some processes in our understanding of primate visual pathways. The last part of our study was based on the search for the features necessary for categorization. We found that, in agreement with the studies of Serre et al. [19], a simple feed-forward network based on low-level features was sufficient to perform categorization efficiently. Moreover, we estimated the size of the features needed to be about 32 × 32 pixels and 64 × 64 pixels. Although categorization is still possible at this very low computational cost, we quantitatively show that it gradually loses robustness.

Perspectives
One of the main goals of this study was to provide a comparison for an ecological and well-studied task used in visual neuroscience. Although this study focuses on the analysis of categorization, it is a necessary step for a well-known task in the field of vision: visual search. This task consists of the simultaneous localization and detection of a visual target of interest. Applied to the case of natural scenes, visually searching, for example, for an animal (either prey, a predator, or a partner) constitutes a challenging problem due to large variability over the numerous visual dimensions. Previous models managed to solve the visual search task by dividing the image into sub-areas. This is at the cost, however, of computer-intensive parallel processing on relatively low-resolution image samples [48,49]. Taking inspiration from natural vision systems [50], we developed a model that was built over the anatomical visual processing pathways observed in mammals, namely the "what" and the "where" pathways [51]. It operates in two steps; one by selecting a region of interest, before knowing its actual visual content, through an ultrafast/low resolution analysis of the full visual field, and the second providing a detailed categorization of the detailed "foveal" selected region attained with the saccade [52] (see Figure 11). In this perspective, our work would be a deepening of the knowledge and models necessary for the realization of the "what" pathway. Modeling this dual-pathways architecture allows for offering an efficient model of visual search as active vision. In particular, it allows us to fill the gap with the shortcomings of CNNs with respect to physiological performances [53]. In the future, we expect to apply this model to better understand visual pathologies in which there exists a deficiency of one of the two pathways [54] while contributing to the field of computer vision. Figure 11. Model built over the anatomical visual processing pathways observed in mammals, namely the "what" and the "where" pathways: the peripheral pathway (top row) is applied to a large display from a natural scene (A): it is first transformed into a retinotopic log-polar input (B), and we then learn to return a "saliency map" (C). The latter infers, for different positions in the target, the predicted accuracy value that can be reached by the foveal pathway, mimicking the "where" pathway used for global localization. The position with the best accuracy will feed a saccade system (D), adjusting the fixation point at the input of the foveal pathway (bottom row). It takes a subsample (E) of the large display (A), over which a categorization is done (F), mimicking the "what" pathway.

Data Availability Statement:
This work is made reproducible using the following tools. First, the code reproducing all figures is available at GitHub https://github.com/SpikeAI/2022-09_ UltraFastCat/blob/main/Readme.md [55] (accessed on 15 March 2023), and in particular the code at DataSetMaker https://github.com/SpikeAI/DataSetMaker [37] (accessed on 15 March 2023) was used to retrieve images. The paper is available as an arXiv preprint https://arxiv.org/abs/2205.03635 with links to previous versions and to the code (accessed on 15 March 2023). Also find the associated zotero group https://www.zotero.org/groups/4560566/ultrafastcat (accessed on 15 March 2023) used to regroup relevant literature on the subject.

Acknowledgments:
For the purpose of open access, the author has applied a CC BY public copyright licence to any author accepted manuscript version arising from this submission.

Conflicts of Interest:
The authors declare no conflict of interest.