Visuospatial Working Memory for Autonomous UAVs: A Bio-Inspired Computational Model

: Visuospatial working memory is a fundamental cognitive capability of human beings needed for exploring the visual environment. This cognitive function is responsible for creating visuospatial maps, which are useful for maintaining a coherent and continuous representation of visual and spatial relationships among objects present in the external world. A bio-inspired computational model of Visuospatial Working Memory (VSWM) is proposed in this paper to endow Autonomous Unmanned Aerial Vehicles (UAVs) with this cognitive function. The VSWM model was implemented on a low-cost commercial drone. A total of 30 test cases were designed and executed. These test cases were grouped into three scenarios: (i) environments with static and dynamic vehicles, (ii) environments with people, and (iii) environments with people and vehicles. The visuospatial ability of the VSWM model was measured in terms of the ability to classify and locate objects in the environment. The VSWM model was capable of maintaining a coherent and continuous representation of visual and spatial relationships among interest objects presented in the environment even when a visual stimulus is lost because of a total occlusion. The VSWM model proposed in this paper represents a step towards autonomous UAVs capable of forming visuospatial mental imagery in realistic environments. responsible for creating visuospatial which are then further processed for the generation of specific actions such as spa- tial and results after the test cases show that the bio-inspired VSWM computational model proposed in this paper is capable of maintaining a coherent and continuous representation of visual and spatial re- lationships among interest objects presented in the environment even when a visual stim- ulus is lost because of a total occlusion. We consider that this bio-inspired computational model represents a step towards autonomous UAVs capable of forming visuospatial men- tal imagery in realistic environments. As future work, we going to implement our bio- inspired VSWM computational model on UAVs with a stereo vision to add depth infor- mation to the visuospatial maps. We believe that designing autonomous UAVs capable of forming visuospatial mental imagery can be very useful for doing different tasks such as surveillance and rescue missions where VSWM is a fundamental cognitive capability for exploring the visual environment. On the other hand, the bio-inspired VSWM computa- tional model proposed in this paper is part of a new brain-inspired cognitive architecture named CogniDron. Therefore, future work also includes modeling and designing other cognitive functions such as creating allocentric spatial maps, emotional system, planning, decision-making, and learning.


Introduction
Unmanned Aerial Vehicles (UAVs), also known as drones, are becoming more and more popular in the research community because they have been widely studied in order to know their potential usage in different areas such as entertainment [1], marketing [2], healthcare [3], agriculture [4], and security [5]. The Artificial Intelligence (AI) research community has made a great effort to develop "intelligent" UAVs capable of doing tasks in an autonomous or semi-autonomous way [6]. Creating an "intelligent" UAV capable of navigating and interacting autonomously in the real world involves facing formidable challenges because they need to be able to process multiple tasks simultaneously, such as constantly sensing the environment; identifying and classifying both static and dynamic obstacles and targets; generating an internal representation of the real world, where the spatial relationship information between the environment's objects must be constantly updated; reasoning and making right decisions to react appropriately when unexpected events appear. New interdisciplinary fields such as cognitive computing [7] and cognitive infocommunication [8] aim to create various bio-inspired engineering applications such as brain-computer interfaces [9] and computational systems capable of mimicking the In Section 4, a detailed analysis of the results and behavior exhibited by the drone is presented. Finally, Section 5 provides a description of experimental results and some concluding remarks.

A Bio-Inspired Computational Model for Autonomous UAVs
The VSWM of human beings combines stimuli identification with information about their location [16,21]. There are different cognitive processes underlying the building of visuospatial maps in the VSWM. These cognitive processes can be grouped into two major blocks. Such grouping is based on the division of cortical visual processing identified in both humans and non-humans (primates) as a dorsal and a ventral pathway (see Figure 1) [21][22][23]. The human brain areas belonging to the ventral pathway (commonly known as the "What" pathway) have been associated with cognitive tasks focused on identifying and recognizing visual stimuli to generate visual maps of them. On the other hand, the human brain areas belonging to the dorsal pathway (commonly known as the "Where" pathway) have been associated with cognitive tasks focused on processing spatial information of visual stimuli to generate spatial maps of them. Anatomically, the ventral pathway has been described as a multisynaptic pathway projecting from the visual cortex (VC) to the anterior temporal target (TE area) in the inferior temporal cortex (ITC), with a further projection from the ITC to the ventrolateral prefrontal cortex (vlPFC), whereas the dorsal pathway has been described as a multisynaptic pathway projecting from the visual cortex (VC) to the posterior parietal cortex (PPC), with a further projection from the PPC to the dorsolateral prefrontal cortex (dlPFC) [21,24]. The PPC itself is divided into an upper and lower portion: the superior parietal lobe (SPL) and inferior parietal lobe (IPL), respectively. These two lobes are separated from one another by a sulcus called the intraparietal sulcus (IPS). Processes involved in the ventral pathway focus on identifying and classifying objects in the environment in order to offer visual information to the VWM [16,21], whereas processes involved in the dorsal pathway focus on identifying locations of objects in the environment in order to offer spatial information to the SWM [16,21].
The bio-inspired VSWM computational model proposed in this paper has taken inspiration from the underlying neural correlates to the VSWM that have been identified in both human and non-human (primate) brains. Therefore, modules of the computational model have been labeled with the name of the brain area that they represent. Figure 1 shows the brain areas that have been considered to propose the bio-inspired VSWM computational model. These brain areas are part of the ventral and dorsal pathways, respectively. The frontal eye fields (FEF) are a frontal brain region that receives converging inputs from both the dorsal and the ventral streams. This brain region contributes to the guidance of saccades. Additionally, the motor cortex (MC) has also been considered as part of the dorsal pathway in order to generate motor behaviors. Figure 1 shows connections that have been considered to propose the computational model of the VSWM. These connections show how visual information flows through these pathways in order to generate visuospatial maps. This does not mean that the bio-inspired computational model proposed in this paper considers all brain areas involved in the VSWM of human beings, or that it implements exact neural mechanisms. This bio-inspired computational model just offers an artificial and abstract representation of major brain areas related to the VSWM. The bio-inspired VSWM computational model proposed in this paper has taken inspiration from the underlying neural correlates to the VSWM that have been identified in both human and non-human (primate) brains. Therefore, modules of the computational model have been labeled with the name of the brain area that they represent. Figure 1 shows the brain areas that have been considered to propose the bio-inspired VSWM computational model. These brain areas are part of the ventral and dorsal pathways, respectively. The frontal eye fields (FEF) are a frontal brain region that receives converging inputs from both the dorsal and the ventral streams. This brain region contributes to the guidance of saccades. Additionally, the motor cortex (MC) has also been considered as part of the dorsal pathway in order to generate motor behaviors. Figure 1 shows connections that have been considered to propose the computational model of the VSWM. These connections show how visual information flows through these pathways in order to generate visuospatial maps. This does not mean that the bio-inspired computational model proposed in this paper considers all brain areas involved in the VSWM of human beings, or that it implements exact neural mechanisms. This bio-inspired computational model just offers an artificial and abstract representation of major brain areas related to the VSWM. Figure 2 shows the architectonic design of the bio-inspired VSWM computational model proposed in this paper. It also uses color codes for identifying the type of information that each module receives, processes, and sends. This information is classified as visual information (blue arrows), spatial information (green arrows), highly cognitive control information (yellow arrows), and sensorimotor information (red arrows). Finally, black arrows indicate the drone's interaction with the environment. The following subsections offer a detailed description of inputs, processes, and outputs associated with the bio-inspired computational model's modules. Description is presented from a bio-inspired approach, but details about their computational implementation are also offered.  Figure 2 shows the architectonic design of the bio-inspired VSWM computational model proposed in this paper. It also uses color codes for identifying the type of information that each module receives, processes, and sends. This information is classified as visual information (blue arrows), spatial information (green arrows), highly cognitive control information (yellow arrows), and sensorimotor information (red arrows). Finally, black arrows indicate the drone's interaction with the environment. The following subsections offer a detailed description of inputs, processes, and outputs associated with the bioinspired computational model's modules. Description is presented from a bio-inspired approach, but details about their computational implementation are also offered.

External Stimuli
• VC module. This module represents the visual cortex. The VC is a complex system consisting of different brain areas encompassing V1, V2, V3, and V4 [25,26]. Offering a detailed computational model of the VC is outside the scope of this paper. Therefore, the VC has been simplified as a single module within the bio-inspired VSWM computational model. Despite our limited knowledge of neuronal mechanisms and processes involved in the VC, studies have demonstrated that visual perception starts in early visual areas with the detection of diverse low-level visual features such as angles, size, orientation, motion, and color [25][26][27][28]. This piecemeal analysis is very different from our subjective perception. However, these low-level visual features encoded through large distributed neuronal populations in the VC are fundamental for identifying complex objects in posterior visual processing brain areas [25]. Additionally, neuroscientific evidence suggests that the process of segmenting images into objects and background starts in the primary visual cortex (area V1) [25]. Therefore, the VC module proposed in this paper is responsible for separating relevant visual stimuli from the background and then sends visual information to the ITC, IPL, and IPS modules for future processes related to spatial and visual identification. Additionally, a connection from the ITC module to the VC module has been considered in order to offer feedback. This feedback allows decreasing the number of likely regions of interest (ROIs) early identified by the VC module. The VC module implements a heatmap-based visual stimuli detection method. This method is an adapted implementation of the heatmap-based object detection method used in CenterNet [29,30]. This method was adapted for making a rough separation of visual stimuli belonging to the foreground from the background. The algorithm implemented in this module works as follows: Appl. Sci. 2021, 11, x FOR PEER REVIEW Figure 2. The architectonic design of the bio-inspired VSWM computational model.

External Stimuli
• VC module. This module represents the visual cortex. The VC is a complex consisting of different brain areas encompassing V1, V2, V3, and V4 [25,26]. O a detailed computational model of the VC is outside the scope of this paper. fore, the VC has been simplified as a single module within the bio-inspired computational model. Despite our limited knowledge of neuronal mechanism processes involved in the VC, studies have demonstrated that visual perceptio in early visual areas with the detection of diverse low-level visual features s angles, size, orientation, motion, and color [25][26][27][28]. This piecemeal analysis different from our subjective perception. However, these low-level visual f The VC module processes an input RGB image I of width W and height H, where I ∈ R W·H·3 . The VC module aims to produce a keypoint heatmapŶ ∈ [0, 1] W R · H R · C by using a stacked hourglass network that allows downsampling of the input image with an output stride R = 4, followed by two sequential hourglass modules. Each hourglass module is a symmetric 5-layer down-and up-convolutional network, as shown in Figure 3.
to the foreground from the background. The algorithm implemented in this module works as follows: The VC module processes an input RGB image I of width W and height H, where ∈ • •3 . The VC module aims to produce a keypoint heatmap ̂∈ [0,1] • • by using a stacked hourglass network that allows downsampling of the input image with an output stride R = 4, followed by two sequential hourglass modules. Each hourglass module is a symmetric 5-layer down-and up-convolutional network, as shown in Figure 3. A heatmap head, dimension head, and offset head are obtained for each image and each C class (our C class includes two keypoint types that represent relevant visual stimuli). Heatmap head is used for the estimation of keypoints on an input image. A heatmap focal loss is defined for training the heatmap head. Focal loss improves the keypoint detection by weighting the keypoints detections. Equation (1) shows the focal loss function, where a value 1 for represents a positive keypoint for a class and a different value represents a negative keypoint.
α and β are hyper-parameters with values of 2 and 4, respectively, and ̂ represents a keypoint prediction. On the other hand, the offset head is used to recover the discretization error caused by the output stride. An L1 Norm Offset Loss is defined for training the offset head. Equation (2) describes the L1 Norm Offset Loss.
Dimension head predicts the dimension boxes of the keypoints using an L1 Norm Dimension Loss (see Equation (3)) for training, where ̂ represents the predicted dimensions and s is actual ground truth dimensions.
The overall training objective is defined in Equation (4). This equation represents the total loss of the network, where and were set to 0.1 and 1, respectively. A heatmap head, dimension head, and offset head are obtained for each image and each C class (our C class includes two keypoint types that represent relevant visual stimuli). Heatmap head is used for the estimation of keypoints on an input image. A heatmap focal loss is defined for training the heatmap head. Focal loss improves the keypoint detection by weighting the keypoints detections. Equation (1) shows the focal loss function, where a value 1 for Y xyc . represents a positive keypoint for a class and a different value represents a negative keypoint.
α and β are hyper-parameters with values of 2 and 4, respectively, andŶ xyc represents a keypoint prediction. On the other hand, the offset head is used to recover the discretization error caused by the output stride. An L1 Norm Offset Loss is defined for training the offset head. Equation (2) describes the L1 Norm Offset Loss.
Dimension head predicts the dimension boxes of the keypoints using an L1 Norm Dimension Loss (see Equation (3)) for training, whereŜ represents the predicted dimensions and s is actual ground truth dimensions.
The overall training objective is defined in Equation (4). This equation represents the total loss of the network, where λ size and λ o f f were set to 0.1 and 1, respectively.
An extended explanation of the heatmap-based object detection method can be found in [29,30].

Visual Working Memory
• ITC module. This module represents the inferior temporal cortex. This brain area has been associated with the process of identifying complex objects in the environment, such as animate and inanimate objects [31,32]. Findings in neuroscience show that vi-sual information starts in the visual area V1 and passes through a sequence of processing stages in the visual cortex until complex object representations are formed in the anterior part of the ITC [24]. Therefore, the ITC module proposed in this paper receives a constant stream of visual stimuli coming from the VC module to identify complex objects in the environment. As part of the implementation of this module, we have included a convolutional neural network (CNN) for classifying relevant stimuli according to a set of category clusters. The ITC module receives the heatmaps produced on the VC module and extracts the peaks for each class independently. All responses whose value is greater or equal to its 8-connected neighbors are detected and keep the top 100 peaks. LetP c be the set of n detected center pointsP = {(x i ,ŷ i )} n i = 1 of class c. Each keypoint location is given by integer coordinates (x i , y i ). Keypoint valuesŶ x i y i c are used as a measure of its detection confidence, and produce a bound- The resulting data are used to feed a CNN with a 3 × 3 max pooling operation. This operation allows avoiding detection of an object as multiple objects. After classifying visual stimuli, the ITC module sends highly processed visual information to the vlPFC module. According to Madl et al. [12], CNNs trained with real-world images are highly similar to those recorded in the ITC of humans and non-humans (primates).

Spatial Working Memory
• SPL module. This module represents the superior parietal lobe. This brain area has been associated with tasks related to spatial working memory, maps of coordinates, spatial shifting, and spatial attention [17,33,34]. Therefore, the SPL module is responsible for decoding visual cues sent by the VC module through the IPS module. Information of these cues is used by the SPL module for creating maps of coordinates. These maps are encoded in an egocentric fashion, according to relative coordinates of the UAV and its orientation. Allocentric representations, relative to environmental references, are not considered yet in this first version of the bio-inspired VSWM computational model. Neurophysiological studies suggest that the superior parietal lobe is part of the networking involved in visual search tasks [33,34]. Therefore, the SPL module helps with spatial shifting and spatial attentional tasks in order to detect displacement of visual stimuli. Additionally, this module shares visuospatial information with the IPS module. Figure 4 shows an example of coordinate maps created by the SPL module. Fuzzy spatial information of the relationship between the UAV and ROIs identified in the environment is processed and included in these maps. For instance, Figure 4a shows that in a first scene two ROIs have been identified. According to the ROIs' coordinates, the SPL module has considered that ROI 1 is located to the front left of the UAV, and ROI 2 is located to the front right of the UAV. On the other hand, Figure 4b shows a second scene, where two new ROIs have been identified. When two or more ROIs are close to each other, the SPL module generates additional fuzzy spatial information of the relationship between these ROIs. In order to establish the positions of the identified ROIs relative to the UAV, Equations (5) and (6) are defined for segmenting the UAV's field of view into five regions as shown in Figure 5. These trivial equations allow the UAV to situate ROIs in the environment.
Equation (5) establishes the limit between the region classified as front and the regions classified as front left and front right when x and f(x) represent the x-coordinate and y-coordinate, respectively. In order to establish the positions of the identified ROIs relative to the UAV, Equations (5) and (6) are defined for segmenting the UAV's field of view into five regions as shown in Figure 5. These trivial equations allow the UAV to situate ROIs in the environment.  In order to establish the positions of the identified ROIs relative to the UAV, Equations (5) and (6) are defined for segmenting the UAV's field of view into five regions as shown in Figure 5. These trivial equations allow the UAV to situate ROIs in the environment.  Equation (5) establishes the limit between the region classified as front and the regions classified as front left and front right when x and f(x) represent the x-coordinate and y-coordinate, respectively.
Equation (6) establishes the limit between the region classified as front left and left and front right and right. Moreover, a local matrix of N·M dimensions is applied over each ROI to generate additional fuzzy spatial information when two or more ROIs are close to each other. Equations (5) and (6), and their reflecting functions in relation to the x axis are applied for segmenting this local matrix to establish spatial relationships between ROIs.
• IPS module. This module represents the intraparietal sulcus. In monkeys and humans, the IPS subdivides the PPC into a superior and an inferior parietal lobe [17]. The IPS has been associated with the creation of attentional priority maps, also known as saliency maps. This brain area is part of networking involved in the visual search task [33]. Therefore, the IPS module has bidirectional communication with the SPL and IPL modules, which are involved in the visual search task too. The IPS module is responsible for decoding visual cues coming from the VC module in order to represent spatial information for the SWM. The creation of attentional priority maps is part of the tasks done by this module. An attentional priority map is a topographic representation of the distribution of attentional weights. In order to create an attentional priority map, the IPS module integrates both bottom-up (perceptual features of the stimulus) and top-down (high-level representations of expectations and action-goals) information [35,36]. Bottom-up information comes from the VC module, whereas top-down information comes from the dlPFC module. The attentional priority map is useful in the presence of distractors because the calibration of attentional weights allows the computational model to resolve the competition between stimuli presented in the environment. This module sends visuospatial information generated by itself and other modules (such as the SPL, and IPL modules) to the dlPFC module. Additionally, this module has direct communication with the FEF module in order to support visually guided actions. However, a mechanism for generating priority maps has not been implemented yet in this first version of the bio-inspired VSWM computational model. • IPL module. This module represents the inferior parietal lobe. This brain area has been considered a relevant brain area involved in the interruption of the current cognition activity and the reorientation of attention when a salient stimulus of high behavioral relevance appears at an unexpected position [17]. However, the physical appearance of a stimulus is determined not only by perceptual factors but also by top-down processes such as expectations or behavioral goals. The IPL has also been associated with maintaining alertness, sustaining attention, and detecting novelty [17]. Therefore, the current IPL module proposed in this paper has been limited to sending cues for indicating or alerting when a new ROI appears in the environment. The IPL module receives visual stimuli coming from the VC module. Additionally, the IPL module has bidirectional communication with the IPS module to share information.

Visuospatial Working Memory
• LPFC module. This module represents the lateral prefrontal cortex (LPFC). Findings in neuroscience indicate that broadly, but strictly within the domain of vision, the entire LPFC is involved in maintenance and manipulation tasks such as attention, working memory, and switching task sets [24]. The LPFC module includes the dlPFC and vlPFC modules. Therefore, this module integrates visual and spatial information coming from the dlPFC and vlPFC modules, respectively, to create visuospatial maps. Figure 6 shows an example of visuospatial maps created by the LPFC module. These maps offer a coherent and continuous representation of interest objects that have been identified, along with their spatial relationship in the environment. For instance, Figure 6a shows a first scene where a vehicle and a person have been identified and labeled as vehicle 1 and person 1, respectively. Spatial information indicates that vehicle 1 is located in front, but to the left side of the UAV, and person 1 is located in front, but to the right side of the UAV. On the other hand, Figure 6b shows a second scene, where two additional persons (labeled person 2 and person 3, respectively) have appeared and they are close to each other. In this case, when two or more interest objects are close to each other, additional spatial information is generated and included. This spatial information indicates the spatial relationship between these objects. Additionally, Figure 6b shows what happens when an object has gone out of the UAV's visual field. In this case, a selective removal process is activated. However, this process can be stopped if the object reappears in the environment. Therefore, whereas the VSWM tries to maintain information about the objects that are no longer present in the environment, the selective removal process operates on outdated information to limit the VSWM's load and hence facilitates the maintenance of relevant information [37,38]. To implement this behavior in the bio-inspired VSWM computational model, Equation (7) was coded as part of the selective removal process: where R is retrievability (a measure of how easy it is to retrieve a piece of information from the VSWM), s is stability of memory (how fast R falls over time in the absence of a relevant stimulus), and t is time. As shown in Figure 6b, when a stimulus that has been previously identified is no longer present in the environment, its representation in the VSWM is degraded until it disappears. The graph represents this process by fading its node until it dissipates, simulating how irrelevant stimuli that are no longer present in the environment are removed from the UAV's VSWM.
persons (labeled person 2 and person 3, respectively) have appeared and they are close to each other. In this case, when two or more interest objects are close to each other, additional spatial information is generated and included. This spatial information indicates the spatial relationship between these objects. Additionally, Figure 6b shows what happens when an object has gone out of the UAV's visual field. In this case, a selective removal process is activated. However, this process can be stopped if the object reappears in the environment. Therefore, whereas the VSWM tries to maintain information about the objects that are no longer present in the environment, the selective removal process operates on outdated information to limit the VSWM's load and hence facilitates the maintenance of relevant information [37,38]. To implement this behavior in the bio-inspired VSWM computational model, Equation (7) was coded as part of the selective removal process: where R is retrievability (a measure of how easy it is to retrieve a piece of information from the VSWM), s is stability of memory (how fast R falls over time in the absence of a relevant stimulus), and t is time. As shown in Figure 6b, when a stimulus that has been previously identified is no longer present in the environment, its representation in the VSWM is degraded until it disappears. The graph represents this process by fading its node until it dissipates, simulating how irrelevant stimuli that are no longer present in the environment are removed from the UAV's VSWM. • dlPFC module. This module represents the dorsolateral prefrontal cortex. There is neuroscientific evidence that shows that the dorsolateral prefrontal cortex is responsible for spatial selectivity [24]. The dlPFC has also been associated with planning, problem solving [39] and top-down executive control [21]. Currently, the dlPFC module is responsible for integrating spatial information coming from the IPS module on the visuospatial maps and sending motor commands to the FEF and MC modules in order to exhibit top-down behavior. • dlPFC module. This module represents the dorsolateral prefrontal cortex. There is neuroscientific evidence that shows that the dorsolateral prefrontal cortex is responsible for spatial selectivity [24]. The dlPFC has also been associated with planning, problem solving [39] and top-down executive control [21]. Currently, the dlPFC module is responsible for integrating spatial information coming from the IPS module on the visuospatial maps and sending motor commands to the FEF and MC modules in order to exhibit top-down behavior. • vlPFC module. This module represents the ventrolateral prefrontal cortex. Neuroscientific evidence shows that the vlPFC is responsible for object selectivity [24]. Currently, the vlPFC module is responsible for integrating visual information coming from the ITC module on the visuospatial maps.

Motor System
• FEF module. This module represents the frontal eye field. Neuroscientific evidence indicates that the frontal eye field receives converging inputs from many cortical areas involved in bottom-up and top-down attentional control. Currently, the computational module that represents the frontal eye field proposed in this paper considers two connections coming from the dlPFC and IPS modules [17,36,40]. Motor commands implemented in this module allow the UAV's camera to pan and tilt a specific number of degrees. • MC module. This module represents the motor cortex. This brain area can be divided into three major areas: the primary motor cortex, the premotor cortex, and the supplementary motor area. Voluntary movements depend critically on these motor areas [41]. However, proposing a detailed computational model that considers all motor areas involved in generating movements is outside the scope of this paper. Therefore, the motor cortex has been simplified as a single module labeled MC. This module has a set of motor commands such as take-off, land, rotate clockwise, rotate counterclockwise, fly forward, fly backward, fly up, fly down, fly left, and fly right. These commands can be invoked by the dlPFC module.

Materials and Methods
The bio-inspired VSWM computational model proposed in this paper was implemented on a Bebop 2. This is a low-cost commercial drone equipped with a 14 mega-pixel camera with a fish-eye lens, dual-core processor with quad-core GPU, GPS, and 8-GB flash storage system. Nevertheless, the bio-inspired VSWM computational model was not embedded in the drone. Instead, a laptop (with an Intel core i7 processor, an external NVIDIA GeForce GTX 1060 video card, and 16 GB of RAM) was used to host the bioinspired VSWM computational model. Therefore, the drone and the bio-inspired VSWM computational model communicate through a Wi-Fi connection. Currently, the bio-inspired VSWM computational model can identify people and vehicles (cars and trucks) in the real world. The following research questions were studied in order to identify the performance and accuracy of the bio-inspired VSWM computational model proposed in this paper:

1.
What is the visuospatial ability of the bio-inspired VSWM computational model when the drone flies between 9.84 and 16.40 feet? 2.
Can the bio-inspired VSWM computational model offer a coherent and continuous representation of visual and spatial relationships among interest objects present in the environment?
To answer these questions, a total of 30 test cases were designed and executed. These test cases were grouped into three scenarios (i) environments with static and dynamic vehicles, (ii) environments with people, and (iii) environments with people and vehicles. Detailed information on these test cases is available at CogniDron/VSWM/test_report. In all the test cases, the drone did not have a specifically defined target in the visual task. Therefore, visual attention was driven by bottom-up processes for generating visuospatial maps of the environment.

First Test Scenario: Environments with Static and Dynamic Vehicles
The main objective of this test scenario was to test the bio-inspired VSWM computational model's visuospatial ability to identify vehicles present in the environment and then to generate 2D visuospatial maps of them. A second objective of this test scenario was to test the bio-inspired VSWM computational model's capability for retaining visuospatial information for a short period of time when visual stimuli were no longer present in the environment. Therefore, 12 test cases were designed and implemented in this test scenario. Table 2 shows a summary of the test cases implemented in this scenario. This summary includes the test case ID, the flight altitude reached by the drone for executing each test case, as well as a brief description of each test case. Additionally, there was no specific target defined in this visual task for all the test cases. Therefore, the drone's camera was fixed at a specific point in the environment in all the test cases. .84 feet Visual stimuli involved in this test case were a parked vehicle and a vehicle in motion. After the two vehicles were identified by the bio-inspired computational model, the vehicle in motion was intentionally driven outside of the drone's field of view and returned after a few seconds to test the bio-inspired computational model's ability to retain visuospatial information for a short period of time when visual stimuli are no longer present. 8 13.12 feet 9 16.40 feet 10 9.84 feet Visual stimuli involved in this test case were a parked vehicle and a vehicle in motion. After the two vehicles were identified by the bio-inspired computational model, the vehicle in motion was intentionally driven outside of the drone's field of view and returned after a few seconds to test the bio-inspired computational model's ability to retain visuospatial information for a short period of time when visual stimuli are no longer present. Finally, the vehicle in motion was permanently left outside of the drone's field of view to test the bio-inspired computational model's selective removal process responsible for degrading visuospatial information of visual stimuli that are no longer present in the environment until such visuospatial information is deleted from the VSWM's buffer.

Second Test Scenario: Environments with People
Like the first test scenario, the main objective of this test scenario was to test the bio-inspired VSWM computational model's visuospatial ability. However, this second test scenario involved people instead of vehicles. Therefore, 9 test cases were designed and implemented in this test scenario. Participants were strategically located in the environment in order to obtain images of them at different scales. Table 3 shows a summary of the test cases implemented in this test scenario. Like the first test scenario, there was no specific target defined in this second scenario. Therefore, the drone's camera was fixed at a specific point in the environment in all the test cases. .84 feet Visual stimuli involved in this test case were three persons in the environment. These persons were located diagonally in front of the drone. After the three persons were identified by the bio-inspired computational model, the person who was located to the right of the drone walked around the environment and crossed behind another person in order to generate a partial or total occlusion between them for a few seconds. 9.84 feet Visual stimuli involved in this test case were three persons in the environment. The persons walked aimlessly around the environment, but they always avoided a partial or total occlusion between them. After the three persons were identified by the bio-inspired computational model, one of them went outside of the drone's field of view. After that, a second person went outside of the drone's field of view too.

Third Test Scenario: Environments with People and Vehicles
The objective of this test scenario was the same one proposed in the previous test scenarios. However, in this test scenario, both people and vehicles were presented simultaneously in the environment. In total, 9 test cases were designed and implemented in this test scenario. Table 4 shows a summary of these test cases. Like previous test scenarios, there was no specific target defined in this test scenario. Therefore, the drone's camera was fixed at a specific point in the environment in all the test cases. .84 feet Visual stimuli involved in this test case were two persons and two parked vehicles in the environment. After the persons and vehicles were identified by the bio-inspired computational model, the persons walked aimlessly around the environment. Persons always avoided a partial or total occlusion between themselves and the vehicles.

Results
This paper considers visuospatial ability as a person's capacity to identify visual and spatial relationships among objects. Therefore, the visuospatial ability of the bio-inspired VSWM computational model was measured in terms of the ability to classify and locate objects in the environment. Images and videos were recorded in each test case to analyze offline the bio-inspired VSWM computational model's visuospatial ability. (Supplementary Materials are available at CogniDron/VSWM/videos) In order to estimate the bio-inspired computational model's accuracy for generating spatial relationships among relevant stimuli in the environment, 100 frame samples were taken from the video of each test case. We observed that spatial relationships among relevant stimuli present in the environment were always right in the three test scenarios. Frames taken by the bio-inspired computational model were stored in order to compute the bio-inspired computational model's accuracy for classifying relevant visual stimuli (such as people and vehicles) present in the environment. Results of each test scenario are reported in Tables 5-7, respectively.    Table 5 shows the results obtained in the first test scenario. The bio-inspired computational model's accuracy for classifying vehicles was from 95.40% to 100% in this test scenario, whereas the bio-inspired computational model's accuracy for classifying people was from 96.33% to 100% in the second test scenario (see Table 6).
Finally, Table 7 shows the results obtained in the third test scenario. The bio-inspired computational model's accuracy for classifying both people and vehicles simultaneously was from 99.75% to 100% in this third test scenario.
The selective removal process proposed in the bio-inspired VSWM computational model was useful for maintaining a coherent and continuous representation of visual and spatial relationships among interest objects presented in the environment. Figure 7 shows how visuospatial information is maintained even when a visual stimulus is lost as a consequence of a total occlusion. As a result of this occlusion, the bio-inspired VSWM computational model keeps the last position where the visual stimulus was identified, and the selective removal process is activated. This removal process allows degrading information of the visual stimulus that disappeared. However, this process can be stopped if the visual stimulus reappears in the environment. When a visual stimulus reappears in the environment, the visuospatial map updates its location otherwise, the visual stimulus information is removed from the visuospatial map. The same behavior was observed when the bio-inspired VSWM computational model had problems identifying visuospatial information from one frame to another. Figure 8 shows an example of how the bio-inspired VSWM computational model generates additional fuzzy spatial relationship information between relevant stimuli when they are close to each other. For instance, four people are present in the environment shown in Figure 8. The bio-inspired computational model identified a first person to the front-right of the drone, a second person in front of the drone, and two more persons to the front-left of the drone. Additionally, the bio-inspired computational model considers that these two last persons are close to each other, which shows that additional fuzzy spatial information is generated to establish the spatial relationship between these persons. Finally, we are aware that the visuospatial performance of the bio-inspired computational model proposed in this paper can vary according to the environment's features, such as lights and shadows, occlusions, the number of visual stimuli, and background clutter. selective removal process is activated. This removal process allows degrading information of the visual stimulus that disappeared. However, this process can be stopped if the visual stimulus reappears in the environment. When a visual stimulus reappears in the environment, the visuospatial map updates its location otherwise, the visual stimulus information is removed from the visuospatial map. The same behavior was observed when the bioinspired VSWM computational model had problems identifying visuospatial information from one frame to another.

Figure 7.
Generating visuospatial maps when a total occlusion happens. Figure 8 shows an example of how the bio-inspired VSWM computational model generates additional fuzzy spatial relationship information between relevant stimuli when they are close to each other. For instance, four people are present in the environment shown in Figure 8. The bio-inspired computational model identified a first person to the front-right of the drone, a second person in front of the drone, and two more persons to the front-left of the drone. Additionally, the bio-inspired computational model considers that these two last persons are close to each other, which shows that additional fuzzy spatial information is generated to establish the spatial relationship between these persons. Finally, we are aware that the visuospatial performance of the bio-inspired computational model proposed in this paper can vary according to the environment's features, such as lights and shadows, occlusions, the number of visual stimuli, and background clutter.

Discussion
VSWM is a fundamental cognitive capability of human beings needed for exploring the visual environment. This cognitive function is responsible for creating visuospatial maps, which are then further processed for the generation of specific actions such as spa- Figure 8. Additional fuzzy spatial relationship information between relevant stimuli when they are close to each other.

Discussion
VSWM is a fundamental cognitive capability of human beings needed for exploring the visual environment. This cognitive function is responsible for creating visuospatial maps, which are then further processed for the generation of specific actions such as spatial shifting, visual search, and location of targets. Obtained results after executing the test cases show that the bio-inspired VSWM computational model proposed in this paper is capable of maintaining a coherent and continuous representation of visual and spatial relationships among interest objects presented in the environment even when a visual stimulus is lost because of a total occlusion. We consider that this bio-inspired computational model represents a step towards autonomous UAVs capable of forming visuospatial mental imagery in realistic environments. As future work, we are going to implement our bio-inspired VSWM computational model on UAVs with a stereo vision to add depth information to the visuospatial maps. We believe that designing autonomous UAVs capable of forming visuospatial mental imagery can be very useful for doing different tasks such as surveillance and rescue missions where VSWM is a fundamental cognitive capability for exploring the visual environment. On the other hand, the bio-inspired VSWM computational model proposed in this paper is part of a new brain-inspired cognitive architecture named CogniDron. Therefore, future work also includes modeling and designing other cognitive functions such as creating allocentric spatial maps, emotional system, planning, decision-making, and learning.
Supplementary Materials: The following are available online at https://drive.google.com/drive/ folders/1b0yWNkhVh08OhaPkEzrzTjzpv5Mm2-eP. The videos and images of the 30 test cases implemented for testing the bio-inspired VSWM computational model, as well as additional information regarding these test cases to support the computational model proposed in this paper, are available at CogniDron/VSWM/videos and images.