Application-Oriented Retinal Image Models for Computer Vision

Energy and storage restrictions are relevant variables that software applications should be concerned about when running in low-power environments. In particular, computer vision (CV) applications exemplify well that concern, since conventional uniform image sensors typically capture large amounts of data to be further handled by the appropriate CV algorithms. Moreover, much of the acquired data are often redundant and outside of the application’s interest, which leads to unnecessary processing and energy spending. In the literature, techniques for sensing and re-sampling images in non-uniform fashions have emerged to cope with these problems. In this study, we propose Application-Oriented Retinal Image Models that define a space-variant configuration of uniform images and contemplate requirements of energy consumption and storage footprints for CV applications. We hypothesize that our models might decrease energy consumption in CV tasks. Moreover, we show how to create the models and validate their use in a face detection/recognition application, evidencing the compromise between storage, energy, and accuracy.

. The proposed framework to generate application-oriented retinal image models. The workflow begins by defining the application's requirements regarding operation (e.g., objects' positioning, illumination) and efficiency (e.g., storage, accuracy). Then, a proper implicit function (e.g., l 2 ) and the spatial configuration of the retinal image model -comprising foveal and peripheral regions -are chosen. The next step is the generation of the model by means of an optimization procedure that considers the implicit function and the spatial configuration to resample points in the 2-d cartesian space. The final artifact is an application-oriented retinal model comprised by uniformlyand non-uniformly-sampled foveal and peripheral regions, respectively. This model is used to resample uniform images, taking them to a space-variant domain and potentially contemplating the requirements determined beforehand.
view and a high-resolution region that is used to foveate a point in a real scene, thereby reducing data 33 processing to a dense, smaller region (fovea), or to a wider, sparse one (periphery) [3,4]. Both regions 34 can also operate in synergy: the periphery examines coarse data to trigger a detailed analysis through 35 foveation. 36 Concepts of the human visual system have already been explored from the hardware and software 37 perspectives. On the hardware side, the problem has been dealt with, mainly, by two fronts: (i) the 38 development of imaging sensors with specific non-uniform spatial configurations [5], and (ii) the use of 39 an intermediary hardware layer to remap uniform images into variable-resolution ones. The first front 40 allows the capture of topology-fixed foveated images at sensing time, whereas the second one provides 41 more flexibility to change the mapping without relying on software routines. Specifically, some  Pure software-based approaches, in opposition, offer more flexibility to simulations, albeit with 51 higher computational costs. In [7], a saccadic search strategy based on foveation for facial landmark 52 detection and authentication is presented. The authors apply a log-polar mapping to some image points 53 and extract Gabor filter responses at these locations, thus imitating the characteristics of the human 54 retina. For training, SVM classifiers are used to discriminate between positive and negative classes of 55 facial landmarks (eyes and mouth) represented by the collected Gabor responses. When testing, the 56 saccadic search procedure evaluates several image points in the seek of candidate landmarks that are further used to authenticate the depicted individual. A more complete review on space-variant imaging 58 from the hardware and software perspectives using log-polar mappings is detailed in [8]. Furthermore, 59 in [9], a foveated object detector is proposed. The detector operates on variable-resolution images 60 obtained by resampling uniform ones with a simplified model of the human visual cortex. The results 61 showed that the detector was capable of approximating the accuracy of a uniform-resolution-oriented 62 one, thereby providing a satisfactory insight to evolutionary biology processes. In another work [10], 63 image foveation is exploited along with a single-pixel camera architecture to induce a compromise 64 between resolution and frame rate. The images are resampled by a space-variant model that is 65 constantly reshaped to match the regions of interest detected in the image by a motion tracking 66 procedure, thus effectively simulating a moving fovea that increasingly gathers high-resolution data 67 across frames. To facilitate comparisons among different sensor arrangements, an appropriate method 68 is described in [11]. The idea is to provide a common space for creating lattices of any kind. To 69 demonstrate the viability of the method, the rectangular and hexagonal lattices are implemented and 70 images built according to both arrangements are further compared.

71
Despite the progress in CV research fields in exploiting space-varying models, there is a 72 lack of a single generic framework for handling seamlessly images generated by heterogeneous 73 pixel sampling strategies. In this paper, we address this issue by proposing a framework for 74 designing Application-Oriented Retinal Image Models (ARIMs) that establish a non-uniform sampling 75 configuration of uniform images. We propose to define the appropriate model for an application 76 on-demand, taking into account specific requirements of the target application. By exploiting such 77 models, we hypothesize it might be possible to decrease the energy spent in computer vision tasks. 78 We show how to create the models and validate their use in a face detection/recognition application, 79 considering the compromise among storage rates, energy, and accuracy. We use a regular image sensor  Instead of using a traditional image, coming from a general uniform sensor, we argue that the 89 best approach is to examine the target application and investigate its requirements/demands. CV 90 applications can comprise a very diverse set of requirements, ranging from efficiency-related ones, 91 such as storage, speed, energy, and accuracy, to other very application-specific ones, such as the 92 need for objects to move slowly or be positioned in specific locations in the scene, be situated in a 93 minimum/maximum distance from the camera, be illuminated by a close light source, and so further.

94
The application considered in this paper is concerned with user authentication based on his face: the 95 individual enters and leaves the scene by any sides, placing himself in front of a camera that captures 96 the scene in a wide field of view.

97
Although the authentication across a wide field of view is a good idea, since more faces are this sense, retaining some pixel data in such areas, even in a sparse manner is also appropriate. Finally, 106 another suitable strategy towards energy reduction is downsampling the image before performing 107 face detection/recognition. This might reduce the energy spent in the whole authentication process, 108 but at the cost of a drop in accuracy.

109
The issues discussed above illustrate examples of requirements to be defined by the analysis of 110 an application's domain. In this paper, they were essential to guide the definition of a model for the 111 biometric application.

113
The design of the model starts with selecting a proper implicit function. The idea is that the 114 function will act as a control mechanism to spread out the non-uniform sampled points over a desired 115 image region. Figure 2 depicts examples of implicit functions we explored (l 1 , l 2 , and l ∞ ).

117
This step is concerned with the spatial characteristics the model must obey. We developed hybrid 118 space-variant models inspired on the human retina. In general, the models comprise two very distinct 119 regions: the fovea and the periphery. The fovea is a fixed-size region of uniformly sampled pixels 120 according to a predefined grid. For instance, a region of size 2 6 × 2 6 pixels can be uniformly sampled 121 by a grid of size 2 5 × 2 5 pixels. Given these characteristics, we can apply conventional CV algorithms 122 in the fovea. In opposition, the periphery is a fovea-surrounding region with a non-uniform pixel 123 density that decreases with the distance from the fovea.

124
The following four parameters should be informed prior to the creation of the hybrid model:

125
• Number of foveas: Surely a human eye has only one fovea, but it is perfectly fine for a model to 126 comprise more than one region of uniform sampling, depending on the application on hand. In 127 our biometric application, we took into account only one fovea.

128
• Location of foveas: The foveas should be spatially organized adhering to the specific 129 requirements of the application. In ours, the fovea is centralized in the image.

130
• Density of foveas: The foveas can be downsampled to simulate a uniform image resolution 131 reduction. We tested different densities (grids) for our fovea.

132
• Density of periphery: The periphery is an important region that encompasses few sparse data 133 in a non-uniform sampling configuration. As discussed previously, by retaining and wisely 134 handling sparse peripheral information (e.g., detecting motion and coarse objects in such an 135 area), the application's resource usage might be optimized. There are several ways to achieve a non-uniform point distribution. Our approach is inspired by the computer graphics literature and previous works [12,13]. Besides the implicit function, the number of peripheral (non-uniform) points and the aspect ratio of the sensor must be provided. We generate a points distribution via a local non-linear optimization procedure that, from an initial distribution, tries to minimize a global energy function defined in Equation 1, where x is a point in image space.
(1) Version April 2, 2020 submitted to Sensors and try to repel each other when they are too close. We do not use Newton's physical model of forces 144 from springs. Instead, we have a mass-free system, so springs generate "velocity forces."

145
The optimization process is very sensitive to its initial conditions. A uniform distribution of the

151
In this section, we present the experimental setup necessary for simulating the usage the proposed 152 models. The chosen dataset closely resembles one of a biometric application. A subset is comprised of four (4) frame sequences (S1, S2, S3, and S4), each of which is registered 166 by three cameras (C1, C2, and C3). For instance, the frame sequence P1E_S2_C3 refers to the second 167 sequence (S2) of people entering portal 1 (P1E) and captured by camera 3 (C3). 168 We used 34 image sequences (out of 48) from the dataset during our evaluations due to the 169 following reasons:  First and third rows: original frames; Second and fourth rows: reconstruction with a model that considers an optical flow peripheral representation. Green and yellow arrows indicate motion direction to the right and left sides, respectively, whereas the ON and OFF labels refer to the operational status of the foveal (face detection/recognition) and peripheral (optical flow) regions. Note that the motion analysis, besides triggering foveal analysis, is also able to restart conveniently, as long as faces are not detected in the fovea during a time interval of frames (left-most frame in the fourth row). (11) sequences where no face is found in the fovea were ignored. This decision was 176 taken because no face recognition accuracy evaluations (using our models) would apply to these 177 sequences.

184
We simulated the operation of a specific-purpose sensor by re-sampling images according to our 185 ARIMs. The idea is to generate images containing two regions: (i) the fovea, encompassing a small 186 area where resolution is uniform, and (ii) the periphery, where pixels are arranged non-uniformly over 187 a wider area. With such a configuration, we were able to perform experiments considering different 188 foveal resolutions, while also taking advantage of the periphery according to the specific requirements 189 of the application. In this vein, we adopted an optical flow representation (orientation and magnitude) 190 for peripheral pixels. The motivation around that representation is that the detection/recognition in the 191 fovea be triggered only when there is movement towards it coming from the periphery. Also, both the 192 detection and recognition procedures turn off when no face is found under a predefined time interval.

193
In this scenario, therefore, more energy can be saved.  The workflow of the simulation process is depicted in Figure 5, where we also discern between  Implemented workflow for simulating the use of ARIMs in a specific CV application. In an ideal scenario, the ARIM, a captured image frame, and the chosen pixel representations for foveal and periphery areas are input to an hypothetical specific-purpose sensor that changes its configuration at run-time. Such a sensor would yield a stream (bytestream) of pixel data from each region of the captured image. The stream (not the 2-d image) would be forwarded to the CV application. For simulation purposes, however, this architecture is fully implemented by software.  registers from Intel processors called model specific registers (MSR). At the code level, we read these 217 registers before and after a block of instructions, and calculate the difference between these values.

218
More specifically, we read the MSR_RAPL_POWER_UNIT register to measure the energy spent in 219 image readings, face detection/recognition procedures, and optical flow analysis (when using ARIMs).   Quantifying reductions in numbers of pixels and image data sizes are essential for assessing the 227 benefits of using different ARIMs in practical situations. Table 1 shows these measurements. We notice 228 that the ARIMs reduced the number of pixels and the size of images in more than 91%. justified by the quality of optical flow analysis, which seem to be acceptable for the tested application.
244 Table 2 presents the minimum, mean, and maximum accuracy loss rates induced by each model in 245 comparison to the benchmarks. Whereas the maximum obtained loss was 50% for Model_1 and the 246 P2E dataset, very small loss rates (close to 0%) were registered in more than one scenario. Another 247 interesting phenomenon is the high loss rates observed for the P2E and P2L datasets, possibly due to 248 slight divergent conditions relative to the P1E and P1L datasets.    keeping accuracy rates acceptable, as previously discussed. Table 3 presents the minimum, mean, and 256 maximum energy reduction rates induced by each model relative to the benchmarks, i.e., the obtained 257 energy savings. As expected, the reduction rates decrease with the increase in foveal resolution, 258 because there is more data to process. This is verifiable by a quick comparison between the mean rates 259 of Model_1 (half density) and Model_3 (full density), for example.

261
A crucial observation that led to the present study is that image data captured by uniform sensors 262 is often dense and redundant, leading to computationally expensive solutions in terms of storage, 263 processing, and energy consumption. We addressed this issue by exploiting a space-variant scheme 264 which was inspired by mechanisms of biological vision, in particular, the way humans sense through 265 the retina. We introduced a generic framework for designing application-oriented retinal image models.

266
The models should be used to re-sample the input images prior to executing an specific CV task. We 267 selected a biometric application to illustrate the conception and usefulness of appropriate models.

268
The  evidences the viability of the proposed models and the conformity to our initial expectations regarding 274 resources saving.

275
In future works, we intend to use our framework in other CV applications, such as surveillance 276 and assembling line inspection. Another possibility is to represent the periphery of our models as implemented. Also, a more complex repertoire of variables would need to be considered, including 284 the costs of computing the models and resampling in the FPGA, as well as the application's domain.

285
Even with these variables in the field, we believe such an infrastructure could yield positive impacts in 286 the energy saving.