A Change-Driven Image Foveation Approach for Tracking Plant Phenology

: One of the challenges in remote phenology studies lies in how to efﬁciently manage large volumes of data obtained as long-term sequences of high-resolution images. A promising approach is known as image foveation, which is able to reduce the computational resources used (i.e., memory storage) in several applications. In this paper, we propose an image foveation approach towards plant phenology tracking where relevant changes within an image time series guide the creation of foveal models used to resample unseen images. By doing so, images are taken to a space-variant domain where regions vary in resolution according to their contextual relevance for the application. We performed our validation on a dataset of vegetation image sequences previously used in plant phenology studies.


Introduction
Vision stands out among other long-evolved biological sensing mechanisms because of its complexity and effectiveness. Unlike conventional digital cameras, the human retina is designed to perceive in a non-uniform manner. While the fovea centralis extracts high-resolution data, the retinal periphery perceives with a gradually-decreasing resolution. By operating with less data and relying on saccades [1], the brain processes these space-variant images more efficiently [2,3]. This scheme has been applied to image-based tasks, such as computer vision [4][5][6][7][8] and image compression [9][10][11].
As a rich source of data, digital images have triggered scientific advances in several areas. In plant phenology, where life cycle events are investigated along with their relationships to climate [12,13], knowledge of plant and vegetation dynamics over time is essential to understand ecosystem processes such as carbon and water exchange [14]. Manual on-the-ground data sampling and investigation can be coupled with hardware/software apparatuses to better capture, manage, and process multimedia data (e.g., images and videos), thus supporting phenological studies in several ways [15]. For the past decades, image-based phenological studies have been successfully developed using satellite-based imaging [16], near-surface cameras [14,15] and more recently, unmanned aerial vehicles [17].
A technical challenge imposed by phenological imaging is related to the memory space occupied, given the need for frequent and extensive imaging. Although hardware technology is continuously advancing in terms of storage capacity, some computing environments still face restrictions on data volumes, such as near real-time data processing applications and bandwidth-limited remote data transfer. For uses such as phenological monitoring, redundancy in image sequences due to invariant or slow-changing pixels contributes largely to the storage issue. Another significant feature of uniform high-resolution images is that they provide detailed data about relevant and non-relevant regions, without distinction, as a result of the underlying acquisition process in traditional cameras. In foveated vision systems, such an issue relates to the task of finding fixation points (i.e., foveal centers) across one or more images, which enables a space-variant image representation to be devised. When applying models of human vision, however, one must be aware of the difficulty in predicting human fixations, as these vary strongly between different subjects and tasks. Thus, as an extra challenge, algorithms to find appropriate fixation points should be application-dependent, as there is no optimal routine that covers all possible vision tasks [18].
Inspired by foveation and human perception, some approaches have attempted to address these challenges. Niu et al. [9] proposed a foveal imaging system comprising a set of lenses and a scanning microlens to produce images where a region is locally magnified and the remaining regions stay unchanged. The authors claim the system is suitable for object tracking and monitoring applications due to its dynamic foveal mechanism. However, peripheral resolution cannot be degraded and different microlenses are necessary to vary the structural foveal arrangement. Shi et al. [10] introduced a foveation-oriented approach to compress remote sensing images in which good image quality levels can be retained for inspection tasks. Nonetheless, such a study only considers a central fovea, and different fixation points are not investigated, which could be convenient in real scenarios. Bektas et al. [11] took human perception concepts into account to compress geographical images in a perceptually lossless manner. By applying models of visual perception, the approach favored data reduction and user experience in tasks involving visual interpretation of images. The study conducted experiments with real users that had their eye's tracked while performing searching tasks in a sequence of images. The results showed that few participants noticed the degradation artifacts caused by processing the images with the models. However, the participants were only assigned the task of looking for a circular map symbol in each image, therefore not delving into specific tasks such as identifying and recognizing places, structures, and other contextual elements.
The interesting balance between efficiency and resource allocation provided by foveal images has led us to further explore this idea. Here, we demonstrate an image representation approach that resembles the human retina in plant phenology tracking. Our method combines segmentation and image foveation to create foveal models and images with varying degrees of spatial resolution. Such variation depends on a contextual relevance that, in our validation scope, is determined by a phenological metric and a behavior pattern of choice. We show that our models yield storage savings, while also retaining efficiency in commonly adopted settings of the phenology research field. This paper is organized as follows. Section 2 describes our approach for devising image models based on a change-driven strategy and on foveation concepts. Section 3 shows the adopted dataset, as well as the steps needed to create the models and to reconstruct images on top of these models. Section 4 presents experimental details regarding the validation steps, shows the results, and discuss them. Section 5 concludes the paper and highlights topics for further investigation. Figure 1 shows the complete workflow of our approach. First, a set of images captured over a time period is used for training. Then, the following steps are performed:

1.
Binary Map: This step creates a binary map encoding a phenological event, and segments the image space into foveal and peripheral regions. For such, we employ a motion history histogram (MHH) [19] to create a frequency map representing the spatial occurrence of binary behavior patterns over a sequence of binary change maps (CM). A CM encodes a desired change, which is determined by a phenological metric, between two RGB images. For instance, the metric of Increase (Equation (1)) can be used to encode a Green-up event by capturing increases in the green-channel pixel values (leaves) of a certain plant individual over time.
In Equation (1), CM Inc (x, y, t) stands for the change map of two consecutive images from the sequence (the first one at timestamp t), and I t (x, y) refers to the (x, y)-pixel value in the t-th image. Note that a sequence of N images yields N − 1 CMs.
The successful use of CMs to depict changes on plant phenology was developed and tested using data derived from ground-based direct visual observation, and relying on MHHs to detect and represent temporal changes in arbitrary temporal multivariate numerical data [20].
Here, we applied the proposed binary encoding approach to characterize phenological events. The MHH stores the frequency each pixel featured a specific behavior in the CM sequence. For instance, a "short" Increase behavior can be modeled as 0110, which refers to an increase of the values of a specific pixel in two consecutive images. Figure 2 exemplifies the idea of detecting binary patterns in a series of typical phenological images. However, we use a binarized version of MHH as we are only interested in pixel positions where a pattern was detected with a frequency greater than a threshold sigma (σ). Figure 3 illustrates the complete process.

2.
Gaussian Kernel Density Estimation (KDE): After delineating foveal and peripheral regions, the next step estimates a 2-d non-parametric probability density function from the binary map using a KDE. The final artifact of this process is a 2-d real-valued matrix representing the corresponding function and matching the size of the map. However, as an inverse analogy to the distribution of cones in the retina, we adjusted the function such that its values increase with the distance to the fovea(s), thus resembling an upside-down 2-d mixture of Gaussians. In Figure 1, the red and blue regions indicate high and low values, respectively. 3.
Foveal Model: We used the non-parametric function and a Hilbert curve in a pixel-sampling procedure to create a foveal model. The Hilbert curve maps a 1-d parameter space to a higher-dimensional space (i.e., 2-d), thus creating a sequential order by visiting each midpoint of a square once. Unlike a usual raster-scan approach, the Hilbert curve favors the preservation of locality properties inherent in multidimensional data, as it traverses neighboring regions prior to visiting distant ones [21].
The proposed method initially divides the image space into four squares around its center. Given that the Euclidean distances of the square's midpoint to its two closest neighbors are greater than the value of the non-parametric function at the midpoint, each square is then recursively divided into four equal-sized squares. This can be envisioned as a gradual refinement process in which more pixels are sampled the closer these are to a fovea. The stop condition occurs when trying to reach subpixel positions, when a square is not refined any further. The final curve is not homogeneous (see Figure 1), with its vertices constituting the non-uniform sampling scheme.

4.
Space-Variant Region of Interest (sROI): Although the foveal model may contain regions of variable interest, one might choose only a subset of them. Having a gradually-decreasing resolution from the foveal centers towards peripheral regions could be helpful in specific circumstances. Instead of using a delimiting rectangular window or a binary mask over a uniform image, a space-variant model allows us to deal with a non-uniformly constrained region of interest, which we refer hereafter as sROI, that may represent a phenological behavior of a given plant individual over time.  After a "training" step, a new set of images taken in a different time period but from the same area can be used to test the foveal models and their extracted sROIs. Images can be taken as a set of sparse points, or be reconstructed by several methods (e.g., voronoi diagrams, quad-trees) that provide a 2-d space-variant representation. From that, data can be extracted and processed (e.g., phenological visual rhythms and interest-point/statistical descriptors). By using space-variant images, memory storage may be reduced given some image are represented at different resolutions and contain less data. Figure 3. Pipeline of the Binarized MHH generation. This example illustrates the entire process applied to detect a specific pattern throughout the image series. First, two consecutive images I i and I i+1 from an image series S = {I i , I i+1 , ..., I n : 0 ≤ i ≤ n, n ≥ 1} are compared according to a phenological metric (in this example, the pixel Increase). The comparison yields a binary change map CM i spotting out the pixels that conform to the metric. As comparisons proceed (i.e., as i increases) all change maps (CMs) are inspected towards detecting pixels that change according to a behavior pattern (in this example, 010). Finally, the MHH is gradually updated upon detections. Suppose a pixel at position (x 1 , y 1 ) is found to follow the desired pattern (in this example, green border cells) at a certain point of the examinations. In such a case, MHH(x 1 , y 1 ) would increase by one. After the entire time series has been processed, the MHH is binarized by means of a predefined threshold sigma (σ).

Materials and Methods
In this section, we detail the experimental procedure adopted, including the dataset used, how foveal models are generated, and the image reconstruction (or resampling) step.

Dataset
The dataset employed contains daily sequences of RGB images (in JPEG format) with 1280 × 960 pixels. The sequence covers the years from 2012 to 2015: where, everyday, an average of six images were captured for every hour (between 6 am to 6 pm). Binary masks of field-identified individual plants are also available. The images were produced with a digital hemispherical lens camera (Mobotix Q 24, Mobotix AG-Germany), which was placed on a monitoring tower far above the canopy and recorded the phenology of a Cerrado (neotropical savanna) area at Itirapina, São Paulo, Brazil [15,22]. Our study occurred during the transition between the dry and the wet season, a period when most of the plant species are producing new leaves. According to [22], in which a dataset from the same area was used, leaf flush periods occur from the end of August to the beginning of October.

Foveal Models
To generate the models, we used the 2012 image dataset, which was preprocessed as follows. First, we removed unwanted elements, such as the camera tower and the information inserted by the camera's software. Then, images were down-scaled to 25% of their original size (320 × 240 pixels) to favor the creation of well-structured models (i.e., with a few, smooth foveal regions). Using scale-space inspections, enabled with gaussian pyramid decompositions, we noticed that images at lower scale levels (i.e., with reduced sizes) provide more satisfactory models, as it removes most noise-like data. Each image was then converted into its green chromatic coordinate (GCC) representation. The GCC is a common index used for near-surface phenology that reflects a measurement of the proportion of green color signal on an RGB image pixel or region [15,22,23]. Finally, we calculated the 90th percentile of each day to encapsulate relevant daily data into a single image and, possibly, minimize the impact of lighting changes (e.g., intensity, angles) on the time series [23].
We selected four binary patterns typifying different behaviors of a phenological change of Increase (see Section 2) encoded in MHHs. The 010, 0110, 01110, and 011110 patterns (ordered from the shortest to the longest) indicate how pixels behave in terms of frequencies of continuous changes (sequences of "1"s) bounded by steady states (bordering "0"s) throughout the image series. Although other temporal patterns could be contemplated, we have chosen these because, together with the Increase metric, they are capable of encoding the Green-up change. For each pattern, we used different values of sigma (σ; found empirically) to binarize the MHHs. Finally, the models generated (i.e., their points) were re-scaled back to the original, high-resolution size. Figure 4 shows the MHHs, KDE implicit functions, and foveal models generated for each pattern. The "spreading" aspect of MHHs also allows us to identify all regions that respond to a particular event.

Image Reconstruction
A plausible approach consists in returning to the 2-d space by means of a reconstruction step. When doing this, the image is essentially uniform, but its contents get represented in a space-variant domain. Figure 5 illustrates the procedure we adopted in this paper, which relies on calculating a voronoi diagram for the set of points comprising each model, then reconstructing the image by drawing voronoi cells filled with the same RGB color of the cell's central pixel in the captured scene. This creates a foveal image that has greater content heterogeneity in foveal regions, as these carry higher resolutions compared to peripheral ones.

Results
In this section, we present a validation for the proposed workflow. We assess the effectiveness of foveal models in terms of time series' correlation rates of an appropriate vegetation index and memory storage usage induced by models.

Evaluation for Reconstructed Images and ROIs
In plant phenology, monitoring variations in plant individual features, such as those related to color and shape, are paramount to understand the phenophases that these individuals undergo, and, consequently, the associated ecosystem processes [15]. A suitable measure to determine these variations in digital images over time is the mean GCC of image pixels or regions. Thus, we validated the foveal models from Section 3.2 by comparing regions from original and reconstructed images having plant individuals known to undergo the same phenological change (i.e., Increase) encoded in the models. We used individuals of the Aspidosperma tomentosum and Caryocar brasiliensis species (for details see [22]).
We tested the models with high-resolution images from the years of 2013, 2014, and 2015. The top panels of Figure 6-top show the mean GCC time series regarding the tested years for original and foveal images with each model. To measure the similarity between the original and foveal time series, we calculated the Pearson Correlation value between series ( Figure 6-bottom panels). The high positive correlation results suggest that the reconstructed visual information is still significant even under varying resolution and degradation levels caused by the space-variant representation. Additionally, models encoding the 010 and 0110 patterns seem more effective at incorporating visual information from Aspidosperma tomentosum individuals, whereas the 011110 pattern is more effective for Caryocar brasiliensis. Figure 7 presents examples of the phenological images used in the experiments, and their masked, reconstructed, and mask-reconstructed versions for a visual inspection.
We also conducted image quality evaluations. Table 1 shows the root mean squared error (RMSE), mean absolute error (MAE), structural similarity index (SSIM), and peak signal-to-noise ratio (PSNR) results for reconstructed images using each model (binary pattern). PSNR and SSIM are two image quality metrics commonly used in the literature to compare different image compression/reconstruction schemes. The PSNR is calculated as the signal peak divided by the strength of the noise, whereas the SSIM quantifies (in the [0, 1] range) the similarity between two images by considering perceptual differences, which include the structural divergences between the depicted objects (e.g., along their edges) caused by image degradation [24]. As expected, SSIM and PSNR values are low, due to the large voronoi cells in peripheral areas. However, RMSE and MAE values were acceptable, indicating that the degradation, although severe in some regions, does not shift the error rates too much, thereby suggesting that foveal images may still be accurate for some analyses, such as plant phenology tracking (as shown in the present study).  Table 1. Results for reconstructed images using the four models respective to the 010, 0110, 01110, and 011110 binary patterns. The considered metrics are root mean squared error (RMSE), mean absolute error (MAE), structural similarity index (SSIM), and peak signal-to-noise ratio (PSNR).

Evaluation for sROIs
Instead of image reconstructions, a subset of the 1-d points from a Hilbert curve can be used. The curve embedded in each foveal model is non-homogeneous and presents variable point densities across the 2-d space it fills. By considering the curve as a sequence of points, such variability can also be verified, but it is necessary to know the refinement level each point in the sequence belongs to. In essence, a new refinement level is obtained every time a recursive call is performed during the model generation (see the algorithmic description in Section 2-step 3), thereby dividing a square area (at the n-th level) into four equal-sized squares, each of which will then belong to the (n + 1)-th level. Points at lower levels are associated with poorly refined regions and coarser resolutions, while points at higher levels are associated with deeper refinements and finer resolutions. In this vein, we introduce the concept of minimum refinement levels (MRLs) to constrain a sequence to a subset comprising all points at and above a certain desired level. We also refer to this MRL-constrained sequence as sROI.
In Figure 8, we use (i) different colors to represent distinct refinement levels and (ii) a quad-tree layout we believe is convenient for visualizing the refinement steps leading to the non-homogeneous aspect of the curve. For instance, the 6th MRL includes points at the 6th, 7th, and 8th refinement levels.   , and 8th (right) minimum refinement levels (MRLs). Each level is represented in a distinct color. In this case, the lowest MRL is the 2nd, meaning that at least two recursive refinements were performed for any of the four points of the initial curve. The 2nd MRL therefore includes all points of the model. Nevertheless, the highest refinement level of the depicted model is the 8th one. Thus, the 6th MRL comprises the points at the 6th, 7th, and 8th refinement levels, whereas the 8th MRL includes only the points of its own level.
The neighborhood preservation property of the Hilbert curve allows us to compare mean GCC correlations from ROIs and sROIs directly. The ROI, in this case, refers to the binary masks of each individual provided by the dataset. Figure 9 shows time series and correlation values for sROIs at the 6th and 8th MRLs of each model. Since sROIs enclose many regions conforming to the event encoded in their models, low correlation values are expected (Figure 9-bottom panels). Moreover, the correlation analyses showed a reverse behavior of some models across distinct species. For example, in the 2013 data, the 010 pattern is positively correlated with Aspidosperma tomentosum, but negatively correlated with Caryocar brasiliensis.

Storage Usage
A benefit of adopting space-variant image representations is the reduced data volumes. First, our image reconstruction process essentially compresses the images in a space-variant manner. Thus, it is enough to compare the size of these images with the ones obtained with other compression methods. We compared the original image, JPEG images with five quality factors varying from 90 (best) to 10 (worst), and the four foveal models from Section 3 ( Figure 10). Our foveal models generated resampled-image sizes that are similar to those obtained via more aggressive JPEG compression actions, i.e., by applying qualities 30 and 10. Additionally, there is a significant difference between the image sizes obtained with our foveal models and with JPEG quality factors above 30, which produce images with less degradation at the cost of higher storage sizes. In contrast, we observed a reduction of around 99% in the number of points (pixels) for the images generated from the foveal models in comparison to the original image ( Table 2). The same holds for bytes.
Compared to ROIs and sROIs, the voronoi reconstruction leads to a higher impact on memory storage rates, since more data is required to represent a foveal image in a 2-d uniform cartesian grid. However, as foveal images show greater visual homogeneity in peripheral regions-due to large super-pixel-like artifacts (i.e., voronoi cells)-additional standard compression techniques can also be applied to decrease the storage sizes even further. Moreover, there is no need to save voronoi cells' vertices, as these can be calculated by voronoi algorithms, but this may require extra and repetitive computational processing. Finally, the combined results of Table 2 and Section 4.2 suggest that the usage of full or MRL-constrained sROIs in phenology tracking (as substitutes for 2-d images) is a valid and viable approach that aims to reduce the manipulated data sizes. Image compression comparison Foveal Models (pattern) JPEG (quality) Figure 10. Comparison of the storage sizes (in kilobytes, KB) of compressed images with our change-driven approach and with the usual JPEG compression process. The models evaluated were the ones generated by the 010, 0110, 01110, and 011110 binary patterns, whereas the JPEG quality factors considered were 10, 30, 50, 70, and 90. The graph shows the mean size of the compressed images.

Conclusions
We have introduced a change-driven image foveation approach to deal with large volumes of data from phenological images. Several phenology applications must be aware of storage limitations, such as real-time expert processing systems, very-high-resolution imaging sensors, and low-bandwidth remote data transmission. As global long-term and widespread databases of phenological imaging become available, efficient storage with minimal loss will be necessary. To solve these problems, we propose to create foveal models that are able to encode phenological metrics and a behavior pattern. MHHs and a Hilbert curve provide the fixation points/regions and the varying-resolution aspect of the models, respectively. We then propose that these models be applied to create foveal images having less, but also most of the relevant data.
We evaluated model correlation rates for mean GCC time series (2-d and 1-d scenarios), visual quality, and memory storage. Our results show a reduction in the amount of stored data and a viable new image representation, both in terms of quality and relevant-data preservation. In the image compression spectrum, our approach is also valuable as it reaches similar image storage sizes to those obtained with a simple JPEG compression technique using low-quality factors. Although the compressed images show visual artifacts, our compression proceeds in a semantics-wise manner as determined and encoded by the foveal models.
Data variability may represent a challenge to any remote sensing approach targeting vegetation tracking, and we have employed foveation precisely to account for such variation. Although our foveal models may be static and built on top of the behavior seen in a specific year, their resolution-degrading configuration still correlates well with those from subsequent years. For a very long time series, however, our approach might have some natural drawbacks, as climate and anthropogenic issues contribute to amplify uncertainties over time. We leave this investigation for future works.
Finally, examining space-variant imaging sensors and Field Programmable Gate Array devices that are able to handle foveation at the hardware level are promising research venues, as these could boost the autonomy of the image acquisition process, particularly in remote areas. In this context, evaluating energy consumption levels of different foveation procedures, performed on variable hardware and software platforms, could be interesting goals for future studies.