Semi-Automatic Spectral Image Stitching for a Compact Hybrid Linescan Hyperspectral Camera towards Near Field Remote Monitoring of Potato Crop Leaves

The miniaturization of hyperspectral cameras has opened a new path to capture spectral information. One such camera, called the hybrid linescan camera, requires accurate control of its movement. Contrary to classical linescan cameras, where one line is available for every band in one shot, the latter asks for multiple shots to fill a line with multiple bands. Unfortunately, the reconstruction is corrupted by a parallax effect, which affects each band differently. In this article, we propose a two-step procedure, which first reconstructs an approximate datacube in two different ways, and second, performs a corrective warping on each band based on a multiple homography framework. The second step combines different stitching methods to perform this reconstruction. A complete synthetic and experimental comparison is performed by using geometric indicators of reference points. It appears throughout the course of our experimentation that misalignment is significantly reduced but remains non-negligible at the potato leaf scale.


Context
Near in situ agriculture observation has experienced a massive boom in recent years with the arrival of the big data era. Plant disease monitoring is of fundamental interest for crop management. Several studies related to different kinds of cultures have thus emerged, e.g., corn [1], wheat [2], grapes [3], or potatoes [4].
Nowadays, hyperspectral imaging (HSI) appears as an interesting track for agriculture to measure the plant phenotype [5]. HSI provides datacubes with two spatial and one spectral dimension. Observing the spatial pixels along the spectral dimension allows to analyze the spectral content of small areas. Specifically, health plant monitoring systems are devised from multiple embedded systems, such as satellites [6], aircraft [7], unmanned aircraft vehicles (UAV) [2], ground-based vehicles [8], and laboratory systems [9].
In the meantime, UAVs have become popular and affordable, so scientific experimentation with embedded sensors has emerged. Onboard sensors have to be compact and robust and should work with limited resources (energy and processing unit), especially with small UAVs with limited carrying capacities.
Among HSI compact sensors-which may be carried by UAVs-only multishot (spatiospectral linescan) [10] ones can yield images with a large spatial spectral and time resolution. This fine resolution enables to obtain accurate radiance images of parts of vegetation crops. This kind of sensor requires a precise movement to obtain a consistent datacube. However, uncontrolled transient movements of light UAVs embedded with compact commercial GNSS controller make it difficult to recover datacubes with a reconstruction error lower

Related Works
Generally, image stitching is tackled over pairs of RGB images presenting a significant overlapping area with different viewing angles. One classical way to tackle this problem is a feature-based registration approach. It usually involves a two-step technique [13]. The first one is to look for a putative set with matching pairs of points between images, which is usually performed with the search for invariant features [14]. It is usually followed by a curative step that aims at removing false matches, also viewed as outliers. This step enables providing a clean matching set.
Image alignment, which considers the optimal matching set as the input, is often performed through a single global transformation [15], but misalignments still exist, due to model inadequacy. To overcome this problem, some authors proposed some local projective models, such as as-projective-as-possible warp [16] or shape-preserving-half-projective warp [17], which warps locally according to the proximity of the reference points.
However, these techniques cannot be applied directly to pairs of hyperspectral datacubes because each pixel has nearly two hundred bands, which cannot be realigned independently for computation reasons. Instead, a reference layer is chosen, and the extraction of coordinate feature points is performed between two identical reference layers from both datacubes. Then, reference layer stitching is performed by finding a single adequate model between the reference layer from datacube 1 and datacube 2. The same model is applied to other layers since each datacube is already aligned within layers [18].

Our Contribution and Paper Organization
In our case, extracting feature points directly from raw images is impossible since raw images involve several spectral bands which have their sensing specificity and sensitivity. Instead, once a first datacube reconstruction is performed, there still exists misalignments between layers. Contrary to classical image stitching or datacube stitching, a collection of warping models has to be learned with respect to a reference layer. However, searching for feature points in each layer is tedious, so extrapolating them with feature points from a few spectral layers feature appears as an interesting solution.
In this paper, the hybrid linescan camera is introduced. We propose two new approximate datacube reconstruction methods, which both assume a regular linear move of the sensor, but which highlight two different points of view. The first one is based on heuristic shifts between raw frames, while the second one uses the physics of the sensor. However, spectral layers obtained from both methods still contain misalignments, due to non-rigid object moves, according to the spectral band and object height. A reference spectral layer is chosen as the one obtained with NADIR rays, and it is considered geometrically unbiased. As a result, any spectral layer is aligned on the reference spectral layer by identifying a specific warping homography for each layer. As an illustration, the whole procedure is summarized in Figure 1. A comparative study of state-of-the art-stitching methods is performed to evaluate the quality of the corrective step. The remainder of the paper reads as follows. Firstly, it emphasizes projective warps, which maps the matching points from one image to another. It then focuses on the specificity of our hyperspectral camera to develop an approximate spectral recovery. However, given that objects from various spectral planes are not well spatially located, various stitching methods are proposed for each spectral layer. It is then followed by practical experimentation, which shows a significant improvement of the geometric scattering of reference points.

Inlier Set of Matching Pairs
Image alignment is often tackled between a pair of selected images. The first step to perform image alignment is to get an input set of matching points, i.e., a set of N P putative corresponding points described as P {(p j , q j )} N P j=1 , where p j accounts for the coordinates in input Image 1, whereas q j accounts for the corresponding coordinate vector in Image 2. This is usually achieved by applying feature detection algorithms that look at the invariant feature descriptors in each image, such as scale invariant feature transform (SIFT) [19] or speeded up robust features (SURF) [14]. However, the putative set is usually embedded with false matches, which may corrupt the alignment step, so some specialized algorithms are devoted to removing mismatches [20,21]. In some cases, the automatic generation of an input set is not feasible for several reasons, especially in agricultural crops, where plant patterns are repetitive, so it calls for the manual extraction of reference points [12,22].

Global Homography Warping (GHW)
Consider that we get an optimal inlier set I {(p j , q j )} N I j=1 ⊂ P-where N I stands for the number of feature points from this set-which describes the mapping between 2D coordinates from Image 1 to Image 2. A homography matrix H relates homogeneous coordinates p j [x j , y j , 1] T from image 1 to its corresponding points in the second image q j [u j , v j , 1] T which can be written as follows: where ∼ accounts for collinearity between both vectors. Equation (1) is also called global homography warping (GHW). Another point of view consists in considering two orthogonal vectors to q j , e.g., [0, −1, v j ] and [1, 0, −u j ], which may be used to obtain equivalent equations, i.e., where h vec(H) and A j may be read from Equation (2). Stacking Equation (2) for all coordinates belonging to the inlier set I enables to introduce a global matrix A defined as Instead of exact equality, this search is usually formulated by looking for the least significant singular vector under an optimization framework, i.e., This formulation is often regarded as being very sensitive to noise, which corrupts the point coordinates from the inlier set [4,23], but some simple coordinate changes are able to improve the result quality [23].

Improved Warping Methods
A single homography relies on the assumption that all the objects from the scene belong to the same plane, which is scarcely true. As a consequence, local homographies are carried out by dividing the image into small grids [16] and by computing local weights according to the proximity of reference points to the grid center. This gives rise to a set of weighted optimizations which may be written as follows: where ∆ accounts for a grid number of the scene among N ∆ grids, and W ∆ results from the computation of local weights derived from the grid center [16]. This method is denoted as as-projective-as-possible (APAP).
As an alternative, dual homography warping (DHW) [24] may be performed by dividing the matching set into two groups, each one having its own homography model. Any pixel in the scene is subject to its specific warping model resulting from a convex combination of these two homography models. Indeed, the current convex combination coefficient is computed from the ratio of distances to each cluster center, which makes the DHW model specific to the current pixel [24].
In addition, both methods consider implicitly that a scene is made of a few continuous zones, where objects from the same zone belong to the same plane. Moreover, the plane change occurs at the border of a specific region of 2D images. This assumption is sometimes difficult to check with complex scenes.
Furthermore, image stitching algorithms are not able to perform datacube stitching. Only Zhang [18] proposed a robust elastic warp that fits first to the central spectral plane from one cube to another and then adapts the transformation to other spectral planes.
Nevertheless, owing to the sensor structure, neither image stitching algorithms nor datacube stitching ones can be applied to raw images. A first datacube reconstruction is expected to perform multiple-layer stitching jobs.
In addition, even when a first datacube is available, none of the previous stitching methods are directly applicable to perform several layer stitching tasks, but adapting them seems to be a good strategy. As an illustration, Table 1 which gathers the assumptions related to the scene of each previous method together with the advantages and inconveniences is introduced. Note that our new methods in Table 1 are denoted without the term "collection of" for the rest of the article since the basic ones are not able to be applied in the context of spectral layer alignment. GHW [23] Objects in the same plane Fast X X APAP [16] Smooth changes Slow X X DHW [24] Two distinct planes Medium X X SPHP [17] Two distinct planes Slow X X Zhang [18] Smooth changes X Fast X Collection of GHW (ours) Objects in the same plane X X Medium Collection of DHW (ours) Two distinct planes X X Slow

Methodology
In this section, a complete description of the camera and its carrier system is performed.

Spatio-Spectral Scanning
Our hyperspectral camera is provided by IMEC and falls into the category of compact spatio-spectral imagers [25]. It differs from classical linescan cameras in the sense that it does not provide a complete spectral description of a spatial line in one shot. It can acquire a complete scene within a range [460; 901] nm by moving linearly along the y direction as shown in Figure 2a since only one wavelength is obtained at one spatial position. Figure 2a illustrates that a point P in the scene may be observed by all the wavelengths by acquiring several images ∀t ∈ [t 1 , t 3 ]. Figure 2b shows the tracking of a feature point in both the central and extreme angles, which correspond to central and extreme band numbers.
Practically, a bank of raw images is acquired at a frame rate f e and it is indexed with the variable i ∈ N * , which is related to the acquisition time, i.e., scanning direction

Sensor Structure
Raw images obtained from this sensor have a resolution of 2048 pixels width (columns) by 1088 pixels long, which are viewed as rows. Figure 3 shows a typical raw image whose description is specified below: 1.
The 4 first rows are not used.

2.
64 spectral stripes (also called bandlets) of 5 by 2048 pixels each enable to inspect the visible wavelength range, represented in the top part of Figure 3.

3.
A 120 by 2048 pixels rectangle (corresponding to 24 stripes) accounts for a blind area. This area corresponds to a white rectangle in the middle of Figure 3.
The 4 last rows are not used either.
As a summary, a raw image contains 192 useful spectral bandlets covering both visible and NIR wavelengths. Each pixel from the raw image captures not only different spectral information, but also a different position in the scene.

3-Axis Representation
By adding a wavelength axis as a third dimension, a raw image may be converted into a stair, as shown in Figure 4, whose step is equal to the step wavelength.
x y b Figure 4. A 3D view of a raw image with 4 spectral bands.
A compact datacube may be built by stacking multiple stairs obtained from successive raw images along the scanning direction (y-axis). Owing to the stripe width (5 pixels), a displacement between two successive images strictly equal to 5 pixels corresponds to the case where successive stairs can be stacked together without any free space ( Figure 5). A shift between two successive images greater than 5 pixels leads to free space between the stairs corresponding to missing data. In cases such that the shift is lower than 5 pixels, successive stairs overlap along the y-axis, enabling to yield redundant data. This choice corresponds to the operating conditions for our study, where an interpolation task between redundant data may be carried out. However, accurate control of the shift between two successive images is not easy, except with laboratory systems. As a consequence, we chose to use such a system, which enables better control of the overlapping along the acquisition process. The description of our laboratory system is provided in the experimental section (Section 6.2).

Two Proposed Spectral Stitching Methods
We here propose two approximate spectral reconstruction methods as a first step of the datacube design. The first one is based on heuristic spectral reconstruction, while the second one uses the physics of the hybrid linescan sensor.

Heuristic Based Spectral Reconstruction (HSR)
In this section, we propose a new spectral stitching method for hyperspectral data reconstruction. This method is composed of two steps, as described in Figure 6:

1.
A sub-datacube reconstruction for hyperspectral image reconstruction and extraction of hyperspectral bands; 2.
A fusion of sub-datacubes based on matching procedures.

Raw Images
Visible Sub-datacube A hyperspectral raw frame has three zones: visible, dead and NIR. The dead (blind) zone splits each frame into two zones: the first one involves 64 visible spectral bands, while the second one includes 128 near-infrared (NIR) spectral bands. To reconstruct a sub-datacube in each zone (visible and NIR), the first step is based upon finding a frame step number shift, which derives from a spatial drift between manually extracted feature points from raw images. Therefore, hyperspectral layers for visible and NIR images may be built by stacking several stairs from regularly chosen raw images. Then, it results in two independent sub-datacubes issuing from these two zones. Finally, in a second step, image matching is performed between each layer of the two sub-datacubes and a chosen reference layer.

Sub-Datacube Reconstruction
In this subsection, hyperspectral images are reconstructed from raw images by extracting spectral bands for visible and NIR layers. To achieve this goal, each sub-datacube is built by stacking a stair from a particular frame as described in Figure 7, and performing the same job with the frame number increased by the frame step parameter. Let i n−1 be the current frame number, then the next frame number i n to be considered is expressed as follows: where n is the stair number. Since the index i n should be an integer, the frame step should also be an integer. If its estimation is not an integer, it could then be approximated as an integer. b y i 1 = 1 Once the right frame number is selected, the second question is which the y-coordinate from the datacube corresponds to the selected band in the selected raw image. The answer derives from the definition of a global offset. First, from the IMEC configuration file of our hybrid linescan camera, two characteristic offsets are available from the raw image definition (Section 3.2): Then, the y s position on the sensor of the beginning of the stripe number b may be deduced from the sensor structure, i.e., Then, the stripe coming from the n th stair for band b should be stored at position y b n in the sub-datacube. Its expression is derived from Figure 7 and is expressed as follows: The reconstruction algorithm which derives from Equations (6), (8) and (9) is provided in Algorithm 1. As a result, two independent sub-datacubes are available. However, the frame step parameter f rame_step should be identified for a complete procedure.

Algorithm 1 Sub-datacube Reconstruction
for b from 1 to 64 do from frame i copy the 5 rows from position y s according to Equation (8) paste the 5 rows at position y b n into spectral plane b using Equation (9) end for for b from 65 to 192 do from frame i copy the 5 rows from position y s according to Equation (8) paste the 5 rows at position y b n into spectral plane b using Equation (9) end for update i to i + f rame_step end while

Estimating the Frame Step Parameter
The previous procedure needs to stitch one data stair beside another one. As a consequence, the frame step parameter needs to be identified. To this end, in each area (visible, NIR), a feature point extracted manually from several regularly spaced raw frames enables to estimate a frame step expressed in pixels per stripe width (5-pixel datacube). Tracking a feature point along the frame number enables the computation of an approximate frame step. Let ∆i i − i be the frame number shift and ∆y s y s (i ) − y s (i), then the frame step which enables to move from one stripe to another reads as follows: The frame step parameter may also be computed in the least square sense by estimating the slope parameter of the curve ∆i versus ∆y s . Practically, four regularly spaced raw frames are used to compute the parameter.
However, visible and NIR sub-datacubes are not matched together, due to the integer rounding of the frame step parameter. In fact, a drift will occur along with the datacube design. Indeed, if the real value of the frame step parameter is, for example, 9.5 frames per band, its integer rounding may lead to an approximate value of 9. A greater number of bands is inserted in the whole datacube, which leads to a datacube stretching effect, where the latter is increasing with the band number. On the contrary, a lower number of bands is stacked if the frame step is overestimated, which corresponds to a contraction effect. As a consequence, image matching should be achieved.

Matching and Fusion of Sub-Datacube
At this stage, the two sub-datacubes are not aligned; moreover, there are still misalignments between bands within the sub-datacube. As a consequence, the procedure proposed in this section consists of matching and fusing both visible and NIR spectral layers into a single global hyperspectral datacube.
To this aim, it can be noticed first that only part of the scene is shared by the two subdatacubes, so unshared parts may be removed to ensure that sub-datacubes can be fused. Then, the desired datacube is built by stacking the spectral layers of each sub-datacube at the right position. Figure 8 describes an example of a drift of a feature point coordinate along the spectral band number b. So, considering that the first band is the reference, the feature point should obtain the same position in every band. As a result, the offset in Figure 8 describes the shift to be performed to reach the target position (2800 px) for every layer. To perform this correction automatically, the target layer number b should be shifted according to a specific model. We propose to fit an affine model for each spectral area and then connect each model with an offset at the border of each spectral area. To that goal, a slope parameter is used for each zone (resp. slope Vis for visible area and slope N IR for NIR area). The slope parameter is calculated for each zone in the least square framework and approximated to an integer value. Then, an offset integer parameter, denoted as o f f set Vis2N IR , is deduced from the difference between the border of each area.
The first spectral band is considered the reference one in this case. Figure 8 shows the shift to be performed to stitch any layer with the reference one. This shift, denoted as o f f set, is computed with the slope parameter of each zone and potentially the offset parameter o f f set Vis2N IR . Algorithm 2 describes the new datacube obtained with each layer shifted by a computed offset, i.e., This method enables to design a datacube which may be evaluated afterwards. However, the method represented by Algorithms 1 and 2 requires the parameters to be integers. This condition implies that the frame step from Algorithm 1 is an integer. This assumption which may only be approximately checked may lead to a possible drift in the datacube reconstruction. The same drift may be observed in Algorithm 2 according to band number b due to an integer rounding error.

Physical-Based Spectral Reconstruction (PSR)
This new method takes into account the physics of image acquisition such that only one parameter is required to perform an approximate datacube reconstruction. Contrary to the previous method, this method is devoted to real parameter computation even if, in essence, indexes are integer numbers. In addition, it tries to exploit radiance redundancy in order to obtain reliable information. Figure 9 describes the different steps of the new method.

First Basis Change
Originally, each pixel coordinate is mapped to a single position in the coordinate system (x s , y s , i), where (x s , y s ) is the pixel position on the sensor for frame number i. Now, we would like to replace y s with a dual coordinate (b * , k), where b * accounts for the band number (1 ≤ b * ≤ 216), and k is the shift index within a band (1 ≤ k ≤ 5). Figure 10 shows these dual coordinates on successive rows from any raw image. The associated starting position for extended band number b * on the sensor, denoted y s , derives from the definition of the sensor, i.e., where b * ∈ {1 . . . 216} (Note that only 192 bands are really used since band 65 to 88 correspond to the dead zone of the sensor. b * is related to b, i.e., b * = b + 24 × 1 b>64 ). Now, each pixel in the original system coordinate can be mapped to a single position in the new coordinate system (x s , b * , k, i) by noticing that Equation (12) stands for integer division of y s − 5 by 5, i.e., As a result, b * and k can be obtained from the integer division y s − 5 by 5, i.e., b * = y s −5 5 where mod stands for modulo. Equation (14) enables to compute the new coordinates from the input system coordinate.

Second Basis Change
To build a hyperspectral datacube, it is required to fill a 3D array indexed by (x, y, b) coordinates, where (x, y) account for ground axes and b is the true band number. So, in other words, we want to find a function that maps each pixel from each raw image in the 3D array. The sensor is assumed to move in the y-direction at a constant speed expressed in pixels per frame, denoted as step. This assumption leads to the new coordinate system, where it can be noticed that y takes into account the linear kinetic model of the sensor over time, i.e., Note that some unused bands, corresponding to 65 ≤ b * ≤ 88, enable to skip the computation of the corresponding coordinates.

Interpolation of the Radiance
In the case where the speed is sufficiently low, the same point from the scene may be observed by several raw images in the same band b * . This situation enables to exploit the redundancy of the scene for each band due to low-speed conditions. This may be performed using an interpolation task that enables to fill the datacube with the estimated radiance.
Let (b * , k, i) and (b * , k , i ) be the coordinates observing the same point, then it leads to the following equation by equating y with both parameters as follows: As a result, a relationship between these parameters should be checked, i.e., However, this relationship is scarcely exact since i and k are both integers while the step parameter is not. Instead, it is usually searched for a set of indexes (i, k) which approximates every target integer coordinate y for band number b, i.e., This set S b * y gathers all pairs of candidates for a fixed value of both the target coordinate y and band number b * . In practice, these candidates from this set are looking at the same object in the scene. As a result, they may be used to perform an interpolation job.
Let us suppose for the sake of clarity that y from Equation (15), denoted as y b * k,i , corresponds to the coordinate with band number b * , shift index k, and image number i.
A weighted interpolation of radiance values may be performed according to the distance |y − y b * k,i |. For this purpose, we introduce the weight parameter for every member of the set S b * y as follows: A weighted average of the radiance values among candidates may be carried out using the previous weight definition Equation (19), i.e., where I b * k,i accounts for the vector of radiance for a complete row (with x fixed, corresponding to 2048 pixel values) obtained with all pairs of indexes (i, k) ∈ S b * y whileÎ b * y stands for the estimated radiance row vector at coordinate y for band number b * . Algorithm 3 summarizes the different steps explained before.

Algorithm 3 Physical-based spectral reconstruction (PSR).
Create a datacube with zero's values and size (nb_lines, nb_columns, nb_bands) for band b * from 1 to nb_bands do create table of estimated position according to Equation (15) for line from 1 to nb_lines do According to Equation (18), search and copy all lines to create an interpolated line according to Equations (19) and (20) paste the interpolated line at position (line, . . . , band) on datacube end for end for This formulation assumes there is no drift in the x-direction according to time. This hypothesis enables to perform a complete estimation of a row image. In practice, this assumption is difficult to check, and there is usually a small shift in the x-direction also. However, this drift is negligible for the candidates belonging to the same set S y,b , so Equation (20) remains a reliable estimator.

Estimating the Step Parameter
The step parameter plays an important role in the stitching task.

1.
Initialization. The knowledge of the sensor speed enables to propose an approximate step value, expressed in pixels per frame, i.e., where V d [mm/s] accounts for the sensor speed and GIFOV [mm/px] stands for the ground instantaneous field of view derived from the global ground field of view [26]. Since all the parameters are known, it is easy to propose an approximate step parameter as an initial value.
2. Update rule. The first run with an estimated speed parameter denoted as step leads to a first spectral reconstruction. Tracking a reference point in different bands enables inspecting a potential drift. Let y 108 (This band corresponds to a ray angle approximately equal to 0 degrees) and y b * be the y-coordinate of the point from, respectively, the reference spectral band and the b * th spectral band. Then, it turns out from Equation (15) that they may be written as follows: Additionally, a single coordinate of the reference point, denoted as y true , should be expected with the true parameter step in different spectral bands so that the following holds: By subtracting Equation (22) with Equation (23), it results in the following: By subtracting both equations in Equation (24), an estimation of the deviation from the true value of the step parameter may be found, i.e., In order to propose a new step value, Equation (25) may be applied once to find the step increment with a single spectral band b * or in the least squares sense with multiple target bands to fit the best parameter. This new step value may be computed as follows: Algorithm (PSR) may then be applied with the new step parameter to obtain better stitching performance. Figure 11 illustrates a more general recursive schema, where ∆step l accounts for the innovation at iteration l obtained from Equation (25). Now, a first datacube reconstruction is available. In conclusion, both PSR and HSR have advantages and inconveniences. HSR is faster but less accurate in the geometric object definition, while PSR is more accurate but requires more computation time. However, both methods still need the datacube to be realigned.

Corrective Warping of the Datacube
Once an approximate reconstruction of the datacube is performed, either with HSR or PSR, a corrective task may be considered. As accurate as the step parameter estimate may be, a physical error could persist. In addition, scanning the objects with various heights causes abrupt jumps in the y-coordinates, related to the ray angle. Figure 12 illustrates a jump in the y-coordinate due to object height variation. So, the spectral layers are still not aligned as expected.
scanning direction To further improve the reconstruction, several warping methods are applied to realign the spectral layers.

Building the Set of Matching Pairs
As classical automatic methods (e.g., SIFT or SURF) fail in building a set of matching pairs, it is necessary to manually extract the feature points. After noticing that paired points are aligned and regularly spaced, the following two-step strategy can be used:

1.
Identify several feature points in a few regularly spaced spectral layers.

2.
Predict the position of feature points in each intermediate spectral layer with linear interpolation.
This step provides a set S of feature point spatial coordinates in every spectral band. The set S is defined as follows: This set is randomly divided into two subsets: the first one, S tr for the training phase, the second one S te for the test stage.

Fitting the Warping Model
First, it is important to choose a spectral layer as a reference layer. Usually, the chosen layer is the NADIR one because it is not affected by the parallax effect. It here corresponds to band number b equal to 84. The warping model function w b→re f , parameterized by the reference layer and the target layer b, takes as input any pixel coordinates issuing from layer b and provides the estimated coordinates in the reference layer, i.e., The function w b→re f (., .) transforms each pixel position from spectral layer b into a reference layer pixel position. According to the chosen model from Section 2, the parameters of the function w b→re f are fitted for each band number b on the corresponding training set S b tr extracted from S tr and defined as S b tr {(x b j y b j ) : j ∈ {1 · · · card(S tr )} }. The learning phase tries to find parameters of a specific model η b which minimize the following distance for each b, i.e., These parameters depend on the chosen reference warping method. For the choice of a specific model, please refer to Section 2 which details different possible models.
Finally, (192 − 1) warping models are obtained via Algorithm 4: one for each spectral band different from the reference one. and (x re f , y re f ) Apply w b→re f (., .) on the test set to evaluate the test error end for

Applying the Model and Post-Processing
Applying a warping model w b→re f (., .) consists of creating a destination layer I b from an original layer I b , which should, in essence, be geometrically superposed to I re f . The intensity in I b derives from I b , i.e., In theory, the function w b→re f should be bijective, so each pixel from the destination layer should be filled with the initial layer. However, in practice, the created enhanced layer may contain some of the following problems: 1.
There may be some empty pixels since the warping model may not be surjective, i.e., ∃ (x re f , y re f ) ∈ X re f such that The empty pixels which are surrounded by filled pixels may be spatially interpolated by a post-processing step, as shown in Figure 13. The others are left empty. 2.
The model may not be injective since two points from the original space point toward the same destination. This property translates mathematically into the following:

3.
A point from X b may point outside from X re f . It is represented in Figure 13 as a missing part. In such a case, the corresponding point is not taken into account.
Missing part Empty part Retrievable part Figure 13. Warping issues.

Practical Experimentation
Our methods explained in Section 4 are applied first on a synthetic dataset, then on three experimental datasets. A comparative study of our reconstruction followed by some state-of-the-art warping methods, as described in Section 2 are performed on each dataset.

Synthetic Dataset
This dataset is generated by an algorithm which simulates the reflection of light rays on the objects. As a consequence, it highlights the impact of different ray angles on the image. To this aim, the scene is composed of a stack of four rectangular planes located in specific positions described in Table 2 as shown in a perspective view in Figure 14. The background outside of these objects is taken as a black pattern. The camera has a focal length of 12 mm. Raw images are taken every 0.1 s from a fixed altitude of 2500 mm. The ground IFOV (GIFOV) is equal in this case to 1.13 mm. The camera follows a uniform rectilinear movement with a speed equal to 11.50 mm/s so that the step parameter is approximately equal to 1 pixel per frame.

Real Dataset
Contrary to [10], where the weather conditions may vary throughout the acquisition process, we select experimentation for which sunny and stable light conditions are guaranteed. Our laboratory system is composed of a camera carried by a moving arm which is coupled to the moving part of a linear axis. A PLC controller enables to accurately control the position of the arm to a predefined setpoint. This setpoint is chosen with a fixed speed throughout the experimentation. This system is fixed over a movable scaffolding. Figure 15 shows the current setup of our laboratory system in a private garden. Note that the speed of the camera is chosen to ensure a step parameter lower than 5 pixels per frame, which yields conditions of band overlapping. The data were collected in Mametz, Northern France, on an outdoor potato garden. Three experiments were conducted respectively on 12 August 2020, 7 June 2021, and 19 July 2021, respectively called Mametz 1, Mametz 2, and Mametz 3. The parameters of these experiments are summarized in Table 3. It gathers several parameters, including the camera parameters and kinematic ones. In the case of Mametz 1, the camera, driven by an open-loop laboratory system, follows an approximate rectilinear movement. The camera speed may vary within reasonable proportions, according to mechanical friction but the average speed is roughly equal to 2.6 mm/s. Contrary to Mametz 1, the camera is driven by a speed-controlled system for other experiments, where the camera speed can be considered constant throughout the course of acquisition.

Evaluation Index
The quality of the fit may be evaluated by inspecting the root mean square error (RMSE) for each layer, refs. [16,24,27], on either the training set (S tr ) or the test set (S te ), defined as follows: where S t accounts for either the training set (S tr ) or the test set (S te ). P b stands for the coordinates of the estimated points in band b while Q re f are the coordinates of the reference points in the reference band. This error is converted into millimeters by dividing the error expressed in pixels by the size of one pixel GIFOV. This index is used throughout the experimentation to assess the performance of each method.

Results
The reference band, obtained for b * = 108 or equivalently b = 84, is built with NADIR rays so it has a perfect geometric reconstruction. Extreme bands are obtained with increasing angles from NADIR. As a result, the parallax effect increases in practice with |b * − 108|. The reconstruction error RMSE is then plotted throughout the experimentation, according to the extended band number b * (involving the dead zone). Figure 16a,b show two resulting layers obtained from the HSR method and the PSR method, respectively. On the left and on the right part of each figure, one layer is displayed individually, while a mixture of both layers is represented in the middle. It can be noticed that both methods involve significant misalignments, which can be measured with the RMSE index. Figure 17 intends to compare the two approximate reconstruction methods on a real dataset obtained on 12 August 2020. To achieve the comparison, the HSR datacube is resized to the PSR datacube size. As expected, the RMSE index increases with |b * − 108|. Roughly speaking, HSR and PSR appear to be equivalent. Contrary to HSR, which involves several trends in visible (b * ≤ 64) and NIR domains (b * ≥ 89), as shown in Figure 17, PSR keeps the same single trend along b * , represented by a single slope. In our opinion, a corrective warping would be more difficult to fit on HSR, so we decided to apply all future corrective steps on PSR reconstruction only.  Figures 18 and 19 show the evolution of the RMSE index according to the extended band number b * by referring to the training set and test set, respectively. The global homography warping (GHW) achieves the best performance. As explained in [23], without large noise on feature points, GHW often performs well, and advanced robust methods may not be useful. It may also be noticed that the improvement brought by DHW with respect to PSR appears here not significant.  During the Mametz 1 experiment, fifty feature points are assigned to the training set and one hundred to the test set, which corresponds to the large sets compared with the minimum requirement for testing and training. Figures 20 and 21 show the performance for the training set and the test set, respectively. Even if DHW performance looks bad on the training step, it finally appears as the best method on the test step. However, the enhancement with respect to GHW is not significant.  Figures 22 and 23 show the reconstruction performance for Mametz 2. This experimentation is conducted with a small focal distance, which leads to a reduced area viewed by all bands. Indeed, at the beginning and the end of scanning experimentation, only a small number of bands observe the scene. In such a situation, a small number of feature points are registered due to the reduced common area. DHW, which divides the training set into two unbalanced subsets, involves probably only the minimum number of feature points necessary to perform the learning step. Indeed, it is observed that one feature point which is affected to one class for the NIR domain is labeled to the other class in the visible domain. As a result, the performance of DHW method decreases in the visible area, due to a mismatch of one of the two homographies. This fact causes the asymmetry of the DHW performance curve, shown in both Figures 22 and 23, between the visible and NIR domains. Moreover, the noisy aspect of DHW performance may also be due to the small number of feature points from this experimentation. Globally, GHW is probably the most interesting method for its regular performance and its ease of implementation.  On Figures 24 and 25, GHW and DHW turn out to be equivalent. Note also that the best performance is achieved for this experimentation among all the others, with a maximum RMSE error equal to 1 cm. Even if the RMSE could be improved, the goal in potato health monitoring is to track essentially the evolution of the red edge [28], which is performed with bands within 127 < b * < 141. In this range, the error remains lower than 3 mm, which is considered a reasonable drift. To conclude, our feeling about experimentations is that using a single global homography warping (GHW) is probably preferable for further improvement. Furthermore, the most consistent experimentation is probably Mametz 3 since the camera speed is accurately controlled and the focal length is adapted for the scene observation.
Moreover, as a perspective, it should be noticed that a few homography models with an appropriate switching procedure should be considered in the future. The difficulty relies on the way the switching procedure should act.

Conclusions
In this article, we have formulated the problem of alignment of spectral layers for a specific spatio-spectral camera. After introducing the sensor structure, we defined a new coordinate system that enables to reconstruct an approximate datacube. However, the disparity of the scene object heights induces geometric misalignment, which is highly dependent on the band number. This corrective step is tackled with multiple spectral homography transformations, which enable to lower the object drifts. The reconstruction was evaluated through one synthetic dataset and three different real datasets. Though a significant improvement was achieved concerning the primary reconstruction on each dataset, the alignment accuracy still needs to be lowered regarding the size of vegetation leaves.