Planar-Equirectangular Image Stitching

: The 360° cameras have served as a convenient tool for people to record their special moments or everyday lives. The supported panoramic view allowed for an immersive experience with a virtual reality (VR) headset, thus adding viewer enjoyment. Nevertheless, they cannot deliver the best angular resolution images that a perspective camera may support. We put forward a solution by placing the perspective camera planar image onto the pertinent 360° camera equirectangular image region of interest (ROI) through planar-equirectangular image stitching. The proposed method includes (1) tangent image-based stitching pipeline to solve the equirectangular image spherical distortion, (2) feature matching scheme to increase correct feature match count, (3) ROI detection to ﬁnd the relevant ROI on the equirectangular image, and (4) human visual system (HVS)-based image alignment to tackle the parallax error. The qualitative and quantitative experiments showed improvement of the proposed planar-equirectangular image stitching over existing approaches on a collected dataset: (1) less distortion on the stitching result, (2) 29.0% increased on correct matches, (3) 5.72° ROI position error from the ground truth and (4) lower aggregated alignment-distortion error over existing alignment approaches. We discuss possible improvement points and future research directions.


Introduction
The proliferation of inexpensive 360°cameras has instigated various use of the camera, not only for research [1,2], but also for daily life [3][4][5]. Consumers have been using the 360°c ameras for recording and possibly sharing their social activities such as trips, holidays, or parties with others [3]. Some of them will even use a virtual reality (VR) headset to view the captured scene with great immersion, allowing them to re-experience the moments [6]. While mindlessly looking around the captured scene, they can find new things or unanticipated events, make them want to see the details of what was happening. Nevertheless, the view from 360°camera is not suitable for the detailed view because it cannot provide a high angular resolution view as much as perspective cameras could offer [7]. It would be interesting to search online for the perspective camera planar images that captured similar scenes then stitch them on top of the relevant region of interest (ROI). We bring forward an image-stitching algorithm that supports the high-resolution view on the ROI. The algorithm aims to stitch the high angular resolution planar image (HPI) onto the 360°equirectangular image (EI).
The image stitching process is generally used to combine multiple overlapping input images into a single wide field of view (FOV) output image [8], i.e., planar-planar image stitching. When stitching the input images, it is crucial to establish a well-aligned overlap-
We introduced the background case to perform the planar-equirectangular image stitching in this section. We next explain the past and current works on the image stitching in Section 2; emphasizing the problems found in existing approaches for the planarequirectangular case. We propose the TI-based image stitching solution to solve the problems; explain the TI underlying theory in Section 3 and the proposed stitching in Section 4. We performed the related experiments and discussed the results in Section 5. We present the conclusion and point out the future research directions in Section 6.

Planar-Planar Image Stitching
Planar-planar image stitching algorithms aim to join multiple overlapping planar images into a wider FOV or panoramic image [9]. The algorithm is principally designed to tackle artifacts on the overlapping region while keeping the stitching result natural [13], i.e., less distorted. One of the key factors influencing the artifacts' level is the quality of the image alignment step. Well-aligned input images are desirable to impose a lower requirement on the subsequent deghosting and postprocessing [14].
Image alignment is an act to transform the input images, making the adjacent ones aligned. The process focused on finding the geometric relationships (i.e., geometry models) within each input. Researchers have recently strived to find the geometry models based on sparse feature matching [9]. The goal is to attain the models that best minimize the feature pairs alignment error.
Existing feature matching-based image alignment methods can be grouped into global and local hybrid transformation [9]. The global methods transform the image by applying the same geometry model on a group of pixels. The simplest version used a single model, such as projective [15,16] or affine, for the entire image region. Due to its simplicity, the model only performs well if the inputs differ purely rotational or contain only one plane. The one plane characteristic mostly valid only for scenes captured far away from the camera. To solve more complicated situations where the captured scene consisted of different planes, Gao et al. [17] suggested the dual-homography method. They divided the image into two dominant planes (i.e., ground and distant), with each plane have its geometry model. Regardless, these methods demand the scene to have a certain number of dominant planes.
The methods belong to the local hybrid transformation model treat image as a mesh. The image mesh is divided into grids of vertices; each corresponds to a different transformation model. One well-known method that belongs to this category is the as-projectiveas-possible (APAP) approach [14]. They attributed each vertex to a separate homography model. They estimated the model by assigning various weight values for each feature pairs depending on their location relative to the vertex. This separated model designation allows each vertex to align themselves independently to minimize the closest feature pair alignment error. Although the model independence allowed them to receive excellent performance on aligning the input images, it also prompted a distortion problem. For instance, the stitching result will not be likely able to preserve rigid or straight objects [9]. Researchers [18,19] have been adding additional constraints, such as similarity [20], to reduce the distortion. However, the constraints inclusion costs higher computational time.

Non-Planar Image Stitching
Compared to planar-planar image stitching, there are only a few researchers who attempted to stitch non-planar images. Dornaika et al. [21] tried to create a multi-resolution image from a pair of wide-angle low-resolution and high-resolution image for surveillance application. Instead of employing sparse feature matching, they used pixel-based correspondences and homography models to perform image alignment. While they had a similar purpose, we plan to perform the stitching on the EI whose higher image distortion than their wide-angle image.
Dong et al. [22] stitched wide-angle images on a fisheye-lens video. As they implemented a planar feature extractor, the matching accuracy was initially low. They devised a multi-homography inlier selection method to improve the accuracy. They estimated a global homography and multiple local homographies from the correct matches. They finally constructed the stitching result by combining the weighted global and local homographies. The research showed that modifications of existing planar image processing were essential to handle image distortion on non-planar image stitching.

Icosahedron to Reduce Equirectangular Image (EI) Distortion
High spherical image distortion on the EI makes the existing image processing algorithm designed for planar image perform poorly. Recent approaches attempted to solve the issue by adopting icosahedral sphere representation, i.e., icosahedron. At the base level, the icosahedron consists of 12 vertices and 20 faces. The subdivision along the vertices and faces will cause the icosahedron to resemble a sphere closer. The higher subdivision, smaller spherical distortion will be [23]. The distance between vertices and faces will be similar to each other.
Zhao et al. [24] introduced a spherical-based feature extractor, spherical oriented FAST and Rotated BRIEF (SPHORB): the spherical version of the planar ORB [25]. The SPHORB detects and describes the features directly on the geodesic grid formed from the icosahedron subdivision. They demonstrated planar-equirectangular feature matching, showing higher accuracy of the SPHORB compared to the ORB. Unfortunately, the test was conducted only on two image sets and on the less-distorted equirectangular ROI.
Eder et al. [10] formulized TIs creation through gnomonic projection [26] on the icosahedron face. They introduced a visibility region on the TIs to assess the spherical image distortion. The higher the icosahedron level, the smaller the distortion on the visibility region will be, thus allowing existing planar image processing to work better. They used the visibility region to mask acceptable features on each TI for simple equirectangular feature extraction demonstration.

Tangent Images (TIs)
In this section, we would like to explain some theories underlying our TI-based planarequirectangular image stitching. We first explain how to generate the TI from the EI. We also describe how to map from one TI back to the EI or the other TI. At last, we would like to explain the procedure of TI-based feature extraction-matching.

Tangent Images (TIs) Creation
We intend to generate a TI T from an EI E. In other words, we seek to map the EI pixels P E ∈ E onto the TI P T ∈ T. First, we map the pixel P E to a unit sphere P S ∈ S: where P E = (x E , y E ) and P S = (λ S , φ S ). The symbol W E and H E denotes the EI E width and height respectively. We then project the P S to P T as shown in Figure 1. To generate the tangent image (TI) T, we find the correspondence of every TI pixel P T on the sphere: P S . The resulting P S is different for every possible TI T. Each TI T coincide with sphere S on C T . We set C T to be in the center of TI T.
Let the sphere-TI coincident point C T = (λ 0 , φ 1 ), the gnomonic projection [26] will project the P S = (λ S , φ S ) to the P T = (x T , y T ): On some cases, we want to find the corresponding P S = (λ S , φ S ) for every TI pixel P T = (x T , y T ). We use inverse gnomonic projection [26] for such cases: As seen from Figure 1, both the P S to P T mapping and the P T to P S mapping will depend on the C T position. As there are infinite choices of C T , we use icosahedron faces position to determine the C T . Precisely, we use icosahedron I whose size covers the whole unit sphere S at origin O. The icosahedron I has all his faces to tangent with the unit sphere S. Figure 2c illustrates the configuration. The configuration treats the unit sphere S as an inscribed sphere [27]. Such sphere has coincident points C T on the icosahedron faces F centroid. Icosahedron and its use for tangent images (TIs). (a,b) showed icosahedron with different subdivision levels b. The increase of b will also increase the number of faces N F . (c) showed an icosahedron I fully cover a unit sphere S. The sphere S coincides with the icosahedron face F at centroid C T . We create the TI T by extending the face F . The higher b will offer more TI position C T possibility and lower distortion in the visibility region Ω T = ∆P T0 P T1 P T2 . The standard planar image processing algorithm works better in the visibility region Ω T than outside.

Icosahedron Subdivision Level
Icosahedron can be characterized by using subdivision level b. The b level determines the number of vertices, edges, and faces in the icosahedron. The higher b level is obtained by performing the subdivision process on the icosahedron. The process splits current edges in the middle. The split will create a new vertex, which is then connected to form new edges and faces. One-time subdivision process will increase the icosahedron b level by one. The icosahedron whose b is one level higher has icosahedron faces four times more than the lower one. For instance, one time subdivision process on the icosahedron with b = 0 (i.e., base-level icosahedron) and N F = 20, will result in an icosahedron b = 1 whose N F = 80. Figure 2a,b shows examples of icosahedron with subdivision level b = 0 and 1. The higher b offers more icosahedron faces, increasing the possible TIs generated. The region Ω T is smaller (see Figure 2c) but providing less distortion than the lower b. We could expect better performance but a slower computational time when using the higher b.

Tangent Images (TIs) Coordinate Mapping
By using icosahedron face position, we generate several TIs from a single EI. We perform the image processing for each TI separately. Afterward, we map the result (i.e., the pixel coordinate) from one TI either back to the EI (i.e., TI-EI mapping) or the other TI (i.e., TI-TI mapping). For example, let T 0 and T 1 be two separate TIs generated from an EI E. We want to project a pixel coordinate from P 0 ∈ T 0 to get P 1 ∈ T 1 (for the TI-TI mapping) and from P 0 ∈ T 0 to P E ∈ E (for the TI-EI mapping). We first map the P 0 to the unit sphere S at S 0 ∈ S by using Equation (3) and inserting the T 0 coincident point C 0 to the equation. Then, for TI-TI mapping case, we get the P 1 ∈ T 1 from the S 0 by inserting the T 1 coincident point C 1 into Equation (2). For the TI-EI mapping case, we use the inverse of Equation (1) to get the P E ∈ E from the S 0 .

Tangent Image (TI)-Based Feature Extraction-Matching
Each TI T has a visible region Ω T whose distortion level is considered lower than the outside [10] (see Figure 2c). The visibility region Ω T is a region in the TI T that covers the icosahedron face F . We perform feature extraction-matching between TIs and the HPI using the Ω T region. The Ω T forms a triangle mask on the TI. We use the triangle mask as a feature extraction mask for the TI; we do not use the mask for the HPI feature extraction. We perform feature matching between the TI and the HPI using the extracted features. We get the resulting matched features inside the Ω T . We then reproject the matched features from the neighboring TIs to obtain matched features outside region Ω T (see Figure 3).  Figure 3 illustrates an example of feature extraction-matching between a main tangent image (MTI) T m and the HPI. We first perform feature extraction-matching between MTI and HPI. The process will result in matched features shown as a yellow dashed line in the figure. In addition, we also perform the feature extraction-matching between the HPI and N number of the neighbor tangent images (NTI) T mn , where n = 1, 2, ...N. The feature extraction-matching result in matched features shown in red, blue, and green dashed lines for T m1 , T m2 , and T m3 respectively. We finally map the features from each NTI coordinate onto MTI coordinate (shown in red, blue, and green straight line for T m1 , T m2 , and T m3 respectively) using the procedure explained in Section 3.3. We define the total matched features obtained (including the ones reprojected from NTI) in the MTI as the accumulative matched features (AMF). Figure 4 shows the three-steps pipeline for our planar-equirectangular image stitching process. First, we detect the ROI position on the EI. The step tries to find the ROI appearance depicted by the HPI input on the EI input. The output from this step will result in an ROI-TI whose appearance is similar to the ROI depicted by the HPI input. Second, we implement the feature extraction-matching between the ROI-TI and the HPI. The step results in ROI-TI and HPI feature matching pairs. Third, we carry out the image alignment between the ROI-TI and the HPI. The step aligns, then blends the HPI. It then processes the blended HPI into two separate ways: (1) directly stitch the blended HPI onto the ROI-TI, and (2) remaps the blended HPI back to equirectangular coordinate through TI-EI mapping procedure using ROI-TI coincident point C ROI (see Section 3.3) then stitches it with EI. The first way will result in a high angular resolution ROI tangent image (HROI-TI); the second way will result in a high angular resolution ROI EI (HROI-EI).

ROI Detection
We divide the ROI detection into two steps: (1) ROI-TI searching and (2) ROI-TI refinement. The first step aims to pick the closest TI to the ROI: the ROI-TI. The second step refines the ROI-TI position (i.e., coincident point C ROI ) closer to the true ROI.

ROI Tangent Image (ROI-TI) Searching
As we do not have any prior information about the ROI position on the EI, we need to perform the feature extraction-matching for the whole EI region. Therefore, we use icosahedron with subdivision level b = 0 (i.e., base-level icosahedron). The lowest b value will generate the smallest number of TIs to represent the whole EI. Thus, it allows for the fastest feature extraction-matching between the HPI and all generated TIs.
We use ORB [25] to extract features from both images quickly. Despite its fast performance, ORB has lower accuracy than other commonly used feature extractors [28] for feature matching tasks. Therefore, we use the grid-based motion statistics (GMS) [12] to improve the accuracy. The GMS is used after the Brute-Force (BF) feature matching process. It identifies correct matches (i.e., true matches) among the feature matching results. We use the identified true matches from the GMS as the final feature matching result.
Precisely, we first generate all possible TIs using the base-level icosahedron. As there are 20 icosahedron faces on the base level, we will obtain a total of 20 TIs. We iterate all 20 TIs T m and perform the feature reprojection described in Section 3.4 to gather the AMF for each T m and the HPI. We use the ORB feature extractor, BF feature matcher, and GMS during the process. After we finish iterating the TIs, we will obtain AMF for all 20 TIs.
We employ a coarse-to-fine approach to pick the ROI-TI among the 20 TIs. We first eliminate false TIs from the list of candidates. We define the false TIs as TIs whose false matches value are high. Figure 5 illustrate the process of detecting the false TIs. We expect that FTI has low AMF despite having a high feature number by itself. The tangent images T 1 in Figure 5 exhibit such characteristics. Although T 1 has high self feature matches, the corresponding NTIs (i.e., T 11 , T 12 , and T 13 ) have low feature matches with the HPI. The overall combination will result in low AMF |M 1 | for the T 1 (shown by red lines). On the contrary, T 0 has high AMF |M 0 |. In addition to having high self feature matches, the corresponding NTI (i.e., T 01 , T 02 , and T 03 ) overall also has high feature matches. The overall combination results in high AMF for T 0 (shown by green lines). . We expect that false TI has a low accumulated features number |M m | despite having a high feature number by itself. The TI T 1 exhibits such characteristics: 4 self-matched features and 4 for M 1 . On the contrary, T 0 has high |M 0 |: 4 self-matched features and 8 AMF. Thus, T 1 is not a false TI. We only show T 0 and T 1 and three NTI for both of them for clarity.
The false TI elimination will leave several TI candidates to pick as the ROI-TI. To select the ROI-TI, we first categorize the AMF into two: central AMF M c and boundary AMF M b . The central AMF means that the corresponding feature is located in the HPI central region. On the contrary, the boundary AMF has the corresponding feature outside the HPI central region. We pick the TI whose central AMF is the highest among the remaining ones and assign it as the ROI-TI. Figure 6 illustrates the process. H central region (i.e., the light blue rectangle box). For example, the (TI) T 2 has two features in total, but one is located outside the light blue rectangle. Thus, T 2 only has one central AMF in total. Meanwhile, T 0 has four central AMF in total: the highest among the others. We then select T 0 as the ROI-TI T ROI .

ROI Tangent Image (ROI-TI) Refinement
As illustrated in Figure 6, on the HPI side, the central AMF M c is located in the center (i.e., inside the light blue rectangle). On the contrary, the central AMF on the ROI-TI T ROI = T 0 side is closer to the boundary than the center. We expect the centralized M c position to make the ROI-TI coincident point (on the unit sphere) C ROI ∈ S closer to the true ROI. We use the central AMF centroid P CAMF ∈ T ROI as the C ROI ∈ S projection point. We readjust the P CAMF position iteratively.
We first calculate the centroid of current CFM P 0 CAMF ∈ T 0 ROI . We then convert the centroid pixel coordinate P 0 CAMF to the unit sphere S 0 CAMF ∈ S through Equation (3). We set the S 0 CAMF as the new coincident point C 0 ROI . If the central AMF centroid P 0 CAMF is not close enough to the ROI-TI central coordinate, we reproject all matched features through feature reprojection as described in Section 3.3 using C 0 ROI . We then calculate the central AMF centroid P 1 CAMF , S 1 CAMF and the central AMF centroid error from the center. The process repeats until the central AMF centroid P k CAMF error is close enough to the center. We use the last C k ROI to generate the refined ROI-TI. We use the refined ROI-TI as input for the next feature extraction-matching step.

ROI Feature Extraction-Matching
It is crucial to have the matched features distributed across the overlapping region on the image alignment step. In contrast to the standard planar-planar case, the overlapping region of planar-equirectangular image alignment covers the whole HPI. Therefore, we need to have the matched features to be distributed across the HPI. One way to improve the distribution is to increase the number of the matched features. We propose a new kNN-based GMS matcher to increase the numbers of matched features.
GMS has been capable of increasing the number of matched features [12]. It converts the quantity of matched features to identify the quality matches (i.e., the true matches). Precisely, GMS first divided the image into several grids. It then used the number of similar neighbors inside the image grids to identify the correct features. The similar neighbors were defined as features that: (1) located inside the same image grid, and (2) the corresponding features also located inside the same image grid. The higher the number of similar neighbors, the highest the possibility it contains the correct matches.
The original implementation of GMS used the results from the BF matcher as the input (i.e., BF-based GMS matcher). In this paper, we propose to use the kNN feature matcher as the input (i.e., kNN-based GMS matcher). Different from the BF-based GMS matcher that only used the closest feature correspondence (i.e., 1st correspondence) as the input, the proposed approach uses up to k feature correspondences as the input (i.e., 1st, 2nd, ..., kth correspondence). We assume there is a possibility that the correct correspondence is on the 2nd, 3rd, ..., kth correspondence. We perform the GMS feature matching iteratively according to the number of k. For every iteration, we switch the correspondence to the next closest (e.g., from 1st to 2nd-closest, or from 2nd to 3rd-closest). Figure 7 illustrates the first and second iteration of the proposed approach. After performing feature matching between the TI and the HPI beforehand, we initially use the 1st correspondence to perform the GMS matcher. Figure 7a shows the situation before the first iteration. Figure 7b shows the result of the first iteration, with a green line that denotes the GMS output. The red, blue, and orange features are similar neighbors: they are located in the same image grid either for the TI side or the HPI side. Thus, they are identified as correct matches. On the contrary, the yellow feature has correspondence (on the HPI side) on a different grid, thus not sharing similar neighbors. Therefore, we switch the yellow to use the 2nd correspondence for the next iteration. As the 2nd correspondence is also located on the same grid as the other correspondences, the yellow feature shared similar neighbors. We repeat the similar pattern for next 3rd, 4th, ... kth correspondences. Figure 7. Proposed kNN-based GMS feature matcher illustration. We performed kNN matcher between the tangent image (TI) and the high angular resolution planar image (HPI) beforehand. We then (a) use the 1st correspondence to perform the first iteration of the GMS matcher. (b) shows the result of the first iteration. The yellow feature has 1st correspondence far from the neighbor correspondences; GMS did not identify it as correct matches (i.e., no green line for the yellow feature). (c) showed the preparation for the second iteration of GMS matcher with the yellow feature is now switched to its 2nd correspondence. (d) As the 2nd correspondence is close to the neighbor correspondences, GMS would identify it as correct matches (i.e., green line for the yellow feature. To improve the feature matching accuracy, we also need to have a less-distorted visibility region Ω T on the TIs. Consequently, we use b value higher than the one used for the ROI detection. We set the b = 1 for this step. Although higher b will increase the number of TIs generated, we only need to gather the AMF on the ROI-TI. Therefore, we only perform the feature extraction-matching between the HPI with the ROI-TI and its corresponding NTIs (see Section 3.4). We use the matching result for the next image alignment step.

HVS-Based Image Alignment
Although the TI representation has made it possible to carry out existing planar-planar approaches, several differences are found when stitching the HPI onto the EI. In the planar-planar case, the stitching algorithms were designed for side-by-side images. The alignment constraint only applies for one side (i.e., the overlapping one) of each input image, allowing the other side free. In multiple input images, the transformation model of every input image can also be altered [29]. On the contrary, the planar-equirectangular image stitching attempts to put the HPI to their corresponding ROI on the EI. The EI acts as a template that the form should be fixed, leaving the HPI as the only input to be modified. As the EI also covers the whole HPI region, these situations can lead to alignment-distortion error trade-off due to parallax error. Figure 8 illustrate this situation. Figure 8a,b are the grid representation of the HPI and the ROI-TI respectively. As the parallax error will cause both views to differ slightly, the feature pairs (i.e., numbered circle on the figure) location on each image will also be different. The differences can cause misalignment in the stitching result. A common way to reduce the misalignment is by providing each vertex a separate model (i.e., local hybrid transformation approach). As shown in Figure 8c, the approaches can reduce the alignment error: the feature pairs (i.e., the same numbered circle) located close to each other on the stitching result. However, the features are located in the whole HPI region, and we can only alter the HPI vertices to minimize the alignment error. These situations make the stitching result central region (i.e., the yellow grid) distorted. The central region distortion can be a problem as our initial goal is to provide a high angular resolution version of the ROI-TI. Naturally, the focused object is located in the central of the ROI-TI. Thus, the distortion around the central region can be displeasing.  A simple way to preserve the central region is by only focusing on the alignment error minimization on the central and using the global transformation model for the whole vertices. This way, the distortion on the central (thus also the focused object) can be minimized. However, the misalignment on the boundary can be seen clearly (e.g., misalignment between feature no. 8,9,10,15,16).
To handle this alignment-distortion error trade-off, we proposed an image stitching inspired by the HVS. The HVS has a much greater resolution in the central visual field compared to the boundary [11]. It used the center vision (i.e., foveal vision) to resolve the focused object's detail resolution. On the other hand, the boundary vision (i.e., peripheral vision) exists mainly to provide low-resolution cues to guide the fovea movements [11]. Therefore, we intend to keep the central region of the stitching result as accurate: wellaligned and less distorted. Likewise, we intend to put the stitching result boundary region as well-aligned, thus not distracting human fovea when inspecting the other EI region.
Mathematically, to align the HPI H with the ROI-TI T ROI , we first divide the HPI H into a vertices grid v i k ∈ H. We then find the affine (i.e., global) model H a f f that map the v i k ∈ H to the v f k ∈ T ROI . In addition, we also find the APAP (i.e., local) H apap model that map the v i k ∈ H to the v p k ∈ T ROI . Finally, we linearly combine the two terms to obtain final where k > 0 is user-defined constant and R is distance of vertex v f k ∈ T ROI to the ROI-TI grid central. The global-local weight w will vary depending on the vertex's distance to the image center. The w is higher for vertices close to the center.
Every vertex v o k will have the same H a f f , serving to keep the objects or straight-line less distorted. We choose affine over homography to represent the global term as it has fewer degrees of freedom (DOG). The HPI and the ROI-TI should have a similar scene; thus, fewer DOG models would afford better regularization.
The w is high on the central region to emphasize the H a f f term, making the central region less distorted. On the contrary, w is small on the boundary region to emphasize the H apap term, making the boundary region better align with the ROI-TI boundary. We expected to have an alignment result whose distortion error is small on the central region, and alignment error is small on the boundary region as shown in Figure 8d).
To reduce the noticeable artifacts that occurred on the alignment result boundary region, we perform an image alpha blending on the HPI region beforehand. We then perform two separate processes: (1) directly stitch the aligned HPI onto the ROI-TI, and (2) remaps the aligned HPI back to the equirectangular coordinate through the TI-EI mapping procedure (see Section 3.3) then stitches it with EI. The first process will result in the HROI-TI, and the second process will result in the HROI-EI (see Figure 4).
We took the pictures by putting the We collected 14 equirectangular and 16 highresolution planar images. There are two EIs paired with two high-resolution planar images for each (i.e., ID no. 4-5 and ID no. 14-15), As a result, we have 16 equirectangular-planar image pairs in total, as shown in Figure 9. We used RICOH THETA V [30] to capture the EIs. To capture the HPIs, we used PTZ Logitecth Pro camera [31] (1 image) and Samsung Galaxy Note 5 [32] (15 images). For the THETA V-the Galaxy Note 5 case, we put the THETA V on top of our head to capture the picture, the Galaxy Note 5 in front of our face. For the THETA V-PTZ Logitech, we put the cameras on a camera rig. The images captured both seven indoor and outdoor scenes with ROI located on various EI regions. Each equirectangular and high-resolution planar image has 3840 × 1920 and 1920 × 1080 original image size respectively. We resize the EIs into 1280 × 640 and high-resolution planar image into 640 × 360 for image processing purposes.
We performed experiments for each contribution. We analyzed the results then suggest relevant solutions to address the problems found. We also present the computation times took for each planar-equirectangular image stitching pipeline step. In addition, we briefly explore the possibility of implementing the proposed pipeline in other 360°image formats. We hence organize this section in the following orders: 1.
Computation time. 6. Implementation in other 360°image format.

KNN-Based Gms Feature Matcher
We experimented with the ROI-TI and the HPI feature matching; we focused on providing higher inliers for the image alignment step. We measured the correct matches number (#inliers), accuracy, and entropy [28] to evaluate the performance. The #inliers denoted how many correct feature matches were detected; accuracy tells how accurate the feature matching was; entropy represents the correct feature match spread level on the ROI-TI. We also showed the inliers relative increase (in percentage) of the proposed KNN-based relative to the BF approach. We identified the correct and wrong feature pairs by manual intervention after the matching. Table 1 showed improved feature match count (#inliers) detected by using the proposed kNN-based GMS matcher. We could see an increase of inliers up to 29.0% on the # Inliers and Entropy (5.25 vs. 5.20) compared to the original BF-matcher implementation. Although the accuracy was higher for the BF-based GMS matcher (96.6% vs. 97.6%), the proposed kNN-based GMS matcher had already been high (owing to the scene similarity between the ROI-TI and the HPI). This result showed that our proposed kNN-based GMS feature matcher could improve the number and distribution (i.e., entropy) of the feature matches. Table 1. Feature matching result between the ROI tangent image (ROI-TI) and the high-resolution planar image (HPI). We compared the original BF-based GMS Matcher (Original) and the kNN-based GMS Matcher (Proposed). The proposed approach can increase number of inliers (# Inliers) up to 29.0% compared to the original. It also increases the the matched features distribution (Entropy), which is the main purpose of the proposed approach. There is a slight reduction in Accuracy but the value is still high. We analyzed deeper on the decrease of the accuracy. The decreased means there was also an increase in the number of wrong matches (i.e., outliers) during the k = 2, 3, ..., K iterations. The increase of the inliers and also the outliers made the accuracy pretty much stayed the same.

ID # Inliers
We believed the cause of the phenomenon was that similar neighbor feature members correspond to the wrong matches. The situation is illustrated in Figure 10. In other words, if the similar neighbors were already correct since the beginning, then the number of the inliers would increase. On the contrary, if the similar neighbors were wrong since the beginning, then the number of the outliers would increase. We concluded that the increase of inliers came from the increase of "correct" similar neighbors feature members; it did not come from the change of the "wrong" similar neighbors to the "correct" similar neighbors. The analysis was in line because entropy could only slightly improve (from 5.20 to 5.25): the new inliers come from the same image grid. The proposed approach was unable to distribute the inliers to the other grids. The yellow feature has 1st correspondence far from the neighbor correspondences; GMS did not identify it as the similar neighbors as the others (i.e., no red line for the yellow feature). (c) showed the preparation for the second iteration of GMS matcher with the yellow feature is now switched to its 2nd correspondence. (d) As the 2nd correspondence is close to the neighbor correspondences, GMS would identify it as similar neighbors (i.e., the red line for the yellow feature. However, as the matches in the neighbors are wrong, the yellow feature is also a wrong match. Regardless, the experiment showed that switching the correspondence to the next one could increase feature matches. We seek a better way to switch the correspondence (i.e., not merely switch it to the next one). For example, we could incorporate the scale information around similar neighbors to pick the subsequent correspondence.

ROI Detection
We determine the ground truth ROI position (i.e., coincident point C GT ) for each image set beforehand. We then measured the angle between the ROI-TI coincident point C ROI and the C GT . We used the angle as the error indicator as the coincident points are located on the unit sphere surface (see Section 3.1). Table 2 summarized the ROI detection experiment results. The "Initial" columns denoted C ROI error to the C GT after the ROI-TI searching step;the "Refined" columns denoted C ROI error to the C GT after the refinement step. On average, the ROI-TI searching step had the C ROI deviated around 17.3°from the C GT . The refinement could reduce the error closer up to 5.72°on average. It can be seen that the orientation error was below 15°on 15 of 16 image sets, with 14 of them below 10°. The result showed that the detection picked the correct closest face. Therefore, the follow-up refinement could reduce the orientation error further. Table 2. ROI detection experiment result. The Angular Error (°) denotes angle difference between the ROI tangent image (ROI-TI) coincident point C T and the ground truth coincident point C GT . The "Initial" represents the angular error between the C ROI and the C GT before the refinement process; the "Refined" represents the angular error between the C ROI and the C GT before the refinement process after the refinement process. The "Refined" error is the final ROI-TI C T error to C GT .

Tangent Image (TI)-Based Planar-Equirectangular Stitching Pipeline
We evaluated the proposed TI-based planar-equirectangular image stitching pipeline performance. We compared the proposed method to the existing planar-planar image stitching pipeline. In other words, we checked whether the intermediate transition from EI to the ROI-TI could reduce the undesired distortion on the stitching result compared to stitch the HPI onto the EI directly. It is important to note that we mainly focused on evaluating the image alignment step, leaving other stitching pipeline steps untouched. Both methods would use the same feature pairs and only differ on which coordinate was used to represent the EI features. The TI-based approach used the ROI-TI coordinate, while the planar-planar image alignment approach used the EI coordinate. We used the existing affine transformation model for both and performed the alpha blending afterward. Figure 11 showed the results. On image no. 1, we see that both approaches resulted in similar results either for the HROI-EI or for the HROI-TI. The results were as expected because the ROI was located about the center region of the EI. This area had low spherical image distortion; thus, the planar-planar image stitching algorithm could work well. Images no. 2 and no. 3 in Figure 11 show that the planar-planar image alignment caused significant distortion. The ROI of image no. 3 located further from the central compared to the no. 2. The stitching results were more distorted as the spherical distortion was higher further from the EI central region. The worst result was shown in image no. 4, where most of the ROI was located on the EI polar region. This region had the strongest distortion compared to the others. We could see how the output image of planar-planar image alignment could not even be recognized. On the contrary, images no. 2, 3, and 4 could still be recognized fine on the proposed TI-based one. The result showed that, as opposed to the less-distorted wide FOV image where the typical planar-planar image stitching can still work [21], the distortion on the EI was too strong to give a good image stitching result. Figure 11. Visual comparison between high angular resolution planar image (HPI)-equirectangular image (EI) stitching (PROPOSED) with planar-planar image stitching (ORIGINAL). The dashed red line depicts distorted region. Both methods resulted in a pair of high-resolution ROI equirectangular (HROI-EI) and tangent images (HROI-TI). When the ROI is located around the EI central region (image no. 1), both methods result in similar results. As the ROI goes further from the center (image no. 2, 3, and 4), the proposed method results in a much less distorted image than the original one. Zoom in the image to see the resolution difference between the ROI tangent image (ROI-TI) with stitching result.

HVS-Based Image Alignment
As a preprocessing step, we divided the HPI into grids of 33 × 19 vertices. We used an ellipse mask to perform alpha blending on the HPI. We then stitched the blended alpha to the ROI-TI, which resulted in HROI-TI. We focused on evaluating the HROI-TI during this experiment.
As the proposed approach is naturally a combination of affine and APAP, we also used the affine and APAP model as the comparator. Moreover, as the proposed HVSbased alignment approach attempted to address the alignment and distortion error on the stitching result, we used both alignment and distortion error as our evaluation metric. We showed both errors qualitatively through image inspection; quantitatively through alignment and distortion error calculation. Figure 12 showed the results of this experiment. We outlined our qualitative inspection as red, yellow, and green for "bad", "medium", and "good" results. The qualitative inspections are based on the distortion and misalignment that occurred on the outlined parts. The ground truths are obtained from the ROI-TI: we want to have a stitching result that provides the high-resolution version of the ROI-TI without having misalignment along the boundary and without distorting them (especially around the central region).

Qualitative Measurement
In image no. 1, we can see that affine transformation suffered misalignment along the boundary. On the contrary, APAP and the proposed (i.e., HVS-based) approach can align pretty well around the region. Image no. 2 showed more complicated situations. The ROI region roughly consisted of multiple planes. For example, object no. 4 (i.e., focused object) was located on a separate plane compared to objects no. 1 and no. 2. The affine transformation could align with an object no. 4 pretty well but suffer from misalignment on other objects. On the other hand, APAP could align objects no. 1, no. 2, no. 3, and no. 5, whose located around the boundary. However, object no. 4 was distorted. Our proposed method could align all the objects as they adopt affine factor on the central region and APAP factor on the boundary region. Image no. 3 had a similar result. The affine method could keep the object no. 4 (i.e., focused object) boundary shape straight. However, it misaligned the other objects. The APAP method could align the boundary better even though the scattered and uneven objects make it harder to align them perfectly. The proposed method gets a similar result as the APAP but can preserve the object no. 4 shapes better. Image no. 4 also behaves similarly as the no. 3. However, as the APAP and affine cannot align object no. 3, the proposed method cannot align object no. 3. The case shows the fail-case of our method.
One way to improve the alignment during the boundary is to provide a distributed alignment on the boundary. The lack of matched features around the ROI-TI can be seen on object no. 3 on Image no. 4. We plan to improve our kNN-based feature matching approach to add new the similar neighbors around the boundary. Furthermore, the failure to perfectly align the boundary objects can be essential or not. The use of widescreen or HMD to observe the stitching result can affect the importance of perfect boundary region alignment, as long as the crucial focused object can be viewed accurately in high-resolution. The premise is in line with our HVS-based assumption to generate the stitching result. In the future, we would like to perform a user study to test whether our image alignment algorithm could provide adequate stitching results.
Nevertheless, distortion found on the focused objects can be displeasing. The situation occurred due to the various size of the focused objects. For instance, object no. 4 on image no. 3 (i.e., the keyboard) is horizontally elongated, while object no. 4 on image no. 4 (i.e., the street sign) is vertically elongated. The problems can potentially be solved if we can adjust the global-local weight w accordingly. Currently, the global-local weight w (see Equation (4) is still decided manually by the user. Object recognition [33] or line detection [34] to find the focused object region can automatically adjust global-local weight w. We can group the vertices according to objects detected in the region. We then use the same local model term for vertices that belong to the same objects. Figure 12. Visual comparison of the resulting HROI-TI between proposed image alignment with affine and APAP. Red, yellow, and green outline respectively represent subjective "bad", "medium", and "good" evaluation of the alignment and the object shape quality. Overall, the AFFINE method could preserve the object shape in the central region but suffered from misalignment on the boundary. The APAP method could align the boundary better but suffered from distortion in the central region. The PROPOSED method could agree between the alignment error found in AFFINE and the distortion error found in APAP. We used the ROI-TI as the GROUND TRUTH. Zoom in the image to see the resolution difference between the ROI tangent image (ROI-TI) with the affine, the APAP, and the Proposed.

Quantitative Measurement
To calculate the alignment error e align , we first calculated the root-mean-squared error (RMSE) between AMF pairs. We used the inspected approaches (i.e., the HVS-based, the APAP, or the Affine) to transform the matched features from the HPI coordinate to the ROI-TI coordinate. We then found the RMSE between the transformed features with their correspondences in the ROI-TI. We set the RMSE as the alignment error e align [14].
To determine the distortion error e distort , we first calculated the 2D rigid-body similarity constraint error [20] of the stitching result vertices. As we only focused on central distortion error, we only calculated the vertices located nearby the image center. To decide it, we assigned an image grid whose size is 3 × 3 distributed evenly inside the image. We described the central vertices as the vertices whose closest grid is the center grid. We averaged these central vertices similarity errors and used them as the distortion error D err .
In addition, we also defined aggregated alignment-distortion e aggregate for APAP P, Affine F, and the proposed HVS-based V comparison: where A = {P, F, V}. The equation is a simple approach to aggregate the relative error of both e align and e distort . As our proposed image alignment method is trying to find the trade-off between e align and e distort , we expected that our proposed method would not often be shown the best on both criteria. Therefore, we intended to use this metric to show the overall performance of our proposed method in compensating both e align and e distort . We made following hypothesis: • e align on proposed method (V) is lower than affine (F) but higher than APAP (P): e align (F) > e align (V) > e align (P) • e distort on proposed method (V) is lower than APAP (P) but higher than affine (F): e distort (P) > e distort (V) > e distort (F) • e aggregate on proposed method (V) is lower than both APAP (P) and affine (F): e aggregate (P) > e aggregate (V) and e aggregate (F) > e aggregate (V) Table 3 summarized the result for this experiment. Overall, the results of the experiment followed our three hypotheses with some exceptions. Most of the time, the e aggregate could show that our proposed method is best (12 of 16 cases). However, this metric alone can give a different interpretation when compared with the other two hypotheses. For instance, ID no. 5 showed that the proposed approach followed both alignment and distortion error hypotheses yet gave FALSE results overall. On the contrary, this metric gave TRUE results on image ID no. 7 despite violating both alignment and distortion error hypotheses.
The ID no. 5, 7, 8, 10, and 15 showed that our method has the lowest e align compared to others. These are examples of ideal cases where our proposed method could get the lowest alignment error as it took the best factor on both central and boundary regions. On the central, the affine factor contributed more to our proposed method, while on the boundary, the APAP factor contributed more. These cases showed that the affine factor could align the central region well even though it has a high e align on the boundary. Alignment during the central-boundary transition also works well on our proposed method resulted in the lowest e align .
A similar reason can also hold on e distort case. As we measured only the central distortion, we got a similar distortion error between the proposed method and the affine model. The transition between central to boundary region could only cause slight differences, which could be positive (lower e distort ) or negative (higher e distort ) effect. The ID no. 6 and 7 showed that our method has the lowest e distort . On the contrary, image ID no. 14 showed our method has the highest e distort .
On the proposed image alignment, we relied on the affine model characteristics property to keep the shape in the ROI region (i.e., to preserve the ROI region similarity).
We did not explicitly add a similarity constraint to the equation [18]. Therefore, there are possibilities that the affine model had a higher distortion error compared to the APAP model, such as shown on image ID no. 5. Table 3. Results for the image alignment step between the ROI tangent image (ROI-TI) with the high-resolution planar image (HPI). We compared the image alignment step output of the Proposed (V) approach with existing Affine (F) and APAP (P) approaches. Most of results follow the alignment error hypotheses e align (F) > e align (V) > e align (P) (11 of 16); distortion error hypotheses e distort (P) > e distort (V) > e distort (F) (12 of 16); alignment-distortion error hypotheses e aggregate (P) > e aggregate (V), e aggregate (F) > e aggregate (V) (12 of 16). Overall, the proposed approach shows the lowest aggregated alignment-distortion error e aggregate (i.e., 0.61) compared to Affine (0.72) and APAP (0.67).
Alignment Error e align (pixel)  Table 4 showed that most of the processing occurred during the ROI detection step. The ROI detection step also took much longer than the ROI-TI feature extraction matching step. We look deeper at both steps and found that most computation time was used to extract the features from generated TIs. The ROI detection used b = 0 to generate 20 TIs. It extracted the features from all available TIs. On the contrary, the ROI feeature extractionmatching step used b = 1. Although generated 80 TIs in total, the step did not extract the features from all TIs. It only extracted features at ROI-TI and its neighbors, which is around 13 TIs in total. Furthermore, the visibility region Ω T was higher for b = 0, making the triangle mask for feature extraction wider. We believe that differentiating and adjusting some of the ORB feature extractor hyperparameters could solve the problem.  1 The % denotes the percentage of the computation time relative to the overall steps.

Computation Time
The image alignment step also took a similar computation time as the ROI detection. We look deeper into the image alignment and found that The computation time was mainly used to count the local H apap term. We think of two ways to solve the computation time problem. First, we would like to use a faster optimization solver library than Eigen to compute the models. Second, we would like to vary the HPI grid vertices number. On the preliminary test, we used 17 × 10 vertices and applied the similarity constraint explicitly. This setting is quite similar to applied on [18]. However, as the planar-equirectangular image stitching overlapping region is bigger than the typical planar-planar image stitching, we found that the vertices size was inadequate. We are still figuring out the best setting for the vertices.

Implementation in Other 360°Image Format
Although we specifically designed our system to handle the 360°image in equirectangular format, the TI can be extended to handle the other formats. A straightforward way is by changing the equirectangular to unit sphere mapping (see Equation (1)) according to each format model. For example, we can use the unified spherical model (USM) [35] to map the fisheye or catadioptric image onto the unit sphere [36,37].
Besides, the use of other image formats than EI can be beneficial for some cases. Precisely, the EI captured by the RICOH THETA V initially captures the image in the dual-fisheye format. The camera internally stitches the two fisheye images to produce the EI. The EI then suffered from the stitching artifacts found on the two fisheye images overlapping region. As the planar-equirectangular image stitching relies on the reliability of the EI image, the artifacts on the boundary region can reduce the performance (e.g., the feature extraction-matching or the image alignment performance) when stitching the images around the boundary.

Conclusions
We introduce a planar-equirectangular image stitching algorithm that places the HPI onto their corresponding ROI region on the EI. The proposed method includes: (1) TI-based planar-equirectangular stitching pipeline, (2) kNN-based GMS feature matching, (3) planar-equirectangular image ROI detection, and (4) HVS-based image alignment.
The TI generation in the stitching pipeline could mitigate the strong spherical image distortion on the EI. It is possible to employ existing planar-based feature extractionmatching and image alignment on the TIs. The proposed kNN-based GMS feature matching is implemented to increase the feature match inliers and entropy. The increase of inliers would make the matched features covered a wider overlapping region than typical planarplanar image stitching. The feature extraction-matching result is then used to conduct ROI detection and HVS-based image alignment. The ROI detection algorithm performs a coarse-to-fine searching process on the base-level TI: (1) find the ROI-TI then (2) refine the ROI-TI position vector incrementally. The image alignment is next conducted between the ROI-TI and the HPI. As the parallax error can instigate alignment-distortion error trade-off when stitching the planar image onto the equirectangular one, the proposed HVS-based alignment yields a less distorted central region while minimizing the boundary mismatch. The approach keeps the focused object on the central region (e.g., rigid structure or line) less distorted while placing the HPI on the EI ROI with slight misalignment.
We conducted a qualitative and quantitative experiment on the collected dataset and got the following results: (1) compared to original BF-based GMS feature matching, the kNN-based approach showed an increase of feature matches up to 29.0% and higher entropy value (5.25 vs. 5.20), (2) the ROI detection algorithm could find the ROI direction vector up to 5.72°error on average, (3) less distorted stitching result when using TI-based compared to existing planar-planar image stitching, and (4) the HVS-based image alignment gave lowest aggregated alignment-distortion error over affine and APAP method (0.61 vs. 0.72 vs. 0.67).
The experiments also showed several shortcomings of our methods. First, the kNNbased feature matching could not provide a noticeable improvement on the correct matches spreads level (i.e., entropy value). Second, the ROI detection algorithm could not improve the AMF entropy significantly as they only increase the amount of existing similar neighbor grid members: not adding new similar neighbor grids. A better way to switch correspondences could potentially increase entropy and accuracy. Third, the calculation of ROI orientation still consumed significant computational time compared to other steps. An adjustment on the ORB feature extractor hyperparameters could potentially solve the issue. Fourth, the best combination of global affine and local homography model still required user intervention as the focused object sizes can be varied. Additional object recognition or line detection step to automatically adjust the weight can be explored in the future. Another way is to add the similarity constraints to improve the focus object shape through a faster optimization library or a parallel processing algorithm.
We also would like to adopt more advanced seam-driven image stitching for future works [38,39]. We believe such a method will make a smoother stitching result boundary though we have to consider the increased computation time and reduced high-resolution region possibility. Furthermore, we would like to deploy our algorithm on the VR headset. We want to perform a user study to test whether our image alignment algorithm could provide adequate stitching results.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.