Multispectral Demosaicing Based on Iterative-Linear-Regression Model for Estimating Pseudo-Panchromatic Image

This paper proposes a method for demosaicing raw images captured by multispectral cameras. The proposed method estimates a pseudo-panchromatic image (PPI) via an iterative-linear-regression model and utilizes the estimated PPI for multispectral demosaicing. The PPI is estimated through horizontal and vertical guided filtering, with the subsampled multispectral-filter-array-(MSFA) image and low-pass-filtered MSFA as the guide image and filtering input, respectively. The number of iterations is automatically determined according to a predetermined criterion. Spectral differences between the estimated PPI and MSFA are calculated for each channel, and each spectral difference is interpolated using directional interpolation. The weights are calculated from the estimated PPI, and each interpolated spectral difference is combined using the weighted sum. The experimental results indicate that the proposed method outperforms the State-of-the-Art methods with regard to spatial and spectral fidelity for both synthetic and real-world images.


Introduction
Commercial cameras, which capture traditional red-green-blue-(RGB) images, typically record only three colors in the visible band, and they are commonly used for general landscapes and portraits, making them one of the most popular camera types.However, with the development of various industries, there is a growing need to record or identify objects that are not easily discernible in RGB images.To meet this demand, multispectral cameras have been developed.Multispectral imaging has become an increasingly important tool in various fields, such as remote sensing [1], agriculture [2], and biomedical imaging [3].These imaging systems capture information from multiple spectral bands, thereby providing valuable information that is not visible in traditional grayscale or RGB imaging.
There are various methods for acquiring multispectral images, including rotating structures of different optical filters for each band.Although this approach can capture multispectral images with full resolution for each channel, it is unsuitable for capturing moving subjects.To address this issue, cameras employing the one-snapshot method are used.These cameras acquire mosaic images when a photograph is captured.The resulting mosaic image appears similar to the Bayer pattern [4] used in commercial RGB cameras, as shown in Figure 1a.The mosaic patterns of multispectral-filter arrays (MSFAs) vary depending on the manufacturer.The most commonly used pattern is a 4 × 4 array, which can be divided into two cases: one where there is a dominant channel, e.g., green, in Figure 1b [5], and another where all the channels have the same probability of appearance of 1  16 , as shown in Figure 1c [6].[4].(b) MSFA with one dominant band [5].
(c) MSFA with no dominant band in IMEC camera [6].The numbers are the band numbers.
A mosaic image is a two-dimensional image in which not every channel is measured at every pixel and, therefore, requires demosaicing to estimate unmeasured pixels.Since the introduction of the original snapshot camera for Bayer patterns, several demosaicing methods have been developed.There are three main traditional approaches: using the color-ratio domain, which assumes a constant ratio between colors in the local region [7]; using the color-difference domain, which assumes a constant difference between colors in the local region [8]; and using the residual domain [9] with guided filtering [10].These methods first interpolate the dominant green channel and then interpolate the remaining R and B channels, using the interpolated G channel.Each channel is interpolated to restore high frequencies through edge estimation.Recent advances in demosaicing have led to the emergence of techniques based on deep learning, in addition to the aforementioned traditional methods.These methods typically use convolutional neural networks (CNNs) to train a network to generate an original image from a raw-image input.Gharbi et al. [11] proposed a joint-demosaicing-and-denoising method using residual networks.
Compared to Bayer filters, the MSFA is a relatively new technology.Therefore, demosaicing methods for MSFAs have been developed by modifying and advancing Bayerpattern-based demosaicing methods.The simplest method of demosaicing MSFAs is to use weighted bilinear filters for each channel.However, this approach has the disadvantage of blurring images.To overcome this limitation, a method using the spectral-difference domain, which is similar to the color-difference domain in Bayer-pattern-based methods, was developed [12].Additionally, the binary-tree-based-edge-sensing-(BTES) method [13] was developed, which first interpolates the centers of the unoccupied pixels.The multispectrallocal-directional-interpolation-(MLDI) method [14] was also developed, which combines spectral-difference domains with BTES.However, the MLDI method was optimized for the proposed MSFA rather than a general MSFA, because the order of adjacent spectral bands must be offset to match the BTES order.Moreover, a method was developed for interpolating multispectral channels by creating a pseudo-panchromatic image (PPI) as a guide [15].This method is suitable for any non-redundant 4 × 4 MSFA.In addition, a deep-learning-based multispectral-demosaicing method has been developed [16][17][18][19], which typically produces better results than traditional methods.However, deep-learning-based multispectral-demosaicing methods have a smaller dataset compared to deep-learningbased Bayer-filter-demosaicing methods.Consequently, it is not sufficient to train a complex network, and if the filter arrangement changes, the network must be retrained accordingly.In this paper, we propose a method to solve the problem that high frequencies are not accurately estimated when estimating PPI.Additionally, in conventional studies, only directional information about raw images was used in demosaicing; this is insufficient for estimating PPI.Because PPI is an image representing all channels, directional information about it must be included in the demosaicing process.Our approach builds on the following observations: (1) prior research [15] has demonstrated the usefulness of PPI for multispectral demosaicing; (2) PPI can be estimated from the high frequencies of MSFA; (3) guided filtering restores the high frequency components of a guide image while preserving details.To this end, we propose a method that uses guided filtering to estimate the PPI and then restores high frequencies for each channel by identifying edges according to the estimated PPI.Our approach is optimized for 4 × 4 MSFA patterns without a dominant band, but can be adapted to other patterns.
The main contributions of this study are as follows: 1.
We propose a novel method for iterative-guided-filtering-pseudo-panchromatic-image-(IGFPPI) estimation that involves performing iterative guided filtering in both the horizontal and vertical directions, and combining the results.

2.
The proposed guided-filtering technique is iterative and automatically determines the stopping criterion for each image.

3.
We use the estimated IGFPPI to determine the weights of each channel, and we obtain the interpolated spectral-difference domain through a weighted sum of the difference between the IGFPPI and the spectral channels.Finally, we add the IGFPPI, to obtain the demosaicing result, and we follow the demosaicing order of the BTES method.
We conducted extensive experiments to compare the quantitative and qualitative results of the proposed method for the peak-signal-to-noise ratio (PSNR), the structuralsimilarity-index measure (SSIM) [20], and the spectral-angle-mapper-(SAM) [21] metrics to those of previously reported methods.The results indicated that the proposed method outperformed both traditional and deep-learning methods.In addition to using the synthesized data, we conducted experiments on actual images captured by IMEC cameras.The demosaicing results for these real-world images suggest that the proposed method performs well in practical situations.
The remainder of this paper is organized as follows: Section 2 presents related work.Section 3 describes the proposed method.Section 4 presents the experimental procedures and results.Section 5 presents our conclusions.

Related Work
The proposed algorithm is designed to be effective for multispectral cameras that acquire images in multispectral bands.This section presents an observational model that accurately describes the image-acquisition process using multispectral cameras.Our algorithm builds on the principles of guided image filtering and PPI estimation, which allows accurate demosaicing of multispectral images.Herein, we comprehensively review these methods.

Observation Model
The observation model of a multispectral camera can be expressed as follows: where I c k represents the acquired pixel of channel c at pixel k; Q(•) is the quantization function; a and b represent, respectively, the spectral minimum and maximum ranges of the multispectral camera; E(λ) represents the relative spectral power distribution of the light source; R(λ) k is the spectral-reflectance factor of a subject at pixel k; and T c (λ) represents the transmittance of the MSFA channel c.
From the observation model, a raw image of a multispectral camera with N channels is defined as follows: where I MSFA k represents the raw image, I c k represents the full resolution of channel c at pixel k, and M c represents the binary mask, which is a special type of image comprising only 0 s and 1 s that is used to represent the MSFA channel c.

Pseudo-Panchromatic Image
Mihoubi et al. proposed PPI estimation as a guide for multispectral demosaicing [15].The PPI I M k at pixel k is defined as the average image over all the channels of a multispectral image, as follows: They developed a two-step algorithm for estimating the PPI.The first step is to create an initial PPI of the low-frequency components from the raw image.The initial PPI ĪM is estimated using a simple Gaussian filter M, as follows: where I MSFA represents a raw image.Second, a high-frequency component is added to the initial PPI.The high-frequency component is calculated under the assumption that the local difference of the initial PPI is similar to that of the raw image, where the local difference is the difference between the value of the arbitrary pixel k and the weighted average value of its eight nearest neighbors q ∈ Ñk with the same channel.The final PPI ÎM k at pixel k is defined as follows: where γ q is the weight calculated from the reciprocal of the difference between k and q in the raw image I MSFA .

Guided Filtering
Guided filtering is a powerful and versatile technique for image processing that has numerous applications including denoising, deblurring, edge-preserving smoothing, and tone mapping.It is particularly useful for images with textures, where traditional filters may not preserve important features.The guided filter is a linear form and can be expressed mathematically as where I l represents the guidance image, q l represents the filtered image, a k and b k are the filter coefficients, and l is the pixel coordinate in a local window ω k centered at pixel k.For determining the filter coefficients, the cost function within the window is given as follows: and its solution is given as where µ k , σ 2 k , and pk represent the mean and variance of the guidance image I and the mean of the filtering input p in the local window ω k , respectively, and where N ω represents the number of pixels in ω k .

Proposed Algorithm
In this section, we describe the proposed methods of the two main components.First, we explain the process of estimating the PPI from the raw image I MSFA .Then, we describe the process of performing directional multispectral demosaicing using the estimated PPI.

Iterative Guided Filtering for Estimating PPI
The proposed IGFPPI framework comprises three steps, as shown in Figure 2. First, a low-pass filter is applied to the MSFA to generate an initial image Ī.Then, subsampling is performed, followed by iterative guided filtering.Finally, upsampling is performed to obtain the estimated PPI image.The initial PPI, which is denoted as Ī, includes the low-frequency components of all channels that contribute to the final PPI image.Equation ( 4) is used to obtain the initial estimate.Next, we perform subsampling on both Ī and I MSFA for each channel as a preprocessing step to restore the high frequencies of the final PPI.The subsampled versions of the raw image I MSFA and the initial PPI Ī in channel c are denoted as İc and İc , respectively.The sizes of I MSFA and Ī are (W × H), and the sizes of İc and İc are , where W and H represent the width and height of the image, respectively.We use the subsampled İc as the guidance image and the subsampled İc as the filtering input for the iterative guided filtering.Iterative guided filtering is performed separately in the horizontal and vertical directions.If the window size is increased to estimate more precise high frequencies, the estimate is closer to the MSFA image, which is the guide image.To prevent this, we calculate the horizontal and vertical directions separately, and the two results are combined to obtain the estimation.The window size used to calculate the linear coefficients is denoted as (h × v); horizontal guided filtering is used when h > v, and vertical guided filtering is used when v > h.
In the first iteration t = 0, iterative guided filtering is performed in the vertical and horizontal directions, using the subsampled İc as the guidance image and the subsampled İc 0 as the filtering input.The equations for this process are as follows: The pixel coordinates are represented by (i, j).For t >= 1, the iterative guided filtering is repeated using the following expressions: where (a c,h t , b c,h t ) and (a c,v t , b c,v t ) are the linear coefficients in the horizontal and vertical directions, respectively.The filtering inputs for iteration t + 1 are the outputs of iteration t, i.e., İc,h t and İc,v t , respectively.Next, we describe the criterion block in Figure 2, which determines when the loop stops.The iterator has two conditions for stopping: (1) when each pixel stops changing, and (2) when the entire image stops changing.The loop stops when both of these conditions are satisfied.
The condition for each pixel to stop changing is determined by the following expressions: where d c,h t (i, j) represents the absolute difference between the results of the horizontal loops of the previous and current step, and where d c,v t (i, j) represents the absolute difference between the results of the vertical loops of the previous and current step.These two values indicate changes in the image.As they converge to zero, there is little change in the pixels at position (i, j).Additionally, δ c,h t (i, j) represents the horizontal change in the result of the current step's horizontal iteration.A value close to zero indicates that there is no change in the horizontal direction.Similarly, δ c,v t (i, j) represents the vertical change in the result of the current step's vertical iteration.The criterion for pixel change is determined by multiplying these two expressions, as follows: The pixel change stops when D c,h t (i, j) < ϵ pixel for the horizontal direction and when D c,v t (i, j) < ϵ pixel for the vertical direction, where ϵ pixel represents a predefined threshold.The global condition under which the loop stops is calculated using the following expressions: where Ẇ and Ḣ represent the width and height of the subsampled image, respectively.The mean absolute difference (MAD) is a measure of the extent to which the entire image changes and is calculated as the average absolute value of the difference between the results of the previous and current steps.Ye et al. determined the convergence based solely on the MAD value [22].However, our focus is the convergence of the difference between the current and previous MADs to zero, rather than the value of the MAD approaching zero.This is because the MAD may not converge to zero, owing to the conditions that prevent each pixel from changing.The difference in MAD between the current and previous steps is calculated as follows: The final number of iterations is determined by finding the smallest value of t that satisfies both ∆ c,h MAD (t) < ϵ global and ∆ c,v MAD (t) < ϵ global , which is defined as T.
The process of weighting and summing the results obtained by guided filtering in the vertical and horizontal directions with the number of iterations obtained earlier is as follows: where w c,h (i, j) and w c,v (i, j) are the weights in the horizontal and vertical directions, respectively, and are defined as follows: where a small criteria value contributes to a large weight.The final step involves guided upsampling of the subsampled channel ˙Î c to generate the final PPI image.To achieve this, we set the window size for the linear coefficients in guided filtering to h = v, and we then upsample the image for each channel to the position of the raw image.The guided upsampling is expressed by the following equation: where (m, n) ∈ [0, 1, 2, 3] 2 determines the grid for upsampling and depends on the subsampled channel c.The indices (m, n) represent the position of a pixel within a 4 × 4 block.For example, if c = 1 in Figure 1c, (m, n) is (3,3). , , , , ,

Directional Multispectral Demosaicing
In this section, we present the proposed multispectral-demosaicing method that utilizes the outcomes of IGFPPI.The overall framework of the method is illustrated in Figure 3.We utilize the disparities between the estimated PPI and each channel, to generate the spectral-difference domain.We then perform directional interpolation of the unoccupied pixels in the spectral-difference domain.Finally, we add the interpolated image and the estimated PPI to obtain the final multispectral demosaicing result.In Figure 3, the masking block refers to the filtering of the raw image I MSFA to zero except for the corresponding channel position c.The proposed directional-interpolation technique utilizes the interpolation order of the BTES method and weight calculation using the PPI.The BTES method first interpolates the center pixel in each step, resulting in four steps for a 4 × 4 MSFA, as shown in Figure 3. Here, the WC&WS block represents the weight calculation and weighted sum, where WC denotes the weight calculation and WS denotes the weighted sum.Let ∆ c s1 (i, j) represent the center pixel of the channel c requiring interpolation in the first step.The weight and weight-sum expressions in step 1 are given by ( 18) and (19), where s0 refers to step 0, s1 refers to step 1, and γ represents the weights.
The equations for step 2 are ( 20) and (21).Steps 3 and 4 are performed in the same manner as steps 1 and 2. Finally, the multispectral image is obtained by adding the estimated PPI to the spectral-difference image obtained through the order of BTES and directional interpolation, as follows:

Metrics
To evaluate the quality of the demosaicing results, we used quantitative metrics, such as the PSNR, SSIM, and SAM.
The PSNR, which measures the logarithm of the average difference between the reference image and the estimated image, was calculated as follows: where MAX represents the maximum value of the image, MSE represents the mean squared error between the reference image x and the estimated image x, and W and H represent the width and height of the image, respectively.
The SSIM was used to evaluate the similarity between the reference image x and the estimated image x.It was calculated using the following equation: where µ x and µ x represent the means of the image vectors x and x, respectively.The standard deviations of x and x are represented by σ x and σ x, respectively.The covariance between x and x is represented by σ xx , and c 1 and c 2 are constants used to prevent the denominator from approaching zero.
The SAM is commonly used to evaluate multispectral images.It represents the average of the angles formed by the reference and estimated image vectors and is calculated using the following formula: For the PSNR and SSIM, larger values indicated better performance, and for the SAM, smaller values indicated better performance.

Dataset and Implementation Detail
In our experiments, we compared the proposed method to previously reported methods using the TokyoTech-31band (TT31) [23] and TokyoTech-59band (TT59) [24] datasets.The TT31 dataset included 35 multispectral images, each containing 31 spectral bands ranging from 420 to 720 nm.The TT59 dataset included 16 multispectral images with 59 spectral bands ranging from 420 to 1000 nm, with the bands spaced 10 nm apart.We excluded the popular CAVE [25] dataset from our experiments because it was used to train conventional deep-learning methods.To generate the synthetic dataset, we used IMEC's "snapshot mosaic" multispectral camera sensor, i.e., XIMEA's xiSpec [26].We utilized the publicly available normalized transmittance of this camera [15] and the camera had the central spectral band λ c to be an MSFA of 4 × 4 arrays, consisting of 469, 480, 489, 499, 513, 524, 537, 551, 552, 566, 580, 590, 602, 613, 621, and 633 nm.The arrays were arranged in ascending order in Figure 1c.We used the normalized transmittance and D65 illuminant to obtain multispectral images for each band, in accordance with (1).The obtained images were then sampled using (2) to generate the raw MSFA images.
The pixel values of the synthesis datasets ranged from 0 to 1.The window size was set to h = 7 and v = 3 for horizontal guided filtering, h = 3 and v = 7 for vertical guided filtering, and h = 5 and v = 5 for the final guided upsampling.We also experimented with 10 −4 for ϵ pixel , which determined the change in each pixel, and 10 −3 for ϵ global , which determined the change in the entire image.

Results for Synthesis Dataset and Real-World Image
For a quantitative evaluation of the proposed method, we compared it to six other methods.The first conventional method (CM1) was the BTES method [13], which prioritized the interpolation of the empty center pixel in the spatial domain of each channel.Interpolation was performed using a weighted sum, and the weights were calculated using the reciprocal of the difference between neighboring pixels.The second conventional method (CM2) was a spectral-difference-(SD) method that employed weighted bilinear filtering in the spectral-difference domain [27].The third conventional method (CM3) was an iterative-spectral-difference-(ItSD) method that used weighted bilinear filtering in the spectral-difference domain [28].The CM2 method was applied repeatedly for each channel.The fourth conventional method (CM4) was an MLDI method similar to the BTES method of [14], except that the interpolation was performed in the spectral domain instead of the spatial domain.The fifth conventional method (CM5) was a PPID method that estimated the PPI of a guide image [15] and performed interpolation in the spectral-difference domain based on the PPI.The sixth conventional method (CM6) was a mosaic-convolution-attention-network-(MCAN) method, in which the mosaic pattern was erased by generating an end-to-end demosaicing network [16].This deep-learning method was implemented using the code published online by the author.
Figure 4 shows the results of the estimated PPIs as a guide image.Figure 4a displays the average of the original multispectral cube, Figure 4b shows the estimated PPI of PPID [15], and Figure 4c shows the estimated PPI of the proposed IGFPPI.The estimated PPI of PPID is blurred and has low contrast.However, the proposed IGFPPI restored high-frequency components better than PPID, and the contrast is also close to the original.The results for the PSNR, SSIM, and SAM of TT31 are presented in Tables 1-3, respectively.In the tables, a dark-gray background indicates the best score and a light-gray background indicates the second-best score.Of the 35 images in the TT31 dataset, the proposed method had the best PSNR for 19 images and the second-best PSNR for 16 images.Additionally, it had the best SSIM for 20 images, the second-best SSIM for 15 images, the best SAM for 18 images, and the second-best SAM for 17.The average PSNR, SSIM, and SAM values for the TT31 dataset indicated that the proposed method outperformed the other methods.Figures 5 and 6 present the qualitative evaluation results for TT31, including those for the Butterfly and ChartCZP images, with the images cropped to highlight differences.We obtained red, green, and blue channels from the multispectral demosaicing image cube and represented them as the RGB images for qualitative evaluation.Figures 5a-h and 6a-h show RGB images from which we extracted channel 16 for red, channel 6 for green, and channel 1 for blue from the multispectral image cube.Figures 5i-p and 6i-p show the error maps of Figures 5a-h and 6a-h.The results of CM1 show the blurriest image, and the results of CM2 and CM3 estimated high frequencies somewhat well, but artifacts can be seen.CM4 and CM6 nearly perfectly restored high frequencies in the resolution chart; however, the mosaic pattern was not entirely removed from the general color image.In CM6, demosaicing is performed using a network that erases the mosaic pattern for each channel.This method performs demosaicing on 16 channels of an MSFA; however, the arrangement is different from the paper of CM6.In the experimental results of this method, we can observe that only the evaluation metrics of chart images corresponding to monotone are of high score.This is because the mosaic pattern is easily erased in images where changes in all channels are constant, but the mosaic pattern is not erased in images where a large change occurs in a specific color.In general, the outcomes of CM5 and PM (referring to the proposed method) appeared to be similar.However, for images such as the resolution chart, PM exhibited superior high-frequency recovery and less color aliasing than CM5.Overall, the image produced by PM had fewer mosaic pattern artifacts and less color aliasing than those produced by the conventional methods.
For quantitative evaluation of the TT59 dataset, we computed the PSNR, SSIM, and SAM values, which are presented in Tables 4-6, respectively.Of the 16 images in the TT59 dataset, the proposed method had the best PSNR for 10 images, and the second-best PSNR for 4 images.Moreover, it had the best SSIM for 8 images, the second-best SSIM for 7 images, the best SAM for 12 images and the second-best SAM for 4 images.The average PSNR, SSIM, and SAM values for the TT59 dataset indicated that the proposed method achieved the best results.
The results for the TT59 dataset were similar to those for the TT31 dataset.In the gray areas, CM4 and CM6 effectively recovered the high frequencies.However, in the colored sections, MSFA pattern artifacts were introduced, resulting in grid-like artifacts.By comparison, CM5 and PM performed better overall, with PM recovering high frequencies better than CM5, as shown in the resolution chart.
Figure 7 shows the demosaicing results for different MSFA arrangements.Figure 7a-h shows the MSFAs in which adjacent spectra are grouped in a 2 × 2 shape.Figure 7i-p are the MSFAs of the original IMEC camera.The proposed method can be observed to be more robust and to have fewer artifacts than conventional methods.In particular, Figure 7c,d,f show grid artifacts where the black line of the butterfly is broken, whereas the proposed method shows reduced grid artifacts compared with other methods.
Table 7 presents a comparison of the execution times, with the desktop specifications of an Intel i7-11700k processor, 32 GB of memory, and an Nvidia RTX 3090 GPU.CM6 was tested using Pytorch, whereas the remaining methods were tested using MATLAB R2021a.To obtain the average execution times for all the datasets, we conducted timing measurements.We found that the method with the shortest execution time was CM1, followed by CM5, PM, CM4, CM2, CM6, and CM3.In addition, as shown in Figure 8, the methods were tested on images captured using an IMEC camera.To qualitatively evaluate the multispectral image cube in the real world, we used the same method that was employed to evaluate the synthesis dataset.Channels 16, 6, and 1 of the multispectral image cube were extracted as the R, G, and B images, respectively, as shown in Figure 8.These results were similar to the experimental results obtained for the synthesis dataset.CM1, CM2, and CM3 exhibited blurred images and strong color aliasing, whereas CM4 exhibited MSFA pattern artifacts.Among the conventional methods, CM5 achieved the best results.CM6, which is a deep-learning method, performed well for the resolution chart.However, the proposed method exhibited better high-frequency recovery and less color aliasing.

Conclusions
We propose an IGFPPI method for PPI estimation and a directional-multispectraldemosaicing method using the estimated PPI obtained from IGFPPI.Guided filtering was used to estimate the PPI from the raw image of the MSFA, where a Gaussian filter was used to obtain the PPI of the low-frequency components, and horizontal and vertical guided filtering was used to estimate the high-frequency components.Using the estimated PPI, we performed directional interpolation in the spectral-difference domain to obtain the final demosaiced multispectral image.
In extensive experiments, among the methods tested, the proposed method achieved the best quantitative scores for the PSNR, SSIM, and SAM and exhibited the best restoration of high frequencies and the least color artifacts in a qualitative evaluation, with a reasonable computation time.The proposed method also achieved good results for real-world images.Furthermore, our proposed method can be adapted to perform multispectral demosaicing in the case of a periodic MSFA and when the spectral transmittance of the MSFA varies.In future research, we will focus on image-fusion demosaicing using both multispectral and color filter arrays.

Table 7 .
Computation Time of Methods.