Three-Dimensional Block Matching Using Orthonormal Tree-Structured Haar Transform for Multichannel Images

Multichannel images, i.e., images of the same object or scene taken in different spectral bands or with different imaging modalities/settings, are common in many applications. For example, multispectral images contain several wavelength bands and hence, have richer information than color images. Multichannel magnetic resonance imaging and multichannel computed tomography images are common in medical imaging diagnostics, and multimodal images are also routinely used in art investigation. All the methods for grayscale images can be applied to multichannel images by processing each channel/band separately. However, it requires vast computational time, especially for the task of searching for overlapping patches similar to a given query patch. To address this problem, we propose a three-dimensional orthonormal tree-structured Haar transform (3D-OTSHT) targeting fast full search equivalent for three-dimensional block matching in multichannel images. The use of a three-dimensional integral image significantly saves time to obtain the 3D-OTSHT coefficients. We demonstrate superior performance of the proposed block matching.


Introduction
Block matching is a fundamental tool used to search for blocks (patches) similar to a given query. It has been widely used in solving various image processing problems, like object recognition and tracking [1], image registration [2], image analysis [3], image restoration [4], to name a few. Block matching searches for patches of the same size as a query and is sensitive to deformation. In this sense, it is different from some common image descriptors such as scale-invariant feature transform (SIFT) [5] and speed-up robust features (SURF) [6], which extract features robust to deformation. In block matching, generally, a full search (FS) algorithm that exhaustively compares all the pixel intensities of all the candidates overlapping each other is the most accurate, but requires vast computation due to the huge number of candidates in a large search space. The larger the image size and number of image bands, the harder it is to use FS.
Fast block matching has been studied from algorithms and architectures. There are several works of architecture such as hardware acceleration and configuration by custom instructions specially for motion estimation in video coding [7][8][9][10][11]. There are also two categories in the algorithms, FS-equivalent algorithms and non-FS-equivalent algorithms. Some non-FS-equivalent algorithms such as three-step-search [12] and diamond search [13] reduce the amount of computation by limiting search areas, and others do so by approximating patterns [14,15], where there is a trade-off between accuracy and efficiency. The scope of this paper is limited to the FS-equivalent algorithm.
Fast FS-equivalent algorithms have been intensively developed in order to address the computational complexity with FS [16][17][18][19]. Although these methods can be applied only to the patches of size power-of-2, they prove that the search in a transformed domain is much more efficient than the search in a spatial domain. The additional calculation to obtain the transform coefficients is fully compensated by the reduction of candidates. The orthogonal Haar transform (OHT) reportedly performs fastest in this setting [20,21]. One of the reasons is that there is a unique way to calculate the transform coefficients using an integral image [22,23]. Once an integral image is built, the transform coefficients can be calculated with a few arithmetic operations. This is based on the fact that the Haar transform matrix is sparse and composed of rectangular functions, which is much different from other transforms. In order to overcome the limitation of patch sizes, two-dimensional orthonormal tree-structured Haar transform (2D-OTSHT), the generalization of OHT, was proposed [24]. Compared to OHT, the normalization factor of 2D-OTSHT is not an integer number, but this has little effect in the whole speedup.
In this paper, we propose three-dimensional orthonormal tree-structured Haar transform (3D-OTSHT) with three-dimensional integral image for multichannel images. In the proposed 3D-OTSHT, the transform coefficients can be obtained by a few arithmetic operations regardless of patch size. Focusing on the pruning performance in the transformed domain, we consider the combination with FS, where the pruning process is stopped at a certain level of reduction of candidates. We demonstrate superior results regarding the savings in computation time compared to the straightforward solution, where the fast FS-equivalent method for grayscale images is applied to each band separately. Experimental results are obtained using a standard dataset for color images and a five-band multispectral image dataset.
The paper is organized as follows: In Section 2, we provide the problem setting and required techniques as preliminaries. We present the design of 3D-OTSHT for three-dimensional (3D) integral image and the 3D block matching using 3D-OTSHT in Section 3. Our evaluations of the pruning performance and speedup are detailed in Section 4. Finally, in Section 5, we conclude our study.

Preliminaries
First, the problem targeted in this paper is stated. Next, a couple of techniques required for the proposed method are briefly described.

Problem Statement
Consider the problem of searching for patches similar to a given query in a multichannel image. In the full search (FS) algorithm, the matching patches are detected with a threshold in a sliding window manner by measuring the similarity of all the candidates in the whole search space. Generally, the sum of squared differences (SSD) of all the intensities in a candidate is used as the similarity.
Let q(x, y, z) be a query patch of size N × N having B bands, and I(u, v, w) measure an image of size M × M having B bands. The SSD of all the intensities of all the candidates is calculated, It turns out that (2(M − N + 1) 2 N 2 B − 1) additions and (M − N + 1) 2 N 2 B multiplications are required for the search. The aim of this paper is to reduce the computational complexity for search in multichannel images while keeping the same accuracy as FS using the same threshold as FS uses.
A part of I(u, v, w), i.e., the i-th candidate, and a query are hereafter simply expressed as p i and q, respectively, in a vector form, e.g., the SSD of the i-th candidate is represented as

Tree-Structured Haar Transform
Tree-structured Haar transform (TSHT) is a generalization of the Haar transform, which can be applied to signals with arbitrary length [25]. In the conventional Haar transform [26], the basis is built by dividing an interval equally into two intervals. In TSHT, on the other hand, the basis is built by dividing an interval unequally into two intervals putting weights on them. The complete division of the intervals can be expressed by a binary tree structure. Figure 1a shows an example of a binary tree having 3 leaves. A circle represents a node. The topmost node is the root. The node having no connection below (no child node) is a leaf. The number inside a circle represents the number of leaves the node has, which determines the internal division ratio when dividing an interval and the weight of intervals in the basis function. Figure 1b shows the intervals corresponding to the nodes of the binary tree. The interval corresponding to a node is divided internally in two subintervals, in a ratio equal to the number of leaves of the left child node to that of the right child node.
Let α be a node of a binary tree having N leaves. Let α 0 and α 1 be the left child node and the right child node of α, respectively. We denote by ι α the interval corresponding to α. The basis function for interval ι root is given as and where ν(α) represents the number of leaves that α has. Thus, except for ι root , the absolute value of the weight of two intervals is inversely proportional to the number of leaves. Figure 1c shows the basis.
With the 3D integral image, the sum of all the intensities in region A (ABCD-EFGH) whose diagonal starts at the location (sX, sY, sZ) and ends at (eX, eY, eZ), as shown in Figure 2, is calculated by seven additions regardless of region size as This property is a key to significant speedup of the proposed method.

Three-Dimensional Block Matching for Multichannel Images
We propose here fast FS-equivalent three-dimensional (3D) block matching for multichannel image using three-dimensional orthonormal tree-structured Haar transform (3D-OTSHT). First, we construct 3D-OTSHT to simplify the computation to obtain its coefficients. Next, we describe the 3D block matching using 3D-OTSHT.

Three-Dimensional Orthonormal Tree-Structured Haar Transform
One of the vectors forming the basis of the vector space for a three-dimensional region is referred to in this paper as a basis block. The 3D-OTSHT consists of the basis blocks built by subdividing a three-dimensional region formed by intervals along X, Y, and Z axis. For rapid calculation of 3D-OTSHT coefficients via integral image, we design the basis block to have at most two regions in it, each of which is assigned to a constant.
Let us consider the basis blocks for a query of size N × N having B bands. We generate the set of basis blocks by basis block functions. Let T X , T Y , and T Z be the binary trees for X axis having N leaves, Y axis having N leaves, and Z axis having B leaves, respectively. The nodes of T X , T Y , and T Z are denoted by α, β, and γ, respectively.
We define the basis block function for region (ι root × ι root × ι root ) as and the basis block functions for the other regions as where c + ϕn and c − ϕn , (n = 1, 2, . . . , 7) are a positive constant and a negative constant, respectively, which includes normalization factor and weight: A series of functions (8) to (14) is repeated at all the intervals corresponding to the nodes of three binary trees. Eventually, N 2 B basis blocks are generated. Each basis block is indicated as G k , (k = 1, 2, . . . , N 2 B) in a vector form. As such, c + ϕn and c − ϕn are replaced by c + k and c − k , respectively. Let G be the set of basis blocks, i.e., G = [G 1 , G 2 , . . . , G N 2 B ] T , where . T denotes the transposition. G is orthonormal. Figure 3 shows the appearance of 3D-OTSHT. Function ϕ 1 divides a region along X axis into two regions, c + * and c − * , where c + * and c − * are the region assigned to c + ϕn and c − ϕn as in (15) through (21), respectively. Functions ϕ 2 and ϕ 3 divide c + * and c − * , respectively, built by ϕ 1 along Y axis; ϕ 4 and ϕ 5 divide c + * and c − * , respectively, built by ϕ 2 along Z axis; ϕ 6 and ϕ 7 divide c + * and c − * , respectively, built by ϕ 3 along Z axis.

3D Block Matching Using 3D-OTSHT with 3D Integral Image
In the proposed 3D block matching, SSD of 3D-OTSHT coefficients is calculated as similarity, but not all the transform coefficients of all the candidates are used. The candidate that does not match is rejected from the search in the middle of processing, which is called pruning.
The transform coefficients are obtained efficiently with a 3D integral image. From (6), the k-th 3D-OTSHT coefficient, P(k), of a patch, p, in a vector form is obtained as where p + * and p − * are the regions in the patch corresponding to the regions to which c + k and c − k are assigned in the k-th basis block, respectively.
For pruning, the similarity of candidates is calculated using a subset of 3D-OTSHT. Let G k be the subset of G that contains the first k basis blocks, i.e., G k = [G 1 , G 2 , . . . , G k ] T . The similarity of the i-th candidate is calculated with G k as where P k i = G k p i and Q k = G k q. At every k (k = 1, 2, . . . , N 2 B), SSD k i is judged with a threshold. Once SSD k i is beyond the threshold, the i-th candidate is rejected from the search and neither OTSHT coefficient nor SSD is calculated afterward. As long as using the same threshold as FS uses, the unmatched candidates are securely rejected. The theory behind this is that if the transform is orthonormal, the energy of a signal is not changed in the transformed domain. Therefore, it holds that where P i = Gp i and Q = Gq. From ||P k i − Q k || 2 2 ≤ ||P i − Q|| 2 2 for k = 1, 2, . . . , N 2 B, secure rejection is made. For this reason, the transform should be orthonormal and SSD used as the similarity measure.

3D Block Matching Using Limited 3D-OTSHT
All the basis blocks of 3D-OTSHT can detect patches with the same accuracy as FS. However, it is inefficient to use all the basis blocks because G k becomes sparser as k increases. To avoid this, the limited number of basis blocks is used for pruning, and after the number of candidates is reduced, the remaining candidates are scrutinized by FS. That is, G K with a certain K is used instead of G for SSD k i shown in (23), and at every k, (k = 1, 2, . . . , K), SSD k i is judged with a threshold for pruning.

Evaluation
We performed 3D block matching in order to evaluate the pruning performance and elapsed time of the proposed method using multichannel images.

Methods and Environments
We compared the performance of the following five methods for search: FS, two-dimensional OTSHT with a two-dimensional integral image by single judge (2D-OTSHT-2DI-S) [24], two-dimensional OTSHT with two-dimensional integral image by whole judge (2D-OTSHT-2DI-W), two-dimensional OTSHT with a 3D integral image (2D-OTSHT-3DI) [27], and the proposed 3D-OTSHT with 3D integral image (3D-OTSHT-3DI). 2D-OTSHT-2DI-S and -W use the OTSHT for grayscale images on each channel, and judge the candidates in different ways: The single judge (-S) decides whether to reject the candidate or not based on a single channel, while the whole judge (-W) evaluates the candidate based on all the channels. In 2D-OTSHT-3DI, a basis image forms a basis block for the 3D integral image so that the same basis image is piled up.
We use two image data sets, color image data set and five-band multispectral image data set. The color image data set, called SIDBA, contains 12 scenes of size 256 × 256 having 3 color channels [28]. The 5-band image data set contains 11 scenes of size 1824 × 1368 having 5 bands [29]. For each dataset, five patch sizes are used. We chose 10 queries randomly in an image per patch size. Then, we obtain the ground truth patches similar to the query by FS with a threshold. If the SSD of the i-th patch holds we set the i-th patch a ground truth. In this experiment, the threshold is 10N 2 B for the color images and 2N 2 B for the 5-band multispectral images. The same threshold is used in all the methods. Table 1 summarizes the number of ground truth patches. From Table 1, it can be seen that the mean number of ground truth patches tends to decrease as the patch size increases and increase as the image size increases. We confirmed that the number of ground truth patches chosen by the queries with low standard deviations of pixel intensities is likely to be large, and that the number of ground truth patches chosen by the queries with high standard deviations is likely to be one. Therefore, the threshold should be set appropriately considering the size and characteristics of images in practical applications. Generally, the larger threshold yields more ground truth patches. However, a threshold that is too large will be meaningless. All the algorithms are written in C as single thread tasks, compiled with Xcode 10.1, and run on a macOS system with 4 GHz Intel core i7 and 16 GB RAM, where eight active processor cores with hardware multithreading are used.

Pruning Performance
The pruning performance is evaluated by the ratio, R(K), of the number of remaining extra patches detected by K basis images/blocks to the number of all the candidates, which is defined by where N d (K) refers to the number of patches detected by K basis images/blocks, N g is the number of ground truth patches, and N c represents the number of all the candidates. Lower R(K) means better performance. Figures 4 and 5 show the pruning performance, R(K), (K = 2, 4, 8, . . . , K max ), in the color images and the 5-band multispectral images, respectively, where K max refers to the maximum number of basis images/blocks. In the proposed 3D-OTSHT-3DI, K max = N 2 B, while in the other methods, K max = N 2 . We confirmed that all the methods detect every ground truth patch at any K, i.e., there are no false negatives. Therefore, the final accuracy (the accuracy after the remaining candidates are scrutinized by FS) is the same as the accuracy of FS. The proposed 3D-OTSHT-3DI method yields the best pruning performance both on 3-band and 5-band images for all the patch sizes. In both image data sets, we observe that when K is greater than or equal to 8, 3D-OTSHT-3DI is better than the other methods and that as K increases, the rate of change reduces. These facts suggest that not all the basis blocks are required for the whole speedup. Figure 6 shows an example of patches detected at K basis images/blocks by different methods and for different values of K. Every one of the detected patches of size 13 × 13 is shown in an orange square, and the detected overlapping patches cover different areas depending on the employed method and the K-value. It can be seen that the amount of candidate patches reduces as K increases. The result of 3D-OTHST-3DI does not differ significantly between K = 8 and K = 16 and, clearly, these results agree best with the ground truth.
Ground truth patches

Computational Complexity
Here, we consider the number of arithmetic operations with respect to the number of basis images/blocks when we use only OTSHT not included in the arithmetic operations of FS. Table 2 shows the number of additions and multiplications per pixel for search for patches similar to a query of size N × N having B bands including building an integral image. It shows the worst case, where no candidates are rejected at any K basis images/blocks. In 2D-OTSHT-2DI-S and -W, two additions are needed per pixel for building a 2D-integral image in each band, and (8K − 1) additions and 3K multiplications are required per pixel for SSD in each band. Thus, in total, (2 + 8k − 1)B additions and 3KB multiplications per pixel. In 2D-and 3D-OTSHT-3DI, (3B − 1) additions are needed per pixel for a 3D-integral image, and (16K − 1) additions and 3K multiplications are required per pixel for SSD. For the same K, the number of additions for 2D-and 3D-OTSHT-3DI is smaller than that for 2D-OTSHT-2DI-S and -W, except for the images having 2 bands; while the number of multiplications for 2D-and 3D-OTSHT-3DI is smaller than that for 2D-OTSHT-2DI-S and -W in any images with more than one band. Table 2. The number of additions and multiplications per pixel for searching patches having B bands with K basis images/blocks.

Speedup
In this section, we analyze when to stop the pruning process for the whole speedup in combination with FS, showing the mean elapsed time at K basis images/blocks in the methods, and compare the final performance of the five methods. Figures 7 and 8 show the mean ratio (%) of the elapsed time of each method at K basis images/blocks to the elapsed time of FS. Tables 3 and 4 detail the mean elapsed time and the ratio to the elapsed time of FS when we use G K , (K = 1, 2, 4, . . . , 32) for the color images and the 5-band multispectral images, respectively. The fastest time at each patch size is expressed in bold. From Tables 3 and 4, it can be seen that the proposed method outperforms the other methods except for the cases of patches of size 9 × 9 in the color images and patches of size 5 × 5 in the 5-band images. In the color images, for patch size 5 × 5, the fastest mean elapsed time was 1.03 (ms) which is 23.17 % of the mean elapsed time of FS; while for 21 × 21, the fastest mean elapsed time was 1.30 (ms), which is 1.78 % of the mean elapsed time of FS. In the 5-band images, for patch size 5 × 5, the fastest mean elapsed time was 0.050 (s) which is 17.44 % of the mean elapsed time of FS; while for 45 × 45, the fastest mean elapsed time was 0.164 (s), which is 0.78 % of the mean elapsed time of FS. The larger the patch size and the larger the number of bands, the higher the efficiency of the proposed method.
We confirmed that the mean ratio did not differ much between eight active processor cores with hardware multithreading and one core in the same system.    We also compared the five methods with the best performed K-value to A2DHT [20]. A2DHT is an FS-equivalent algorithm for grayscale images and reportedly performs fastest in FS-equivalent algorithms, whose source code is provided by the authors [30]. We modified the code so that the method was applied to each band separately. Since A2DHT has a limitation that the patch size is power-of-2, the queries of size 16 × 16 were used for the color image dataset and the queries of size 16 × 16 and 32 × 32 were used for the 5-band image dataset. The number of ground truth patches is summarized in Table 5. We confirmed that all the methods detected every ground truth patch. Table 6 shows the mean elapsed time and mean ratio to FS time, where the fastest time and its ratio at each patch size is expressed in bold. It can be seen that the proposed method outperforms state-of-the-art methods.  All the algorithms are written in C as single thread tasks, compiled with Xcode 10.1, and run on a macOS system with 4 GHz Intel core i7 and 16 GB RAM, where eight active processor cores with hardware multithreading are used.

Conclusions
We proposed a fast FS-equivalent 3D block matching method in order to search for the patch(es) similar to a given query for multichannel images. The proposed method uses 3D-OTSHT and reduces the number of candidates with SSD in the transformed domain. The pruning process rejects unmatched candidates during the block matching processing. Moreover, in combination with FS, the pruning process is stopped during the processing for the whole speedup. We designed 3D-OTSHT making the most of 3D integral image rather than the extension of one-dimensional TSHT. Unmatched patches are securely rejected due to being orthonormal and using SSD as the similarity measure. We have analyzed the pruning performance and mean elapsed time using a color image dataset and a 5-band multispectral image dataset, and demonstrated that the proposed method outperforms state-of-the-art methods. In the color images, the mean elapsed time was shortened up to less than 2% of the FS time. In the 5-band multispectral images, the search time was shortened up to around 0.8% of the FS time, hence allowing more than 100-times faster processing without sacrificing the accuracy. We believe that these huge savings in computation time can enable new applications of patch matching in multichannel images, which were not feasible before due to the prohibitive computational complexity.